## Why shouldn't I use linear regression if my outcome is binary?

This week a student asked me (quite reasonably) whether using linear regression to model binary outcomes was really such a bad idea, and if it was, why. In this post I'll look at some of the issues involved and try to answer the question.

## Interpreting odds and odds ratios

Odds and odds ratios are an important measure of the absolute/relative chance of an event of interest happening, but their interpretation is sometimes a little tricky to master. In this short post, I'll describe these concepts in a (hopefully) clear way.

## Comparing predictive ability of two nested logistic regression models

A very common situation in biostatistics, but also much more broadly of course, is that one wants to compare the predictive ability of two competing models. A key question of interest often is whether adding a new marker or variable Y to an existing set X improves prediction. The most obvious way of testing this hypothesis is to use a regression model, and then test whether adding the new variable Y improves fit, by testing the null hypothesis that the coefficient of Y in the expanded model differs from zero. An alternative approach is to test whether adding the new variable improves some measure of predictive ability, such as the area under the ROC curve.

## Checking functional form in logistic regression using loess plots

When we include a continuous variable as a covariate in a regression model, it's important that we include it using the correct (or something approximately correct) functional form. For example, with a continuous outcome Y and continuous covariate X, it may be the case that the expected value of Y is a linear function of X and X^2, rather than a linear function of X. For linear regression there are a number of ways of assessing what the appropriate functional form is for a covariate. A simple but often effective approach is simply to look at a scatter plot of Y against X, to visually assess the shape of the association.