This week a student asked me (quite reasonably) whether using linear regression to model binary outcomes was really such a bad idea, and if it was, why. In this post I'll look at some of the issues involved and try to answer the question.
Odds and odds ratios are an important measure of the absolute/relative chance of an event of interest happening, but their interpretation is sometimes a little tricky to master. In this short post, I'll describe these concepts in a (hopefully) clear way.
A very common situation in biostatistics, but also much more broadly of course, is that one wants to compare the predictive ability of two competing models. A key question of interest often is whether adding a new marker or variable Y to an existing set X improves prediction. The most obvious way of testing this hypothesis is to use a regression model, and then test whether adding the new variable Y improves fit, by testing the null hypothesis that the coefficient of Y in the expanded model differs from zero. An alternative approach is to test whether adding the new variable improves some measure of predictive ability, such as the area under the ROC curve.
When we include a continuous variable as a covariate in a regression model, it's important that we include it using the correct (or something approximately correct) functional form. For example, with a continuous outcome Y and continuous covariate X, it may be the case that the expected value of Y is a linear function of X and X^2, rather than a linear function of X. For linear regression there are a number of ways of assessing what the appropriate functional form is for a covariate. A simple but often effective approach is simply to look at a scatter plot of Y against X, to visually assess the shape of the association.
In a previous post we looked at the popular Hosmer-Lemeshow test for logistic regression, which can be viewed as assessing whether the model is well calibrated. In this post we'll look at one approach to assessing the discrimination of a fitted logistic model, via the receiver operating characteristic (ROC) curve.
In this post we'll look at the deviance goodness of fit test for Poisson regression with individual count data. Many software packages provide this test either in the output when fitting a Poisson regression model or can perform it after fitting such a model (e.g. Stata), which may lead researchers and analysts in to relying on it. In this post we'll see that often the test will not perform as expected, and therefore, I argue, ought to be used with caution.