In a previous post we looked at the popular Hosmer-Lemeshow test for logistic regression, which can be viewed as assessing whether the model is well calibrated. In this post we'll look at one approach to assessing the discrimination of a fitted logistic model, via the receiver operating characteristic (ROC) curve.
Before discussing the ROC curve, first let's consider the difference between calibration and discrimination, in the context of logistic regression. As in previous posts, I'll assume that we have an outcome , and covariates . The logistic regression model assumes that:
The model parameters are the regression coefficients , and these are usually estimated by the method of maximum likelihood.
Good calibration is not enough
For given values of the model covariates, we can obtain the predicted probability . The model is said to be well calibrated if the observed risk matches the predicted risk (probability). That is, if we were to take a large group of observations which are assigned a value , the proportion of these observations with ought to be close to 20%. If instead the observed proportion were 80%, we would probably agree that the model is not performing well - it is underestimating risk for these observations. The comparison between predicted probabilities and observed proportions is the basis for the Hosmer-Lemeshow test.
Should we be content to use a model so long as it is well calibrated? Unfortunately not. To see why, suppose we fit a model for our outcome but without any covariates, i.e. the model:
This (null) model assigns every observation the same predicted probability, since it does not use any covariates. The estimate of the single parameter will be the observed overall log odds of a positive outcome, such that the predicted value of will be identical to the proportion of observations in the dataset.
This (rather useless) model assigns every observation the same predicted probability. It will have good calibration - in future samples the observed proportion will be close to our estimated probability. However, the model isn't really useful because it doesn't discriminate between those observations at high risk and those at low risk. The situation is analogous to a weather forecaster who, every day, says the chance of rain tomorrow is 10%. This prediction might be well calibrated, but it doesn't tell people whether it is more or less likely to rain on a given day, and so isn't really a helpful forecast!
As well as being well calibrated, we would therefore like our model to have high discrimination ability. In the binary outcome context, this means that observations with ought to be predicted high probabilities, and those with ought to be assigned low probabilities. Such a model allows us to discriminate between low and high risk observations.
Sensitivity and specificity
To explain the ROC curve, we first recall the important notions of sensitivity and specificity of a test or prediction rule. The sensitivity is defined as the probability of the prediction rule or model predicting an observation as 'positive' given that in truth (). In words, the sensitivity is the proportion of truly positive observations which is classified as such by the model or test. Conversely the specificity is the probability of the model predicting 'negative' given that the observation is 'negative' ().
Our model or prediction rule is perfect at classifying observations if it has 100% sensitivity and 100% specificity. Unfortunately in practice this is (usually) not attainable. So how can we summarize the discrimination ability of our logistic regression model? For each observation, our fitted model can be used to calculate the fitted probabilities . On their own, these don't tell us how to classify observations as positive or negative. One way to create such a classification rule is to choose a cut-point , and classify those observations with a fitted probability above as positive and those at or below it as negative. For this particular cut-off, we can estimate the sensitivity by the proportion of observations with which have a predicted probability above , and similarly we can estimate specificity by the proportion of observations with a predicted probability at or below .
If we increase the cut-point , fewer observations will be predicted as positive. This will mean that fewer of the observations will be predicted as positive (reduced sensitivity), but more of the observations will be predicted as negative (increased specificity). In picking the cut-point, there is thus an intrinsic trade off between sensitivity and specificity.
The receiver operating characteristic (ROC) curve
Now we come to the ROC curve, which is simply a plot of the values of sensitivity against one minus specificity, as the value of the cut-point is increased from 0 through to 1:
A model with high discrimination ability will have high sensitivity and specificity simultaneously, leading to an ROC curve which goes close to the top left corner of the plot. A model with no discrimination ability will have an ROC curve which is the 45 degree diagonal line.
Plotting the ROC curve in R
There are a number of packages in R for creating ROC curves. The one I've used here is the pROC package. First, let's simulate a dataset with one predictor x:
set.seed(63126) n <- 1000 x <- rnorm(n) pr <- exp(x)/(1+exp(x)) y <- 1*(runif(n) < pr) mod <- glm(y~x, family="binomial")
Next we extract from the fitted model object the vector of fitted probabilities:
predpr <- predict(mod,type=c("response"))
We now load the pROC package, and use the roc function to generate an roc object. The basic syntax is to specify a regression type equation with the response y on the left hand side and the object containing the fitted probabilities on the right hand side:
library(pROC) roccurve <- roc(y ~ predpr)
The roc object can then be plotted using
which gives us the ROC plot (see previously shown plot). Note that here because our logistic regression model only included one covariate, the ROC curve would look exactly the same if we had used roc(y ~ x), i.e. we needn't have fitted the logistic regression model. This is because with just one covariate the fitted probabilities are a monotonic function of the only covariate. However in general (i.e. with more than one covariate in the model), this won't be the case.
Previously we said that a model with good discrimination ability, the ROC curve will go close to the top left corner. To check this with a simulation, we will re-simulate the data, increasing the log odds ratio from 1 to 5:
set.seed(63126) n <- 1000 x <- rnorm(n) pr <- exp(5*x)/(1+exp(5*x)) y <- 1*(runif(n) < pr) mod <- glm(y~x, family="binomial") predpr <- predict(mod,type=c("response")) roccurve <- roc(y ~ predpr) plot(roccurve)
Now let's run the simulation one more time but where the variable x is in fact independent of y. To do this we simply modify the line generating the probability vector pr to
pr <- exp(0*x)/(1+exp(0*x))
which gives the following ROC curve
Area under the ROC curve
A popular way of summarizing the discrimination ability of a model is to report the area under the ROC curve. We have seen that a model with discrimination ability has an ROC curve which goes closer to the top left hand corner of the plot, whereas a model with no discrimination ability has an ROC curve close to a 45 degree line. Thus the area under the curve ranges from 1, corresponding to perfect discrimination, to 0.5, corresponding to a model with no discrimination ability. The area under the ROC curve is also sometimes referred to as the c-statistic (c for concordance).
The area under the estimated ROC curve (AUC) is reported when we plot the ROC curve in R's Console. We can also obtain the AUC using
I'll return to the topics of confidence interval estimation for the estimated AUC and adjusting for optimism in later posts.
For more information on the pROC package, I'd suggest taking a look at this paper, published in the open access journal BMC Bioinformatics.
Interpretation of the area under the ROC curve
Although it is not obvious from its definition, the area under the ROC curve (AUC) has a somewhat appealing interpretation. It turns out that the AUC is the probability that if you were to take a random pair of observations, one with and one with , the observation with has a higher predicted probability than the other. The AUC thus gives the probability that the model correctly ranks such pairs of observations.
In the biomedical context of risk prediction modelling, the AUC has been criticized by some. In the risk prediction context, individuals have their risk of developing (for example) coronary heart disease over the next 10 years predicted. Thus a measure of discrimination which examines the predicted probability of pairs of individuals, one with and one with , does not really match the prospective risk prediction setting, where we do not have such pairs.
For more on risk prediction, and other approaches to assessing the discrimination of logistic (and other) regression models, I'd recommend looking at Steyerberg's Clinical Prediction Models book, an (open access) article published in Epidemiology, and Harrell's Regression Modeling Strategies' book.