A very common situation in biostatistics, but also much more broadly of course, is that one wants to compare the predictive ability of two competing models. A key question of interest often is whether adding a new marker or variable Y to an existing set X improves prediction. The most obvious way of testing this hypothesis is to use a regression model, and then test whether adding the new variable Y improves fit, by testing the null hypothesis that the coefficient of Y in the expanded model differs from zero. An alternative approach is to test whether adding the new variable improves some measure of predictive ability, such as the area under the ROC curve.
A number of papers have investigated the problem of testing whether the area under the ROC curve (AUC), is improved through adding a new variable to a logistic regression model for a binary outcome. For a selection of open access ones, see here, here, and here. Apparently, a common practice for comparing the AUCs of two nested logistic regression models was (and maybe still is) to apply a nonparametric test developed by DeLong and colleagues to the linear predictors generated from the two logistic regression models.
As the earlier linked papers describe, a common finding when adopting this approach is that even when the newly added variable Y is statistically significant in the larger model, the DeLong test for the increase in AUC is not statistically significant. This result appears quite paradoxical – the significant p-value of the new variable Y in the larger model indicating evidence that the new variable adds new independent predictive information over and above the other variables already included in the model, yet there is no evidence that the AUCs of the two models differ. Applying DeLong’s test in this way has thus been stated as being extremely conservative. As pointed out by what I think is probably the most important paper to read on this topic by Pepe and colleagues, applying DeLong et al’s test to the linear predictors generated by the two models ignores the fact that the coefficients/parameters in the two models have been estimated in the first stage. Indeed, in DeLong et al’s 1988 paper they considered a situation where two markers are being compared in terms of their AUCs. Each of these two markers could of course be generated by taking a linear combination of variables, but the construction of the DeLong et al test assumes that the coefficients in the two linear combinations are fixed, known quantities, which is not the case when comparing the linear predictors of two nested logistic regression model fits.
Equivalence of null hypotheses
In what I think is an extremely useful paper, Pepe et al prove a number of important results. To describe them, let D be the binary outcome which takes values 0 or 1, and let r(X,Y)=P(D=1|X,Y) denote the true risk of a positive outcome given X and Y. Similarly, r(X)=P(D=1|X). One can then construct the (true) ROC curve using r(X) and r(X,Y) as markers. Pepe et al prove that the resulting AUC values are identical if and only if r(X,Y)=r(X), that is, the AUCs of the true risk functions are identical if and only if Y provides no additional predictive information about D, over and above X. The consequence of the preceding result is that the null hypothesis that Y provides no additional predictive information over and above X is equivalent to the null hypothesis that the AUC of the true risk function using X is identical to the AUC of the true risk function which uses Y in addition to X. Pepe et al also prove that these conditions are equivalent to a variety of other conditions which state equality between other types of measure of predictive ability. The fact that these null hypotheses are equivalent mean that there is no justification (or point) to testing twice, i.e. to test both the statistical significance when Y is added to a logistic regression model, and then also to test for whether adding Y increases the AUC.
Performance of DeLong’s test when applied to compare nested logistic models
In simulations, Pepe et al found, as in other recent papers, that applying DeLong et al’s test to compare the fit of two nested logistic regression gives a highly conservative test, such that the chance of rejecting the null of no improvement in predictive ability is much lower than 5% when this null is true. Consequently, when there is an improvement in predictive ability, the test has low power, and in particular much lower power than the likelihood ratio test (or asymptotically equivalent Wald test) for the coefficient of Y in the logistic regression model. Pepe et al also tried using bootstrapping to allow for the fact that the coefficients in the two models are estimated. Although this approach increased power, it was still substantially lower than the power of the likelihood ratio test comparing the fits of the two logistic regression models. To some extent this is to be expected – as Pepe et al point out, in the special case where X is empty, applying DeLong’s test is equivalent to using the Wilcoxon two-sample test, which is less powerful than a z-test for different in group means (of Y) when Y is conditionally normal given D, and this z-test is asymptotically equivalent to the test for the coefficient for Y in the logistic regression model.
Practical recommendations
For me the take home message from Pepe et al’s paper is that for testing whether the variable Y adds predictive ability over and above using X, one should focus on the usual likelihood based tests available when fitting regression models (i.e. likelihood ratio or Wald tests). These tests are known to be optimal in terms of power asymptotically and alternative approaches, such as mis-applying DeLong’s test in this setting, have much lower power. Pepe et al also make the point that calibration of fitted models is very important, and also that overfitting can be an issue when there are a large number of predictor variables. In particular, using the Wald or likelihood ratio test to assess whether Y adds predictive information over and above X relies on the regression model(s) being used to be correctly specified.
They also emphasize that quantifying the increase in predictive ability, as measured by some metric, which could be the AUC or one of the newer metrics, is important. In the case of the AUC, we can calculate the increase in AUC from the model which uses X to the one which uses X and Y. It should be possible to use bootstrapping to obtain a valid confidence interval for the increase in AUC. An issue here is that we can expect the bootstrap confidence interval to sometimes (probably often in fact) include the null value of zero even when Y is statistically significant in the model. Pepe et al suggest that one could increase the confidence interval’s lower limit to zero whenever the p-value for the coefficient of Y in the larger model is statistically significant. How to find confidence intervals for the increase in predictive ability in such a way that there is a 1-1 concordance with the test for the coefficient of Y in the regression model is an area for future research.