A very common situation in biostatistics, but also much more broadly of course, is that one wants to compare the predictive ability of two competing models. A key question of interest often is whether adding a new marker or variable Y to an existing set X improves prediction. The most obvious way of testing this hypothesis is to use a regression model, and then test whether adding the new variable Y improves fit, by testing the null hypothesis that the coefficient of Y in the expanded model differs from zero. An alternative approach is to test whether adding the new variable improves some measure of predictive ability, such as the area under the ROC curve.
Stata-Mata’s st_view function – use with care!
I use Stata a lot, and I think it’s a great package. An excellent addition a few years ago was the Mata language, a fully fledged matrix programming language which sits on top or separate from Stata’s regular dataset and command/syntax structure. Many of Stata’s built in commands are programmed using Mata, I believe. I’ve been using Mata quite a bit to program new commands, and in the process have come across some strange behaviour in the st_view function in Mata which I think can cause real difficulties (it did for me!). This post will hopefully help avoid others ending up with the problems I did.
Adjusting for optimism/overfitting in measures of predictive ability using bootstrapping
In a previous post we looked at the area under the ROC curve for assessing the discrimination ability of a fitted logistic regression model. An issue that we ignored there was that we used the same dataset to fit the model (estimate its parameters) and to assess its predictive ability.
A problem with doing this, particularly when the dataset used to fit/train the model is small is that such estimates of predictive ability are optimistic. That is, they will fit the dataset which have been used to estimate the parameters somewhat better than they will fit new data. In some sense, this is because with small datasets the fitted model adapts to chance characteristics of the observed data which won’t occur in future data. A silly example of this would be a linear regression model of a continuous variable Y fitted to a continuous covariate X with just n=2 data points. The fitted line will just be the line connecting the two data points. In this case, the R squared measure will be 1 (100%), suggesting your model has perfect predictive power(!), when of course with new data it would almost certainly not have an R squared of 1.