overfitting – The Stats Geek

In a previous post we looked at the area under the ROC curve for assessing the discrimination ability of a fitted logistic regression model. An issue that we ignored there was that we used the same dataset to fit the model (estimate its parameters) and to assess its predictive ability.

A problem with doing this, particularly when the dataset used to fit/train the model is small is that such estimates of predictive ability are optimistic. That is, they will fit the dataset which have been used to estimate the parameters somewhat better than they will fit new data. In some sense, this is because with small datasets the fitted model adapts to chance characteristics of the observed data which won’t occur in future data. A silly example of this would be a linear regression model of a continuous variable Y fitted to a continuous covariate X with just n=2 data points. The fitted line will just be the line connecting the two data points. In this case, the R squared measure will be 1 (100%), suggesting your model has perfect predictive power(!), when of course with new data it would almost certainly not have an R squared of 1.