I've been teaching a modelling course recently, and have been reading and thinking about the notion of goodness of fit. R squared, the proportion of variation in the outcome Y, explained by the covariates X, is commonly described as a measure of goodness of fit. This of course seems very reasonable, since R squared measures how close the observed Y values are to the predicted (fitted) values from the model.
An important point to remember, however, is that R squared does not give us information about whether our model is correctly specified. That is, it does not tell us whether we have correctly specified how the expectation of the outcome Y depends on the covariate(s). In particular, a high value of R squared does not necessarily mean that our model is correctly specified. This is easiest to illustrate with a simple example.
First we will simulate some data using R. To do this, we randomly generate X values from a standard normal distribution (mean zero, variance one). We then generate the outcome Y equal to X plus a random error, again with a standard normal distribution:
n <- 1000 set.seed(512312) x <- rnorm(n) y <- x + rnorm(n)
We can then fit the (correct) linear regression model for Y with X as covariate:
mod1 <- lm(y~x) summary(mod1) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -2.8571 -0.6387 -0.0022 0.6050 3.0716 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.02193 0.03099 0.708 0.479 x 0.93946 0.03127 30.040 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.98 on 998 degrees of freedom Multiple R-squared: 0.4748, Adjusted R-squared: 0.4743 F-statistic: 902.4 on 1 and 998 DF, p-value: < 2.2e-16
Since we have specified the model correctly, we are not surprised that the parameter estimates of the intercept and slope are close to the values used to simulate the data (0 for the intercept, 1 for the slope). We also note that the R squared value is 0.47, indicating that X explains an estimated 47% of the variation in Y.
We can also plot the data, overlayed with the fitted line from the model:
Now let's re-generate the data, but generating Y such that its expectation is an exponential function of X:
n <- 1000 set.seed(512312) x <- rnorm(n) y <- exp(x) + rnorm(n)
In practice of course, we don't simulate our data - we observe or collect the data and then try to fit a reasonable model to it. So, as before, we might start by fitting a simple linear regression model, which assumes the expectation of Y is a linear function of X:
mod2 <- lm(y~x) summary(mod2) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -3.5022 -0.9963 -0.1706 0.6980 21.7411 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.65123 0.05220 31.63 <2e-16 *** x 1.53517 0.05267 29.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.651 on 998 degrees of freedom Multiple R-squared: 0.4598, Adjusted R-squared: 0.4593 F-statistic: 849.5 on 1 and 998 DF, p-value: < 2.2e-16
Unlike in the first case, the parameter estimates we obtain (1.65, 1.54) are not unbiased estimates of the parameters in the 'true' data generating mechanism, in which the expectation of Y is a linear function of exp(X). Also, we see that we obtain an R squared value of 0.46, indicating again that X (included linearly) explains quite a large amount of the variation in Y. We might take this to mean that the model we have used, that the expectation of Y is linear in X, is reasonable. However, if we again plot the observed data, and overlay it with the fitted line:
Overlaying the fitted line onto the observed data makes clear that the model we have used is not correctly specified, despite the fact that the R squared value is quite large. In particular we see that for both low and high values of X, the fitted values are too small. This is obviously a consequence of the fact that the expectation of Y depends on exp(X), whereas the model we have used assumes it is a linear function of X.
This simple example illustrates that, although R squared is an important measure, a high value does not mean our model is correctly specified. Arguably a better way of describing R squared is as a measure of 'explained variation'. For assessing whether our model is correctly specified, we should make use of model diagnostic techniques, such as plots of residuals against the covariates or the linear predictor.