R squared and adjusted R squared

One quantity people often report when fitting linear regression models is the R squared value. This measures what proportion of the variation in the outcome Y can be explained by the covariates/predictors. If R squared is close to 1 (unusual in my line of work), it means that the covariates can jointly explain the variation in the outcome Y. This means Y can be accurately predicted (in some sense) using the covariates. Conversely, a low R squared means Y is poorly predicted by the covariates. Of course, an effect can be substantively important but not necessarily explain a large amount of variance - blood pressure affects the risk of cardiovascular disease, but it is not a strong enough predictor to explain a large amount of variation in outcomes. Put another way, knowing someone's blood pressure can't tell you with much certainty whether a particular individual will suffer from cardiovascular disease.

If we write our linear model as

Y=\beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} +...+\beta_{p} X_{p} + \epsilon

where \epsilon \sim N(0,\sigma^{2}), the true value of R squared is

R^{2} = \frac{Var(\beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} +...+\beta_{p} X_{p})}{Var(Y)} = \frac{Var(Y)-\sigma^{2}}{Var(Y)} = 1 - \frac{\sigma^{2}}{Var(Y)}

The denominator is the total variance of Y, while the numerator is this quantity minus the residual variability not explained by the covariates, i.e. the variability in Y that is explained by the covariates.

Most stats packages present two R squared measures. In R the linear model command gives 'Multiple R-squared' and 'Adjusted R-squared'. What's the difference, and which should we use? Let's first look at the 'Multiple R-squared'. This is an estimate of the population R squared value obtained by dividing the model sum of squares, as an estimate of the variability of the linear predictor, by the total sum of squares:

\hat{R}^{2} = 1 - \frac{\sum^{n}_{i=1}(Y_{i}-\hat{Y}_{i})^{2}}{\sum^{n}_{i=1}(Y_{i}-\bar{Y})^{2}} = 1 - \frac{\frac{1}{n}\sum^{n}_{i=1}(Y_{i}-\hat{Y}_{i})^{2}}{\frac{1}{n}\sum^{n}_{i=1}(Y_{i}-\bar{Y})^{2}}

where \hat{Y}_{i} denotes the predicted value of Y_{i} and \bar{Y} denotes the sample mean of Y.

To check this, we can simulate a simple dataset in R as follows:

> set.seed(5132)
> x <- rnorm(10)
> y <- rnorm(10)
> mod <- lm(y~x)
> summary(mod)

lm(formula = y ~ x)

    Min      1Q  Median      3Q     Max 
-1.2575 -0.7416  0.2801  0.4595  0.9286 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.06486    0.28430   0.228    0.825
x           -0.05383    0.25122  -0.214    0.836

Residual standard error: 0.8464 on 8 degrees of freedom
Multiple R-squared:  0.005707,  Adjusted R-squared:  -0.1186 
F-statistic: 0.04591 on 1 and 8 DF,  p-value: 0.8357

> anova(mod)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value Pr(>F)
x          1 0.0329 0.03289  0.0459 0.8357
Residuals  8 5.7310 0.71637
> 0.0329/(0.0329+5.7310)
[1] 0.005707941 

This code simulates two vectors X and Y independently, from the N(0,1) distribution. It then fits the linear model for Y with X as covariate. The R-squared value from the summary is 0.005707, suggesting (correctly here) that X is not a good predictor of Y. We then use the anova command to extract the analysis of variance table for the model, and check that the 'Multiple R-squared' figure is equal to the ratio of the model to the total sum of squares.

The bias of the standard R squared estimator

Unfortunately, this estimator of R squared is biased. The magnitude of the bias will depend on how many observations are available to fit the model and how many covariates are relative to this sample size. The bias can be particularly large with small sample sizes and a moderate number of covariates. To illustrate this bias, we can perform a small simulation study in R. To do this, we will repeatedly (1000 times) generate data for covariates X, Z1, Z2, Z3, Z4 (independent N(0,1)). We will then generate an outcome Y, which depends only on X, and in such a way that the true R squared is 0.1 (10%). We will then fit the model to each dataset, and record the R-squared estimates:

n <- 20
simulations <- 1000
R2array <- array(0, dim=simulations)
for (i in 1:simulations) {
	x <- rnorm(n)
	z <- array(rnorm(n*4), dim=c(n,4))
	y <- 0.1*x + rnorm(n)
	mod <- lm(y~x+z)
	R2array[i] <- summary(mod)$r.squared

Running this code, I get a mean of the R2array of 0.266, with a 95% CI from 0.257 to 0.274. This compares with the true R squared value of 0.1. Here the simple R squared estimator is severely biased, due to the large number of predictors relative to observations.

Why the standard R squared estimator cannot be unbiased

Why is the standard estimate (estimator) of R squared biased? One way of seeing why it can't be unbiased is that by its definition the estimates always lie between 0 and 1. From one perspective this a very appealing property - since the true R squared lies between 0 and 1, having estimates which fall outside this range wouldn't be nice (this can happen for adjusted R squared). However, suppose the true R squared is 0 - i.e. the covariates are completely independent of the outcome Y. In repeated samples, the R squared estimates will be above 0, and their average will therefore be above 0. Since the bias is the difference between the average of the estimates in repeated samples and the true value (0 here), the simple R squared estimator must be positively biased.

Incidentally, lots of estimators we used are not unbiased (more on this in a later post), so lack of unbiasedness doesn't on its own concern is. The problem is that the bias can be large for certain designs where the number of covariates is large relative to the number of observations used to fit the model.

Adjusted R squared

So, the simple R squared estimators is upwardly biased. What can we do? Well, we can modify the estimator to try and reduce this bias. A number of approaches have been proposed, but the one usually referred to by 'adjusted R squared' is motivated by returning to the definition of the population R squared as

R^{2} = 1 - \frac{\sigma^{2}}{Var(Y)}

The standard R squared estimator uses biased estimators of \sigma^{2} and Var(Y), by using the divisor n for both. These estimators are biased because they ignore the degrees of freedom used to estimate the regression coefficients and mean of Y. If instead, we estimate Var(Y) by the usual unbiased sample variance (which uses n-1 as the divisor) and \sigma^{2} by its unbiased estimator which uses n-p-1 as the divisor, we obtain the estimator:

\hat{R}^{2}_{adj} = 1 - \frac{\frac{1}{n-p-1}\sum^{n}_{i=1}(Y_{i}-\hat{Y}_{i})^{2}}{\frac{1}{n-1}\sum^{n}_{i=1}(Y_{i}-\bar{Y})^{2}}

where n denotes the sample size, p denotes the number of covariates, and \hat{R}^{2} denotes the standard R squared estimator. This can be re-arranged to show that

\hat{R}^{2}_{adj} = \hat{R}^{2} - (1-\hat{R}^{2})\frac{p}{n-p-1}

From this formula, we can that the adjusted estimator is always less than the standard one. We can also see the larger p is relative to n, the larger the adjustment. Also, as we saw in the simple simulated dataset earlier, it is quite possible for \hat{R}^{2}_{adj} to be negative. A final point: although the adjusted R squared estimator uses unbiased estimators of the residual variance and the variance of Y, it is not unbiased. This is because the expectation of a ratio is not generally equal to the ratio of the expectations. To check this, try repeating the earlier simulation, but collecting the adjusted R squared values (stored in summary(mod)$adj.r.squared) and looking at the mean of the estimates.


Leave a Reply