R squared/correlation depends on variance of predictor

I've written about R squared a few times before. In a discussion I was involved with today the question was raised as to how/whether the R squared in a linear regression model with a single continuous predictor depends on the variance of the predictor variable. The answer to the question is of course yes.

An algebraic explanation
Let's suppose the simplest possible linear model between an outcome $Y$ and continuous predictor $X$:

$Y = \alpha + \beta X + \epsilon$

The (true) R squared of the regression model is the proportion of variance in the outcome $Y$ that is explained by the predictor $X$. The variance explained by $X$ is the variance of the linear predictor:

$Var(\alpha + \beta X) = \beta^{2} Var(X)$

The total variance of the outcome in the population is then the sum of the variance of the linear predictor and the variance of the residuals, $\sigma^{2}$. Thus the true population R squared is:

$R^{2} = \frac{\beta^{2} Var(X)}{\beta^{2} Var(X) + \sigma^{2}}$

Suppose now that we consider how well $X$ predicts $Y$ (using R squared) in a restricted population, where we restrict on the basis of values of $X$ (in some way). If we do this, the variance of $X$ in the restricted population will be reduced, relative to its variance in the original population. Using the preceding formula for R squared, we can see that the effect of this will be to reduce R squared. In the extreme, if we restricted to those individuals with $X$ values in a very small range, the variance of $X$ in the restricted population would be almost zero, such that R squared would be close to zero.

A visual illustration
We can also visualize the preceding concept easily in R. We first simulate data from a linear model with a very large sample size:

n <- 10000
set.seed(456)
x <- 100*runif(n)
y <- x+rnorm(n)


If we plot $Y$ against $X$ in the total sample, using

plot(x,y)


we have:

Visually, it appears that $X$ is a very good predictor of $Y$. Fitting the corresponding linear model confirms this:

summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
Min      1Q  Median      3Q     Max
-4.1295 -0.6794 -0.0023  0.6879  3.5579

Coefficients:
Estimate Std. Error  t value Pr(>|t|)
(Intercept) 0.0068489  0.0204500    0.335    0.738
x           0.9999752  0.0003534 2829.539   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.015 on 9998 degrees of freedom
Multiple R-squared:  0.9988,	Adjusted R-squared:  0.9988
F-statistic: 8.006e+06 on 1 and 9998 DF,  p-value: < 2.2e-16


giving an R squared 0.9988.

Next, we plot the data again, but restricted to those with $X<1$:

plot(x[x<1],y[x<1])


Y against X, for those with X<1[/caption] Now, visually at least, $X$ appears to explain a much smaller proportion of the variance of the outcome, and fitting the linear model to the restricted sample confirms this:

summary(lm(y[x<1]~x[x<1]))

Call:
lm(formula = y[x < 1] ~ x[x < 1])

Residuals:
Min       1Q   Median       3Q      Max
-2.93421 -0.73513 -0.09459  0.69282  2.59506

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.0893     0.2432  -0.367  0.71459
x[x < 1]      1.3960     0.4386   3.183  0.00215 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.121 on 72 degrees of freedom
Multiple R-squared:  0.1233,	Adjusted R-squared:  0.1112
F-statistic: 10.13 on 1 and 72 DF,  p-value: 0.002155


with a much lower R squared value of 0.1233.