Linear regression is one the work horses of statistical analysis, permitting us to model how the expectation of an outcome Y depends on one or more predictors (or covariates, regressors, independent variables) X. Previously I wrote about the assumptions required for validity of ordinary linear regression estimates and their inferential procedures (tests, confidence intervals) assuming (as we often do) that the residuals are normally distributed with constant variance.
In my experience, when the ordinarily least squares (OLS) estimators and inferential procedures for them are taught to students, the predictors X are treated as fixed. That is, we act as if they are controlled by the design of our study or experiment. In practice, in lots of studies the predictors are not under the control of the investigator, but are random variables like the outcome Y. As an example, suppose we are interested in finding out how blood pressure (BP) depends on weight. We take a random sample of 1000 individuals, and measure their BP and weight. We then fit a linear regression model with Y=BP and X=weight. The predictor X here is certainly not under the control of the investigator, and so it seems to me a quite reasonable question to ask: why is it valid to use inferential procedures which treat X as fixed, when in my study X is random?
First, lets consider the unbiasedness of the OLS estimators. Recall that the OLS estimators can be written as
where the bold Y and X denote the vector and matrix of the outcome and predictors respectively from our sample of subjects. Then under the assumption that , the OLS estimator is easily shown to be conditionally (on ) unbiased, since:
But what if our predictors are random? Then we are interested in the unconditional expectation of the estimator. In this case we can simply appeal to the rule of total expectation to see that
What about the sampling variability of the OLS estimator? Assuming the residuals have constant variance , we can find its variance conditional on the observed values of the predictors by
which equals . In software, the variances of the OLS estimates are given using this formula, using the observed matrix and the sample estimate of the residual variance, . But what about the unconditional variance of the estimator. Using the law of total variance we can find that
The unconditional variance of the OLS estimator is therefore the average, across samples in which X and Y are random, of the variance the OLS estimator would have for fixed X. It follows that using the usual OLS variance estimator, derived assuming fixed X, is an unbiased estimate of the unconditional variance of the estimator.