# Assumptions for linear regression

Linear regression is one of the most commonly used statistical methods; it allows us to model how an outcome variable $Y$ depends on one or more predictor (sometimes called independent variables) $X_{1},X_{2},..,X_{p}$. In particular, we model how the mean, or expectation, of the outcome $Y$ varies as a function of the predictors:

$E(Y|X_{1},..,X_{p}) = \beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p}$

Equivalently, the linear model can be expressed by:

$Y=\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p} + \epsilon$

where $\epsilon$ denotes a mean zero error, or residual term. To carry out statistical inference, additional assumptions such as normality are typically made.

However, a common misconception about linear regression is that it assumes that the outcome $Y$ is normally distributed. Actually, linear regression assumes normality for the residual errors $\epsilon$, which represent variation in $Y$ which is not explained by the predictors. It may be the case that marginally (i.e. ignoring any predictors) $Y$ is not normal, but after removing the effects of the predictors, the remaining variability, which is precisely what the residuals represent, are normal, or are more approximately normal.

So, inferential procedures for linear regression are typically based on a normality assumption for the residuals. However, a second perhaps less widely known fact amongst analysts is that, as sample sizes increase, the normality assumption for the residuals is not needed. More precisely, if we consider repeated sampling from our population, for large sample sizes, the distribution (across repeated samples) of the ordinary least squares estimates of the regression coefficients follow a normal distribution. As a consequence, for moderate to large sample sizes, non-normality of residuals should not adversely affect the usual inferential procedures. This result is a consequence of an extremely important result in statistics, known as the central limit theorem.

A further assumption made by linear regression is that the residuals have constant variance. That is, their variance does not change across different levels of the predictors. In contrast to the normality assumption, if the residuals do not satisfy the constant variance assumption, standard errors and confidence intervals (based on standard formulae) will be adversely affected, irrespective of whether the sample size is large or not. However even in this case, the ordinary least squares estimators are unbiased. If the constant variance assumption is violated, we should use the robust sandwich variance estimator.