I've written about R squared a few times before. In a discussion I was involved with today the question was raised as to how/whether the R squared in a linear regression model with a single continuous predictor depends on the variance of the predictor variable. The answer to the question is of course yes.

# Linear regression

## Interval regression with heteroskedastic errors

Interval regression allows one to fit a linear model of an outcome on covariates when the outcome is subject to censoring. In Stata an interval regression can be fitted using the intreg command. Each outcome value is either observed exactly, is interval censored (we know it lies in a certain range), left censored (we only know the outcome is less than some value), or right censored (we only know the outcome is greater than some value). In Stata's implementation the robust option is available, which with regular linear regression can be used when the residual variance is not constant. Using robust option doesn't change the parameter estimates, but the standard errors (SEs) are calculated using the sandwich variance estimator. In this post I'll briefly look at the rationale for using robust with interval regression, and highlight the fact that if the residual variances are not constant, unlike for regular linear regression, the interval regression estimates are biased.

## The robust sandwich variance estimator for linear regression (using R)

In a previous post we looked at the (robust) sandwich variance estimator for linear regression. This method allowed us to estimate valid standard errors for our coefficients in linear regression, without requiring the usual assumption that the residual errors have constant variance.

In this post we'll look at how this can be done in practice using R, with the sandwich package (I'll assume below that you've installed this library). To illustrate, we'll first simulate some simple data from a linear regression model where the residual variance increases sharply with the covariate:

## R squared and goodness of fit in linear regression

I've been teaching a modelling course recently, and have been reading and thinking about the notion of goodness of fit. R squared, the proportion of variation in the outcome Y, explained by the covariates X, is commonly described as a measure of goodness of fit. This of course seems very reasonable, since R squared measures how close the observed Y values are to the predicted (fitted) values from the model.

## R squared and adjusted R squared

One quantity people often report when fitting linear regression models is the R squared value. This measures what proportion of the variation in the outcome Y can be explained by the covariates/predictors. If R squared is close to 1 (unusual in my line of work), it means that the covariates can jointly explain the variation in the outcome Y. This means Y can be accurately predicted (in some sense) using the covariates. Conversely, a low R squared means Y is poorly predicted by the covariates. Of course, an effect can be substantively important but not necessarily explain a large amount of variance - blood pressure affects the risk of cardiovascular disease, but it is not a strong enough predictor to explain a large amount of variation in outcomes. Put another way, knowing someone's blood pressure can't tell you with much certainty whether a particular individual will suffer from cardiovascular disease.

## The robust sandwich variance estimator for linear regression (theory)

In a previous post we looked at the properties of the ordinary least squares linear regression estimator when the covariates, as well as the outcome, are considered as random variables. In this post we'll look at the theory sandwich (sometimes called robust) variance estimator for linear regression. See this post for details on how to use the sandwich variance estimator in R.