In a previous post we looked at the (robust) sandwich variance estimator for linear regression. This method allowed us to estimate valid standard errors for our coefficients in linear regression, without requiring the usual assumption that the residual errors have constant variance.

In this post we'll look at how this can be done in practice using R, with the sandwich package (I'll assume below that you've installed this library). To illustrate, we'll first simulate some simple data from a linear regression model where the residual variance increases sharply with the covariate:

set.seed(194812) n <- 100 x <- rnorm(n) residual_sd <- exp(x) y <- 2*x + residual_sd*rnorm(n)

This code generates Y from a linear regression model given X, with true intercept 0, and true slope 2. However, the residual standard deviation has been generated as exp(x), such that the residual variance increases with increasing levels of X. We can visually see the effect of this:

plot(x,y)

which gives

In this simple case it is visually clear that the residual variance is much larger for larger values of X, thus violating one of the key assumptions needed for the 'model based' standard errors to be valid. In any case, let's see what the results are if we fit the linear regression model as usual:

> mod<- lm(y~x) > summary(mod) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -25.9503 -0.8574 -0.1751 0.9809 13.4015 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.08757 0.36229 -0.242 0.809508 x 1.18069 0.31071 3.800 0.000251 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.605 on 98 degrees of freedom Multiple R-squared: 0.1284, Adjusted R-squared: 0.1195 F-statistic: 14.44 on 1 and 98 DF, p-value: 0.0002512

This shows that we have strong evidence against the null hypothesis that Y and X are independent. For comparison later, we note that the standard error of the X effect is 0.311.

Now we will use the (robust) sandwich standard errors, as described in the previous post. To do this we will make use of the sandwich package.

Next we load the sandwich package, and then pass the earlier fitted lm object to a function in the package which calculates the sandwich variance estimate:

> library(sandwich) > vcovHC(mod, type = "HC") (Intercept) x (Intercept) 0.08824454 0.1465642 x 0.14656421 0.3414185

The resulting matrix is the estimated variance covariance matrix of the two model parameters. Thus the diagonal elements are the estimated variances (squared standard errors). We can therefore calculate the sandwich standard errors by taking these diagonal elements and square rooting:

> sandwich_se <- diag(vcovHC(mod, type = "HC"))^0.5 > sandwich_se (Intercept) x 0.2970598 0.5843103

So, the sandwich standard error for the coefficient of X is 0.584. This contrasts with the earlier model based standard error of 0.311. Because here the residual variance is not constant, the model based standard error underestimates the variability in the estimate, and the sandwich standard error corrects for this. Let's see what impact this has on the confidence intervals and p-values. To do this we use the result that the estimators are asymptotically (in large samples) normally distributed. First, to get the confidence interval limits we can use:

> coef(mod)-1.96*sandwich_se (Intercept) x -0.66980780 0.03544496 > coef(mod)+1.96*sandwich_se (Intercept) x 0.4946667 2.3259412

So the 95% confidence interval limits for the X coefficient are (0.035, 2.326). To find the p-values we can first calculate the z-statistics (coefficients divided by their corresponding standard errors), and compare the squared z-statistics to a chi-squared distribution on one degree of freedom:

> z_stat <- coef(mod)/sandwich_se > p_values <- pchisq(z_stat^2, 1, lower.tail=FALSE) > p_values (Intercept) x 0.76815365 0.04331485

We now have a p-value for the dependence of Y on X of 0.043, in contrast to p-value obtained earlier from lm of 0.00025. Using the sandwich standard errors has resulted in much weaker evidence against the null hypothesis of no association.

Note that there are in fact other variants of the sandwich variance estimator available in the sandwich package.

I like your explanation about this, but I was confused by the final conclusion. Since we have already known that y is equal to 2*x plus a residual, which means x has a clear relationship with y, why do you think "the weaker evidence against the null hypothesis of no association" is a better choice? and what's more, since we all know the residual variance among x is not a constant, it increases with increasing levels of X, but robust method also take it as a constant, a bigger constant, it is not the true case either, why we should think this robust method is a better one? Thank you for your sharing.

Hi Amenda, thanks for your questions. So when the residual variance is in truth not constant, the standard model based estimate of the standard error of the regression coefficients is biased. Consequently, p-values and confidence intervals based on this will not be valid - for example 95% confidence intervals based on the constant variance based SE will not have 95% coverage in repeated samples. On your second point, the robust/sandwich SE is estimating the SE of the regression coefficient estimates, not the residual variance itself, which here was not constant as X varied. So when the residual variance is not constant as X varies, the robust/sandwich SE will give you a valid estimate of the repeated sampling variance for the regression coefficient estimates.