The robust sandwich variance estimator for linear regression (using R)

In a previous post we looked at the (robust) sandwich variance estimator for linear regression. This method allowed us to estimate valid standard errors for our coefficients in linear regression, without requiring the usual assumption that the residual errors have constant variance.

In this post we’ll look at how this can be done in practice using R, with the sandwich package (I’ll assume below that you’ve installed this library). To illustrate, we’ll first simulate some simple data from a linear regression model where the residual variance increases sharply with the covariate:

n <- 100
x <- rnorm(n)
residual_sd <- exp(x)
y <- 2*x + residual_sd*rnorm(n)

This code generates Y from a linear regression model given X, with true intercept 0, and true slope 2. However, the residual standard deviation has been generated as exp(x), such that the residual variance increases with increasing levels of X. We can visually see the effect of this:


which gives

Plot of simulated Y against X data, where residual variance increases with increasing levels of X
Plot of simulated Y against X data, where residual variance increases with increasing levels of X

In this simple case it is visually clear that the residual variance is much larger for larger values of X, thus violating one of the key assumptions needed for the 'model based' standard errors to be valid. In any case, let's see what the results are if we fit the linear regression model as usual:

> mod<- lm(y~x)
> summary(mod)

lm(formula = y ~ x)

     Min       1Q   Median       3Q      Max 
-25.9503  -0.8574  -0.1751   0.9809  13.4015 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.08757    0.36229  -0.242 0.809508    
x            1.18069    0.31071   3.800 0.000251 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.605 on 98 degrees of freedom
Multiple R-squared:  0.1284,    Adjusted R-squared:  0.1195 
F-statistic: 14.44 on 1 and 98 DF,  p-value: 0.0002512

This shows that we have strong evidence against the null hypothesis that Y and X are independent. For comparison later, we note that the standard error of the X effect is 0.311.

Now we will use the (robust) sandwich standard errors, as described in the previous post. To do this we will make use of the sandwich package.

Next we load the sandwich package, and then pass the earlier fitted lm object to a function in the package which calculates the sandwich variance estimate:

> library(sandwich)
> vcovHC(mod, type = "HC")
            (Intercept)         x
(Intercept)  0.08824454 0.1465642
x            0.14656421 0.3414185

The resulting matrix is the estimated variance covariance matrix of the two model parameters. Thus the diagonal elements are the estimated variances (squared standard errors). We can therefore calculate the sandwich standard errors by taking these diagonal elements and square rooting:

> sandwich_se <- diag(vcovHC(mod, type = "HC"))^0.5
> sandwich_se
(Intercept)           x 
  0.2970598   0.5843103 

So, the sandwich standard error for the coefficient of X is 0.584. This contrasts with the earlier model based standard error of 0.311. Because here the residual variance is not constant, the model based standard error underestimates the variability in the estimate, and the sandwich standard error corrects for this. Let's see what impact this has on the confidence intervals and p-values. To do this we use the result that the estimators are asymptotically (in large samples) normally distributed. First, to get the confidence interval limits we can use:

> coef(mod)-1.96*sandwich_se
(Intercept)           x 
-0.66980780  0.03544496 
> coef(mod)+1.96*sandwich_se
(Intercept)           x 
  0.4946667   2.3259412 

So the 95% confidence interval limits for the X coefficient are (0.035, 2.326). To find the p-values we can first calculate the z-statistics (coefficients divided by their corresponding standard errors), and compare the squared z-statistics to a chi-squared distribution on one degree of freedom:

> z_stat <- coef(mod)/sandwich_se
> p_values <- pchisq(z_stat^2, 1, lower.tail=FALSE)
> p_values
(Intercept)           x 
 0.76815365  0.04331485 

We now have a p-value for the dependence of Y on X of 0.043, in contrast to p-value obtained earlier from lm of 0.00025. Using the sandwich standard errors has resulted in much weaker evidence against the null hypothesis of no association.

Note that there are in fact other variants of the sandwich variance estimator available in the sandwich package.

14 thoughts on “The robust sandwich variance estimator for linear regression (using R)”

  1. I like your explanation about this, but I was confused by the final conclusion. Since we have already known that y is equal to 2*x plus a residual, which means x has a clear relationship with y, why do you think “the weaker evidence against the null hypothesis of no association” is a better choice? and what’s more, since we all know the residual variance among x is not a constant, it increases with increasing levels of X, but robust method also take it as a constant, a bigger constant, it is not the true case either, why we should think this robust method is a better one? Thank you for your sharing.

    • Hi Amenda, thanks for your questions. So when the residual variance is in truth not constant, the standard model based estimate of the standard error of the regression coefficients is biased. Consequently, p-values and confidence intervals based on this will not be valid – for example 95% confidence intervals based on the constant variance based SE will not have 95% coverage in repeated samples. On your second point, the robust/sandwich SE is estimating the SE of the regression coefficient estimates, not the residual variance itself, which here was not constant as X varied. So when the residual variance is not constant as X varies, the robust/sandwich SE will give you a valid estimate of the repeated sampling variance for the regression coefficient estimates.

  2. Hi Jonathan, really helpful explanation, thank you for it. I got a couple of follow up questions, I’ll just start. 1. When you created the z-value, isn’t it necessary to subtract the expected value? 2. Why did you set the lower.tail to FALSE, isn’t it common to use it? And 3. I used your code on my data and compered it with the ones I got when I used the “coeftest” command. I got similar but not the equal results, sometimes it even made the difference between two significance levels, is it possible to compare these two or did I miss something?
    I hope I didn’t over asked you, all in all this was a great and helpful article.



    • Hi Nick

      1. So I was calculating a p-value for a test of the null that the coefficient of X is zero. In general the test statistic would be the estimate minus the value under the null, divided by the standard error. Here the null value is zero, so the test statistic is simply the estimate divided by its standard error.

      2. Because I squared the z statistic, this gives a chi squared variable under the null on 1 degree of freedom, with large positive values indicating evidence against the null (these correspond to either large negative or large positive values of the z-statistic). Thus I want the upper tail probability, not the lower.

      3. I have not used ceoftest before, but from looking at the documentation, are you passing the sandwich variance estimate to coeftest? If you just pass the fitted lm object I would guess it is just using the standard model based (i.e. not sandwich) variance estimates, and hence you would get differences.


  3. Hi Jonathan, thanks for the nice explanation. I just have one question, can I apply this for logit/probit regression models?

  4. Thank a lot. I got the same results using your detailed method and the following method. sorry if my question and comments are too naive :), really new to the topic.

    coeftest(model, vcov = vcovHC(model, “HC”))


  5. Hi Jonathan

    2 things:

    1. Site is super helpful. Thank you so much

    2. “and compare the squared z-statistics to a chi-squared distribution on one degree of freedom”… Why are we using one df?

    Thanks in adv.

    • The z-statistic follows a standard normal distribution under the null. So you can either find the two tailed p-value using this, or equivalently, the one tailed p-value for the squared z-statistic with reference to a chi-squared distribution on 1 df. Why 1 df? Because a standard normal random variable squared follows the chi-squared distribution on 1 df.

  6. Hi Jonathan, super helpful, thanks so much! I have one question: I am using this in a logit regression (dependent variable binary, independent variables not) with the following command:
    model <- glm(DV ~ IV+IV+…+IV, family = binomial(link = "logit"), data = DATA)

    When I follow your approach, I can use HC0 and HC1, but if try to use HC2 and HC3, I get "NA" or "NaN" as a result. My preference for HC3 comes from a paper from Long and Ervin (2000) who argue that HC3 is most reliable for samples with less than 250 observations – however, they have looked at linear models.

    Why can I only use HC0 and HC1 but not HC2 and HC3 in a logit regression? Many thanks in advance!

  7. Hi Jonathan,

    Thanks so much for posting this. If all the assumptions for my multiple regression were satisfied except for homogeneity of variance, then I can still trust my coefficients and just adjust the SE, z-scores, and p-values as described above, right? Can/should I make a similar adjustment to the F test result as well?



    • Hi Devyn. Correct. The standard F-test is not valid if the errors don’t have constant variance. I don’t know if there is a robust version of this for linear regression. I think you could perform a joint Wald test that all the coefficients are zero, using the robust/sandwich version of the variance covariance matrix.

      • Hi Jonathan,

        Thanks so much, that makes sense. I have tried it. In general, my SEs were adjusted to be a little larger, but one thing I have noticed is that the standard errors actually got quite a bit smaller for a couple of dummy-coded groups where the vast majority of entries in the data are 0. Can you think of why the sandwich estimator could sometimes result in smaller SEs?




Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.