In a previous post we looked at the (robust) sandwich variance estimator for linear regression. This method allowed us to estimate valid standard errors for our coefficients in linear regression, without requiring the usual assumption that the residual errors have constant variance.
In this post we'll look at how this can be done in practice using R, with the sandwich package (I'll assume below that you've installed this library). To illustrate, we'll first simulate some simple data from a linear regression model where the residual variance increases sharply with the covariate:
set.seed(194812) n <- 100 x <- rnorm(n) residual_sd <- exp(x) y <- 2*x + residual_sd*rnorm(n)
This code generates Y from a linear regression model given X, with true intercept 0, and true slope 2. However, the residual standard deviation has been generated as exp(x), such that the residual variance increases with increasing levels of X. We can visually see the effect of this:
In this simple case it is visually clear that the residual variance is much larger for larger values of X, thus violating one of the key assumptions needed for the 'model based' standard errors to be valid. In any case, let's see what the results are if we fit the linear regression model as usual:
> mod<- lm(y~x) > summary(mod) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -25.9503 -0.8574 -0.1751 0.9809 13.4015 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.08757 0.36229 -0.242 0.809508 x 1.18069 0.31071 3.800 0.000251 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.605 on 98 degrees of freedom Multiple R-squared: 0.1284, Adjusted R-squared: 0.1195 F-statistic: 14.44 on 1 and 98 DF, p-value: 0.0002512
This shows that we have strong evidence against the null hypothesis that Y and X are independent. For comparison later, we note that the standard error of the X effect is 0.311.
Next we load the sandwich package, and then pass the earlier fitted lm object to a function in the package which calculates the sandwich variance estimate:
> library(sandwich) > vcovHC(mod, type = "HC") (Intercept) x (Intercept) 0.08824454 0.1465642 x 0.14656421 0.3414185
The resulting matrix is the estimated variance covariance matrix of the two model parameters. Thus the diagonal elements are the estimated variances (squared standard errors). We can therefore calculate the sandwich standard errors by taking these diagonal elements and square rooting:
> sandwich_se <- diag(vcovHC(mod, type = "HC"))^0.5 > sandwich_se (Intercept) x 0.2970598 0.5843103
So, the sandwich standard error for the coefficient of X is 0.584. This contrasts with the earlier model based standard error of 0.311. Because here the residual variance is not constant, the model based standard error underestimates the variability in the estimate, and the sandwich standard error corrects for this. Let's see what impact this has on the confidence intervals and p-values. To do this we use the result that the estimators are asymptotically (in large samples) normally distributed. First, to get the confidence interval limits we can use:
> coef(mod)-1.96*sandwich_se (Intercept) x -0.66980780 0.03544496 > coef(mod)+1.96*sandwich_se (Intercept) x 0.4946667 2.3259412
So the 95% confidence interval limits for the X coefficient are (0.035, 2.326). To find the p-values we can first calculate the z-statistics (coefficients divided by their corresponding standard errors), and compare the squared z-statistics to a chi-squared distribution on one degree of freedom:
> z_stat <- coef(mod)/sandwich_se > p_values <- pchisq(z_stat^2, 1, lower.tail=FALSE) > p_values (Intercept) x 0.76815365 0.04331485
We now have a p-value for the dependence of Y on X of 0.043, in contrast to p-value obtained earlier from lm of 0.00025. Using the sandwich standard errors has resulted in much weaker evidence against the null hypothesis of no association.
Note that there are in fact other variants of the sandwich variance estimator available in the sandwich package.