It is well known that adjusting for one or more baseline covariates can increase statistical power in randomized controlled trials. One reason that adjusted analyses are not used more widely may be because researchers may be concerned that results may be biased if the baseline covariate(s)’ effects are not modelled correctly in the regression model for outcome. For example, a continuous baseline covariate would by default be entered linearly in a regression model, but in truth it’s effect on outcome may be non-linear. In this post we’ll review an important result which shows that for continuous outcomes modelled with linear regression, this does not matter in terms of bias – we obtain unbiased estimates of treatment effect even if we mis-specify a baseline covariate’s effect on outcome.
Setup
We’ll assume that we have data from a two arm trial on subjects. For the ith subject we record a baseline covariate and outcome . We let denote a binary indicator of whether the subject is randomized to the new treatment group or the standard treatment group . In some situations the baseline covariate may be a measurement of the same variable (e.g. blood pressure) which is being measured at follow-up by .
A linear regression model for the data is then specified as
where are errors which have expectation zero conditional on and . The parameters (regression coefficients) are then estimated using ordinary least squares. We let denote the estimated treatment effect. The true treatment effect is
Robustness to misspecification
We now ask the question: is the ordinary least square estimator unbiased for , even if the linear regression model assumed is not necessarily correctly specified? The answer is yes (asymptotically), and further below we outline the proof given by Yang and Tsiatis, in their 2001 paper ‘Efficiency Study of Estimators for a Treatment Effect in a Pretest-Posttest Trial’.
This means that for continuous outcomes analysed by linear regression, we do not need to worry that by potentially mis-specifying the effect we may introduce bias into the treatment effect estimator. The price of mis-specifying the effect of will be a reduction in efficiency. For more on methods which adaptively model this effect, see the 2008 paper by Tsiatis et al here and also Chapter 5 section 4 of Tsiatis’ book, ’Semiparametric Theory and Missing Data’. We also note a further results from Yang and Tsiatis, that in general a more efficient estimate can be obtained by allowing for an interaction between and treatment .
Another important point to remember is that the standard ‘model based’ standard error from linear regression assumes that the residual errors have constant variance. If this assumption doesn’t hold, it’s important to account for this in our inferences. Providing the sample size is not small, this can be achieved by using sandwich standard errors, which I covered in an earlier post here.
Simulations
To illustrate these results, we perform a small simulation study. For trials of size , we will simulate the treatment indicator and a baseline covariate . We will then simulate the outcome from a linear regression model, but with linear and quadratic effects of . The true treatment effect is set to .
We perform three analyses: 1) an unadjusted analysis using lm(), equivalent to a two sample t-test, 2) an adjusted analysis, including linearly, and hence mis-specifying the outcome model, and 3) the correct adjusted analysis, including both linear and quadratic effects of .
The code is given by:
nsim <- 1000 n <- 1000 pi <- 0.5 unadjusted <- array(0, dim=nsim) adjustedmisspec <- array(0, dim=nsim) adjustedcorrspec <- array(0, dim=nsim) for (sim in 1:nsim) { z <- rbinom(n, 1, pi) x <- rnorm(n) y <- x+x^2+z+rnorm(n) #analysis not adjusting for baseline unadjustedMod <- lm(y~z) unadjusted[sim] <- coef(unadjustedMod)[2] #adjusted analysis misspecified adjustedmisspecMod <- lm(y~z+x) adjustedmisspec[sim] <- coef(adjustedmisspecMod)[2] #adjusted correctly specified xsq <- x^2 adjustedcorrspecMod <- lm(y~z+x+xsq) adjustedcorrspec[sim] <- coef(adjustedcorrspecMod)[2] } mean(unadjusted) mean(adjustedmisspec) mean(adjustedcorrspec) sd(unadjusted) sd(adjustedmisspec) sd(adjustedcorrspec)
Running this, I obtained (without setting the seed, you will get slightly different results):
> mean(unadjusted) [1] 0.9988225 > mean(adjustedmisspec) [1] 0.9980142 > mean(adjustedcorrspec) [1] 0.9995535 > sd(unadjusted) [1] 0.121609 > sd(adjustedmisspec) [1] 0.1090832 > sd(adjustedcorrspec) [1] 0.0639239
As expected, all three estimators are unbiased. In particular, the estimator based on a mis-specified adjustment for the baseline covariate remains unbiased, as per the theory. Moreover, we see that even with the mis-specified model, the estimator is less variable than the unadjusted estimator, corresponding to a gain in efficiency. However, we also see that a much larger efficiency gain is possible if we are able to correctly specify the effect of the baseline covariate.
Proof
The following proof is taken from a 2001 paper by Yang and Tsiatis, which can be accessed at JSTOR here. First we centre all three variables. The variables centred by their true expectations are denoted , and . We let denote the variables centred by their sample (as opposed to population) means. After centreing, our model for the variables centred about their population means becomes
where now there is no intercept because of the centreing. Note also that after centreing the variables about their corresponding sample means, we can fit the model to the empirically centred variables without an intercept, and obtain the same estimates and as without centreing. We now let and . The OLS estimators can then be expressed as
As the sample size tends to infinity, the OLS estimators converge in probability to
We can then derive
Then we have that since has mean zero, . Since and similarly has mean zero, where . Lastly, by randomization, and are statistically independent, which means that . Since the off diagonal elements are zero, we then have that
To find the latter expectation we can use the law of total expectation to give
Since we can write and , we have that
and so
Finally then, we have
We have thus shown that the OLS estimate of treatment effect is consistent for irrespective of whether the linearity assumption for the baseline covariate's effect on is correct or not.