Multiple imputation has become a popular approach for handling missing data (see www.missingdata.org.uk). Suppose that we have an outcome (dependent variable in our model of interest) Y, and a covariate X. Suppose further that X contains some missing values, and that we are happy to assume that these satisfy the missing at random assumption. Then we might consider using multiple imputation to impute the missing values in X. A natural question that then follows is whether, in the imputation model for X, the variable Y should be included as a covariate? Particularly when Y is a variable measured later in time than X, our intuition may lead us to think that it is inappropriate to use the future information contain in Y when imputing in X. This however, is not the case.
To illustrate the concepts, we simulate a small dataset in Stata, initially with no missing data:
clear set seed 123 set obs 100 gen x=rnormal() gen y=x+0.25*rnormal() twoway (scatter y x) (lfit y x)
Next we make 50 of the 100 observations of X set to missing:
replace x=. if _n<=50 gen xmiss=(_n<=50)
The job of the imputation model
The job of the imputation model is to synthetically generate imputations of the missing values in such a way that statistical analysis of the resulting imputations leads to valid statistical inferences. In the present context, where we have two variables Y and X, and the analysis model consists of some type of regression of Y on X (meaning Y is the dependent variable and X is the covariate), we want our imputations to be generated such that we get valid estimates of the parameters in the model for Y|X.
Imputing X ignoring Y
Suppose that we were to impute X using a regression model, but did not include Y as a covariate in the imputation model. We can do this easily in Stata, generating just one imputed value for each missing value and then plotting Y against either the resulting imputed values of X or the observed X (when it was observed):
mi set flong mi register imputed x mi impute reg x, add(1) twoway (scatter y x if _mi_m==1 & xmiss==1, mcolor(red) legend(lab(1 "Imputed"))) /* */ (scatter y x if _mi_m==1 & xmiss==0, mcolor(blue) legend(lab(2 "Observed")))
The plot clearly shows the problem with imputing the missing values in X ignoring Y - amongst those for whom we have imputed X, there is no association between Y and X, when in fact there should be. If we were to fit our analysis model for Y|X, we would obtain biased estimates, with the dependence of Y on X diluted relative to the true regression coefficient value.
Imputing taking the outcome into account
If we instead impute X taking the outcome Y into account (as a covariate in the imputation model for X), the following steps will happen. The imputation model for X|Y will be fitted using those individuals with X observed. Since we are assuming X is missing at random given Y, this complete case analysis fit is valid. Thus if in truth there is no association between X and Y, we should (in expectation) find this in this complete case fit. Otherwise, we obtain a valid estimate of the dependence of X on Y. We then randomly impute the missing X values based on this complete case estimate of the distribution of X|Y (in fact there is an additional Bayesian posterior step, but this is not important for the present discussion).
The fact that Y may follow X chronologically is, for the purposes of multiple imputation irrelevant. The goal is to specify a realistic model for the conditional distribution of the partially observed variable(s) given the fully observed.
To continue with our simulated dataset, we first discard the imputed values generated previously, then re-impute X, but this time including Y as a covariate in the imputation model:
mi extract 0, clear mi set flong mi register imputed x mi impute reg x = y, add(1) twoway (scatter y x if _mi_m==1 & xmiss==1, mcolor(red) legend(lab(1 "Imputed"))) /* */ (scatter y x if _mi_m==1 & xmiss==0, mcolor(blue) legend(lab(2 "Observed")))
Now we see that in our (single) imputed dataset, the X values appear to have been imputed in such a way that the association between Y and X has been preserved.
Variable selection in multiple imputation
A general rule when selecting which variables to include in imputation models is that all variables involved in the analysis model(s) must be included, either as variables which are being imputed, or as covariates in imputation models. Thus, the recommendation that the outcome variable of the analysis model be included follows automatically from the general rule.
For further reading on the importance of including the outcome variable in imputation models, see a letter (here) I published previously with colleagues, and also this overview paper in the British Medical Journal.