Multiple imputation has become an extremely popular approach to handling missing data, for a number of reasons. One is that once the imputed datasets have been generated, they can each be analysed using standard analysis methods, and the results pooled using Rubin’s rules. However, in addition to the missing at random assumption, for multiple imputation to give unbiased point estimates the model(s) used to impute missing data need to be (at least approximately) correctly specified. Because of this, care must be taken when choosing the imputation model.
What constitutes a reasonable imputation model will obviously depend on the dataset and situation at hand. One situation which is commonly encountered, but where it is not obvious what one should do, is where the dataset, or the model(s) which will be fitted after imputation, contains interaction terms or non-linear terms such as squared terms.
Example – an interaction with missing values
As an example, let’s suppose that the analysis (substantive) model for an outcome Y, i.e. the model we will fit after performing multiple imputation, contains an interaction between two variables, X1 and X2. Further, let’s suppose that X1 contains missing values, so that correspondingly the interaction variable X1*X2 also has missing values. So, we have missingness in two variables, X1 and X1*X2, although the missingness in the interaction variable X1*X2 is induced by the missing in the covariate X1.
The question is, how should we go about imputing the missing values? Whatever approach we use, we should ensure that imputed values in X1 (and X1*X2) are generated which are consistent with our analysis model, in which Y is assumed to depend on X1, X2 and X1*X2. An implication of this interaction is that the association between Y and X1 will be different for different values of X2.
Impute then transform
The most obvious approach is to impute X1, using Y and X2 in the imputation model. Then in the imputed datasets we can ‘passively’ impute the interaction variable X1*X2. This approach was also termed ‘impute then transform’ in a paper by von Hippel, and is appealing due to its simplicity. However, unfortunately it (in general) leads to bias – the missing values in X1 are imputed from a model which (at least by default) assumes additive effects of X2 and Y, which is incompatible with the interaction term X1*X2 we want to include in our analysis model.
Realising that the default imputation model for X1 is incompatible with the interaction term in our analysis model, we might try to modify the imputation model to allow for it. If X1 and X2 interact in their associations with Y, this means the association between Y and X1 differs according to X2. Therefore a natural approach is to include the Y*X2 interaction in the imputation model for X1. This approach will usually perform better than the first, where we effectively completely ignored the interaction. However, it may not give unbiased estimates, because the correct imputation distribution, implied by the analysis model and a model for X1 given X2, may not be the one we are using, even when we include the Y*X2 interaction. For simulation results demonstrating this, see this paper.
Imputing separately in subgroups
Suppose that X2 is a factor variable. In this case, provided there are not too many levels, and there are a reasonable number of observations in each level of X2, we can carry out multiple imputation separately in the different levels of X2. Why would we do this? By imputing separately in the different levels of X2, we allow the associations between X1 and Y to differ according to the level of X2. So by imputing separately in the different levels of X2, we impute in a way which allows for the possibility of the interaction which we would like to include in our analysis model.
This approach can be an excellent approach to handling the interaction. However, there are some situations where it can’t be used:
1) if X2 is continuous, we can’t do this, because there are too many different ‘levels’
2) we also can’t use this approach if X2 also contains missing values.
Transform then impute, or just another variable (JAV)
von Hippel’s proposed solution to the problem is to impute the interaction variable X1*X2 directly, as if it were ‘just another variable’ (JAV). The consequence of this is that in the imputed datasets, the imputed values of the interaction variable X1X2 will not be equal to the product of the imputed value of X1 and the observed value of X2. At first sight, this doesn’t appear to be a sensible approach, since we have imputed values in the interaction variable which are not consistent with the deterministic relationship that X1X2=X1*X2.
However, it turns out that in certain special cases (described in detail here), such an approach does give unbiased estimates. Specifically, if our analysis model is linear regression, and the data are missing completely at random, it will be unbiased. The intuition for this result is that although the imputation model isn’t correctly specified (manifested by the inconsistency in the imputed values), it does create imputed datasets where Y, X1, X2 and X1X2 have the correct means and covariances, and since the coefficients of a linear regression model only depend on these, unbiased estimates are obtained.
Unfortunately however, for other types of model (e.g. logistic regression), the above argument doesn’t hold, and this approach results in biased (see here for simulation results). Bias also occurs when the data are missing at random (as opposed to the more restrictive missing completely at random) assumption. Having said that, when the analysis model is linear regression, among the approaches available using standard imputation software, the approach is probably the best route to take.
Substantive model compatible full conditional specification / chained equations
Recently, with colleagues, I’ve developed an alternative approach to imputing in the presence of interactions or non-linear terms (paper here). The approach builds upon the popular chained equations or full conditional specific approach to multiple imputation. The essence of the approach is to ensure that the partially observed variables (just X1 in our running example) is impute using a model which is compatible with the analysis (substantive) model. This is achieved using a sampling technique called rejection sampling. So, if our analysis model contains an interaction between X1 and X2, X1 is impute using Y and X2 from a model which is compatible with the analysis model (i.e. with the interaction).
In simulations (see the paper), this approach performed favourably compared to the previously described methods. As well as interactions, the approach can accommodate non-linear terms in the analysis model. Examples of this include squared terms, ratios (e.g. body mass index), and transformations (e.g. log transform) in the analysis model.
This approach is implemented in the R package smcfcs and in Stata (to install, type: ssc install smcfcs).