Multiple imputation has become an extremely popular approach to handling missing data, for a number of reasons. One is that once the imputed datasets have been generated, they can each be analysed using standard analysis methods, and the results pooled using Rubin's rules. However, in addition to the missing at random assumption, for multiple imputation to give unbiased point estimates the model(s) used to impute missing data need to be (at least approximately) correctly specified. Because of this, care must be taken when choosing the imputation model.

What constitutes a reasonable imputation model will obviously depend on the dataset and situation at hand. One situation which is commonly encountered, but where it is not obvious what one should do, is where the dataset, or the model(s) which will be fitted after imputation, contains interaction terms or non-linear terms such as squared terms.

**Example - an interaction with missing values**

As an example, let's suppose that the analysis (substantive) model for an outcome Y, i.e. the model we will fit after performing multiple imputation, contains an interaction between two variables, X1 and X2. Further, let's suppose that X1 contains missing values, so that correspondingly the interaction variable X1*X2 also has missing values. So, we have missingness in two variables, X1 and X1*X2, although the missingness in the interaction variable X1*X2 is induced by the missing in the covariate X1.

The question is, how should we go about imputing the missing values? Whatever approach we use, we should ensure that imputed values in X1 (and X1*X2) are generated which are consistent with our analysis model, in which Y is assumed to depend on X1, X2 and X1*X2. An implication of this interaction is that the association between Y and X1 will be different for different values of X2.

**Impute then transform**

The most obvious approach is to impute X1, using Y and X2 in the imputation model. Then in the imputed datasets we can 'passively' impute the interaction variable X1*X2. This approach was also termed 'impute then transform' in a paper by von Hippel, and is appealing due to its simplicity. However, unfortunately it (in general) leads to bias - the missing values in X1 are imputed from a model which (at least by default) assumes additive effects of X2 and Y, which is incompatible with the interaction term X1*X2 we want to include in our analysis model.

Realising that the default imputation model for X1 is incompatible with the interaction term in our analysis model, we might try to modify the imputation model to allow for it. If X1 and X2 interact in their associations with Y, this means the association between Y and X1 differs according to X2. Therefore a natural approach is to include the Y*X2 interaction in the imputation model for X1. This approach will usually perform better than the first, where we effectively completely ignored the interaction. However, it may not give unbiased estimates, because the correct imputation distribution, implied by the analysis model and a model for X1 given X2, may not be the one we are using, even when we include the Y*X2 interaction. For simulation results demonstrating this, see this paper.

**Imputing separately in subgroups**

Suppose that X2 is a factor variable. In this case, provided there are not too many levels, and there are a reasonable number of observations in each level of X2, we can carry out multiple imputation separately in the different levels of X2. Why would we do this? By imputing separately in the different levels of X2, we allow the associations between X1 and Y to differ according to the level of X2. So by imputing separately in the different levels of X2, we impute in a way which allows for the possibility of the interaction which we would like to include in our analysis model.

This approach can be an excellent approach to handling the interaction. However, there are some situations where it can't be used:

1) if X2 is continuous, we can't do this, because there are too many different 'levels'

2) we also can't use this approach if X2 also contains missing values.

**Transform then impute, or just another variable (JAV)**

von Hippel's proposed solution to the problem is to impute the interaction variable X1*X2 directly, as if it were 'just another variable' (JAV). The consequence of this is that in the imputed datasets, the imputed values of the interaction variable X1X2 will not be equal to the product of the imputed value of X1 and the observed value of X2. At first sight, this doesn't appear to be a sensible approach, since we have imputed values in the interaction variable which are not consistent with the deterministic relationship that X1X2=X1*X2.

However, it turns out that in certain special cases (described in detail here), such an approach does give unbiased estimates. Specifically, if our analysis model is linear regression, and the data are missing completely at random, it will be unbiased. The intuition for this result is that although the imputation model isn't correctly specified (manifested by the inconsistency in the imputed values), it does create imputed datasets where Y, X1, X2 and X1X2 have the correct means and covariances, and since the coefficients of a linear regression model only depend on these, unbiased estimates are obtained.

Unfortunately however, for other types of model (e.g. logistic regression), the above argument doesn't hold, and this approach results in biased (see here for simulation results). Bias also occurs when the data are missing at random (as opposed to the more restrictive missing *completely *at random) assumption. Having said that, when the analysis model is linear regression, among the approaches available using standard imputation software, the approach is probably the best route to take.

**Substantive model compatible full conditional specification / chained equations**

Recently, with colleagues, I've developed an alternative approach to imputing in the presence of interactions or non-linear terms (paper here). The approach builds upon the popular chained equations or full conditional specific approach to multiple imputation. The essence of the approach is to ensure that the partially observed variables (just X1 in our running example) is impute using a model which is compatible with the analysis (substantive) model. This is achieved using a sampling technique called rejection sampling. So, if our analysis model contains an interaction between X1 and X2, X1 is impute using Y and X2 from a model which is compatible with the analysis model (i.e. with the interaction).

In simulations (see the paper), this approach performed favourably compared to the previously described methods. As well as interactions, the approach can accommodate non-linear terms in the analysis model. Examples of this include squared terms, ratios (e.g. body mass index), and transformations (e.g. log transform) in the analysis model.

**Software**

Free R and Stata software implementing the approach is available here.

Hi!

Great text! Congrats!

I though the approach to deal with interaction variables (one factor -two levels, another continuous) quite interesting. To impute in different subsets! I would like to try that to my data, although I'm not sure how easy will it be to combine the results after the imputation.

Although, theoretically if both variables are complete (continuous and factor that interact with each other), it will not be necessary to add the interaction to the dataset since the idea is that imputed values from variables with interaction will differ. Am I understanding this correctly?

Thanks! Keep the good work

Thanks! If you're a Stata user imputing separately in subsets is particularly easy - you just add by(group) as an option to the command. It then imputes separately in each group, and combines the imputed datasets from each group together for you.

I'm not sure I entirely understand your question though - do you mean how would you impute another covariate X3 if X1, X2 and Y were complete?

I'm a R user. I will look about how easy it is to implement that solution in R.

What I was trying to say is suppose you have an interaction between X1 (two factors) and X2 (continuous) and those two variables do not have missing observations. And several other variables X3, X4, X5, etc have missing observations. Do you think it will benefit the model if I include the interaction between X1 and X2, although there are complete?

Thank you very much for your guidance!

Thanks. Yes, if you are imputing X3, X4 and X5 using the standard MICE/ICE approach (as opposed to the SMC-FCS approach we have developed) you should include the interactions/non-linearities which are in the substantive model (e.g. the interaction X1*X2) in the imputation models. If you want to read more about this, its given quite a detailed treatment in Chapters 6 and 7 of my colleagues (Carpenter and Kenward) book, Multiple Imputation and its Application.

Hi, just come across this, thanks! I am using MPlus to run a multilevel model (two levels, av cluster size is only 2). I have interactions at within levels and across levels (gulps). Missing data on outcome as well as predictors (and thus interactions). Do you have any Mplus script examples for this by any chance? Many thanks again for interesting info on this trick issue...

Sorry I don't have any MPlus code as I've never used it myself. But I think if you include the covariates as additional outcomes/dependent variables, and thereby model their distribution, MPlus may enable you to handle the missing values by its (I think) default full information maximum likelihood approach. So long as the joint model you specify includes your desired interactions, then I think it ought to effectively handle these when it obtains the maximum likelihood estimates.

Hi,

Thank you very much for the article, it's very useful.

I am a stata user and now facing a problem with multiple imputation. I use MICE to handle the missingness in my dataset. Using mhodds analysis, I have identified 6 potential effect modifiers (say X1 - X6) on the association between particular risk factors and outcome. Those effect modifiers are categorical variables, complete variables and have 2-4 level.

With respect to the interactions, I plan to use the by ( ) option. I was wondering if I can impute the missing data using the by ( ) option for those 6 interaction terms simultaneously in one command. Or, do you have any suggestions regarding my problem?

With very many thanks

Using by() here is probably not a good idea, no, because there will be so many combinations of levels, such that the sample size in some combinations may be very small. Moreover this can't work if some of the variables involved in the interactions have missing values. The smcfcs command may however be a feasible solution - see https://blogs.lshtm.ac.uk/missingdata/2017/06/06/substantive-model-compatible-imputation-of-missing-covariates/

Hi, many thanks for the smcfcs method. That's really inspiring. My question is could it be feasible for longitudinal analyses? I am using xtmixed (multilevel modelling) for the data analyses. The dataset includes repetitive variables (e.g. edu1 edu2 edu3) across waves. According to your guidance, it seems at this moment smcfcs is only feasible for linear, logistic regression etc. Is there any possible way for me to use smcfcs to impute time-varying variables with interactions in my longitudinal dataset?

Thank you very much!

Hi Wendy. Unfortunately I've not yet extended smcfcs to that type of model. There have been some recent publications on this setting: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0372-y and http://journals.sagepub.com/doi/full/10.1177/0962280217730851