Multiple imputation has become an extremely popular approach to handling missing data, for a number of reasons. One is that once the imputed datasets have been generated, they can each be analysed using standard analysis methods, and the results pooled using Rubin’s rules. However, in addition to the missing at random assumption, for multiple imputation to give unbiased point estimates the model(s) used to impute missing data need to be (at least approximately) correctly specified. Because of this, care must be taken when choosing the imputation model.

What constitutes a reasonable imputation model will obviously depend on the dataset and situation at hand. One situation which is commonly encountered, but where it is not obvious what one should do, is where the dataset, or the model(s) which will be fitted after imputation, contains interaction terms or non-linear terms such as squared terms.

**Example – an interaction with missing values**

As an example, let’s suppose that the analysis (substantive) model for an outcome Y, i.e. the model we will fit after performing multiple imputation, contains an interaction between two variables, X1 and X2. Further, let’s suppose that X1 contains missing values, so that correspondingly the interaction variable X1*X2 also has missing values. So, we have missingness in two variables, X1 and X1*X2, although the missingness in the interaction variable X1*X2 is induced by the missing in the covariate X1.

The question is, how should we go about imputing the missing values? Whatever approach we use, we should ensure that imputed values in X1 (and X1*X2) are generated which are consistent with our analysis model, in which Y is assumed to depend on X1, X2 and X1*X2. An implication of this interaction is that the association between Y and X1 will be different for different values of X2.

**Impute then transform**

The most obvious approach is to impute X1, using Y and X2 in the imputation model. Then in the imputed datasets we can ‘passively’ impute the interaction variable X1*X2. This approach was also termed ‘impute then transform’ in a paper by von Hippel, and is appealing due to its simplicity. However, unfortunately it (in general) leads to bias – the missing values in X1 are imputed from a model which (at least by default) assumes additive effects of X2 and Y, which is incompatible with the interaction term X1*X2 we want to include in our analysis model.

Realising that the default imputation model for X1 is incompatible with the interaction term in our analysis model, we might try to modify the imputation model to allow for it. If X1 and X2 interact in their associations with Y, this means the association between Y and X1 differs according to X2. Therefore a natural approach is to include the Y*X2 interaction in the imputation model for X1. This approach will usually perform better than the first, where we effectively completely ignored the interaction. However, it may not give unbiased estimates, because the correct imputation distribution, implied by the analysis model and a model for X1 given X2, may not be the one we are using, even when we include the Y*X2 interaction. For simulation results demonstrating this, see this paper.

**Imputing separately in subgroups**

Suppose that X2 is a factor variable. In this case, provided there are not too many levels, and there are a reasonable number of observations in each level of X2, we can carry out multiple imputation separately in the different levels of X2. Why would we do this? By imputing separately in the different levels of X2, we allow the associations between X1 and Y to differ according to the level of X2. So by imputing separately in the different levels of X2, we impute in a way which allows for the possibility of the interaction which we would like to include in our analysis model.

This approach can be an excellent approach to handling the interaction. However, there are some situations where it can’t be used:

1) if X2 is continuous, we can’t do this, because there are too many different ‘levels’

2) we also can’t use this approach if X2 also contains missing values.

**Transform then impute, or just another variable (JAV)**

von Hippel’s proposed solution to the problem is to impute the interaction variable X1*X2 directly, as if it were ‘just another variable’ (JAV). The consequence of this is that in the imputed datasets, the imputed values of the interaction variable X1X2 will not be equal to the product of the imputed value of X1 and the observed value of X2. At first sight, this doesn’t appear to be a sensible approach, since we have imputed values in the interaction variable which are not consistent with the deterministic relationship that X1X2=X1*X2.

However, it turns out that in certain special cases (described in detail here), such an approach does give unbiased estimates. Specifically, if our analysis model is linear regression, and the data are missing completely at random, it will be unbiased. The intuition for this result is that although the imputation model isn’t correctly specified (manifested by the inconsistency in the imputed values), it does create imputed datasets where Y, X1, X2 and X1X2 have the correct means and covariances, and since the coefficients of a linear regression model only depend on these, unbiased estimates are obtained.

Unfortunately however, for other types of model (e.g. logistic regression), the above argument doesn’t hold, and this approach results in biased (see here for simulation results). Bias also occurs when the data are missing at random (as opposed to the more restrictive missing *completely *at random) assumption. Having said that, when the analysis model is linear regression, among the approaches available using standard imputation software, the approach is probably the best route to take.

**Substantive model compatible full conditional specification / chained equations**

Recently, with colleagues, I’ve developed an alternative approach to imputing in the presence of interactions or non-linear terms (paper here). The approach builds upon the popular chained equations or full conditional specific approach to multiple imputation. The essence of the approach is to ensure that the partially observed variables (just X1 in our running example) is impute using a model which is compatible with the analysis (substantive) model. This is achieved using a sampling technique called rejection sampling. So, if our analysis model contains an interaction between X1 and X2, X1 is impute using Y and X2 from a model which is compatible with the analysis model (i.e. with the interaction).

In simulations (see the paper), this approach performed favourably compared to the previously described methods. As well as interactions, the approach can accommodate non-linear terms in the analysis model. Examples of this include squared terms, ratios (e.g. body mass index), and transformations (e.g. log transform) in the analysis model.

**Software**

This approach is implemented in the R package smcfcs and in Stata (to install, type: ssc install smcfcs).

Hi!

Great text! Congrats!

I though the approach to deal with interaction variables (one factor -two levels, another continuous) quite interesting. To impute in different subsets! I would like to try that to my data, although I’m not sure how easy will it be to combine the results after the imputation.

Although, theoretically if both variables are complete (continuous and factor that interact with each other), it will not be necessary to add the interaction to the dataset since the idea is that imputed values from variables with interaction will differ. Am I understanding this correctly?

Thanks! Keep the good work

Thanks! If you’re a Stata user imputing separately in subsets is particularly easy – you just add by(group) as an option to the command. It then imputes separately in each group, and combines the imputed datasets from each group together for you.

I’m not sure I entirely understand your question though – do you mean how would you impute another covariate X3 if X1, X2 and Y were complete?

I’m a R user. I will look about how easy it is to implement that solution in R.

What I was trying to say is suppose you have an interaction between X1 (two factors) and X2 (continuous) and those two variables do not have missing observations. And several other variables X3, X4, X5, etc have missing observations. Do you think it will benefit the model if I include the interaction between X1 and X2, although there are complete?

Thank you very much for your guidance!

Thanks. Yes, if you are imputing X3, X4 and X5 using the standard MICE/ICE approach (as opposed to the SMC-FCS approach we have developed) you should include the interactions/non-linearities which are in the substantive model (e.g. the interaction X1*X2) in the imputation models. If you want to read more about this, its given quite a detailed treatment in Chapters 6 and 7 of my colleagues (Carpenter and Kenward) book, Multiple Imputation and its Application.

Hi, just come across this, thanks! I am using MPlus to run a multilevel model (two levels, av cluster size is only 2). I have interactions at within levels and across levels (gulps). Missing data on outcome as well as predictors (and thus interactions). Do you have any Mplus script examples for this by any chance? Many thanks again for interesting info on this trick issue…

Sorry I don’t have any MPlus code as I’ve never used it myself. But I think if you include the covariates as additional outcomes/dependent variables, and thereby model their distribution, MPlus may enable you to handle the missing values by its (I think) default full information maximum likelihood approach. So long as the joint model you specify includes your desired interactions, then I think it ought to effectively handle these when it obtains the maximum likelihood estimates.

Hi,

Thank you very much for the article, it’s very useful.

I am a stata user and now facing a problem with multiple imputation. I use MICE to handle the missingness in my dataset. Using mhodds analysis, I have identified 6 potential effect modifiers (say X1 – X6) on the association between particular risk factors and outcome. Those effect modifiers are categorical variables, complete variables and have 2-4 level.

With respect to the interactions, I plan to use the by ( ) option. I was wondering if I can impute the missing data using the by ( ) option for those 6 interaction terms simultaneously in one command. Or, do you have any suggestions regarding my problem?

With very many thanks

Using by() here is probably not a good idea, no, because there will be so many combinations of levels, such that the sample size in some combinations may be very small. Moreover this can’t work if some of the variables involved in the interactions have missing values. The smcfcs command may however be a feasible solution – see https://blogs.lshtm.ac.uk/missingdata/2017/06/06/substantive-model-compatible-imputation-of-missing-covariates/

Hi, many thanks for the smcfcs method. That’s really inspiring. My question is could it be feasible for longitudinal analyses? I am using xtmixed (multilevel modelling) for the data analyses. The dataset includes repetitive variables (e.g. edu1 edu2 edu3) across waves. According to your guidance, it seems at this moment smcfcs is only feasible for linear, logistic regression etc. Is there any possible way for me to use smcfcs to impute time-varying variables with interactions in my longitudinal dataset?

Thank you very much!

Hi Wendy. Unfortunately I’ve not yet extended smcfcs to that type of model. There have been some recent publications on this setting: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0372-y and http://journals.sagepub.com/doi/full/10.1177/0962280217730851

Hi, and thank you for the overview of the literature; it is most helpful! I have one question related to JAV, and another one related to the smcfcs algorithm.

JAV: In Seaman et al. (2012), it is said that the JAV approach make the assumption that the vector (Y, X, Z, XZ) is jointly normally distributed. Moreover, it seems to be implied that one can extend (Y, X, Z, XZ) to (Y, X, Z, XZ, S), where S is a vector of covariates/auxiliary variables, and (Y, X, Z, XZ, S) is assumed to be jointly normally distributed. Does that mean that all the variables Y, X, Z, XZ, and S must all be imputed using a linear model when using multiple imputation by chained equations, even if some of the variables are not continuous, similar to what they did in White et al. (2010) ?

Smcfcs: Does the smcfcs algorithm can work in cases were some variables for the substantive model are scales derived from multiple items, but one wants to impute the items, not the scales?

Hello Jonathan, Thanks so much for this article and for focusing on imputation. I have a situation similar to your “Example – an interaction with missing values” above, except that my missing value is on X2, the variable for which the relationship between Y and X1 differ. My X2 is binary.

In a complete case analysis, X1 shows a STRONG effect on Y in one subgroup of X2 yet NO effect in the other subgroup of X2. When I impute X2 (18% missing) across the full dataset, it then mis-classifies subjects for my needed subgroup analyses. My effect gets washed out. Do you know any creative ways to address this?

So missing values in X2, X2 is binary, and you want to allow for an interaction between X1 and X2 in the model for Y? You can use my smcfcs software for this. Are you using R or Stata?

Great stuff!

1. When interactions turn out non-significant, do you simply remove them in a postestimation command, or does the imputation need to be re-run without the interactions?

Say you have 4 dichotomous variables with missing, and 2 interactions: x1 x2 x3 x4 x1x3 x1x4, in a Cox model.

Abbreviated Stata code:

gen x1x3 = x1 * x3

gen x1x4 = x1 * x4

stset timevar (…)

smcfcs stcox x1 x2 x3 x4 x1x3 x1x4, logit(x1 x2 x3 x4) passive(x1x3 = x1 * x3 | x1x4 = x1 * x4) m(20)

mi estimate, post hr: stcox: x1 x2 x3 x4 x1x3 x1x4 –> interaction testing, with no signifcant interactions

Should you then report:

mi estimate, post hr: stcox x1 x2 x3 x4

or, rerun the whole imputation without the interaction terms:

smcfcs stcox x1 x2 x3 x4, logit(x1 x2 x3 x4) m(20)

2. How do you handle smcfs and time varying covariates?

Say x2 violates the PH assumption and you would include this as tvc(x2).

Should this be treated as any other interaction and be created manually?

stsplit, at(failures) (this is a command that causes severe memory issues on my setup…)

gen x2_tvc = x2 * _t

smcfcs stcox x1 x2 x3 x4, logit(x1 x2 x3 x4) passive(x2_tvc = x2 * _t) m(20)

Or as a postestimation command after:

smcfcs stcox x1 x2 x3 x4, logit(x1 x2 x3 x4) m(20)

mi estimate, post hr: stcox x1 x2 x3 x4, tvc(x2)

1. I think it is fine to fit the model without the interactions after imputing allowing for the possibility of the interactions. You could re-impute without allowing for the interactions, and you would likely gain slightly more efficient estimates.

2. I think this about the effect of a covariate varying with time? I suggest looking at this paper https://doi.org/10.1002/sim.7842

Hello! Thank you for this explanation, it is much appreciated. I have a question, but will give some background to my particular case first.

Firstly, I know how to use both SAS and SPSS, but for this purpose I am using SAS.

Secondly, I have a binary outcome and several binary and multi-level categorical covariates (no continuous covariates) that I am using in a multiple logistic regression model. In my pre-imputation Model Building phase, all covariates were identified as significantly contributing to the model (using a purposeful selection process and likelihood ratio tests). Two of my covariates were identified as having a significant interaction between them — let’s call them X1 and X2. The main explanatory variable in the model is X1 and it is binary. There are no missing data on the outcome, X1, or other covariates, but, there are 12% missing data for X2, which is also binary.

Thirdly, I would like to perform a multiple imputation (the proc mi procedure in SAS) using an FCS logistic model to impute values for X2 and the interaction between X1*X2 to see if the interaction is still significant after imputation.

As you suggest above, it is better to impute the interaction term as “just another variable”. I am keen to do this, however, SAS does not allow me to put an interaction term into the model using “X1*X2” syntax if X2 is one of the variables being imputed. Do you have a recommendation to circumvent this? I suppose I could create a new variable that is the product of X1 and X2 (e.g. newvar = X1*X2, taking care to leave missing values where X2 is missing) and then put “newvar” into the multiple imputation model, but then, in the analysis/pooling phase (after imputation) there does not seem to be a way of of getting stratified beta coefficients or odds ratios after pooling (the proc mianalyze pricedure in SAS).

I ultimately would like to report odds ratios for the binary outcome, comparing the binary categories of X1, adjusted for all the significant covariates, stratified by X2.

I have searched for resources and have not found an explanation. I feel like I may be missing something that is right in front of me! I would greatly appreciate any advice 🙂

Actually as I wrote for logistic regression models we found in simulations that ‘just another variable’ doesn’t work too well.

A highly relevant paper for this situation can be found here https://doi.org/10.1016/j.jclinepi.2016.07.004. In your situation, one approach would be to impute separately in levels defined by the fully observed X1 variable. Because you impute separately by X1, the X2–>outcome association is allowed to differ between X1 levels, which is compatible with your interaction effect in the outcome model. Alternatively, you can impute the whole group at one, but include the X1*Y interaction variable (where Y is the outcome) in the imputation model for X2 as a covariate. To do this you can simply create a new variable X1_Y = X1*Y before imputing and including it as a covariate in the imputation model. In the analysis (using PROC MI ANALYZE) this variable won’t be used.

Hello! I’m trying to develop my first multiple imputation model. I apologise if this is a basic question. I need to undertake an analysis with sex as an effect modifier.

i used your description above to impute with by(sex) within the imputation.

However, i don’t know how to actual run the regression with the imputed dataset as when i run an interaction term with ## i get an error. any advice?

Hi Ali. I assume because of the ## in your question that you are using Stata? If so, after doing the imputation with by(sex), mi estimate should work fine fitting your regression model with interactions included using the ## syntax.

Dear Jonathan, Thank you for such a insighful source of information!

I use Stata and and I am currenltly runing mice to do multiple imputation; but was not considering the interaction term in my final model. In your post you mention a place to access how to implement the approach you describe “Free Stata software implementing the approach is available here.” however, the link is no longer working, would you be so kind to point me to a place where I could find a sample syntax to implement the passive imputation in stata to incorporate the interaction in my model? Thank you so much!

Thanks Sandra. I’ve updated the post to give up to date details. Our imputation software in Stata can be downloaded by typing the following in Stata’s console:

ssc install smcfcs