I just came across a very interesting draft paper on arXiv by Paul von Hippel on ‘maximum likelihood multiple imputation’. von Hippel has made many important contributions to the multiple imputation (MI) literature, including the paper which advocated that one ‘transform then impute’ when one has interaction or non-linear terms in the substantive model of interest. The present paper on maximum likelihood multiple imputation is in its seventh draft on arXiv, the first being released back in 2012. I haven’t read every detail of the paper, but it looks to me to be another thought provoking and potentially practice changing paper. This post will not attempt by any means to cover all of the important points made in the paper, but will just highlight a few.
Including the outcome in imputation models of covariates
Multiple imputation has become a popular approach for handling missing data (see www.missingdata.org.uk). Suppose that we have an outcome (dependent variable in our model of interest) Y, and a covariate X. Suppose further that X contains some missing values, and that we are happy to assume that these satisfy the missing at random assumption. Then we might consider using multiple imputation to impute the missing values in X. A natural question that then follows is whether, in the imputation model for X, the variable Y should be included as a covariate? Particularly when Y is a variable measured later in time than X, our intuition may lead us to think that it is inappropriate to use the future information contain in Y when imputing in X. This however, is not the case.