Multiple imputation separately by groups in R and Stata

When using multiple imputation to impute missing values there are often situations where one wants to perform the imputation process completely separately in groups of subjects defined by some fully observed variable (e.g. sex or treatment group). In Stata, this is made very easy through use of the by() option. You simply tell the mi impute command what variable (or variables) you want to perform the imputation stratified on. Stata will then impute separately in groups defined by this variable(s), and then assemble the imputations of each strata back together so you have your desired number of imputed datasets.

Last week someone asked me how to do it in R, ideally with the mice package. Compared to Stata, one has to do a little bit more work. One approach is to use the mice.impute.bygroup function in the miceadds package, a package which extends functionality for mice in various directions. If you instead want to do it manually, you can do so by making using of the rbind function within the mice package.

Read more

Maximum likelihood multiple imputation

I just came across a very interesting draft paper on arXiv by Paul von Hippel on ‘maximum likelihood multiple imputation’. von Hippel has made many important contributions to the multiple imputation (MI) literature, including the paper which advocated that one ‘transform then impute’ when one has interaction or non-linear terms in the substantive model of interest. The present paper on maximum likelihood multiple imputation is in its seventh draft on arXiv, the first being released back in 2012. I haven’t read every detail of the paper, but it looks to me to be another thought provoking and potentially practice changing paper. This post will not attempt by any means to cover all of the important points made in the paper, but will just highlight a few.

Read more

Including the outcome in imputation models of covariates

Multiple imputation has become a popular approach for handling missing data (see www.missingdata.org.uk). Suppose that we have an outcome (dependent variable in our model of interest) Y, and a covariate X. Suppose further that X contains some missing values, and that we are happy to assume that these satisfy the missing at random assumption. Then we might consider using multiple imputation to impute the missing values in X. A natural question that then follows is whether, in the imputation model for X, the variable Y should be included as a covariate? Particularly when Y is a variable measured later in time than X, our intuition may lead us to think that it is inappropriate to use the future information contain in Y when imputing in X. This however, is not the case.

Read more