Maximum likelihood multiple imputation

I just came across a very interesting draft paper on arXiv by Paul von Hippel on ‘maximum likelihood multiple imputation’. von Hippel has made many important contributions to the multiple imputation (MI) literature, including the paper which advocated that one ‘transform then impute’ when one has interaction or non-linear terms in the substantive model of interest. The present paper on maximum likelihood multiple imputation is in its seventh draft on arXiv, the first being released back in 2012. I haven’t read every detail of the paper, but it looks to me to be another thought provoking and potentially practice changing paper. This post will not attempt by any means to cover all of the important points made in the paper, but will just highlight a few.

Standard MI
Rubin proposed MI within the Bayesian paradigm. In what would be considered the standard MI approach, before imputing the m’th dataset, one first takes a draw from the observed data posterior distribution of the parameters in the imputation model. Missing data are then drawn randomly from the conditional distribution of the missing given the observed, conditioning on the parameter value drawn in the preceding step. The drawing of a new parameter value for each imputed dataset is key to ensuring imputations are ‘proper’, a concept defined by Rubin. If one follows this scheme, variance estimation for the final inference is, under certain assumptions, delightfully straightforward, thanks to ‘Rubin’s rules’. These consist of (basically) summing the average within imputation variance and the between imputation variance.

Even for standard models like the multivariate normal, the first step of obtaining draws from the observed data posterior distribution is non-trivial (at least with non-monotone missingness). Typically we resort to MCMC methods, which as von Hippel points out, depend on various choices and steps (burn-in, number of iterations to ensure statistical independence between draws, diagnostics to check for non-convergence). Thus an imputation approach which does not require drawing from the observed data posterior may be quite appealing, since it would avoid all these issues.

Maximum likelihood MI
von Hippel proposes generating each imputed dataset conditional on the observed data maximum likelihood estimate (MLE), which he terms maximum likelihood MI (MLMI). As he describes, obtaining the MLE is often the first step performed in order to choose starting values for the MCMC sampler in the standard posterior draw MI (PDMI). Thus in MLMI the additional (often non-trivial) step of obtaining posterior draws is obviated. This is probably the main advantage of MLMI over PDMI, and it is potentially quite a big one. A further point is that one does not need to specify a prior for the imputation model parameters, as one does in PDMI. Having said that, implementations of PDMI are very often based on default ‘vague’ prior choices for the imputation model, and indeed often software implementing standard MI does not allow the user to specify alternative prior choices.

For a given number of imputations, MLMI is statistically more efficient than PDMI, since the latter is subject to additional randomness due to the posterior draw. This point was made in an important 1998 paper by Wang and Robins. They developed large sample theory for the asymptotic distribution of both MLMI and PDMI, including variance estimators for the former. A nice observation from von Hippel is that the variance of an individual draw from the observed data posterior is twice that of the variance of the observed data MLE. This can be seen from the fact that, in large samples, the posterior draw is equal to the MLE plus a mean zero random draw with variance equal to the MLE, resulting in the variance being twice the MLE’s variance.

As noted by others earlier, von Hippel describes how application of Rubin’s rules variance estimator is invalid for MLMI. He goes on to describe formulae for consistently estimating the variance of MLMI estimates, based on within and between imputation variance components. One immediate issue however with these is that it is possible, particularly with a large fraction of missing information and small number of imputations, for some of the estimates involved to be negative, when the quantities they are estimating are non-negative. He therefore proposes alternative shrinkage estimators of these quantities that are guaranteed to return values within the corresponding parameter spaces. A consequence of this shrinkage is that the estimates can be biased when the fraction of missing information is large and the number of imputations is not sufficiently large. Nevertheless, at first glance von Hippel’s within/between variance estimation approach looks attractive, particularly when the fraction of missing information is less than 50%. Moreover they do not appear to be that much more complicated to implement than Rubin’s rules.

Bootstrapped MI
von Hippel also considers use of the bootstrap with MLMI and PDMI, something I have written about before. He describes a simple two way random effect model which implies a result for the variance of the bootstrapped MI estimator (where you first bootstrap, then perform MI within bootstrap samples, then average all point estimates across all bootstraps/imputations). This variance decomposition result is used to recommend that if one uses the bootstrap, one should use m=2 imputations within each bootstrap sample. He then describes how these components of variance can be estimated after imputation has been performed, leading to an estimate of variance for the bootstrap MI estimator of the parameter of interest.

Assumptions and congeniality
In the discussion, von Hippel highlights that the within/between variance estimator(s) he proposes require for validity that the imputation and analysis models are the same correctly specified model. Some would argue that this is quite a drawback, since in contrast, as von Hippel notes, Meng’s 1994 work which defined the notion of congeniality showed that PDMI with Rubin’s rules can give conservative inferences under certain types of uncongeniality.

You may also be interested in:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.