Tomorrow I'm giving a talk (slides here) at the Joint Statistical Meeting in Vancouver on some work I've been doing on combining bootstrapping with multiple imputation (MI), something I've written about here before. That post looked at a recent paper by Schomaker and Heumann (2018) on various ways of combining bootstrapping and MI. A more recent post discussed an arXiv paper by von Hippel (2018) on maximum likelihood multiple imputation, which also contains a nice proposal for combining bootstrap and MI. My talk this week is about how these perform when the imputation and analysis models are not congenial.

**MI inference using Rubin's rules**

Part of MI's success has been the simplicity of Rubin's combination rules. After imputation, these combine estimates of the full data variance from each imputation with the between imputation variance of the point estimates. When the imputation and analysis models are the same model (or conditionals of a single joint model), Rubin's variance estimator is (asymptotically) unbiased and confidence intervals attain their nominal coverage.

In other (so called uncongenial) situations, the variance estimator can be biased. In many practical situations this bias is upwards, so that inferences are conservative. In many applications this conservatism may not be a big concern. But in some, like randomised trials, non-trivial conservatism is not ideal - it means we could have run the trial with a smaller number of patients for the same (true) precision. Or alternatively, for a given trial, we could obtain narrower confidence intervals than those obtained using Rubin's rules.

**A motivating example**

A motivating example is so called control or reference based MI for missing data in trials. The basic idea here is the that missing data for patients in the active group is imputed based (partly at least) on estimates of parameters from the control or reference arm. This might be sensible when the missing data is due to dropout, and patients who dropped out then went on to a control like treatment. It turns how that here Rubin's variance estimator is biased upwards (see Seaman et al (2014)) for the true repeated sampling variance. There is an ongoing debate about what the 'right' variance to use for these reference based imputation methods should be (see here). I shan't get into that debate here, but will assume that we are interested in frequentist valid inferences, i.e. the true repeated sampling variance of the estimator.

One approach is to try to analytically derive the variance of this particular MI estimator in a particular setting. For a repeatedly measured continuous endpoint Tang has recently done just this. The expression for the variance estimator is quite complicated (see slide 6). More generally, we would need to derive a variance estimator on a case by case basis, which is hard. Robins and Wang (2000) described a general variance estimator which is valid regardless of congeniality. But essentially this requires situation specific derivations which are pretty hard to work through. In conclusion, it would be nice to have something that works out of the box, much in the same way as Rubin's rules do (under congeniality). Which takes us to the bootstrap...

**Four combinations of bootstrap with MI**

Schomaker and Heumann considered four different combinations of bootstrap with MI. Using their terminology, MI boot Rubin consists of imputing M times. For each imputation, B bootstraps are generated to estimate the full data variance. Rubin's rules is then applied. Since this in the end uses Rubin's rules, we shouldn't expect unbiased variance estimates and confidence intervals to attain nominal coverage in situations where the imputation and analysis models are not congenial.

The second variant, MI boot pooled is the same, except that in the end we form a percentile based confidence interval, constructed using the 2.5% and 97.5% percentiles of the pooled MB estimates. For large M, this approach can be viewed as being approximately equivalent to taking a draw from the posterior of the imputation/analysis model parameter(s), since bootstrapping followed by fitting the model is in large samples equivalent to taking a draw from the posterior distribution. Taking a different view, one can show (see slides) that the variance of the pooled estimates will be close to Rubin's variance estimator when M and B are large. For small M, this variance estimator will be biased downwards, which accords with Schomaker and Heumann's findings that with small M this approach leads to intervals which undercover.

The third and fourth variants first bootstrap, then do MI. The third variant (Boot MI) generates B bootstraps of the observed data, then imputes each M times. The standard bootstrap variance estimator is then calculated, where for each bootstrap the estimate is the average of the estimates across the M imputations. Since this corresponds to application of bootstrapping, whose validity does not rely on notions of congeniality, to the MI estimator with M imputations, we should expect it to be valid even under uncongeniality.

The fourth variant (Boot MI pooled) again pools all BM estimates, and then forms percentile confidence intervals. One can show that the variance of the BM estimates is equal to the bootstrap estimator of variance from Boot MI plus an additional term. We would thus expect variance estimates to be biased upwards, and intervals to overcover, whether or not the imputation and analysis models are congenial or not.

So the third approach, 'Boot MI', is the only one we expect to be valid under uncongeniality. The drawback of this approach is computational cost. A reasonably large number of bootstraps B are needed to obtain reliable variance estimates. If we choose M=1 (or small M), our estimator is somewhat inefficient, and the Monte-Carlo error in our point estimate may be larger than we are comfortable with. If we choose a large value of M for efficiency reasons, the total number of bootstraps and imputations rapidly gets very large, which is computationally costly.

**von Hippel's Boot MI**

von Hippel also suggested what Schomaker and Heumann call Boot MI. However, he proposed fitting a one random effects model to the point estimates obtained. From this one can estimate the between bootstrap variance and the within bootstrap between imputation variance. These can then be used to construct an unbiased variance estimator for the estimator which averages all BM estimates. By doing this, we can use a small value of M (von Hippel recommends M=2), thus reducing computational cost, but still obtain an efficient estimator of the target parameter because we estimate it by the average of the BM estimates. It thus seems very attractive for unbiased variance estimation in uncongenial settings.

**Simulations - jump to reference imputation**

Early evidence (see slides) from simulations in the jump to reference situation, in an admittedly very simple setup supports the preceding conclusions. In the congenial setting, all the methods provide valid frequentist inferences, except MI boot pooled with small M, whose intervals slightly undercover (as expected) and Boot MI pooled which overcovers, again as expected. Using jump to reference imputation, which is uncongenial to the analysis model, we see Rubin's rules intervals overcover as expected. The two MI boot variants perform very similarly, because they are essentially equivalent to using Rubin's rules. Only Boot MI and von Hippel's variant of this provide frequentist valid inference. For the former we used B=200 bootstraps and M=10 imputations, while for the latter we were able to use just M=2.

In conclusion, von Hippel's proposed bootstrap followed by MI approach seems to me very nice, and offers an out of the box efficient and general solution for frequentist valid inferences even under uncongeniality. The main assumptions we need are that the data are iid, and that the estimator is normally distributed. The latter assumption is not needed with Schomaker and Heumann's Boot MI approach if we use percentile based intervals, but the drawback here however is the higher computational cost.

A paper describing this work should be out on arXiv in the coming months.