Reference based imputation for continuous missing data in R with bootstrap inference

In clinical trials patients often dropout from the trial, for a variety of reasons. Historically outcome measures were not obtained after such dropout, and the dropout often coincided with the patient no longer receiving their original randomised treatment. For a treatment policy estimand (i.e. what historically would have been called the intention to treat effect), the missing at random (MAR) assumption is questionable if patients who don’t dropout remain on their randomised treatment while those who dropout discontinue their randomised treatment (see my previous post). In particular, analyses of such data assuming MAR effectively impute the post dropout outcomes as if the patients were still on their randomised treatment.

In recent years so called referenced or control based imputation methods, proposed by Carpenter et al (2013), have become increasingly popular for handling this problem. This approach involves imputing missing post dropout (or post deviation) outcomes for patients in the active treatment group using an imputation distribution which is constructed using estimates of certain parameters from the control arm. Broadly speaking the idea is that if the patients dropping out in the active arm are no longer receiving active treatment but instead the control treatment, we should impute their missing outcomes based on what the data tell us about how outcomes are distributed when patients are on the control treatment.

Following the publication of reference based methods, Seaman et al pointed out that if one uses them to impute missing data and then uses Rubin’s combination rules for inference, estimated variances are larger (sometimes by a sizeable amount) than the true repeated sampling variance of the estimator. The discrepancy is due to uncongeniality between the imputation model and analysis model. Since then there has been a debate (which is ongoing) about which is the correct variance (see Cro et al 2019 and White et al 2020).

I am not going to enter this debate here, although I will soon put up a pre-print paper where I do. Suffice to say, I am interested in approaches for using reference based MI and obtaining frequentist valid inferences – i.e. I want the variance corresponding to the variance of the referenced based MI estimator in repeated samples. There has been work on this already. In particular Tang 2017 derive analytic estimators for the frequentist/repeated sample variance of reference based MI. These are pretty complex, and would be tricky to extend, for example to time to event settings, or when imputation is performed differently depending on the patient’s reason for dropout.

An alternative approach for estimating the frequentist variance is to use bootstrapping, such as the efficient approach proposed by von Hippel, which is implemented in the R package bootImpute. The original implementation of reference based imputation is thanks to James Roger, and his SAS macros. For Stata users, we have Suzie Cro’s mimix package. As far I aware, there are no publicly available implementations of reference based imputation for continuous endpoints in R. To address this, I have written a function refBasedCts in the package mlmi. At the moment it supports only MAR and jump to reference imputation. I will add other variants in due course, particularly if someone contacts me to say they would want one of them! Right now refBasedCts is only in the Github version of the mlmi package, while it undergoes some further development. This development version can be installed into R using:


One of the drawbacks of using bootstrapping for inference is that one needs to use a large number of bootstraps to get reliable inferences. In the approach described by von Hippel and Bartlett (2019), this is partly mitigated by the nested bootstrap/imputation scheme. It is further mitigated by the fact that this bootstrap approach does not require the imputations to be proper, in the sense that we can impute conditional on the maximum likelihood estimates of the imputation model parameters, rather than needing to obtain posterior draws of these. For the continuous endpoint reference based approach, obtaining such posterior draws requires uses of MCMC, substantially increasing the computational cost and requiring (at least in principle) convergence checking for the Markov chains. Thus the refBasedCts function in mlmi fits a mixed model to each treatment arm’s data, and imputes conditional on the (restricted) maximum likelihood estimates of the parameters.

To give an illustration of the code to do this, the following snippet gives the example code from refBasedCts, which imputes a simulated example dataset using jump to reference:

bootImps <- bootImpute(ctsTrialWide, refBasedCts, nBoot=1000, nImp=2,
                         outcomeVarStem="y", nVisits=3, trtVar="trt",
                         baselineVars=c("v", "y0"), type="J2R", M=1)

#write a small wrapper function to perform an ANCOVA at the final time point
ancova <- function(inputData) {
    coef(lm(y3~v+y0+trt, data=inputData))
ests <- bootImputeAnalyse(bootImps, ancova)

If anyone does try it out and has feedback or finds bugs, please get in touch. In due course it will be uploaded to CRAN, hopefully with the addition of functionality of reference based imputation for recurrent event endpoints, as described by Keene et al 2014.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.