Bootstrapping multiple imputation using multiple cores/processors in R

I’ve written previously about combining bootstrapping with multiple imputation, in particular when the imputation and analysis models may not be congenial. This work has recently been published in Statistical Methods in Medical Research (open access paper here). The approach we recommend in this paper, proposed earlier by Paul von Hippel, is implemented in the R package bootImpute.

Perhaps the biggest limitation with using bootstrapping for inference rather than Rubin’s rules is that you need to use a much larger number of bootstraps to obtain reproducible inferences than the number of imputations typically used with multiple imputation. To mitigate this I have just released a new version of the bootImpute package which can make use of multiple cores/processors, thereby considerably reducing computation time. It does by exploiting the parallel computing functionality that is now built into R. This is possible because the imputation (and also the analysis) of the different bootstrapped datasets can be performed completely independently of each other – it is an example of an embarrassingly parallelisable process.

To illustrate the potential gains in terms of computation time, in this post I will impute the nhanes2 dataset from the mice package. The call to mice provided as the example in the mice function is:

mice(nhanes2, meth=c('sample','pmm','logreg','norm'))

The bootImpute function provides a wrapper function for when you want to use mice, called bootMice. To use this we can run the following code, where I have asked for 2000 bootstraps, each to be imputed twice:

bootImps <- bootMice(nhanes2, nBoot=2000, nImp=2, meth=c('sample','pmm','logreg','norm'))

Running this code on my (admittedly somewhat old!) desktop takes 5 minutes.

We can re-do the imputation but now using 4 cores with:

bootImps <- bootMice(nhanes2, nBoot=2000, nImp=2, nCores=4, seed=651234,
                                  meth=c('sample','pmm','logreg','norm'))

If we specify more than one core, notice we have to specify a seed to the bootMice function, so that the random number streams are correctly setup across the multiple cores. This run takes just 1.5 minutes. For larger datasets and/or slower imputation methods, computational time will be much larger, but by using multiple cores this can be easily substantially reduced.

After creating the bootstrapped and imputed datasets, we of course will want to analyse them. This can be done using the bootImputeAnalyse function, which also lets you specify multiple cores to speed up the computation time.

Leave a ReplyCancel reply