Missing data – Page 4 – The Stats Geek

Conditional mean reference-based multiple imputation

May 19, 2022September 24, 2021 by Jonathan Bartlett

The reference-based approach to imputing missing data has become popular in clinical trials, as I’ve blogged about previously. In the standard approach, the multiple imputations are generated as draws from the posterior distribution under a Bayesian model. With a continuous outcome, each of the imputed datasets is analysed using a linear regression model for the outcome (typically measured at the final time point), with treatment group and some baseline variables as covariates.

In a new pre-print available on arXiv, in work by Marcel Wolbers and colleagues at Roche, we propose an alternative approach for reference-based imputation for continuous outcomes. This approach results in a treatment effect point estimate and (frequentist) standard error without any Monte-Carlo error.

Summary statistics after imputation with mice

September 22, 2021 by Jonathan Bartlett

Someone recently asked how they could calculate summary statistics after performing multiple imputation with the mice package. The first thing to say is that if you are only interested in calculating a certain summary statistic on each of the imputed datasets, this is easy to achieve. You can extract each imputed dataset using the complete() function, and then apply whatever function you would normally use to calculate the summary statistic in question.

In the rest of this post, I’ll consider the situation where you are interested in performing inference for the summary statistic (or functional if you will). That is, if you are interested in say the median in your data, you are interested because ultimately you are interested in the median of the variable in the population (from which your sample data came from). Viewed this way, the summary statistic is an estimator of a population parameter, and so we should apply the usual procedure for multiple imputation: estimate the parameter on each imputed dataset and its corresponding complete data variance, and then pool these using Rubin’s rules. For some quantities (e.g. the mean), this is pretty easy. For others, at least as far as I can see, it requires a bit more work.

smcfcs imputation in R – now with parallel functionality

June 17, 2021 by Jonathan Bartlett

Substantive model compatible fully conditional specification multiple imputation can be useful for imputing missing values in covariates in a way which accommodates the form of the substantive/outcome model. One of its drawbacks compared to standard FCS imputation, as implemented in the mice package in R, is its higher computational burden. This is due to the use of rejection sampling when imputing missing values in continuous covariates.

I am happy to announce that thanks to the efforts of Edouard Bonneville, the smcfcs package in R now supports the use of multiple cores by parallel processing. The package now has a function smcfcs.parallel. This can be used to call the other smcfcs functions in parallel. Having specified the number of imputations desired, smcfcs.parallel splits these across the number of cores/processors specified by the user in the n_core argument. Since multiple imputation is ’embarrassingly parallel’, substantial speed improvements can be achieved. Many thanks to Ed for his continuing contributions to smcfcs in R.