Summary statistics after imputation with mice

Someone recently asked how they could calculate summary statistics after performing multiple imputation with the mice package. The first thing to say is that if you are only interested in calculating a certain summary statistic on each of the imputed datasets, this is easy to achieve. You can extract each imputed dataset using the complete() function, and then apply whatever function you would normally use to calculate the summary statistic in question.

In the rest of this post, I’ll consider the situation where you are interested in performing inference for the summary statistic (or functional if you will). That is, if you are interested in say the median in your data, you are interested because ultimately you are interested in the median of the variable in the population (from which your sample data came from). Viewed this way, the summary statistic is an estimator of a population parameter, and so we should apply the usual procedure for multiple imputation: estimate the parameter on each imputed dataset and its corresponding complete data variance, and then pool these using Rubin’s rules. For some quantities (e.g. the mean), this is pretty easy. For others, at least as far as I can see, it requires a bit more work.

Mean

The mice package has an amazing functionality for fitting a range of different models to imputed datasets. As far as possible, for a given statistic of interest, we want to figure out if we can ‘fit’ a model which returns the estimate of interest and works with mice. For the mean this is easy. The sample mean of a variable can be estimated using a linear model with no covariates (except the intercept). The following R code illustrates how to estimate the mean of the BMI variable in the NHANES data that comes with the mice package:

library(mice)
#first impute, say 20 times
m <- 20
set.seed(6134)
imp <- mice(nhanes, m=m)
imp
#mean BMI (warm up)
bmi_mean <- with(imp, lm(bmi~1))
summary(pool(bmi_mean))

Running this we obtain in the final step:

       term estimate std.error statistic       df     p.value
(Intercept)   26.477 0.9498893  27.87377 15.44971 1.24345e-14

So our estimate of the population mean BMI is 26.477 with a standard error of 0.945.

Median

Let’s now consider the median. Ordinarily we might calculate the median using R’s quantile function, but this won’t provide us with a standard error. To overcome this, we can use quantile regression, which is an approach for estimating how a given quantile (of which the median is an example) varies with covariates. We will thus use the same trick as before – fit a model with the dependent variable being BMI and no covariates. The estimate will be the sample median. But by using a quantile regression (the rq function in the quantreg package) we will be able to extract complete data variance estimates, which is what we need to apply Rubin’s rules. First, let’s fit the quantile regression to the imputed datasets:

library(quantreg)
#fit quantile regression to each imputed dataset
bmi_med <- with(imp, rq(bmi~1))

Unfortunately mice’s pool function will not work with the output returned by the rq function. Instead we can manually collect up the point estimates of the median and variances from each of the imputed datasets and then pass these to the pool.scalar function from mice to pool them using Rubin’s rules. In this case I had to do a little digging into the form of the output returned by the rq function in order to figure out how to extract the estimated median and variance:

#extract estimates and variances
ests <- array(0, dim=m)
vars <- array(0, dim=m)
for (i in 1:m) {
  ests[i] <- summary.rq(bmi_med$analyses[[i]],covariance=TRUE)$coefficients[1,1]
  vars[i] <- summary.rq(bmi_med$analyses[[i]],covariance=TRUE)$coefficients[1,2]^2
}
#now apply Rubin's rules using pool.scalar
med_rubin <- pool.scalar(Q=ests, U=vars, n=dim(nhanes[1]))

From this we can extract the point estimate and standard error

#median point estimate
med_rubin$qbar
#median standard error
med_rubin$t^0.5

which gives 26.67 and a SE of 1.36.

This took quite a bit of extra work on our part, but if you are familiar with the function you would apply to complete data to calculate your summary statistic of interest and a standard error, it shouldn’t be too tricky to extract these from the fits from each imputed dataset.

One thing to remember is that Rubin’s rules relies on an assumption that your complete data estimator is normally (or t) distributed. So for certain quantities it may be important to apply a transformation first, apply Rubin’s rules and then back transform, to render this normal assumption more accurate.

3 thoughts on “Summary statistics after imputation with mice”

nathalie3108

June 17, 2022 at 8:31 am

Hello! Thank you! What if I want to recreate a normal summary statistics table with mean, standard deviation, min and max values?
kikatuso

August 10, 2022 at 2:38 pm

How would you combine multiple estimates from M imputed datasets for categorical data?
- Jonathan Bartlett
  
  September 8, 2022 at 1:13 pm
  
  The same principle can be applied as described in the post. For the parameter of interest (e.g. a proportion), calculate the estimate of it and an appropriate standard error for each imputed dataset, and then you can pool these using Rubin’s rules, as implemented in the pool.scalar function in the mice package.

Mean

Median

3 thoughts on “Summary statistics after imputation with mice”

Leave a Reply to nathalie3108 Cancel reply