Multiple imputation separately by groups in R and Stata

When using multiple imputation to impute missing values there are often situations where one wants to perform the imputation process completely separately in groups of subjects defined by some fully observed variable (e.g. sex or treatment group). In Stata, this is made very easy through use of the by() option. You simply tell the mi impute command what variable (or variables) you want to perform the imputation stratified on. Stata will then impute separately in groups defined by this variable(s), and then assemble the imputations of each strata back together so you have your desired number of imputed datasets.

Last week someone asked me how to do it in R, ideally with the mice package. Compared to Stata, one has to do a little bit more work. One approach is to use the mice.impute.bygroup function in the miceadds package, a package which extends functionality for mice in various directions. If you instead want to do it manually, you can do so by making using of the rbind function within the mice package.

To illustrate, let’s construct the Potthoff-Roy data following the example code in ?potthoffroy within the mice package:

### create missing values at age 10 as in Little and Rubin (1987)
phr <- potthoffroy
idmis <- c(3, 6, 9, 10, 13, 16, 23, 24, 27)
phr[idmis, 4] <- NA

       id       sex          d8             d10             d12             d14       
 Min.   : 1.0   F:11   Min.   :16.50   Min.   :20.00   Min.   :19.00   Min.   :19.50  
 1st Qu.: 7.5   M:16   1st Qu.:21.00   1st Qu.:22.12   1st Qu.:23.00   1st Qu.:25.00  
 Median :14.0          Median :22.00   Median :23.00   Median :24.00   Median :26.00  
 Mean   :14.0          Mean   :22.19   Mean   :23.61   Mean   :24.65   Mean   :26.09  
 3rd Qu.:20.5          3rd Qu.:23.25   3rd Qu.:25.00   3rd Qu.:26.00   3rd Qu.:27.75  
 Max.   :27.0          Max.   :27.50   Max.   :28.00   Max.   :31.00   Max.   :31.50  
                                       NA's   :9                                    

This data frame contains observations on 16 boys and 11 girls who at ages 8, 10, 12, and 14 had the distance (mm) from the center of the pituitary gland to the pteryomaxillary fissure measured, as described further in ?potthoffroy. We will impute the missing values artificially created in d10 separately in the boys and girls. First we split the data into these two groups:

#create separate male and female datasets
phr_male  <- subset(phr, sex=="M")
phr_female  <- subset(phr, sex=="F")

Next we will define a predictor matrix so that the id variable is not used as a predictor in the imputation process:

predMat <- make.predictorMatrix(phr_male)
#specify not to use id variable as predictor
predMat[,'id'] <- 0

    id sex d8 d10 d12 d14
id   0   1  1   1   1   1
sex  0   0  1   1   1   1
d8   0   1  0   1   1   1
d10  0   1  1   0   1   1
d12  0   1  1   1   0   1
d14  0   1  1   1   1   0

Now we can perform the imputation in each of the two groups, using this custom predictor matrix. First with the males:

male_imps <- mice(phr_male, predictorMatrix=predMat)

The warning we get is a warning from mice that it has (correctly) dropped the sex variable as a predictor. It cannot be a predictor within the male subset since it is constant. Next we impute in the females:

female_imps <- mice(phr_female, predictorMatrix=predMat)

Lastly we need to combine the two sets of imputations, using rbind:

imps <- rbind(female_imps, male_imps)

Although it appears that we are using R’s regular rbind function, mice actually has its own version which overrides R’s version when rbind is called with objects of type mids (as returned by the mice function). mice’s rbind does precisely what we want here – it combines the first imputation of the females with the first imputation of the males, and this becomes the first imputed dataset in the object I have called imps above, and similarly for the subsequent imputations. To see the details in the documentation, type ?rbind.mids in the console.

3 thoughts on “Multiple imputation separately by groups in R and Stata”

  1. Hi Jonathan,

    Super helpful post, thank you! I have a question regarding this.

    I am trying to compile two ‘omics datasets that were collected on separate samples within the same experimental framework. They have a few detected features in common (about 20) and several more which are unique to each dataset (about 20 and 40). I would like to compile the two datasets and impute the variables that are missing in one dataset but not the other. The mice function did this extremely smoothly.

    However, there are some confounding issues which I am hoping you can help me address. First, there are two treatments, so I would need to impute separately for each treatment combination. This seems straightforward based on your post (thank you, again).

    The second is that the patterns among the common variables across treatments vary somewhat between the two datasets. I am interpreting this as set or replicate variation, and when I have analyzed the 20 common variables by themselves, I have incorporated ‘set’ as a covariate. However, if I am understanding the process correctly, mice imputes based on the data that are present for a single variable, meaning that the values that are imputed to flesh out that variable in dataset 2 would take on the same overall pattern as that same variable from dataset 1. This would mean that any variation attributable to one dataset would be carried across both datasets through imputation and there would be no way for me to account for it.

    What I would like to do is use use the covariate adjustment to incorporate dataset variation into the imputation for each variable. Is there any way to do this? I’ve done my best to be clear here, but I am happy to provide other information.



Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.