Meng’s concept of congeniality in multiple imputation (MI) is I think a tricky one (for me anyway!). Loosely speaking congeniality is about whether the imputation and analysis models make different assumptions about the data. Meng gave a definition in his 1994 paper, but I prefer the one given in a more recent paper by Xie and Meng, which is what I and Rachael Hughes used in our paper this year on different methods of combining bootstrapping with MI. In words (see the papers for the same in equations) it is that there exists a Bayesian model for the data such that:
- given complete/full data, the posterior mean of the parameter of interest matches the point estimate given by fitting our analysis model of interest to that data, and the posterior variance matches the variance estimator calculated by our analysis model fit.
- the conditional distribution of the missing data given the observed in this Bayesian model matches that used by our imputation model.
If they are congenial and the models are correctly specified, Rubin’s variance estimator is (asymptotically) unbiased for the true repeated sampling variance of the MI point estimator(s).
One of the potentially useful features of MI are that we can include variables in the imputation stage which we then don’t use in the analysis model. Including such auxiliary variables in the imputation model can increase the likelihood that the MAR assumption holds when the auxiliary variable is associated with the probability of missingness, and can increase efficiency according to how strongly it is correlated with the variable(s) being imputed. A nice paper (among many) on the potential of including auxiliary variables in MI is Hardt et al 2012. In this post, I’ll consider whether including auxiliary variables in the imputation model leads to uncongeniality. The post was prompted following a discussion earlier in the year with my colleague Paul von Hippel.
Including auxiliary variables doesn’t necessarily lead to uncongeniality
Including auxiliary variables at the imputation stage doesn’t necessarily lead to uncongeniality. To see this, let’s consider a very simple situation with three continuous variables, y1, y2, y3. Let’s suppose some data are missing in y2, with the probability of missingness dependent on y1 and y3 in some way, but conditional on y1 and y3, no remaining dependence on the value of y2 itself, so that the data are MAR (given y1 and y3). Our analysis model of interest is to fit a linear regression model for y1 with y2 as covariate. Thus y3 is not involved in the analysis model. Our imputation model will be a normal linear regression model for y2, with y1 and y3 as covariates. Thus we are using y3 here as an auxiliary variable in the imputation model.
The imputation and analysis model here are congenial. This is true because there exists a Bayesian model for the three variables, namely a tri-variate multivariate normal distribution, for which the conditional distribution of the missing data given the observed matches that used in the imputation model, and given complete data, the posterior mean and variance for the parameters in the model for y1|y2 match what our analysis model (standard ordinary least squares regression of y1 on y2) returns us.
A simulation demonstration
Given they are congenial, we should expect Rubin’s variance estimator to be (asymptotically) unbiased for the true repeated sampling variance of the MI estimator for the parameters in the analysis model. We can check this empirically with a simulation in R:
library(MASS)
library(mice)
nSim <- 10000
n <- 500
expit <- function(x) exp(x)/(1+exp(x))
mySigma <- matrix(c(1,0.5,0.25,0.5,1,0.3,0.25,0.3,1),nrow=3,ncol=3)
mySigma
estArray <- array(0, dim=nSim)
varArray <- array(0, dim=nSim)
set.seed(72342)
for (i in 1:nSim) {
print(i)
simData <- data.frame(mvrnorm(n=n, mu=c(0,0,0), Sigma=mySigma))
colnames(simData) <- c("y1", "y2", "y3")
#make some y2 data MAR conditional on y1 and y3
missPr <- expit(simData$y1-simData$y3)
simData$y2[runif(n)<missPr] <- NA
#impute 5 times using mice, with norm
#only one iteration needed because missingness in one variable
imps <- mice(simData, method="norm", maxit=1)
#fit analysis model of y1 on y2 only, ignoring y3
impfit <- with(imps, lm(y1~y2))
mirubin <- pool(impfit)
#save the Rubin's rules point estimate and variance estimator for x2 chef
estArray[i] <- mirubin$pooled[2,3]
varArray[i] <- mirubin$pooled[2,6]
}
mean(estArray)
var(estArray)
mean(varArray)
After a few minutes to run (I chose to use 10,000 simulations to keep Monte-Carlo error small!), this results in:
> mean(estArray)
[1] 0.4949839
> var(estArray)
[1] 0.002886426
> mean(varArray)
[1] 0.002880525
We see that the empirical variance of the estimates (var(estArray)) is very close to the average of Rubin’s variance estimator across the 10,000 simulations (mean(varArray)). Interestingly, we see some slight downwards bias in the average of the MI point estimate (0.4949 compared to the true value which here was 0.5). Increasing the sample size n to 5,000 and re-running confirms this is just a (slight) finite sample bias.
Implications of auxiliary variables on congeniality
Adding auxiliary variables to an imputation model doesn’t in of itself lead to uncongeniality. Whether an imputation model which uses auxiliary variable(s) and the analysis model are congenial depends on the modelling assumptions made in the two models, and whether there exists a joint Bayesian model for the all variables under consideration which satisfy the conditions described earlier in the post.