Sullivan and colleagues have recently published a nice paper exploring multiple imputation for missing covariates or outcome when one is interested in estimating relative risks. They performed simulations where missing covariates or outcomes were imputed either using multivariate normal imputation or using fully conditional specification imputation (FCS), and where the true outcome model is a log link binomial model. They concluded that multivariate normal imputation performed poorly, producing estimated coefficients which were biased towards the null. Fully conditional specification performed better, although estimates were still biased in certain situations.
Sullivan et al investigated the performance of the two imputation approaches through simulation. They simulated datasets with two covariates, X1 and X2 (binary or normally distributed), and an outcome Y that depended on X1 and X2 using a log link regression. Values in X2 and Y were then made missing according to an MAR mechanism. For FCS imputation, they imputed missing values in Y using a logistic regression model conditional on X1 and X2. As the authors noted, since this differed to the true model for Y|X1,X2, the imputation model was to some extent misspecified. For imputation of X2 they used either logistic or normal linear conditional models, depending on whether X2 was binary or continuous. In the first scenario X1 and X2 were both binary, while in the second they were bivariate normally distributed. Given the biases found for both FCS and multivariate normal imputation, they recommended further research for imputation of missing values when the analyst wants to estimate a risk ratio.
One important consideration is that Sullivan et al assumed that because a risk ratio was the target of inference, they simulated the outcome from a log link regression. In practice, while a risk ratio may be the target of inference, it may be more plausible that the conditional distribution of a binary outcome given covariates follows a logistic regression. This is partly because with some continuous covariates, it is quite possible for the implied predicted probability for a log link model to lie above 1. It is because of this that one often encounters difficulties with fitting log link models (see this paper). Arguably, in general it might be assumed that a logit model is more plausible a priori because it gives valid probabilities for any linear predictor value, unlike a log link model.
An alternative perspective is that while we may be interested in estimating a well defined risk ratio parameter, at the same time we may believe that a more plausible model for the binary outcome given the covariates is a logistic regression. In this case it is possible to impute missing values compatibly with an assumption that the outcome follows a logistic regression model, but then to estimate the marginal risk ratio, for example as described here in a previous post.
Here I will illustrate with a small simulation example in R. First we simulate a large dataset with X1 and X2 bivariate normal and Y binary, where Y follows a logistic model given X1 and X2. The R program can be downloaded here.
First we simulate a very large full dataset:
#simulate full data library(MASS) expit <- function(x) { exp(x)/(1+exp(x)) } set.seed(259815) n <- 100000 x <- mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1,0.2,0.2,1), nrow=2)) x1 <- x[,1] x2 <- x[,2] y <- 1*(runif(n) < expit(-3+log(3)*x1+log(3)*x2)) mean(y) [1] 0.11052
Next we estimate the marginal risk ratio for the effect of changing X1 from 1 to 0 (see this previous post for more on this):
#estimate the marginal risk ratio for x1=1 vs x1=0 with full data by standardization #for later use with MI, we will write a function that takes a dataset and returns this estimate marginalRiskRatio <- function(inputData) { ymod <- glm(y~x1+x2, family="binomial", data=inputData) #predict risk under x1=0 risk0 <- expit(coef(ymod)[1]+coef(ymod)[3]*inputData$x2) #predict risk under x1=1 risk1 <- expit(coef(ymod)[1]+coef(ymod)[2]*1+coef(ymod)[3]*inputData$x2) #estimate marginal risk ratio mean(risk1)/mean(risk0) } fullData <- data.frame(y=y,x1=x1,x2=x2) marginalRiskRatio(fullData) [1] 2.295438
Next we make some values in Y and X2 missing, using one of the mechanisms considered by Sullivan et al:
#make some data missing, as per Sullivan et al z1 <- x1/0.2^0.5 r_y <- 1*(runif(n) < expit(2.5+2*z1)) r_x2 <- 1*(runif(n) < expit(2.5+2*z1)) obsData <- fullData obsData$y[r_y==0] <- NA obsData$x2[r_x2==0] <- NA
Now we can impute the missing values in Y and X2. We will do this using the smcfcs package, which will impute missing outcomes from the specified logistic outcome model and the missing covariate values from an imputation model which is compatible with the logistic outcome model:
#we impute using smcfcs, compatibly with logistic outcome model library(smcfcs) numImps <- 10 imps <- smcfcs(obsData, smtype="logistic", smformula="y~x1+x2", method=c("", "", "norm"), m=numImps) [1] "Outcome variable(s): y" [1] "Passive variables: " [1] "Partially obs. variables: x2" [1] "Fully obs. substantive model variables: x1" [1] "Imputation 1" [1] "Imputing missing outcomes using specified substantive model." [1] "Imputing: x2 using x1 plus outcome" [1] "Imputation 2" [1] "Imputation 3" [1] "Imputation 4" [1] "Imputation 5" [1] "Imputation 6" [1] "Imputation 7" [1] "Imputation 8" [1] "Imputation 9" [1] "Imputation 10" Warning message: In smcfcs.core(originaldata, smtype, smformula, method, predictorMatrix, : Rejection sampling failed 7 times (across all variables, iterations, and imputations). You may want to increase the rejection sampling limit.
Lastly, we can apply our earlier defined function for estimating the marginal risk ratio to each imputed dataset, and combine these using Rubin's rules (i.e. taking the average of the log risk ratios):
estLogRR <- array(0, dim=numImps) for (i in 1:numImps) { estLogRR[i] <- log(marginalRiskRatio(imps$impDatasets[[i]])) } #pooled estimate of log risk ratio is mean(estLogRR) [1] 0.8325685 #and estimate of risk ratio exp(mean(estLogRR)) [1] 2.299217
We obtain an estimate after imputation which is very close to the full data estimate. This is because data are MAR and we have imputed from a correctly specified imputation model (and because the sample size is very large).
Note we haven't obtained a variance estimate. This is because obtaining a standard error for the standardization estimate of the marginal risk ratio is somewhat harder - we could use an asymptotic analytical estimator, as implemented for example in Stata's teffects command (see previous post), or using bootstrapping for each imputed dataset.