I was recently asked about whether smcfcs, my R and Stata packages for multiple imputation of covariates, can accommodate non-linear relationships between covariates. The answer is yes, and in this post I’ll illustrate how this can be done.

To illustrate, we will simulate a very large dataset with two covariates, x1 and x2, and a continuous outcome y. We will simulate x2 as having a mean which is a quadratic function of x1, and then make some values of x2 missing, with the probability of missing ness dependent on the outcome y:

library(smcfcs) expit <- function(x) { exp(x)/(1+exp(x)) } set.seed(123) n <- 10000 x1 <- rnorm(n) x2 <- x1^2 + rnorm(n) y <- x1+x2 + rnorm(n) #make some x2 values missing conditional on y x2[(runif(n) < expit(y))] <- NA mydata <- data.frame(x1,x2,y)

The true coefficients of the substantive model are thus 0 (intercept), 1 (coefficient of x1) and 1 (coefficient of x2). Note that here there are no non-linearities in the substantive model, but there is a non-linearity in the dependence of x2 on x1. Although this latter non-linearity may not be of particular scientific interest, we will now see that failure to accommodate this in the imputation process will lead to biased inferences:

imps1 <- smcfcs(mydata, smtype="lm", smformula = "y~x1+x2", numit=50, method=c("","norm","")) #note we use more iterations than the default because #a preliminary run showed this was needed for convergence library(mitools) impobj <- imputationList(imps1$impDatasets) models <- with(impobj, lm(y~x1+x2)) summary(MIcombine(models))

This gives as output:

[1] "Outcome variable(s): y" [1] "Passive variables: " [1] "Partially obs. variables: x2" [1] "Fully obs. substantive model variables: x1" [1] "Imputation 1" [1] "Imputing: x2 using x1 plus outcome" [1] "Imputation 2" [1] "Imputation 3" [1] "Imputation 4" [1] "Imputation 5" Warning message: In smcfcs.core(originaldata, smtype, smformula, method, predictorMatrix, : Rejection sampling failed 11974 times (across all variables, iterations, and imputations). You may want to increase the rejection sampling limit. Multiple imputation results: with(impobj, lm(y ~ x1 + x2)) MIcombine.default(models) results se (lower upper) missInfo (Intercept) 0.511528 0.02282962 0.4640523 0.5590037 48 % x1 1.467190 0.02633373 1.4097039 1.5246753 64 % x2 1.070601 0.01513689 1.0407036 1.1004980 17 %

We see appreciable bias in the estimate of the intercept and coefficient of x1. smcfcs is imputing the missing values in x2 assuming that x2 follows a linear regression model conditional on x1, with the conditional mean being linear in x1. Because we simulated the data, we know this is incorrect: the conditional mean of x2 is quadratic in x1. Since x1 is here fully observed, we can easily accommodate this by generating a new variable for the square of x1 and including this in the substantive model. Doing so will mean that x1 squared is automatically conditioned on in the imputation model for x2:

mydata$x1sq <- mydata$x1^2 imps2 <- smcfcs(mydata, smtype="lm", smformula = "y~x1+x2+x1sq", numit=50, method=c("","norm","","")) impobj <- imputationList(imps2$impDatasets) models <- with(impobj, lm(y~x1+x2)) summary(MIcombine(models))

This gives as output:

[1] "Outcome variable(s): y" [1] "Passive variables: " [1] "Partially obs. variables: x2" [1] "Fully obs. substantive model variables: x1,x1sq" [1] "Imputation 1" [1] "Imputing: x2 using x1,x1sq plus outcome" [1] "Imputation 2" [1] "Imputation 3" [1] "Imputation 4" [1] "Imputation 5" Warning message: In smcfcs.core(originaldata, smtype, smformula, method, predictorMatrix, : Rejection sampling failed 101 times (across all variables, iterations, and imputations). You may want to increase the rejection sampling limit. Multiple imputation results: with(impobj, lm(y ~ x1 + x2)) MIcombine.default(models) results se (lower upper) missInfo (Intercept) -0.01652469 0.01311926 -0.04266813 0.009618758 25 % x1 1.04806406 0.01266759 1.02212900 1.073999125 42 % x2 1.03650918 0.01316507 1.00460701 1.068411347 84 %

We now have estimates which are pretty close to the true values used in the data generating mechanism.

One point to note is that we have modified the model assumed for x2|x1, but we have also modified the substantive model (at least the one used as part of the imputation process) to one which includes x1sq. Just because x1 has a non-linear effect on x2, this does not necessarily mean we would want to include x1sq as a covariate in the imputation model. It is possible to incorporate x1sq in the covariate model for x2 but not in the substantive model used for imputation, by using the predictor matrix option and not putting x1sq in the substantive model:

predictorMatrix <- array(0, dim=c(4,4)) predictorMatrix[2,c(1,4)] <- 1 imps3 <- smcfcs(mydata, smtype="lm", smformula = "y~x1+x2", numit=50, predictorMatrix=predictorMatrix, method=c("","norm","","")) impobj <- imputationList(imps3$impDatasets) models <- with(impobj, lm(y~x1+x2)) summary(MIcombine(models))

This gives as output:

[1] "Outcome variable(s): y" [1] "Passive variables: " [1] "Partially obs. variables: x2" [1] "Fully obs. substantive model variables: x1" [1] "Imputation 1" [1] "Imputing: x2 using x1,x1sq plus outcome" [1] "Imputation 2" [1] "Imputation 3" [1] "Imputation 4" [1] "Imputation 5" Warning message: In smcfcs.core(originaldata, smtype, smformula, method, predictorMatrix, : Rejection sampling failed 503 times (across all variables, iterations, and imputations). You may want to increase the rejection sampling limit. Multiple imputation results: with(impobj, lm(y ~ x1 + x2)) MIcombine.default(models) results se (lower upper) missInfo (Intercept) -0.0274234 0.015746687 -0.06054163 0.005694823 53 % x1 1.0075646 0.018740270 0.96407720 1.051052088 77 % x2 1.0026004 0.008043873 0.98549090 1.019709850 56 %

The impact in general of excluding x1sq from the substantive model in the imputation process when it is not needed will be to give somewhat more precise inferences.

Here x1 was fully observed. What if x1 also had some missing values? Then we would need to tell smcfcs how to impute x1, and then to passively impute the x1sq variable. How should we impute x1, given our knowledge of the true data generating model? x1 was simulated as normally distributed marginally, and x2|x1 as normal, with a quadratic mean in x1. This implies that x1|x2 is some non-standard distribution, and not one that smcfcs can correctly accommodate. Nevertheless, we will proceed, asking smcfcs to impute x1 using the norm method:

mydata$x1[runif(n)<0.25] <- NA mydata$x1sq <- mydata$x1^2 predictorMatrix[1,2] <- 1 imps4 <- smcfcs(mydata, smtype="lm", smformula = "y~x1+x2", numit=50, predictorMatrix=predictorMatrix, method=c("norm","norm","","x1^2")) impobj <- imputationList(imps4$impDatasets) models <- with(impobj, lm(y~x1+x2)) summary(MIcombine(models))

which gives as output:

[1] "Outcome variable(s): y" [1] "Passive variables: x1sq" [1] "Partially obs. variables: x1,x2" [1] "Fully obs. substantive model variables: " [1] "Imputation 1" [1] "Imputing: x1 using x2 plus outcome" [1] "Imputing: x2 using x1,x1sq plus outcome" [1] "Imputation 2" [1] "Imputation 3" [1] "Imputation 4" [1] "Imputation 5" Warning message: In smcfcs.core(originaldata, smtype, smformula, method, predictorMatrix, : Rejection sampling failed 17260 times (across all variables, iterations, and imputations). You may want to increase the rejection sampling limit. Multiple imputation results: with(impobj, lm(y ~ x1 + x2)) MIcombine.default(models) results se (lower upper) missInfo (Intercept) 0.2687343 0.04002737 0.1694782 0.3679903 88 % x1 1.0276229 0.03432337 0.9436348 1.1116109 86 % x2 1.0742299 0.01635284 1.0385746 1.1098852 64 %

First we note the dramatically increased number of rejection sampler failures - we can attribute this to the misspecification of the x1 imputation model. Second, we now see some bias in the parameter estimates.

This example also illustrates a theoretical issue with smcfcs - while it imputes each covariate from an imputation model which is compatible with the specified substantive or outcome model, this does not imply that each of these imputation models are mutually compatible. Specifically, the models used for the distribution of each covariate conditional on the others may not be compatible.

A more principled approach is to specify a single joint model for the data, and impute under its implied conditional distributions. This can be achieved using JAGS for example. A couple of years ago I wrote an R package for fitting Bayesian models with missing covariates using JAGS from R. It's available as an R package on Github here, but it needs some more work to get it give back imputed values after fitting the Bayesian model.