Perfect prediction handling in smcfcs for R

One of the things users have often asked me about the substantive model compatible fully conditional specification multiple imputation approach is the problem of perfect prediction. This problem arises when imputing a binary (or more generally a categorical variable) and there is a binary (or categorical) predictor, if among one or more levels of the predictor, the outcome is always 0 or always 1. Typically a logistic regression model is specified for the binary variable being imputed, and in the case of perfect prediction, the MLE for one or more parameters (on the log odds scale) is infinite. As described by White, Royston and Daniel (2010), this leads to problems in the imputations. In particular, to make the imputation process proper, a draw from the multivariate normal is used to draw new parameters of the logistic regression imputation model. The perfect prediction data configuration leads to standard errors that are essentially infinite, but in practice on the computer will be very very large. These huge standard errors lead to posterior draws (or what are used in place of posterior draws) which fluctuate from being very large and negative to very large and positive, when in reality they ought to be only large in one direction (see Section 4 of White, Royston and Daniel (2010)).

Various solutions to this problem have been proposed in the modelling literature generally and the multiple imputation literature in particular. In the latter case, White, Royston and Daniel (2010) proposed a data augmentation scheme where a few additional carefully crafted observations are added before the imputation model is fitted which have the effect of rendering the MLEs finite and the resulting procedure stable statistically. An alternative is to use Firth’s bias correction approach. This approach modifies the maximum likelihood procedure by removing its first order finite sample bias. The approach can also be cast as a Bayesian approach, using Jeffrey’s invariant prior. A consequence of this is that the resulting estimates are always finite, even in cases of perfect prediction (Kosmidis and Firth 2021). As such, it is a neat solution to the problem of perfect prediction in multiple imputation.

Up till now, the smcfcs packages in R and Stata have not been able to handle datasets with perfect prediction. I’m pleased to say that the smcfcs package in R now has functionality, courtesy of the brglm2 package, for using Firth’s bias reduced method when it uses logistic regression models either as the outcome (substantive) model and/or one of the models used for a partially observed covariate.

To illustrate, let’s simulate a tiny (n=10) dataset where a binary covariate x is measured by a misclassified version z. In fact, we will simulate z such that all of the time when x=0, z=0, and when x=1, z=1. We will set x to be missing in 2 of the 10 observations, but z will be fully observed. We also simulate a continuous outcome y which depends on x.

n <- 10
z <- c(0,0,0,0,0,1,1,1,1,1)
x <- z
y <- x+rnorm(n)
x[c(5,10)] <- NA

simData <- data.frame(x,z,y)

Let’s now try and impute using smcfcs. We will make use of the predictorMatrix argument to tell smcfcs that we want to use z as an auxiliary variable, but we will not include it in the outcome/substantive model, compatible with the way we have simulated the data above:

#impute x using z as an auxiliary variable
predMat <- array(0, dim=c(3,3))
predMat[1,2] <- 1
rm(x,z,y)
set.seed(62377)
imps <- smcfcs(simData, smtype="lm", smformula="y~x",
                   method=c("logreg", "", ""), predictorMatrix=predMat, m=1)

which produces

Error in fittedMean[imputationNeeded, xMisVal] : subscript out of bounds
In addition: Warning message:
In rbinom(length(imputationNeeded), 1, directImpProbs) : NAs produced

Note the line “Imputing: x using z plus outcome”, which shows smcfcs will as we wanted, use z effectively as an auxiliary variable when imputing x, but because we have not included z as a covariate in the smformula argument, smcfcs makes the (in this case correct) assumption that y and z are conditionally independent given x.

The error messages we receive are due to perfect prediction – the fact that z perfectly predicts x. We could ignore z from the imputation process, and this would solve the perfect prediction issue. But since z is such a good predictor of x, we would like to use because it (considerably!) reduces the uncertainty about the missing values in x.

To deal with this, we slight modify the smcfcs call so that the method argument value used to impute x is brlogreg (bias reduced logistic regression):

set.seed(62377)
imps <- smcfcs(simData, smtype="lm", smformula="y~x",
                   method=c("brlogreg", "", ""), predictorMatrix=predMat, m=1)

This runs with no errors, and if we look at the single imputed dataset we see

imps$impDatasets


   x z           y
1  0 0  0.01271471
2  0 0 -0.32367364
3  0 0  0.15830966
4  0 0  0.70151758
5  1 0  0.99240572
6  1 1  0.21656320
7  1 1 -0.07626927
8  1 1  1.35128655
9  1 1 -0.68468862
10 1 1  2.02276208

The missing x values were in rows 5 and 10, and in truth were 0 and 1 respectively. In fact, in this dataset, while the 10th row value of x has been imputed as 1, the 5th row value has also been imputed as 1. This is because the bias reduced logistic regression fit (and hence the approximate posterior draws of these parameters that smcfcs generates) do not imply that z perfectly predicts x. To examine this, we can re-run smcfcs with noisy=TRUE to actually see the fit of the bias reduced logistic model for f(x|z). Below I’ve picked out this model fit from the last iteration:

[1] "Iteration  10"

Call:
glm(formula = xmodformula, family = "binomial", data = xmoddata, 
    method = brglm2::brglmFit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4172  -0.4172   0.0000   0.4172   0.4172  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   -2.398      1.618  -1.482   0.1384  
z              4.796      2.288   2.096   0.0361 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13.8629  on 9  degrees of freedom
Residual deviance:  1.7402  on 8  degrees of freedom
AIC:  5.7402

Type of estimator: AS_mixed (mixed bias-reducing adjusted score equations)
Number of Fisher Scoring iterations: 1

Based on this fit, the P(x=1|z=0) = expit(-2.398) = 0.08, while P(x=1|z=1)=expit(-2.398+4.796)=0.92.

The smtype argument can also be set to brlogistic to analogously use the bias reduced logistic regression method when fitting the substantive/outcome model.

Once the sample size gets larger, when perfect prediction problems will tend to occur less common, the differences between using the bias reduced logistic regression and regular logistic regression will shrink, such that it doesn’t really matter which you use. But in (admittedly extreme and simplistic) settings like the one above, the bias reduced route seems attractive. I therefore think in general with binary covariates and outcomes the bias reduced logistic regression options in the method and smtype arguments ought probably be used by people by default.

The new functionality described in this post is currently available in the development version of smcfcs, which can be installed using the devtools package using:

devtools::install_github("jwb133/smcfcs")

These new features will then be available in the CRAN version of smcfcs soon.

This work was supported by a UK Medical Research Council grant (MR/T023953/1).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.