A while ago I got involved in a project led by Anna-Carolina Haensch and Bernd Weiß investigating multiple imputation methods for baseline covariates in discrete time survival analysis. The work has recently been published open access in the journal Sociological Methods & Research. The paper investigates a variety of different multiple imputation approaches. My main contribution was the extension of the substantive model compatible fully conditional specification (smcfcs) approach for multiple imputation to the discrete time survival model setting, and extending the functionality of the smcfcs package in R to incorporate this. In this short post I’ll give a quick demonstration of this functionality.
One of the things users have often asked me about the substantive model compatible fully conditional specification multiple imputation approach is the problem of perfect prediction. This problem arises when imputing a binary (or more generally a categorical variable) and there is a binary (or categorical) predictor, if among one or more levels of the predictor, the outcome is always 0 or always 1. Typically a logistic regression model is specified for the binary variable being imputed, and in the case of perfect prediction, the MLE for one or more parameters (on the log odds scale) is infinite. As described by White, Royston and Daniel (2010), this leads to problems in the imputations. In particular, to make the imputation process proper, a draw from the multivariate normal is used to draw new parameters of the logistic regression imputation model. The perfect prediction data configuration leads to standard errors that are essentially infinite, but in practice on the computer will be very very large. These huge standard errors lead to posterior draws (or what are used in place of posterior draws) which fluctuate from being very large and negative to very large and positive, when in reality they ought to be only large in one direction (see Section 4 of White, Royston and Daniel (2010)).
When using multiple imputation to handle missing data, one must, if not immediately, but eventually, decide how many imputations to base inferences on. The validity of inferences does not rely on how many imputations are used, but the statistical efficiency of the inference can be increased by using more imputations. Moreover, we may want our results to be reproducible to a given precision, in the sense that if someone were to re-impute the same data using the same number of imputations but with a different random number seed, they would obtain the same estimates to the desired precision. For a great summary on considerations on how many imputations to use, see the corresponding section from Stef van Buuren’s book.
In this post I provide a small bit of R code which, given a pooled analysis after performing imputation using the mice package in R, calculates the so called Monte-Carlo standard error of the multiple imputation point estimates. Stata has really nice functionality for doing this built into mi estimate.