G-formula (sometimes known as G-computation) is an approach for estimating the causal effects of treatments or exposures which can vary over time and which are subject to time-varying confounding. It is one of the so called G-methods developed by Jamie Robins and co-workers. For a nice overview of these, I recommend this open access paper by Naimi et al 2017, and for more details, the What If book by Hernán and Robins. In this post, I’ll describe some recent work with Camila Olarte Parra and Rhian Daniel in which we have explored the use of multiple imputation methods and software as a route to implementing G-formula estimators.

# multiple imputation

## Multiple imputation for missing baseline covariates in discrete time survival analysis

A while ago I got involved in a project led by Anna-Carolina Haensch and Bernd Weiß investigating multiple imputation methods for baseline covariates in discrete time survival analysis. The work has recently been published open access in the journal Sociological Methods & Research. The paper investigates a variety of different multiple imputation approaches. My main contribution was the extension of the substantive model compatible fully conditional specification (smcfcs) approach for multiple imputation to the discrete time survival model setting, and extending the functionality of the smcfcs package in R to incorporate this. In this short post I’ll give a quick demonstration of this functionality.

## Perfect prediction handling in smcfcs for R

One of the things users have often asked me about the substantive model compatible fully conditional specification multiple imputation approach is the problem of perfect prediction. This problem arises when imputing a binary (or more generally a categorical variable) and there is a binary (or categorical) predictor, if among one or more levels of the predictor, the outcome is always 0 or always 1. Typically a logistic regression model is specified for the binary variable being imputed, and in the case of perfect prediction, the MLE for one or more parameters (on the log odds scale) is infinite. As described by White, Royston and Daniel (2010), this leads to problems in the imputations. In particular, to make the imputation process proper, a draw from the multivariate normal is used to draw new parameters of the logistic regression imputation model. The perfect prediction data configuration leads to standard errors that are essentially infinite, but in practice on the computer will be very very large. These huge standard errors lead to posterior draws (or what are used in place of posterior draws) which fluctuate from being very large and negative to very large and positive, when in reality they ought to be only large in one direction (see Section 4 of White, Royston and Daniel (2010)).