G-formula for causal inference via multiple imputation

G-formula (sometimes known as G-computation) is an approach for estimating the causal effects of treatments or exposures which can vary over time and which are subject to time-varying confounding. It is one of the so called G-methods developed by Jamie Robins and co-workers. For a nice overview of these, I recommend this open access paper by Naimi et al 2017, and for more details, the What If book by Hernán and Robins. In this post, I’ll describe some recent work with Camila Olarte Parra and Rhian Daniel in which we have explored the use of multiple imputation methods and software as a route to implementing G-formula estimators.

G-formula implementation

Without going into all the details here, G-formula as typically implemented in software simulates longitudinal histories of the time-varying confounders and outcomes under the treatment or exposure regimes which are of scientific interest. Typically, as for example in the R package gfoRmula or in the Stata gformula package, models are fitted for the conditional distributions of the time-varying confounders given the earlier variables. Values are then simulated conditional on the fitted models (typically fitted by maximum likelihood) for a user specified number of individuals. Quantities of interest, such as the mean final outcome under the regime(s) of interest can then be calculated. For inference, the non-parametric bootstrap is used. This means that the original data is resampled many times. To each resample, the conditional models are fitted, the longitudinal histories are simulated, and the estimand(s) of interest calculated. The standard error of quantities being estimated can then be estimated by the bootstrap standard error. One (maybe minor) drawback of using bootstrapping is that it is computationally intensive, although these days this is often less of an issue than it would have been in the past.

G-formula via multiple imputation

The simulation based G-formula estimator has strong similarities with imputation estimators. In particular, we can view the simulation G-formula estimator as a form of synthetic imputation – we are imputing longitudinal histories for new individuals under the treatment regimes of interest given our fitted models. Synthetic imputation was proposed many years ago as a possible solution in the setting where one wants to release datasets (e.g. from a survey) but does not want to release the original data in order to protect the confidentiality of those in the original data. Raghunthan, Reiter and Rubin, in a 2003 paper, showed how such an approach could work using Bayesian multiple imputation techniques. In particular, they demonstrated that Rubin’s standard combination rules overestimate the variance of estimates obtained from analysing synthetic imputation datasets. Moreover, they derived an alternative variance estimator that correctly accounts for the fact that in the synthetic imputation setting, the original data used to fit the imputation model are no longer involved in the analysis stage. Interestingly, whereas Rubin’s original variance estimator sums the between and within imputation variances (save for an adjustment due to the finite number of imputations), the synthetic imputation variance estimator is the between minus within variance (again with an adjustment for the finite number of imputations).

In a pre-print available now on arXiv, we show how this methodology can be used to implement G-formula estimators using standard multiple imputation software, such as the mice package in R. This involves augmenting the original dataset with additional rows in which all variables are set to missing except the treatment indicators, which are set to the values under the treatment regimes of interest. The (artificial) missing values are then imputed, and the mean (for example) counterfactual final outcome under each regime can be estimated using the resulting synthetic imputations. Simulation evidence obtained thus far suggests this approach can work well with as few as 25 imputations.

Missing data

In practice, the original dataset we have often suffers from some missing values. We are then faced with the question of how to handle these missing values in the context of applying G-formula. In the pre-print, we propose using multiple imputation software to first impute the missing values in the original data. Each of the resulting imputed datasets can then be augmented with additional rows and the counterfactual outcomes in these rows imputed. These can then be analysed using the synthetic imputation rules developed by Raghunthan et al (2003).

R package gFormulaMI

Implementing G-formula using standard multiple imputation software is relatively straightforward. However, to facilitate use of the approach, we have developed an R package, gFormulaMI, implementing it. The package exploits the functionality of the mice package in R to impute the counterfactual outcomes under the treatment regimes of interest. The synthetic imputations are then analysed, and the results pooled using the synthetic imputation rules.

For an introduction to the functionality of the package, please see the vignette here.

The development version of the gFormulaMI package can be installed into R using:



The package will be uploaded to CRAN in due course. In the meantime, if anyone has feedback or experiences errors running it, please let me know via the Issues page at Github.


If you are interested in hearing more about this work, I’ll be talking about it in London at LSHTM (and streamed online) on 8th February (recording available here). For more details on this, please see here. The slides of the talk are available here.


This work was supported by a grant from the UK Medical Research Council (MR/T023953/1).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.