Research Fellow post at LSHTM – machine learning for missing data

We are currently recruiting for a Research Fellow position at London School of Hygiene & Tropical Medicine to work on an exciting new project that will develop machine learning based methods for handling missing data in statistical analyses. The project, funded by the UK’s Economic and Social Research Council, will develop new missing data methods based on recently developments in double or debiased machine learning. The project team includes myself (Jonathan Bartlett), Shaun Seaman at the MRC Biostatistics Unit, and Richard Silverwood from UCL.

The post will be for 3.5 years, and we are accepting applications until 30th September. For further details on the role and to apply, please see the LSHTM jobs site.

The role of post intercurrent event data in the estimation of hypothetical estimands in clinical trials

Clinical trial estimands which make use of the so-called hypothetical strategy target the effect of one randomised treatment compared to another in a scenario where the corresponding intercurrent event does not happen. Historically estimation of such estimands has made use of established techniques for handling missing data, setting any observed data after the intercurrent event to missing.

In the last few years it has been shown that data after the intercurrent event can be used for estimation of such hypothetical estimands, using methods such as G-formula and G-estimation from causal inference. These offer the potential for increased statistical power, but rely on making certain assumptions about how the intercurrent event influences subsequent outcomes. In a new pre-print available on arXiv, Rhian Daniel and I examine further the role of such post intercurrent event data in estimation of hypothetical estimands.

In the paper we:

  • show certain G-formula estimators are identical to certain G-estimators, something which is not obvious from their construction
  • show these estimators can only improve efficiency and power by making additional assumptions not required by estimators (such as imputation missing data estimators) that do not use data observed after the intercurrent event
  • show the gain in efficiency/power will typically be modest, since in most trials the rates of such intercurrent events is usually not too large
  • argue that the additional assumptions necessary will often not be plausible on clinical grounds

As such, we conclude by recommending that estimation of estimands that adopt the hypothetical strategy continue to be based on estimators that do not use data after the intercurrent event occurs. This involves setting any data observed after the intercurrent event to missing and handling the resulting missing counterfactual (no intercurrent event) outcomes using missing data methods, such as multiple imputation or inverse probability weighting.

Multiple imputation for coarsened (grouped) factor covariates

Missing data are a common problem in statistical analyses. A closely related but slightly different problem is when for an individual in a dataset, although we do not know the exact value of a particular variable, we have some partial information about the missing value. Specifically, we know the value belongs to a subset of the sample space. Such data is said to have been coarsened. An example of this is a factor variable, that takes values a, b, or c where for some individuals we know they are in a or c, and for other individuals we know their value is b or c, but we are not sure which.

In such a setting, we could try and use multiple imputation (MI) to impute the missing values. This would involve setting the ‘a or c’ values and ‘b or c’ values to missing, and imputing. An obvious issue with this approach would be that for individuals with ‘a or c’, some could be imputed as b – the imputation has not respected the known information about the true value. Ideally we want our information to respect and utilise this partial information about the missing value.

Thanks to the work of Lars van der Burg, the smcfcs package in R for MI of missing covariates now incorporates functionality for imputing factor covariates which are missing but for which there is such partial information (for some individuals). To see how the new functionality works, please see the accompanying vignette. For further details of the methodology, including simulations and an illustrative example, see van der Burg et al 2025, available open-access in Statistics in Medicine.