Multiple imputation for coarsened (grouped) factor covariates

Missing data are a common problem in statistical analyses. A closely related but slightly different problem is when for an individual in a dataset, although we do not know the exact value of a particular variable, we have some partial information about the missing value. Specifically, we know the value belongs to a subset of the sample space. Such data is said to have been coarsened. An example of this is a factor variable, that takes values a, b, or c where for some individuals we know they are in a or c, and for other individuals we know their value is b or c, but we are not sure which.

In such a setting, we could try and use multiple imputation (MI) to impute the missing values. This would involve setting the ‘a or c’ values and ‘b or c’ values to missing, and imputing. An obvious issue with this approach would be that for individuals with ‘a or c’, some could be imputed as b – the imputation has not respected the known information about the true value. Ideally we want our information to respect and utilise this partial information about the missing value.

Thanks to the work of Lars van der Burg, the smcfcs package in R for MI of missing covariates now incorporates functionality for imputing factor covariates which are missing but for which there is such partial information (for some individuals). To see how the new functionality works, please see the accompanying vignette. For further details of the methodology, including simulations and an illustrative example, see van der Burg et al 2025, available open-access in Statistics in Medicine.

Multiple imputation with flexible parametric survival models

Following a recent request from someone, I’ve extended the functionality of my R package smcfcs, which performs multiple imputation of missing covariates, compatible with a user-specified substantive or outcome. The package can now impute compatibly with a flexible parametric Royston-Parmar type model. In this post I’ll briefly highlight some of the potential uses of this new functionality.

Read more

Estimating hypothetical estimands with causal inference and missing data estimators in a diabetes trial

We (Camila Olarte Parra (LSHTM), Rhian Daniel (Cardiff), myself, and David Wright (AstraZeneca)) recently put on arXiv a new paper which explores the use of estimators from both the causal inference and missing data literatures for estimating a so-called hypothetical estimand in a previously conducted clinical trial in diabetes.

Read more