Multiple imputation for coarsened (grouped) factor covariates

Missing data are a common problem in statistical analyses. A closely related but slightly different problem is when for an individual in a dataset, although we do not know the exact value of a particular variable, we have some partial information about the missing value. Specifically, we know the value belongs to a subset of the sample space. Such data is said to have been coarsened. An example of this is a factor variable, that takes values a, b, or c where for some individuals we know they are in a or c, and for other individuals we know their value is b or c, but we are not sure which.

In such a setting, we could try and use multiple imputation (MI) to impute the missing values. This would involve setting the ‘a or c’ values and ‘b or c’ values to missing, and imputing. An obvious issue with this approach would be that for individuals with ‘a or c’, some could be imputed as b – the imputation has not respected the known information about the true value. Ideally we want our information to respect and utilise this partial information about the missing value.

Thanks to the work of Lars van der Burg, the smcfcs package in R for MI of missing covariates now incorporates functionality for imputing factor covariates which are missing but for which there is such partial information (for some individuals). To see how the new functionality works, please see the accompanying vignette. For further details of the methodology, including simulations and an illustrative example, see van der Burg et al 2025, available open-access in Statistics in Medicine.

Leave a ReplyCancel reply