Multiple imputation for coarsened (grouped) factor covariates

Missing data are a common problem in statistical analyses. A closely related but slightly different problem is when for an individual in a dataset, although we do not know the exact value of a particular variable, we have some partial information about the missing value. Specifically, we know the value belongs to a subset of the sample space. Such data is said to have been coarsened. An example of this is a factor variable, that takes values a, b, or c where for some individuals we know they are in a or c, and for other individuals we know their value is b or c, but we are not sure which.

In such a setting, we could try and use multiple imputation (MI) to impute the missing values. This would involve setting the ‘a or c’ values and ‘b or c’ values to missing, and imputing. An obvious issue with this approach would be that for individuals with ‘a or c’, some could be imputed as b – the imputation has not respected the known information about the true value. Ideally we want our information to respect and utilise this partial information about the missing value.

Thanks to the work of Lars van der Burg, the smcfcs package in R for MI of missing covariates now incorporates functionality for imputing factor covariates which are missing but for which there is such partial information (for some individuals). To see how the new functionality works, please see the accompanying vignette. For further details of the methodology, including simulations and an illustrative example, see van der Burg et al 2025, available open-access in Statistics in Medicine.

What is meant by a ‘while on treatment’ estimand?

The ICH E9 R1 addendum on estimands in clinical trials has made big waves in the clinical trial world in the last few years. It aims to provide a framework to think about and define more precisely what exactly the treatment effect(s) of interest is in a clinical trial, in light of what the addendum calls ‘intercurrent events’ (ICEs):

Events occurring after treatment initiation that affect either the interpretation or the existence of the
measurements associated with the clinical question of interest. It is necessary to address intercurrent
events when describing the clinical question of interest in order to precisely define the treatment effect
that is to be estimated.

A couple of weeks ago a really nice paper was published by Harrison and Brummel in the American Statistican which explored the five different ‘strategies’ described in the E9 addendum for handling ICEs in a simple example using potential outcomes. For each strategy they gave an example of an estimand defined using the strategy and a simple estimator for estimating the estimand from the data. In this post, I want to focus on the while on treatment strategy, as I think it’s one area where there is some debate as to what exactly the E9 addendum meant. I of course do not claim to have the definitive answer, but the following is my view.

Read more

Does a Bernoulli/binomial model really assume everyone has the same probability p?

When you estimate a proportion and want to calculate a standard error for the estimate, you would normally do so based on assuming that the number of ‘successes’ in the sample is a draw from a binomial distribution, which counts the number of successes in a series of n independent Bernoulli 0/1 draws, where each draw has a probability p of ‘success’. Does the model rely or assume that for each of these binary observations the success probability is the same? In the third paragraph of this blog post Frank Harrell (seems to) argue that it does. In this post I’ll delve into this a bit further, using the same numerical example Frank gives.

Suppose we have a random sample of n individuals on whom we observe a binary outcome indicating presence or absence of disease. Suppose that in a sample of n=100, 40 have the disease, and so our estimate of the proportion of disease in the population (which I will denote p) from the sample was drawn is \hat{p}=40/100=0.4.

Read more