Causal (in)validity of the trimmed means estimand

This week I’ve been given the opportunity to present some ongoing work with colleagues Camila Olarte Parra and Rhian Daniel about the so called ‘trimmed means estimand’ in clinical trials at the International Biometric Conference in Riga, Latvia. The slides of my talk are available here for anyone interested. In this post I’ll give a brief overview of my talk.

Intercurrent events and composite strategies

In clinical trials some patients may experience what is now termed an ‘intercurrent event’ (ICE) in the ICH E9 estimand addendum. This is an event that occurs after treatment initiation that affects the existence or interpretation of a patient’s outcome. Some examples relevant for the method I’m talking about here are discontinuation of randomised treatment due to perceived lack of efficacy or toxicity / adverse events.

The ICH E9 addendum discussed various strategies that can be used to handle such intercurrent events. One is the composite strategy, whereby one somehow incorporates the occurrence of the ICE into the definition of the outcome or endpoint variable. With an endpoint that is binary, this can be relatively straightforward – if a patient experiences the ICE, which here would constitute a bad outcome or failure of treatment, their endpoint value is set to the category or level which constitutes a bad outcome. Hence we lump together patients who experienced the intercurrent event with those that did not but which had a bad outcome nevertheless.

When the endpoint in the trial is continuous, applying a composite strategy is harder. What value should we assign for patients who experience the ICE?

Trimmed means approach

Permutt and Li (2017) proposed a trimmed means approach to answering this question. This consists of the following. Patients who experience the ICE are assigned a very bad value (e.g. the worst value possible for the variable) for the outcome, with the justification that the ICE occurring constitutes a bad outcome (in a general sense) for the patient. It does not matter precisely what value is assigned, because in the next step we trim (delete) the worst x% of patients from treatment group. Thus as long as the very bad value is so bad that we are guaranteed the patients assigned this value are trimmed (excluded), it doesn’t matter precisely what the value is. We then calculate the difference in means (difference in trimmed means) between the two treatment groups.

The percentage x we choose needs to be at least as large as the largest of the proportions of patients who experience the ICE in the two treatment groups, so that we ensure all patients with an ICE are trimmed. Note that this means that some patients who don’t experience an ICE will also get trimmed.

The trimmed means approach is attractive because we incorporate the occurrence of the ICE in the endpoint, but, at least to a certain extent, the precise value we assign when a patient has an ICE is not important.

Interpretation of the trimmed mean

Permutt and Li explain that one of the nice things about the trimmed mean is that it is readily interpretable:

Some patients did badly on treatment. Either they completed with bad outcomes, or they dropped out. For some medical conditions, it will not matter much whether they dropped out or completed with bad outcomes, nor how bad the bad outcomes were. The trimmed mean is the average outcome for other patients, those who did best in each group.

My work on this topic has focused on thinking about whether the trimmed means comparison between treatment groups is fair or valid as a measure of the causal effect of treatment. Specifically, are the patients who in the end are being compared in the difference in trimmed means comparable?

Randomisation and exchangeability

By randomly assigning which treatment each patient gets in a clinical trial, we are guaranteed that (on average) the characteristics of the patients assigned to one treatment are the same in distribution to those assigned the other treatment. This allows us to compare outcomes between the treatment groups and interpret differences seen as a causal effect of treatment. In contrast, in an observational study where treatments are not randomly assigned, differences in outcomes between treatment groups could, partly or wholly, be due to the fact that certain types of patients tended to get given one treatment and certain other types of patients the other.

Trimmed means in DAGs

One approach to looking at this is via directed acyclic graphs (DAGs). The following DAG shows the setup we have, where A denotes randomised treatment, X baseline variables, Y the outcome/endpoint and R whether of not the patient experiences the ICE. Because of randomisation, there is no arrow going directly between randomised treatment A and the baseline variables X. We then have the composite endpoint U, which is equal to Y for those patients who do not experience the ICE, and is equal to the very bad value for those that do. The binary variable Trim then indicates whether the patient is ‘trimmed out’ (Trim=1) or not (Trim=0). Whether a patient gets trimmed depends on the value of the composite endpoint U and whether this falls below the relevant quantile for that treatment group (hence the arrow form A to Trim).

When we remove (trim) patients whose value of U falls below (here taking low values as bad outcomes) the x% quantile of U in their treatment group, we analyse the remaining patients, those with Trim=0. Or put another way, we stratify or condition on Trim=0. When we do this, the rules for DAGs indicate that (in general) we get a correlation or dependence between the baseline variables X and the treatment group A, because we are conditioning on Trim which is a ‘child’ of A and X. Or put another way, the distribution of the baseline variables X differs between the two treatment groups – the two groups of patients being compared are no longer exchangeable.

An unrealistic but possibly useful example

Here’s a more concrete example to show the problem. The example is clearly not very likely to occur in practice, but it hopefully shows why the two groups being compared after trimming are not in general similar types of patients. Imagine that there are two types of patients, mild patients, making up 50% of the population, and severe patients, making up the rest. Under control treatment, the population distribution of outcomes is N(0,1), with the mild patients all taking values above 0 (better outcomes) and the severe patients taking the lower half of values below 0. Suppose under control treatment, no patients experience the ICE.

Distribution of outcomes under control treatment, according to the patient type (mild or severe)

Under active treatment, all the mild patients get the ICE, but none of the severe patients do. For the severe patients, who don’t get the ICE, their individual outcomes are equal to minus what their outcome would have been under control. As such, their distribution of outcomes under active is the positive half normal:

Distribution of outcomes under active treatment among those not experiencing the ICE, which here consists only of severe patients.

50% of patients, the mild ones, experience the ICE under active, so let’s choose to trim the worst 50% of values from each group. Since the mild patients have been assigned an arbitrarily bad value, they all get trimmed, and we are left with the positive half normal shown above in orange which consists entirely of severe patients. In the control group, the worst 50% are the severe patients, and so we are left with only mild patients. Hence when we contrast the trimmed means between treatment groups, we are comparing a group that wholly consists of mild patients with a group that wholly consists of severe patients. We have completely lost the benefit of randomisation in terms of making the patients in the two groups exchangeable.

The ICH E9 addendum on estimands has a useful line which I think speaks to the problem we have demonstrated above:

An estimand is a precise description of the treatment effect reflecting the clinical question posed by a given clinical trial objective. It summarises at a population level what the outcomes would be in the same patients under different treatment conditions being compared.
ICH E9 (R1) addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials

We have seen from the above examples that in general the trimmed means estimand fails this criterion. The trimmed mean from the control group tells us what the mean outcome would be in the best performing patients if we assigned the whole population to control. Analogously, the trimmed mean from the active group tells us what the mean outcome would be in the best performing patients if we assigned the whole population to receive active treatment. But the best performing patients in these two scenarios are not in general the same patients.

Trimmed means as an estimator of other estimands

One of the things that I find is repeatedly coming up in discussions of estimands is that a given statistical method (estimator) is not necessarily linked uniquely to one estimand. The trimmed means estimator (rather than estimand) is one such case. Under certain assumptions, the trimmed means estimator can be shown to be a valid estimator of the full population average treatment effect, where here it is being used a method to handle missing data caused by patients have an ICE.

This work was supported by an UK MRC grant MR/T023953/1.

4 thoughts on “Causal (in)validity of the trimmed means estimand”

Stephen Senn

July 17, 2022 at 7:26 am

Thanks for an interesting analysis and clear exposition. Over 40 years ago Larry Gould proposed an analysis for dealing with dropouts that led to a continuous ranking of all patients: they were ranked by when they dropped out and the ranking was then continued with completes using the intended clinical measure. See https://www.jstor.org/stable/2556126 Do you have any views on this?
- Jonathan Bartlett
  
  July 7, 2023 at 11:20 am
  
  A belated thank you Stephen. I was not aware of this paper. I am increasingly attracted to such an approach for handling what are now called intercurrent events in the analysis, and hope in the near future to work on effect estimation for the resulting composite endpoint.
Christian Bressen Pipper

July 25, 2022 at 8:15 am

Thanks for posting this! Another great example that you should never condition on colliders
Tom Permutt

October 6, 2022 at 2:32 am

Notice, however, that the control group you would like to compare the best treated outcomes to (the same subjects if they had had control, or similar subjects who had control) did no better than the group you actually compared them to (the best control outcomes). So, no, you didn’t estimate the causal effect for this subgroup, but you may have a very robust demonstration that it’s in the right direction. This is important in drug regulation.