In recent years increasing attention has been focused on violations of the proportional hazards assumption made in Cox’s famous proportional hazards model. This has occurred in particular due to trials of immuno-oncology treatments, where the mechanism of action may imply that any benefit of treatment would only occur some time after initiation of treatment. As a result, a large number of papers have been published recently looking at alternative methods of analysis (to Cox’s model), which I have written about before.

Recently a paper was published in the journal Statistics in Biopharmaceutical Research proposing methods of analysis for time to event data in randomised trials when it is anticipated that the proportional hazards assumption may be violated. The paper proposed an approach for hypothesis testing, the ‘max combo’ method, which is based on a series of weighted log rank tests. Weighted log rank tests are a generalisation of the standard log rank test for censored time to event data which weight the different follow-up periods differentially, ideally with highest weight in the period where one expects the largest treatment effect. Since one may not correctly anticipate when this will occur, the max combo test uses the weighted log rank test which is most statistically significant (among a small collection of weighted log rank tests), with an appropriate penalty incorporated for the fact that the particular weighted log rank test is being selected because it is the most significant. As well as hypothesis testing, effect estimation is obviously of crucial importance. In the presence of non-proportional hazards, the aforementioned paper advocated use of period specific hazard ratios (estimating separate hazard ratios in distinct portions of follow-up), and also a weighted hazard ratio that corresponds to the ‘winning’ weighted log rank test in order to characterise how the *treatment effect changes over time*.

In response to this, I and a number of colleagues have questioned the proposal to use period specific and weighted hazard ratios to describe how treatment effects change over time. Our concerns are not based on new insights, but those of others previously published, for example Hernan’s ‘Hazard of Hazard Ratios’ and Aalen et al ‘Does Cox Analysis of a Randomized Survival Study Yield a Causal Treatment Effect?’. The essence of the issue is as follows: in a randomised trial, all baseline factors (observed and unobserved) are balanced (in expectation) across treatment groups. The hazard ratio at a particular time during follow-up contrasts failure rates between treatment groups (possibly conditional on some measured baseline covariates) among those still at risk of failure. There will be baseline factors which influence the time to event but which are not measured. By randomisation these will be balanced (on average) across treatment groups. But as time moves on, generally the distribution of these factors will not remain balanced between the patients still at risk in the two treatment groups. As we will see below, this can cause changes in the hazard ratio between the two treatment groups over time, even in the absence of time-changing effects of treatment. As such, in line with others, we believe it is inadvisable in general to use period specific (or time weighted) hazard ratios to characterise how the treatment effect changes over time.

In our letter we describe a hypothetical oncology example to illustrate the idea, and here I’ll show some simulation results for this. Suppose that following surgery to remove cancer, patient are randomised 1:1 to a new active treatment versus a control. Suppose that the surgery cures 50% of the patients. Thus (in expectation) 50% of the control patients are cured and 50% of the active treatment patients are cured. Among the cured patients, as a result of being cured, in the time frame of the trial we assume the hazard of death is essentially zero. Among the non-cured patients, suppose the new treatment reduces hazard from 1 to 0.5. Thus among the non-cured patients, the hazard ratio comparing the new treatment to control is 0.5, and this does not change over time, while in the cured patients, there is no effect of treatment (since they are cured). In the following R code I simulate data from such a trial with a very large sample size, with a follow-up time of 3 units of time.

```
set.seed(82623)
n <- 10000
cured <- 1*(runif(n)<0.5)
trt <- 1*(runif(n)<0.5)
haz <- rep(1,n)
haz[trt==1] <- haz[trt==1]*0.5
t <- rexp(n, rate=haz)
#for those cured, set failure time to something large
t[cured==1] <- 100
#censor at t=3
d <- 1*(t<3)
t[d==0] <- 3
#set up observed data, remembering cure status is not observed
dat <- data.frame(trt=trt, t=t, d=d)
```

Next we plot the Kaplan-Meier curves for the two treatment groups.

```
library(survival)
plot(survfit(Surv(t,d)~trt))
```

The Kaplan-Meier curves do not obviously show that there is a problem with the proportional hazards assumptions. Now let’s fit a Cox model to the data (assuming proportional hazards):

```
coxmod <- coxph(Surv(t,d)~trt, data=dat)
summary(coxmod)
Call:
coxph(formula = Surv(t, d) ~ trt, data = dat)
n= 10000, number of events= 4363
coef exp(coef) se(coef) z Pr(>|z|)
trt -0.34030 0.71156 0.03048 -11.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
trt 0.7116 1.405 0.6703 0.7554
Concordance= 0.547 (se = 0.004 )
Likelihood ratio test= 125.6 on 1 df, p=<2e-16
Wald test = 124.7 on 1 df, p=<2e-16
Score (logrank) test = 125.9 on 1 df, p=<2e-16
```

The estimated hazard ratio is 0.71, suggesting active treatment reduces hazard. If we plot a smoothed estimate of the hazard ratio over time using the cox.zph function we obtain:

`plot(cox.zph(coxmod, transform="identity"))`

The y-axis for the plot is log HR, so that 0 on the y-axis corresponds to a HR of 1. It shows that initially the HR is less than 1, but that as time moves on, the HR moves above 1, apparently suggesting that the treatment is beneficial initially but then becomes harmful. We might based on this estimate piece-wise HRs using the data before and after t=2, which can be performed via use of the survSplit function:

```
splitDat <- survSplit(Surv(t,d)~trt, data=dat, cut=2, id="id", episode = "timegroup")
splitCox <- coxph(Surv(t,d)~trt*strata(timegroup), data=splitDat)
summary(splitCox)
Call:
coxph(formula = Surv(t, d) ~ trt * strata(timegroup), data = splitDat)
n= 16207, number of events= 4363
coef exp(coef) se(coef) z Pr(>|z|)
trt -0.43161 0.64946 0.03291 -13.11 < 2e-16 ***
trt:strata(timegroup)timegroup=2 0.69449 2.00268 0.09199 7.55 4.36e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
trt 0.6495 1.5397 0.6089 0.6927
trt:strata(timegroup)timegroup=2 2.0027 0.4993 1.6723 2.3983
Concordance= 0.553 (se = 0.004 )
Likelihood ratio test= 184.2 on 2 df, p=<2e-16
Wald test = 181.3 on 2 df, p=<2e-16
Score (logrank) test = 184.1 on 2 df, p=<2e-16
#estimated HR after t=2 is
exp(coef(splitCox)[1]+coef(splitCox)[2])
trt
1.300662
```

In accordance with the previous plot, this suggest that before t=2, the HR is 0.65 (active treatment beneficial), whereas after t=2 the HR is estimated as 1.30 (active treatment harmful). Since we simulated the data we know that in terms of causal effects of treatment this is not correct. For the 50% of patients cured by surgery (before treatment initiation), active treatment has no effect compared to control. For the other 50%, active treatment reduces hazard by a factor of 0.5, and this is constant over time.

To further explore the issue, let’s now look at how the proportion of cured patients changes over time in the risk sets in the two treatment groups:

```
cureControl <- Vectorize(function(x) {
mean(cured[(t>x) & (trt==0)])
})
cureActive <- Vectorize(function(x) {
mean(cured[(t>x) & (trt==1)])
})
curve(cureControl, from=0, to=3, col="red",
ylab="Proportion cured among those still at risk", xlab="Time")
curve(cureActive, from=0, to=3, add=TRUE, col="blue")
text(1.7,0.9, "Control")
text(2,0.7, "Active")
```

This plot makes the issue clear. At baseline (where all patients are still at risk of the event), each treatment group contains approximately 50% cured patients. As time moves on, this proportion increases in both treatment groups, since only the non-cured patients are susceptible to failure. However, crucially, this proportion increases quicker in the control arm (because the active treatment reduces hazard). This explains the earlier finding that in later stages of follow-up, the active treatment appears harmful – it is because among the survivors in the active arm a smaller proportion are cured patients than in the control arm. The hazard ratio based on the later part of follow-up is essentially confounded – the survivors in the two treatment groups are imbalanced with respect to cure status.

Because of the difficulties of (validly) attaching causal interpretations to period specific and weighted hazard ratios, we recommend in our letter well known alternatives for characterising the treatment effect, such as contrasts of survival probabilities at specified times, contrasts of specified quantiles (e.g., median survival), or differences/ratios of restricted mean survival time.

Thank you Jonathan for an excellent and informative post (as always)!

If I’m not mistaken, however, the HR after t=2 should be 0.6495*2.0027 (this doesn’t change the conclusions of the post, obviously).

Thank you for spotting this Andrea! I have updated the post accordingly.

Thank you. This makes a lot of sense (after re-reading a few times). I wish I knew about this earlier.

Kudos to the title of the letter as well 🙂