Testing equality of two survival distributions: log-rank/Cox versus RMST

Cox's proportional hazards model is by far the most common approach used to model survival or time to event data. For a simple two group comparison, such as in a randomised controlled trial, the model says that the hazard of failure in one group is a constant ratio (over time) of the hazard of failure in the other group. A test that this hazard ratio equals 1 is a test of the null hypothesis of equality of the survival functions of the two groups. The log rank test is essentially equivalent to the score test that the HR=1 in the Cox model, and is commonly used as the primary analysis hypothesis test in randomised trials.

Sometimes however the proportional hazards assumption may not hold, raising the question of how the survival functions of two groups ought to be modelled and compared. One thing to note is that the log rank test does not assume proportional hazards per se. It is a valid test of the null hypothesis of equality of the survival functions without any assumptions (save assumptions regarding censoring). It is however most powerful for detecting alternative hypotheses in which the hazards are proportional.

Restricted mean survival time (RMST)
An alternative approach to modelling failure time data is to estimate the so called restricted mean survival time (RMST). The restricted is there because one estimates the mean of: time to failure or a specified time t. The restriction is there because unless one continues follow-up until every subject experiences the event of interest (or in the presence of censoring, until the Kaplan-Meier estimator goes to zero), the overall mean failure time cannot be estimated, at least not without making parametric assumptions about the failure time distribution.

In a trial setting, the RMST (for a given time t) can be estimated in each treatment group, and then they can be compared between treatment groups, either as a difference or ratio. The RMST can be estimated non-parametrically, by the area under the Kaplan-Meier curve up to time t, or based on parametric models. Royston and Parmar in particular have developed approaches for RMST based on flexible parametric models (see for example this paper).

Unlike the hazard ratio, the RMST is well defined without requiring any assumptions. The RMST is also arguably attractive because of its interpretation: it is the average time to failure up to the specified time.

Hypothesis testing
Hypothesis tests can be constructed based on RMST, by testing that the difference in RMST between groups is zero. Since equality of the two groups' survival functions implies their RMST are equal, this test is also a valid test of equality of survival functions. It is interesting to note that one could have a situation where the RMSTs (to a given time) are equal although the two groups' survival functions differ. In this case the RMST based test would have essentially no power to detect the difference in survival functions.

Now we get to the reason for this post: Tian and colleagues have just published a paper that's available in Early View at Biometrics, in which they investigate the power (efficiency) of the log rank/Cox model test to the non-parametric RMST test. Analytical work and simulations are performed to investigate the relative power of the two approaches in different scenarios. The key findings are broadly:

  • Crossing hazards: HR<1 at time 0 changing to HR>1 at time t - RMST can have substantially greater power than Cox/log-rank
  • HR converging to 1: HR<1 at time 0 and converges to 1 at time t - RMST has superior power compared to Cox/log-rank, but the difference is modest
  • HR decreasing from 1: HR=1 at time 0, decreasing to <1 at time t - RMST has lower power than Cox/log-rank

Their findings have important implications for whether a trial might consider choosing to use RMST as the primary hypothesis test rather than Cox/log rank when hazards are anticipated to be non-proportional. As an example, the authors note:

HR-based test is more powerful for detecting the delayed treatment effect encountered in recent oncology trials testing the efficacy of immunotherapy

Of course, while the statistical power of a hypothesis test is obviously important, interpretability of the corresponding effect measure is also important. In particular, when hazards are non-proportional (in a material way), interpreting the Cox model estimated HR is arguably problematic. For different views on this, see:

Leave a Reply