The log rank test is often used to test the hypothesis of equality for the survival functions of two treatment groups in a randomised controlled trial. Alongside this, trials often estimate the hazard ratio (HR) comparing the hazards of failure in the two groups. Typically the HR is estimated by fitting Cox’s proportional hazards model, and a 95% confidence interval is used to indicate the precision of the estimated HR.

There are of course many different ways of constructing confidence intervals for parameter estimates. For estimates found by the method of maximum likelihood, we most often use so called Wald intervals, which are formed by taking the estimated log HR plus and minus 1.96 standard errors. A drawback of the Wald interval is that it is possible for the log rank p-value to be statistically significant, but for the Wald 95% interval for the HR to include the null value of 1, leading to an apparently inconsistent result.

One approach to avoid the possibility of this inconsistency is to form the CI based on the likelihood score test. When there are no tied failure times in the dataset, this approach gives a 95% CI for the HR which includes 1 if and only if the log rank test p-value is greater than 0.05. Unfortunately, this concordance no longer holds when there are ties.

An alternative is to estimate the HR and form a 95% CI based on an approach proposed by Peto. The 95% CI for the HR formed using Peto’s method contains 1 if and only if the log rank test p-value is greater than 1, even when there are ties. Unfortunately, as shown in a recently published paper by Lin *et al *in Biometrics, Peto’s estimator for the HR is not consistent. Thus even in large samples, it is biased (although perhaps not much), and consequently the corresponding 95% CI does not have the correct coverage level.

To address this, Lin *et al* propose a modification to the likelihood score test, and invert this modified score test to form a 95% CI for the HR. Their approach ensures consistency with the log rank test p-value, including in the case that stratification factors are included. Unlike the CI found from Peto’s method, their proposed CI has correct coverage, and compared to the Wald based CI, is generally narrower.

Lin *et al’s* approach requires use of a numerical method to find the CI limits, and they have made available a SAS macro implementing their method, available here. Their approach seems attractive, and it will be interesting to see how quickly it is taken up in trial analyses.

Mehrotra and Roth (2011; Statistics in Biopharmaceutical Research) proposed a simple approach which guarantees inferential agreement between the logrank test p-value and corresponding confidence interval for the hazard ratio. Their method also allows for ties and is easy to implement. For testing the usual null hypothesis (HR=1), their proposed generalized logrank (GLR) statistic is identical to Mantel’s logrank statistic with no ties (which is equal to the score test from the Cox PH model); with ties, their corresponding GLR_E statistic is a generalization of the the statistic proposed by Efron (1977).

How is a clinician to interpret an HR when it is statistically significant but its CI includes the null hypothesis? E.g., a recent paper reported p = .03 but the 95% CI was 0.57 to 15.5 (Ann Transl Med, “Estimating tumor mutational burden across multiple cancer types using whole-exome sequencing”).

That shouldn’t happen. In the paper you mention, I think the result you mention is this: “Patients with hepatocellular carcinoma also showed a significant difference between two groups, with the TMB-high group demonstrating an apparent survival disadvantage (P=0.03, HR =2.97, 95% CI: 0.57–15.46)”. The HR of 2.97 is in the middle (on the log scale) between 0.57 and 15.46, so the HR and the 95% are consistent here. Assuming they are correct, you can back-transform the CI limits to get the standard error, and from this get a p-value (R code included below). From this I get a 2-sided p-value of 0.20, i.e. not significant, consistent with the CI including 1. Thus I suspect the p=0.03 is incorrect. Indeed one of the groups has only 6 individuals in it, making it more plausible the correct result is not statistically significant.

se <- (log(15.46)-log(2.97))/1.96 2*pnorm(log(2.97)/se, lower.tail = FALSE) =0.1959