Last week I listened to a great presentation about new trial designs by Mahesh Parmar, director of the Medical Research Council Clinical Trial Unit in London. Among the topics he touched on were multi-arm trials (and extensions), as an attractive alternative to the classic two arm trial. There seem to be a number of advantages to such a trial design, in which in the simplest case, the trial randomizes patients to either control or one of a number of experimental treatments.
The presentation led me to reading up a little on the topic, during which I came across a nice short piece recently published in the Lancet by Mahesh, my colleague James Carpenter, and Matthew Sydes. In it they advocate a shift to multiarm, phase 3 superiority trials. As they explain, there are a number of potential advantages to using a multiarm design. One of the stated advantages, which I’ll focus on here, is the following:
Compellingly, increasing the number of research arms increases the probability within one trial of reliably showing that at least one new treatment is superior to control, even allowing for the inevitable correlation between comparisons
With some reasonable assumptions, they show that “the probability of at least one success increases rapidly as the number of groups increases”, where a success is defined as finding a statistically significant treatment effect (compared to control) in at least one of the arms examined in the multiarm trial. The implication of this would thus be that one of the benefits of a multi-arm trial is to increase the power to detect at least one statistically significant treatment effect.
However, intuitively one would think the increase in power for detecting this alternative hypothesis (that at least one treatment is superior to control) would come with an increased type 1 error rate, the probability that, if in truth all the treatments were equivalent to control, we would wrongly find at least one treatment arm to be (apparently) superior to control.
A further problem I would expect is that if, as I suspect may be the case in such a multiarm trial, readers of the results focus on the estimated effects of the apparently best treatment, the estimated effect for that treatment is likely to be exaggerated relative to its true effect.
Should we worry about multiplicity here?
It turns out (unsurprisingly perhaps) that the question of whether multiplicity adjustments of some kind are appropriate in the setting of multiarm trials has been hotly debated – a good recent summary can be found in this paper published by Wason et al. As Wason et al explain, one view is that if the different hypotheses represent distinct research questions, then it is reasonable not to allow for multiple comparisons. This seems entirely reasonable to me. However, once one starts to talk about one of the advantages of a multiarm trial being that the probability of detecting at least one statistically significant treatment effect is higher with the design, it would seem that this is not the case. Rather, as noted earlier, the implication is that one wants to ‘find’ evidence of a treatment effect for at least one treatment.
One argument made against the need for multiplicity arguments in this context is that if separate two arm trials were conducted for each of the treatments under consideration, no multiplicity adjustment would be made in practice. However, as Wason et al note, multiarm trial results are usually reported in a single paper, with the treatment effects discussed and compared relative to each other. Furthermore, even in the classic situation of separate two arm trials, if I was presented with the results of 100 two arm trials of different treatments, and say only two or three of these had found evidence of a benefit for their corresponding treatment, surely again we should be careful with multiplicity, since we know that our standard frequentist analysis has a type 1 error rate of 5%. Indeed in this situation, arguably the appropriate approach to analysing the results of the 100 trials would be a Bayesian meta-analysis, which would, as has been discussed elsewhere by others (e.g. Andrew Gelman – http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf), potentially accommodate the multiple comparisons ‘automatically’.
A small simulation study
To empirically investigate, I’ve performed a very small simulation study (in R), based on the assumptions made by Parmar et al in coming up with their figures for the statistical power of multiarm trials. The simulation performs repeated 3 arm trials (two treatment arms and one control arm). In each, it is assumed that the true treatment effects are drawn from some ‘population’ of true treatment effects, with beneficial treatments corresponding to negative values. Each treatment arm is compared to the control arm, in a conventional frequentist analysis. We then see if in each of these analyses, there is statistically significant evidence of an effect at the conventional 5% level. We can then see, across simulated datasets, how the power to detect at least one significant effect is affected by having two treatment arms, rather than one. Running the code (given at the bottom of the post), as we would expect, and in line with the analytical results of Parmar et al, the probability of detecting at least one statistically significant treatment effect is increased, relative to the corresponding probability when just considering one of the treatment arms vs control. This matches one of the stated advantages of the multiarm design.
The second half of the code finds, in each simulation, the treatment whose effect appears to be the best (here this corresponds to the most negative treatment effect estimate). It then compares this to the corresponding true treatment effect. Running the simulation, we see that this quantity is on average negative: as expected, if in each multiarm trial we focus on the treatment which has the largest apparent effect, then this effect estimate is on average exaggerated compared to the true effect. Interestingly, the corresponding confidence interval coverage is still ok (at 95%) – I need to think more about this (why coverage is ok despite the conditional bias), but if anyone can shed light on it please do so in a comment.
Next, if we re-run the simulation with
mu <- c(0,0) tausq <- 0
at the top, this sets all true treatment effects to zero. This enables us to examine type 1 error. Doing so, we see that the type 1 error of each treatment effect test is 5%, as we would expect. However, we also see (as we should expect given the earlier power results) that the probability of finding at least one statistically significant treatment effect is larger than 5% - an inflation of the family wise type 1 error rate.
A Bayesian approach
Following the ideas in Andrew Gelman's paper linked to earlier, it would seem that a natural (and easy in principle) approach to handle the preceding issues is to perform a Bayesian analysis, where we specify a prior for the 'population' distribution of true treatment effects. In this case, the posterior mean of each treatment effect will be shrunk towards the estimated population mean treatment effect, by an amount depending on the estimated variance of true treatment effects, and the precision of each estimated effect. The idea that the effects being estimated in a given study can be thought of as being drawn from some population of possible effects is actually the basis for the calculations of Parmar et al . Of course, the difficulty in practice would be how to choose the prior for the parameters of the distribution of true effects.
Classical frequentist approaches for handling multiplicity generally (I believe) focus on adjustment of p-values to control the family wise error rate. One drawback it seems to me with these approaches is that they only focus on the p-values (as far as I know), and make no adjustment to the actual treatment effect estimates. Thus there would remain the problem that if one focuses on the apparently most beneficial treatment, in expectation the estimated effect is larger than the true effect.
R code for simulation
Below is the R code used for the simulations referred to earlier. If anyone spots an error in it, please let me know in a comment.
library(MASS) nSims <- 1000 #specify mean and variance of true effects mu <- c(-1,-1) tausq <- 2 #mu <- c(0,0) #tausq <- 0 #specify correlation rho between true treatments rho <- 0 trueCov <- matrix(c(tausq, rho*tausq, rho*tausq, tausq), nrow=2) #the error correlation is 0.5, due to common control arm errorRho <- 0.5 errorVariance <- 1 errorCov <- matrix(c(errorVariance, errorRho*errorVariance, errorRho*errorVariance, errorVariance), nrow=2) sigResult <- array(0, dim=c(nSims,2)) trueEffects <- array(0, dim=c(nSims,2)) estEffects <- array(0, dim=c(nSims,2)) for (i in 1:nSims) { #generate true treatment effects trueEffects[i,] <- mvrnorm(n=1, mu=mu, Sigma=trueCov) estEffects[i,] <- trueEffects[i,] + mvrnorm(n=1, mu=c(0,0), Sigma=errorCov) testStat <- estEffects[i,]/(errorVariance^0.5) p_value <- 2*pnorm(abs(testStat), lower.tail=FALSE) sigResult[i,] <- 1*(p_value<0.05) } table(rowSums(sigResult)) #proportions of trials where each treatment group is found #to be superior to control, according to statistical significance colMeans(sigResult) #proportion where at least one result is significant mean(rowSums(sigResult)>0) #as expected, this is increased compared to the single arm proportions #this is the gain in power (for detecting at least one statistically significant #treatment effect) #find which of the two treatments in each trial has (apparently) the largest #beneficial effect compared to control #here this is equivalent to effect estimate which is smallest, since error variance #is assumed the same for both treatment arms (vs control) bestTrt <- array(0, dim=c(nSims,1)) bestTrtTrueEff <- array(0, dim=c(nSims,1)) bestEstEff <- array(0, dim=c(nSims,1)) bestEstMinusTrue <- array(0, dim=c(nSims,1)) ciCov <- array(0, dim=c(nSims,1)) for (i in 1:nSims) { bestTrt[i] <- which.min(estEffects[i,]) #find corresponding true effect bestTrtTrueEff[i] <- trueEffects[i,bestTrt[i]] bestEstEff[i] <- estEffects[i,bestTrt[i]] bestEstMinusTrue[i] <- bestEstEff[i] - bestTrtTrueEff[i] ciCov[i] <- 1*(((bestEstEff[i]-1.96*errorVariance^0.5) < bestTrtTrueEff[i]) & ((bestEstEff[i]+1.96*errorVariance^0.5) > bestTrtTrueEff[i])) } mean(bestEstEff-bestTrtTrueEff) #this is what we would expect - treatment effect estimates are on average exagerrated #relative to their true value, if we focus in each trial on the treatment which #apparently has the largest effect mean(ciCov)
James Wason (james.wason@mrc-bsu.cam.ac.uk) commented to me via email:
Great post Jonathan. I definitely think the Bayesian approach sounds like an excellent idea, especially when there are effect estimates for a good range of similar treatments available.
I think arguments about multiplicity etc should be secondary to us all encouraging the use of multi-arm trials in practice. Whether or not correction for multiple-testing is used, they provide great benefits.
Someone recently asked me why I am vaguely in favour of correction for multiple testing in multi-arm trials. I gave the following made up example: “Imagine you read a paper which compared a number of different drugs to placebo, and the best p-value was 0.04. Would you be more convinced that this drug was truly effective if the number of experimental arms tested was 2 or 8? If you genuinely don’t think there is a difference in those two situations, then you shouldn’t adjust for multiple testing!”. I myself would definitely be more convinced if there were two experimental arms.
However we do have to think about how far we can go with this view. In my mind, trials which continually add in new treatments are an exciting way to do trials in the future. In that case it is impossible to stringently control the chance of making a type I error – eventually a false positive treatment will get through no matter how stringent you are. I wouldn’t like stringent multiple testing to be the only reason we cannot use such trial designs in confirmatory settings.
Thanks for the comment James. I entirely agree with everything you wrote.
On your last paragraph, in such a trial where treatments are added continually, even if (as you say) it is impossible to stringently control the family wise type 1 error rate, it would be important I would think to use statistical methods which give point estimates and precision estimates which take some account of the multiplicity. And if indeed it is a confirmatory setting, would one really be adding new treatments continually? This sounds more like earlier stage trials?
Dear Jonathan, I enjoyed reading your post.
I have been interested for quite some time in the topic of multiple testing in clinical trials, and its relation to Bayesian inference, or why the Bayesians “don’t care”. Since you drifted in this direction and because something vaguely bothered me in the back of my head when reading the email response by Mr. Wason I decided to share my thoughts for discussion. Although I hope I am not missing the point, for I did not read the full paper by Mr. Wason yet.
What untangled a knot for me was the exposition in “The P-Value Fallacy” by Goodman (1999), who explains that multiplicity and other issues are really a question of “deduction vs. induction”. Type 1 and 2 error rates refer to Neyman’s(/Pearson’s) method of describing certainty in your inference in terms of validity of a deductive behavior in a long-run sequence inside a particular sampling space (of e.g. particular clinical trials). This kind of inference explicitly precludes the possibility of getting any information out of the data inductively. One cannot talk about the “evidence provided by the data” in support of some hypothesis WITHIN this long-run framework. The reason is that a single sample can be legitimately element of several different long-run sampling spaces. Goodman cites a classical example for this, where the p-value is different when analyzing the same data for the same endpoint, but within 2 different “long-run sampling spaces”, a seeming paradox resulting from trying to use the two p-values in an inductive fashion here.
When using Bayesian inference calculus you can play this game the other way round, for example in the context of sequential adaptive trials, where one can look at the data repeatedly without actually increasing the Neyman type 1 error because the inference is strictly inductive, obeys the likelihood principle and is in particular independent of the stopping rules for the trial. Berger also has some interesting thoughts on this in the context of conditional frequentist testing procedures.
With regard to the multi-arm multiplicity issue two things now occur to me:
1. Since the inference procedure is the Neyman/Pearson deductive kind, I don’t see the point in talking about an increase of power to “detect _at least one_ drug superior to control”. The only way why this may appear to be appealing at first for me is because one is tempted to translate this into “more evidence” for the efficacy of a particular drug. But this it cannot provide, because that would be induction and surely, as Mr. Wason also pointed out in a way, the evidence that the data provides for the superiority of a particular drug X cannot depend on whatever performance some other drug may have. Increasing the number of arms may thus increase Neyman/Pearson power, but each power statement is with respect to a different long-run sampling space and precludes by nature any talk about treatment effect in light of the data for a particular drug. At least in terms of the error rates, I mean.
2. If one decides to enjoy the advantage of increased power in a multi-arm study, one necessarily also decides for a particular long-run sampling space in which then there is an inflated FWER of type 1, as you pointed out by your simulation. If you decide to report power for this sampling space but neglect to control FWER accordingly, you are cherry-picking, no? Either your interest is rather “inductive in nature” regarding a particular drug, or you accept the increased FWER in the long-run sampling space corresponding to your multi-arm study with increased power.
So my point would be that perhaps both the appeal and the confusion with this topic may stem from a lack of distinction between inductive vs. deductive inference?
That would be my humble opinion, at least, what are your thoughts on this?
johannes
Thanks Johannes for your thoughtful comments! I had not read the Goodman paper before either.
Regarding your comments, a few thoughts. You wrote “When using Bayesian inference calculus you can play this game the other way round, for example in the context of sequential adaptive trials, where one can look at the data repeatedly without actually increasing the Neyman type 1 error because the inference is strictly inductive” – are you sure it is correct that the type 1 error would not be increased here? I appreciate that within the Bayesian paradigm one does not adjust for taking multiple looks. But if one evaluated the long run type 1 error of a (Bayesian) procedure that made multiple looks and had some rule for deciding when to declare that there was an effect, surely the type 1 error would increase as one increased the number of looks? Isn’t the point that the Bayesian is not interested in long run frequentist properties?
With regard to your point 1), I think as you say the long run sampling space is indeed different for the multi-arm design as opposed to two arm designs. The point of the Mahesh et al work I think however is to say that the power to declare at least one treatment effect ‘significant’ is increased by using the multi-arm design. This is appealing at least if the objective of the trial is confined to finding a statistically significant p-value for at least one active arm compared to control. As I think your comments speak to, a second and crucial question is what one is able to infer about the true treatment effects from such a design. But perhaps I haven’t fully understood your point here?
Regarding your point 2), I entirely agree – if one starts measuring the performance of a design in terms of frequentist power, it seems to me one must correspondingly be concerned with controlling type 1 error rates.
Having read the Goodman piece, I have two (probably ill informed!) thoughts. Goodman says that Fisher introduced the p-value as an informal index of discrepancy between the data and the null hypothesis, but not as anything more formal. He then I think argues that the p-value is deductive, not inductive. Although the p-value’s construction is deductive, I think people do then make the next jump and use it to be inductive – if the p-value is small this is evidence against this particular null hypothesis being true.
Goodman’s description of the p-value fallacy seems to come down to the illustration that one can obtain two different p-values for the same set of data which could have arised from two different study designs. The argument seems to be based on the assumption that the measure of evidence ought to be the same in both cases, and the fact that they are not shows that the p-value is not a valid measure of evidence in the “short run”. I know that this has been argued/discussed a lot over the decades (the strong likelihood principle), but intuitively to me it is not immediately obvious that the evidence should be the same in the two designs – it does not seem unreasonable that the amount of evidence might differ, for the same set of data, according to what design was used to obtain the data (the two different sampling schemes).
A reference I have found very useful for discussion of this is Chapter 7 of Pawitan’s book “In All Likelihood”.
Hey Jonathan.
Yes, apologies, I had not presented my thoughts in a very tidy manner.
With regard to the Bayesian sequential designs I must admit that I have not yet implemented any simulations myself, although it is on my list. You are certainly right that one can look at frequentist properties of Bayesian inference. With regard to the reproducibility (as in the definition of ‘science’) of results from particular procedures I am currently thinking that this has a meaningful purpose in the right context. Although a Bayesian may not have this for a primary concern, as Jaynes demonstrates rather impressively, I find, in his book (logic of science, chp. 9): If there is a meaningful relationship to be had then probability theory should allow you to make the transition from (Bayesian) probabilities to actual corresponding frequencies/prevalences in a sensible way. In Bayesian sequential designs, your inference is truly inductive and respects the likelihood principle since it is based on posterior probabilities. As Berger argues in some of his talks, the latter do not depend on the stopping rules of the sequential design and therefore “data peaking” does not “spend frequentist alpha”. A great paper demonstrating this, I think, is “Bayesian Adaptive Designs for Clinical Trials” by Cheng & Shen. They have a straight-forward decision-theoretic approach and can also define a type 1 error under the likelihood principle, which allows them to determine upper bounds for e.g. type I error of the design independent of the stopping rules. This insight allows them to calibrate the loss function according to the frequentist operating characteristics. They go on to show in several simulations that the actual frequencies of type I and II errors then really stay within the predicted boundaries. At the same time, their design has however no fixed sample size and although the full procedure respects the specified long-run average error rates, as calibrated through the loss function, the number of “data peaks” is variable as far as I understand. Intuitively, the design keeps on recruiting until the data provides “sufficient” information.
For me, this is also exactly the point with the Royall/Berry/Goodman example of the 2 differing p-values: Although people try to give it inductive meaning as a measure of evidence for (Berkson 1942) or against H0 (Fisher…), it is the result of a deductive procedure and it does _not_ reflect information provided by the data in a way that respects the likelihood principle. And that is also why Neyman-Pearson developed their theory, I guess. In that theory such a p-value is only meaningful in so far as it is below the corresponding type I error threshold or not, they clearly state that you give up the option of looking at “evidence supplied by the particular data set” within that scope, as Goodman also argues.
The reason why I had brought this up was because of Mr. Wason’s thought experiment: “If you genuinely don’t think there is a difference in those two situations, then you shouldn’t adjust for multiple testing!” If you genuinely don’t think there is a difference in those two situations, then you must be reasoning inductively. The _evidence provided by the data for a particular parameter or hypothesis_ cannot logically depend on other stuff that you also do around this. Therefore then, as the likelihood principle suggests, your inference shouldn’t either. When talking about power or error-rates in Neyman-Pearson theory, however, you are reasoning deductively. And that’s why I thought there can’t be any confusion as to whether or not one should control type I FWER in the case of the frequentist multi-arm trial designs.
Either way, I think this way (deductive vs. inductive inference) at looking at the problem of multiple testing is very illuminating. For me it was, at least.
Btw, with regard to the Gelman paper you might also find “A Bayesian perspective on the Bonferroni adjustment” by Westfall (1997) interesting, if you don’t know it already.
Thanks Johannes for your additional thoughts. I think I understand your points now. I will take a look at the Westfall and Cheng & Shen papers! Best wishes, Jonathan.