Matching analysis to design: stratified randomization in trials

Yesterday I was re-reading the recent nice articles by Brennan Kahan and Tim Morris on how to analyse trials which use stratified randomization. Stratified randomization is commonly used in trials, and involves randomizing in a certain way to ensure that the treatments are assigned in a balanced way within strata defined by chosen baseline covariates.

The intuitive rationale for such an approach to randomization can be viewed as follows. Suppose the trial will enroll men and women, and it is known that men on average have worse outcomes than women. Suppose we are going to conduct a very small trial, with just four patients and that of the four patients recruited, two are men and two are women. First let’s imagine that we randomize two treatments (A & B) to the four patients using simple randomization. By chance, it could then be that the two men are randomized to A, and the two women to B. If we then analyse the trial, ignoring gender, and compare the two treatment groups, our intuition tells us that we are perhaps not obtaining a good estimate of the effect of treatment, since both patients randomized to A were men and both randomized to B were women, and we know gender is predictive of outcome. What do I mean by “good estimate”? If we were to repeat this trial many times, the estimated effect from a simple unadjusted analysis would be unbiased – on average they would vary around the true effect. But in some repetitions our estimate may be far from the true effect because of occurrences such as the one just described.

One approach to try and obtain a treatment effect estimate closer to the truth would be to adjust for gender in the analysis. In the example situation above, where both men are randomized to A and both women are randomized to B, we cannot adjust, since gender and treatment are co-linear or ‘aliased’. We cannot from the data distinguish between the effects of treatment and gender. Of course in a larger study such an occurrence would be highly unlikely to occur, and so usually it would be possible to use a regression model to adjust for gender when estimating the treatment effect. When there is just a single baseline covariate, such a regression model can be thought of as comparing outcomes between the two treatment groups separately in strata defined by the baseline covariate, and then pooling these treatment effect estimates, under an assumption of no interaction, weighting them according to their standard errors.

Now let’s suppose that we redo the randomization stratified on gender. In this case, we will guaranteed that among the two men enrolled, one will be assigned to treatment A and one assigned to treatment B, and similarly in the women. By doing this, we exclude the possibility described previously that can occur under simple randomization. By stratifying the randomization on gender, we ensure that the treatment groups are balanced in respect of gender distribution. Simple randomization guarantees this in expectation, but not in any given sample. By ensuring balance in each sample, stratified randomization enables us to obtain a more precise estimate of the effect of treatment.

However, there is a catch, as highlighted by the work of Kahan and Morris. The use of stratified randomization induces a dependence in the data between patients. As their BMJ article graphically illustrates, the treatment specific means are positively correlated when considering running repeated trials. It turns out that the consequence of this non-independence is that if one analyses the trial ignoring the factors used in the stratified randomization, the standard error estimate is larger than it should be. That is, if one uses stratified randomization in order to ensure balance between arms in respect of the baseline variables used in the randomization, and one ignores these baseline variables in the analysis, the benefit in terms of improved precision is not realised in the calculated standard error. The effect of this is that power is lower than it needs to be, and the type 1 error controlled at lower level than intended.

To understand the positive correlation between the treatment group means, I found the following logic helpful (although maybe this is just obvious to everyone other than me!). Suppose as before that men tend to have worse (say lower) outcomes than women, and consider the scatter graph in Kahan & Morris’ paper showing the treatment A mean vs treatment B mean in repeated hypothetical trials from some population. Suppose that you are told that the treatment A mean is higher than average (the average across repeated trials) in a given trial. This suggests that treatment group A had more women than the average trial conducted in this population. Then since you know that stratified randomization was used, if there were more women in group A than on average, there must also be more women in group B in this trial than on average, and so we would also expect the group B mean to be higher than average. We thus have positive correlation between the group means in repeated trials conducted from within a particular population.

One solution when the trial has used stratified randomization is to analyse the trial using a regression approach, adjusting for the baseline variables used in the randomization as covariates. By doing this, we break the dependency in the data – the observations on outcome between patients are independent, conditional on treatment assignment and the baseline variables used in the randomization. The net result is a valid standard error estimate, and a gain in power.

According to the literature review performed by Kahan and Morris, just 26% of trials included in their review which used some form of stratified randomization accounted for all the factors used in the analysis. They conclude that if trials use such randomization schemes, it is important that the subsequent analysis adequately accounts for the stratification factors.

7 thoughts on “Matching analysis to design: stratified randomization in trials”

  1. The assumption is that the stratification variable is correlated with outcome. Also, the treatment effect estimate is unbiased even without adjusting for the stratification variable, only less efficient.

  2. Is it also necessary to control for the stratification in the power calculation to determine sample size? Or is this already accounted for in the estimates input in to the calculation.

    & if the study is powered to allow subgroup analysis by the stratification factor, should the power calculation be conducted individually for each strata using strata-specific estimates effect sizes?

    Thank you!

    • So the power calculation would account for it provided you use the variance of the outcome conditional on treatment & stratification factors, i.e. the variance of the outcome within levels of the variables used in the stratification. If the stratification factors are prognostic for outcome, this variance is lower than the variance if you don’t stratify/condition on these, and this will therefore reduce your required sample size commensurately.

      To your second question, if you want to have good power for detecting treatment effects separately in a number of subgroups, you could do the power calculation for each subgroup. Leaving aside multiplicity issues, this would tell you how many patients you need in each strata. To determine the total number of patients in the trial you would then need estimates of the population proportions in each strata, to determine the total n that will give you at least the required sample size for each strata specific analysis.

      • This is really helpful reply – thank you so much.

        Just as a quick clarification question to the second part regarding the population proportions. Say for example you stratify on age (above or below a certain threshold) where the observed effect is lower in the younger population. In individual calculations the sample needed for the younger group is n=150 & the older is n=80 (alpha = 0.025). In the population, the younger group account for 30% of cases and older 70%. Is it correct then that the total sample size would not just be 150+80, the population size would need to be taken in to account (ie, 150 (30%) + 350 (70%))? It is possible I have misunderstood this.

        Thanks again for your time.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.