Yesterday I was re-reading the recent nice articles by Brennan Kahan and Tim Morris on how to analyse trials which use stratified randomization. Stratified randomization is commonly used in trials, and involves randomizing in a certain way to ensure that the treatments are assigned in a balanced way within strata defined by chosen baseline covariates.

The intuitive rationale for such an approach to randomization can be viewed as follows. Suppose the trial will enroll men and women, and it is known that men on average have worse outcomes than women. Suppose we are going to conduct a very small trial, with just four patients and that of the four patients recruited, two are men and two are women. First let's imagine that we randomize two treatments (A & B) to the four patients using simple randomization. By chance, it could then be that the two men are randomized to A, and the two women to B. If we then analyse the trial, ignoring gender, and compare the two treatment groups, our intuition tells us that we are perhaps not obtaining a good estimate of the effect of treatment, since both patients randomized to A were men and both randomized to B were women, and we know gender is predictive of outcome. What do I mean by "good estimate"? If we were to repeat this trial many times, the estimated effect from a simple unadjusted analysis would be unbiased - on average they would vary around the true effect. But in some repetitions our estimate may be far from the true effect because of occurrences such as the one just described.

One approach to try and obtain a treatment effect estimate closer to the truth would be to adjust for gender in the analysis. In the example situation above, where both men are randomized to A and both women are randomized to B, we cannot adjust, since gender and treatment are co-linear or 'aliased'. We cannot from the data distinguish between the effects of treatment and gender. Of course in a larger study such an occurrence would be highly unlikely to occur, and so usually it would be possible to use a regression model to adjust for gender when estimating the treatment effect. When there is just a single baseline covariate, such a regression model can be thought of as comparing outcomes between the two treatment groups separately in strata defined by the baseline covariate, and then pooling these treatment effect estimates, under an assumption of no interaction, weighting them according to their standard errors.

Now let's suppose that we redo the randomization stratified on gender. In this case, we will guaranteed that among the two men enrolled, one will be assigned to treatment A and one assigned to treatment B, and similarly in the women. By doing this, we exclude the possibility described previously that can occur under simple randomization. By stratifying the randomization on gender, we ensure that the treatment groups are balanced in respect of gender distribution. Simple randomization guarantees this in expectation, but not in any given sample. By ensuring balance in each sample, stratified randomization enables us to obtain a more precise estimate of the effect of treatment.

However, there is a catch, as highlighted by the work of Kahan and Morris. The use of stratified randomization induces a dependence in the data between patients. As their BMJ article graphically illustrates, the treatment specific means are positively correlated when considering running repeated trials. It turns out that the consequence of this non-independence is that if one analyses the trial ignoring the factors used in the stratified randomization, the standard error estimate is larger than it should be. That is, if one uses stratified randomization in order to ensure balance between arms in respect of the baseline variables used in the randomization, and one ignores these baseline variables in the analysis, the benefit in terms of improved precision is not realised in the calculated standard error. The effect of this is that power is lower than it needs to be, and the type 1 error controlled at lower level than intended.

To understand the positive correlation between the treatment group means, I found the following logic helpful (although maybe this is just obvious to everyone other than me!). Suppose as before that men tend to have worse (say lower) outcomes than women, and consider the scatter graph in Kahan & Morris' paper showing the treatment A mean vs treatment B mean in repeated hypothetical trials from some population. Suppose that you are told that the treatment A mean is higher than average (the average across repeated trials) in a given trial. This suggests that treatment group A had more women than the average trial conducted in this population. Then since you know that stratified randomization was used, if there were more women in group A than on average, there must also be more women in group B in this trial than on average, and so we would also expect the group B mean to be higher than average. We thus have positive correlation between the group means in repeated trials conducted from within a particular population.

One solution when the trial has used stratified randomization is to analyse the trial using a regression approach, adjusting for the baseline variables used in the randomization as covariates. By doing this, we break the dependency in the data - the observations on outcome between patients are independent, conditional on treatment assignment and the baseline variables used in the randomization. The net result is a valid standard error estimate, and a gain in power.

According to the literature review performed by Kahan and Morris, just 26% of trials included in their review which used some form of stratified randomization accounted for all the factors used in the analysis. They conclude that if trials use such randomization schemes, it is important that the subsequent analysis adequately accounts for the stratification factors.

The positive correlation thing is interesting and definitely not immediately obvious to me either. Nice explanation!