Setting seeds when running R simulations in parallel

I’ve written previously about running simulations in R and a few years ago on using Amazon Web Services to run simulations in R. I’m currently using the University of Bath’s high performance computing cluster, called Balena, to run computationally intensive simulations. To run a large number of N independent statistical simulations, I first generate the N input datasets, which is computationally fast to do. I then split the N datasets into M batches. I then ask the high performance cluster to run my analysis script M times. The i’th call to the script gets passed the integer i as the task id environment variable by the scheduler system on the high performance cluster. In my case the scheduler is Slurm, and the top of my R analysis script looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

Now, my statistical analysis involves drawing random number generation (bootstrapping and multiple imputation). It is therefore important to set R’s random number seed. As described by Tim Morris and colleagues (section 4.1.1), it is important when using parallel computing systems to set the seed carefully, by having each instance or ‘worker’ use a different random number stream. They mention the rstream package in R, but I instead have been making use of the built-in parallel package‘s functionality. The documentation for the parallel package shows how this can be done:

RNGkind("L'Ecuyer-CMRG")
set.seed(2002) # something
M <- 16 ## start M workers
s <- .Random.seed
for (i in 1:M) {
  s <- nextRNGStream(s)
  # send s to worker i as .Random.seed
}

To operationalise this with my high performance cluster, I adapt this code as follows, so that my analysis program looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

#find seed for this batch
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(69012365) #set seed to something
s <- .Random.seed
for (i in 1:batch) {
  s <- nextRNGStream(s)
}
.GlobalEnv$.Random.seed <- s

Each independent batch will run this code. The i’th batch will generate the required seed for the i’th random number stream, and then set R’s global environment .Random.seed value to the corresponding value. The first set.seed call in this code to a common value ensures that we can re-produce the whole process if required.

Postscript: based on my reading of the parallel package documentation, I think what I have done above is correct, but should anyone know or think otherwise, please shout for my (and others!) benefit.

Robustness of ANCOVA in randomised trials with unequal randomisation

In my previous post I wrote about a new paper in Biometrics which shows that when ANCOVA is used to analyse a randomised trial with adjustment for baseline covariates, as well as the treatment effect estimator being consistent, the usual model based standard error (SE) is also valid, irrespective of whether the regression model is correctly specified. As I wrote, these results were proved assuming that the trial used simple randomisation to the two groups, with equal probability of randomisation to the two.

In a pre-print available on arXiv, I extend this paper’s results to consider the case where the randomisation probabilities are not equal. Although 1:1 randomisation is by far the most common approach used I think, unequal randomisation is not that uncommon. In this situation, the point estimator for the treatment effect is still consistent – this isn’t affected by the unequal randomisation.

The analyses in the paper show that the model based SE is no longer generally consistent if the outcome model is misspecified when the randomisation is not 1:1. It is valid if the true regression coefficients of the outcome on the covariates are the same in the two treatment groups and the variance of the errors in the two groups are equal. But otherwise in general it is not. So for example even if the true regression coefficients are equal in the two groups, if the error variances are not, the model based SE is not valid. Alternatively, if the true regression coefficients differ in the two groups (i.e. interactions between treatment and some baseline covariates), again the model based SE would not in general be expected to be valid.

The impact of such invalidity in the SEs is that the type I error will not generally be controlled at the desired level, and confidence intervals will not have the correct coverage, even for large sample sizes. The results show that depending on the configuration the SEs can be biased upwards or downwards (see the pre-print for details).

These results mean that in trials with simple randomisation and where the randomisation is not 1:1, if one is concerned about the ANCOVA model being misspecified, the model based SE shouldn’t be used. Instead, robust sandwich SEs, which are widely available in statistical packages, are recommended. These provided asymptotically valid variance estimation under essentially arbitrary model misspecification.

December 2019 – this work has now been published in Biometrics.

ANCOVA in RCTs – model based standard errors are valid even under misspecification

A nice paper by Wang and colleagues has just been published in Biometrics which examines the robustness of ANCOVA (i.e. linear regression) for analysing continuous outcomes in randomised trials. By this they mean a linear regression which adjusts for randomised treatment group and baseline covariates. Yang and Tsiatis in 2001 showed that this estimator is consistent for the average treatment effect even if the model is misspecified, and as such can be recommended for general use in the analysis of such trials. They offered a variance estimator for the resulting treatment effect estimate that is valid even if the model is misspecified, and compared this in simulations to the usual model based variance estimator from ANCOVA (which they refer to as the OLS variance). Yang and Tsiatis reported that this OLS variance estimator performed well even when the linear model was misspecified, but suggested that the OLS variance estimator had some bias as the sample size was increased.

In their new paper Wang and colleagues prove that so long as the randomisation ratio is 1:1, the standard model based variance estimator for the adjusted treatment effect from ANCOVA is valid even under model misspecification. As such, this further strengthens support for using a linear regression model with adjustment for baseline covariates in randomised trials.

An important caveat to the result of Wang and colleagues is that they assumed that the treatment group is assigned completely at random, and in particular independently of the baseline covariates. This rules out certain randomisation schemes, such as stratified randomisation, where randomisation depends on the subject’s baseline covariates. Indeed we know that in this setting, if we don’t properly model the effects of the variables used to stratify the randomisation, our treatment effect variance estimates in general are not correct.