I’ve written previously about running simulations in R and a few years ago on using Amazon Web Services to run simulations in R. I’m currently using the University of Bath’s high performance computing cluster, called Balena, to run computationally intensive simulations. To run a large number of N independent statistical simulations, I first generate the N input datasets, which is computationally fast to do. I then split the N datasets into M batches. I then ask the high performance cluster to run my analysis script M times. The i’th call to the script gets passed the integer i as the task id environment variable by the scheduler system on the high performance cluster. In my case the scheduler is Slurm, and the top of my R analysis script looks like:
slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)
Now, my statistical analysis involves drawing random number generation (bootstrapping and multiple imputation). It is therefore important to set R’s random number seed. As described by Tim Morris and colleagues (section 4.1.1), it is important when using parallel computing systems to set the seed carefully, by having each instance or ‘worker’ use a different random number stream. They mention the rstream package in R, but I instead have been making use of the built-in parallel package‘s functionality. The documentation for the parallel package shows how this can be done:
RNGkind("L'Ecuyer-CMRG")
set.seed(2002) # something
M <- 16 ## start M workers
s <- .Random.seed
for (i in 1:M) {
s <- nextRNGStream(s)
# send s to worker i as .Random.seed
}
To operationalise this with my high performance cluster, I adapt this code as follows, so that my analysis program looks like:
slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)
#find seed for this batch
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(69012365) #set seed to something
s <- .Random.seed
for (i in 1:batch) {
s <- nextRNGStream(s)
}
.GlobalEnv$.Random.seed <- s
Each independent batch will run this code. The i’th batch will generate the required seed for the i’th random number stream, and then set R’s global environment .Random.seed value to the corresponding value. The first set.seed call in this code to a common value ensures that we can re-produce the whole process if required.
Postscript: based on my reading of the parallel package documentation, I think what I have done above is correct, but should anyone know or think otherwise, please shout for my (and others!) benefit.