Setting seeds when running R simulations in parallel

I’ve written previously about running simulations in R and a few years ago on using Amazon Web Services to run simulations in R. I’m currently using the University of Bath’s high performance computing cluster, called Balena, to run computationally intensive simulations. To run a large number of N independent statistical simulations, I first generate the N input datasets, which is computationally fast to do. I then split the N datasets into M batches. I then ask the high performance cluster to run my analysis script M times. The i’th call to the script gets passed the integer i as the task id environment variable by the scheduler system on the high performance cluster. In my case the scheduler is Slurm, and the top of my R analysis script looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

Now, my statistical analysis involves drawing random number generation (bootstrapping and multiple imputation). It is therefore important to set R’s random number seed. As described by Tim Morris and colleagues (section 4.1.1), it is important when using parallel computing systems to set the seed carefully, by having each instance or ‘worker’ use a different random number stream. They mention the rstream package in R, but I instead have been making use of the built-in parallel package‘s functionality. The documentation for the parallel package shows how this can be done:

RNGkind("L'Ecuyer-CMRG")
set.seed(2002) # something
M <- 16 ## start M workers
s <- .Random.seed
for (i in 1:M) {
  s <- nextRNGStream(s)
  # send s to worker i as .Random.seed
}

To operationalise this with my high performance cluster, I adapt this code as follows, so that my analysis program looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

#find seed for this batch
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(69012365) #set seed to something
s <- .Random.seed
for (i in 1:batch) {
  s <- nextRNGStream(s)
}
.GlobalEnv$.Random.seed <- s

Each independent batch will run this code. The i’th batch will generate the required seed for the i’th random number stream, and then set R’s global environment .Random.seed value to the corresponding value. The first set.seed call in this code to a common value ensures that we can re-produce the whole process if required.

Postscript: based on my reading of the parallel package documentation, I think what I have done above is correct, but should anyone know or think otherwise, please shout for my (and others!) benefit.

2 thoughts on “Setting seeds when running R simulations in parallel”

  1. Hi Jonathan

    Thanks for sharing this.

    I was wondering:
    1.) In your analysis program, due to your ‘for’ loop, what happens to the values of ’s’ for i==1…(batch-1)? Do they get sent to “worker”s indeed, as per the example from the parallel package documentation? Or is only the ‘final’ value of ’s’ (i.e., for i==batch) being assigned to ‘.GlobalEnv$.Random.seed’?

    2.) Subsequently, do all cores (“worker”s?) use the same value of ‘.GlobalEnv$.Random.seed’ (i.e., the value of ’s’ as assigned to ‘.GlobalEnv$.Random.seed’ on the last line of your analysis program)?

    Reply
    • So the high performance cluster scheduler is running that second program M times independently. The batch variable stores which batch is currently being run. So the for loop is run from i up to batch so that it can get the seed needed for that batch. Each independent batch is then using a different seed (which is what we want), and the code ensures these are from distinct streams.

      Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.