R – The Stats Geek

I’ve written previously about running simulations in R and a few years ago on using Amazon Web Services to run simulations in R. I’m currently using the University of Bath’s high performance computing cluster, called Balena, to run computationally intensive simulations. To run a large number of N independent statistical simulations, I first generate the N input datasets, which is computationally fast to do. I then split the N datasets into M batches. I then ask the high performance cluster to run my analysis script M times. The i’th call to the script gets passed the integer i as the task id environment variable by the scheduler system on the high performance cluster. In my case the scheduler is Slurm, and the top of my R analysis script looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

Now, my statistical analysis involves drawing random number generation (bootstrapping and multiple imputation). It is therefore important to set R’s random number seed. As described by Tim Morris and colleagues (section 4.1.1), it is important when using parallel computing systems to set the seed carefully, by having each instance or ‘worker’ use a different random number stream. They mention the rstream package in R, but I instead have been making use of the built-in parallel package‘s functionality. The documentation for the parallel package shows how this can be done:

RNGkind("L'Ecuyer-CMRG")
set.seed(2002) # something
M <- 16 ## start M workers
s <- .Random.seed
for (i in 1:M) {
  s <- nextRNGStream(s)
  # send s to worker i as .Random.seed
}

To operationalise this with my high performance cluster, I adapt this code as follows, so that my analysis program looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

#find seed for this batch
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(69012365) #set seed to something
s <- .Random.seed
for (i in 1:batch) {
  s <- nextRNGStream(s)
}
.GlobalEnv$.Random.seed <- s

Each independent batch will run this code. The i’th batch will generate the required seed for the i’th random number stream, and then set R’s global environment .Random.seed value to the corresponding value. The first set.seed call in this code to a common value ensures that we can re-produce the whole process if required.

Postscript: based on my reading of the parallel package documentation, I think what I have done above is correct, but should anyone know or think otherwise, please shout for my (and others!) benefit.

In my work and indeed blog posts on this site I often perform simulation studies. They can be invaluable in various ways for exploring and testing the performance of statistical methods under different conditions. Recently Tim Morris, Ian White and Michael Crowther published an excellent paper in Statistics in Medicine, freely available here, on how to plan and run simulation studies. The paper contains a wealth of useful guidance and advice on how to run simulation studies, and in particular highlights some things that can cause things to go wrong with inappropriate setting of random number seeds!

Tim has an accompanying Github repository with Stata code for their illustrative example from the paper, where they simulate survival data and analyse it using a number of different survival regression models. As part of the new MSc in Data Science & Statistics here at the University of Bath, I’ve put together a short introductory tutorial on performing simulation studies using R. It can be accessed here. I hope it gives a good introduction to the key elements of programming up a simulation study in R. If anyone has comments on it or thinks I’ve omitted something important that should be covered, please get in touch via email or a comment on this page.

Setting seeds when running R simulations in parallel

Running simulation studies in R