Running simulations in R using Amazon Web Services

I've recently been working on some simulation studies in R which involve computer intensive MCMC sampling. Ordinarily I would use my institution's computing cluster to do these, making use of the large number of computer cores, but a temporary lack of availability of this led me to investigate using Amazon's Web Services (AWS) system instead. In this post I'll describe the steps I went through to get my simulations going in R. As background, I am mainly a Windows user, and had never really used the Linux operating system. Nonetheless, the process wasn't actually too tricky to get going in the end, and it's enabled me to get the simulations completed far far more quickly than if I'd just used my desktop's 8 cores. The advantages of using a cloud computing resource (from my perspective) is that in principle you can use as little or as much computing power as you need or want, and it is always available - you don't have to compete against other user's demands, as would typically be the case on an academic institution's computer cluster.

Initial setup with Amazon Web Services
First off sign up for a free account at https://aws.amazon.com. As part of this you will be asked for your credit card details. The way that Amazon have done this is that you get certain amount of computing time for free, but thereafter you get charged. Later on, when you start up computers in the cloud on your account, you should be careful to understand what will and what won't cost you, although you should and I think always are warned if you select to do something that will cost you money.

Next, you need to run through all of the steps described here. EC2 is the Amazon service that we will use. Be warned - there is a lot of terminology in the documentation, and at first I really had little clue what was what. However, after a few goes things become clearer. In the following I'll describe what's what, at least as far as I understand it!

Step 1 - signing up for AWS
You should have already done this, as described above.

Step 2 - Create an IAM User
The first paragraph in this section explains the purpose of creating an IAM user account. This account is the one we will use later to log-in. Note however that to check your billing (if you do anything more than what is included in the free account), you must log-in using your original email address log-in details.

Step 3 - Create a Key Pair
The primary way to log in to the cloud computer will be using the SSH protocol. This is basically a command line interface to the computer. Part of the log in process via SSH is to use a secure key which only you have, meaning that only you are allowed to log in to the computer. Follow the instructions to setup a public and private key. The private key is the one you will need for logging in to the cloud computer.

Step 4 - Create a Virtual Private Cloud (VPC)
You can skip this bit, as it describes.

Step 5 - Create a Security Group
In a minute we will start a computer in the cloud to which we can connect. However, we want to make sure that only we can connect to it. Setting up a security group means the cloud computer we will soon start up will only allow connections from you. Follow the instructions to create a new security group. SSH is the protocol that we will be using to connect in to the cloud computer, while HTTP and HTTPS are web protocols.

Starting your first instance
We will now set up our first cloud computer. Amazon refers to one of these as an instance. First, log-in using the IAM credentials you setup earlier, by going to: https://your_aws_account_id.signin.aws.amazon.com/console/ , substituting your AWS account ID in the preceding address. Then click EC2, in the top left of the home page. This takes you to the EC2 dashboard.

To start an instance, follow step 1 of the instructions here. This shows you how to start an instance that won't cost you any money, and how to ensure that you assign the security group you set up earlier to the instance. Then follow the instructions for step 2, connecting to your instance. For a Windows user, this involves installing the free Putty software, which you use to connect to your instance using the SSH protocol. This essentially gives you a DOS like connection to your instance.

Once you've finished playing around with your instance, make sure you terminate it from the EC2 dashboard, by right clicking on the instance and selecting terminate.

Installing R onto your instance
The Getting Started instructions referred to above describe how to setup an instance running the Amazon Linux AMI, which as far as I understand is Amazon's own flavour/variant of Linux which they have created (if anyone can correct me on this, please add a comment). To run R, I have instead used a plain Ubuntu installation. Ubuntu is one of the most popular versions of the Linux operating system.

Start a new instance, running the Ubuntu AMI shown in the main list. It should say "Free tier eligible" below it. Then connect to it via SSH using Putty. Note that whereas for the Amazon AMI the username is ec2-user, for Ubuntu the username you need to connect is ubuntu. Also remember that you will need to find the public DNS of the new instance in the EC2 dashboard in your browser. So the host name in Putty will look like ubuntu@ec2-1-1-1-1.compute-1.amazonaws.com.

Next we can install R. For me, a never before Linux user, this was the hardest bit. Fortunately there are a number of blogs and pages showing you different ways of doing it. I had to use a variant of what is described here, which didn't work straight off as described in AWS. First, in Putty type:

sudo nano /etc/apt/sources.list

This loads the sources.list file into the nano text editor. We need to add a line to point to where R can be downloaded. Add a line as follows, to the bottom of the file:

deb http://cran.rstudio.com/bin/linux/ubuntu trusty/

Then press Ctrl+X, type Y, then enter, to save the file. Next, type the following three commands in sequence, selecting Y for yes where required:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo apt-get update
sudo apt-get install r-base

If this was successful, you should now have R installed. To check, at the prompt type R. If it worked, you should see:

R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

You can now work in R, from the command line. To quit, type q() and press enter. You are then asked if you want to save your workspace or not.

Now you probably don't want to work in R from the command line like this of course! If running simulations, I get an R script file ready on my desktop, and then copy it across to AWS to run it (see below). For details on how to transfer files from your local machine to your AWS instance, see here for instructions using the freely available WinSCP program (the instructions are about three quarters of the way down the page). If you did want to run RStudio on AWS, see Louis Aslett's excellent page on running R and RStudio on AWS.

Running parallel simulations
The original reason for me trying to get AWS going was to run computer intensive simulations in R. The way I have done this is to first start an instance type with many cores (36), specifically the c4.8xlarge type. This currently costs about $1.70 an hour to run, plus tax. To run the simulations in parallel, I call R with a script file which launches multiple instances of R. To write this script file, type: nano myscript.sh at your command line. This opens the nano text editor program. Enter the following lines:

#!/bin/bash
for set in {1..36}
do
nohup Rscript myrprog.R $set > ~/outputs/myprogoutputs_$set 2>&1 &
done

This script calls R 36 times, with each call passing a different integer from 1 to 36 as the only argument to R. The first part of the line within the for loop is nohup. Putting this at the start of the call ensures that the simulations will continue even if we log out from the instance - i.e. they will carry on running while we are not connected. The next part calls R with Rscript, giving the name of our R script. The $set then passes the set for loop variable value as an argument to R.

The second half of the call says that we want the output to be stored in a subdirectory called outputs, with a filename myprogoutputs$set, where $set inserts the set variable into the filename. The 2>&1 specifies how we want output and any errors from the call to R to be captured (see here). The last & requests that R be run in the background. This part is crucial, as it allows us to call R 36 times simultaneously. Without this final &, after the first call to RScript, the second call would only be issued once the first RScript call had completed.

Next, we have to change the permissions of the script file we have just created to enable it to be executed. To do this, type the following at the Putty command prompt:

chmod +x myscript.sh

Before running our script file, we need to edit our R program to make use of the argument that we are passing to it. At the top of my R program I have the following lines added:

args=(commandArgs(TRUE))
set <- as.numeric(args[[1]])

This takes the value of set in the script file earlier which is passed to R. We can then use the R variable set to ensure that we for example generate the appropriate datasets to be analysed, or to set the random number seed appropriately. If we don't do this, each call to R will produce identical results (although if we randomly generate data, and don't set the seed in our R program, we would still obtain different results from each run probably, since R creates an initial seed based on the time and its process ID).

Let's suppose that we wanted to perform 3600 simulations. If we are using a 36 core instance, we could run 100 simulations in each call to R. Each call then saves a dataset with results from 100 simulations. We can then combine the 36 datasets to form a dataset of the desired 3600 simulation results.

To save the results, at the end of my R program I have a line like:

results <- list(estimates=estimates, ciLower=ciLower, ciUpper=ciUpper) setwd("outputs") save(results, file=paste("results_",set,".RData",sep=""))
This saves an R data file called results_set, where set is our set variable, to the outputs folder.

We now have a setup which can run 36 simultaneous copies of R. The 36 R datasets created can then be combined together and analysed.

Using more than 36 cores
But what if our simulations take so long that we want to use more than 36 cores/processors? The route I have taken is to start multiple instances, each with 36 cores, and to edit my R program so that each call only performs a smaller number of simulations, thus completing more quickly. Doing this isn't quite as tortuous as it sounds. First off, it would be a pain to have to install R and our required packages on each of the instances. Fortunately AWS has this nifty feature called Amazon Machine Images. To make use of this, if you right click on the instance you have running with R installed, select Image, then Create Image. This will take a snapshot of your instance, including the operating system, installed software, and whatever you had stored on the attached storage. Now when you go to start a second instance, Step 1 is to "Choose an Amazon Machine Image (AMI)". Now if you click on "My AMIs" on the left hand side, you can select to start a new instance using the image you just saved based on your fist instance. This means that when you connect to this second instance, R and everything else will already be setup.

Splitting the simulations across many instances is however a bit of a pain. It means one has to SSH into each instance, start off the simulations, wait until they complete, then copy the results files back off with WinSCP to your local drive repeatedly. I have no doubt there is a much slicker route to achieving the same thing - if anyone can suggest how I'd very much appreciate it!

Conclusions
I've been using AWS to run simulations (on and off) for a about a week. While it is costing me money, it's enabled me to complete my simulations in a tiny fraction of the time it would have taken my desktop machine to complete them. And while it does cost money, it only costs money while you are running your instances - once you're done you can terminate them and you're done.

Lastly, if anyone has any tips of their own for running simulations on AWS, or a correction to anything I've written, please add a comment.

8 thoughts on “Running simulations in R using Amazon Web Services

  1. Thanks so much for writing this! It's super helpful. Your instructions just saved me a ton of time! One question: Once your simulation is complete, how do you get the resulting output file out of your AWS instance and onto your local machine?

  2. Hi Jonathan,
    Thanks for posting about this! Some other people suggest using the parallel library or other multi-core libraries in R, instead of using multiple R instances the way that you do. It is a little easier, which is kind of cool, and possibly more robust to errors... are there other reasons why you prefer your method? Do you have any idea which is faster?

    • Good question Sean. I did it the way I did it primarily because it was the most obvious approach to me, and didn't require me to have to make any substantive changes to the code. I'm certainly no expert, but I think where you are running completely independent simulations there can be no speed advantage compared to running it with some parallel / multi-core package - the difference is the latter sorts out managing the runs for you. As I say though, I'm really not an expert, so someone else may indicate other advantages in the simulation setting to doing it differently.

    • If there is variance in completion times for the simulations, a library like foreach or parallel might save time since it has the option of allocating jobs to cores as they finish, rather than preallocating a batch of jobs to each core. Take a look at the %dopar% infix operator in the foreach library for parallelization. It makes parallel processing very easy.

  3. Hi Jonathan,

    Great thanks for writing this at first! I just have several questions:

    - When you type " > ~/outputs/myprogoutputs_$set 2>&1 & ", you mentioned that you want to store the outputs in a folder called "outputs" with name, I suppose, "myprogoutputs_$set". However, you also add " save(results, file = paste("results_", set, ".RData", sep = "")) " to the end of your R program, where, as I understand, you will save the result with name as "results_$set". Should those be the same name?

    - Moreover, after adding " args=(commandArgs(TRUE)); set <- as.numeric(args[[1]]) " at the top of the R program (a .R file?), there came an error saying " Error in args[[1]] : subscript out of bounds " when I run the R script from the command line. Do you have any idea about how to solve this problem?

    Thank you in advance! I would also appreciate that if anyone else could answer my questions.

    • Hi Sam

      There are two different outputs. The first one in your question is specifying where the R console output will be saved. The second one, within the R script, is saving an R object containing whatever simulation results we want to save, to a file.

      I'm not sure about your second question. It looks like the set argument is not being passed into R correctly.

      Best wishes
      Jonathan

Leave a Reply