Running simulations in R using Amazon Web Services

I’ve recently been working on some simulation studies in R which involve computer intensive MCMC sampling. Ordinarily I would use my institution’s computing cluster to do these, making use of the large number of computer cores, but a temporary lack of availability of this led me to investigate using Amazon’s Web Services (AWS) system instead. In this post I’ll describe the steps I went through to get my simulations going in R. As background, I am mainly a Windows user, and had never really used the Linux operating system. Nonetheless, the process wasn’t actually too tricky to get going in the end, and it’s enabled me to get the simulations completed far far more quickly than if I’d just used my desktop’s 8 cores. The advantages of using a cloud computing resource (from my perspective) is that in principle you can use as little or as much computing power as you need or want, and it is always available – you don’t have to compete against other user’s demands, as would typically be the case on an academic institution’s computer cluster.

Read more

smcfcs in R – updated version 1.1.1 with critical bug fix

For any users of my R package smcfcs, I’ve just released a new version (1.1.1), which along with a few small changes, includes a critical bug fix. The bug affected imputation of categorical (binary and categorical variables with more than two levels) when the substantive model is linear regression (other substantive model types were not affected). All users should update to the new version, which is available on CRAN.

Machine learning vs. traditional modelling techniques

In the process of organising a conference session on machine learning, I’ve finally got around to reading the late Leo Breiman’s thought provoking 2001 Statistical Science article “Statistical Modeling: The Two Cultures”. I highly recommend reading the paper, and the discussion that follows it. In the paper Breiman argues that statistics as a field should open its eyes to analysing data not only with traditional ‘data models’ (his terminology), by which he means standard (usually parametric) probabilistic models, but to also make much more use of so called machine learning algorithmic techniques.

Read more