PhD in estimands/causal inference in trials (UK/EU)

If you are a UK/EU resident interested in pursuing a PhD on estimands/causal inference in clinical trials, please see the advert here. There is (rightly) increasing emphasis in clinical trials in clear specification of the scientific question and hence target estimand or parameter.

While one might think the process of choosing and specifying the estimand is usually easy, in many settings various things can happen during follow-up which complicate this. Examples include patients changing treatments, failing from competing risks, or dying before the endpoint of interest can be measured. This has led to the ICH E9 addendum on estimands, whose final version will soon be published. There remain a number of areas where deciding what the most appropriate estimand is and how one can validly estimate it from the observable data is challenging, and this PhD will seek to address some of these outstanding areas. For more background on this area, I’d recommend reading this paper.

The PhD will be based at the University of Bath, with myself as primary supervisor. The student will benefit from additional supervision from leading researchers in causal inference: Rhian Daniel (Cardiff), Jack Bowden (Bristol) and Daniel Farewell (Cardiff).

For information about funding and the application process, please see the information here. The application deadline is 25th November 2019.

What’s the difference between statistics and machine learning?

I had an interesting discussion at work today (among people I think would all call themselves statisticians!) about the distinction(s) between statistics and machine learning. It is something I am still not very clear about myself, and have yet to find a satisfactory answer. It’s a topic that seems to get particularly some statisticians hot under the collar, when machine learning courses apparently claim that methods statisticians tend to think are part of statistics are in fact part of machine learning:

This post is certainly not going to tell you what the difference machine learning and statistics is. Rather I hope that it spurs readers of the post to help me understand their differences.

Historically I think it’s the case that machine learning algorithms were developed in computer science departments of universities, whereas statistics was developed within mathematics or statistics departments. But this is merely about the historical origins, rather than any fundamental distinction.

Machine learning (about which I know a lot less) tends I think to focus on algorithms, and a subset of these has as their objective to prediction some outcome based on a set of inputs (or predictors as we might call them in statistics). In contrast to parametric statistical models, these algorithms typically do not make rigid assumptions about the relationships between the inputs and the outcome, and therefore can perform well then the dependence of the outcome on the predictors is complex or non-linear. The potential to capture such complex relationships is however not unique to machine learning – within statistical models we have flexible parametric / semiparametric, and even non-parametric methods such as non-parametric regression.

The Wikipedia page on machine learning states:

Machine learning and statistics are closely related fields in terms of methods, but distinct in their principal goal: statistics draws population inferences from a sample, while machine learning finds generalizable predictive patterns.

So statistics is about using sample data to draw inferences or learn about a wider population from which the sample has been drawn, whereas machine learning finds patterns in the data that can be generalised. It’s not clear from this quote alone to what machine learning will generalise to, but the natural thing that comes to mind is some broader collection or population which is similar to the sample at hand. So this apparent distinction seems quite subtle. Indeed the Wikipedia page goes on to say:

A core objective of a learner is to generalize from its experience.[2][17] Generalization in this context is the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a learning data set. The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.

To me, when one starts saying that the training data is considered representative of the space of occurrences, this sounds remarkably similar to the notion of the training data being a sample from some larger population, as would often be assumed by statistical models.

An interesting short article in Nature Methods by Bzdok and colleagues considers the differences between machine learning and statistics. The key distinction they draw out is that statistics is about inference, whereas machine learning tends to focus on prediction. They acknowledge that statistical models can often be used both for inference and prediction, and that while some methods fall squarely in one of the two domains, some methods, such as bootstrapping, are used by both. They write:

ML makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions. However, despite convincing prediction results, the lack of an explicit model can make ML solutions difficult to directly relate to existing biological knowledge.

The claim that ML methods can be effective even when the data are not collected through a carefully controlled experimental design is interesting. First it seems to imply that statistics is mainly useful only when the data are from an experiment, something which epidemiologists conducting observational studies or survey statisticians conducting national surveys would presumably take issue with. Second it seems to suggest that ML can give useful predictions for the future with minimal assumptions on how the training data arose. This seems problematic, and I cannot see why the importance of how the data arose should be different depending on whether you use a statistical method or a machine learning method. For example, if we collect data on the association between an exposure (e.g. alcohol consumption) and an outcome (e.g. blood pressure) from an observational (non-experimental) study, I cannot see how machine learning can without additional assumptions overcome the probable issue of confounding.

As I wrote earlier, I do not have a well formed view of the distinction between machine learning and statistics. My best attempt is the following: statistics starts with a model assumption, which could be more rigid (i.e. simple parametric models) or less so (i.e. semiparametric or nonparametric) which describes aspects of the data generating distribution in a way that answers a question of interest or could be used for prediction for the population from which the sample has been drawn. Uncertainty about parameters in the model or about predictions can be quantified. Machine learning doesn’t assume a model, but is a collection of algorithms for building prediction rules or finding clustering in data. The prediction rules should work well at prediction future data. Uncertainty about predictions or clustering is presumably not possible or harder, given the absence of a model.

Please add your views in a comment and help me understand the distinction(s).

I posted this to the website Hacker News last night, and it picked up some interest. As a consequence there are lots of interesting comments from people available on the its Hacker News thread.

Setting seeds when running R simulations in parallel

I’ve written previously about running simulations in R and a few years ago on using Amazon Web Services to run simulations in R. I’m currently using the University of Bath’s high performance computing cluster, called Balena, to run computationally intensive simulations. To run a large number of N independent statistical simulations, I first generate the N input datasets, which is computationally fast to do. I then split the N datasets into M batches. I then ask the high performance cluster to run my analysis script M times. The i’th call to the script gets passed the integer i as the task id environment variable by the scheduler system on the high performance cluster. In my case the scheduler is Slurm, and the top of my R analysis script looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

Now, my statistical analysis involves drawing random number generation (bootstrapping and multiple imputation). It is therefore important to set R’s random number seed. As described by Tim Morris and colleagues (section 4.1.1), it is important when using parallel computing systems to set the seed carefully, by having each instance or ‘worker’ use a different random number stream. They mention the rstream package in R, but I instead have been making use of the built-in parallel package‘s functionality. The documentation for the parallel package shows how this can be done:

RNGkind("L'Ecuyer-CMRG")
set.seed(2002) # something
M <- 16 ## start M workers
s <- .Random.seed
for (i in 1:M) {
  s <- nextRNGStream(s)
  # send s to worker i as .Random.seed
}

To operationalise this with my high performance cluster, I adapt this code as follows, so that my analysis program looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

#find seed for this batch
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(69012365) #set seed to something
s <- .Random.seed
for (i in 1:batch) {
  s <- nextRNGStream(s)
}
.GlobalEnv$.Random.seed <- s

Each independent batch will run this code. The i’th batch will generate the required seed for the i’th random number stream, and then set R’s global environment .Random.seed value to the corresponding value. The first set.seed call in this code to a common value ensures that we can re-produce the whole process if required.

Postscript: based on my reading of the parallel package documentation, I think what I have done above is correct, but should anyone know or think otherwise, please shout for my (and others!) benefit.