What’s the difference between statistics and machine learning?

I had an interesting discussion at work today (among people I think would all call themselves statisticians!) about the distinction(s) between statistics and machine learning. It is something I am still not very clear about myself, and have yet to find a satisfactory answer. It’s a topic that seems to get particularly some statisticians hot under the collar, when machine learning courses apparently claim that methods statisticians tend to think are part of statistics are in fact part of machine learning:

This post is certainly not going to tell you what the difference machine learning and statistics is. Rather I hope that it spurs readers of the post to help me understand their differences.

Historically I think it’s the case that machine learning algorithms were developed in computer science departments of universities, whereas statistics was developed within mathematics or statistics departments. But this is merely about the historical origins, rather than any fundamental distinction.

Machine learning (about which I know a lot less) tends I think to focus on algorithms, and a subset of these has as their objective to prediction some outcome based on a set of inputs (or predictors as we might call them in statistics). In contrast to parametric statistical models, these algorithms typically do not make rigid assumptions about the relationships between the inputs and the outcome, and therefore can perform well then the dependence of the outcome on the predictors is complex or non-linear. The potential to capture such complex relationships is however not unique to machine learning – within statistical models we have flexible parametric / semiparametric, and even non-parametric methods such as non-parametric regression.

The Wikipedia page on machine learning states:

Machine learning and statistics are closely related fields in terms of methods, but distinct in their principal goal: statistics draws population inferences from a sample, while machine learning finds generalizable predictive patterns.

So statistics is about using sample data to draw inferences or learn about a wider population from which the sample has been drawn, whereas machine learning finds patterns in the data that can be generalised. It’s not clear from this quote alone to what machine learning will generalise to, but the natural thing that comes to mind is some broader collection or population which is similar to the sample at hand. So this apparent distinction seems quite subtle. Indeed the Wikipedia page goes on to say:

A core objective of a learner is to generalize from its experience.[2][17] Generalization in this context is the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a learning data set. The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.

To me, when one starts saying that the training data is considered representative of the space of occurrences, this sounds remarkably similar to the notion of the training data being a sample from some larger population, as would often be assumed by statistical models.

An interesting short article in Nature Methods by Bzdok and colleagues considers the differences between machine learning and statistics. The key distinction they draw out is that statistics is about inference, whereas machine learning tends to focus on prediction. They acknowledge that statistical models can often be used both for inference and prediction, and that while some methods fall squarely in one of the two domains, some methods, such as bootstrapping, are used by both. They write:

ML makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions. However, despite convincing prediction results, the lack of an explicit model can make ML solutions difficult to directly relate to existing biological knowledge.

The claim that ML methods can be effective even when the data are not collected through a carefully controlled experimental design is interesting. First it seems to imply that statistics is mainly useful only when the data are from an experiment, something which epidemiologists conducting observational studies or survey statisticians conducting national surveys would presumably take issue with. Second it seems to suggest that ML can give useful predictions for the future with minimal assumptions on how the training data arose. This seems problematic, and I cannot see why the importance of how the data arose should be different depending on whether you use a statistical method or a machine learning method. For example, if we collect data on the association between an exposure (e.g. alcohol consumption) and an outcome (e.g. blood pressure) from an observational (non-experimental) study, I cannot see how machine learning can without additional assumptions overcome the probable issue of confounding.

As I wrote earlier, I do not have a well formed view of the distinction between machine learning and statistics. My best attempt is the following: statistics starts with a model assumption, which could be more rigid (i.e. simple parametric models) or less so (i.e. semiparametric or nonparametric) which describes aspects of the data generating distribution in a way that answers a question of interest or could be used for prediction for the population from which the sample has been drawn. Uncertainty about parameters in the model or about predictions can be quantified. Machine learning doesn’t assume a model, but is a collection of algorithms for building prediction rules or finding clustering in data. The prediction rules should work well at prediction future data. Uncertainty about predictions or clustering is presumably not possible or harder, given the absence of a model.

Please add your views in a comment and help me understand the distinction(s).

I posted this to the website Hacker News last night, and it picked up some interest. As a consequence there are lots of interesting comments from people available on the its Hacker News thread.

Setting seeds when running R simulations in parallel

I’ve written previously about running simulations in R and a few years ago on using Amazon Web Services to run simulations in R. I’m currently using the University of Bath’s high performance computing cluster, called Balena, to run computationally intensive simulations. To run a large number of N independent statistical simulations, I first generate the N input datasets, which is computationally fast to do. I then split the N datasets into M batches. I then ask the high performance cluster to run my analysis script M times. The i’th call to the script gets passed the integer i as the task id environment variable by the scheduler system on the high performance cluster. In my case the scheduler is Slurm, and the top of my R analysis script looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

Now, my statistical analysis involves drawing random number generation (bootstrapping and multiple imputation). It is therefore important to set R’s random number seed. As described by Tim Morris and colleagues (section 4.1.1), it is important when using parallel computing systems to set the seed carefully, by having each instance or ‘worker’ use a different random number stream. They mention the rstream package in R, but I instead have been making use of the built-in parallel package‘s functionality. The documentation for the parallel package shows how this can be done:

RNGkind("L'Ecuyer-CMRG")
set.seed(2002) # something
M <- 16 ## start M workers
s <- .Random.seed
for (i in 1:M) {
  s <- nextRNGStream(s)
  # send s to worker i as .Random.seed
}

To operationalise this with my high performance cluster, I adapt this code as follows, so that my analysis program looks like:

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
# coerce the value to an integer
batch <- as.numeric(slurm_arrayid)

#find seed for this batch
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(69012365) #set seed to something
s <- .Random.seed
for (i in 1:batch) {
  s <- nextRNGStream(s)
}
.GlobalEnv$.Random.seed <- s

Each independent batch will run this code. The i’th batch will generate the required seed for the i’th random number stream, and then set R’s global environment .Random.seed value to the corresponding value. The first set.seed call in this code to a common value ensures that we can re-produce the whole process if required.

Postscript: based on my reading of the parallel package documentation, I think what I have done above is correct, but should anyone know or think otherwise, please shout for my (and others!) benefit.

Robustness of ANCOVA in randomised trials with unequal randomisation

In my previous post I wrote about a new paper in Biometrics which shows that when ANCOVA is used to analyse a randomised trial with adjustment for baseline covariates, as well as the treatment effect estimator being consistent, the usual model based standard error (SE) is also valid, irrespective of whether the regression model is correctly specified. As I wrote, these results were proved assuming that the trial used simple randomisation to the two groups, with equal probability of randomisation to the two.

In a pre-print available on arXiv, I extend this paper’s results to consider the case where the randomisation probabilities are not equal. Although 1:1 randomisation is by far the most common approach used I think, unequal randomisation is not that uncommon. In this situation, the point estimator for the treatment effect is still consistent – this isn’t affected by the unequal randomisation.

The analyses in the paper show that the model based SE is no longer generally consistent if the outcome model is misspecified when the randomisation is not 1:1. It is valid if the true regression coefficients of the outcome on the covariates are the same in the two treatment groups and the variance of the errors in the two groups are equal. But otherwise in general it is not. So for example even if the true regression coefficients are equal in the two groups, if the error variances are not, the model based SE is not valid. Alternatively, if the true regression coefficients differ in the two groups (i.e. interactions between treatment and some baseline covariates), again the model based SE would not in general be expected to be valid.

The impact of such invalidity in the SEs is that the type I error will not generally be controlled at the desired level, and confidence intervals will not have the correct coverage, even for large sample sizes. The results show that depending on the configuration the SEs can be biased upwards or downwards (see the pre-print for details).

These results mean that in trials with simple randomisation and where the randomisation is not 1:1, if one is concerned about the ANCOVA model being misspecified, the model based SE shouldn’t be used. Instead, robust sandwich SEs, which are widely available in statistical packages, are recommended. These provided asymptotically valid variance estimation under essentially arbitrary model misspecification.

December 2019 – this work has now been published in Biometrics.