Miscellaneous – The Stats Geek

PhD on causal inference for competing risks data

November 13, 2022 by Jonathan Bartlett

Applications are invited for a 3-year PhD studentship from the ESRC UBEL DTP (UCL, Bloomsbury and East London Doctoral Training Partnership)

We are seeking applicants who would like to pursue PhD research on the project described below. This project is offered as part of the Longitudinal Analysis and Design topic under the Quantitative Social Science Pathway of the ESRC UBEL DTP. Successful applicants will based in the Department of Medical Statistics at LSHTM. Further information on the funding scheme can be found at https://ubel-dtp.ac.uk/esrc-studentships/

3 year post-doc in Bath – Clinical trial estimands – from definition to estimation

April 3, 2020March 31, 2020 by Jonathan Bartlett

Applications are open now for 3 year post-doc Research Associate position at the University of Bath. The position is funded by a UK Medical Research Council grant ‘Clinical Trial Estimands – from definition to estimation’.

The context

Clinical trials represent the gold standard for evaluating the effects of treatments or interventions. Nevertheless, many trials are complicated by a variety of issues which renders their design and analysis more complicated. Examples include patients discontinuing their randomised treatment or taking additional rescue medications. In other settings, such as cancer studies, where quality of life endpoints are important secondary outcomes, a non-trivial proportion of patients may die before the quality of life endpoint can be measured, leading to ‘missingness due to death’. In cardiovascular trials, primary interest may be in estimating the treatments’ effects on incidence of cardiovascular events, but patients may die from other causes during follow-up, leading to so called competing risks.

In recent years there has been a growing recognition that such issues need careful consideration at both the design and analysis stages of a randomised trial. Within the world of pharmaceutical trials, this recognition has led to the publication of the ICH E9 estimand addendum. The addendum gives welcome focus to these issues and offers a framework for the definition of a clinical trial estimand in the presence of these issues (so called intercurrent events). It does not however say very much about which statistical methods ought to be used to estimate different estimands from clinical trial data. Moreover, the addendum does not explicitly discuss causal inference concepts, although these are sitting there in among the document (e.g. the principal stratification method, which is mentioned).

The project and post-doc

The grant funding this post-doc position aims to address the question of how statistical methods can be used to estimate a variety of estimands in the presence of so called intercurrent events. In particular, it seeks to exploit the many developments made in the field of modern casual inference to the problem. These methods were mostly developed with the analysis of non-randomised observational studies in mind. In randomised trials, although treatment group is randomly assigned, the post baseline intercurrent events that take place are not randomly assigned. As such the randomised trial becomes like an observational study, with the special property that the initial assignment to treatment was random.

The 3 year position will be based at the Department of Mathematical Sciences at the University of Bath. The post holder will be supervised by myself and Dr Rhian Daniel, Cardiff University, an expert in causal inference methods. The project will also benefit from regular input from the statisticians at the pharmaceutical company AstraZeneca.

For further details about the position and how to apply, please go to the University of Bath jobs page. For informal enquiries about the position, please email me at j.w.bartlett@bath.ac.uk

What’s the difference between statistics and machine learning?

August 9, 2019August 8, 2019 by Jonathan Bartlett

I had an interesting discussion at work today (among people I think would all call themselves statisticians!) about the distinction(s) between statistics and machine learning. It is something I am still not very clear about myself, and have yet to find a satisfactory answer. It’s a topic that seems to get particularly some statisticians hot under the collar, when machine learning courses apparently claim that methods statisticians tend to think are part of statistics are in fact part of machine learning:

When *linear* regression becomes machine learning….. And is repackaged in an online course….@f2harrell @FamedCelebrity @statsepi @ADAlthousePhD @AndrewPGrieve @stephensenn pic.twitter.com/tZnyPZVDbD
— ChristosArgyropoulos (@ChristosArgyrop) August 6, 2019

This post is certainly not going to tell you what the difference machine learning and statistics is. Rather I hope that it spurs readers of the post to help me understand their differences.

Historically I think it’s the case that machine learning algorithms were developed in computer science departments of universities, whereas statistics was developed within mathematics or statistics departments. But this is merely about the historical origins, rather than any fundamental distinction.

Machine learning (about which I know a lot less) tends I think to focus on algorithms, and a subset of these has as their objective to prediction some outcome based on a set of inputs (or predictors as we might call them in statistics). In contrast to parametric statistical models, these algorithms typically do not make rigid assumptions about the relationships between the inputs and the outcome, and therefore can perform well then the dependence of the outcome on the predictors is complex or non-linear. The potential to capture such complex relationships is however not unique to machine learning – within statistical models we have flexible parametric / semiparametric, and even non-parametric methods such as non-parametric regression.

The Wikipedia page on machine learning states:

Machine learning and statistics are closely related fields in terms of methods, but distinct in their principal goal: statistics draws population inferences from a sample, while machine learning finds generalizable predictive patterns.

So statistics is about using sample data to draw inferences or learn about a wider population from which the sample has been drawn, whereas machine learning finds patterns in the data that can be generalised. It’s not clear from this quote alone to what machine learning will generalise to, but the natural thing that comes to mind is some broader collection or population which is similar to the sample at hand. So this apparent distinction seems quite subtle. Indeed the Wikipedia page goes on to say:

A core objective of a learner is to generalize from its experience.^[2][17] Generalization in this context is the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a learning data set. The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.

To me, when one starts saying that the training data is considered representative of the space of occurrences, this sounds remarkably similar to the notion of the training data being a sample from some larger population, as would often be assumed by statistical models.

An interesting short article in Nature Methods by Bzdok and colleagues considers the differences between machine learning and statistics. The key distinction they draw out is that statistics is about inference, whereas machine learning tends to focus on prediction. They acknowledge that statistical models can often be used both for inference and prediction, and that while some methods fall squarely in one of the two domains, some methods, such as bootstrapping, are used by both. They write:

ML makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions. However, despite convincing prediction results, the lack of an explicit model can make ML solutions difficult to directly relate to existing biological knowledge.

The claim that ML methods can be effective even when the data are not collected through a carefully controlled experimental design is interesting. First it seems to imply that statistics is mainly useful only when the data are from an experiment, something which epidemiologists conducting observational studies or survey statisticians conducting national surveys would presumably take issue with. Second it seems to suggest that ML can give useful predictions for the future with minimal assumptions on how the training data arose. This seems problematic, and I cannot see why the importance of how the data arose should be different depending on whether you use a statistical method or a machine learning method. For example, if we collect data on the association between an exposure (e.g. alcohol consumption) and an outcome (e.g. blood pressure) from an observational (non-experimental) study, I cannot see how machine learning can without additional assumptions overcome the probable issue of confounding.

As I wrote earlier, I do not have a well formed view of the distinction between machine learning and statistics. My best attempt is the following: statistics starts with a model assumption, which could be more rigid (i.e. simple parametric models) or less so (i.e. semiparametric or nonparametric) which describes aspects of the data generating distribution in a way that answers a question of interest or could be used for prediction for the population from which the sample has been drawn. Uncertainty about parameters in the model or about predictions can be quantified. Machine learning doesn’t assume a model, but is a collection of algorithms for building prediction rules or finding clustering in data. The prediction rules should work well at prediction future data. Uncertainty about predictions or clustering is presumably not possible or harder, given the absence of a model.

Please add your views in a comment and help me understand the distinction(s).

I posted this to the website Hacker News last night, and it picked up some interest. As a consequence there are lots of interesting comments from people available on the its Hacker News thread.