In the process of organising a conference session on machine learning, I’ve finally got around to reading the late Leo Breiman’s thought provoking 2001 Statistical Science article “Statistical Modeling: The Two Cultures”. I highly recommend reading the paper, and the discussion that follows it. In the paper Breiman argues that statistics as a field should open its eyes to analysing data not only with traditional ‘data models’ (his terminology), by which he means standard (usually parametric) probabilistic models, but to also make much more use of so called machine learning algorithmic techniques.
My only real experience so far with machine learning techniques has been with random forest, first proposed by Breiman, in the context of multiple imputation for missing data. So my knowledge of the area is very limited. Nevertheless, for what it’s worth, the following are my main take away points and thoughts having read the paper:
- Breiman mainly focuses on the task of predicting an outcome Y from a set (possibly high dimensional) of predictors X. He repeatedly says that the primary criterion for judging a model/method is its predictive accuracy. This to me seems far too narrow – there are many situations where the objective is not simply to best predict an outcome using a set of predictors.
- Ill advised practices using standard modelling techniques are highlighted (which are common, according to Breiman), such as using parametric regression models and drawing conclusions about substantive questions without wider consideration of issues such as whether the observational data at hand can reasonably be expected to answer the causal question of interest. This to me however does not seem to be a legitimate criticism of ‘data models’, but rather how they are sometimes (inappropriately) used.
- A number of machine learning techniques are explained, and illustrated to have prediction accuracy superior to that of more classical parametric modelling techniques, such as logistic regression.
- With more complex problems, Breiman says there are more elaborate data modelling techniques, but says these “become more cumbersome”. He makes reference to Bayesian techniques using MCMC. But there is no mention of methods which would fall into Breiman’s ‘data model’ category but which attempt to relax some of the assumptions made by parametric models, in order to allow one to obtain valid answers to the question of interest while having more robustness to violations of certain assumptions. Cox’s proportional hazards model for time to event data is a case in point.
- Standard parametric modelling techniques may give wrong conclusions if their assumptions are wrong, and the goodness of fit and model checking techniques we have available typically have low power. Consequently, when we find no evidence of poor fit, we may often wrongly conclude that our model fits well.
- Although some results have been derived (and no doubt more have been since 2001 when the article was written) regarding the properties of machine learning techniques, theory for them is much less developed than for the model standard probabilistic modelling methods. This is one aspect that concerned me when using random forest in the context of multiple imputation – if one is not able to theoretically understand a method’s properties even under ideal conditions, how can we place confidence in using the method widely?
- He compares a number of machine learning techniques in terms of their interpretability (of results or ouput) and prediction accuracy. He states that a:
model does not have to be simple to provide reliable information about the relation between predictor and response variables; neither does it have to be a data model. The goal is not interpretability, but accurate information.
The random forest technique is rated highly in terms of its prediction accuracy, but poorly in terms of interpretatibility. When prediction is the objective, I would agree that the lack of interpretability, in terms of which predictors are most important, or how large is the effect of a particular predictor, are less or not important. But there are many situations where we are not interested in just prediction of the outcome. For example, in analyses of observational studies where we are interested in estimating the causal effect of an exposure of interest on an outcome, and there are confounders or possibly effect modifiers, the task cannot simply be reduced to one of trying to best predict the outcome from the covariates.
- Three data examples are discussed, in which the machine learning techniques outperform classical regression modelling. But these examples are all essentially prediction problems. As per my previous point, what would one do, if restricted to machine learning techniques, to say try and estimate the causal effect of a non-randomized exposure in an observational study?
- The first of the three examples given is a survival analysis, but in fact the outcome is dichotomised to one of surviving vs. not surviving. In practice, in follow-up studies some patients are often followed up for different durations, such that one cannot (validly) simply create a binary outcome variable of death vs. no death. In the statistical literature there is rich body of techniques for handling the problem of censoring in the outcome. I cannot see how one could accommodate this issue without in some way using probability models.
- All of the machine learning techniques (and again, this comment is caveated with an acknowledgement of my large ignorance about the field!) seem to be about point prediction. But often we are interested in other aspects of the dependence of an outcome on predictors. For example, we may be interested in how the variability in an outcome varies as a function of predictors. Or in clustered or multi-level data, we are often interested in decomposing the variation in the outcome into that which can be attributed to cluster effects (between cluster variation) and within cluster variation. It isn’t clear how such data, which have a complex dependency structure, can be adequately modelled using the machine learning techniques.
I think that machine learning techniques likely have lots to offer the data analyst/statistician, and I am keen to explore them further. Here I would of course agree with Breiman that we must open ourselves to the possibility of using a wider tool set. Nevertheless, I did not really come away particularly convinced that our large scale use of probabilistic parametric/semiparametric/nonparametric models, across many different disciplines, should be so radically changed in favour of using machine learning techniques. I can see that the latter certainly should play an increasing role in our analyses in certain situations (e.g. when good prediction is the aim), but there seem to be a whole host of types of analyses where I can’t see how they would be able to help answer the substantive question of interest. This may of course simply be due to my ignorance about the techniques of course, or my failure to conceive of new adaptions of them to handle some of the issues I’ve discussed.