Machine learning vs. traditional modelling techniques

In the process of organising a conference session on machine learning, I’ve finally got around to reading the late Leo Breiman’s thought provoking 2001 Statistical Science article “Statistical Modeling: The Two Cultures”. I highly recommend reading the paper, and the discussion that follows it. In the paper Breiman argues that statistics as a field should open its eyes to analysing data not only with traditional ‘data models’ (his terminology), by which he means standard (usually parametric) probabilistic models, but to also make much more use of so called machine learning algorithmic techniques.

My only real experience so far with machine learning techniques has been with random forest, first proposed by Breiman, in the context of multiple imputation for missing data. So my knowledge of the area is very limited. Nevertheless, for what it’s worth, the following are my main take away points and thoughts having read the paper:

  • Breiman mainly focuses on the task of predicting an outcome Y from a set (possibly high dimensional) of predictors X. He repeatedly says that the primary criterion for judging a model/method is its predictive accuracy. This to me seems far too narrow – there are many situations where the objective is not simply to best predict an outcome using a set of predictors.
  • Ill advised practices using standard modelling techniques are highlighted (which are common, according to Breiman), such as using parametric regression models and drawing conclusions about substantive questions without wider consideration of issues such as whether the observational data at hand can reasonably be expected to answer the causal question of interest. This to me however does not seem to be a legitimate criticism of ‘data models’, but rather how they are sometimes (inappropriately) used.
  • A number of machine learning techniques are explained, and illustrated to have prediction accuracy superior to that of more classical parametric modelling techniques, such as logistic regression.
  • With more complex problems, Breiman says there are more elaborate data modelling techniques, but says these “become more cumbersome”. He makes reference to Bayesian techniques using MCMC. But there is no mention of methods which would fall into Breiman’s ‘data model’ category but which attempt to relax some of the assumptions made by parametric models, in order to allow one to obtain valid answers to the question of interest while having more robustness to violations of certain assumptions. Cox’s proportional hazards model for time to event data is a case in point.
  • Standard parametric modelling techniques may give wrong conclusions if their assumptions are wrong, and the goodness of fit and model checking techniques we have available typically have low power. Consequently, when we find no evidence of poor fit, we may often wrongly conclude that our model fits well.
  • Although some results have been derived (and no doubt more have been since 2001 when the article was written) regarding the properties of machine learning techniques, theory for them is much less developed than for the model standard probabilistic modelling methods. This is one aspect that concerned me when using random forest in the context of multiple imputation – if one is not able to theoretically understand a method’s properties even under ideal conditions, how can we place confidence in using the method widely?
  • He compares a number of machine learning techniques in terms of their interpretability (of results or ouput) and prediction accuracy. He states that a:

    model does not have to be simple to provide reliable information about the relation between predictor and response variables; neither does it have to be a data model. The goal is not interpretability, but accurate information.

    The random forest technique is rated highly in terms of its prediction accuracy, but poorly in terms of interpretatibility. When prediction is the objective, I would agree that the lack of interpretability, in terms of which predictors are most important, or how large is the effect of a particular predictor, are less or not important. But there are many situations where we are not interested in just prediction of the outcome. For example, in analyses of observational studies where we are interested in estimating the causal effect of an exposure of interest on an outcome, and there are confounders or possibly effect modifiers, the task cannot simply be reduced to one of trying to best predict the outcome from the covariates.

  • Three data examples are discussed, in which the machine learning techniques outperform classical regression modelling. But these examples are all essentially prediction problems. As per my previous point, what would one do, if restricted to machine learning techniques, to say try and estimate the causal effect of a non-randomized exposure in an observational study?
  • The first of the three examples given is a survival analysis, but in fact the outcome is dichotomised to one of surviving vs. not surviving. In practice, in follow-up studies some patients are often followed up for different durations, such that one cannot (validly) simply create a binary outcome variable of death vs. no death. In the statistical literature there is rich body of techniques for handling the problem of censoring in the outcome. I cannot see how one could accommodate this issue without in some way using probability models.
  • All of the machine learning techniques (and again, this comment is caveated with an acknowledgement of my large ignorance about the field!) seem to be about point prediction. But often we are interested in other aspects of the dependence of an outcome on predictors. For example, we may be interested in how the variability in an outcome varies as a function of predictors. Or in clustered or multi-level data, we are often interested in decomposing the variation in the outcome into that which can be attributed to cluster effects (between cluster variation) and within cluster variation. It isn’t clear how such data, which have a complex dependency structure, can be adequately modelled using the machine learning techniques.

I think that machine learning techniques likely have lots to offer the data analyst/statistician, and I am keen to explore them further. Here I would of course agree with Breiman that we must open ourselves to the possibility of using a wider tool set. Nevertheless, I did not really come away particularly convinced that our large scale use of probabilistic parametric/semiparametric/nonparametric models, across many different disciplines, should be so radically changed in favour of using machine learning techniques. I can see that the latter certainly should play an increasing role in our analyses in certain situations (e.g. when good prediction is the aim), but there seem to be a whole host of types of analyses where I can’t see how they would be able to help answer the substantive question of interest. This may of course simply be due to my ignorance about the techniques of course, or my failure to conceive of new adaptions of them to handle some of the issues I’ve discussed.

3 thoughts on “Machine learning vs. traditional modelling techniques”

  1. Thanks for this very useful summary, Jonathan. Just a small comment about what you say about making causal inferences about the effect of a particular exposure on an outcome from observational data not fitting into this class of prediction problems. You are right, of course. However, I imagine many would argue that after some important “pre-processing” on the part of the user, causal inference problems do amount to prediction problems. For example, if we want to estimate the marginal causal effect of an exposure X on an outcome Y controlling for a set of confounders C, then we can either take a traditional approach of predicting Y from C first in the exposed and then in the unexposed and then averaging the difference in predictions over the distribution of C, or we can take a propensity score-based approach and first predict X given C to estimate the propensity score, and then compare the mean of Y in the exposed and unexposed after some suitable adjustment for the estimated propensity score. In either case, the majority of the data analysis boils down to a prediction problem (either [Y|C,X=1] and [Y|C,X=0] or [X|C]) – would you agree? Mark van der Laan’s group at UC Berkeley have done a lot of work on incorporating machine learning techniques into causal inference.

    Reply
    • Thank you Rhian! Two comments in reply:

      1) Could one not argue that the causal theory that tells you that, under certain assumptions, you can estimated causal effects using certain approaches is built on the sort of ‘data models’ that Breiman seemed to be against. Now you might say that this model can be almost entirely nonparametric, and so isn’t really a ‘model’. But the theory that gets you to the conclusion that e.g. you can estimated the causal effect using propensity scores is very much based on thinking about the underlying causal mechanisms between the variables in the study, and this it seems to me goes against the thrust of Breiman’s message. As you say, the machine learning prediction techniques may be very useful as a tool within the chain, as it is (potentially) within multiple imputation for missing data.

      2) In the case of outcome modelling for estimating an exposure’s effect, is it always just a prediction problem? When you write [Y|C,X=1], I think you mean modelling the conditional distribution, rather than just say the conditional mean of Y. Although the latter may suffice for certain causal estimands, is it not the case that in general in the outcome modelling approach one must specify a correct model for the conditional distribution of the outcome given exposure and confounders? If so, as far as I know (which as I’ve said is not very far!), the machine learning techniques are about point prediction for the outcome, rather than being models for the actual outcome distribution.

      Reply
      • Hi Jonathan, on (2) yes, in general you would need the whole distributions [Y|C,X=1] and [Y|C,X=1], so I see what you mean. However, for many parameters often targeted, e.g. E(Y^1)-E(Y^0), you only need point predictions of E(Y|C_i,X=1) and E(Y|C_i,X=0). And then maybe machine learning techniques have a part to play, but only, I agree, as one part of a chain.

        Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.