Recently my colleague Ruth Keogh and I had a paper published: ‘Bayesian correction for covariate measurement error: a frequentist evaluation and comparison with regression calibration’ (open access here). The paper compares the popular regression calibration approach for handling covariate measurement error in regression models with a Bayesian approach. The two methods are compared from the frequentist perspective, and one of the arguments we make is that frequentists should more often consider using Bayesian methods.
At first sight this may seem a strange suggestion. The Bayesian approach by construction is not concerned with frequentist notions of repeated sampling properties. It requires the user to specify their a priori beliefs about the model parameters, and these are combined with the model likelihood function to give the posterior distribution for the parameters. In contrast, the method of maximum likelihood is very commonly used by frequentists, and often forms an important component of frequentist statistics courses.
From the frequentist perspective, a method can of course be evaluated in terms of bias and variance even if the method was not constructed with a view to it having good frequentist properties. Like Bayesian methods, maximum likelihood is not derived from a frequentist perspective. Nevertheless, we can evaluate its repeated sample properties. In finite samples, the maximum likelihood estimator is not necessarily unbiased. For example, in logistic regression with a single covariate, the maximum likelihood estimator of the odds ratio is biased away from the null. But as the sample size gets large, subject to regularity conditions the bias in maximum likelihood estimators goes to zero. Moreover, the variance goes to zero as the sample size tends to infinity, so that the estimator is consistent. Lastly, among all reasonable estimators in parametric models, the maximum likelihood estimator achieves the lowest possible asymptotic variance (it is an efficient estimator).
Given these properties, it is perhaps unsurprising that the method of maximum likelihood is so popular, in particular among those that are happy with the frequentist paradigm. Perhaps less well known is the fact that, again under regularity conditions, Bayesian estimators enjoy exactly the same large sample properties as maximum likelihood estimators (see Chapter 4 of Gelman et al’s Bayesian Data Analysis book). This is because as the sample size gets large, the posterior is to a greater and greater extent determined by the likelihood function, with the prior having minimal impact.
In practice, we don’t have infinitely sized samples, but finite samples. In this situation, it may be possible to construct a Bayesian estimator which may often perform better (e.g. in terms of bias and variance), than maximum likelihood or other estimators, by choosing a sensible prior. For example, consider the use of logistic regression models in small to moderately sized epidemiological studies. The earlier linked paper demonstrates how the maximum likelihood estimator for the odds ratio of an exposure effect will in general be biased away from the null. This finite sample bias could potentially be mitigated by using a weakly informative prior, which encodes the empirical observation that associations seen across epidemiological studies are typically small to moderately sized. For more on this, see these interesting papers by Gelman et al and Hamra et al.
Of course, whereas for a given model, there is a single maximum likelihood estimator, there are an infinite number of different Bayes estimators, corresponding to the infinite different prior specifications that are possible. But I’m not sure this is an argument against the Bayesian approach for the frequentist. Andrew Gelman I believe has argued the prior should be viewed as another part of the model specification. From this perspective, the presence of the prior enables the analyst to encode additional choices/specifications which have the potential to allow one to make improved inferences. Of course if the prior places most of its density far from the true value of the model parameter (i.e. it’s a bad prior choice), the Bayes estimator may have inferior finite sample frequentist properties than the maximum likelihood estimator. But this is reasonable and generally the case in statistics – you may improve inferences by making stronger assumptions, but only so long as the stronger assumptions are correct.