On “The fallacy of placing confidence in confidence intervals”

Note: if you read this post, make sure to read the comments/discussion below it with Richard Morey, author of the paper in question, who put me straight on a number of points.

Thanks to Twitter I came across the latest draft of a very nicely written and thought provoking paper “The fallacy of placing confidence in confidence intervals”, by Morey, Rouder, Hoekstra, Lee and Wagenmakers. The paper aims to show why frequentist confidence intervals do not posses a number of properties that researchers often believe that they do. In contrast, they show that Bayesian credible intervals posses these desired properties, and advocate the replacement of confidence intervals with Bayesian credible intervals.

The fundamental fallacy
Their paper is very nice, and I highly recommend people to read it in general, and also so that the following makes sense. I am personally becoming more of a Bayesian by the day. This post is therefore not a defence of confidence intervals in general. Nonetheless, I do take issue somewhat with the author’s ‘fundamental confidence fallacy’, which I think comes down to different perspectives on the meaning of probability. My view is probably a consequence of my ignorance and failure of understanding, but I shall describe my view here in any case, so that I might be corrected in my thinking.

The authors describe this fundamental fallacy as:

“If the probability that a random interval contains the true value is X%, then the plausibility or probability that a particular observed interval contains the true value is X%,” or “We have X% confidence that the observed interval contains the true value.”

The authors describe an example situation (locating a submarine), with a single unknown parameter \theta. They then describe a number of different procedures for constructing a 50% confidence interval for \theta. In the particular setup they consider, the different confidence intervals are nested within each other. They then argue (following Fisher apparently):

“If all intervals had a 50% probability of containing the true value, then all the probability must be contained in the shortest of the intervals. Because each procedure is by itself a 50% procedure, the procedure which chooses the shortest of 50% intervals will contain the true value less than 50% of the time. Hence believing the FCF (fundamental confidence fallacy) results in a logical contradiction.”

This quote begins “If all intervals had a 50% probability of containing the true value”. First, and as the authors clearly understand, given the datasets, and the calculated CIs, in truth each CI either does or does not contain the true value. Thus if we make a statement that an interval has a 50% probability of containing the truth, we must be clear what we mean by saying it has 50% probability of being true. A Bayesian uses a definition of probability in terms of measure of certainty or belief, and as far as I understand it, does not make reference to long run frequencies. In contrast, a frequentist defines probabilities in terms of an imagined repetition of experiments. In the case of a 50% CI, the frequentist means that in 50% of repeated experiments the CI would contain the true value.

After some deductions, the authors’ argument ends “the procedure which chooses the shortest of 50% intervals will contain the true value less than 50% of the time”. What do the authors mean here by “50% of the time”? It sounds to me as if they are essentially calibrating the 50% probability statement by reference to a long run sequence of experiments. If this is indeed the case, and the procedure they have chosen (the one which produces the shortest intervals) is indeed a 50% CI, then, contrary to their conclusion, it will surely contain the true value in 50% of repeated experiments.

The authors then go on to show a different way why the fallacy cannot be true. In the submarine example (Fig 1b), they give a dataset where by the nature of the problem (read the paper to see the details), you could almost with certainty logically infer the value of \theta. Thus with this dataset you could logically deduct (or obtain the same answer by using the likelihood function) the true of \theta. The authors point out however that three of the CIs they consider encompass the likelihood interval (the range values for which the likelihood function is non-zero). They then argue

“…meaning that there is 100% certainty that these 50% confidence intervals contain the hatch. Reporting 50% certainty in an interval that surely contains the parameter would clearly be a mistake”

Again, this ‘proof’ of the fallacy being false seems problematic, in the following way. If someone gives you one of these three 50% CIs, and you know nothing else, then you might (although the authors wouldn’t!) state that your interval has a 50% probability (in the long run sense) of containing the truth, and I think this is correct. Now suppose someone then explains to you that because of the structure of the problem and the data observed, it is in fact possible to logically deduce that the true value of the parameter is guaranteed to be included in your 50% CI. In this case, it would be ridiculous to just report the 50% CI, rather than reporting that in fact we can deduce the true value of the parameter. Nonetheless, in my view this does not invalidate the 50% probability statement that was made before this clever person came along and explained to you that you could in fact deduce the true value of the parameter here. Here, basing inferences on the likelihood function or a Bayesian analysis one would be able to conclude with certainty the value of \theta, while the researcher only using one of the three aforementioned CIs would not. In this example, and no doubt others, this shows that frequentist methods can be sub-optimal as a method of inference, and I would entirely agree with that conclusion. However, for me this does not render the ‘fundamental fallacy’ false.

On giving up confidence intervals and instead being a Bayesian
The paper concludes by arguing that giving up CIs is highly advisable, in that we do not lose much, but would gain a lot by using Bayesian inference. I think I agree that one does again a lot by being a Bayesian. However, in practice it is arguably not as simple to be a Bayesian as the authors seem to imply. The difficulty of course is that to perform a Bayesian analysis we must in addition to the model specify priors for the parameters, which is sometimes easier said than done. That is not to say that specifying priors is impossible or always difficult, but simply that prior specification is not a piece of cake, and sometimes (small datasets or not much information about parameters) materially affects posterior inferences. Moreover, while for simple models it may be relatively straightforward to think about what is reasonable to specify as a prior, for more complex models it seems to me to become much more problematic. For example, in a complex model of longitudinal data, using random effects to model trajectories with cubic splines, what would be my prior about the variances of the random effects and their correlations? I’m really not sure!

5 thoughts on “On “The fallacy of placing confidence in confidence intervals””

  1. I think — based on what you say (“If this is indeed the case, and the procedure they have chosen (the one which produces the shortest intervals) is indeed a 50% CI, then, contrary to their conclusion, it will surely contain the true value in 50% of repeated experiments.”) that there is a misunderstanding about the “choose the shortest interval” example. I’ll outline the logic in more detail here, to show how the fallacy arises (as it must, due to the frequentist problem of reference classes).

    We can expand the logic of the Fundamental Confidence Fallacy as follows:
    1. If I apply an X% confidence procedure C in an infinite sequence of samples, then the proportion of intervals that contain the true value approaches X% (true by definition).
    2. Of course, the infinite sequence above is merely theoretical. I computed the interval (L, U) for *these* observed data (L and U are specific numbers). (also true)
    3. I note that if I had been applying the procedure in the infinite sequence and came across *these* data, I would compute the CI (L, U). (also true)
    4. I identify the long-run “probability” X% with the observed interval (L, U): That is, “The probability that the observed interval (L, U) contains the true value is X%.” (wrong)

    Step 4 is the problematic. First, I will note that the word “probability” is in quotes. This is because the word I choose here doesn’t matter. I could pick “bunkiness”. The problem lies not fundamentally with the word I use, but rather with the reference class problem which exists regardless.

    Let’s do the nested CI example. In the nested case, we have four 50% CIs: that is, all four were computed from procedures that contain the true value 50% of the time. For any one of these intervals, I can use the above logic to say that “The probability that the observed interval (L, U) contains the true value is X%” where L,U are the endpoints of the intervals. This is *already* strange, because there are four different intervals I can say this about. But suppose we happen to be using the SD interval (and, incidentally, the SD interval is the shortest among them, though this doesn’t matter for now). We say, on the basis of the fact that the SD interval is a 50% CI, that “there is a 50% probability that (L,U) contains the true value” (where L,U are the endpoints of the SD interval).

    Now, consider the following procedure. We compute all four intervals, then choose the shortest interval as our confidence interval. Suppose that in this case, it was the SD interval (though it doesn’t matter). In other samples, it might have been the Bayes interval, or the UMP interval. If I follow this procedure, my CIs will contain the true value in *less than* 50% of samples. This can be confirmed by simulation, or by simply noting that the procedure is the same as applying one of the four procedures but *shortening* the interval a large proportion of the time. It must have <50% probability of containing the true value, and thus it is a <50% confidence procedure.

    But note that we could apply precisely the same logic as our four steps above. The resulting *interval* is the same, for these data, regardless of whether we used the SD procedure or the "shortest interval" procedure. The problem comes when I try to uniquely identify which sequence this observation comes from, a theoretical "SD procedure" sequence or a theoretical "shortest interval procedure" sequence. I cannot uniquely identify one sequence: the observed interval (L,U) would be observed under both sequences.

    This means that I could choose to identify either 50% probability (or, "bunkiness", or certainty, or plausibility, or confidence, or whatever I want to call it) with the interval, or <50%. As long as I can find a sequence that would give me L,U as an interval for these data, and has a particular long-run probability of containing the true value of X% — and I can find one of those for *any* X — I could assign the "probability" that the interval contains the true value as X%. The FCF leads to mutually contradictory assessments of intervals.

    If you red Neyman (1952) (available here: http://bayesfactor.blogspot.co.uk/2015/04/my-favorite-neyman-passage-on.html) you'll see that he constructs an example with nested confidence intervals and says: "the theory of confidence intervals does not assert anything about the probability that the unknown parameter θ will fall within any specified limits. What it does assert is that the probability of success in estimation using [any such] formula[…] is equal to [X%]." If one does not follow Neyman's advice, one will fall into mutually contradictory assertions.

    Reply
    • Many thanks Richard. I previously hadn’t understood what you meant by shortest interval, but now I do.

      First, you find it strange that four different intervals could all apparently have the same confidence level (probability of containing the truth). This ought not to seem strange, given that one can obtain different confidence intervals based on inverting tests with different statistical power.

      You describe a new procedure which for each dataset calculates four 50% CIs and then uses the shortest as the interval. I can entirely believe that this procedure has less than 50% coverage in repeated sampling. Of course frequentist theory would not necessarily assert that it would have 50% coverage.

      You then argue that given an observed interval (l,u), that could have either come from procedure A or procedure B, we cannot uniquely identify which infinite sequence (with an infinite sequence consisting of applying the shortest interval procedure, or an infinite sequence consisting of applying one of the individual CI procedures). If the two sequences have different confidence levels, we cannot determine the confidence level of the interval (l,u).

      I’m afraid this logic seems a bit bizarre to me still, since surely one can immediately and unambiguously identify which infinite sequence the interval belongs to according to which procedure was used to calculate the interval. As an analyst I (ought to!) know whether I calculated four intervals and then picked the shortest, or just went to straight to calculating the SD interval. In the latter case I can assert 50% confidence, while in the former I have less than 50% confidence. To me there seems no contradiction here.

      Following your reply I’ve been trying to reconcile my (no doubt dodgy!) intuition about CIs with some sort of logical argument. Once the CI has been calculated, the interval either includes or doesn’t include the truth, so I guess if I am going to try and make a probability statement about this, I must be trying to use probability in the belief sense, rather than long run sense. So I will try and perform a (clearly suboptimal type of) Bayesian analysis to justify this belief:

      I am going to collect some data and calculate a 50% CI (which we’ll assume has correct coverage) based on it for some parameter of interest. Let S denote a 0/1 variable indicating whether the resulting interval includes the true parameter value. Then we can agree (I presume) that prior to performing the study, P(S=1)=0.5. Once the study is performed, S has realized some value s. In some contrived examples (e.g. an interval that is 50% the whole real line or 50% null) we can logically deduce whether s=1 or s=0, but in general we can deduce its value. In any case, let’s suppose I am not clever enough or simply do not try and make such deductions.

      Now suppose that I try and perform a Bayesian analysis for the model for S, but I do not use any information observed in the data / pretend that the observed data say nothing about S (this evidently may not be very sensible in some situations, see contrived example above). If we apply a Bayesian analysis, I have a prior P(S=1)=0.5. If I don’t use/see any of the observed data, then I can argue that the likelihood function L(S|data) is flat, since I am pretending that relevant (to S) ‘data’ is empty. In this case the posterior is P(S=1|data)=P(S=1)L(S=1|data)=0.5*1=0.5. Thus having conducted my study, I have posterior probability of 0.5 that the observed interval contains the truth.

      Reply
  2. First, I think it would be good for me to say that there’s nothing controversial about our argument among philosophers of statistics and theoretical statisticians; it has been well known for a long time that frequentist inference can, in situations like these, suffer from problems (if interpreted incorrectly). The reference class / representative subset problem raises its head here, but it’s not anything new, really.

    > “I’m afraid this logic seems a bit bizarre to me still, since surely one can immediately and unambiguously identify which infinite sequence the interval belongs to according to which procedure was used to calculate the interval. As an analyst I (ought to!) know whether I calculated four intervals and then picked the shortest, or just went to straight to calculating the SD interval. ”

    Remember, when talking about a CI, we’re talking not about a procedure, but rather about an interval. It would be a very strange situation if given the *same data* and the *same two-number interval*, the probability of the specific interval containing the true value — which you calculated after observing the data — would depend on *how* you calculated the interval. It’s important to emphasize how absurd this result would be, because two people, given the same data, same model, and the same interval, could assess differently the probability that the interval contains the true parameter.

    > “In any case, let’s suppose I am not clever enough or simply do not try and make such deductions.”

    If you do this, you will make invalid inferences from the data. You’re basically not looking at the data.

    > “I don’t use/see any of the observed data, then I can argue that the likelihood function L(S|data) is flat, since I am pretending that relevant (to S) ‘data’ is empty.”

    Well, that wouldn’t be a likelihood. You’re just recapitulating what you knew before you saw the data – that is, the definition of the confidence interval. The data have added nothing. Think about it: if you’re defending a statement by asserting that it would be perfectly fine *if we didn’t look at the data*, then there’s a problem here 🙂 But you have hit on the important aspect of the theory of CIs: it based on *pre-data* statements. That’s why you need to ignore the data to make them work.

    Reply
    • Thanks Richard. You say it would be absurd if given the same data, the same model and the same interval, you could come up with different probabilities that the interval contains the true parameter. But this need not be the case since you can produce many different (frequentist) valid estimators/CI procedures for a given model, each with different statistical properties. If one CI arises from a more efficient estimator, it will typically be narrower than the CI arising from a less efficient estimator. Thus I could imagine being able to construct two intervals, based on two different procedures, which are in fact identical for a given dataset, yet have different coverage levels.

      Regarding my probably crazy last argument: I guess yes I am saying that for the argument to work knowing the actual realized values of L=l and U=u of the CI wouldn’t affect my belief that the interval contains the truth. This would be a valid post data statement if one buys the argument that knowing the values of l and u does not change my belief about whether the interval contains the truth.

      Reply
      • > “Thus I could imagine being able to construct two intervals, based on two different procedures, which are in fact identical for a given dataset, yet have different coverage levels.”

        Yes, this is true, but the absurdity comes when you then try to associate specific probabilities/certainties/whatever to the individual intervals, due to the frequentist reference class problem. That’s the whole point, and why the Fundamental Confidence Fallacy is in fact a fallacy. As Pearson (1939) said, “Following Neyman’s approach [of proscribing probability statements for intervals], there is no inconsistency in this result, since one probability is associated with the employment of [one procedure], the other with the [other procedure]. It is only when we try to divorce the probability measure from the rule and to regard the former as something associated with a particular interval, that the need for a unique probability measure [something CIs don’t provide] seems to be felt.”

        > “This would be a valid post data statement if one buys the argument that knowing the values of l and u does not change my belief about whether the interval contains the truth.”

        Not really; l<mu<u is not the same proposition as, say, 1<mu<3. You're not retaining the same belief so much as taking one belief (the belief in an unknown interval proposition, pre-data) and asserting that the belief you had about is *also* the belief you have about the new, observed interval. If you want to do this, you need a principle. You can't simply stipulate that you're going to do this.

        Had you been asked about the numbers in the observed interval before you observed them, would you have said 95% (or whatever)? If not, then your beliefs changed due to the interval. But if you don't change them, you're in for absurdity. Consider the following scenario. I ask you "If you observe the CI [1,2], what would your assessment of the interval [1,2] be?" You say 95%. I then ask you "If you observe the CI [2,3], what would your assessment of the interval [2,3] be?" You say 95%. I then ask you "What are your assessments of [1,2] and [2,3] *now*?" If you claim that the observed CI does not change your beliefs, you're committed to a 95% probability assessment of each. Now, or course, the probability of [1,3] is 190%. That's not good.

        Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.