Recently my colleague Ruth Keogh and I had a paper published: 'Bayesian correction for covariate measurement error: a frequentist evaluation and comparison with regression calibration' (open access here). The paper compares the popular regression calibration approach for handling covariate measurement error in regression models with a Bayesian approach. The two methods are compared from the frequentist perspective, and one of the arguments we make is that frequentists should more often consider using Bayesian methods.

# Inference

## On "The fallacy of placing confidence in confidence intervals"

*Note: if you read this post, make sure to read the comments/discussion below it with Richard Morey, author of the paper in question, who put me straight on a number of points.*

Thanks to Twitter I came across the latest draft of a very nicely written and thought provoking paper "The fallacy of placing confidence in confidence intervals", by Morey, Rouder, Hoekstra, Lee and Wagenmakers. The paper aims to show why frequentist confidence intervals do not posses a number of properties that researchers often believe that they do. In contrast, they show that Bayesian credible intervals posses these desired properties, and advocate the replacement of confidence intervals with Bayesian credible intervals.

## Banning p-values from journals

A psychology journal (Basic and Applied Social Psychology) has recently caused a bit of stir by banning p-values from their published articles. For what it's worth, here's a few views on the journal's new policy, and on the use of p-values and confidence intervals in empirical research.

## Adjusting for optimism/overfitting in measures of predictive ability using bootstrapping

In a previous post we looked at the area under the ROC curve for assessing the discrimination ability of a fitted logistic regression model. An issue that we ignored there was that we used the same dataset to fit the model (estimate its parameters) and to assess its predictive ability.

A problem with doing this, particularly when the dataset used to fit/train the model is small is that such estimates of predictive ability are optimistic. That is, they will fit the dataset which have been used to estimate the parameters somewhat better than they will fit new data. In some sense, this is because with small datasets the fitted model adapts to chance characteristics of the observed data which won't occur in future data. A silly example of this would be a linear regression model of a continuous variable Y fitted to a continuous covariate X with just n=2 data points. The fitted line will just be the line connecting the two data points. In this case, the R squared measure will be 1 (100%), suggesting your model has perfect predictive power(!), when of course with new data it would almost certainly not have an R squared of 1.

## Wilcoxon-Mann-Whitney as an alternative to the t-test

The two sample t-test is one of the most used statistical procedures. Its purpose is to test the hypothesis that the means of two groups are the same. The test assumes that the variable in question is normally distributed in the two groups. When this assumption is in doubt, the non-parametric Wilcoxon-Mann-Whitney (or rank sum ) test is sometimes suggested as an alternative to the t-test (e.g. the Wikipedia page on the t-test), which doesn't rely on distributional assumptions. But is this necessarily a good 'replacement'?

## A/B testing - confidence interval for the difference in proportions using R

In a previous post we looked at how Pearson's chi-squared test (or Fisher's exact test) can be used to test whether the 'success' proportions are equal under two conditions. In biostatistics this setting arises (for example) when patients are randomized to receive one or other of two treatments, and for each patient we observe either a 'success' (of course this could be a bad outcome, such as death) or 'failure'. In web design people may have data where web site visitors are sent to one of two versions of a page at random, and for each visit a success is defined as some outcome such as a purchase of a product. In both cases, we may be interested in testing the hypothesis that the true proportion of successes in the population are equal, and this is what we looked at in an earlier post. Note that the randomization described in these two examples is not necessary for the statistical procedures described in this post, but of course randomization affects our interpretation of the differences between the groups.