The two sample t-test is one of the most used statistical procedures. Its purpose is to test the hypothesis that the means of two groups are the same. The test assumes that the variable in question is normally distributed in the two groups. When this assumption is in doubt, the non-parametric Wilcoxon-Mann-Whitney (or rank sum ) test is sometimes suggested as an alternative to the t-test (e.g. the Wikipedia page on the t-test), which doesn't rely on distributional assumptions. But is this necessarily a good 'replacement'?
The Wilcoxon-Mann-Whitney test
The Wilcoxon-Mann-Whitney (WMW) test consists of taking all the observations from the two groups and ranking them in order of size (ignoring group membership). The ranks of the observations from the first group (it doesn't matter which group you choose) are then summed, and the test statistic is formed as
Under the null hypothesis that the distribution of the variable in question is identical (in the population) in the two groups, the sampling distribution of can be determined (or a normal approximation is invoked) and thus a p-value calculated. The test is available in most (if not all) statistical packages.
What hypothesis is WMH testing?
If WMH is to be used as an alternative to the two sample t-test (for example because the normality assumption made by the latter is in doubt), it would seem a reasonable requirement that it ought to be testing the same 'thing'. What null and alternative hypotheses is WMH testing? Although papers or books may present a single set of hypotheses, it turns out the WMH test is valid under a range of different sets of possible null and alternative hypotheses (see this paper by Fay and Proschan). But the commonly stated hypotheses are that the distributions in the two groups are the same (null) vs that the probability that a random observation from group 1 exceeds a random observation from group 2 differs from 0.5 (under the null this probability is 0.5).
As a nice article by Fagerland (freely available here) shows, a statistically significant WMH test can result even when the population means of the variable in question are identical in the two groups (i.e. when the t-test null hypothesis is true). Fagerland demonstrates this empirically by simulating data from gamma and log-normal distributions in which the means and medians are identical in the two groups, but the variability (standard deviation) differed in the two groups. The simulations show that the WMH test rejects the null hypothesis more than 5% of the time in these situations (with the rejection rate depending on the particular setup). Of course there is nothing wrong with this result - the distributions in the two groups are not identical, so the null hypothesis of the WMH test is not true, and we would hope that the WMH test would reject the null.
However, if our objective is to test for equality of means, the WMH would mislead us, since the two groups have identical means (and medians) yet the WMH rejects the null hypothesis more than 5% of the time. The difficulty (in my view) with the WMH test is that if we obtain a statistically significant p-value, it is rather unspecific as to what we have found. We have found evidence against the null of identical distributions in the two groups, but we cannot (without doing further analysis) be more specific as to the way in which the two distributions differ.
More specific interpretations of the WMH test can be given, but only if we are willing, in advance of performing the test (and ideally seeing the data), to assume the possibility of a more restrictive alternative hypothesis. One is the location shift, where under the null the two groups have the same distribution, and under the alternative one group's distribution is shifted in location (see Perspective 6 on page 10 of the paper by Fay and Proschan). However, these sets of nulls and alternatives do not include the one I am assuming in this post is of interest, namely the null that the means of the two groups is the same, versus the alternative that the means differ.
What to do?
As I described in a previous post, provided the sample size is moderately large, the two-sample t-test is robust to non-normality due to the central limit theorem. Fagerland's simulation results demonstrate this, with the t-test giving a rejection rate of approximately 5% in the simulation study (in contrast to the WMH, which rejects more than 5% of the time). So the usual t-test (possibly allowing for unequal variances) can usually be used, provided the sample sizes are not too small and the distribution is not extremely skewed.
But what if the sample size is small? One thought is to use a permutation test, based on computing the difference in sample means and permuting the group membership. However, this suffers from the same issue as WMH (see here), and so shouldn't be used if under the null of equal means we believe it is possible for the group's distributions to differ in other respects.
My personal approach, if I were worried about the normality assumption and the sample size was small, would be to use bootstrapping and the relationship between confidence intervals and hypothesis tests. Specifically, I would find the (superior) bias corrected and accelerated 95% confidence interval for the difference in means between the groups. If this interval excludes the null value of zero, p<0.05, and otherwise p>0.05. If we want to get the actual p-value, we would need to find at what confidence level the confidence interval just includes zero. Although the justification for the bootstrap technique relies on a large sample argument, it has been shown to often work remarkably well with quite small sample sizes.
A different view
A final note in the interest of balance. In reading online when writing this article I came across the following paper by Sawilowsky, who evidently has a quite different view regarding the choice between WMH and the t-test'. That WMH and the t-test are testing different hypotheses is categorised by Sawilowsky as a "True Statement Irrelevant in Choosing Between the t and Wilcoxon" test. Why the fact that the hypotheses being tested differs between the two is irrelevant is not clear to me - if anyone can enlighten me please do so by adding a comment to this page.
For more reading on WMH, the t-test, and its normality assumption, in addition to the previously mentioned papers, I'd also recommend looking at this very readable paper by Lumley and colleagues.