The t-test is one of the most commonly used tests in statistics. The two-sample t-test allows us to test the null hypothesis that the population means of two groups are equal, based on samples from each of the two groups. In its simplest form, it assumes that in the population, the variable/quantity of interest X follows a normal distribution in the first group and is in the second group. That is, the variance is assumed to be the same in both groups, and the variable is normally distributed around the group mean. The null hypothesis is then that .
A simple extension allows for the variances to be different in the two groups, i.e. that in the first group, the variable of interest X is distributed and in the second group as . Since often variances can differ between the two groups being tested, it is generally advisable to allow for this possibility.
So, as constructed, the two-sample t-test assumes normality of the variable X in the two groups. On the face of it then, we would worry if, upon inspection of our data, say using histograms, we were to find that our data looked non-normal. In particular, we would worry that the t-test will not perform as it should – i.e. that if the null hypothesis is true, it will falsely reject the null 5% of the time (I’m assuming we are using the usual significance level).
In fact, as the sample size in the two groups gets large, the t-test is valid (i.e. the type 1 error rate is controlled at 5%) even when X doesn’t follow a normal distribution. I think the most direct route to seeing why this is so, is to recall that the t-test is based on the two groups means and . Because of the central limit theorem, the distribution of these, in repeated sampling, converges to a normal distribution, irrespective of the distribution of X in the population. Also, the estimator that the t-test uses for the standard error of the sample means is consistent irrespective of the distribution of X, and so this too is unaffected by normality. As a consequence, the test statistic continues to follow a distribution, under the null hypothesis, when the sample size tends to infinity.
What does this mean in practice? Provided our sample size isn’t too small, we shouldn’t be overly concerned if our data appear to violate the normal assumption. Also, for the same reasons, the 95% confidence interval for the difference in group means will have correct coverage, even when X is not normal (again, when the sample size is sufficiently large). Of course, for small samples, or highly skewed distributions, the above asymptotic result may not give a very good approximation, and so the type 1 error rate may deviate from the nominal 5% level.
Let’s now use R to examine how quickly the sample mean’s distribution (in repeated samples) converges to a normal distribution. We will simulate data from a log-normal distribution – that is, log(X) follows a normal distribution. We can generate random samples from this distribution by exponentiating random draws from a normal distribution. First we will draw a large (n=100000) sample and plots its distribution to see what it looks like:
We can see that its distribution is highly skewed. On the face of it, we would be concerned about using the t-test for such data, which is derived assuming X is normally distributed.
To see what the sampling distribution of looks like, we will choose a sample size n, and repeatedly take draws of size n from the log-normal distribution, calculate the sample mean, and then plot the distribution of these sample means. The following shows a histogram of the sample means for n=3 (from 10,000 repeated samples):
Here the sampling distribution of is skewed. With such a small sample size, if one of the sample has a high value from the tail of the distribution, this will give a sample mean which is quite far from the true mean. If we repeat, but now with n=10:
It is now starting to look more normal, but it is still skewed – the sample mean is occasionally large. Notice that x-axis range is now smaller – the variability of the sample mean is now smaller than with n=3. Lastly, we try n=100:
Now the sample mean’s distribution (in repeated samples from the population) looks pretty much normal. When n is large, even though one of our observations might be in the tail of the distribution, all the other observations near the centre of the distribution keep the mean down. This suggests that the t-test should be ok with n=100, for this particular X distribution. A more direct way of checking this would be to perform a simulation study where we empirically estimate the type 1 error rate of the t-test, applied to this distribution with a given choice of n.
Of course if X isn’t normally distributed, even if the type 1 error rate for the t-test assuming normality is close to 5%, the test will not be optimally powerful. That is, there will exist alternative tests of the null hypothesis which have greater power to detect alternative hypotheses.
For more on the large sample properties of hypothesis tests, robustness, and power, I would recommend looking at Chapter 3 of ’Elements of Large-Sample Theory’ by Lehmann. For more on the specific question of the t-test and robustness to non-normality, I’d recommend looking at this paper by Lumley and colleagues.
Addition – 1st May 2017
Below Teddy Warner queries in a comment whether the t-test ‘assumes’ normality of the individual observations. The following image is from the book Statistical Inference by Casella and Berger, and is provided just to illustrate the point that the t-test is, by its construction, based on assuming normality for the individual (population) values: