Someone recently asked me what the difference was between the sample mean and the population mean. This is really a question which goes to the heart of what it means to perform statistical inference. Whatever field we are working in, we are usually interested in answering some kind of question, and often this can be expressed in terms of some numerical quantity, e.g. what is the mean income in the US. This question can be framed mathematically by saying we would like to know the value of a parameter describing some distribution. In the case of the mean US income, the parameter is the mean of the distribution of US incomes. Here the population is the US population, and the population mean is the mean of all the incomes in the US population. For our objective, the population mean is the parameter of interest.
In some (or maybe most) settings, the population is large but finite. However, often the population is so large that we actually assume the population is infinite, to make some of the maths easier. Because the population is large, we usually cannot hope to calculate the parameter of interest (e.g. the population mean) exactly, because to do so we would have to obtain income information from the large population. This is often infeasible due to costs or practicalities.
Instead, we take a sampleĀ from the population of interest, and calculate the mean of the sample (or, more generally, an estimate of our parameter of interest, based on the sample data), giving the sample mean. Now of course the sample mean will not equal the population mean. But if the sample is a simple random sample, the sample mean is an unbiased estimate of the population mean. This means that the sample mean is not systematically smaller or larger than the population mean. Or put another way, if we were to repeatedly take lots and lots (actually an infinite number) of samples, the mean of the sample means would equal the population mean.
Because the sample mean is not equal to the population mean which we are actually interested in, if we are to use the sample mean in place of the population mean we should always report it with some measure of how precise it is. The most common ways of doing this are to report a standard error or confidence intervals, topics I will return to in later posts.