The t-test and robustness to non-normality

The t-test is one of the most commonly used tests in statistics. The two-sample t-test allows us to test the null hypothesis that the population means of two groups are equal, based on samples from each of the two groups. In its simplest form, it assumes that in the population, the variable/quantity of interest X follows a normal distribution N(\mu_{1},\sigma^{2}) in the first group and is N(\mu_{2},\sigma^{2}) in the second group. That is, the variance is assumed to be the same in both groups, and the variable is normally distributed around the group mean. The null hypothesis is then that \mu_{1}=\mu_{2}.

Read more

Linear regression with random regressors, part 2

Previously I wrote about how when linear regression is introduced and derived, it is almost always done assuming the covariates/regressors/independent variables are fixed quantities. As I wrote, in many studies such an assumption does not match reality, in that both the regressors and outcome in the regression are realised values of random variables. I showed that the usual ordinary least squares (OLS) estimators are unbiased with random covariates, and that the usual standard error estimator, derived assuming fixed covariates, is unbiased with random covariates. This gives us some understand of the behaviour of these estimators in the random covariate setting.

Read more

Regression inference assuming predictors are fixed

Linear regression is one the work horses of statistical analysis, permitting us to model how the expectation of an outcome Y depends on one or more predictors (or covariates, regressors, independent variables) X. Previously I wrote about the assumptions required for validity of ordinary linear regression estimates and their inferential procedures (tests, confidence intervals) assuming (as we often do) that the residuals are normally distributed with constant variance.

Read more

Assumptions for linear regression

Linear regression is one of the most commonly used statistical methods; it allows us to model how an outcome variable Y depends on one or more predictor (sometimes called independent variables) X_{1},X_{2},..,X_{p}. In particular, we model how the mean, or expectation, of the outcome Y varies as a function of the predictors:

Read more

A/B testing and Pearson's chi-squared test of independence

A good friend of mine asked me recently about how to do A/B testing. As he explained, A/B testing refers to the process in which when someone visits a website, the site sends them to one of two (or possibly more) different 'landing' or home pages, and which one they are sent to is chosen at random. The purpose is to determine which page version generates a superior outcome, e.g. which page generates more advertising revenue, or which which page leads a greater proportion of visitors to continue visiting the site.

Read more

The difference between the sample mean and the population mean

Someone recently asked me what the difference was between the sample mean and the population mean. This is really a question which goes to the heart of what it means to perform statistical inference. Whatever field we are working in, we are usually interested in answering some kind of question, and often this can be expressed in terms of some numerical quantity, e.g. what is the mean income in the US. This question can be framed mathematically by saying we would like to know the value of a parameter describing some distribution. In the case of the mean US income, the parameter is the mean of the distribution of US incomes. Here the population is the US population, and the population mean is the mean of all the incomes in the US population. For our objective, the population mean is the parameter of interest.

Read more