The Hosmer-Lemeshow goodness of fit test for logistic regression

Before a model is relied upon to draw conclusions or predict future outcomes, we should check, as far as possible, that the model we have assumed is correctly specified. That is, that the data do not conflict with assumptions made by the model. For binary outcomes logistic regression is the most popular modelling approach. In this post we'll look at the popular, but sometimes criticized, Hosmer-Lemeshow goodness of fit test for logistic regression.

Read moreThe Hosmer-Lemeshow goodness of fit test for logistic regression

A/B testing - confidence interval for the difference in proportions using R

In a previous post we looked at how Pearson's chi-squared test (or Fisher's exact test) can be used to test whether the 'success' proportions are equal under two conditions. In biostatistics this setting arises (for example) when patients are randomized to receive one or other of two treatments, and for each patient we observe either a 'success' (of course this could be a bad outcome, such as death) or 'failure'. In web design people may have data where web site visitors are sent to one of two versions of a page at random, and for each visit a success is defined as some outcome such as a purchase of a product. In both cases, we may be interested in testing the hypothesis that the true proportion of successes in the population are equal, and this is what we looked at in an earlier post. Note that the randomization described in these two examples is not necessary for the statistical procedures described in this post, but of course randomization affects our interpretation of the differences between the groups.

Read moreA/B testing - confidence interval for the difference in proportions using R

The robust sandwich variance estimator for linear regression (using R)

In a previous post we looked at the (robust) sandwich variance estimator for linear regression. This method allowed us to estimate valid standard errors for our coefficients in linear regression, without requiring the usual assumption that the residual errors have constant variance.

In this post we'll look at how this can be done in practice using R, with the sandwich package (I'll assume below that you've installed this library). To illustrate, we'll first simulate some simple data from a linear regression model where the residual variance increases sharply with the covariate:

Read moreThe robust sandwich variance estimator for linear regression (using R)

Wald vs likelihood ratio test

When taking a course on likelihood based inference, one of the key topics is that of testing and confidence interval construction based on the likelihood function. Usually the Wald, likelihood ratio, and score tests are covered. In this post I'm going to revise the advantages and disadvantages of the Wald and likelihood ratio test. I will focus on confidence intervals rather than tests, because the deficiencies of the Wald approach are more transparently seen here.

Read moreWald vs likelihood ratio test

Adjusting for baseline covariates in randomized controlled trials

Randomized controlled trials constitute what are generally considered to be the gold standard design for evaluating the effects of some intervention or treatment of interest. The fact that participants are randomized to the two (sometimes more) groups ensures that, at least in expectation, the two treatment groups are balanced in respect of both measured, and importantly, unmeasured factors which may influence the outcome. As a consequence, differences in outcomes between the two groups can be attributed to the effect of being randomized to the treatment rather than the control (which often would be another treatment).

Read moreAdjusting for baseline covariates in randomized controlled trials