The Stats Geek

The Hosmer-Lemeshow goodness of fit test for logistic regression

September 12, 2024February 16, 2014 by Jonathan Bartlett

Before a model is relied upon to draw conclusions or predict future outcomes, we should check, as far as possible, that the model we have assumed is correctly specified. That is, that the data do not conflict with assumptions made by the model. For binary outcomes logistic regression is the most popular modelling approach. In this post we’ll look at the popular, but sometimes criticized, Hosmer-Lemeshow goodness of fit test for logistic regression.

A/B testing – confidence interval for the difference in proportions using R

February 23, 2014February 15, 2014 by Jonathan Bartlett

In a previous post we looked at how Pearson’s chi-squared test (or Fisher’s exact test) can be used to test whether the ‘success’ proportions are equal under two conditions. In biostatistics this setting arises (for example) when patients are randomized to receive one or other of two treatments, and for each patient we observe either a ‘success’ (of course this could be a bad outcome, such as death) or ‘failure’. In web design people may have data where web site visitors are sent to one of two versions of a page at random, and for each visit a success is defined as some outcome such as a purchase of a product. In both cases, we may be interested in testing the hypothesis that the true proportion of successes in the population are equal, and this is what we looked at in an earlier post. Note that the randomization described in these two examples is not necessary for the statistical procedures described in this post, but of course randomization affects our interpretation of the differences between the groups.

The robust sandwich variance estimator for linear regression (using R)

May 10, 2014February 14, 2014 by Jonathan Bartlett

In a previous post we looked at the (robust) sandwich variance estimator for linear regression. This method allowed us to estimate valid standard errors for our coefficients in linear regression, without requiring the usual assumption that the residual errors have constant variance.

In this post we’ll look at how this can be done in practice using R, with the sandwich package (I’ll assume below that you’ve installed this library). To illustrate, we’ll first simulate some simple data from a linear regression model where the residual variance increases sharply with the covariate: