Multiple imputation with interactions and non-linear terms

Multiple imputation has become an extremely popular approach to handling missing data, for a number of reasons. One is that once the imputed datasets have been generated, they can each be analysed using standard analysis methods, and the results pooled using Rubin’s rules. However, in addition to the missing at random assumption, for multiple imputation to give unbiased point estimates the model(s) used to impute missing data need to be (at least approximately) correctly specified. Because of this, care must be taken when choosing the imputation model.

What constitutes a reasonable imputation model will obviously depend on the dataset and situation at hand. One situation which is commonly encountered, but where it is not obvious what one should do, is where the dataset, or the model(s) which will be fitted after imputation, contains interaction terms or non-linear terms such as squared terms.

Read more

Adjusting for covariate misclassification in logistic regression – predictive value weighting

When we fit regression models, we implicitly assume that the values in our dataset are accurate measurements of the variables of interest. In many settings, the measurements we actually have are imperfect. In the case of a categorical variable, for some of the records in our dataset the observed value may differ from the true value, due to misclassification. Misclassification arises for many different reasons. In epidemiology, instruments are often used to measure conditions imperfectly – sometimes observations which should be recorded as 1 are recorded as 0, and vice-versa. In this post I’ll focus on the common situation where logistic regression is used to model an outcome Y, and one of the covariates is subject to misclassification.

Read more

Area under the ROC curve – assessing discrimination in logistic regression

In a previous post we looked at the popular Hosmer-Lemeshow test for logistic regression, which can be viewed as assessing whether the model is well calibrated. In this post we’ll look at one approach to assessing the discrimination of a fitted logistic model, via the receiver operating characteristic (ROC) curve.

Read more