When is complete case/records logistic regression unbiased?

It's sometimes thought that when data are missing, complete case analysis or complete records analysis, where those with missing values on the variables involved in the analysis are dropped, is biased unless data are missing completely at random (MCAR). In a previous post I explored the fact that complete case/records analysis can in fact be unbiased so long as missingness is unrelated to the outcome variable, conditional on the covariates. Depending on which variable(s) suffer from missingness, this can correspond to data being missing at random (MAR) or even missing not at random (MNAR).

Yesterday I gave a seminar at LSHTM discussing some recent work which brings together earlier results which have perhaps been somewhat neglected, looking at the specific case of logistic regression models. It turns out that because of the special symmetry property of the odds ratio measure which lies at the heart of logistic regression, a logistic regression complete case/records analysis can be unbiased for the association of a variable of interest (e.g. exposure) adjusted for a number of other covariates (e.g. confounders) in a perhaps surprising range of situations. The slides can be downloaded here, and an audio recording version is available here.

As described in the slides, missingness can depend on the outcome and confounders, or exposure and confounders, and the complete records estimate of the exposure association is unbiased. Depending on which variables have missing values, these conditions sometimes correspond to the MAR assumption and other times to an MNAR assumption. In general if missingness depends jointly on exposure and outcome, estimates of the exposure association are biased. However, as described in the slides, there are even special cases here where estimates for the exposure association remain unbiased.

October 2015: This work has now been published in the American Journal of Epidemiology, and is available open-access here.

Missing covariates in structural equation models

This week I was talking to a friend about how covariates which have missing values are handled in structural equation modelling (SEM) software. I'll preface this post by saying that I'm definitely not an expert (or anywhere close!) in structural equation models, so if anyone spots errors/problems please add a comment. My friend thought that certain implementations of SEMs in some packages have the ability to automatically accommodate missingness in covariates, using so called 'full information maximum likelihood'. In the following I'll describe my subsequent exploration of how Stata's sem command handles missingness in covariates.

Read moreMissing covariates in structural equation models

Multiple imputation using random forest

In recent years a number of researchers have proposed using machine learning techniques to impute missing data. One of these is the so called random forest technique. I recently gave a talk at the International Biometric Society's conference in Florence, Italy, on the topic. In case it is of interest to anyone, the slides of the talk are available below.

Slides from talk at IBC2014 on random forest multiple imputation

Multiple imputation with interactions and non-linear terms

Multiple imputation has become an extremely popular approach to handling missing data, for a number of reasons. One is that once the imputed datasets have been generated, they can each be analysed using standard analysis methods, and the results pooled using Rubin's rules. However, in addition to the missing at random assumption, for multiple imputation to give unbiased point estimates the model(s) used to impute missing data need to be (at least approximately) correctly specified. Because of this, care must be taken when choosing the imputation model.

What constitutes a reasonable imputation model will obviously depend on the dataset and situation at hand. One situation which is commonly encountered, but where it is not obvious what one should do, is where the dataset, or the model(s) which will be fitted after imputation, contains interaction terms or non-linear terms such as squared terms.

Read moreMultiple imputation with interactions and non-linear terms

When is complete case analysis unbiased?

My primary research area is that of missing data. Missing data are a common issue in empirical research. Within biostatistics missing data are almost ubiquitous - patients often do not come back to visits as planned, for a variety of reasons. In surveys participants may move in between survey waves, we lose contact with them, such that we are missing their responses to the questions we would have liked to asked them.

Read moreWhen is complete case analysis unbiased?