My primary research area is that of missing data. Missing data are a common issue in empirical research. Within biostatistics missing data are almost ubiquitous – patients often do not come back to visits as planned, for a variety of reasons. In surveys participants may move in between survey waves, we lose contact with them, such that we are missing their responses to the questions we would have liked to asked them.
Missing data always cause, to a lesser or greater extent, a loss of information. The manifestations of this are larger standard errors and wider confidence intervals for parameter estimates. But an arguably more important consequence is that missing data may induce bias in our estimates, unless missingness is unrelated to the variables involved in our analysis (the so called missing completely at random assumption).
There are a vast range of statistical techniques for accommodating missing data (see www.missingdata.org.uk). Perhaps the most commonly adopted is to simply exclude those participants in our dataset who have any data missing (in those variables we are concerned with) from our analysis. This is what is commonly known as a ‘complete case analysis’ or ‘listwise deletion’ – we analyse only the complete cases. I recently gave a seminar (slides here) at LSHTM about when a complete case analysis is unbiased and a method for improving upon the efficiency of complete case analysis. In this post I’ll describe the first aspect, that of when a complete case analysis is unbiased.
Missing completely at random
As I noted earlier, if data are missing complete randomly, meaning that the chance of data being missing is unrelated to any of the variables involved in our analysis, a complete case analysis is unbiased. This is because the subset of complete cases represent a random (albeit smaller than intended) sample from the population.
In general, if the complete cases are systematically different from the sample as a whole (i.e. different to the incomplete cases), i.e. the data are not missing completely randomly, analysing only the complete cases will lead to biased estimates.
For example, suppose we are interested in estimating the median income of the some population. We send out an email asking a questionnaire to be completed, amongst which participants are asked to say how much they earn. But only a proportion of the target sample return the questionnaire, and so we have missing incomes for the remaining people. If those that returned an answer to the income question have systematically higher or lower incomes than those who did not return an answer, the median income of the complete cases will be biased.
Complete case analysis validity when data are not MCAR
However, in some cases, a complete case analysis can actually give unbiased estimates even when the data are not missing completely randomly. One of these settings is that in which our analysis consists of fitting a regression model, relating the distribution of some outcome Y (or dependent variable) to one or more predictors (or independent variables) X (here X could consist of a number of predictors). Examples of such models are linear regression for continuous outcomes and logistic regression for binary outcomes. When missingness occurs in either the outcome Y, one or more of the predictors X, or potentially both, fitting the regression model to the complete cases is unbiased provided the probability of being a complete case is independent of Y, conditional on X (see the slides here for an explanation of why).
In some settings, such as cohort studies, where people are followed up over time, this condition might be reasonably assumed to hold. For example, suppose X are factors measured of subjects at recruitment into the cohort study, and that the outcome Y is measured some time after recruitment. Suppose one of the predictors in X has missing values. Then missingness in X can’t be directly caused by Y, since the future value of Y is yet to be determined. Missingness in X is either caused by the value of X itself, or by other factors/variables. Only if missingness is caused by such other factors, and these factors independently affect the outcome Y, will complete case analysis be biased.
Unfortunately, as is usually the case in analyses of missing data, this assumption about missingness cannot be definitively confirmed using the data at hand – to do this we would need to have the missing data available. However, in some cases the assumption that missingness is independent of outcome, after adjusting for the predictors, might be deemed plausible. In this case, whilst complete case analysis is not optimally efficient (it throws away the data from incomplete cases), it is at least unbiased.
So, for a particular analysis, before we ditch the humble complete case analysis in favour of some more sophisticated method, which all stats packages can perform (indeed it is typically the default approach for handling missing values), we should stop and think about whether it is possible that our complete case results might actually be ok (from a bias perspective). It’s important to say however that even when complete case analysis is unbiased, it is inefficient – it throws away all the information in the incomplete cases.
p.s. October 2015 – this paper I co-authored may be of interest – Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression