Missing covariates in competing risks analysis

Today I gave a seminar at the Centre for Biostatistics, University of Manchester, as part of a three seminar afternoon on missing data. My talk described recent work on methods for handling missing covariates in competing risks analysis, with a focus on when complete case analysis is valid and on multiple imputation approaches. For the latter, our substantive model compatible adaptation of fully conditional specification now supports competing risks analysis, both in R and Stata (see here).

The slides of my talk are available here.

Update 13th May 2016: the corresponding paper is now available (open access) here.

Multiple imputation followed by deletion of imputed outcomes

In 2007, Paul von Hippel published a nice paper proposing a variant of the conventional multiple imputation (MI) approach to handling missing data. The paper advocated a multiple imputation followed by deletion (MID) approach. The context considered was where we are interested in fitting a regression model for an outcome Y with covariates X, and some Y and X values are missing. The approach advocated consists of running imputation as usual, imputing missing values in Y and X, but then discarding those records where the outcome Y had been imputed. Instead, the reduced datasets, with missing X values imputed but only observed Y values, are analysed as usual, with results combined using Rubin's rules.

Read more

Substantive model compatible imputation of covariates - smcfcs in R

I'm pleased to announce the release of an R package, smcfcs, which implements multiple imputation of missing covariates using substantive model compatible fully conditional specification. As described in a previous post, this is a modified version of the popular fully conditional specification, or chained equations, approach to multiple imputation (e.g. as implemented in the excellent MICE package).

smcfcs is an attractive approach when the outcome or substantive model includes interactions or non-linear covariate effects, or is itself a non-linear model, such as Cox's proportional hazards model. In these case, it can be difficult, or sometimes impossible, to directly specify an imputation model for partially observed covariates that is compatible with the outcome/substantive model. Such incompatibility can lead to biased estimates, due to mis-specification of the imputation model. smcfcs resolves this potential problem by ensuring that each partially observed covariate is imputed from an imputation model which is compatible with a user specified outcome/substantive model.

smcfcs is available on CRAN in R. It supports linear and logistic regression outcome models, as well as Cox proportional hazards models for censored time to event outcomes. Competing risks outcomes can also be accommodated through specification of Cox models for each cause specific hazard function. A Stata version is all available, and can be installed from within Stata from the SSC archive using: ssc install smcfcs

Including the outcome in imputation models of covariates

Multiple imputation has become a popular approach for handling missing data (see www.missingdata.org.uk). Suppose that we have an outcome (dependent variable in our model of interest) Y, and a covariate X. Suppose further that X contains some missing values, and that we are happy to assume that these satisfy the missing at random assumption. Then we might consider using multiple imputation to impute the missing values in X. A natural question that then follows is whether, in the imputation model for X, the variable Y should be included as a covariate? Particularly when Y is a variable measured later in time than X, our intuition may lead us to think that it is inappropriate to use the future information contain in Y when imputing in X. This however, is not the case.

Read more

When is complete case/records logistic regression unbiased?

It's sometimes thought that when data are missing, complete case analysis or complete records analysis, where those with missing values on the variables involved in the analysis are dropped, is biased unless data are missing completely at random (MCAR). In a previous post I explored the fact that complete case/records analysis can in fact be unbiased so long as missingness is unrelated to the outcome variable, conditional on the covariates. Depending on which variable(s) suffer from missingness, this can correspond to data being missing at random (MAR) or even missing not at random (MNAR).

Yesterday I gave a seminar at LSHTM discussing some recent work which brings together earlier results which have perhaps been somewhat neglected, looking at the specific case of logistic regression models. It turns out that because of the special symmetry property of the odds ratio measure which lies at the heart of logistic regression, a logistic regression complete case/records analysis can be unbiased for the association of a variable of interest (e.g. exposure) adjusted for a number of other covariates (e.g. confounders) in a perhaps surprising range of situations. The slides can be downloaded here, and an audio recording version is available here.

As described in the slides, missingness can depend on the outcome and confounders, or exposure and confounders, and the complete records estimate of the exposure association is unbiased. Depending on which variables have missing values, these conditions sometimes correspond to the MAR assumption and other times to an MNAR assumption. In general if missingness depends jointly on exposure and outcome, estimates of the exposure association are biased. However, as described in the slides, there are even special cases here where estimates for the exposure association remain unbiased.

October 2015: This work has now been published in the American Journal of Epidemiology, and is available open-access here.

Missing covariates in structural equation models

This week I was talking to a friend about how covariates which have missing values are handled in structural equation modelling (SEM) software. I'll preface this post by saying that I'm definitely not an expert (or anywhere close!) in structural equation models, so if anyone spots errors/problems please add a comment. My friend thought that certain implementations of SEMs in some packages have the ability to automatically accommodate missingness in covariates, using so called 'full information maximum likelihood'. In the following I'll describe my subsequent exploration of how Stata's sem command handles missingness in covariates.

Read more