For any users of my R package smcfcs, I've just released a new version (1.1.1), which along with a few small changes, includes a critical bug fix. The bug affected imputation of categorical (binary and categorical variables with more than two levels) when the substantive model is linear regression (other substantive model types were not affected). All users should update to the new version, which is available on CRAN.

# Missing data

## Weighting after multiple imputation for MNAR sensitivity analysis not recommended

A concern when analysing data with missing values is that the missing at random (MAR) assumption, upon which a number of methods rely, does not hold. When the missing at random assumption is in doubt, ideally we should perform sensitivity analyses, whereby we assess how sensitive our conclusions are to plausible deviations from MAR. One route to performing such a sensitivity analysis, which is convenient if one has already performed multiple imputation (assuming MAR), is the weighting method proposed by Carpenter *et al * in 2007. This involves applying a weighted version of Rubin's rules to the parameter estimates obtained from the MAR imputations, with the weight given to a particular imputation estimate depending on how plausible the imputations in that dataset are with an assumed missing not at random (MNAR) mechanism. The method is appealing because, computationally, it requires relatively little additional effort once MAR imputations have been generated.

In an important paper just published by Rezvan *et al* in BMC Medical Research Methodology, the performance of this weighting method has been explored through a series of simulation studies. In summary, they find that the method does not recover unbiased estimates, even when the number impuations used is large, when the correct (true) value of the MNAR sensitivity parameter is used. The paper explains in detail possible reasons for the failure of the method, but the summary conclusion is that the weighting method ought not to be used for performing MNAR sensitivity analyses after MAR multiple imputation.

What might one do as an alternative? One is to perform the selection modelling MNAR sensitivity analysis using software such as WinBUGS or JAGS, in which the substantive model and selection (missingness) model are jointly fitted, and one uses an informative prior for the sensitivity parameter. A further alternative, which like the weighting approach can (in certain situations) exploit multiple imputations generated assuming MAR, is the pattern mixture approach, whereby the MAR imputations are modified to reflect an assumed MNAR mechanism. The modified imputations can then be analysed and results combined using Rubin's rules in the usual way.

## Missing covariates in competing risks analysis

Today I gave a seminar at the Centre for Biostatistics, University of Manchester, as part of a three seminar afternoon on missing data. My talk described recent work on methods for handling missing covariates in competing risks analysis, with a focus on when complete case analysis is valid and on multiple imputation approaches. For the latter, our substantive model compatible adaptation of fully conditional specification now supports competing risks analysis, both in R and Stata (see here).

The slides of my talk are available here.

Update 13th May 2016: the corresponding paper is now available (open access) here.

## Multiple imputation followed by deletion of imputed outcomes

In 2007, Paul von Hippel published a nice paper proposing a variant of the conventional multiple imputation (MI) approach to handling missing data. The paper advocated a multiple imputation followed by deletion (MID) approach. The context considered was where we are interested in fitting a regression model for an outcome Y with covariates X, and some Y and X values are missing. The approach advocated consists of running imputation as usual, imputing missing values in Y and X, but then discarding those records where the outcome Y had been imputed. Instead, the reduced datasets, with missing X values imputed but only observed Y values, are analysed as usual, with results combined using Rubin's rules.

Read moreMultiple imputation followed by deletion of imputed outcomes

## Substantive model compatible imputation of covariates - smcfcs in R

I'm pleased to announce the release of an R package, *smcfcs*, which implements multiple imputation of missing covariates using substantive model compatible fully conditional specification. As described in a previous post, this is a modified version of the popular fully conditional specification, or chained equations, approach to multiple imputation (e.g. as implemented in the excellent MICE package).

*smcfcs* is an attractive approach when the outcome or substantive model includes interactions or non-linear covariate effects, or is itself a non-linear model, such as Cox's proportional hazards model. In these case, it can be difficult, or sometimes impossible, to directly specify an imputation model for partially observed covariates that is compatible with the outcome/substantive model. Such incompatibility can lead to biased estimates, due to mis-specification of the imputation model. *smcfcs* resolves this potential problem by ensuring that each partially observed covariate is imputed from an imputation model which is compatible with a user specified outcome/substantive model.

*smcfcs* is available on CRAN in R. It supports linear and logistic regression outcome models, as well as Cox proportional hazards models for censored time to event outcomes. Competing risks outcomes can also be accommodated through specification of Cox models for each cause specific hazard function. A Stata version is all available, and can be installed from within Stata from the SSC archive using: ssc install smcfcs

## Including the outcome in imputation models of covariates

Multiple imputation has become a popular approach for handling missing data (see www.missingdata.org.uk). Suppose that we have an outcome (dependent variable in our model of interest) Y, and a covariate X. Suppose further that X contains some missing values, and that we are happy to assume that these satisfy the missing at random assumption. Then we might consider using multiple imputation to impute the missing values in X. A natural question that then follows is whether, in the imputation model for X, the variable Y should be included as a covariate? Particularly when Y is a variable measured later in time than X, our intuition may lead us to think that it is inappropriate to use the future information contain in Y when imputing in X. This however, is not the case.

Read moreIncluding the outcome in imputation models of covariates