Multiple imputation followed by deletion of imputed outcomes

In 2007, Paul von Hippel published a nice paper proposing a variant of the conventional multiple imputation (MI) approach to handling missing data. The paper advocated a multiple imputation followed by deletion (MID) approach. The context considered was where we are interested in fitting a regression model for an outcome Y with covariates X, and some Y and X values are missing. The approach advocated consists of running imputation as usual, imputing missing values in Y and X, but then discarding those records where the outcome Y had been imputed. Instead, the reduced datasets, with missing X values imputed but only observed Y values, are analysed as usual, with results combined using Rubin’s rules.

von Hippel advocated the MID approach first on the basis that usually it gives more efficient inferences when the number of imputations used is finite and not chosen to be very large. The second reason proposed is that MID is more robust to problems with the imputation model used for Y, since only records with observed Y are in the end used. Part of the argument for the deletion step is that when one is interested in the conditional distribution Y|X, (in the absence of auxiliary variables) individuals with missing Y give no information about the parameters in this conditional distribution

The first thing to note is that the efficiency difference can always be made sufficiently small by simply using standard MI with more imputations. Regarding the second proposed advantage, on the face of it, and in the absence of auxiliary variables, this seems reasonable. However, in this setting (in the absence of auxiliary variables) the imputation model for Y|X would ordinarily be chosen to be the same as the analysis model. Thus if we are worried that this model is badly misspecified, then we are equivalently saying that our analysis model is badly misspecified. This argument is not entirely clear cut though, since the imputation model requires a full parametric model (ordinarily) for Y|X, whereas the analysis model could make fewer assumptions. For example, if one fits a linear regression model for Y|X, provided one uses robust standard errors and has a moderate sample size, the normality assumption is not needed.

von Hippel also discussed possible disadvantages of the MID approach, principally considering situations where auxiliary variables are available which could help impute the missing Ys. His arguments in this regard however mainly focused on the efficiency of estimates. In the last week a new paper has been published by Thomas Sullivan and colleagues in the American Journal of Epidemiology, looking at the bias and precision of the MID approach. One cautionary finding from their work is that when missingness in the outcome is associated with an auxiliary variable, the MID approach can lead to biased estimates. Suppose we have an auxiliary variable V, which is associated with Y and that missingness in Y is missing at random (MAR) given V. In this case, marginalising over V, missingness in Y will be associated with Y (conditional on X). As such, if we in the end only analyse those with observed Ys (as in the MID approach), this hybrid MI/complete case/records analysis will be biased, as empirically demonstrated by Sullivan and colleagues. Consequently, Sullivan and colleagues caution researchers that MID may be inadvisable when one has such auxiliary variables, which seems entirely sensible.

Leave a ReplyCancel reply