Missing data books

General Missing Data Books

Handbook of Missing Data Methodology, 2014, 600 pages

In 2014 the Handbook of Missing Data Methodology was published by CRC Press. This volume is edited by leaders in the field: Molenberghs, Fitzmaurice, Kenward, Tsiatis and Verbeke, and gives an encyclopedic coverage of missing data methodology. In chapters written by both the editors and other leading contributors to the field, the early developments of the field are carefully described, as well as given excellent coverage of the most recent developments. Following an introductory section, the book is split according to the three broad approaches for handling missing data: likelihood and Bayesian, multiple imputation, and semi-parametric approaches. The next section then covers approaches to performing sensitivity (to the missing at random assumption) analyses, while the last section contains chapters on special topics, including missing data in clinical trials and in survey analysis.

I'm still working my way through various chapters, but so far, as a researcher working in the field, I've found it extremely useful.

Pros: the most up to date volume covering missing data methodology.

Cons: none really.

Statistical Analysis with Missing Data, by Little and Rubin, 2002, 408 pages

Rod Little and Don Rubin have contributed massively to the development of theory and methods for handling missing data (Rubin being the originator of multiple imputation). In this book they take a rigourous and principled approach to handling missing data.

The first part of the book begins by outlining the problems caused by missing data, and Rubin's classification of missing data mechanisms. It then goes on to discuss complete case analysis and the key ideas behind imputing missing data. Next, they explain why with only single imputation, obtaining valid standard errors and confidence intervals is tricky, which is one of the motivations for multiple imputation, which is then introduced.

The second part of the book looks at a likelihood frequentist and Bayesian approaches to the general problem of missing data. These are illustrated using a number of examples, with the multivariate normal being the main one. The EM algorithm is introduced as an approach to find maximum likelihood estimates when the missingness pattern is non-monotone. The last section introduces multiple imputation more formally, including a justification for Rubin's now famous 'rules' for combining estimates across imputations.

The last part of the book goes through a number of examples: the multivariate normal model, the log linear model for multivariate categorical data and the general location model for when a mixture of continuous and categorical variables are present. The final section looks at analyses under a missing not at random assumption.

Pros: the book rigourously introduces the key concepts relevant for analysing dataset subject to missingness. All of the main statistical methods are coverred.

Cons: because of its age, there is no coverage of newer approaches, for example the chained equations multiple imputation approach to imputation.

Analysis of Incomplete Multivariate Data, by Schafer, 1997, 444 pages

Schafer's book is a highly readable account of likelihood based and Bayesian approaches to analysing datasets suffering from missingness. The topics covered are fairly similar to Little and Rubin, except Schafer doesn't look at MNAR analyses, and also gives less detail about approaches such as weighted complete case analysis and alternative variance estimation methods such as the bootstrap.

The book is fairly detailed regarding the computational implementation of approaches such as the EM algorithm, which readers may find useful to a greater or lesser extent depending on their interests. An arguable strength of the book compared to Little and Rubin's is that more practical guidance is given in regards to implementing multiple imputation in practice.

Pros: highly readable and careful exposition of the key concepts. Alongside the technical details, useful practical guidance is given regarding applying the methods to real datasets.

Cons: arguably a little out of date now (e.g. chained equations imputation isn't included).

Missing Data in Clinical Studies, by Molenberghs and Kenward, 2007, 528 pages

This book gives a broad account of the issues raised, concepts needed, and statistical methods for handling, missing data in clinical studies. After introducing examples which are used throughout the book, Rubin's taxonomy for missing data mechanisms and the concept of ignorability is introduced.

The second part of the book briefly looks at so called 'classical techniques', including complete case analysis and various simple imputation approaches. This includes the perhaps still commonly used (in longitudinal studies) approach of last observation carried forward.

The third part looks at analyses when the missing data mechanism can be ignored (i.e. missing at random plus a variation independence assumption on parameters). The first approach is maximum likelihood (termed 'direct likelihood' in this text). This includes introducing mixed effect models (random-effect models) for multivariate or repeated measures data, which when fitted by likelihood (as the almost always are), give valid inferences under ignorability. The next chapter introduces the EM algorithm as one method for obtaining maximum likelihood estimates which is particularly attractive in the presence of missing or latent data.

The next chapter introduces multiple imputation (MI) generically, and the justification for Rubin's rules. A notable feature of this book is the inclusion, in chapter 10, of weighted generalized estimating equations (GEE), which for non-continuous outcomes (e.g. binary), provide an alternative approach to generalized linear mixed models. The next chapter then looks at how MI and GEE can be combined.

Chapter 12 looks at issues surrounding frequentist inferences based on the likelihood. This includes coverage of the earlier work by the authors concerning whether the expected or observed information matrix should be used (the observed is recommended) when there are missing data (see here for the open access paper on which this is based). Chapters 13 then illustrates the ideas described in the preceding chapters with one of the example datasets, and chapter 14 gives details of the methods' implementations in SAS.

Part 5 then gives an extensive account of sensitivity analyses to the MAR assumption (i.e. analyses assuming MNAR). This is probably the most detailed account in a book of this topic, with Hogan and Daniels' book (see below) being a competitor. The final two chapters of the book then explore how these approaches might be applied to two of illustrative datasets.

Pros: an expansive coverage of the different approaches to handling missing data in clinical studies. The detailed description of the likelihood and weighted GEE approaches is also great, and something given less or no coverage in other books.

Cons: from a software perspective SAS is the focus, so users of other software will naturally find the code of less use. If one is interested in practicalities of using MI, arguably the MI focused (see below) books are better.

Multiple Imputation books

Multiple imputation (MI) has become an extremely popular approach to handling missing data. However, there are a large number of issues and choices to be considered when applying it.

Flexible Imputation of Missing Data, by van Buuren, 2012, 342 pages

Stef van Buuren was one of the originators of the chained equations / full conditional specification approach to multiple imputation, and his popular R package MICE is used throughout the book. This approach involves specifying separate conditional models for each variable with missing values, as opposed to explicitly specifying a multivariate model.

The first two chapters introduce the general missing data problem and then the key ideas of MI. The third chapter then describes the mechanics of imputation of a single variable step by step, in a very intuitive way. This includes the predictive mean matching approach, which is the default method for quantitative variables in MICE. It also coverage of newer approaches based on regression trees. Multi-level models are also described for imputation of a single partially observed variable in a multi-level/hierarchical context.

The fourth chapter describes the two main approaches for specification of an imputation model when multiple variables have missing data, namely Rubin's original multivariate/joint model approach, and the newer chained equations / full conditional specification approach. The fifth chapter discusses some of the practical issues which arise when applying MI in practice, including the choice of variables to include, diagnostics for imputation models, and how to allow for interactions and non-linearities. Chapter six covers the process of pooling the results across imputations, and a includes a discussion of how apply variable selection methods (e.g. stepwise) after using MI.

Chapters seven, eight and nine describe a number of different case studies, illustrating some of the ideas introduced earlier. This includes a chapter on imputation with longitudinal data. The book then concludes with a discussion of a number of do's and don'ts, and areas for future research.

Pros: a very practical guide to applying MI in practice. Complex ideas are intuitively and clearly explained. Illustrated with R code using the MICE throughout.

Cons: not many. If you're using something other than R, then of course the R code included will be of less interest!

Multiple Imputation and its Application, by Carpenter and Kenward, 2013, 368

This book, authored by my colleagues James Carpenter and Mike Kenward, focuses on multiple imputation and describes how it can be successfully applied to handle a number of complications which often arise. The first part of the book describes Rubin's missing data taxonomy and how consideration of the missing data mechanism is key to choosing the statistical approach which is to be used. There is also a careful but understandable justification for Rubin's rules, and also an explanation of the important notion of congeniality between imputation and analysis models.

The second part begins by describing the two main approaches to MI, namely the original approach based on a multivariate/joint model, and then the chained equations or full conditional specification approach. This includes material on when the chained equations approach is equivalent to joint model MI. In the following sections, Carpenter and Kenward then explain how both approaches can be applied to impute i) quantitative, ii) binary and ordinal, iii) unordered categorical data. The remainder of the second part deals with the important situation(s) where interactions or non-linear relationships are present.

The final part of the book looks at a number of more advanced topics, including how to perform MI with multi-level or hierarchical data, how MI can be used to perform missing not at random sensitivity analyses and how MI can be applied with weighted survey data.

Pros: tackles how MI can be applied in a number of commonly occurring complex situations - i.e. with interactions and non-linearities, with multi-level or hierarchical data. Up to date availability of the various approaches using different software packages is also discussed throughout. The methods are also describe using real examples throughout.

Cons: some of the issues which are tackled are relatively complicated, meaning that some of the material is of a more theoretical nature.

Clinical trials

Clinical Trials with Missing Data: A Guide for Practitioners, by O'Kelly and Ratitch, 2014, 472

Missing data is a big issue in the world of clinical trials. While many of the other missing data books do mention clinical trials (some quite extensively), this book focuses exclusively on missing data in trials. It has just been published, and I've not looked at it yet, but my guess is that it will be of use to many statisticians and trialists. I will review this when I get a copy.

Longitudinal data

Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis, by Daniels and Hogan, 2008, 328 pages

This book is as far as I am aware the only one which focuses specifically on missing data in longitudinal studies. The first chapter introduces a number of case studies, which are then carried through the book as running examples. Chapters two to four then give an detailed overview of models for longitudinal data when there are no data missing.

Chapter five sets up a framework for handling missingness in longitudinal data, and what MCAR, MAR and MNAR mean in this context. The three approaches to sensitivity analysis under MNAR (selection modelling, pattern-mixture modelling, and shared parameter models) are also introduced. Chapters six and seven then cover analysis under 'ignorability' (the MAR assumption plus a parameter distinctiveness condition).

The final three chapters looks in detail at performing sensitivity analyses when data are thought to be missing not at random. The exposition is clear and detailed. After reading them however, one realises that conducting MNAR sensitivity analyses in the longitudinal context is really quite a complex endeavour!

Pros: the only book which specifically focuses on missing data in longitudinal studies, and which describes sensitivity analyses in so much detail.

Cons: not really a con of the book, since sensitivity analyses to the MNAR assumption are intrinsically complicated, but the material covered in this book is fairly technical.

Semiparametric approaches

Semiparametric Theory and Missing Data, by Tsiatis, 2006, 404 pages

In the 90s, Jamie Robins and colleagues in Harvard applied recently developed theory for semiparametric models to the problem of handling missing data. In this book, Tsiatis very carefully and didactically explains this theory. The first part of the book describes the theory for estimation in semiparametric models in the absence of missing data. This theory is very interesting in its own right - important examples of the models discussed are generalized estimating equations for multivariate data and Cox's proportional hazards model for survival data.

The second part of the book then describes how this semiparametric theory can be applied to the problem of parameter estimation when we have missing data under the missing at random assumption. This leads to so called augmented inverse probability weighted complete case estimators. This combines the idea of a weighted complete case analysis and an imputation type approach. A strength of these estimators is the so called doubly robustness property - they are consistent provided either the missingness model or the imputation type model is correctly specified.

Pros: this is a beautiful book, explaining this complex theory in as an accessible way as is probably possible.

Cons: none, although as a book about the theory of semiparametric methods, it is probably not so useful for applied researchers looking for practical guides for how to handle missing data.

2 thoughts on “Missing data books

  1. Thank you for the useful and careful list. I'm mainly teaching about "Missing Data" with a strong view on hands-on application with R... and I wish I would have found this blog when I had started explorations (with 'Little and Rubin') 2 months ago.

Leave a Reply