Missing data books

I have compiled below details on some of the missing data books I am (to varying extents) familiar with, with details of their content.

Links provided to Amazon, from which I may earn an affiliate commission if you subsequently make a purchase.

General Missing Data Books

Handbook of Missing Data Methodology, 2014, 600 pages (Amazon Affiliate link)

In 2014 the Handbook of Missing Data Methodology was published by CRC Press. This volume is edited by leaders in the field: Molenberghs, Fitzmaurice, Kenward, Tsiatis and Verbeke, and gives an encyclopedic coverage of missing data methodology. In chapters written by both the editors and other leading contributors to the field, the early developments of the field are carefully described, as well as given excellent coverage of the most recent developments. Following an introductory section, the book is split according to the three broad approaches for handling missing data: likelihood and Bayesian, multiple imputation, and semi-parametric approaches. The next section then covers approaches to performing sensitivity (to the missing at random assumption) analyses, while the last section contains chapters on special topics, including missing data in clinical trials and in survey analysis.

I’m still working my way through various chapters, but so far, as a researcher working in the field, I’ve found it extremely useful.

Pros: the most up to date volume covering missing data methodology.

Cons: none really.

Statistical Analysis with Missing Data, by Little and Rubin, 3rd edition 2019, 462 pages (Amazon Affiliate link)

Rod Little and Don Rubin have contributed massively to the development of theory and methods for handling missing data (Rubin being the originator of multiple imputation). In this book they take a rigourous and principled approach to handling missing data.

The first part of the book begins by outlining the problems caused by missing data, and Rubin’s classification of missing data mechanisms. It then goes on to discuss complete case analysis and the key ideas behind imputing missing data. Next, they explain why with only single imputation, obtaining valid standard errors and confidence intervals is tricky, which is one of the motivations for multiple imputation, which is then introduced.

The second part of the book looks at a likelihood frequentist and Bayesian approaches to the general problem of missing data. These are illustrated using a number of examples, with the multivariate normal being the main one. The EM algorithm is introduced as an approach to find maximum likelihood estimates when the missingness pattern is non-monotone. The last section introduces multiple imputation more formally, including a justification for Rubin’s now famous ‘rules’ for combining estimates across imputations.

The last part of the book goes through a number of examples: the multivariate normal model, the log linear model for multivariate categorical data and the general location model for when a mixture of continuous and categorical variables are present. The final section looks at analyses under a missing not at random assumption.

3rd edition updates: The 3rd edition was published in 2019, updating this classic text to cover some of the important developments in the field since the 2002 2nd edition. In particular, the 3rd edition covers multiple imputation by chained equations, and a number of applied examples, such as using multiple imputation for measurement error, and in clinical trials.

Pros: the book rigourously introduces the key concepts relevant for analysing dataset subject to missingness. All of the main statistical methods are coverred.

Cons: none particularly

Analysis of Incomplete Multivariate Data, by Schafer, 1997, 444 pages (Amazon Affiliate link)

Schafer’s book is a highly readable account of likelihood based and Bayesian approaches to analysing datasets suffering from missingness. The topics covered are fairly similar to Little and Rubin, except Schafer doesn’t look at MNAR analyses, and also gives less detail about approaches such as weighted complete case analysis and alternative variance estimation methods such as the bootstrap.

The book is fairly detailed regarding the computational implementation of approaches such as the EM algorithm, which readers may find useful to a greater or lesser extent depending on their interests. An arguable strength of the book compared to Little and Rubin’s is that more practical guidance is given in regards to implementing multiple imputation in practice.

Pros: highly readable and careful exposition of the key concepts. Alongside the technical details, useful practical guidance is given regarding applying the methods to real datasets.

Cons: arguably a little out of date now (e.g. chained equations imputation isn’t included).

Missing Data in Clinical Studies, by Molenberghs and Kenward, 2007, 528 pages (Amazon Affiliate link)

This book gives a broad account of the issues raised, concepts needed, and statistical methods for handling, missing data in clinical studies. After introducing examples which are used throughout the book, Rubin’s taxonomy for missing data mechanisms and the concept of ignorability is introduced.

The second part of the book briefly looks at so called ‘classical techniques’, including complete case analysis and various simple imputation approaches. This includes the perhaps still commonly used (in longitudinal studies) approach of last observation carried forward.

The third part looks at analyses when the missing data mechanism can be ignored (i.e. missing at random plus a variation independence assumption on parameters). The first approach is maximum likelihood (termed ‘direct likelihood’ in this text). This includes introducing mixed effect models (random-effect models) for multivariate or repeated measures data, which when fitted by likelihood (as the almost always are), give valid inferences under ignorability. The next chapter introduces the EM algorithm as one method for obtaining maximum likelihood estimates which is particularly attractive in the presence of missing or latent data.

The next chapter introduces multiple imputation (MI) generically, and the justification for Rubin’s rules. A notable feature of this book is the inclusion, in chapter 10, of weighted generalized estimating equations (GEE), which for non-continuous outcomes (e.g. binary), provide an alternative approach to generalized linear mixed models. The next chapter then looks at how MI and GEE can be combined.

Chapter 12 looks at issues surrounding frequentist inferences based on the likelihood. This includes coverage of the earlier work by the authors concerning whether the expected or observed information matrix should be used (the observed is recommended) when there are missing data (see here for the open access paper on which this is based). Chapters 13 then illustrates the ideas described in the preceding chapters with one of the example datasets, and chapter 14 gives details of the methods’ implementations in SAS.

Part 5 then gives an extensive account of sensitivity analyses to the MAR assumption (i.e. analyses assuming MNAR). This is probably the most detailed account in a book of this topic, with Hogan and Daniels’ book (see below) being a competitor. The final two chapters of the book then explore how these approaches might be applied to two of illustrative datasets.

Pros: an expansive coverage of the different approaches to handling missing data in clinical studies. The detailed description of the likelihood and weighted GEE approaches is also great, and something given less or no coverage in other books.

Cons: from a software perspective SAS is the focus, so users of other software will naturally find the code of less use. If one is interested in practicalities of using MI, arguably the MI focused (see below) books are better.

Multiple Imputation books

Multiple imputation (MI) has become an extremely popular approach to handling missing data. However, there are a large number of issues and choices to be considered when applying it.

Multiple Imputation for Nonresponse in Surveys, by Rubin, 1987, 287 pages (Amazon Affiliate link)

Rubin’s original book on multiple imputation. I only recently got my own copy of this, but so far it’s been a pleasure to read, in particular because this is the book that really started MI’s huge success. The context for the book is the handling of missing data in surveys, but of course much of it is highly relevant more broadly. After an introductory chapter, Chapter 2 reviews the statistical framework used in the remainder of the book. The first part of this is that the objective is assumed to be inference for a characteristic or quantity of a fixed finite population. This differs from the usual model based paradigm, where the sample is assumed to have come from an infinite population, and the estimand of interest is a parameter indexing a probability distribution for this infinite population. The chapter then describes how Bayesian methods can be used for finite population inference. This chapter also describes assumptions about the sampling mechanism and response (or missingness) mechanisms. Chapter 3 gives the implementation details for MI. Chapter 4 then gives arguments for conditions under which MI will give frequentist valid inferences. This includes the important notion of what it means for an imputation procedure to be ‘proper’ for a particular complete data analysis procedure. Examples are then given of imputation procedures that do and do not satisfy this criterion. The final two chapters then consider in detail imputation procedures under ignorable (essentially MAR) and non-ignorable (essentially MNAR) nonresponse or missingness mechanisms.

For those interested in the origins of MI and its original justifications and derivations, this book is highly recommended. The language and exposition is clear, and as well as covering the technical justification for MI, it is rich in practical guidance about how to sensible impute partially observed datasets.

Flexible Imputation of Missing Data, by van Buuren, 2012, 342 pages (Amazon Affiliate link), and online version here

Stef van Buuren was one of the originators of the chained equations / full conditional specification approach to multiple imputation, and his popular R package MICE is used throughout the book. This approach involves specifying separate conditional models for each variable with missing values, as opposed to explicitly specifying a multivariate model.

The first two chapters introduce the general missing data problem and then the key ideas of MI. The third chapter then describes the mechanics of imputation of a single variable step by step, in a very intuitive way. This includes the predictive mean matching approach, which is the default method for quantitative variables in MICE. It also coverage of newer approaches based on regression trees. Multi-level models are also described for imputation of a single partially observed variable in a multi-level/hierarchical context.

The fourth chapter describes the two main approaches for specification of an imputation model when multiple variables have missing data, namely Rubin’s original multivariate/joint model approach, and the newer chained equations / full conditional specification approach. The fifth chapter discusses some of the practical issues which arise when applying MI in practice, including the choice of variables to include, diagnostics for imputation models, and how to allow for interactions and non-linearities. Chapter six covers the process of pooling the results across imputations, and a includes a discussion of how apply variable selection methods (e.g. stepwise) after using MI.

Chapters seven, eight and nine describe a number of different case studies, illustrating some of the ideas introduced earlier. This includes a chapter on imputation with longitudinal data. The book then concludes with a discussion of a number of do’s and don’ts, and areas for future research.

Pros: a very practical guide to applying MI in practice. Complex ideas are intuitively and clearly explained. Illustrated with R code using the MICE throughout.

Cons: not many. If you’re using something other than R, then of course the R code included will be of less interest!

Multiple Imputation and its Application, by Carpenter, Bartlett, Morris, Wood, Quartagno and Kenward, 2023, 464 pages (Amazon Affiliate link)

I had previously written a brief review of the 1st edition of this book. I am now a co-author of the 2nd edition, and so I am almost surely biased in my opinions of it! That being said, I think some of the books particular strengths include:

  • in depth discussion of congeniality and compatibility, and the practical implications of the theory for these for data analysts
  • detailed descriptions of both joint modelling and fully conditional specification approaches to multiple imputation, and their relative pros and cons
  • an updated chapter on performing imputation with derived variables, such as interactions, non-linear effects, sum scores, splines
  • expanded chapter on MI with survival data, including imputing missing covariates in Cox models and MI for case-cohort and nested case-control studies
  • new chapters on multiple imputation for / in the context of:
    • prognostic models
    • measurement error and misclassification
    • causal inference
    • using MI in practice

Clinical trials

Clinical Trials with Missing Data: A Guide for Practitioners, by O’Kelly and Ratitch, 2014, 472 (Amazon Affiliate link)

Missing data is a big issue in the world of clinical trials. While many of the other missing data books do mention clinical trials (some quite extensively), this book focuses exclusively on missing data in trials. It has just been published, and I’ve not looked at it yet, but my guess is that it will be of use to many statisticians and trialists. I will review this when I get a copy.

Longitudinal data

Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis, by Daniels and Hogan, 2008, 328 pages (Amazon Affiliate link)

This book is as far as I am aware the only one which focuses specifically on missing data in longitudinal studies. The first chapter introduces a number of case studies, which are then carried through the book as running examples. Chapters two to four then give an detailed overview of models for longitudinal data when there are no data missing.

Chapter five sets up a framework for handling missingness in longitudinal data, and what MCAR, MAR and MNAR mean in this context. The three approaches to sensitivity analysis under MNAR (selection modelling, pattern-mixture modelling, and shared parameter models) are also introduced. Chapters six and seven then cover analysis under ‘ignorability’ (the MAR assumption plus a parameter distinctiveness condition).

The final three chapters looks in detail at performing sensitivity analyses when data are thought to be missing not at random. The exposition is clear and detailed. After reading them however, one realises that conducting MNAR sensitivity analyses in the longitudinal context is really quite a complex endeavour!

Pros: the only book which specifically focuses on missing data in longitudinal studies, and which describes sensitivity analyses in so much detail.

Cons: not really a con of the book, since sensitivity analyses to the MNAR assumption are intrinsically complicated, but the material covered in this book is fairly technical.

Semiparametric approaches

Semiparametric Theory and Missing Data, by Tsiatis, 2006, 404 pages (Amazon Affiliate link)

In the 90s, Jamie Robins and colleagues in Harvard applied recently developed theory for semiparametric models to the problem of handling missing data. In this book, Tsiatis very carefully and didactically explains this theory. The first part of the book describes the theory for estimation in semiparametric models in the absence of missing data. This theory is very interesting in its own right – important examples of the models discussed are generalized estimating equations for multivariate data and Cox’s proportional hazards model for survival data.

The second part of the book then describes how this semiparametric theory can be applied to the problem of parameter estimation when we have missing data under the missing at random assumption. This leads to so called augmented inverse probability weighted complete case estimators. This combines the idea of a weighted complete case analysis and an imputation type approach. A strength of these estimators is the so called doubly robustness property – they are consistent provided either the missingness model or the imputation type model is correctly specified.

Pros: this is a beautiful book, explaining this complex theory in as an accessible way as is probably possible.

Cons: none, although as a book about the theory of semiparametric methods, it is probably not so useful for applied researchers looking for practical guides for how to handle missing data.

3 thoughts on “Missing data books”

  1. Thank you for the useful and careful list. I’m mainly teaching about “Missing Data” with a strong view on hands-on application with R… and I wish I would have found this blog when I had started explorations (with ‘Little and Rubin’) 2 months ago.

    Reply

Leave a Reply to Martin MaechlerCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.