I previously wrote a post about the meaning of missing at random for longitudinal data in clinical trials, stemming from an earlier question from someone. Somebody recently asked an excellent question in the comments to this post, which here I’ll follow-up on using directed acyclic graphs (DAGs). The idea of using DAGs to under missingness assumptions has been written about by a number of authors, including Daniel et al and Thoemmes and Mohan.
The setup is a clinical trial where we are measuring patients repeatedly over time. To simplify suppose there are only two visits and we consider a group of patients assigned to receive a particular treatment. At visit 1, measurements are made on patients, and based on these, a decision is made on whether the patient will dropout from the study. If they dropout, their subsequent outcome measurement at visit 2 is missing. I assumed previously that if patients dropout the treatment they receive may change, and this in turn will affect their (unobserved) subsequent outcome measurement. The original question was whether the resulting missing data are missing at random? My previous post argued that the data were not missing at random. The following DAG encodes the assumptions made:
The full outcome data consist of Y1 and Y2. Suppose we don’t observe the variable called Trt, which represents the patient’s treatment received between time/visit 1 and time/visit 2. The missingness indicator here is the Dropout variable, since we assume that if the patient drops out, Y2 is not observed. To determine if Y2 is MAR conditional on Y1, we can use the rules of DAGs to ask if the missingness indicator Dropout is independent of the partially observed variable Y2, conditional on Y1. The answer is No – there is an open path from the Dropout variable to Y2, via the treatment received variable Trt, indicated in the DAG below by the curved orange line.
A recent comment to my original post asked whether this was true if one accounts for the variable Trt recording what treatment patients receive between the visits. If the variable is observed (historically often it would not be among those who dropout), then it is true that Y2 is MAR conditional on Y1 and Trt, since the above orange path is blocked by adjusting/conditioning on Trt. In principle could then use multiple imputation to impute the missing Y2 values in those who dropped out, provided we adjust/condition on both Y1 and Trt. Some who dropout will remain on treatment (Trt=1) and some will not (Trt=0). The imputation distribution for Y2 will need to adjust for Trt.
Suppose however that all those who don’t dropout remain on treatment. In this case there is no information in the data to estimate the distribution of Y2 (given Y1) conditional on not receiving treatment (Trt=0), and so we cannot (at least not without strong untestable assumptions) impute Y2 for those patients who dropped out and stopped receiving treatment. Intuitively this makes complete sense – we have no information in the observed data on what happens to Y2 if patients stop taking treatment. The preceding obstacle is an example of a so called positivity assumption violation. In particular, in this special case we have the probability of Y2 being observed of 0 if Trt=0. The manifestation of this issue would be that when fitting the imputation model for Y2, your statistical software would spit out some sort of warning or error because among those with Y2 observed, which are those used to fit the imputation model, there is no variation in the covariate Trt (all values of Trt in this subset being 1), and hence it could not estimate its effect on Y2.