The missing at random (MAR) assumption plays an extremely important role in the context of analysing datasets subject to missing data. Its importance lies primarily in the fact that if we are willing to assume data are MAR, we can identify (estimate) target parameters. There are a variety of methods for handling data which are assumed to be MAR. One approach is estimation of a model for the variables of interest using the method of maximum likelihood. In the context of randomised trials, primary analyses are sometimes based on methods which are valid under MAR, such linear mixed models (MMRM). A key concern however is whether the MAR assumption is plausibly valid in any given situation.

A common type of missing data in randomised trials is due to dropout - once a patient drops out their follow-up data is (often) unavailable for the subsequent planned follow-up visits. If missingness is only due to dropout, the MAR assumption can be shown to be equivalent to the following condition: among those patients who had not dropped out by visit t-1, under MAR the probability of dropping out before visit t may depend on data measured at visit t-1 and preceding visits, but given these, does not depend on the possibly unobserved visit t observations (or later observations).

Last year I was asked an interesting question on the missing data Google group that I maintain. I was asked about whether MAR was plausible in a longitudinal trial with a particular setup: consider those patients who attend visit t-1 (and hence have not yet dropped out). Suppose measurements are taken on these patients at visit t-1, and based on these, and possibly past measurements, a decision is made as to whether the patient drops out, or continues in the study.

On the face of it, the missingness caused by dropout would seem to be MAR, since missingness/dropout depends (in a causal sense) only on observed data (data recorded at visit t-1, and possibly earlier visits). Unfortunately, this logic (I believe) is not necessarily sound. Suppose, as will often be the case, that the distribution of outcome at time t differs between those patients who have not yet dropped out of the study and those who have, even after conditioning/adjusting for past data. In this case, because the indicator of whether a patient drops out between visit t-1 and visit t is predictive of outcome at time t, even conditional on the past information, the MAR assumption will not hold.

A somewhat contrived example of such a situation would be where, at visit t-1, each patient who has not yet dropped out tosses a coin to decide if they will now drop out of the study, or not. Next, suppose that if they drop out, they no longer are able to receive the intervention to which they were originally randomised. Lastly, suppose that their outcome following drop out differs in distribution to those who did not drop out, even after adjusting for past measurements. This might be expected because, in contrast to the patients that did not drop out, they are no longer receiving their randomized intervention. Because of this, missingness will be associated with the outcome value at time t, even after adjusting for past data, such that MAR will not hold. This will be the case even though the missingness was generated by a purely random coin toss.

If anyone has thoughts on the above, or thinks there is a flaw in my logic, please add a comment.

This sounds right to me. The missingness indicator at time t (R_t) directly influences the outcome at time t (Y_t). Hence it is not true that Y_t and R_t are conditionally independent given what you have observed, and hence MAR is violated. Maybe it seems counterintuitive because we usually ask "what influences R?" whereas we should also ask "what is influenced by R?".

I think that you are correct, Jonathon, but that there is an important element missing from the explanations provided: perfect prediction. The conditions for MAR are very nearly satisfied in the examples given, except for the fact that there is imperfect overlap (structural support) between the participants who are lost and those who are retained, i.e. nobody with certain characteristics or who loses the coin toss is followed up. With only slight modification to study designs, the conditions for MAR can be satisfied: if some of the people who would otherwise be dropped from the study can be followed up (have their outcome observed), then MAR becomes plausible, given the information that led to participants being dropped.

Putting it like this might seem a bit like stating the obvious ('if we had more data about people with missing data, then missing data wouldn't be a problem'!) but I think it helps to clarify the source of the violation of MAR, and indicates the solutions that must be designed into studies of complex treatments in which people may become unsuitable for further treatment after treatment allocation. The problem is not that there are variables missing from missingness/imputation models, but that observed variables perfectly predict whether or not someone will be dropped. Break the perfect prediction and the problem is solved.

Thanks James. Everything you say makes sense to me, except I can't really see why one could say in the examples I described why MAR is 'very nearly satisfied'. If the outcome distribution changes by a large amount (relative to those who don't dropout, and conditional on past information) following the decision to dropout, then it is surely 'very far' from being MAR.

Sorry, that was a poor choice of words. I was referring to the fact that all of the necessary variables are there (but not in a sufficient subsample of participants); not that the residual bias will necessarily be small. So we are 'nearly there' in the sense that 'it would only take a little bit more data', not 'there is only a little bit of bias' (technical terminology!). You're quite right - the bias could still be enormous.

I agree that dropouts should not be treated as MAR for the longitudinal studies.

However, I don't quite understand the last part with the coin toss example. Are we still collecting data for the dropouts at time t?

MAR is the pattern of missingness. Since missingness is decided by a coin toss, whatever happened after the dropout (i.e. Time t) is not available. This missingness does not depend on any data at t-1, t-2, etc.. This is a separate issue if we collect time t data for dropout, but then these subjects will not be "dropouts". They would be considered as "Subjects who missed 1 period of intervention"

In that case, shouldn't the coin toss example represent missingness at t as MCAR?

Hi. I wasn't trying to say that MAR is always a bad assumption in longitudinal studies. I was just trying to explain how one could end up with MNAR even when the missingness is caused by a completely random process, if the occurrence of missingness/dropout influences subsequent outcomes.

In the example I described, for those for whom the coin toss goes heads (say), they dropout and no longer have their (future) outcomes observed. However, I assumed that the values of the future outcomes were affected by dropout, as might be the case for example if dropping out meant stopped taking a particular drug or intervention. In this case, although dropout does not depend causally on the post dropout outcomes, statistically they are dependent because the occurrence of dropout affects (we are assuming) the subsequent outcomes. The data would thus be MNAR.

It's probably also worth pointing out here that MAR is an *assumption* about the pattern of missingness in a given analysis; not the pattern of missingness itself. Everybody talks about whether "data are MAR or NMAR" but the focus in such statements is misdirected and leads to erroneous conclusions based on whether the real-life missing-data-generating mechanism was random or not, without due consideration of the analysis. A better way to think about it is to ask whether the information that is included in the analysis is sufficient to make the MAR assumption plausible. e.g.

1. If the coin toss does not affect the clinical outcome then "the MCAR assumption is plausible" and complete case analysis, etc., may be valid. An assumption of MAR would also be plausible, regardless of whether it incorporated information about the coin toss.

2. If the coin toss does affect the clinical outcome, then the MAR assumption is plausible *IF* sufficient information about the effect of the coin toss on the outcome is included in the analysis AND information about the coin toss is observed or can be modelled for everyone with a missing clinical outcome. If sufficient information about the effect of coin toss on the outcome and the coin toss status of those with missing outcomes is not included in the analysis, then the MAR assumption is not plausible (data are NMAR *with respect to this analysis*).

All of the adverbs in the second sentence, above, are to highlight another common mistake. The "real-life missing-data-generating mechanism" (coin toss, etc.) does not actually matter, except as far as it is implicated in the "missingness mechanism". The "missingness mechanism" refers to the conditional relationships between model variables and the likelihood of data being missing, which may or may not be related to the real-life mechanism. Schafer JL and Graham JW. Missing data: our view of the state of the art. Psychol Methods. 2002; 7: 147-77 provide a good explanation of this. If a coin toss affects the outcome, then there is a different "missingness mechanism" than if a coin toss does not affect the outcome, but both are based on the same real-life phenomenon.

Thanks James! I will add one more comment/clarification for people reading your excellent comment - I wouldn't say MCAR/MAR/MNAR are assumptions about "patterns" of missingness. At least not if one uses the term pattern to refer to which combination of variables are observed. Rather it is about the statistical dependence of missingness on the variables involved in the analysis.

Hello there. I'm looking at some longitudinal data for which there are missing values: it's hospital admissions data and although the date of admission is always present, the date of discharge can be missing. The date of discharge/length of hospital stay is important to the analyses. Can longitudinal missing data methods be used to explore missing dates/length of stay, based on say, characteristics of the patient (such as diagnosis, medical history) and/or the hospital? The condition obviously is that the length of stay for a given visit can't extend into the next admission. There are a number of ways I've been looking at this issue but wondered if "missing data" approaches might offer something worth exploring?

Hi Ron. I'm curious about what you had in mind when you referred to "missing data approaches", but presumably multiple imputation and the like. Without being dismissive, you have missing data, therefore anything that you do is a missing data approach. 'Doing nothing' usually means that your software will just ignore anybody with missing data, which is known as 'complete case analysis' or 'listwise deletion'. This is one possible approach to missing data, and usually not a very good one. In your case it might be defensible though, in which case you're in luck--if there is good reason to think that having a missing discharge date is unrelated to the length of hospital stay or any other variable that you are particularly interested in. This could be the case, for example, if missing discharge dates were only caused by data administrators in a hurry. But perhaps people with missing discharge dates were actually transferred, or perhaps they died in hospital, in which case the likelihood of having missing data would almost certainly be related to whatever it is that you are studying. A good starting point would be looking at everything that you do know about people with missing discharge dates. Do they differ on any known characteristics from people without missing discharge dates? Have you asked the data administrators why discharge dates might be missing (another important starting point, if open to you)?