Multiple imputation for missing baseline covariates in discrete time survival analysis

A while ago I got involved in a project led by Anna-Carolina Haensch and Bernd Weiß investigating multiple imputation methods for baseline covariates in discrete time survival analysis. The work has recently been published open access in the journal Sociological Methods & Research. The paper investigates a variety of different multiple imputation approaches. My main contribution was the extension of the substantive model compatible fully conditional specification (smcfcs) approach for multiple imputation to the discrete time survival model setting, and extending the functionality of the smcfcs package in R to incorporate this. In this short post I’ll give a quick demonstration of this functionality.

Example dataset

The smcfcs package includes an example dataset, ex_dtsam, that I’ll use for the demonstration here. The first few rows of the dataset looks like:

> library(smcfcs)
> head(ex_dtsam)
  x1         x2 failtime d
1 NA -0.3534953        1 0
2 NA  0.4747339        1 1
3  0 -1.0519961        1 1
4  1 -1.8582732        1 1
5  0 -0.1040417        3 0
6 NA  0.1881088        8 0

x1 is a binary variable with some missing values and x2 is a fully observed continuous variable. failtime is the discrete failure/censoring time variable, and d is the event indicator (=1 means failtime is the failure time, =0) means the failtime is a censoring time).

Complete case analysis

We’ll first fit a discrete time survival model with x1 and x2 as covariates to the complete cases – here meaning those with x1 observed. The dataset is in so called wide form – one row per individual. To fit a discrete time survival model using the standard glm function, we need to expand the data to long form. To do this, we can make use of the survSplit function in the survival package:

library(survival)
longCCData <- survSplit(Surv(failtime,d)~x1+x2, data=ex_dtsam,
                      cut=unique(ex_dtsam$failtime[ex_dtsam$d==1]))

To check it has worked, let’s take a look at the first 10 rows of the long form dataset:

> longCCData[1:10,]
   x1         x2 tstart failtime d
1  NA -0.3534953      0        1 0
2  NA  0.4747339      0        1 1
3   0 -1.0519961      0        1 1
4   1 -1.8582732      0        1 1
5   0 -0.1040417      0        1 0
6   0 -0.1040417      1        2 0
7   0 -0.1040417      2        3 0
8  NA  0.1881088      0        1 0
9  NA  0.1881088      1        2 0
10 NA  0.1881088      2        3 0

In the long form dataset, we have a new variable tstart that indicates the time that the corresponding interval starts. The failtime variable now indicates the end time of the corresponding interval.

The first 4 individuals fail or are censored in the first period, and so still only have one row in the long form dataset. The fifth individual was censored at time 3. Consequently, they have 3 rows in the long form dataset (rows 5-7). The failure indicator d is 0 for all 3 rows. If instead of being censored at time 3 they had failed, the value of d in the third row for this individual (i.e. row 7) would be 1.

We now use glm to fit a logistic regression model. There are different possibilities for how to model the effect of time on the baseline hazard. Here we make no assumptions about the shape of the (discrete) baseline hazard, by including tstart in the model as a covariate as a factor, in addition to x1 and x2:

ccMod <- glm(d~-1+factor(tstart)+x1+x2, 
   family="binomial", data=longCCData)

If we had included as a continuous covariate, we would be assuming that the baseline hazard changes linearly with time. Next we summarise the model:

> summary(ccMod)

Call:
glm(formula = d ~ -1 + factor(tstart) + x1 + x2, family = "binomial", 
    data = longCCData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7843  -0.8740  -0.6487   1.1486   2.3913  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
factor(tstart)0 -1.03926    0.12369  -8.402  < 2e-16 ***
factor(tstart)1 -0.81559    0.14605  -5.584 2.35e-08 ***
factor(tstart)2 -0.78402    0.18766  -4.178 2.94e-05 ***
factor(tstart)3 -0.52645    0.23051  -2.284   0.0224 *  
factor(tstart)4 -0.70223    0.31910  -2.201   0.0278 *  
factor(tstart)5 -0.86466    0.43622  -1.982   0.0475 *  
factor(tstart)6 -0.14377    0.48521  -0.296   0.7670    
factor(tstart)7  1.37926    0.90508   1.524   0.1275    
factor(tstart)8  1.04948    1.51623   0.692   0.4888    
x1               0.74245    0.14637   5.072 3.93e-07 ***
x2              -0.89708    0.08599 -10.433  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1608.1  on 1160  degrees of freedom
Residual deviance: 1333.7  on 1149  degrees of freedom
  (1555 observations deleted due to missingness)
AIC: 1355.7

Number of Fisher Scoring iterations: 4

Focusing on the coefficients for x1 and x2, we see strong evidence that x1=1 increases the hazard of failure, while increasing x2 decreases the hazard of failure. Note that I included a -1 in the model formula to prevent R from including an intercept, so that the tstart coefficients represent the hazard in each time (when x1 and x2 are set to 0).

Towards the bottom of the output we see that 1555 observations have been deleted (more accurately, not used) due to missingness. This is because of the missingness in x1. The original dataset had 1000 individuals, while the expanded long form dataset has 2715 rows. Next let’s use smcfcs to impute the missing x1 values.

Multiple imputation using smcfcs

To impute missing covariates compatibly with a discrete time survival model, we use the new smcfcs.dtsam function. This function inherits many of the usual arguments of the base smcfcs function, but has a few extra ones specific to the discrete time survival setting. An important point to note is that smcfcs.dtsam expects as input the data in wide form, i.e. the original ex_dtsam dataset. To impute the missing x1 values 5 times we call the following code:

#set a random seed for reproducibility
set.seed(272)
M <- 5
imps <- smcfcs.dtsam(ex_dtsam, smformula="Surv(failtime,d)~x1+x2",
                 method=c("logreg","", "", ""),m=M)

The smformula argument specifies the outcome/substantive model formula. It uses the Surv() function to create the censored failure time outcome. Note however that we have not specified in this model formula how the effect of time on the baseline hazard is to be handled. smcfcs.dtsam has an argument timeEffects for specifying this. Since we have not specified a value of this argument, the default value (factor) has been used – as such smcfcs is imputing compatibly with the logistic regression model we fitted earlier to the complete cases that models the effect of time as a factor variable (i.e. not assuming anything about the effect of time). The alternatives for this argument currently available are linear and quad, specifying linear or quadratic effects of time.

Analysing the imputed datasets

The imputed datasets returned by smcfcs.dtsam into the object we have called imps are in the wide form. To analyse the imputed datasets we therefore have to use the survSplit function on each one to expand it into the long form, fit the logistic regression model, save the results, and then pass these to a function to pool them using Rubin’s rules. In the smcfcs package I make use of the mitools package to do this, and this is what we’ll do here:

> #fit dtsam model to each dataset manually, since we need
> #to expand to person-period data form first
> ests <- vector(mode = "list", length = M)
> vars <- vector(mode = "list", length = M)
> for (i in 1:M) {
+   longData <- survSplit(Surv(failtime,d)~x1+x2, data=imps$impDatasets[[i]],
+                         cut=unique(ex_dtsam$failtime[ex_dtsam$d==1]))
+   mod <- glm(d~-1+factor(tstart)+x1+x2, family="binomial", data=longData)
+   ests[[i]] <- coef(mod)
+   vars[[i]] <- diag(vcov(mod))
+ }
> library(mitools)
> summary(MIcombine(ests,vars))
Multiple imputation results:
      MIcombine.default(ests, vars)
                   results         se     (lower      upper) missInfo
factor(tstart)0 -1.0793271 0.10135697 -1.2802512 -0.87840305     21 %
factor(tstart)1 -0.7931045 0.11333959 -1.0166214 -0.56958758     15 %
factor(tstart)2 -0.8477692 0.13694692 -1.1165919 -0.57894644      7 %
factor(tstart)3 -0.6476199 0.15818249 -0.9576972 -0.33754272      2 %
factor(tstart)4 -0.6011301 0.20005064 -0.9933020 -0.20895815      3 %
factor(tstart)5 -0.5155458 0.24885376 -1.0032980 -0.02779367      1 %
factor(tstart)6 -0.5973793 0.33037308 -1.2449107  0.05015212      1 %
factor(tstart)7 -0.4448869 0.44904216 -1.3250340  0.43526022      1 %
factor(tstart)8  0.2026454 0.63344764 -1.0399326  1.44522332      5 %
factor(tstart)9  1.3550266 1.07378793 -0.7503416  3.46039484      4 %
x1               0.8037503 0.13454266  0.5252672  1.08223336     46 %
x2              -0.9016313 0.05899317 -1.0176224 -0.78564019     11 %

Focusing on the x1 and x2 coefficients, we do not see major changes from the complete case estimates. Here the dataset and its missingness were simulated (see here), and the missingness in x1 was simulated with the probability of missingness dependent on x2. As such, the complete case analysis was unbiased, and the MAR assumption made by multiple imputation is also satisfied (note that these do not coincide in all situations). The standard error for the coefficient for x1 is a little bit smaller, but we see a bigger reduction in standard error for the coefficient of x2 – by imputing the missing values in x1 we principally gain information for the coefficients of the fully observed covariate (here x2).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.