Following a recent request from someone, I’ve extended the functionality of my R package smcfcs, which performs multiple imputation of missing covariates, compatible with a user-specified substantive or outcome. The package can now impute compatibly with a flexible parametric Royston-Parmar type model. In this post I’ll briefly highlight some of the potential uses of this new functionality.
Flexible parametric models using flexsurv
To incorporate flexible parametric Royston-Parmar models into smcfcs, I’ve made use of the fantastic package flexsurv, written by Christopher Jackson. Due to some of the fantastic functionality built into flexsurv, plugging it into smcfcs was relatively easy. The latest development version of smcfcs has a new function smcfcs.flexsurv which imputes missing covariates compatibly with a flexible parametric model, based on the flexsurvspline function in the flexsurv package. The basic syntax looks like:
library(smcfcs)
library(flexsurv)
set.seed(63213)
imps <- smcfcs.flexsurv(ex_flexsurv,
k=2,
smformula="Surv(t,d)~x+z",
method=c("","","logreg",""))
# analyse imputed datasets
library(mitools)
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, flexsurvspline(Surv(t,d)~x+z, k=2))
summary(MIcombine(models))
Here the only additional argument compard to a Cox model approach is k – k=2 specifies how many knots to use in the model.
Why flexible parametric models?
Why would one want to use a flexible parametric survival model when one has the semiparametric Cox model? Lots of great stuff has been written about why there are some advantages to taking a parametric approach, and indeed in a number of interviews I recall David Cox speaking to these, particularly in the context of his semiparametric model having been so successful. I’ll instead focus below on some of the specific potential benefits in the context of multiple imputation.
Theoretical underpinnings
Since its beginnings the smcfcs package has had functionality for Cox proportional hazards models, based on the derivations given in this paper. One potentially troubling aspect of our approach is that we did not in the algorithm account for uncertainty in the estimate of the baseline hazard function, which in a Cox model, is left completely unspecified by the model. In order for Rubin’s rules to work, imputations must be generated conditional on draws of the parameters from their corresponding posterior distributions, and we were unsure how to do this for the (infinite dimensional) baseline hazard function parameter. We acknowledged this in the paper, and simulations suggested good performance of Rubin’s rules despite it. Further simulations in the competing risks setting also showed good performance. Nonetheless, the theoretical basis is arguably not completely solid, as noted in a 2020 paper by Eriksson et al. By using a parametric yet flexible survival model in smcfcs, we overcome these theoretical issues – the model is fully parametric, and smcfcs takes a posterior draw of the full set of parameters in the survival model, by utilising the normboot.flexsurvreg function in flexsurv.
Imputing censored events times
The person whose query started this piece of work wanted to impute both missing covariates and some of the ‘missing’ right-censored event times. Ordinarily we don’t worry about trying to impute event times that are right-censored because we have statistical tools that correctly recognise the contribution to the likelihood of such partial observations. But there are situations where we may want to impute the event times for those who are censored. These include settings where we may think there are covariates that affect censoring and the time-to-event but which we do not want to use in our substantive model. In this case we should condition on these to render the independent censoring assumption plausible, yet do not want to condition on them in the substantive model. Here one could impute censored times conditional on the covariates, and then perform analysis on the resulting imputations not conditioning on them. Multiple imputation is also useful for assessing how results change if censoring is not independent even after accounting for measured covariates (see Jackson et al 2014).
In the context of imputing censored event times use of a parametric model again has advantages compared to a Cox model. In the case of allowing for dependent censoring, Jackson et al 2014 proposed using a bootstrap resampling approach to account for uncertainty in the Cox model parameters (regression coefficients and baseline hazard function). In the parametric approach, where all the parameters are finite dimensional, we can appeal to large sample theory to draw from the (approximately) multivariate normal distribution of the posterior.
To impute event times for those who were originally censored is achieved in smcfcs.flexsurv using the imputeTimes argument:
# impute missing covariate + censored event times
imps <- smcfcs.flexsurv(ex_flexsurv,
k=2,
smformula="Surv(t,d)~x+z",
method=c("","","logreg",""),
imputeTimes=TRUE)
Time-varying covariate effects
One of the strengths of flexible parametric models is the ease with which more complex models can be fitted, for example to allow for the effects of covariates to be time-varying. This is achieved in the flexsurvspline function by allowing the spline parameters to vary by levels of chosen covariates, as described in Section 5.1 of the flexsurv user guide. For example, in our earlier model we can add gamma1(x) to allow the effect of log(t) on the log cumulative hazard to vary with x (i.e. differ between the two levels of x since here x is binary):
flexsurvspline(Surv(t,d)~x+z+gamma1(x), k=2)
If one or more covariates is partially observed in this setting, how should you impute the covariates (should you choose to)? My colleagues Ruth Keogh and Tim Morris investigated this in a 2018 paper, showing how smcfcs could be extended to accommodate time-varying covariate effects in a Cox model.
In the flexible parametric survival model approach, because of the functionality of the flexsurv package, it is straightforward to impute missing covariates compatible with a model that allows for time-varying effects:
imps <- smcfcs.flexsurv(ex_flexsurv,
k=2,
smformula="Surv(t,d)~x+z+gamma1(x)",
method=c("","","logreg",""))
Installation and feedback
The smcfcs.flexsurv function is available in the development version of the smcfcs package, and can be installed via:
devtools::install_github("jwb133/smcfcs")
I will upload the new version to CRAN soon, but if anyone tries it out and finds bugs or has other feedback, please get in touch.