Multiple imputation (MI) is a popular approach to handling missing data. In the final part of MI, inferences for parameter estimates are made based on simple rules developed by Rubin. These rules rely on the analyst having a calculable standard error for their parameter estimate for each imputed dataset. This is fine for standard analyses, e.g. regression models fitted by maximum likelihood, where standard errors based on asymptotic theory are easily calculated. However, for many analyses analytic standard errors are not available, or are prohibitive to find by analytical methods. For such methods, if there were no missing data, an attractive approach for finding standard errors and confidence intervals is the method of bootstrapping. However, if one is using MI to handle missing data, and would ordinarily use bootstrapping to find standard errors / confidence intervals, how should these be combined?

## Multiple imputation for missing covariates in Poisson regression

This week I've released a new version of the smcfcs package for R on CRAN. SMC-FCS performs multiple imputation for missing covariates in regression models, using an adaption of the chained equations / fully conditional specification approach to imputation, which we called Substantive Model Compatible Fully Conditional Specification MI.

The new version of smcfcs now supports Poisson regression outcome / substantive models, which are often used for count outcomes. Future additions will add support for negative binomial regression models, which are often used to model over dispersed count outcomes, and also support for offsets, which are often needed when fitting count regression models.

## On the missing at random assumption in longitudinal trials

The missing at random (MAR) assumption plays an extremely important role in the context of analysing datasets subject to missing data. Its importance lies primarily in the fact that if we are willing to assume data are MAR, we can identify (estimate) target parameters. There are a variety of methods for handling data which are assumed to be MAR. One approach is estimation of a model for the variables of interest using the method of maximum likelihood. In the context of randomised trials, primary analyses are sometimes based on methods which are valid under MAR, such linear mixed models (MMRM). A key concern however is whether the MAR assumption is plausibly valid in any given situation.

## Running simulations in R using Amazon Web Services

I've recently been working on some simulation studies in R which involve computer intensive MCMC sampling. Ordinarily I would use my institution's computing cluster to do these, making use of the large number of computer cores, but a temporary lack of availability of this led me to investigate using Amazon's Web Services (AWS) system instead. In this post I'll describe the steps I went through to get my simulations going in R. As background, I am mainly a Windows user, and had never really used the Linux operating system. Nonetheless, the process wasn't actually too tricky to get going in the end, and it's enabled me to get the simulations completed far far more quickly than if I'd just used my desktop's 8 cores. The advantages of using a cloud computing resource (from my perspective) is that in principle you can use as little or as much computing power as you need or want, and it is always available - you don't have to compete against other user's demands, as would typically be the case on an academic institution's computer cluster.

## smcfcs in R - updated version 1.1.1 with critical bug fix

For any users of my R package smcfcs, I've just released a new version (1.1.1), which along with a few small changes, includes a critical bug fix. The bug affected imputation of categorical (binary and categorical variables with more than two levels) when the substantive model is linear regression (other substantive model types were not affected). All users should update to the new version, which is available on CRAN.

## Machine learning vs. traditional modelling techniques

In the process of organising a conference session on machine learning, I've finally got around to reading the late Leo Breiman's thought provoking 2001 Statistical Science article "Statistical Modeling: The Two Cultures". I highly recommend reading the paper, and the discussion that follows it. In the paper Breiman argues that statistics as a field should open its eyes to analysing data not only with traditional 'data models' (his terminology), by which he means standard (usually parametric) probabilistic models, but to also make much more use of so called machine learning algorithmic techniques.