This week I was talking to a friend about how covariates which have missing values are handled in structural equation modelling (SEM) software. I’ll preface this post by saying that I’m definitely not an expert (or anywhere close!) in structural equation models, so if anyone spots errors/problems please add a comment. My friend thought that certain implementations of SEMs in some packages have the ability to automatically accommodate missingness in covariates, using so called ‘full information maximum likelihood’. In the following I’ll describe my subsequent exploration of how Stata’s sem command handles missingness in covariates.
Missing data
Multiple imputation using random forest
In recent years a number of researchers have proposed using machine learning techniques to impute missing data. One of these is the so called random forest technique. I recently gave a talk at the International Biometric Society’s conference in Florence, Italy, on the topic. In case it is of interest to anyone, the slides of the talk are available below.
Slides from talk at IBC2014 on random forest multiple imputation
Multiple imputation with interactions and non-linear terms
Multiple imputation has become an extremely popular approach to handling missing data, for a number of reasons. One is that once the imputed datasets have been generated, they can each be analysed using standard analysis methods, and the results pooled using Rubin’s rules. However, in addition to the missing at random assumption, for multiple imputation to give unbiased point estimates the model(s) used to impute missing data need to be (at least approximately) correctly specified. Because of this, care must be taken when choosing the imputation model.
What constitutes a reasonable imputation model will obviously depend on the dataset and situation at hand. One situation which is commonly encountered, but where it is not obvious what one should do, is where the dataset, or the model(s) which will be fitted after imputation, contains interaction terms or non-linear terms such as squared terms.