Missing covariates in structural equation models

This week I was talking to a friend about how covariates which have missing values are handled in structural equation modelling (SEM) software. I'll preface this post by saying that I'm definitely not an expert (or anywhere close!) in structural equation models, so if anyone spots errors/problems please add a comment. My friend thought that certain implementations of SEMs in some packages have the ability to automatically accommodate missingness in covariates, using so called 'full information maximum likelihood'. In the following I'll describe my subsequent exploration of how Stata's sem command handles missingness in covariates.

To investigate how missing covariates are handled, I'll consider the simplest possible situation, in which we have one outcome Y and one covariate X, with Y following a simple linear regression model given X. First we'll simulate a large dataset, so that we know the true parameter values:

clear
set seed 6812312
set obs 10000
gen x=rnormal()
gen y=x+rnormal()

Here the true intercept parameter is 0 and the true slope parameter is 1. The residual errors have variance one. Next, let's set some of the covariate values missing. To do this we'll use a missingness mechanism where the probability of missingness depends on the (fully observed) outcome Y. This means that the missingness mechanism will satisfy the so called missing at random assumption. Specifically, we will calculate the probability that X is observed based on a logistic regression model where Y enters as the sole covariate:

gen rxb=-2+2*y
gen rpr=exp(rxb)/(1+exp(rxb))
gen r=(runiform() < rpr)
replace x=. if r==0

Now we can apply Stata's sem command to fit the SEM:

sem (x -> y, )
(7270 observations with missing values excluded)

Endogenous variables

Observed:  y

Exogenous variables

Observed:  x

Fitting target model:

Iteration 0:   log likelihood = -6732.1256  
Iteration 1:   log likelihood = -6732.1256  

Structural equation model                       Number of obs      =      2730
Estimation method  = ml
Log likelihood     = -6732.1256

------------------------------------------------------------------------------
             |                 OIM
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural   |
  y <-       |
           x |   .6179208   .0179671    34.39   0.000      .582706    .6531355
       _cons |    .999025   .0200306    49.88   0.000     .9597658    1.038284
-------------+----------------------------------------------------------------
     var(e.y)|   .6472101   .0175178                      .6137707    .6824714
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0)   =      0.00, Prob > chi2 =      .

In the absence of missing values, the sem command by default uses maximum likelihood to estimate the model parameters. As explained in the (very good) documentation, this is performed assuming full joint normality of all the variables. When there are missing values, the sem command's default behaviour, like I think every other regular estimation command in Stata (and other packages) is to perform a complete case analysis (listwise deletion). Here we obtained an estimated intercept of 1.00 (true value 0) and slope of 0.62 (true value 1). The bias here is because missingness is related to the outcome/dependent variable, even after conditioning on the covariate X (see here).

But sem also has another option, which will enable us to fit the model using the observed data from all 10,000 records. If the sem is specified using the graphical dialog boxes, this can be selected in the estimation options box as "Maximum likelihood with missing values". From the command line, we can select it by:

. sem (x -> y, ), method(mlmv)

*output cut

Structural equation model                       Number of obs      =     10000
Estimation method  = mlmv
Log likelihood     = -20549.731

------------------------------------------------------------------------------
             |                 OIM
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural   |
  y <-       |
           x |   .9804851   .0156235    62.76   0.000     .9498637    1.011107
       _cons |  -.0145543    .025363    -0.57   0.566    -.0642649    .0351562
-------------+----------------------------------------------------------------
      mean(x)|   .0032305   .0257089     0.13   0.900     -.047158    .0536189
-------------+----------------------------------------------------------------
     var(e.y)|    1.02696   .0324877                      .9652191     1.09265
       var(x)|   .9847265   .0314871                       .924907    1.048415
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0)   =      0.00, Prob > chi2 =      .

We now have estimates of the intercept of -0.01 (true 0) and slope of 0.98 (true 1) - the estimates are now unbiased. How has Stata achieved this. As explained in the manual:

"For method MLMV to perform what might seem like magic, joint normality of all variables is assumed and missing values are assumed to be missing at random (MAR)."

So, we obtain unbiased estimates (for this data generation setup) because Stata's sem command (correctly here) assumes joint normality of Y and X, and that missingness satisfies the MAR assumption.

Non-normal X
Let's now re-run the simulation, but now make X follow a chi-squared distribution on one degree of freedom, by squaring the rnormal() draw:

clear
set seed 6812312
set obs 10000
gen x=(rnormal())^2
gen y=x+rnormal()

gen rxb=-2+2*y
gen rpr=exp(rxb)/(1+exp(rxb))
gen r=(runiform() < rpr)
replace x=. if r==0

Running sem with the missing value option, we obtain:

. sem (x -> y, ), method(mlmv)

*output cut

Structural equation model                       Number of obs      =     10000
Estimation method  = mlmv
Log likelihood     = -25316.281

------------------------------------------------------------------------------
             |                 OIM
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural   |
  y <-       |
           x |   .8281994   .0066085   125.32   0.000      .815247    .8411518
       _cons |   .4792567   .0161389    29.70   0.000      .447625    .5108883
-------------+----------------------------------------------------------------
      mean(x)|   .5842649   .0224815    25.99   0.000     .5402019    .6283279
-------------+----------------------------------------------------------------
     var(e.y)|   .7537745   .0157842                      .7234643    .7853546
       var(x)|   3.073801   .0551011                       2.96768    3.183717
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0)   =      0.00, Prob > chi2 =      .

Now we again have biased estimates, since the joint normality assumption for Y and X no longer holds. Thus we see that the joint normality assumption is critical when we have missing covariates if we use this option.

Missing completely at random
Let's run the simulation one last time, again with X chi-squared distributed, but now with X being made missing completely at random (MCAR):

clear
set seed 6812312
set obs 10000
gen x=(rnormal())^2
gen y=x+rnormal()
replace x=. if (runiform()<0.5)
sem (x -> y, ), method(mlmv)

*output cut

Structural equation model                       Number of obs      =     10000
Estimation method  = mlmv
Log likelihood     = -25495.152

------------------------------------------------------------------------------
             |                 OIM
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural   |
  y <-       |
           x |   .9985166   .0093366   106.95   0.000     .9802173    1.016816
       _cons |  -.0092478   .0158659    -0.58   0.560    -.0403445    .0218488
-------------+----------------------------------------------------------------
      mean(x)|   .9738369   .0158113    61.59   0.000     .9428474    1.004826
-------------+----------------------------------------------------------------
     var(e.y)|   1.033884    .020162                      .9951133    1.074166
       var(x)|    1.83369   .0330307                       1.77008    1.899585
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0)   =      0.00, Prob > chi2 =      .

Now we again have unbiased estimates, despite the fact that the joint normality assumption is violated. I think this occurs here because when the data are MCAR, the mean and covariance structure can be estimated consistently even if the normality assumption is violated, and for SEMs the model parameters (as far as I understand) only depend on means and covariance parameters, since they are all based on normal models. Unfortunately of course MCAR is often not a tenable assumption in practice. Moreover, even in this situation where we obtain unbiased estimates under violation of normality, I don't think the confidence intervals and p-values will necessarily be valid.

The ability to handle missing covariate values essentially automatically in SEMs in Stata is certainly a nice feature. But as the manual explains, and as illustrated above, if one or more partially observed covariates are not normally distributed, the modelling assumptions Stata uses to handle the missing values will not be appropriate, and thus we may obtain biased estimates and invalid inferences as a result.

Mplus
From looking at the documentation for the popular Mplus SEM software and some online forum posts, it looks like Mplus like Stata will by default drop records from the analysis that contain missing covariates, and that the only way to avoid this is to move the covariates to the left hand side of the equations, and model them as additional dependent variables.

Leave a Reply