This week I was talking to a friend about how covariates which have missing values are handled in structural equation modelling (SEM) software. I’ll preface this post by saying that I’m definitely not an expert (or anywhere close!) in structural equation models, so if anyone spots errors/problems please add a comment. My friend thought that certain implementations of SEMs in some packages have the ability to automatically accommodate missingness in covariates, using so called ‘full information maximum likelihood’. In the following I’ll describe my subsequent exploration of how Stata’s sem command handles missingness in covariates.
To investigate how missing covariates are handled, I’ll consider the simplest possible situation, in which we have one outcome Y and one covariate X, with Y following a simple linear regression model given X. First we’ll simulate a large dataset, so that we know the true parameter values:
clear set seed 6812312 set obs 10000 gen x=rnormal() gen y=x+rnormal()
Here the true intercept parameter is 0 and the true slope parameter is 1. The residual errors have variance one. Next, let’s set some of the covariate values missing. To do this we’ll use a missingness mechanism where the probability of missingness depends on the (fully observed) outcome Y. This means that the missingness mechanism will satisfy the so called missing at random assumption. Specifically, we will calculate the probability that X is observed based on a logistic regression model where Y enters as the sole covariate:
gen rxb=-2+2*y gen rpr=exp(rxb)/(1+exp(rxb)) gen r=(runiform() < rpr) replace x=. if r==0
Now we can apply Stata's sem command to fit the SEM:
sem (x -> y, ) (7270 observations with missing values excluded) Endogenous variables Observed: y Exogenous variables Observed: x Fitting target model: Iteration 0: log likelihood = -6732.1256 Iteration 1: log likelihood = -6732.1256 Structural equation model Number of obs = 2730 Estimation method = ml Log likelihood = -6732.1256 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Structural | y <- | x | .6179208 .0179671 34.39 0.000 .582706 .6531355 _cons | .999025 .0200306 49.88 0.000 .9597658 1.038284 -------------+---------------------------------------------------------------- var(e.y)| .6472101 .0175178 .6137707 .6824714 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
In the absence of missing values, the sem command by default uses maximum likelihood to estimate the model parameters. As explained in the (very good) documentation, this is performed assuming full joint normality of all the variables. When there are missing values, the sem command's default behaviour, like I think every other regular estimation command in Stata (and other packages) is to perform a complete case analysis (listwise deletion). Here we obtained an estimated intercept of 1.00 (true value 0) and slope of 0.62 (true value 1). The bias here is because missingness is related to the outcome/dependent variable, even after conditioning on the covariate X (see here).
But sem also has another option, which will enable us to fit the model using the observed data from all 10,000 records. If the sem is specified using the graphical dialog boxes, this can be selected in the estimation options box as "Maximum likelihood with missing values". From the command line, we can select it by:
. sem (x -> y, ), method(mlmv) *output cut Structural equation model Number of obs = 10000 Estimation method = mlmv Log likelihood = -20549.731 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Structural | y <- | x | .9804851 .0156235 62.76 0.000 .9498637 1.011107 _cons | -.0145543 .025363 -0.57 0.566 -.0642649 .0351562 -------------+---------------------------------------------------------------- mean(x)| .0032305 .0257089 0.13 0.900 -.047158 .0536189 -------------+---------------------------------------------------------------- var(e.y)| 1.02696 .0324877 .9652191 1.09265 var(x)| .9847265 .0314871 .924907 1.048415 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
We now have estimates of the intercept of -0.01 (true 0) and slope of 0.98 (true 1) - the estimates are now unbiased. How has Stata achieved this. As explained in the manual:
"For method MLMV to perform what might seem like magic, joint normality of all variables is assumed and missing values are assumed to be missing at random (MAR)."
So, we obtain unbiased estimates (for this data generation setup) because Stata's sem command (correctly here) assumes joint normality of Y and X, and that missingness satisfies the MAR assumption.
Non-normal X
Let's now re-run the simulation, but now make X follow a chi-squared distribution on one degree of freedom, by squaring the rnormal() draw:
clear set seed 6812312 set obs 10000 gen x=(rnormal())^2 gen y=x+rnormal() gen rxb=-2+2*y gen rpr=exp(rxb)/(1+exp(rxb)) gen r=(runiform() < rpr) replace x=. if r==0
Running sem with the missing value option, we obtain:
. sem (x -> y, ), method(mlmv) *output cut Structural equation model Number of obs = 10000 Estimation method = mlmv Log likelihood = -25316.281 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Structural | y <- | x | .8281994 .0066085 125.32 0.000 .815247 .8411518 _cons | .4792567 .0161389 29.70 0.000 .447625 .5108883 -------------+---------------------------------------------------------------- mean(x)| .5842649 .0224815 25.99 0.000 .5402019 .6283279 -------------+---------------------------------------------------------------- var(e.y)| .7537745 .0157842 .7234643 .7853546 var(x)| 3.073801 .0551011 2.96768 3.183717 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
Now we again have biased estimates, since the joint normality assumption for Y and X no longer holds. Thus we see that the joint normality assumption is critical when we have missing covariates if we use this option.
Missing completely at random
Let's run the simulation one last time, again with X chi-squared distributed, but now with X being made missing completely at random (MCAR):
clear set seed 6812312 set obs 10000 gen x=(rnormal())^2 gen y=x+rnormal() replace x=. if (runiform()<0.5) sem (x -> y, ), method(mlmv) *output cut Structural equation model Number of obs = 10000 Estimation method = mlmv Log likelihood = -25495.152 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Structural | y <- | x | .9985166 .0093366 106.95 0.000 .9802173 1.016816 _cons | -.0092478 .0158659 -0.58 0.560 -.0403445 .0218488 -------------+---------------------------------------------------------------- mean(x)| .9738369 .0158113 61.59 0.000 .9428474 1.004826 -------------+---------------------------------------------------------------- var(e.y)| 1.033884 .020162 .9951133 1.074166 var(x)| 1.83369 .0330307 1.77008 1.899585 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
Now we again have unbiased estimates, despite the fact that the joint normality assumption is violated. I think this occurs here because when the data are MCAR, the mean and covariance structure can be estimated consistently even if the normality assumption is violated, and for SEMs the model parameters (as far as I understand) only depend on means and covariance parameters, since they are all based on normal models. Unfortunately of course MCAR is often not a tenable assumption in practice. Moreover, even in this situation where we obtain unbiased estimates under violation of normality, I don't think the confidence intervals and p-values will necessarily be valid.
The ability to handle missing covariate values essentially automatically in SEMs in Stata is certainly a nice feature. But as the manual explains, and as illustrated above, if one or more partially observed covariates are not normally distributed, the modelling assumptions Stata uses to handle the missing values will not be appropriate, and thus we may obtain biased estimates and invalid inferences as a result.
Mplus
From looking at the documentation for the popular Mplus SEM software and some online forum posts, it looks like Mplus like Stata will by default drop records from the analysis that contain missing covariates, and that the only way to avoid this is to move the covariates to the left hand side of the equations, and model them as additional dependent variables.