Estimating risk ratios from observational data in Stata

When analysing binary outcomes, logistic regression is the analyst’s default approach for regression modelling. The logit link used in logistic regression is the so called canonical link function for the binomial distribution. Estimates from logistic regression are odds ratios, which measure how each predictor is estimated to increase the odds of a positive outcome, holding the other predictors constant. However, most people find risk ratios easier to interpret than odds ratios. In randomized studies it is of course easy to estimate the risk ratio comparing the two treatment (intervention) groups. With observational data, where the exposure or treatment is not randomly allocated, estimating the risk ratio for the effect of the treatment is somewhat trickier.

The ideal situation – randomized treatment assignment
Ideally the assignment to treatment groups would be randomized, as in a randomized controlled trial. To illustrate the methods to come, we first simulate (in Stata) a large dataset which could arise in a randomized trial:

clear
set seed 1234
set obs 10000
gen x=rnormal()
gen z=(runiform()<0.5)
gen xb=x+z
gen pr=exp(xb)/(1+exp(xb))
gen y=(runiform() < pr)

This code generates a dataset for 10,000 individuals. Each has a value of a baseline variable x, which is simulated from a standard N(0,1) distribution. Next, as per a randomized study, we simulated a binary variable z with probability 0.5 of being 1 and probability 0.5 of being 0. The binary outcome y is then generated, and we have generated it from a logistic regression model, with log odds of being 1 equal to x+z. The true odds ratio for z=1 compared to z=0, adjusted for x, is thus exp(1)=2.72.

Since the treatment is randomly assigned, we can ignore x and estimate the risk ratio comparing z=1 to z=0 using the GLM command with a log link:

. glm y z, family(binomial) link(log) eform

Generalized linear models                          No. of obs      =     10000
Optimization     : ML                              Residual df     =      9998
                                                   Scale parameter =         1
Deviance         =  12960.39176                    (1/df) Deviance =  1.296298
Pearson          =        10000                    (1/df) Pearson  =    1.0002

Variance function: V(u) = u*(1-u)                  [Bernoulli]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  1.296439
Log likelihood   = -6480.195879                    BIC             = -79124.59

------------------------------------------------------------------------------
             |                 OIM
           y | Risk Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           z |   1.429341   .0241683    21.13   0.000     1.382748    1.477503
       _cons |    .495886   .0070829   -49.11   0.000     .4821963    .5099643
------------------------------------------------------------------------------

The risk ratio is estimated as 1.43, and because the dataset is large, the 95% confidence interval is quite narrow.

Estimating risk ratios from observational data
Let us now consider the case of observational data. To do so we simulate a new dataset, where now the treatment assignment depends on x:

clear
set seed 1234
set obs 10000
gen x=rnormal()
gen tr_xb=x
gen tr_pr=exp(tr_xb)/(1+exp(tr_xb))
gen z=(runiform() < tr_pr)
gen xb=x+z
gen pr=exp(xb)/(1+exp(xb))
gen y=(runiform() < pr)

If we run the same GLM model for y, ignoring x, we obtain:

. glm y z, family(binomial) link(log) eform

Generalized linear models                          No. of obs      =     10000
Optimization     : ML                              Residual df     =      9998
                                                   Scale parameter =         1
Deviance         =  12014.93919                    (1/df) Deviance =  1.201734
Pearson          =        10000                    (1/df) Pearson  =    1.0002

Variance function: V(u) = u*(1-u)                  [Bernoulli]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  1.201894
Log likelihood   = -6007.469594                    BIC             = -80070.04

------------------------------------------------------------------------------
             |                 OIM
           y | Risk Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           z |   1.904111   .0353876    34.65   0.000        1.836    1.974748
       _cons |   .4101531   .0069811   -52.36   0.000      .396696    .4240667
------------------------------------------------------------------------------

The crude risk ratio is now biased upwards, since we have generated the data such that those with higher values of x are more likely to be in the z=1 group, and those with higher values of x are more likely to have y=1. How then can we adjust for the confounding effects of x, and estimate the risk ratio for z=1 compared to z=0?

Using a log-link generalized linear model
The most obvious approach is to add x to our GLM command:

. glm y z x, family(binomial) link(log) eform

Iteration 0:   log likelihood = -9271.4631  
Iteration 1:   log likelihood = -5833.7585  
Iteration 2:   log likelihood = -5733.9167  (not concave)
Iteration 3:   log likelihood = -5733.9167  (not concave)
...

This however fails to converge, with Stata giving us repeated (not concave) warnings. This problem, of log link GLMs failing to converge, is well known, and is an apparent road block to estimating a valid risk ratio for the effect of treatment, adjusted for the confounder x.

Estimating the risk ratio via a logistic working model
A relatively easy alternative is to use a logistic working model to estimating a risk ratio for treatment which adjusts for x. To do this we first fit an appropriate logistic regression model for y, with x and z as predictors:

. logistic y z x

Logistic regression                               Number of obs   =      10000
                                                  LR chi2(2)      =    2797.60
                                                  Prob > chi2     =     0.0000
Log likelihood = -5343.6848                       Pseudo R2       =     0.2075

------------------------------------------------------------------------------
           y | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           z |   2.947912   .1442639    22.09   0.000     2.678297    3.244668
           x |   2.693029   .0811728    32.87   0.000     2.538542    2.856918
       _cons |   .9910342   .0322706    -0.28   0.782      .929761    1.056345
------------------------------------------------------------------------------

This of course gives us an odds ratio for the treatment effect, not a risk ratio. However, we can use this model to calculate predicted probabilities for each individual assuming first that all individuals are not treated (z=0), and then assuming that all individuals are treated (z=1):

gen zcopy=z
replace z=0
predict pr_z0, pr
replace z=1
predict pr_z1, pr

This code first generates a new variable, zcopy, which keeps a copy of the original treatment assignment variable. We then set all individuals z to 0, and ask for the predicted probability that y=1. We then set all individuals to z=1, and again calculate P(y=1). Now to estimate the risk ratio for the effect of z=1 compared to z=0, we simply take the ratio of the marginal risk under these two conditions, i.e. P(y=1|z=1)/P(y=1|z=0):

summ pr_z0 pr_z1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       pr_z0 |     10000    .4970649    .2060767   .0166056   .9765812
       pr_z1 |     10000    .7094445    .1765126   .0474181   .9919309

di .7094445/.4970649
1.4272673
replace z=zcopy

This gives us an estimated risk ratio, comparing z=1 to z=0, of 1.43, identical to the risk ratio estimated when we first simulated data in which treatment assignment was entirely random (and in particular independent of x).

By using a logistic regression working model to come up with the predictions, we overcome the numerical difficulties which are often encountered when one instead attempts to directly fit a GLM adjusting for the confounders with a log link and binomial response. The approach we have described here is not new - see this paper by Sander Greenland. However, the approach is still perhaps not widely used.

Confidence intervals
We have found a point estimate for the risk ratio, but we would of course also like a confidence interval, to indicate the precision of the estimate. One could bootstrap the whole procedure. An alternative is based on the theory of estimating equations, and is implemented in Stata's teffects command. Thanks to David Drukker, of Stata Corp., for assistance with the following code.

First, we use teffects to give us estimates of the marginal mean of the binary outcome (equivalent to the probability that y=1) when z is set to 0 and then to 1:

teffects ra (y x, logit) (z), pom

Iteration 0:   EE criterion =  1.898e-21  
Iteration 1:   EE criterion =  5.519e-34  

Treatment-effects estimation                    Number of obs      =     10000
Estimator      : regression adjustment
Outcome model  : logit
Treatment model: none
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
POmeans      |
           z |
          0  |   .4957607   .0072813    68.09   0.000     .4814897    .5100318
          1  |   .7078432   .0071566    98.91   0.000     .6938164    .7218699
------------------------------------------------------------------------------

The first part, (y x, logit), tells Stata that the outcome model for y is a logistic regression with x as a predictor. Whereas in our earlier manual implementation we fitted the logistic regression model to all individuals, adjusting for z, teffects ra instead finds the predictions by fitting separate logistic regression models for y (with x as predictor) in the two groups defined by z=0 and z=1. This is why the predicted mean outcomes under z=0 and z=1 from teffects ra differ a little from those we calculated earlier.

To calculate the risk ratio and a confidence interval, we first use teffects ra , coeflegend to find the names that Stata has saved the estimates in:

teffects ra , coeflegend 

Treatment-effects estimation                    Number of obs      =     10000
Estimator      : regression adjustment
Outcome model  : logit
Treatment model: none
------------------------------------------------------------------------------
           y |      Coef.  Legend
-------------+----------------------------------------------------------------
POmeans      |
           z |
          0  |   .4957607  _b[POmeans:0bn.z]
          1  |   .7078432  _b[POmeans:1.z]
------------------------------------------------------------------------------

We can now calculate the risk ratio and its confidence interval using the nlcom. However, since this will give us a symmetric Wald based confidence interval, it is preferable to find this interval for the log risk ratio, and then to back transform the resulting interval to the risk ratio scale:

nlcom log(_b[POmeans:1.z] / _b[POmeans:0bn.z])

       _nl_1:  log(_b[POmeans:1.z] / _b[POmeans:0bn.z])

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _nl_1 |   .3561291   .0172327    20.67   0.000     .3223536    .3899047
------------------------------------------------------------------------------

. di exp(.3561291)
1.4277919

. di exp(.3223536)
1.3803728

. di exp(.3899047)
1.47684

We thus have an estimated risk ratio of 1.43, with 95% confidence interval 1.38 to 1.48.

Assumptions and alternatives
The preceding procedure relies on an assumption that the logistic regression working model is correctly specified. That is, if we use teffects ra, we assume that in each treatment group, y follows a logistic regression model given x. With additional confounders these can be added to the outcome model, with suitable interactions and non-linear terms if deemed necessary.

The teffects command offers a number of alternative approaches to the regression adjustment approach we have taken here. The first is inverse probability weighting (IPW) by the propensity score, using teffects ipw. Here, rather than modelling the distribution of the outcome conditional on the confounders, we specify a model for the treatment assignment mechanism. The validity of estimates then relies on the model for treatment assignment being correctly specified. For our simple setup above, this is performed by typing:

teffects ipw (y) (z x), pom

which assumes a logistic regression model for the treatment assignment mechanism, with x included as a predictor. See here for a nice paper on the propensity score approach, and some discussion on its merits relative to the regression adjustment approach.

A further approach combines the regression adjustment and IPW approaches (teffects ipwra). Here we specify both a model for the outcome and a model for the treatment assignment mechanism. This approach is so called doubly robust: it gives consistent estimates provided at least one of these two models is correctly specified. It thus gives some protection from model mis-specification, in that so as long as one of the two models is correctly specified, our estimates are consistent. For our simple example, this can be performed using:

teffects ipwra (y x, logit) (z x, logit), pom

Here the first bracket specifies the outcome model, while the second bracket specifies the treatment assignment model.

Paper by Knol et al
Knol et al recently reported results of a simulation study comparing the regression method described here (referred to by the authors as Austin's method) with a number of others. They observed that the regression method based on a logistic working model gave somewhat biased estimates of the risk ratio for certain parameter values. This bias was, as suggested by an online response to this paper, due to the fact that they simulated data assuming that y|x,z follows a log-link GLM, with additive effects of x and z. Consequently the logistic regression working model for y|x,z was mis-specified, such that a slight bias was observed. In this situation, the IPW or doubly robust estimators could be used to obtain a consistent estimate, provided the treatment assignment model is correctly specified.

12 thoughts on “Estimating risk ratios from observational data in Stata”

    • Thanks Jessie. For odds ratios, there are (at least) two possible targets: the conditional (on covariates) odds ratio for the exposure/treatment effect and the marginal one. For the conditional one you could simply fit a logistic regression model to the data, with treatment and confounders as covariates, and the estimated odds ratio for treatment is the (estimate of) conditional odds ratio. For the marginal odds ratio you could follow the same process as in the post, except rather than using nlcom to calculate the risk ratio, use it to calculate the marginal odds ratio with confidence interval.

      Reply
      • Many thanks for your help. I would like to use nlcom to calculate the marginal odds ratio with CI. Could you advise me on what code I should use?

        Reply
  1. This is extremely useful! Is there a way to get the odds ratio with teffects psmatch using nearest neighbor and other matching methods?

    Reply
  2. I would like to pose a question that arose while analyzing some of my own data. I was expecting that the null hypothesis of no effect (i.e., a RR of 1 or a risk difference of 0) would be rejected independent of the measure used to quantify the effect. That is, I expected that one measure of effect (for example, RR) would always reject the null when the other measure (risk difference) did. I furthermore expected the P values for the two measures to be the same even though one effect size was a ratio while the other was a linear effect size.

    However, I was intrigued by the fact that, using this excellent technique, different P values are obtained for risk differences as opposed to risk ratios. For example, in some of my own data, I have seen the null hypothesis of a RR of 1 rejected by the nlcom command while the null hypothesis of a RD=0 was not rejected. This was true whether or not I performed the nlcom command on log-transformed parameters or untransformed parameters.

    Could you help me understand why the two are different? Any help is appreciated! Thank you.

    Reply
    • The main reason I think is because if you are using a Wald test (and associated confidence intervals), and this is what nlcom is using, the test statistic is the estimate divided by its standard error, and this is not invariant to transformations. So even though the null hypotheses that the risk ratio is 1 and the risk difference is 0 are the same, the Wald tests of these null hypotheses are not equivalent. As such they will give different p-values. See the section ‘Non-invariance to re-parametrisations’ at Wikipedia’s page on the Wald test.

      Reply
  3. Hi Jonathan,

    I’ve been binge-reading papers on the OR/RR debate this week to prepare for a lecture.

    Aside from the well-known convergence issues for Log-binomial, two additional criticisms of Risk Ratios are the potential lack of constancy across levels of a covariate (when the outcome is common among the unexposed), and the lack of invariance to whether disease or lack-of-disease is modelled.

    I wonder if the method you describe here – which looks suspiciously like g-computation – side-steps both of those issues.

    Reply
  4. Very very useful post. Thank you for writing this post.

    Interpretation of the output
    teffects ra (y x, logit) (z), pom

    What is the interpretation of this result
    ——————————————————————————
    | Robust
    y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    ————-+—————————————————————-
    POmeans |
    z |
    0 | .4957607 .0072813 68.09 0.000 .4814897 .5100318
    1 | .7078432 .0071566 98.91 0.000 .6938164 .7218699
    ——————————————————————————

    Suppose, I have to write the interpretation of this table, in the result section of my paper, how will I describe these coefficients. I understand the risk ratio you have calculated. But if someone just wants to stick to coefficient only, then how will you describe the impact of intervention/treatment based on these coefficients.

    Reply
    • The coefficients in the table you have there are the estimated mean outcomes for the population under each of the two treatments. i.e. if you give the population treatment 0, it estimates the population outcome would be 0.496, while if you give the population treatment 1, it estimates a population mean of 0.708.

      Reply

Leave a Reply to Jessie BCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.