Stata-Mata’s st_view function – use with care!

I use Stata a lot, and I think it’s a great package. An excellent addition a few years ago was the Mata language, a fully fledged matrix programming language which sits on top or separate from Stata’s regular dataset and command/syntax structure. Many of Stata’s built in commands are programmed using Mata, I believe. I’ve been using Mata quite a bit to program new commands, and in the process have come across some strange behaviour in the st_view function in Mata which I think can cause real difficulties (it did for me!). This post will hopefully help avoid others ending up with the problems I did.

st_view()
The st_view() function in Mata enables you to create objects which are “a view onto current Stata dataset”. Specifically, you can create a Mata object which points to one or more parts of the current Stata dataset. For example after

st_view(x, ., "x")

we can use the Mata object x to examine the contents of the x variable. A very nice feature (and the difference with st_data()) is that if we make a change to the Mata object x, we change the corresponding value(s) in the x variable in the Stata dataset.

There are a number of ways of calling st_view(). In the preceding call we passed it the name of the variable we want x to point to – a very useful feature. Another way of calling it is to specify the column number(s) you want the newly created Mata object to point to.

st_view() objects point to by column number, not to variables
The first important point that I discovered through encountering strange behaviour is that the st_view() object points to a particular column in the Stata dataset, even when it is called by passing a variable name. This means that, having created the Mata object x, which is (was) a view to the variable x in the Stata dataset, if the variable x in the Stata dataset changes position, the Mata st_view object x will no longer be pointing to the correct variable. To illustrate, try pasting the following code into your do file editor, and highlighting and running it:

mata:
mata clear

void test() {
	st_view(x, ., "x")
	x
	stata("drop y")
	x
}

end

clear
set obs 5
gen y=1
gen x=2
mata: test()

The Stata code at the bottom generates a Stata dataset with two variables, y (containing 1s) and x (containing 2s). The Mata program test() first creates the Mata object x, which points to the Stata dataset variable x. We then print the contents of x. We then call, from the Mata program, the Stata command drop y. We then print the Mata object x again. Running the code, we get as output:

. mata: test()
       1
    +-----+
  1 |  2  |
  2 |  2  |
  3 |  2  |
  4 |  2  |
  5 |  2  |
    +-----+
       1
    +-----+
  1 |  .  |
  2 |  .  |
  3 |  .  |
  4 |  .  |
  5 |  .  |
    +-----+

We see that the first call to print x gives us the x variable, which contains 2s. But the second time we call to print x, we get a column of missing values. What’s happened? Although unfortunately not documented in the help file for st_view(), it turns out that it points to a particular column(s), not to a particular variable. When the Mata object x is created, since the x variable in the Stata dataset occupies column 2, the Mata object x points to column 2. When we drop y from the dataset, the x variable goes to column 1, and column 2 is now empty.

What are the implications of this behaviour? That if we use st_view() and call it so as to point to a particular variable, we should be aware that if the Stata dataset’s variables are re-ordered, our st_view() object may no longer be pointing the right variable. This means that everytime we make a change in a Mata program to the Stata dataset, we should redefine any Mata st_view() objects that we then want to use.

Personally I think the above behaviour is a little bit dangerous (and evidently another user, Sergiy Radyakin, seems to agree). The ability to call st_view() by passing it a variable name led me to assume that the resulting object would point to that variable. Since it points to a particular column, I think it would be better to force the user to realise this by having to instead call something like st_view(x, ., st_varindex(“x”)), since this would make the programmer realise that st_view points to columns, not variables.

st_view can be disrupted more generally
Unfortunately, even when the variable we want our st_view() object to point to apparently occupies the same column as when we created the Mata object, we can still have problems, as illustrated by the following code:

mata:
mata clear

void test() {
	stata("reg x y")
	stata("predict xb, xb")
	st_view(xb, ., "xb")
	xb
	stata("logit y x")
	xb
}

end

clear
set obs 10
gen y=(_n<=5)
gen x=1+y+rnormal()
mata: test()

which gives as output

mata: test()

      Source |       SS       df       MS              Number of obs =      10
-------------+------------------------------           F(  1,     8) =    2.68
       Model |  1.56994026     1  1.56994026           Prob > F      =  0.1403
    Residual |  4.68793052     8  .585991315           R-squared     =  0.2509
-------------+------------------------------           Adj R-squared =  0.1572
       Total |  6.25787079     9  .695318976           Root MSE      =   .7655

------------------------------------------------------------------------------
           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           y |   .7924494   .4841451     1.64   0.140    -.3239913     1.90889
       _cons |   .8035085   .3423423     2.35   0.047     .0140657    1.592951
------------------------------------------------------------------------------
                  1
     +---------------+
   1 |  1.595957994  |
   2 |  1.595957994  |
   3 |  1.595957994  |
   4 |  1.595957994  |
   5 |  1.595957994  |
   6 |  .8035085201  |
   7 |  .8035085201  |
   8 |  .8035085201  |
   9 |  .8035085201  |
  10 |  .8035085201  |
     +---------------+

Iteration 0:   log likelihood = -6.9314718  
Iteration 1:   log likelihood = -5.5810723  
Iteration 2:   log likelihood = -5.5794228  
Iteration 3:   log likelihood = -5.5794223  

Logistic regression                               Number of obs   =         10
                                                  LR chi2(1)      =       2.70
                                                  Prob > chi2     =     0.1001
Log likelihood = -5.5794223                       Pseudo R2       =     0.1951

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |    1.48188   1.015263     1.46   0.144    -.5079979    3.471758
       _cons |  -1.785538   1.437518    -1.24   0.214    -4.603022    1.031946
------------------------------------------------------------------------------
        1
     +-----+
   1 |  1  |
   2 |  1  |
   3 |  1  |
   4 |  1  |
   5 |  1  |
   6 |  1  |
   7 |  1  |
   8 |  1  |
   9 |  1  |
  10 |  1  |
     +-----+

The xb variable occupies the 3rd column in the Stata dataset. When the stata("logit y x") is executed, Stata evidently creates a column of 1s for the constant in the model as a temporary variable. One would have thought though that any such variable would be added to the dataset, such that the xb variable would still occupy the third position, and the Mata st_view object would still point to the correct variable. Unfortunately that's not the case, and we again end up with a Mata st_view() object not pointing to the variable we had intended it to.

What are the implications of this? As explained to me on the Stata forum, one must recreate the st_view() object after any "changes to the dataset", and here "changes" evidently includes running a regression model.

So the moral of the story is, st_view()s can be very useful, but if you use them, use them with care!

Leave a ReplyCancel reply