Stata-Mata’s st_view function – use with care!

December 2025 – update

Thanks to a user comment below, it seems Stata have changed st_view’s behaviour, and the two example problems in my original post below now (at least as of version 19.5) now give the results I had originally expected to get.

Original post follows

I use Stata a lot, and I think it’s a great package. An excellent addition a few years ago was the Mata language, a fully fledged matrix programming language which sits on top or separate from Stata’s regular dataset and command/syntax structure. Many of Stata’s built in commands are programmed using Mata, I believe. I’ve been using Mata quite a bit to program new commands, and in the process have come across some strange behaviour in the st_view function in Mata which I think can cause real difficulties (it did for me!). This post will hopefully help avoid others ending up with the problems I did.

Read more

Adjusting for optimism/overfitting in measures of predictive ability using bootstrapping

In a previous post we looked at the area under the ROC curve for assessing the discrimination ability of a fitted logistic regression model. An issue that we ignored there was that we used the same dataset to fit the model (estimate its parameters) and to assess its predictive ability.

A problem with doing this, particularly when the dataset used to fit/train the model is small is that such estimates of predictive ability are optimistic. That is, they will fit the dataset which have been used to estimate the parameters somewhat better than they will fit new data. In some sense, this is because with small datasets the fitted model adapts to chance characteristics of the observed data which won’t occur in future data. A silly example of this would be a linear regression model of a continuous variable Y fitted to a continuous covariate X with just n=2 data points. The fitted line will just be the line connecting the two data points. In this case, the R squared measure will be 1 (100%), suggesting your model has perfect predictive power(!), when of course with new data it would almost certainly not have an R squared of 1.

Read more

Multiple imputation using random forest

In recent years a number of researchers have proposed using machine learning techniques to impute missing data. One of these is the so called random forest technique. I recently gave a talk at the International Biometric Society’s conference in Florence, Italy, on the topic. In case it is of interest to anyone, the slides of the talk are available below.

Slides from talk at IBC2014 on random forest multiple imputation