My first foray with Stan

A number of people have mentioned Stan recently to me. Stan fits probability models to data using the Bayesian approach to statistical inference. WinBUGS was the first package to really allow users to fit complex, user defined models with Bayesian methods. As far as I understand, Stan’s strongest selling points are that it is fast, because it compiles your model into C++ code, and because of the clever sampling methods it implements (for more on this, see the sub-section in Stan’s manual).

Stan from R
You can run Stan from a number of statistical packages. So far I’ve been running Stan from R, first following the instructions in the quick start guide to get everything installed and running.

Simple linear regression
My first (after running the example in the quick start guide) baby step into using Stan was to fit a simple linear regression model. The first step is to write the file for the Stan model, which I could essentially take straight from the Stan manual. This consists of a file linreg.stan:

data {
  int N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}

The first part of this file, called data, declares the scalars, vectors and matrices which will be passed to Stan as inputs. Here that consists of the integer sample size N (the lower part telling Stan that 0 is a lower limit for this quantity). Next we declare two vectors of length N for the outcome y and covariate x. The next part declares the parameters of the model, which here consist of the intercept alpha, the slope beta, and the residual standard deviation sigma. Ordinarily we would specify proper priors here for these parameters. Since we have not, Stan will assume flat uniform (improper) priors for the two regression parameters, and a uniform prior on the residual standard deviation parameter.

Next we can simulate a dataset, and fit the model using Stan and our file linreg.stan, by running the following R code:

set.seed(123)
n <- 100
x <- rnorm(n)
y <- x+rnorm(n)
mydata <- list(N = n, y = y, x=x)

fit <- stan(file = 'linreg.stan', data = mydata, 
            iter = 1000, chains = 4)

The first time a Stan model is fitted, there is a delay of some seconds while the model is compiled into C++. As the developers of Stan describe however, once you have compiled a model, it is possible to apply it to new datasets without having to repeat the compilation process (a big advantage in the context of performing simulation studies.

In the above code we've asked Stan to run 4 independent chains, with 1000 iterations each. After running, we can summarize the output by:

fit
Inference for Stan model: linreg.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
alpha  -0.10    0.00 0.10  -0.29  -0.16  -0.10  -0.04   0.09  1346    1
beta    0.95    0.00 0.11   0.75   0.88   0.95   1.02   1.17  1467    1
sigma   0.98    0.00 0.07   0.85   0.93   0.98   1.03   1.12  1265    1
lp__  -47.54    0.06 1.24 -50.77 -48.02 -47.24 -46.68 -46.17   503    1

Samples were drawn using NUTS(diag_e) at Mon Jun 08 18:35:58 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

For the regression slope beta, we have a posterior mean of 0.95 (close to the true value of 1 used to simulate the data). To form a 95% posterior credible interval, we simply take the 2.5% and 97.5% centiles of the sampled posterior, which here is 0.75 to 1.17.

You can get various other quantities out from the fitted model. One is to plot the posterior distribution for one of the model parameters. To obtain this for the regression slope we can run the following:

result <- extract(fit)
hist(result$beta)

which gives

Histogram of posterior distribution of beta
Histogram of posterior distribution of beta

Let's now fit the linear model using standard ordinary least squares:

summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9073 -0.6835 -0.0875  0.5806  3.2904 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.10280    0.09755  -1.054    0.295    
x            0.94753    0.10688   8.865  3.5e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9707 on 98 degrees of freedom
Multiple R-squared:  0.4451,	Adjusted R-squared:  0.4394 
F-statistic:  78.6 on 1 and 98 DF,  p-value: 3.497e-14

This gives us an estimate of the slope of 0.95, the same to 2 decimal places as the posterior mean from Stan, with a standard error of 0.11, which is the same as the posterior SD from Stan. Here we had a moderately large dataset, such that the posterior inferences (or I should say the numerical quantities, rather than the inferences themeselves) are very similar to those obtained by a frequentist maximum likelihood analysis.

My interest in Stan and Bayesian inference
The linear regression model fitted above is obviously not very interesting an example! My interest in exploring Stan, and using it to perform Bayesian inference, is motivated by problems in measurement error and missing data. As was described and shown many years ago by the authors of WinBUGS and others, the Bayesian approach is very natural when tackling problems with different sources of uncertainty over and above parameter uncertainty, e.g. missing data or covariates measured with error. Indeed the popular multiple imputation approach to missing data is developed within the Bayesian paradigm, and in fact can be viewed as an approximation to a full Bayesian analysis.

In future posts I'll explore using Stan to fit models allowing for measurement error and missing data.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.