Playing with Bayes and RStan

statistical computing

First time pretending to be a Bayesian, and trying RStan.

Author

Kenneth Hung

Published

September 25, 2020

I did not really much statistical training in my undergrad days, and my knowledge of statistics is pretty much confined to whatever grad level statistics classes Berkeley offered — 99% of those was frequentist — so I lack the Bayesian exposure that most statistics undergrad would have received. So when something slightly Bayesian (does empirical Bayes count?) showed up, I decided to teach myself using Gelman et al. (2013).

It is hard to learn something new without any examples, and I happen to stumble upon this tweet:

Trend line can be misleading, and there are good years too. But certainly it looks like bad years had become worse. Just curious, where can I find the data itself for educational purposes?
— Kenneth Hung ((kenhungkk?)) September 7, 2020

The data itself is here and I got to learn how to handle PDFs with pdftools as well.

library(pdftools)

Using poppler version 22.08.0

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(stringr)
library(ggplot2)
library(rstan)

Loading required package: StanHeaders

rstan (Version 2.21.5, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)


Attaching package: 'rstan'

The following object is masked from 'package:tidyr':

    extract

library(foreach)
library(doParallel)

Loading required package: iterators

Loading required package: parallel

theme_set(theme_minimal())
options(mc.cores = parallel::detectCores())
registerDoParallel(cores = parallel::detectCores())

raw.text <- pdf_text(
  'https://www.fire.ca.gov/media/11397/fires-acres-all-agencies-thru-2018.pdf'
) %>%
  str_split("\n") %>%
  unlist()

data <- raw.text[4:35] %>%
  as_tibble(.name_repair = "unique") %>%
  mutate(
    full.year = str_sub(value, end = 6), acres = str_sub(value, start = 132)
  ) %>%
  select(-value) %>%
  mutate_all(str_trim) %>%
  mutate_all(str_replace_all, ",", "") %>%
  mutate_all(as.numeric) %>%
  mutate(year = full.year - min(full.year))
N <- nrow(data)

ggplot(data) + geom_line(aes(x = full.year, y = acres))

Like I said in the tweet, a linear fit seems off.

Using the default priors, I ran a Bayesian linear regression. I have to say, seeing the chains running and utilizing my new laptop’s computation power was very exciting.

model.stan <- "
data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
"

fit <- stan(
  model_code = model.stan,
  data = list(N = N, x = data$year, y = data$acres),
  iter = 5000
)

summary(fit)$summary

             mean      se_mean           sd        2.5%         25%         50%
alpha 240006.8645 2.539418e+03 1.551917e+05 -68013.3021 137908.1746 239675.9123
beta   25350.8491 1.403148e+02 8.616489e+03   8211.8978  19674.1566  25370.7932
sigma 447205.8818 8.786446e+02 6.045076e+04 348075.7837 404490.9808 440528.5524
lp__    -418.5083 2.070267e-02 1.255783e+00   -421.7157   -419.1005   -418.1869
              75%       97.5%    n_eff      Rhat
alpha 344199.6611 552040.4477 3734.813 0.9997395
beta   31137.5510  42277.3893 3770.976 0.9999660
sigma 481952.3839 586203.5716 4733.442 1.0006884
lp__    -417.5703   -417.0499 3679.393 1.0005287

Since all Bayesians do is do posterior draws, I don’t find it hard to understand the result. But what matters the most to me is that the data is fit well. As BDA would suggest, I should do a posterior predictive check, specifically something that would demonstrate my suspicion that linear model isn’t a good fit. By looking at the time series, I would guess that there are fewer positive residuals than negative ones. So I used the proportion of positive residuals after OLS as the test statistic. Embarrassingly it took me a long time to realize how to do a posterior predictive check for regression, the most basic example in Chapter 14 surprisingly did not emphasize this part.

post <- extract(fit)
post.pred.stat <- foreach(
  alpha = post$alpha,
  beta = post$beta,
  sigma = post$sigma,
  .combine = "c"
) %dopar% {
  y.rep <- alpha + beta * data$year + sigma * rnorm(N)
  residual.rep <- residuals(lm(y.rep ~ data$year))
  mean(residual.rep > 0)
}

residual <- residuals(lm(acres ~ year, data = data))
obs.stat <- mean(residual > 0)

ggplot() +
  geom_histogram(
    aes(x = post.pred.stat), alpha = 0.5, binwidth = 1 / nrow(data)
  ) +
  geom_vline(xintercept = obs.stat)

It looks like we have way too few positive residuals and I should probably use a log-linear model.

model.stan <- "
data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
"

fit <- stan(
  model_code = model.stan,
  data = list(N = N, x = data$year, y = log(data$acres)),
  iter = 5000
)

summary(fit)$summary

             mean      se_mean         sd         2.5%         25%        50%
alpha 12.41908326 0.0043276351 0.27139695  11.89028656 12.23991697 12.4138605
beta   0.04168707 0.0002393191 0.01502463   0.01160145  0.03212815  0.0418301
sigma  0.76934629 0.0015107737 0.10299026   0.60040099  0.69765468  0.7585032
lp__  -7.16169509 0.0235539924 1.33780557 -10.65728017 -7.71969447 -6.8090985
              75%       97.5%    n_eff     Rhat
alpha 12.59937319 12.95217175 3932.861 1.000875
beta   0.05154402  0.07100689 3941.421 1.000715
sigma  0.82879300  0.99799036 4647.222 1.000257
lp__  -6.20961693 -5.68965437 3225.945 1.000177

And we perform the same check to see if the residuals are symmetric.

post <- extract(fit)
post.pred.stat <- foreach(
  alpha = post$alpha,
  beta = post$beta,
  sigma = post$sigma,
  .combine = "c"
) %dopar% {
  y.rep <- alpha + beta * data$year + sigma * rnorm(N)
  residual.rep <- residuals(lm(y.rep ~ data$year))
  mean(residual.rep > 0)
}

residual <- residuals(lm(log(acres) ~ year, data = data))
obs.stat <- mean(residual > 0)

ggplot() +
  geom_histogram(
    aes(x = post.pred.stat), alpha = 0.5, binwidth = 1 / nrow(data)
  ) +
  geom_vline(xintercept = obs.stat)

Much better! Of course the model is not going to be correct, but we just need to keep checking for statistics that we care about. One idea I had is the number of times a new record is set. From the time series, it is 5 — we will count the very first year, not that it really matters. I thought this may be revealing if there is a lot of autocorrelation in the time series — for example, the more acres are burnt the previous year, the less there is to burn the year after.

post <- extract(fit)
post.pred.stat <- foreach(
  alpha = post$alpha,
  beta = post$beta,
  sigma = post$sigma,
  .combine = "c"
) %dopar% {
  y.rep <- alpha + beta * data$year + sigma * rnorm(N)
  sum(y.rep == cummax(y.rep))
}

obs.stat <- sum(data$acres == cummax(data$acres))

ggplot() +
  geom_histogram(
    aes(x = post.pred.stat), alpha = 0.5, binwidth = 1
  ) +
  geom_vline(xintercept = obs.stat)

Not too bad! For both model it looks like the slope is positive. There are many other data that would have been relevant to this analysis, such as the rainfall the year before and other climate data. There are also more sophisticated things such as Bayesian ARIMA that I could do (but I don’t know how), but hey, there are only 32 points in this dataset.

References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013), Bayesian Data Analysis.