Textbook

Introduction to Bayesian Data Analysis for Cognitive Science

Nicenboim, Schad, Vasishth

Be sure to read the textbook’s chapters 1-3 in addition to watching this lecture.

Linear modeling

Suppose \(y\) is a vector of continuous responses; assume for now that the \(y\) are independent and identically distributed:

\(y \stackrel{iid}{\sim} Normal(\mu,\sigma)\)

This is the simple linear model:

\(y = \mu + \varepsilon \hbox{ where } \varepsilon \sim Normal(0,\sigma)\)

There are two parameters, \(\mu,\sigma\), so we need priors on these. We expand on this simple model next.

The way we will conduct data analysis is as follows:

We will now work through some specific examples to illustrate how the data analysis process works.

Example 1: A single subject pressing a button repeatedly

As a first example, we will fit a simple linear model to some reaction time data.

The data frame df_spacebar contains data of a subject pressing the space bar without reading in a self-paced reading experiment.

Install the library

Install the library bcogsci. See:

https://github.com/bnicenboim/bcogsci

Load the data

library(bcogsci)
data("df_spacebar")
head(df_spacebar)
## # A tibble: 6 × 2
##       t trial
##   <int> <int>
## 1   141     1
## 2   138     2
## 3   128     3
## 4   132     4
## 5   126     5
## 6   134     6

Visualizing the data

It is a good idea to look at the distribution of the data before doing anything else. See Figure \(\ref{fig:m1visualize}\).

plot(density(df_spacebar$t),
     main="Button-press data",
     xlab="Tapping time")
\label{fig:m1visualize}Visualizing the data.

Visualizing the data.

The data looks a bit skewed, but we ignore this for the moment.

Define the likelihood function

Let’s model the data with the following assumptions:

  • There is a true underlying time, \(\mu\), that the participant needs to press the space-bar.
  • There is some noise around \(\mu\).
  • The noise is normally distributed (this assumption is questionable given the skew but; we fix this assumption later).

This means that the likelihood for each observation \(n\) will be:

\[\begin{equation} \begin{aligned} t_n \sim Normal(\mu, \sigma) \end{aligned} \end{equation}\]

where \(n =1 \ldots N\).

This is just the simple linear model:

\[\begin{equation} t = \mu + \varepsilon \hbox{ where } \varepsilon \sim Normal(0,\sigma) \end{equation}\]

Define the priors for the parameters

We are going to use the following priors for the two parameters in this model:

\[\begin{equation} \begin{aligned} \mu &\sim Normal(0, 2000) \\ \sigma &\sim Normal(0, 500) \text{ truncated so that } \sigma > 0 \end{aligned} \end{equation}\]

In order to decide on a prior for the parameters, always visualize them first. See the Figure below.

\label{fig:m1priors}Visualizing the priors for example 1.

Visualizing the priors for example 1.

Prior predictive distribution

With these priors, we are going to generate something called the prior predictive distribution. This helps us check whether the priors make sense.

Formally, we want to know the density \(f(\cdot)\) of data points \(t_1,\dots,t_n\), given a vector of priors \(\Theta\). In our example, \(\Theta=\langle\mu,\sigma \rangle\). The prior predictive density is:

\[\begin{equation} f(t_1,\dots,t_n)= \int f(t_1)\cdot f(t_2)\cdots f(t_n) f(\Theta) \, d\Theta \end{equation}\]

In essence, we integrate out the parameters. Here is one way to do it in R:

  • Take one sample from each of the priors
  • Generate nobs data points using those samples

This would give us a matrix containing nsim * nobs generated data. We can then plot the prior predictive densities generated.

## needed for half-normal distribution
library(extraDistr) 
## number of simulations
nsim<-1000
## number of observations generated 
## each time:
nobs<-100
y<-matrix(rep(NA,nsim*nobs),ncol = nobs)
mu<-rnorm(nsim,mean=0,sd=2000)
## truncated normal, cut off at 0:
sigma<-rtnorm(nsim,mean=0,sd=500,a=0)

for(i in 1:nsim){
y[i,]<-rnorm(nobs,mean=mu[i],sd=sigma[i])
}

We can try to redefine the prior for \(\mu\) to have only positive values, and then check again. We still get some negative values, but that is because we are assuming that

\(y \sim Normal(\mu,\sigma)\)

which will have negative values for small \(\mu\) and large \(\sigma\).

y<-matrix(rep(NA,nsim*nobs),ncol = nobs)
mu<-rtnorm(nsim,mean=0,sd=2000,a=0)
for(i in 1:nsim){
y[i,]<-rnorm(nobs,mean=mu[i],sd=sigma[i])
}

Generating prior predictive distributions using Stan

We can generate a prior predictive distribution using Stan as follows.

First, we define a Stan model that defines the priors and defines how the data are to be generated.

Documentation on Stan is available at mc-stan.org.

priorpred<-"data {
  int N; 
}
parameters {
real<lower=0> mu;
real<lower=0> sigma;
}
model {
    mu ~ normal(0,2000);
    sigma ~ normal(0,500);
}
generated quantities {
  vector[N] y_sim;
  for(i in 1:N) {
    y_sim[i] = normal_rng(mu,sigma);
  }}"

Load RStan and brms.

## load rstan
library(rstan)
options(mc.cores = parallel::detectCores())
library(brms)

Then we generate the data:

## generate 100 data-points 
dat<-list(N=100)

## fit model:
m1priorpred<-stan(model_code=priorpred,
                  data=dat,
                  chains = 4, 
                  warmup = 1000,
                iter = 2000)
## extract and plot one of the data-sets:
y_sim<-extract(m1priorpred,pars="y_sim")
str(y_sim)
## List of 1
##  $ y_sim: num [1:4000, 1:100] 2211 2928 1639 1628 4730 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ iterations: NULL
##   .. ..$           : NULL
## plot 100th simulated data set:
hist(y_sim$y_sim[100,],
     main="Prior predictive distribution",
     xlab="y_sim",freq=FALSE)

Fit the model in Stan

Having satisfied ourselves that the priors mostly make sense (do they, though?), we now fit the model to simulated data generated from known parameter values. The goal here is to ensure that the model recovers the true underlying parameters.

Next, we write the Stan model, adding a likelihood in the model block:

m1<-"data {
  int N;
  real y[N]; // data
}
parameters {
real<lower=0> mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0,2000);
sigma ~ normal(0,500);
y ~ normal(mu,sigma);
}
generated quantities {
  vector[N] y_sim;
  for(i in 1:N) {
    y_sim[i] = normal_rng(mu,sigma);
  }}
"

Then generate simulated data with known parameter values (we decide what these are):

set.seed(123)
N <- 500
true_mu <- 400
true_sigma <- 125
y <- rnorm(N, true_mu, true_sigma)

y <- round(y) 
sim_data <- data.frame(y=y)
dat<-list(y=y,N=N)

Finally, we fit the model:

## fit model:
m1rstan<-stan(model_code=m1,
                  data=dat,
                  chains = 4, 
                iter = 2000)

## extract posteriors:
posteriors<-extract(m1rstan,
            pars=c("mu","sigma"))
\label{fig:m1rstanpost}Posteriors from fake data, model m1. Vertical lines show the true values of the parameters.

Posteriors from fake data, model m1. Vertical lines show the true values of the parameters.

Posterior predictive checks

Once we have the posterior distribution \(f(\Theta\mid y)\), we can derive the predictions based on this posterior distribution:

\[\begin{equation} p(y_{pred}\mid y ) = \int p(y_{pred}, \Theta\mid y)\, d\Theta= \int p(y_{pred}\mid \Theta,y)p(\Theta\mid y)\, d\Theta \end{equation}\]

Assuming that past and future observations are conditionally independent given \(\Theta\), i.e., \(p(y_{pred}\mid \Theta,y)= p(y_{pred}\mid \Theta)\), we can write:

\[\begin{equation} p(y_{pred}\mid y )=\int p(y_{pred}\mid \Theta) p(\Theta\mid y)\, d\Theta \end{equation}\]

Note that we are conditioning \(y_{pred}\) only on \(y\), we do not condition on what we don’t know (\(\Theta\)); we integrate out the unknown parameters.

This posterior predictive distribution is different from the frequentist approach, which gives only a predictive distribution of \(y_{pred}\) given our estimate of \(\theta\) (a point value).

In the Stan code above, we have already generated the posterior predictive distribution, in the generated quantities block.

Implementing the model in brms

This model is expressed in brms in the following way. First, define the priors:

priors <- c(set_prior("normal(0, 2000)", 
                      class = "Intercept"),
            set_prior("normal(0, 500)", 
                      class = "sigma"))

Then, define the generative process assumed:

m1brms<-brm(t~1,df_spacebar,prior = priors,
       iter = 2000,
       warmup = 1000,
       chains = 4,
       family = gaussian(), 
       control = list(adapt_delta = 0.99))

Summarizing the posteriors, and convergence diagnostics

A graphical summary of posterior distributions of model m1 is shown below:

\label{fig:m1stanplot}Posterior distributions of the parameters in model m1.

Posterior distributions of the parameters in model m1.

The trace plots below show how well the four chains are mixing:

\label{fig:m1traceplot}Trace plots in model m1.

Trace plots in model m1.

An alternative way to plot is shown below.

\label{fig:m1plot}Posterior distributions and trace plots in model m1.

Posterior distributions and trace plots in model m1.

Fitting the brms model on simulated data

m1_fakebrms<-brm(y~1,sim_data,prior = priors,
       iter = 2000, chains = 4,family = gaussian(), 
       control = list(adapt_delta = 0.99))

Summarizing the posterior distribution

We are assuming that there’s a true underlying time it takes to press the space bar, \(\mu\), and there is normally distributed noise with a Normal(0,\(\sigma\)) truncated at 0 that generates the different button tapping times. All this is encoded in our likelihood by assuming that the tapping times are normally distributed with an unknown true mean \(\mu\) (and an unknown standard deviation \(\sigma\)).

The objective of the Bayesian model is to learn about the plausible values of \(\mu\), or in other words, to get a distribution that encodes what we know about the true mean of the distribution of RTs, and about the true standard deviation, \(\sigma\), of the distribution of RTs.

Our model allows us to answer questions such as:

What is the probability that the underlying value of the mindless press of the space bar would be over, say 170 ms?**

As an example, consider this model that we ran above.

priors <- c(set_prior("normal(0, 2000)", 
                      class = "Intercept"),
            set_prior("normal(0, 500)", 
                      class = "sigma"))

m1brms<-brm(t~1,df_spacebar,prior = priors,
       iter = 2000,
       warmup = 1000,
       chains = 4,
       family = gaussian(), 
       control = list(adapt_delta = 0.99))

Now compute the posterior probability \(Prob(\mu>170)\):

mu_post<-posterior_samples(m1brms,
         variable=c("b_Intercept"))$b_Int
mean(mu_post>170)
## [1] 0.14225

The credible interval

The 95% credible interval can be extracted for \(\mu\) as follows:

posterior_interval(m1brms,
    variable=c("b_Intercept"))
##                 2.5%    97.5%
## b_Intercept 166.0771 171.1545
posterior_summary(m1brms,
    variable=c("b_Intercept"))
##             Estimate Est.Error     Q2.5    Q97.5
## b_Intercept 168.6368  1.290527 166.0771 171.1545

This type of interval is also known as a credible interval.

A credible interval demarcates the range within which we can be certain with a certain probability that the “true value” of a parameter lies given the data and the model.

Aside: This is very different from the frequentist confidence interval, which has the following meaning: If you were (counterfactually) to run the same experiment over and over again, say 100 times, then 95% of the 100 confidence intervals you would generate would contain the true value of the parameter. The frequentist confidence interval does not tell you the range over which you can be 95% certain that the true value of the parameter lies: the parameter is not a random variable but rather is a point value, so one cannot make probability statements about it.

The percentile interval is a type of credible interval (the most common one), where we assign equal probability mass to each tail.

We generally report 95% credible intervals. But we can extract any interval; a 73% interval, for example, leaves 13.5% of the probability mass on each tail, and we can calculate it like this:

round(quantile(mu_post,
   prob=c(0.135,0.865)))
## 13.5% 86.5% 
##   167   170

The influence of priors and sensitivity analysis

\[\begin{equation} \begin{aligned} \mu &\sim Uniform(0,5000) \\ \sigma &\sim Uniform(0, 500) \end{aligned} \end{equation}\]

priors <- c(set_prior("uniform(0, 5000)", 
                      class = "Intercept",
                      lb = 0,
                      ub = 5000),
            set_prior("uniform(0, 500)", 
                      class = "sigma",
                      lb=0,
                      ub=500))
m2<-brm(t~1,df_spacebar,prior = priors,
       iter = 2000, chains = 4,
       family = gaussian(), 
       control = list(adapt_delta = 0.99))

For looking at the results:

short_summary(m2)
## ...
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   168.66      1.32   166.09   171.21 1.00     2273     1798
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    24.97      0.93    23.17    26.84 1.00     2098     1761
## 
## ...

In general, we don’t want our priors to have too much influence on our posterior.

This is unless we have very good reasons for having informative priors, such as a very small sample and/or a lot of prior information; an example would be if we have data from an impaired population, which makes it hard to increase our sample size.

We usually center the priors on 0 and we let the likelihood dominate in determining the posterior.

This type of prior is sometimes called a weakly informative prior. Notice that a uniform prior is not a weakly informative prior, it assumes that every value is equally likely, zero is as likely as 5000.

You should always do a sensitivity analysis to check how influential the prior is: try different priors and verify that the posterior doesn’t change drastically.

Example 2: Investigating adaptation effects

More realistically, we might have run the small experiment to find out whether the participant tended to speedup (practice effect) or slowdown (fatigue effect) while pressing the space bar.

Preprocessing the data

  • We need to have data about the number of times the space bar was pressed for each observation, and add it to our list.

  • It’s a good idea to center the number of presses (a covariate) to have a clearer interpretation of the intercept.

  • In general, centering predictors is always a good idea, for interpretability and for computational reasons.

  • See the contrast coding chapters in the book for details on this point.

df_spacebar <- df_spacebar %>%
  mutate(c_trial = trial - mean(trial))

Probability model

Our model changes, because we have a new parameter.

\[\begin{equation} t_n \sim \mathit{Normal}(\alpha + c\_trial_n \cdot \beta,\sigma) \end{equation}\]

where \(n =1 \ldots N\).

We could use the following priors.

\[\begin{equation} \begin{aligned} \alpha &\sim Normal(0, 2000) \\ \beta &\sim Normal(0, 500) \\ \sigma &\sim Normal(0, 500) \text{ truncated so that } \sigma > 0 \\ \end{aligned} \end{equation}\]

We are basically fitting a linear model, \(\alpha\) represents the intercept (namely, the grand mean of the RTs), and \(\beta\) represents the slope.

What information are the priors encoding?

Do the priors make sense?

We’ll write this in brms as follows.

priors <- c(set_prior("normal(0, 2000)", 
                      class = "Intercept"),
            set_prior("normal(0, 500)", 
                      class = "b",
                      coef="c_trial"),
            set_prior("normal(0, 500)", 
                      class = "sigma"))
m2<-brm(t~1+c_trial,df_spacebar,
        prior = priors,
       iter = 2000, chains = 4,family = gaussian(), 
       control = list(adapt_delta = 0.99))

Posteriors

library(bayesplot)
#bayesplot::theme_default()
stanplot(m2,type="hist")

We’ll need to examine what happens with \(\beta\). The summary gives us the relevant information.

m2_post_samp_b <- posterior_samples(m2, "^b")
str(m2_post_samp_b)
## 'data.frame':    4000 obs. of  2 variables:
##  $ b_Intercept: num  170 168 168 168 169 ...
##  $ b_c_trial  : num  0.0874 0.0723 0.0764 0.0938 0.0811 ...
beta_samples <- m2_post_samp_b$b_c_trial 
beta_mean<-mean(beta_samples)
quantiles_beta <- quantile(beta_samples,
                           prob=c(0.025,0.975))
beta_low<-quantiles_beta[1]
beta_high<-quantiles_beta[2]

Examine what happens with \(\beta\). The summary gives us the relevant information:

m2_post_samp_b2 <- posterior_summary(m2, "^b")
round(m2_post_samp_b2,2)
##             Estimate Est.Error   Q2.5  Q97.5
## b_Intercept   168.65      1.24 166.21 170.98
## b_c_trial       0.09      0.01   0.07   0.11

Let’s say we know that our model is working as expected, since we already used simulated data to test the recovery of the parameters.

Posterior predictive checks

To do posterior predictive checks for our last example, using brms, we need to do:

pp_check(m2,nsamples=100)+
  theme(text = element_text(size=16),
        legend.text=element_text(size=16))
\label{fig:m2ppc}Posterior predictive check of model m2.

Posterior predictive check of model m2.

Using the log-normal likelihood

mu <- 6
sigma <- 0.5
N <- 100000
# Generate N random samples 
## from a log-normal distribution
sl <- rlnorm(N, mu, sigma)
lognormal_plot <- ggplot(data.frame(samples=sl), 
aes(sl)) + geom_histogram() + 
      ggtitle("Log-normal distribution\n") + 
      ylim(0,25000) + xlim(0,2000) + theme_bw()
# Generate N random samples 
# from a normal distribution, 
# and then exponentiate them
sn <- exp(rnorm(N, mu, sigma))
normalplot <- ggplot(data.frame(samples=sn), 
aes(sn)) + geom_histogram() + 
      ggtitle("Exponentiated samples of
      \n a normal distribution") + 
      ylim(0,25000) + xlim(0,2000) + theme_bw()
plot(lognormal_plot)
\label{fig:logndemo}The log-normal distribution.

The log-normal distribution.

plot(normalplot)
\label{fig:explogndemo}Exponentiated samples from a log-normal distribution.

Exponentiated samples from a log-normal distribution.

Re-fit the model assuming a log-normal likelihood

If we assume that RTs are log-normally distributed, we’ll need to change our model:

\[\begin{equation} t_n \sim LogNormal(\alpha + c\_trial_n \cdot \beta,\sigma) \end{equation}\]

where \(n =1 \ldots N\)

But now the scale of our priors needs to change! They are no longer in milliseconds.

\[\begin{equation} \begin{aligned} \alpha &\sim Normal(0, 10) \\ \beta &\sim Normal(0, 1) \\ \sigma &\sim Normal(0, 2) \text{ truncated so that } \sigma > 0 \\ \end{aligned} \end{equation}\]

priors_log <- c(set_prior("normal(0, 10)", 
                      class = "Intercept"),
            set_prior("normal(0, 1)", 
                      class = "b",
                      coef="c_trial"),
            set_prior("normal(0, 2)", 
                      class = "sigma"))
m2_logn<-brm(t~1+ c_trial,df_spacebar,
             prior = priors_log,
       iter = 2000, chains = 4,family = lognormal(), 
       control = list(adapt_delta = 0.99,
                      max_treedepth=15))
## Compiling Stan program...
## Start sampling

Summarizing the posterior and inference

Next, we turn to the question of what we can report as our results, and what we can conclude from the data.

  • We can summarize the posterior and do inference as discussed in Example 1.

  • If we want to talk about the effect estimated by the model, we summarize the posterior of \(\beta\) in the following way:

  • \(\hat\beta = 0.0874124\), 95% CrI = \([ 0.0653573 , 0.1088305 ]\), \(P(\beta >0) \approx 1\)

Posterior predictive checks and distribution of summary statistics

We can now verify whether our predicted datasets look more similar to the real dataset. See rgw figure below.

pp_check(m2_logn, nsamples = 100)+
  theme(text = element_text(size=16),
  legend.text=element_text(size=16))
\label{fig:lognppc}Posterior predictive check.

Posterior predictive check.

m2_logn<-brm(t~1+c_trial,df_spacebar,
             prior = priors_log,
       iter = 2000, chains = 4,
       family = lognormal(),
       control = list(adapt_delta = 0.99,
                      max_treedepth=15))

General workflow

This is the general workflow that we suggest for a Bayesian model.

  1. Define the full probability model:
    1. Decide on the likelihood.
    2. Decide on the priors.
    3. Write the brms or Stan model.
  2. Do prior predictive checks to determine if priors make sense.
  3. Check model using simulated data:
    1. Simulate data with known values for the parameters.
    2. Fit the model and do MCMC diagnostics.
    3. Verify that it recovers the parameters from simulated data.
  4. Fit the model with real data and do MCMC diagnostics.
  5. Evaluate the model’s fit (e.g., posterior predictive checks, distribution of summary statistics). This may send you back to 1.
  6. Inference/prediction/decisions.
  7. Conduct model comparison if there’s an alternative model (to be discussed later).