# Chapter 3 Computational Bayesian data analysis

In the previous chapter, we learned how to analytically derive the posterior distribution of the parameters in our model. In practice, however, this is possible for only a very limited number of cases. Although the numerator of the Bayes rule, the unnormalized posterior, is easy to calculate (by multiplying the probability density/mass functions analytically), the denominator, the marginal likelihood, requires us to carry out an integration; see Equation (3.1).

\begin{aligned} p(\boldsymbol{\Theta}|\boldsymbol{y}) &= \cfrac{ p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) }{p(\boldsymbol{y})}\\ p(\boldsymbol{\Theta}|\boldsymbol{y}) &= \cfrac{ p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) }{\int_{\boldsymbol{\Theta}} p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) d\boldsymbol{\Theta} } \end{aligned} \tag{3.1}

Unless we are dealing with conjugate distributions, the solution will be extremely hard to derive or there will be no analytical solution. This was the major bottleneck of Bayesian analysis in the past, and required Bayesian practitioners to program an approximation method by themselves before they could even begin the Bayesian analysis. Fortunately, many of the probabilistic programming languages freely available today (see the next section for a listing) allow us to define our models without having to acquire expert knowledge about the relevant numerical techniques.

## 3.1 Deriving the posterior through sampling

Let’s say that we want to derive the posterior of the model from section 2.2, that is, the posterior distribution of the cloze probability of “umbrella”, $$\theta$$, given the following data: a word (e.g., “umbrella”) was answered 80 out of 100 times, and assuming a binomial distribution as the likelihood function, and $$\mathit{Beta}(a=4,b=4)$$ as a prior distribution for the cloze probability. If we can obtain samples from the posterior distribution of $$\theta$$, instead of an analytically derived posterior distribution, given enough samples we will have a good approximation of the posterior distribution. Obtaining samples from the posterior will be the only viable option in the models that we will discuss in this book. By “obtaining samples”, we are talking about a situation analogous to when we use rbinom() or rnorm() to obtain samples from a particular distribution. For more details about sampling algorithms, see the further readings suggested in section 3.10.

Thanks to probabilistic programming languages, it will be relatively straightforward to get these samples, and we will discuss how we will do it in more detail in the next section. For now let’s assume that we used some probabilistic programming language to obtain 20000 samples from the posterior distribution of the cloze probability, $$\theta$$: 0.828, 0.813, 0.786, 0.81, 0.806, 0.737, 0.771, 0.722, 0.763, 0.77, 0.778, 0.829, 0.736, 0.838, 0.776, 0.816, 0.743, 0.73, 0.701, 0.764, … Figure 3.1 shows that the approximation of the posterior looks quite similar to the analytically derived posterior. The difference between the analytically computed and approximated mean and variance are $$-0.0004$$ and $$0.000005$$ respectively.

## 3.2 Bayesian Regression Models using Stan: brms

The surge in popularity of Bayesian statistics is closely tied to the increase in computing power and the appearance of probabilistic programming languages, such as WinBUGS (Lunn et al. 2000), JAGS (Plummer 2016), PyMC3 (Salvatier, Wiecki, and Fonnesbeck 2016), Turing (Ge, Xu, and Ghahramani 2018), and Stan (Carpenter et al. 2017); for a historical review, see Plummer (2022).

These probabilistic programming languages allow the user to define models without having to deal (for the most part) with the complexities of the sampling process. However, they require learning a new language since the user has to fully specify the statistical model using a particular syntax.7 Furthermore, some knowledge of the sampling process is needed to correctly parameterize the models and to avoid convergence issues (these topics will be covered in detail in chapter 11).

There are some alternatives that allow Bayesian inference in R without having to fully specify the model “by hand”. The packages rstanarm (Goodrich et al. 2018) and brms (Bürkner 2019) provide Bayesian equivalents of many popular R model-fitting functions, such as (g)lmer (Bates, Mächler, et al. 2015b) and many others; both rstanarm and brms use Stan as the back-end for estimation and sampling. The package R-INLA (Lindgren and Rue 2015) allows for fitting a limited selection of likelihood functions and priors in comparison to rstanarm and brms (R-INLA can fit models that can be expressed as latent Gaussian models). This package uses the integrated nested Laplace approximation (INLA) method for approximating Bayesian inference rather than a sampling algorithm as it is used by the other probabilistic languages listed. Another alternative is JASP (JASP Team 2019), which provides a graphical user interface for both frequentist and Bayesian modeling, and is intended to be an open-source alternative to SPSS.

We will focus on brms in this part of the book. This is because it can be useful for a smooth transition from frequentist models to their Bayesian equivalents. The package brms is not only powerful enough to satisfy the statistical needs of many cognitive scientists, it has the added benefit that the Stan code can be inspected (with the brms functions make_stancode() and make_standata()), allowing the user to customize their models or learn from the code produced internally by brms to eventually transition to writing the models entirely in Stan. We revisit the models of this and the following chapters in the introduction to Stan in chapter 10.

### 3.2.1 A simple linear model: A single subject pressing a button repeatedly (a finger tapping task)

We’ll use the following example of a finger tapping task (for a review, see Hubel et al. 2013) to illustrate the basic steps for fitting a model. Suppose that a subject first sees a blank screen. Then, after a certain amount of time (say $$200$$ ms), the subject sees a cross in the middle of a screen, and as soon as they see the cross, they tap on the space bar as fast as they can until the experiment is over ($$361$$ trials). The dependent measure here is the time it takes in milliseconds from one press of the space bar to the next one. The data in each trial are therefore finger tapping times in milliseconds. Suppose that the research question is: how long does it take for this particular subject to press a key? (As an aside, notice that the data are not independent and identically distributed but are ignoring that detail for now; we will address that issue later in the book).

Let’s model the data with the following assumptions:

1. There is a true (unknown) underlying time, $$\mu$$ ms, that the subject needs to press the space bar.
2. There is some noise in this process.
3. The noise is normally distributed (this assumption is questionable given that finger tapping as also response times are generally skewed; we will fix this assumption later).8

This means that the likelihood for each observation $$n$$ will be:

$$$t_n \sim \mathit{Normal}(\mu, \sigma) \tag{3.2}$$$

where $$n =1, \ldots, N$$, and $$t$$ is the dependent variable (finger tapping times in milliseconds). The variable $$N$$ indexes the total number of data points. The symbol $$\mu$$ indicates the location of the normal distribution function; the location parameter shifts the distribution left or right on the horizontal axis. For the normal distribution, the location is also the mean of the distribution. The symbol $$\sigma$$ indicates the scale of the distribution; as the scale decreases, the distribution gets narrower. This compressing approaches a spike (all the probability mass get concentrated near one point) as the scale parameter approaches zero. For the normal distribution, the scale is also its standard deviation.

The reader may have encountered the model shown in Equation (3.2) in the form shown in Equation (3.3):

$$$t_n = \mu + \varepsilon \hbox{, where } \varepsilon_n \stackrel{iid}{\sim} \mathit{Normal}(0,\sigma) \tag{3.3}$$$

When the model is written in this way, it should be understood as meaning that each data point $$t_n$$ has some variability around a mean value $$\mu$$, and that variability has standard deviation $$\sigma$$. The term iid’’ (independent and identically distributed) implies that each data point $$t_n$$ is independently generated (is not correlated with any of the other data points), and is coming from the same distribution (namely, $$Normal(\mu,\sigma)$$).

For a frequentist model that will give us the maximum likelihood estimate (the sample mean) of the time it takes to press the space bar, this would be enough information to write the formula in R, t ~ 1, and plug it into the function lm() together with the data: lm(t ~ 1, data). The meaning of the 1 here is that lm will estimate the intercept in the model, in our case $$\mu$$. If the reader is completely unfamiliar with linear models, the references in section 4.5 will be helpful.

For a Bayesian linear model, we will also need to define priors for the two parameters of our model. Let’s say that we know for sure that the time it takes to press a key will be positive and lower than a minute (or $$60000$$ ms), but we don’t want to make a commitment regarding which values are more likely. We encode what we know about the noise in the task in $$\sigma$$: we know that this parameter must be positive and we’ll assume that any value below $$2000$$ ms is equally likely. These priors are in general strongly discouraged: A flat (or very wide) prior will almost never be the best approximation of what we know. Prior specification will be discussed in detail in chapter 6.

In this case, even if we know very little about the task, we know that pressing the spacebar will take at most a couple of seconds. We’ll use flat priors in this section for pedagogical purposes; the next sections will already show more realistic uses of priors.

\begin{aligned} \mu &\sim \mathit{Uniform}(0, 60000) \\ \sigma &\sim \mathit{Uniform}(0, 2000) \end{aligned} \tag{3.4}

First, load the data frame df_spacebar from the bcogsci package:

data("df_spacebar")
df_spacebar
## # A tibble: 361 × 2
##       t trial
##   <int> <int>
## 1   141     1
## 2   138     2
## 3   128     3
## # … with 358 more rows

It is always a good idea to plot the data before doing anything else; see Figure 3.2. As we suspected, the data look a bit skewed, but we ignore this for the moment.


ggplot(df_spacebar, aes(t)) +
geom_density() +
xlab("Finger tapping times") +
ggtitle("Button-press data")

#### 3.2.1.1 Specifying the model in brms

Fit the model defined by equations (3.2) and (3.4) with brms in the following way.

fit_press <- brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(
prior(uniform(0, 60000), class = Intercept, lb = 0, ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0, ub = 2000)
),
chains = 4,
iter = 2000,
warmup = 1000
)

The brms code has some differences from a model fit with lm. At this beginning stage, we’ll focus on the following options:

1. The term family = gaussian() makes it explicit that the underlying likelihood function is a normal distribution (Gaussian and normal are synonyms). This detail is implicit in lm. Other linking functions are possible, exactly as in the glm function. The default for brms that corresponds to the lm function is gaussian().
2. The term prior takes as argument a vector of priors. Although this specification of priors is optional, the researcher should always explicitly specify each prior. Otherwise, brms will define priors by default, which may or may not be appropriate for the research area. In cases where the distribution has a restricted coverage, that is, not every value is valid (e.g., smaller than $$0$$ or larger than $$60000$$ are not valid for the intercept), we need to set lower and upper boundaries with lb and ub.9
3. The term chains refers to the number of independent runs for sampling (by default four).
4. The term iter refers to the number of iterations that the sampler makes to sample from the posterior distribution of each parameter (by default 2000).
5. The term warmup refers to the number of iterations from the start of sampling that are eventually discarded (by default half of iter).

The last three options, chains, iter, warmup determine the behavior of the sampling algorithm: the No-U-Turn Sampler (NUTS; Hoffman and Gelman 2014) extension of Hamiltonian Monte Carlo (Duane et al. 1987; Neal 2011). We will discuss sampling in a bit more depth in chapter 10, but the basic process is explained next.

#### 3.2.1.2 Sampling and convergence in a nutshell

The code specification starts four chains independently from each other. Each chain “searches” for samples of the posterior distribution in a multidimensional space, where each parameter corresponds to a dimension. The shape of this space is determined by the priors and the likelihood. The chains start at random locations, and in each iteration they take one sample each. When sampling begins, the samples may or may not belong to the posterior distributions of the parameters. Eventually, the chains end up in the vicinity of the posterior distribution, and from that point onward the samples will belong to the posterior.

Thus, when sampling begins, the samples from the different chains can be far from each other, but at some point they will “converge” and start delivering samples from the posterior distributions. Although there are no guarantees that the number of iterations we run the chains for will be sufficient for obtaining samples from the posteriors, the default values of brms (and Stan) are in many cases sufficient to achieve convergence. When the default number of iterations do not suffice, brms (actually, Stan) will print out warnings, with suggestions for fixing the convergence problems. If all the chains converge to the same distribution, by removing the “warmup” samples, we make sure that we do not get samples from the initial path to the posterior distributions. The default in brms is that half of the total number of iterations in each chain (which default to 2000) will count as “warmup”. So, if one runs a model with four chains and the default number of iterations, we will obtain a total of 4000 samples from the four chains, after discarding the warmup iterations.

Figure 3.3(a) shows the path of the chains from the warmup phase onwards. Such plots are called trace plots. The warmup is shown only for illustration purposes; generally, one should only inspect the chains after the point where convergence has (presumably) been achieved (i.e., after the dashed line). After convergence has occurred, a visual diagnostic check is that chains should look like a “fat hairy caterpillar.” Compare the trace plot of our model in Figure 3.3(a) with the trace plot of a model that did not converge, shown in Figure 3.3(b).

Trace plots are not always diagnostic as regards convergence. The trace plots might look fine, but the model may not have converged. Fortunately, Stan automatically runs several diagnostics with the information from the chains, and if there are no warnings after fitting the model and the trace plots look fine, we can be reasonably sure that the model converged, and assume that our samples are from the true posterior distribution. However, it is necessary to run more than one chain (preferably four), with a couple of thousands of iterations (at least) in order for the diagnostics to work.

#### 3.2.1.3 Output of brms

Once the model has been fit (and assuming that we got no warning messages about convergence problems), we can print out the samples of the posterior distributions of each of the parameters using as_draws_df() (which stores metadata about the chains) or with as.data.frame():

as_draws_df(fit_press) %>%
head(3) 
## # A draws_df: 3 iterations, 1 chains, and 4 variables
##   b_Intercept sigma lprior  lp__
## 1         167    27    -19 -1686
## 2         171    24    -19 -1684
## 3         171    24    -19 -1684
## # ... hidden reserved variables {'.chain', '.iteration', '.draw'}

The term b_Intercept in the brms output corresponds to our $$\mu$$. We can ignore the last two columns: lp is not really part of the posterior, it’s the log-density of the unnormalized posterior for each iteration (lp will be discussed later in Box 10.1), lprior is the log-density of the (joint) prior distribution and it is there for compatibility with the package priorsense (https://github.com/n-kall/priorsense).

Plot the density and trace plot of each parameter after the warmup (Figure 3.4).

Printing the object with the brms fit provides a nice, if somewhat verbose, summary:

fit_press
# posterior_summary(fit_press) is also useful
##  Family: gaussian
##   Links: mu = identity; sigma = identity
## Formula: t ~ 1
##    Data: df_spacebar (Number of observations: 361)
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
##
## Population-Level Effects:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   168.63      1.26   166.18   171.06 1.00     3234     2568
##
## Family Specific Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    25.02      0.95    23.25    26.95 1.00     2809     2222
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The Estimate is just the mean of the posterior samples, Est.Error is the standard deviation of the posterior and the CIs mark the lower and upper bounds of the 95% credible intervals (to distinguish credible intervals from frequentist confidence intervals, the former will be abbreviated as CrIs):

as_draws_df(fit_press)$b_Intercept %>% mean() ## [1] 169 as_draws_df(fit_press)$b_Intercept %>% sd()
## [1] 1.26
sigma_samples <- as_draws_df(fit_press)sigma normal_predictive_distribution( mu_samples = mu_samples, sigma_samples = sigma_samples, N_obs = N_obs ) ## # A tibble: 1,444,000 × 3 ## iter trialn t_pred ## <dbl> <int> <dbl> ## 1 1 1 132. ## 2 1 2 128. ## 3 1 3 174. ## # … with 1,443,997 more rows The brms function posterior_predict() is a convenient function that delivers samples from the posterior predictive distribution. Using the command posterior_predict(fit_press) yields the predicted finger tapping times in a matrix, with the samples as rows and the observations (data-points) as columns. (Bear in mind that if a model is fit with sample_prior = "only", the dependent variable is ignored and posterior_predict will yield samples from the prior predictive distribution). The posterior predictive distribution can be used to examine the “descriptive adequacy” of the model under consideration (Gelman et al. 2014, Chapter 6; Shiffrin et al. 2008). Examining the posterior predictive distribution to establish descriptive adequacy is called posterior predictive checks. The goal here is to establish that the posterior predictive data look more or less similar to the observed data. Achieving descriptive adequacy means that the current data could have been generated by the model. Although passing a test of descriptive adequacy is not strong evidence in favor of a model, a major failure in descriptive adequacy can be interpreted as strong evidence against a model (Shiffrin et al. 2008). For this reason, comparing the descriptive adequacy of different models is not enough to differentiate between their relative performance. When doing model comparison, it is important to consider the criteria that Roberts and Pashler (2000) define. Although Roberts and Pashler (2000) are more interested in process models and not necessarily Bayesian models, their criteria are important for any kind of model comparison. Their main point is that it is not enough to have a good fit to the data for a model to be convincing. One should check that the range of predictions that the model makes is reasonably constrained; if a model can capture any possible outcome, then the model fit to a particular data set is not so informative. In the Bayesian modeling context, although posterior predictive checking is important, it is only a sanity check to assess whether the model behavior is reasonable (for more on this point, see chapter 7). In many cases, one can simply use the plot functions from brms (that act as wrappers for bayesplot functions). For example, the plotting function pp_check() takes as arguments the model, the number of predicted data sets, and the type of visualization, and it can display different visualizations of posterior predictive checks. In these type of plots, the observed data are plotted as $$y$$ and predicted data as $$y_{rep}$$. Below, we use pp_check() to investigate how well the observed distribution of finger tapping times fit our model based on some number ($$11$$ and $$100$$) of samples of the posterior predictive distributions (that is, simulated data sets) ; see Figures 3.8 and 3.9. pp_check(fit_press, ndraws = 11, type = "hist") pp_check(fit_press, ndraws = 100, type = "dens_overlay") The data is slightly skewed and has no values smaller than $$100$$ ms, but the predictive distributions are centered and symmetrical; see Figures 3.8 and 3.9. This posterior predictive check shows a slight mismatch between the observed and predicted data. Can we build a better model? We’ll come back to this issue in the next section. ## 3.7 The influence of the likelihood Finger tapping times (and response times in general) are not usually normally distributed. A more realistic distribution is the log-normal. A random variable (such as time) that is log-normally distributed takes only positive real values and is right-skewed. Although other distributions can also produce data with such properties, the log-normal will turn out to be a pretty reasonable distribution for finger tapping times and response times. ### 3.7.1 The log-normal likelihood If $$\boldsymbol{y}$$ is log-normally distributed, this means that $$\log(\boldsymbol{y})$$ is normally distributed.12 The log-normal distribution is also defined using the parameters location, $$\mu$$, and scale, $$\sigma$$, but these are on the log ms scale; they correspond to the mean and standard deviation of the logarithm of the data $$\boldsymbol{y}$$, $$\log(\boldsymbol{y})$$, which will be normally distributed. Thus, when we model some data $$\boldsymbol{y}$$ using the log-normal likelihood, the parameters $$\mu$$ and $$\sigma$$ are on a different scale than the data $$\boldsymbol{y}$$. Equation (3.9) shows the relationship between the log-normal and the normal. \begin{aligned} \log(\boldsymbol{y}) &\sim \mathit{Normal}( \mu, \sigma)\\ \boldsymbol{y} &\sim \mathit{LogNormal}( \mu, \sigma) \end{aligned} \tag{3.9} We can obtain samples from the log-normal distribution, using the normal distribution by first setting an auxiliary variable $$z$$, so that $$z = \log(y)$$. This means that $$z \sim \mathit{Normal}(\mu, \sigma)$$. Then we can just use $$exp(z)$$ as samples from the $$\mathit{LogNormal}(\mu, \sigma)$$, since $$\exp(z) =\exp(\log(y)) = y$$. The code below produces Figure 3.10. mu <- 6 sigma <- 0.5 N <- 500000 # Generate N random samples from a log-normal distribution sl <- rlnorm(N, mu, sigma) ggplot(tibble(samples = sl), aes(samples)) + geom_histogram(aes(y = after_stat(density)), binwidth = 50) + ggtitle("Log-normal distribution\n") + coord_cartesian(xlim = c(0, 2000)) # Generate N random samples from a normal distribution, # and then exponentiate them sn <- exp(rnorm(N, mu, sigma)) ggplot(tibble(samples = sn), aes(samples)) + geom_histogram(aes(y = after_stat(density)), binwidth = 50) + ggtitle("Exponentiated samples from\na normal distribution") + coord_cartesian(xlim = c(0, 2000)) ### 3.7.2 Using a log-normal likelihood to fit data from a single subject pressing a button repeatedly If we assume that finger tapping times are log-normally distributed, the likelihood function changes as follows: $$$t_n \sim \mathit{LogNormal}(\mu,\sigma)$$$ But now the scale of the priors needs to change! Let’s start with uniform priors for ease of exposition, even though, as we mentioned earlier, these are not really appropriate here. (More realistic priors are discussed below.) \begin{aligned} \mu &\sim \mathit{Uniform}(0, 11) \\ \sigma &\sim \mathit{Uniform}(0, 1) \\ \end{aligned} \tag{3.10} Because the parameters are on a different scale than the dependent variable, their interpretation changes and it is more complex than if we were dealing with a linear model that assumes a normal likelihood (location and scale do not coincide with the mean and standard deviation of the log-normal): • The location $$\mu$$: In our previous linear model, $$\mu$$ represented the mean (in a normal distribution, the mean happens to be identical to the median and the mode). But now, the mean needs to be calculated by computing $$\exp(\mu +\sigma ^{2}/2)$$. In other words, in the log-normal, the mean is dependent on both $$\mu$$ and $$\sigma$$. The median is just $$\exp(\mu)$$. Notice that the prior of $$\mu$$ is not on the milliseconds scale, but rather on the log milliseconds scale. • The scale $$\sigma$$: This is the standard deviation of the normal distribution of $$\log(\boldsymbol{y})$$. The standard deviation of a log-normal distribution with location $$\mu$$ and scale $$\sigma$$ will be $$\exp(\mu +\sigma ^{2}/2)\times \sqrt{\exp(\sigma^2)- 1}$$. Unlike the normal distribution, the spread of the log-normal distribution depends on both $$\mu$$ and $$\sigma$$. To understand the meaning of the priors on the millisecond scale, both the priors and the likelihood need to be taken into account. Generating a prior predictive distribution will help in interpreting the priors. This distribution can be generated by just exponentiating the samples produced by normal_predictive_distribution() (or, alternatively, edit the function by replacing rnorm() with rlnorm()). N_samples <- 1000 N_obs <- nrow(df_spacebar) mu_samples <- runif(N_samples, 0, 11) sigma_samples <- runif(N_samples, 0, 1) prior_pred_ln <- normal_predictive_distribution( mu_samples = mu_samples, sigma_samples = sigma_samples, N_obs = N_obs ) %>% mutate(t_pred = exp(t_pred)) Next, plot the distribution of some representative statistics; see Figure 3.11. We cannot generate negative values any more, since $$\exp($$any finite real number$$) > 0$$. These priors might work in the sense that the model might converge; but it would be better to have regularizing priors for the model. An example of regularizing priors: \begin{aligned} \mu &\sim \mathit{Normal}(6, 1.5) \\ \sigma &\sim \mathit{Normal}_+(0, 1) \\ \end{aligned} \tag{3.11} The prior for $$\sigma$$ here is a truncated distribution, and although its location is zero, this is not its mean. We can calculate its approximate mean from a large number of random samples of the prior distribution using the function rtnorm() from the package extraDistr. In this function, we have to set the parameter a = 0 to express the fact that the normal distribution is truncated from the left at 0. (Box 4.1 discusses this type of distribution in detail): mean(rtnorm(100000, 0, 1, a = 0)) ## [1] 0.795 Even before generating the prior predictive distributions, we can calculate the values within which we are 95% sure that the expected median of the observations will lie. We do this by looking at what happens at two standard deviations away from the mean of the prior, $$\mu$$, that is $$6 - 2\times 1.5$$ and $$6 + 2\times 1.5$$, and exponentiating these values: c( lower = exp(6 - 2 * 1.5), higher = exp(6 + 2 * 1.5) ) ## lower higher ## 20.1 8103.1 This means that the prior for $$\mu$$ is still not too informative (these are medians; the actual values generated by the log-normal distribution can be much more spread out). Next, plot the distribution of some representative statistics of the prior predictive distributions. brms allows one to sample from the priors, ignoring the observed data t , by setting sample_prior = "only" in the brm function. If we want to use brms to generate prior predictive data in this manner even before we have any data, (at least for now) we do need to have some non-NA values as the dependent variable t. Setting sample_prior = "only" will ignore the data, but we still need to add a data frame in the data = specification in the brm function: In this case, we add a vector of 1 as “data”. The family is specified as lognormal(); recall that in the first example, the family was gaussian(). df_spacebar_ref <- df_spacebar %>% mutate(t = rep(1, n())) fit_prior_press_ln <- brm(t ~ 1, data = df_spacebar_ref, family = lognormal(), prior = c( prior(normal(6, 1.5), class = Intercept), prior(normal(0, 1), class = sigma) ), sample_prior = "only", control = list(adapt_delta = .9) ) To avoid the warnings, increase the adapt_delta parameter’s default value from $$0.8$$ to $$0.9$$ to simulate the data. Since Stan samples from the prior distributions in the same way that it samples from the posterior distribution, one should not ignore warnings; always ensure that the model converged. In that respect, the custom function normal_predictive_distribution() defined in section 3.3 has the advantage that it will always yield independent samples from the prior distribution and will not experience any convergence problems. This is because it just relies on the rnorm() function in R. Plot the prior predictive distribution of means with the following code (the figure is not produced here, to conserve space). In a prior predictive distribution, we generally want to ignore the data; this requires setting prefix = "ppd" in pp_check(). pp_check(fit_prior_press_ln, type = "stat", stat = "mean", prefix = "ppd") + coord_cartesian(xlim = c(0.001, 300000)) + scale_x_continuous("Finger tapping times [ms]", trans = "log", breaks = c(0.001, 1, 100, 1000, 10000, 100000), labels = c( "0.001", "1", "100", "1000", "10000", "100000" ) ) + ggtitle("Prior predictive distribution of means") To plot the distribution of minimum, and maximum values, just replace mean with min, and max respectively. The distributions of the three statistics are displayed in Figure 3.12.  p1 <- pp_check(fit_prior_press_ln, type = "stat", stat = "mean", prefix = "ppd") + coord_cartesian(xlim = c(0.001, 300000)) + scale_x_continuous("Finger tapping times [ms]", trans = "log", breaks = c(0.001, 1, 100, 1000, 10000, 100000), labels = c( "0.001", "1", "100", "1000", "10000", "100000" ) ) + ggtitle("Prior predictive distribution of means") p2 <- pp_check(fit_prior_press_ln, type = "stat", stat = "min", prefix = "ppd") + coord_cartesian(xlim = c(0.001, 300000)) + scale_x_continuous("Finger tapping times [ms]", trans = "log", breaks = c(0.001, 1, 100, 1000, 10000, 100000), labels = c( "0.001", "1", "100", "1000", "10000", "100000" ) ) + ggtitle("Prior predictive distribution of minimum values") p3 <- pp_check(fit_prior_press_ln, type = "stat", stat = "max", prefix = "ppd") + coord_cartesian(xlim = c(0.001, 300000)) + scale_x_continuous("Finger tapping times [ms]", trans = "log", breaks = c(0.001, 1, 100, 1000, 10000, 100000), labels = c( "0.001", "1", "10", "1000", "10000", "100000" ) ) + ggtitle("Prior predictive distribution of maximum values") plot_grid(p1, p2, p3, nrow = 3, ncol =1) Figure 3.12 shows that the priors used here are still quite uninformative. The tails of the prior predictive distributions that correspond to our normal priors shown in Figure 3.12 are even further to the right, reaching more extreme values than for the prior predictive distributions generated by uniform priors (shown in Figure 3.11). The new priors are still far from representing our prior knowledge. We could run more iterations of choosing priors and generating prior predictive distributions until we have priors that generate realistic data. However, given that the bulk of the distributions of the mean, maximum, minimum values lies roughly in the correct order of magnitude, these priors are going to be acceptable. In general, summary statistics (e.g., mean, median, min, max) can be used to test whether the priors are in a plausible range. This can be done by defining, for the particular research problem under study, the extreme data that would be very implausible to ever observe (e.g., reading times at a word larger than one minute) and choosing priors such that such extreme finger tapping times occur only very rarely in the prior predictive distribution. Next, fit the model; recall that both the distribution family and prior change in comparison to the previous example. fit_press_ln <- brm(t ~ 1, data = df_spacebar, family = lognormal(), prior = c( prior(normal(6, 1.5), class = Intercept), prior(normal(0, 1), class = sigma) ) ) When we look at the summary of the posterior, the parameters are on the log-scale: fit_press_ln ## ... ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept 5.12 0.01 5.10 5.13 1.00 3994 2616 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 0.13 0.00 0.13 0.15 1.00 3247 2641 ## ## ... If the research goal is to find out how long it takes to press the space bar in milliseconds, we need to transform the $$\mu$$ (or Intercept in the model) to milliseconds. Because the median of the log-normal distribution is $$\exp(\mu)$$, the following returns the estimates in milliseconds: estimate_ms <- exp(as_draws_df(fit_press_ln)b_Intercept)

To display the mean and 95% credible interval of these samples, type:

c(mean = mean(estimate_ms), quantile(estimate_ms, probs = c(.025, .975)))
##  mean  2.5% 97.5%
##   167   165   169

Next, check whether the predicted data sets look similar to the observed data set. See Figure 3.13; compare this with the earlier Figure 3.9.

pp_check(fit_press_ln, ndraws = 100)

The key question is: Are the posterior predicted data now more similar to the observed data, compared to the case where we had a Normal likelihood? According to Figure 3.13, it seems so, but it’s not easy to tell.

Another way to examine the extent to which the predicted data looks similar to the observed data would be to look at the distribution of some summary statistic. As with prior predictive distributions, examine the distribution of representative summary statistics for the data sets generated by different models. However, in contrast with what occurs with prior predictive distributions, at this point we have a clear reference, our observations, and this means that we can compare the summary statistics with the observed statistics from our data. We suspect that the normal distribution would generate finger tapping times that are too fast (since it’s symmetrical) and that the log-normal distribution may capture the long tail better than the normal model. Based on this supposition, compute the distribution of minimum and maximum values for the posterior predictive distributions, and compare them with the minimum and maximum value respectively in the data. The function pp_check() implements this by specifying stat either "min" or "max" for both fit_press, and fit_press_ln; an example is shown below. The plots are shown in Figures 3.14 and 3.15.

pp_check(fit_press, type = "stat", stat = "min")

Figure 3.14 shows that the log-normal likelihood does a slightly better job since the minimum value is contained in the bulk of the log-normal distribution and in the tail of the normal one. Figure 3.15 shows that both models are unable to capture the maximum value of the observed data. One explanation for this is that the log-normal-ish observations in our data are being generated by the task of pressing as fast as possible, whereas the observations with long finger tapping times are being generated by lapses of attention. This would mean that two probability distributions are mixed here; modeling this process involves more complex tools that we will take up in chapter 19.

This completes our introduction to brms. We are now ready to learn about more regression models.

## 3.8 List of the most important commands

Here is a list of the most important commands we learned in this chapter.

• The core brms function for fitting models, for generating prior predictive and posterior predictive data:
fit_press <- brm(t ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(
prior(uniform(0, 60000), class = Intercept, lb = 0, ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0, ub = 2000)
),
chains = 4,
iter = 2000,
warmup = 1000
## uncomment for prior predictive:
## sample_prior = "only",
## uncomment when dealing with divergent transitions
## control = list(adapt_delta = .9)
)
• Extract samples from fitted model:
as_draws_df(fit_press)
• Basic plot of posteriors
plot(fit_press)
• Plot prior predictive/posterior predictive data
## Posterior predictive check:
pp_check(fit_press, ndraws = 100, type = "dens_overlay")
## Plot posterior predictive distribution of statistical summaries:
pp_check(fit_press, ndraws = 100, type = "stat", stat = "mean")
## Plot prior predictive distribution of statistical summaries:
pp_check(fit_press, ndraws = 100, type = "stat", stat = "mean",
prefix = "ppd")

## 3.9 Summary

This chapter showed how to fit and interpret a Bayesian model with a normal likelihood. We looked at the effect of priors by investigating prior predictive distributions and by carrying out a sensitivity analysis. We also looked at the fit of the posterior, by inspecting the posterior predictive distribution (which gives us some idea about the descriptive adequacy of the model). We also showed how to fit a Bayesian model with a log-normal likelihood, and how to compare the predictive accuracy of different models.

## 3.10 Further reading

Sampling algorithms are discussed in detail in Gamerman and Lopes (2006). Also helpful are the sections on sampling from the short open-source book by Bob Carpenter, Probability and Statistics: a simulation-based introduction (https://github.com/bob-carpenter/prob-stats), and the sections on sampling algorithms in Lambert (2018) and Lynch (2007). Introductory linear modeling theory is covered in Dobson and Barnett (2011); more advanced treatments are in Montgomery, Peck, and Vining (2012) and Seber and Lee (2003). Generalized linear models are covered in detail in McCullagh and Nelder (2019). The reader may also benefit from our own freely available online lecture notes on linear modeling: https://github.com/vasishth/LM.

## 3.11 Exercises

Exercise 3.1 A simple linear model.

1. Fit the model fit_press with just a few iterations, say 50 iterations (set warmup to the default of 25, and use four chains). Does the model converge?

2. Using normal distributions, choose priors that better represent your assumptions/beliefs about finger tapping times. To think about a reasonable set of priors for $$\mu$$ and $$\sigma$$, you should come up with your own subjective assessment about what you think a reasonable range of values can be for $$\mu$$ and how much variability might happen. There is no correct answer here, we’ll discuss priors in depth in chapter 6. Fit this model to the data. Do the posterior distributions change?

Exercise 3.2 Revisiting the button-pressing example with different priors.

1. Can you come up with very informative priors that influence the posterior in a noticeable way (use normal distributions for priors, not uniform priors)? Again, there are no correct answers here; you may have to try several different priors before you can noticeably influence the posterior.
2. Generate and plot prior predictive distributions based on this prior and plot them.

3. Generate posterior predictive distributions based on this prior and plot them.

Exercise 3.3 Posterior predictive checks with a log-normal model.

1. For the log-normal model fit_press_ln, change the prior of $$\sigma$$ so that it is a log-normal distribution with location ($$\mu$$) of $$-2$$ and scale ($$\sigma$$) of $$0.5$$. What does such a prior imply about your belief regarding button-pressing times in milliseconds? Is it a good prior? Generate and plot prior predictive distributions. Do the new estimates change compared to earlier models when you fit the model?
2. For the log-normal model, what is the mean (rather than median) time that takes to press the space bar, what is the standard deviation of the finger tapping times in milliseconds?

Exercise 3.4 A skew normal distribution.

Would it make sense to use a “skew normal distribution” instead of the log-normal? The skew normal distribution has three parameters: location $$\xi$$ (this is the lower-case version of the Greek letter $$\Xi$$, pronounced “chi”, with the “ch” pronounced like the “ch” in “Bach”), scale $$\omega$$ (omega), and shape $$\alpha$$. The distribution is right skewed if $$\alpha >0$$, is left skewed if $$\alpha <0$$, and is identical to the regular normal distribution if $$\alpha =0$$. For fitting this in brms, one needs to change family and set it to skew_normal(), and add a prior of class = alpha (location remains class = Intercept and scale, class = sigma).

1. Fit this model with a prior that assigns approximately 95% of the prior probability of alpha to be between 0 and 10.
2. Generate posterior predictive distributions and compare the posterior distribution of summary statistics of the skew normal with the normal and log-normal.

### References

Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015b. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Bürkner, Paul-Christian. 2019. brms: Bayesian Regression Models Using “Stan”. https://CRAN.R-project.org/package=brms.

Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael J. Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1). Columbia Univ., New York, NY (United States); Harvard Univ., Cambridge, MA (United States).

Dobson, Annette J, and Adrian Barnett. 2011. An Introduction to Generalized Linear Models. CRC press.

Duane, Simon, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. “Hybrid Monte Carlo.” Physics Letters B 195 (2): 216–22. https://doi.org/10.1016/0370-2693(87)91197-X.

Gamerman, Dani, and Hedibert F Lopes. 2006. Markov chain Monte Carlo: Stochastic simulation for Bayesian inference. CRC Press.

Ge, Hong, Kai Xu, and Zoubin Ghahramani. 2018. “Turing: A Language for Flexible Probabilistic Inference.” In Proceedings of Machine Learning Research, edited by Amos Storkey and Fernando Perez-Cruz, 84:1682–90. Playa Blanca, Lanzarote, Canary Islands: PMLR. http://proceedings.mlr.press/v84/ge18b.html.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC Press.

Gelman, Andrew, Daniel Simpson, and Michael J. Betancourt. 2017. “The Prior Can Often Only Be Understood in the Context of the Likelihood.” Entropy 19 (10): 555. https://doi.org/10.3390/e19100555.

Goodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2018. “Rstanarm: Bayesian Applied Regression Modeling via Stan.” http://mc-stan.org/.

Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–1623. http://dl.acm.org/citation.cfm?id=2627435.2638586.

Hubel, Kerry A, Bruce Reed, E William Yund, Timothy J Herron, and David L Woods. 2013. “Computerized Measures of Finger Tapping: Effects of Hand Dominance, Age, and Sex.” Perceptual and Motor Skills 116 (3). SAGE Publications Sage CA: Los Angeles, CA: 929–52.

JASP Team. 2019. “JASP (Version 0.11.1)[Computer software].” https://jasp-stats.org/.

Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge university press.

Jeffreys, Harold. 1939. Theory of Probability. Oxford: Clarendon Press.

Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. London, UK: Sage.

Lindgren, Finn, and Håvard Rue. 2015. “Bayesian Spatial Modelling with R-INLA.” Journal of Statistical Software 63 (1): 1–25.

Luce, R Duncan. 1991. Response Times: Their Role in Inferring Elementary Mental Organization. Oxford University Press.

Lunn, D.J., A. Thomas, N. Best, and D. Spiegelhalter. 2000. “WinBUGS-A Bayesian Modelling Framework: Concepts, Structure, and Extensibility.” Statistics and Computing 10 (4). Springer: 325–37.

Lynch, Scott Michael. 2007. Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. New York, NY: Springer.

McCullagh, Peter, and J.A. Nelder. 2019. Generalized Linear Models. Second Edition. Boca Raton, Florida: Chapman; Hall/CRC.

Montgomery, D. C., E. A. Peck, and G. G. Vining. 2012. An Introduction to Linear Regression Analysis. 5th ed. Hoboken, NJ: Wiley.

Neal, Radford M. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Taylor & Francis. https://doi.org/10.1201/b10905-10.

O’Hagan, Antony, and Jonathan Forster. 2004. “Kendall’s Advanced Theory of Statistics, Vol. 2B: Bayesian Inference.” Wiley.

Plummer, Martin. 2016. “JAGS Version 4.2.0 User Manual.”

Plummer, Martyn. 2022. “Simulation-Based Bayesian Analysis.” Annual Reviews.

Roberts, Seth, and Harold Pashler. 2000. “How Persuasive Is a Good Fit? A Comment on Theory Testing.” Psychological Review 107 (2): 358–67.

Salvatier, John, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. “Probabilistic Programming in Python Using PyMC3.” PeerJ Computer Science 2 (April). PeerJ: e55. https://doi.org/10.7717/peerj-cs.55.

Seber, George A. F., and Allen J. Lee. 2003. Linear Regression Analysis. 2nd Edition. Hoboken, NJ: John Wiley; Sons.

Shiffrin, Richard, Michael D. Lee, Woojae Kim, and Eric-Jan Wagenmakers. 2008. “A Survey of Model Evaluation Approaches with a Tutorial on Hierarchical Bayesian Methods.” Cognitive Science: A Multidisciplinary Journal 32 (8): 1248–84. https://doi.org/10.1080/03640210802414826.

Vasishth, Shravan, Zhong Chen, Qiang Li, and Gueilan Guo. 2013. “Processing Chinese Relative Clauses: Evidence for the Subject-Relative Advantage.” PLoS ONE 8 (10). Public Library of Science: 1–14.

Vasishth, Shravan, and Felix Engelmann. 2022. Sentence Comprehension as a Cognitive Process: A Computational Approach. Cambridge, UK: Cambridge University Press. https://books.google.de/books?id=6KZKzgEACAAJ.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2019. “Rank-Normalization, Folding, and Localization: An Improved $$\widehat R$$ for Assessing Convergence of MCMC.” arXiv Preprint arXiv:1903.08008.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2019. “Rank-Normalization, Folding, and Localization: An Improved $$\widehat{R}$$ for Assessing Convergence of MCMC.”

1. The Python package PyMC3 and the Julia library Turing are recent exceptions since they are fully integrated into their respective languages.

2. We refer to the time it takes for a subject to respond or react to a stimuli as response time (response and reaction times are often used interchangeably, cf. Luce 1991). In this case, however, there are no stimuli, and the subject only taps the space bar.

3. The problem here is that although the parameter for the intercept is assigned a uniform distribution bounded between $$0$$ and $$60000$$ ms, the sampler might start sampling from an initial value outside this range. The sampler can start from an initial value that is outside the $$0$$-$$60000$$ range because the initial value is chosen randomly (unless the user specifies an initial value explicitly).

4. We’ll see later how to generate prior predictive distributions of statistics such as mean, minimum, or maximum value in section 3.7.2 using brms and pp_check().

5. Even though, in theory, one could use even wider priors, in practice, these are the widest priors that achieve convergence.

6. More precisely, $$\log_e(\boldsymbol{y})$$ or $$\ln(\boldsymbol{y})$$, but we’ll write it as just $$log()$$.