Chapter 3 Computational Bayesian data analysis

In the previous chapter, we learned how to analytically derive the posterior distribution of the parameters in our model. In practice, however, this is possible for only a very limited number of cases. Although the numerator of the Bayes rule, the unnormalized posterior, is easy to calculate (by multiplying the probability density/mass functions analytically), the denominator, the marginal likelihood, requires us to carry out an integration; see Equation (3.1).

\[\begin{equation} \begin{aligned} p(\boldsymbol{\Theta}|\boldsymbol{y}) &= \cfrac{ p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) }{p(\boldsymbol{y})}\\ p(\boldsymbol{\Theta}|\boldsymbol{y}) &= \cfrac{ p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) }{\int_{\boldsymbol{\Theta}} p(\boldsymbol{y}|\boldsymbol{\Theta}) \cdot p(\boldsymbol{\Theta}) d\boldsymbol{\Theta} } \end{aligned} \tag{3.1} \end{equation}\]

Unless we are dealing with conjugate distributions, the solution will be extremely hard to derive or there will be no analytical solution. This was the major bottleneck of Bayesian analysis in the past, and required Bayesian practitioners to program an approximation method by themselves before they could even begin the Bayesian analysis. Fortunately, many of the probabilistic programming languages freely available today (see the next section for a listing) allow us to define our models without having to acquire expert knowledge about the relevant numerical techniques.

3.1 Deriving the posterior through sampling

Let’s say that we want to derive the posterior of the model from section 2.2, that is, the posterior distribution of the cloze probability of “umbrella”, \(\theta\), given the following data: a word (e.g., “umbrella”) was answered 80 out of 100 times, and assuming a binomial distribution as the likelihood function, and \(\mathit{Beta}(a=4,b=4)\) as a prior distribution for the cloze probability. If we can obtain samples from the posterior distribution of \(\theta\), instead of an analytically derived posterior distribution, given enough samples we will have a good approximation of the posterior distribution. Obtaining samples from the posterior will be the only viable option in the models that we will discuss in this book. By “obtaining samples”, we are talking about a situation analogous to when we use rbinom() or rnorm() to obtain samples from a particular distribution. For more details about sampling algorithms, see the further readings suggested in section 3.10.

Thanks to probabilistic programming languages, it will be relatively straightforward to get these samples, and we will discuss how we will do it in more detail in the next section. For now let’s assume that we used some probabilistic programming language to obtain 20000 samples from the posterior distribution of the cloze probability, \(\theta\): 0.828, 0.813, 0.786, 0.81, 0.806, 0.737, 0.771, 0.722, 0.763, 0.77, 0.778, 0.829, 0.736, 0.838, 0.776, 0.816, 0.743, 0.73, 0.701, 0.764, … Figure 3.1 shows that the approximation of the posterior looks quite similar to the analytically derived posterior. The difference between the analytically computed and approximated mean and variance are \(-0.0004\) and \(0.000005\) respectively.

Histogram of the samples of \(\theta\) from the posterior distribution generated via sampling. The black line shows the density plot of the analytically derived posterior.

FIGURE 3.1: Histogram of the samples of \(\theta\) from the posterior distribution generated via sampling. The black line shows the density plot of the analytically derived posterior.

3.2 Bayesian Regression Models using Stan: brms

The surge in popularity of Bayesian statistics is closely tied to the increase in computing power and the appearance of probabilistic programming languages, such as WinBUGS (Lunn et al. 2000), JAGS (Plummer 2016), PyMC3 (Salvatier, Wiecki, and Fonnesbeck 2016), Turing (Ge, Xu, and Ghahramani 2018), and Stan (Carpenter et al. 2017); for a historical review, see Plummer (2022).

These probabilistic programming languages allow the user to define models without having to deal (for the most part) with the complexities of the sampling process. However, they require learning a new language since the user has to fully specify the statistical model using a particular syntax.7 Furthermore, some knowledge of the sampling process is needed to correctly parameterize the models and to avoid convergence issues (these topics will be covered in detail in chapter 11).

There are some alternatives that allow Bayesian inference in R without having to fully specify the model “by hand”. The packages rstanarm (Goodrich et al. 2018) and brms (Bürkner 2019) provide Bayesian equivalents of many popular R model-fitting functions, such as (g)lmer (Bates, Mächler, et al. 2015b) and many others; both rstanarm and brms use Stan as the back-end for estimation and sampling. The package R-INLA (Lindgren and Rue 2015) allows for fitting a limited selection of likelihood functions and priors in comparison to rstanarm and brms (R-INLA can fit models that can be expressed as latent Gaussian models). This package uses the integrated nested Laplace approximation (INLA) method for approximating Bayesian inference rather than a sampling algorithm as it is used by the other probabilistic languages listed. Another alternative is JASP (JASP Team 2019), which provides a graphical user interface for both frequentist and Bayesian modeling, and is intended to be an open-source alternative to SPSS.

We will focus on brms in this part of the book. This is because it can be useful for a smooth transition from frequentist models to their Bayesian equivalents. The package brms is not only powerful enough to satisfy the statistical needs of many cognitive scientists, it has the added benefit that the Stan code can be inspected (with the brms functions make_stancode() and make_standata()), allowing the user to customize their models or learn from the code produced internally by brms to eventually transition to writing the models entirely in Stan. We revisit the models of this and the following chapters in the introduction to Stan in chapter 10.

3.2.1 A simple linear model: A single subject pressing a button repeatedly (a finger tapping task)

We’ll use the following example of a finger tapping task (for a review, see Hubel et al. 2013) to illustrate the basic steps for fitting a model. Suppose that a subject first sees a blank screen. Then, after a certain amount of time (say \(200\) ms), the subject sees a cross in the middle of a screen, and as soon as they see the cross, they tap on the space bar as fast as they can until the experiment is over (\(361\) trials). The dependent measure here is the time it takes in milliseconds from one press of the space bar to the next one. The data in each trial are therefore finger tapping times in milliseconds. Suppose that the research question is: how long does it take for this particular subject to press a key? (As an aside, notice that the data are not independent and identically distributed but are ignoring that detail for now; we will address that issue later in the book).

Let’s model the data with the following assumptions:

  1. There is a true (unknown) underlying time, \(\mu\) ms, that the subject needs to press the space bar.
  2. There is some noise in this process.
  3. The noise is normally distributed (this assumption is questionable given that finger tapping as also response times are generally skewed; we will fix this assumption later).8

This means that the likelihood for each observation \(n\) will be:

\[\begin{equation} t_n \sim \mathit{Normal}(\mu, \sigma) \tag{3.2} \end{equation}\]

where \(n =1, \ldots, N\), and \(t\) is the dependent variable (finger tapping times in milliseconds). The variable \(N\) indexes the total number of data points. The symbol \(\mu\) indicates the location of the normal distribution function; the location parameter shifts the distribution left or right on the horizontal axis. For the normal distribution, the location is also the mean of the distribution. The symbol \(\sigma\) indicates the scale of the distribution; as the scale decreases, the distribution gets narrower. This compressing approaches a spike (all the probability mass get concentrated near one point) as the scale parameter approaches zero. For the normal distribution, the scale is also its standard deviation.

The reader may have encountered the model shown in Equation (3.2) in the form shown in Equation (3.3):

\[\begin{equation} t_n = \mu + \varepsilon \hbox{, where } \varepsilon_n \stackrel{iid}{\sim} \mathit{Normal}(0,\sigma) \tag{3.3} \end{equation}\]

When the model is written in this way, it should be understood as meaning that each data point \(t_n\) has some variability around a mean value \(\mu\), and that variability has standard deviation \(\sigma\). The term ``iid’’ (independent and identically distributed) implies that each data point \(t_n\) is independently generated (is not correlated with any of the other data points), and is coming from the same distribution (namely, \(Normal(\mu,\sigma)\)).

For a frequentist model that will give us the maximum likelihood estimate (the sample mean) of the time it takes to press the space bar, this would be enough information to write the formula in R, t ~ 1, and plug it into the function lm() together with the data: lm(t ~ 1, data). The meaning of the 1 here is that lm will estimate the intercept in the model, in our case \(\mu\). If the reader is completely unfamiliar with linear models, the references in section 4.5 will be helpful.

For a Bayesian linear model, we will also need to define priors for the two parameters of our model. Let’s say that we know for sure that the time it takes to press a key will be positive and lower than a minute (or \(60000\) ms), but we don’t want to make a commitment regarding which values are more likely. We encode what we know about the noise in the task in \(\sigma\): we know that this parameter must be positive and we’ll assume that any value below \(2000\) ms is equally likely. These priors are in general strongly discouraged: A flat (or very wide) prior will almost never be the best approximation of what we know. Prior specification will be discussed in detail in chapter 6.

In this case, even if we know very little about the task, we know that pressing the spacebar will take at most a couple of seconds. We’ll use flat priors in this section for pedagogical purposes; the next sections will already show more realistic uses of priors.

\[\begin{equation} \begin{aligned} \mu &\sim \mathit{Uniform}(0, 60000) \\ \sigma &\sim \mathit{Uniform}(0, 2000) \end{aligned} \tag{3.4} \end{equation}\]

First, load the data frame df_spacebar from the bcogsci package:

## # A tibble: 361 × 2
##       t trial
##   <int> <int>
## 1   141     1
## 2   138     2
## 3   128     3
## # … with 358 more rows

It is always a good idea to plot the data before doing anything else; see Figure 3.2. As we suspected, the data look a bit skewed, but we ignore this for the moment.

Visualizing the button-press data.

FIGURE 3.2: Visualizing the button-press data. Specifying the model in brms

Fit the model defined by equations (3.2) and (3.4) with brms in the following way.

The brms code has some differences from a model fit with lm. At this beginning stage, we’ll focus on the following options:

  1. The term family = gaussian() makes it explicit that the underlying likelihood function is a normal distribution (Gaussian and normal are synonyms). This detail is implicit in lm. Other linking functions are possible, exactly as in the glm function. The default for brms that corresponds to the lm function is gaussian().
  2. The term prior takes as argument a vector of priors. Although this specification of priors is optional, the researcher should always explicitly specify each prior. Otherwise, brms will define priors by default, which may or may not be appropriate for the research area. In cases where the distribution has a restricted coverage, that is, not every value is valid (e.g., smaller than \(0\) or larger than \(60000\) are not valid for the intercept), we need to set lower and upper boundaries with lb and ub.9
  3. The term chains refers to the number of independent runs for sampling (by default four).
  4. The term iter refers to the number of iterations that the sampler makes to sample from the posterior distribution of each parameter (by default 2000).
  5. The term warmup refers to the number of iterations from the start of sampling that are eventually discarded (by default half of iter).

The last three options, chains, iter, warmup determine the behavior of the sampling algorithm: the No-U-Turn Sampler (NUTS; Hoffman and Gelman 2014) extension of Hamiltonian Monte Carlo (Duane et al. 1987; Neal 2011). We will discuss sampling in a bit more depth in chapter 10, but the basic process is explained next. Sampling and convergence in a nutshell

The code specification starts four chains independently from each other. Each chain “searches” for samples of the posterior distribution in a multidimensional space, where each parameter corresponds to a dimension. The shape of this space is determined by the priors and the likelihood. The chains start at random locations, and in each iteration they take one sample each. When sampling begins, the samples may or may not belong to the posterior distributions of the parameters. Eventually, the chains end up in the vicinity of the posterior distribution, and from that point onward the samples will belong to the posterior.

Thus, when sampling begins, the samples from the different chains can be far from each other, but at some point they will “converge” and start delivering samples from the posterior distributions. Although there are no guarantees that the number of iterations we run the chains for will be sufficient for obtaining samples from the posteriors, the default values of brms (and Stan) are in many cases sufficient to achieve convergence. When the default number of iterations do not suffice, brms (actually, Stan) will print out warnings, with suggestions for fixing the convergence problems. If all the chains converge to the same distribution, by removing the “warmup” samples, we make sure that we do not get samples from the initial path to the posterior distributions. The default in brms is that half of the total number of iterations in each chain (which default to 2000) will count as “warmup”. So, if one runs a model with four chains and the default number of iterations, we will obtain a total of 4000 samples from the four chains, after discarding the warmup iterations.

Figure 3.3(a) shows the path of the chains from the warmup phase onwards. Such plots are called trace plots. The warmup is shown only for illustration purposes; generally, one should only inspect the chains after the point where convergence has (presumably) been achieved (i.e., after the dashed line). After convergence has occurred, a visual diagnostic check is that chains should look like a “fat hairy caterpillar.” Compare the trace plot of our model in Figure 3.3(a) with the trace plot of a model that did not converge, shown in Figure 3.3(b).

Trace plots are not always diagnostic as regards convergence. The trace plots might look fine, but the model may not have converged. Fortunately, Stan automatically runs several diagnostics with the information from the chains, and if there are no warnings after fitting the model and the trace plots look fine, we can be reasonably sure that the model converged, and assume that our samples are from the true posterior distribution. However, it is necessary to run more than one chain (preferably four), with a couple of thousands of iterations (at least) in order for the diagnostics to work.

a. Trace plots of our brms model for the button-pressing data. All the chains start from initial values above 200 and are outside of the plot. b. Trace plots of a model that did not converge. We can diagnose the non-convergence by the observing that the chains do not overlap—each chain seems to be sampling from a different distribution.

FIGURE 3.3: a. Trace plots of our brms model for the button-pressing data. All the chains start from initial values above 200 and are outside of the plot. b. Trace plots of a model that did not converge. We can diagnose the non-convergence by the observing that the chains do not overlap—each chain seems to be sampling from a different distribution. Output of brms

Once the model has been fit (and assuming that we got no warning messages about convergence problems), we can print out the samples of the posterior distributions of each of the parameters using as_draws_df() (which stores metadata about the chains) or with

## # A draws_df: 3 iterations, 1 chains, and 4 variables
##   b_Intercept sigma lprior  lp__
## 1         167    27    -19 -1686
## 2         171    24    -19 -1684
## 3         171    24    -19 -1684
## # ... hidden reserved variables {'.chain', '.iteration', '.draw'}

The term b_Intercept in the brms output corresponds to our \(\mu\). We can ignore the last two columns: lp is not really part of the posterior, it’s the log-density of the unnormalized posterior for each iteration (lp will be discussed later in Box 10.1), lprior is the log-density of the (joint) prior distribution and it is there for compatibility with the package priorsense (

Plot the density and trace plot of each parameter after the warmup (Figure 3.4).

Density and trace plots of our brms model for the button-pressing data.

FIGURE 3.4: Density and trace plots of our brms model for the button-pressing data.

Printing the object with the brms fit provides a nice, if somewhat verbose, summary:

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: t ~ 1 
##    Data: df_spacebar (Number of observations: 361) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   168.63      1.26   166.18   171.06 1.00     3234     2568
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    25.02      0.95    23.25    26.95 1.00     2809     2222
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The Estimate is just the mean of the posterior samples, Est.Error is the standard deviation of the posterior and the CIs mark the lower and upper bounds of the 95% credible intervals (to distinguish credible intervals from frequentist confidence intervals, the former will be abbreviated as CrIs):

## [1] 169
## [1] 1.26
##  2.5% 97.5% 
##   166   171

Furthermore, the summary provides the Rhat, Bulk_ESS, and Tail_ESS of each parameter. R-hat compares the between- and within-chain estimates of each parameter. R-hat is larger than 1 when chains have not mixed well, one can only rely on the model if the R-hats for all the parameters are less than 1.05. (R warnings will appear otherwise). Bulk ESS (bulk effective sample size) is a measure of sampling efficiency in the bulk of the posterior distribution, that is the effective sample size for the mean and median estimates, whereas tail ESS (tail effective sample size) indicates the sampling efficiency at the tails of the distribution, that is the minimum of effective sample sizes for 5% and 95% quantiles. The effective sample size is generally smaller than the number of post-warmup samples, because the samples from the chains are not independent (they are correlated to some extent), and carry less information about the posterior distribution in comparison to independent samples. In some cases, however, such as with the Intercept here, the effective sample size is actually larger than the number of post-warmup samples. This might happen for parameters with a normally distributed posterior (in the unconstrained space, see Box (thm:target)) and low dependence on the other parameters (Vehtari, Gelman, Simpson, Carpenter, and Bürkner 2019). Very low effective sample size indicates sampling problems (and are accompanied by R warnings) and in general appear together with chains that are not properly mixed. As rule of thumb, Vehtari, Gelman, Simpson, Carpenter, and Bürkner (2019) suggest that a minimum of 400 effective sample size is required for statistical summaries.

We see that we can fit our model without problems, and we get some posterior distributions for our parameters. However, we should ask ourselves the following questions:

  1. What information are the priors encoding? Do the priors make sense?
  2. Does the likelihood assumed in the model make sense for the data?

We’ll try to answer these questions by looking at the prior and posterior predictive distributions, and by doing sensitivity analyses. This is explained in the following sections.

3.3 Prior predictive distribution

We had defined the following priors for our linear model:

\[\begin{equation} \begin{aligned} \mu &\sim \mathit{Uniform}(0, 60000) \\ \sigma &\sim \mathit{Uniform}(0, 2000) \end{aligned} \tag{3.5} \end{equation}\]

These priors encode assumptions about the kind of data we would expect to see in a future study. To understand these assumptions, we are going to generate data from the model; such data, which is generated entirely by the prior distributions, is called the prior predictive distribution. Generating prior predictive distributions repeatedly helps us to check whether the priors make sense. What we want to know here is, do the priors generate realistic-looking data?

Formally, we want to know the density \(p(\cdot)\) of data points \(y_{pred_1},\dots,y_{pred_N}\) from a data set \(\boldsymbol{y_{pred}}\) of length \(N\), given a vector of priors \(\boldsymbol{\Theta}\) and our likelihood \(p(\cdot|\boldsymbol{\Theta})\); (in our example, \(\boldsymbol{\Theta}=\langle\mu,\sigma \rangle\)). The prior predictive density is written as follows:

\[\begin{equation} \begin{aligned} p(\boldsymbol{y_{pred}}) &= p(y_{pred_1},\dots,y_{pred_n})\\ &= \int_{\boldsymbol{\Theta}} p(y_{pred_1}|\boldsymbol{\Theta})\cdot p(y_{pred_2}|\boldsymbol{\Theta})\cdots p(y_{pred_N}|\boldsymbol{\Theta}) p(\boldsymbol{\Theta}) \, d\boldsymbol{\Theta} \end{aligned} \end{equation}\]

In essence, the vector of parameters is integrated out. This yields the probability distribution of possible data sets given the priors and the likelihood, before any observations are taken into account.

The integration can be carried out computationally by generating samples from the prior distribution.

Here is one way to generate prior predictive distributions:

Repeat the following many times:

  1. Take one sample from each of the priors.
  2. Plug those samples into the probability density/mass function used as the likelihood in the model to generate a data set \(y_{pred_1},\ldots,y_{pred_n}\).

Each sample is an imaginary or potential data set.

Create a function that does this:

The following code produces \(1000\) samples of the prior predictive distribution of the model that we defined in section 3.2.1. This means that it will produce \(361000\) predicted values (\(361\) predicted observations for each of the \(1000\) simulations). Although this approach works, it’s quite slow (a couple of seconds). See Box 3.1 for a more efficient version of this function. Section 3.7.2 will show that it’s possible to use brms to sample from the priors, ignoring the t in the data by setting sample_prior = "only". However, since brms still depends on Stan’s sampler, which uses Hamiltonian Monte Carlo, the prior sampling process can also fail to converge, especially when one uses very uninformative priors, like the ones used in this example. In contrast, our function above, which uses rnorm(), cannot have convergence issues and will always produce multiple sets of prior predictive data \(y_{pred_1},\ldots,y_{pred_n}\).

## 2.48 sec elapsed
## # A tibble: 361,000 × 3
##   trialn t_pred  iter
##    <dbl>  <dbl> <dbl>
## 1      1 16710.     1
## 2      2 16686.     1
## 3      3 17245.     1
## # … with 360,997 more rows

Box 3.1 A more efficient function for generating prior predictive distribution

A more efficient function can be created in the following way using the map2_dfr() function from the purrr package. This function yields an approximately 10-fold increase in speed. Although the distributions should be the same with both functions, the specific numbers in the tables won’t be, due to the randomness in the process of sampling.

The purrr function map2_dfr() (which works similarly to the base R function lapply() and Map()) essentially runs a for-loop, and builds a data frame with the output. It iterates over the values of two vectors (or lists) simultaneously, here, mu_samples and sigma_samples and, in each iteration, it applies a function to each value of the two vectors, here, mu and sigma. The output of each function is a data frame (or tibble in this case) with N_obs observations which is bound in a larger data frame at the end of the loop. Each of these data frames bound together represents an iteration in the simulation, and we identify the iterations by setting .id = "iter".

Although this method for generating prior predictive distributions is a bit involved, it presents an advantage in comparison to the more straightforward use of predict() (or posterior_predict(), which can also generate prior predictions) together with setting sample_prior = "only" in the brms model (as we will do in section 3.7.2). Namely, here we don’t depend on Stan’s sampler, and that means that no matter the number of iterations in our simulation or how uninformative our priors, there will never be any convergence problems.

## 0.596 sec elapsed

Figure 3.5 shows the first 18 samples of the prior predictive distribution (i.e., 18 independently generated prior predicted data sets) with the code below.

Eighteen samples from the prior predictive distribution of the model defined in section 3.2.1.

FIGURE 3.5: Eighteen samples from the prior predictive distribution of the model defined in section 3.2.1.

The prior predictive distribution in Figure 3.5 shows prior data sets that are not realistic: Apart from the fact that the data sets show that finger tapping times distributions are symmetrical–and we know from prior experience with such data that they are generally right-skewed–some data sets present finger tapping times that are unrealistically long. Worse yet, if we inspect enough samples from the prior predicted data, it will become clear that a few data sets have negative finger tapping time values.

We can also look at the distribution of summary statistics in the prior predictive data. Even if we don’t know beforehand what the data should look like, it’s very likely that we have some expectations for possible mean, minimum, or maximum values. For example, in the button-pressing example, it seems reasonable to assume that average finger tapping times are between \(200\)-\(600\) ms; finger tapping times are very unlikely to be below \(50\) ms (given the delays in keyboards), and even long lapses of attention won’t be greater than a couple of seconds.10 Three distributions of summary statistics are shown in Figure 3.6.

The prior predictive distributions of the mean, minimum, and maximum values of the button-pressing model defined in section 3.2.1.

FIGURE 3.6: The prior predictive distributions of the mean, minimum, and maximum values of the button-pressing model defined in section 3.2.1.

Figure 3.6 shows that we used much less prior information than what we could have: Our priors were encoding the information that any mean between \(0\) and \(60000\) ms is equally likely. It seems clear that a value close to \(0\) or to \(60000\) ms would be extremely surprising. This wide range of mean values occurs because of the uniform prior on \(\mu\). Similarly, maximum values are quite “uniform”, spanning a much wider range than what one would expect. Finally, in the distribution of minimum values, negative finger tapping times occur. This might seem surprising (our prior for \(\mu\) excluded negative values), but the reason that negative values appear is that the prior is interpreted together with the likelihood (Gelman, Simpson, and Betancourt 2017), and the likelihood is a normal distribution, which will allow for negative samples even if the location parameter \(\mu\) has a positive value.

To summarize the above discussion, the priors used in the example are clearly not very realistic given what we might know about finger tapping times for such a button pressing task. This raises the question: what priors should we have chosen? In the next section, we consider this question.

3.4 The influence of priors: sensitivity analysis

For most cases that we will encounter in this book, there are four main classes of priors that we can choose from. In the Bayesian community, there is no fixed nomenclature for classifying different kinds of priors. For this book, we have chosen specific names for each type of prior, but this is just a convention that we follow for consistency. There are also other classes of prior that we do not discuss in this book. An example is improper priors such as \(\mathit{Uniform}(-\infty,+\infty)\), which are not proper probability distributions because the area under the curve does not sum to 1.

When thinking about priors, the reader should not get hung up on what precisely the name is for a particular type of prior; they should rather focus on what that prior means in the context of the research problem.

3.4.1 Flat, uninformative priors

One option is to choose priors that are as uninformative as possible. The idea behind this approach is to let the data “speak for itself” and to not bias the statistical inference with “subjective” priors. There are several issues with this approach: First, the prior is as subjective as the likelihood, and in fact, different choices of likelihood might have a much stronger impact on the posterior than different choices of priors. Second, uninformative priors are in general unrealistic because they give equal weight to all values within the support of the prior distribution, ignoring the fact that usually there is some minimal information about the parameters of interest. Usually, at the very least, the order of magnitude is known (response times or finger tapping times will be in milliseconds and not days, EEG signals some microvolts and not volts, etc.). Third, uninformative priors make the sampling slower and might lead to convergence problems. Unless there is a large amount of data, it would be wise to avoid such priors. Fourth, it is not always clear which parameterization of a given distribution the flat priors should be assigned to. For example, the Normal distribution is sometimes defined based on its standard deviation (\(\sigma\)), variance (\(\sigma^2\)), or precision (\(1/\sigma^2\)): a flat prior for the standard deviation is not flat for the precision of the distribution. Although it is sometimes possible to find an uninformative prior that is uninvariant under a change of parameters (also called Jeffreys priors; Jaynes 2003, sec. 6.15; Jeffreys 1939, Chapter 3), this is not always the case. Finally, if Bayes factors need to be computed, uninformative priors can lead to very misleading conclusions (chapter 15).

In the button-pressing example discussed in this chapter, an example of a flat, uninformative prior would be \(\mu \sim \mathit{Uniform}(-10^{20},10^{20})\). On the millisecond scale, this is a very strange prior to use for a parameter representing mean button-pressing time: it allows for impossibly large positive values, and it also allows negative button-pressing times, which is of course impossible. It is technically possible to use such a prior, but it wouldn’t make much sense.

3.4.2 Regularizing priors

If there does not exist much prior information (and if this information cannot be worked out through reasoning about the problem), and there is enough data (what “enough” means here will presently become clear when we look at specific examples), it is fine to use regularizing priors. These are priors that down-weight extreme values (that is, they provide regularization), they are usually not very informative, and mostly let the likelihood dominate in determining the posteriors. These priors are theory-neutral; that is, they usually do not bias the parameters to values supported by any prior belief or theory. The idea behind this type of prior is to help to stabilize computation. These priors are sometimes called weakly informative or mildly informative priors in the Bayesian literature. For many applications, they perform well, but discussed in chapter 15, they tend to be problematic if Bayes factors need to be computed.

In the button-pressing example, an example of a regularizing prior would be \(\mu \sim \mathit{Normal}_{+}(0,1000)\). This is a Normal distribution prior truncated at \(0\) ms, and allows a relatively constrained range of positive values for button-pressing times (roughly, up to \(2000\) ms or so). This is a regularizing prior because it rules out negative button-pressing times and down-weights extreme values over \(2000\) ms.

3.4.3 Principled priors

The idea here is to have priors that encode all (or most of) the theory-neutral information that the researcher has. Since one generally knows what one’s data do and do not look like, it is possible to build priors that truly reflect the properties of potential data sets, using prior predictive checks. In this book, many examples of this class of priors will come up.

In the button-pressing data, an example of a principled prior would be \(\mu \sim \mathit{Normal}_{+}(250,100)\). This prior is not overly restrictive, but represents a guess about plausible button-pressing times. Prior predictive checks using principled priors should produce realistic distributions of the dependent variable.

3.4.4 Informative priors

There are cases where a lot of prior knowledge exists. In general, unless there are very good reasons for having relatively informative priors (see chapter 15), it is not a good idea to let the priors have too much influence on the posterior. An example where informative priors would be important is when investigating a language-impaired population from which we can’t get many subjects, but a lot of previously published papers exist on the research topic.

In the button-pressing data, an informative prior could be based on a meta-analysis of previously published or existing data, or the result of prior elicitation from an expert (or multiple experts) on the topic under investigation. An example of an informative prior would be \(\mu \sim \mathit{Normal}_{+}(200,20)\). This prior will have some influence on the posterior for \(\mu\), especially when one has relatively sparse data.

These four options constitute a continuum. The uniform prior from the last model (section 3.2.1) falls between flat, uninformative and regularizing priors. In practical data analysis situations, we are mostly going to choose priors that fall between regularizing and principled. Informative priors, in the sense defined above, will be used only relatively rarely; but they become more important to consider when doing Bayes factor analyses (chapter 15).

3.5 Revisiting the button-pressing example with different priors

What would happen if even wider priors were used for the model defined previously (in section 3.2.1)? Suppose that every mean between \(-10^{6}\) and \(10^{6}\) ms is assumed to be equally likely. This prior is clearly unrealistic and actually makes no sense at all: we are not expecting negative finger tapping times. Regarding the standard deviation, one could assume that any value between \(0\) and \(10^{6}\) is equally likely.11 The likelihood remains unchanged.

\[\begin{equation} \begin{aligned} \mu &\sim \mathit{Uniform}(-10^{6}, 10^{6}) \\ \sigma &\sim \mathit{Uniform}(0, 10^{6}) \end{aligned} \tag{3.6} \end{equation}\]

Even with these extremely unrealistic priors, which require us to change the adapt_delta and max_treedepth default values to achieve convergence, the output of the model is virtually identical to the previous one (see Figure 3.7).

## ...
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   168.65      1.34   166.02   171.28 1.00     3939     3345
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    25.00      0.98    23.13    27.04 1.01      568      470
## ...
Comparison of the posterior distributions from the model with extremely unrealistic priors, fit_press_unif, against the previous model with more “realistically” bounded uniform distributions (but still not recommended), fit_press.

FIGURE 3.7: Comparison of the posterior distributions from the model with extremely unrealistic priors, fit_press_unif, against the previous model with more “realistically” bounded uniform distributions (but still not recommended), fit_press.

Next, consider what happens if very informative priors are used. Assume that mean values very close to \(400\) ms are the most likely, and that the standard deviation of the finger tapping times is very close to \(100\) ms. Given that this is a model of button-pressing times, such an informative prior seems wrong—\(200\) ms seems like a more realistic mean button-pressing time, not \(400\) ms. You can check this by doing an experiment yourself and looking at the recorded times; a software like Linger ( makes it easy to set up such an experiment.

The \(\mathit{Normal}_+\) notation indicates a normal distribution truncated at zero such that only positive values are allowed (Box 4.1 discusses this type of distribution in detail). Even though the prior for the standard deviation is restricted to be positive, we are not required to add lb = 0 to the prior, and it is automatically taken into account by brms.

\[\begin{equation} \begin{aligned} \mu &\sim \mathit{Normal}(400, 10) \\ \sigma &\sim \mathit{Normal}_+(100, 10) \end{aligned} \tag{3.7} \end{equation}\]

Despite these unrealistic but informative priors, the likelihood mostly dominates and the new posterior means and credible intervals are just a couple of milliseconds away from the previous estimates:

## ...
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   172.89      1.38   170.22   175.66 1.00     2452     2659
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    26.05      1.03    24.08    28.14 1.00     2791     2418
## ...

As a final example of a sensitivity analysis, choose some principled priors. Assuming that we have some prior experience with previous similar experiments, suppose the mean reaction time is expected to be around \(200\) ms, with a 95% probability of the mean ranging from \(0\) to \(400\) ms. This uncertainty is perhaps unreasonably large, but one might want to allow a bit more uncertainty than one really thinks is reasonable (this kind of conservativity in allowing somewhat more uncertainty is sometimes called Cromwell’s rule in Bayesian statistics; see O’Hagan and Forster 2004, sec. 3.19). In such a case, one can decide on the prior \(\mathit{Normal}(200, 100)\). Given that the experiment involves only one subject and the task is very simple, one might not expect the residual standard deviation \(\sigma\) to be very large: As an example, one can settle on a location of \(50\) ms for a truncated normal distribution, but still allow for relatively large uncertainty: \(Normal_{+}(50,50)\). The prior specifications are summarized below.

\[\begin{equation} \begin{aligned} \mu &\sim \mathit{Normal}(200, 100) \\ \sigma &\sim \mathit{Normal}_+(50, 50) \end{aligned} \end{equation}\]

Why are these priors principled? The designation “principled” here largely depends on our domain knowledge. Chapter 6 discusses how one can use domain knowledge when specifying priors.

One can achieve a better understanding of what a particular set of priors imply by visualizing the priors graphically, and carrying out prior predictive checks. These steps are skipped here, but these issues will be discussed in detail in chapters 6 and 7. These chapters will give more detailed information about choosing priors and on developing a principled workflow for Bayesian data analysis.

The new estimates are virtually the same as before:

## ...
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   168.70      1.32   166.07   171.24 1.00     3533     2548
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    25.01      0.94    23.24    26.94 1.00     3171     2484
## ...

The above examples of using different priors should not be misunderstood to mean that priors never matter. When there is enough data, the likelihood will dominate in determining the posterior distributions. What constitutes “enough” data is also a function of the complexity of the model; as a general rule, more complex models require more data.

Even in cases where there is enough data and the likelihood dominates in determining the posteriors, regularizing, principled priors (i.e., priors that are more consistent with our a priori beliefs about the data) will in general speed-up model convergence.

In order to determine the extent to which the posterior is influenced by the priors, it is a good practice to carry out a sensitivity analysis: try different priors and either verify that the posterior doesn’t change drastically, or report how the posterior is affected by some specific priors (for examples from psycholinguistics, see Vasishth et al. 2013; Vasishth and Engelmann 2022). Chapter 15 will demonstrate that sensitivity analysis becomes crucial for reporting Bayes factors; even in cases where the choice of priors does not affect the posterior distribution, it generally affects the Bayes factor.

3.6 Posterior predictive distribution

The posterior predictive distribution is a collection of data sets generated from the model (the likelihood and the priors). Having obtained the posterior distributions of the parameters after taking into account the data, the posterior distributions can be used to generate future data from the model. In other words, given the posterior distributions of the parameters of the model, the posterior predictive distribution gives us some indication of what future data might look like.

Once the posterior distributions \(p(\boldsymbol{\Theta}\mid \boldsymbol{y})\) are available, the predictions based on these distributions can be generated by integrating out the parameters:

\[\begin{equation} p(\boldsymbol{y_{pred}}\mid \boldsymbol{y} ) = \int_{\boldsymbol{\Theta}} p(\boldsymbol{y_{pred}}, \boldsymbol{\Theta}\mid \boldsymbol{y})\, d\boldsymbol{\Theta}= \int_{\boldsymbol{\Theta}} p(\boldsymbol{y_{pred}}\mid \boldsymbol{\Theta},\boldsymbol{y})p(\boldsymbol{\Theta}\mid \boldsymbol{y})\, d\boldsymbol{\Theta} \end{equation}\]

Assuming that past and future observations are conditionally independent given \(\boldsymbol{\Theta}\), i.e., \(p(\boldsymbol{y_{pred}}\mid \boldsymbol{\Theta},\boldsymbol{y})= p(\boldsymbol{y_{pred}}\mid \boldsymbol{\Theta})\), the above equation can be written as:

\[\begin{equation} p(\boldsymbol{y_{pred}}\mid \boldsymbol{y} )=\int_{\boldsymbol{\Theta}} p(\boldsymbol{y_{pred}}\mid \boldsymbol{\Theta}) p(\boldsymbol{\Theta}\mid \boldsymbol{y})\, d\boldsymbol{\Theta} \tag{3.8} \end{equation}\]

In Equation (3.8), we are conditioning \(\boldsymbol{y_{pred}}\) only on \(\boldsymbol{y}\), we do not condition on what we don’t know (\(\boldsymbol{\Theta}\)); the unknown parameters have been integrated out. This posterior predictive distribution has important differences from predictions obtained with the frequentist approach. The frequentist approach gives a point estimate of each predicted observation given the maximum likelihood estimate of \(\boldsymbol{\Theta}\) (a point value), whereas the Bayesian approach gives a distribution of values for each predicted observation. As with the prior predictive distribution, the integration can be carried out computationally by generating samples from the posterior predictive distribution. The same function that we created before, normal_predictive_distribution(), can be used here. The only difference is that instead of sampling mu and sigma from the priors, the samples come from the posterior.

## # A tibble: 1,444,000 × 3
##    iter trialn t_pred
##   <dbl>  <int>  <dbl>
## 1     1      1   132.
## 2     1      2   128.
## 3     1      3   174.
## # … with 1,443,997 more rows

The brms function posterior_predict() is a convenient function that delivers samples from the posterior predictive distribution. Using the command posterior_predict(fit_press) yields the predicted finger tapping times in a matrix, with the samples as rows and the observations (data-points) as columns. (Bear in mind that if a model is fit with sample_prior = "only", the dependent variable is ignored and posterior_predict will yield samples from the prior predictive distribution).

The posterior predictive distribution can be used to examine the “descriptive adequacy” of the model under consideration (Gelman et al. 2014, Chapter 6; Shiffrin et al. 2008). Examining the posterior predictive distribution to establish descriptive adequacy is called posterior predictive checks. The goal here is to establish that the posterior predictive data look more or less similar to the observed data. Achieving descriptive adequacy means that the current data could have been generated by the model. Although passing a test of descriptive adequacy is not strong evidence in favor of a model, a major failure in descriptive adequacy can be interpreted as strong evidence against a model (Shiffrin et al. 2008). For this reason, comparing the descriptive adequacy of different models is not enough to differentiate between their relative performance. When doing model comparison, it is important to consider the criteria that Roberts and Pashler (2000) define. Although Roberts and Pashler (2000) are more interested in process models and not necessarily Bayesian models, their criteria are important for any kind of model comparison. Their main point is that it is not enough to have a good fit to the data for a model to be convincing. One should check that the range of predictions that the model makes is reasonably constrained; if a model can capture any possible outcome, then the model fit to a particular data set is not so informative. In the Bayesian modeling context, although posterior predictive checking is important, it is only a sanity check to assess whether the model behavior is reasonable (for more on this point, see chapter 7).

In many cases, one can simply use the plot functions from brms (that act as wrappers for bayesplot functions). For example, the plotting function pp_check() takes as arguments the model, the number of predicted data sets, and the type of visualization, and it can display different visualizations of posterior predictive checks. In these type of plots, the observed data are plotted as \(y\) and predicted data as \(y_{rep}\). Below, we use pp_check() to investigate how well the observed distribution of finger tapping times fit our model based on some number (\(11\) and \(100\)) of samples of the posterior predictive distributions (that is, simulated data sets) ; see Figures 3.8 and 3.9.

Histograms of eleven samples from the posterior predictive distribution of the model fit_press (\(y_{rep}\)).

FIGURE 3.8: Histograms of eleven samples from the posterior predictive distribution of the model fit_press (\(y_{rep}\)).

A posterior predictive check that shows the fit of the model fit_press in comparison to data sets from the posterior predictive distribution using an overlay of density plots.

FIGURE 3.9: A posterior predictive check that shows the fit of the model fit_press in comparison to data sets from the posterior predictive distribution using an overlay of density plots.

The data is slightly skewed and has no values smaller than \(100\) ms, but the predictive distributions are centered and symmetrical; see Figures 3.8 and 3.9. This posterior predictive check shows a slight mismatch between the observed and predicted data. Can we build a better model? We’ll come back to this issue in the next section.

3.7 The influence of the likelihood

Finger tapping times (and response times in general) are not usually normally distributed. A more realistic distribution is the log-normal. A random variable (such as time) that is log-normally distributed takes only positive real values and is right-skewed. Although other distributions can also produce data with such properties, the log-normal will turn out to be a pretty reasonable distribution for finger tapping times and response times.

3.7.1 The log-normal likelihood

If \(\boldsymbol{y}\) is log-normally distributed, this means that \(\log(\boldsymbol{y})\) is normally distributed.12 The log-normal distribution is also defined using the parameters location, \(\mu\), and scale, \(\sigma\), but these are on the log ms scale; they correspond to the mean and standard deviation of the logarithm of the data \(\boldsymbol{y}\), \(\log(\boldsymbol{y})\), which will be normally distributed. Thus, when we model some data \(\boldsymbol{y}\) using the log-normal likelihood, the parameters \(\mu\) and \(\sigma\) are on a different scale than the data \(\boldsymbol{y}\). Equation (3.9) shows the relationship between the log-normal and the normal.

\[\begin{equation} \begin{aligned} \log(\boldsymbol{y}) &\sim \mathit{Normal}( \mu, \sigma)\\ \boldsymbol{y} &\sim \mathit{LogNormal}( \mu, \sigma) \end{aligned} \tag{3.9} \end{equation}\]

We can obtain samples from the log-normal distribution, using the normal distribution by first setting an auxiliary variable \(z\), so that \(z = \log(y)\). This means that \(z \sim \mathit{Normal}(\mu, \sigma)\). Then we can just use \(exp(z)\) as samples from the \(\mathit{LogNormal}(\mu, \sigma)\), since \(\exp(z) =\exp(\log(y)) = y\). The code below produces Figure 3.10.

Two log-normal distributions with the same parameters generated by either generating samples from a log-normal distribution or exponentiating samples from a normal distribution.Two log-normal distributions with the same parameters generated by either generating samples from a log-normal distribution or exponentiating samples from a normal distribution.

FIGURE 3.10: Two log-normal distributions with the same parameters generated by either generating samples from a log-normal distribution or exponentiating samples from a normal distribution.

3.7.2 Using a log-normal likelihood to fit data from a single subject pressing a button repeatedly

If we assume that finger tapping times are log-normally distributed, the likelihood function changes as follows:

\[\begin{equation} t_n \sim \mathit{LogNormal}(\mu,\sigma) \end{equation}\]

But now the scale of the priors needs to change! Let’s start with uniform priors for ease of exposition, even though, as we mentioned earlier, these are not really appropriate here. (More realistic priors are discussed below.)

\[\begin{equation} \begin{aligned} \mu &\sim \mathit{Uniform}(0, 11) \\ \sigma &\sim \mathit{Uniform}(0, 1) \\ \end{aligned} \tag{3.10} \end{equation}\]

Because the parameters are on a different scale than the dependent variable, their interpretation changes and it is more complex than if we were dealing with a linear model that assumes a normal likelihood (location and scale do not coincide with the mean and standard deviation of the log-normal):

  • The location \(\mu\): In our previous linear model, \(\mu\) represented the mean (in a normal distribution, the mean happens to be identical to the median and the mode). But now, the mean needs to be calculated by computing \(\exp(\mu +\sigma ^{2}/2)\). In other words, in the log-normal, the mean is dependent on both \(\mu\) and \(\sigma\). The median is just \(\exp(\mu)\). Notice that the prior of \(\mu\) is not on the milliseconds scale, but rather on the log milliseconds scale.
  • The scale \(\sigma\): This is the standard deviation of the normal distribution of \(\log(\boldsymbol{y})\). The standard deviation of a log-normal distribution with location \(\mu\) and scale \(\sigma\) will be \(\exp(\mu +\sigma ^{2}/2)\times \sqrt{\exp(\sigma^2)- 1}\). Unlike the normal distribution, the spread of the log-normal distribution depends on both \(\mu\) and \(\sigma\).

To understand the meaning of the priors on the millisecond scale, both the priors and the likelihood need to be taken into account. Generating a prior predictive distribution will help in interpreting the priors. This distribution can be generated by just exponentiating the samples produced by normal_predictive_distribution() (or, alternatively, edit the function by replacing rnorm() with rlnorm()).

Next, plot the distribution of some representative statistics; see Figure 3.11.

The prior predictive distribution of the mean, median, minimum, and maximum value of the log-normal model with priors defined in Equation (3.10), that is \(\mu \sim \mathit{Uniform}(0, 11)\) and \(\sigma \sim \mathit{Uniform}(0, 1)\). The x-axis is log-transformed.

FIGURE 3.11: The prior predictive distribution of the mean, median, minimum, and maximum value of the log-normal model with priors defined in Equation (3.10), that is \(\mu \sim \mathit{Uniform}(0, 11)\) and \(\sigma \sim \mathit{Uniform}(0, 1)\). The x-axis is log-transformed.

We cannot generate negative values any more, since \(\exp(\)any finite real number\() > 0\). These priors might work in the sense that the model might converge; but it would be better to have regularizing priors for the model. An example of regularizing priors:

\[\begin{equation} \begin{aligned} \mu &\sim \mathit{Normal}(6, 1.5) \\ \sigma &\sim \mathit{Normal}_+(0, 1) \\ \end{aligned} \tag{3.11} \end{equation}\]

The prior for \(\sigma\) here is a truncated distribution, and although its location is zero, this is not its mean. We can calculate its approximate mean from a large number of random samples of the prior distribution using the function rtnorm() from the package extraDistr. In this function, we have to set the parameter a = 0 to express the fact that the normal distribution is truncated from the left at 0. (Box 4.1 discusses this type of distribution in detail):

## [1] 0.795

Even before generating the prior predictive distributions, we can calculate the values within which we are 95% sure that the expected median of the observations will lie. We do this by looking at what happens at two standard deviations away from the mean of the prior, \(\mu\), that is \(6 - 2\times 1.5\) and \(6 + 2\times 1.5\), and exponentiating these values:

##  lower higher 
##   20.1 8103.1

This means that the prior for \(\mu\) is still not too informative (these are medians; the actual values generated by the log-normal distribution can be much more spread out). Next, plot the distribution of some representative statistics of the prior predictive distributions. brms allows one to sample from the priors, ignoring the observed data t , by setting sample_prior = "only" in the brm function.

If we want to use brms to generate prior predictive data in this manner even before we have any data, (at least for now) we do need to have some non-NA values as the dependent variable t. Setting sample_prior = "only" will ignore the data, but we still need to add a data frame in the data = specification in the brm function: In this case, we add a vector of 1 as “data”. The family is specified as lognormal(); recall that in the first example, the family was gaussian().

To avoid the warnings, increase the adapt_delta parameter’s default value from \(0.8\) to \(0.9\) to simulate the data. Since Stan samples from the prior distributions in the same way that it samples from the posterior distribution, one should not ignore warnings; always ensure that the model converged. In that respect, the custom function normal_predictive_distribution() defined in section 3.3 has the advantage that it will always yield independent samples from the prior distribution and will not experience any convergence problems. This is because it just relies on the rnorm() function in R.

Plot the prior predictive distribution of means with the following code (the figure is not produced here, to conserve space). In a prior predictive distribution, we generally want to ignore the data; this requires setting prefix = "ppd" in pp_check().

To plot the distribution of minimum, and maximum values, just replace mean with min, and max respectively. The distributions of the three statistics are displayed in Figure 3.12.

The prior predictive distributions of the mean, maximum, and minimum values of the log-normal model with priors defined in equation (3.11). The prior predictive distributions are labeled \(y_{pred}\). The x-axis shows values back-transformed from the log-scale.

FIGURE 3.12: The prior predictive distributions of the mean, maximum, and minimum values of the log-normal model with priors defined in equation (3.11). The prior predictive distributions are labeled \(y_{pred}\). The x-axis shows values back-transformed from the log-scale.

Figure 3.12 shows that the priors used here are still quite uninformative. The tails of the prior predictive distributions that correspond to our normal priors shown in Figure 3.12 are even further to the right, reaching more extreme values than for the prior predictive distributions generated by uniform priors (shown in Figure 3.11). The new priors are still far from representing our prior knowledge. We could run more iterations of choosing priors and generating prior predictive distributions until we have priors that generate realistic data. However, given that the bulk of the distributions of the mean, maximum, minimum values lies roughly in the correct order of magnitude, these priors are going to be acceptable. In general, summary statistics (e.g., mean, median, min, max) can be used to test whether the priors are in a plausible range. This can be done by defining, for the particular research problem under study, the extreme data that would be very implausible to ever observe (e.g., reading times at a word larger than one minute) and choosing priors such that such extreme finger tapping times occur only very rarely in the prior predictive distribution.

Next, fit the model; recall that both the distribution family and prior change in comparison to the previous example.

When we look at the summary of the posterior, the parameters are on the log-scale:

## ...
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     5.12      0.01     5.10     5.13 1.00     3994     2616
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.13      0.00     0.13     0.15 1.00     3247     2641
## ...

If the research goal is to find out how long it takes to press the space bar in milliseconds, we need to transform the \(\mu\) (or Intercept in the model) to milliseconds. Because the median of the log-normal distribution is \(\exp(\mu)\), the following returns the estimates in milliseconds:

To display the mean and 95% credible interval of these samples, type:

##  mean  2.5% 97.5% 
##   167   165   169

Next, check whether the predicted data sets look similar to the observed data set. See Figure 3.13; compare this with the earlier Figure 3.9.

The posterior predictive distribution of fit_press_ln.

FIGURE 3.13: The posterior predictive distribution of fit_press_ln.

The key question is: Are the posterior predicted data now more similar to the observed data, compared to the case where we had a Normal likelihood? According to Figure 3.13, it seems so, but it’s not easy to tell.

Another way to examine the extent to which the predicted data looks similar to the observed data would be to look at the distribution of some summary statistic. As with prior predictive distributions, examine the distribution of representative summary statistics for the data sets generated by different models. However, in contrast with what occurs with prior predictive distributions, at this point we have a clear reference, our observations, and this means that we can compare the summary statistics with the observed statistics from our data. We suspect that the normal distribution would generate finger tapping times that are too fast (since it’s symmetrical) and that the log-normal distribution may capture the long tail better than the normal model. Based on this supposition, compute the distribution of minimum and maximum values for the posterior predictive distributions, and compare them with the minimum and maximum value respectively in the data. The function pp_check() implements this by specifying stat either "min" or "max" for both fit_press, and fit_press_ln; an example is shown below. The plots are shown in Figures 3.14 and 3.15.

The distributions of minimum values in a posterior predictive check, using the normal and log-normal probability density functions. The minimum in the data is \(110\) ms.The distributions of minimum values in a posterior predictive check, using the normal and log-normal probability density functions. The minimum in the data is \(110\) ms.

FIGURE 3.14: The distributions of minimum values in a posterior predictive check, using the normal and log-normal probability density functions. The minimum in the data is \(110\) ms.

The distributions of maximum values in a posterior predictive check using the normal and log-normal. The maximum in the data is \(409\) ms.The distributions of maximum values in a posterior predictive check using the normal and log-normal. The maximum in the data is \(409\) ms.

FIGURE 3.15: The distributions of maximum values in a posterior predictive check using the normal and log-normal. The maximum in the data is \(409\) ms.

Figure 3.14 shows that the log-normal likelihood does a slightly better job since the minimum value is contained in the bulk of the log-normal distribution and in the tail of the normal one. Figure 3.15 shows that both models are unable to capture the maximum value of the observed data. One explanation for this is that the log-normal-ish observations in our data are being generated by the task of pressing as fast as possible, whereas the observations with long finger tapping times are being generated by lapses of attention. This would mean that two probability distributions are mixed here; modeling this process involves more complex tools that we will take up in chapter 19.

This completes our introduction to brms. We are now ready to learn about more regression models.

3.9 Summary

This chapter showed how to fit and interpret a Bayesian model with a normal likelihood. We looked at the effect of priors by investigating prior predictive distributions and by carrying out a sensitivity analysis. We also looked at the fit of the posterior, by inspecting the posterior predictive distribution (which gives us some idea about the descriptive adequacy of the model). We also showed how to fit a Bayesian model with a log-normal likelihood, and how to compare the predictive accuracy of different models.

3.10 Further reading

Sampling algorithms are discussed in detail in Gamerman and Lopes (2006). Also helpful are the sections on sampling from the short open-source book by Bob Carpenter, Probability and Statistics: a simulation-based introduction (, and the sections on sampling algorithms in Lambert (2018) and Lynch (2007). Introductory linear modeling theory is covered in Dobson and Barnett (2011); more advanced treatments are in Montgomery, Peck, and Vining (2012) and Seber and Lee (2003). Generalized linear models are covered in detail in McCullagh and Nelder (2019). The reader may also benefit from our own freely available online lecture notes on linear modeling:

3.11 Exercises

Exercise 3.1 A simple linear model.

  1. Fit the model fit_press with just a few iterations, say 50 iterations (set warmup to the default of 25, and use four chains). Does the model converge?

  2. Using normal distributions, choose priors that better represent your assumptions/beliefs about finger tapping times. To think about a reasonable set of priors for \(\mu\) and \(\sigma\), you should come up with your own subjective assessment about what you think a reasonable range of values can be for \(\mu\) and how much variability might happen. There is no correct answer here, we’ll discuss priors in depth in chapter 6. Fit this model to the data. Do the posterior distributions change?

Exercise 3.2 Revisiting the button-pressing example with different priors.

  1. Can you come up with very informative priors that influence the posterior in a noticeable way (use normal distributions for priors, not uniform priors)? Again, there are no correct answers here; you may have to try several different priors before you can noticeably influence the posterior.
  2. Generate and plot prior predictive distributions based on this prior and plot them.

  3. Generate posterior predictive distributions based on this prior and plot them.

Exercise 3.3 Posterior predictive checks with a log-normal model.

  1. For the log-normal model fit_press_ln, change the prior of \(\sigma\) so that it is a log-normal distribution with location (\(\mu\)) of \(-2\) and scale (\(\sigma\)) of \(0.5\). What does such a prior imply about your belief regarding button-pressing times in milliseconds? Is it a good prior? Generate and plot prior predictive distributions. Do the new estimates change compared to earlier models when you fit the model?
  2. For the log-normal model, what is the mean (rather than median) time that takes to press the space bar, what is the standard deviation of the finger tapping times in milliseconds?

Exercise 3.4 A skew normal distribution.

Would it make sense to use a “skew normal distribution” instead of the log-normal? The skew normal distribution has three parameters: location \(\xi\) (this is the lower-case version of the Greek letter \(\Xi\), pronounced “chi”, with the “ch” pronounced like the “ch” in “Bach”), scale \(\omega\) (omega), and shape \(\alpha\). The distribution is right skewed if \(\alpha >0\), is left skewed if \(\alpha <0\), and is identical to the regular normal distribution if \(\alpha =0\). For fitting this in brms, one needs to change family and set it to skew_normal(), and add a prior of class = alpha (location remains class = Intercept and scale, class = sigma).

  1. Fit this model with a prior that assigns approximately 95% of the prior probability of alpha to be between 0 and 10.
  2. Generate posterior predictive distributions and compare the posterior distribution of summary statistics of the skew normal with the normal and log-normal.


Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015b. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48.

Bürkner, Paul-Christian. 2019. brms: Bayesian Regression Models Using “Stan”.

Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael J. Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1). Columbia Univ., New York, NY (United States); Harvard Univ., Cambridge, MA (United States).

Dobson, Annette J, and Adrian Barnett. 2011. An Introduction to Generalized Linear Models. CRC press.

Duane, Simon, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. “Hybrid Monte Carlo.” Physics Letters B 195 (2): 216–22.

Gamerman, Dani, and Hedibert F Lopes. 2006. Markov chain Monte Carlo: Stochastic simulation for Bayesian inference. CRC Press.

Ge, Hong, Kai Xu, and Zoubin Ghahramani. 2018. “Turing: A Language for Flexible Probabilistic Inference.” In Proceedings of Machine Learning Research, edited by Amos Storkey and Fernando Perez-Cruz, 84:1682–90. Playa Blanca, Lanzarote, Canary Islands: PMLR.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC Press.

Gelman, Andrew, Daniel Simpson, and Michael J. Betancourt. 2017. “The Prior Can Often Only Be Understood in the Context of the Likelihood.” Entropy 19 (10): 555.

Goodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2018. “Rstanarm: Bayesian Applied Regression Modeling via Stan.”

Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–1623.

Hubel, Kerry A, Bruce Reed, E William Yund, Timothy J Herron, and David L Woods. 2013. “Computerized Measures of Finger Tapping: Effects of Hand Dominance, Age, and Sex.” Perceptual and Motor Skills 116 (3). SAGE Publications Sage CA: Los Angeles, CA: 929–52.

JASP Team. 2019. “JASP (Version 0.11.1)[Computer software].”

Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge university press.

Jeffreys, Harold. 1939. Theory of Probability. Oxford: Clarendon Press.

Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. London, UK: Sage.

Lindgren, Finn, and Håvard Rue. 2015. “Bayesian Spatial Modelling with R-INLA.” Journal of Statistical Software 63 (1): 1–25.

Luce, R Duncan. 1991. Response Times: Their Role in Inferring Elementary Mental Organization. Oxford University Press.

Lunn, D.J., A. Thomas, N. Best, and D. Spiegelhalter. 2000. “WinBUGS-A Bayesian Modelling Framework: Concepts, Structure, and Extensibility.” Statistics and Computing 10 (4). Springer: 325–37.

Lynch, Scott Michael. 2007. Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. New York, NY: Springer.

McCullagh, Peter, and J.A. Nelder. 2019. Generalized Linear Models. Second Edition. Boca Raton, Florida: Chapman; Hall/CRC.

Montgomery, D. C., E. A. Peck, and G. G. Vining. 2012. An Introduction to Linear Regression Analysis. 5th ed. Hoboken, NJ: Wiley.

Neal, Radford M. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Taylor & Francis.

O’Hagan, Antony, and Jonathan Forster. 2004. “Kendall’s Advanced Theory of Statistics, Vol. 2B: Bayesian Inference.” Wiley.

Plummer, Martin. 2016. “JAGS Version 4.2.0 User Manual.”

Plummer, Martyn. 2022. “Simulation-Based Bayesian Analysis.” Annual Reviews.

Roberts, Seth, and Harold Pashler. 2000. “How Persuasive Is a Good Fit? A Comment on Theory Testing.” Psychological Review 107 (2): 358–67.

Salvatier, John, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. “Probabilistic Programming in Python Using PyMC3.” PeerJ Computer Science 2 (April). PeerJ: e55.

Seber, George A. F., and Allen J. Lee. 2003. Linear Regression Analysis. 2nd Edition. Hoboken, NJ: John Wiley; Sons.

Shiffrin, Richard, Michael D. Lee, Woojae Kim, and Eric-Jan Wagenmakers. 2008. “A Survey of Model Evaluation Approaches with a Tutorial on Hierarchical Bayesian Methods.” Cognitive Science: A Multidisciplinary Journal 32 (8): 1248–84.

Vasishth, Shravan, Zhong Chen, Qiang Li, and Gueilan Guo. 2013. “Processing Chinese Relative Clauses: Evidence for the Subject-Relative Advantage.” PLoS ONE 8 (10). Public Library of Science: 1–14.

Vasishth, Shravan, and Felix Engelmann. 2022. Sentence Comprehension as a Cognitive Process: A Computational Approach. Cambridge, UK: Cambridge University Press.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2019. “Rank-Normalization, Folding, and Localization: An Improved \(\widehat R\) for Assessing Convergence of MCMC.” arXiv Preprint arXiv:1903.08008.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2019. “Rank-Normalization, Folding, and Localization: An Improved \(\widehat{R}\) for Assessing Convergence of MCMC.”

  1. The Python package PyMC3 and the Julia library Turing are recent exceptions since they are fully integrated into their respective languages.

  2. We refer to the time it takes for a subject to respond or react to a stimuli as response time (response and reaction times are often used interchangeably, cf. Luce 1991). In this case, however, there are no stimuli, and the subject only taps the space bar.

  3. The problem here is that although the parameter for the intercept is assigned a uniform distribution bounded between \(0\) and \(60000\) ms, the sampler might start sampling from an initial value outside this range. The sampler can start from an initial value that is outside the \(0\)-\(60000\) range because the initial value is chosen randomly (unless the user specifies an initial value explicitly).

  4. We’ll see later how to generate prior predictive distributions of statistics such as mean, minimum, or maximum value in section 3.7.2 using brms and pp_check().

  5. Even though, in theory, one could use even wider priors, in practice, these are the widest priors that achieve convergence.

  6. More precisely, \(\log_e(\boldsymbol{y})\) or \(\ln(\boldsymbol{y})\), but we’ll write it as just \(log()\).