# Chapter 7 Workflow

Although modern Bayesian analysis tools (such as `brms`

) greatly facilitate Bayesian computations, the model specification is still (as it should be) the responsibility of the user. In the previous chapters (e.g., chapter 3), we have outlined some of the steps needed to arrive at a useful and robust analysis. In this chapter, we bring these ideas together to spell out a principled approach to developing a workflow. This chapter is based on a recent introduction of a principled Bayesian workflow to cognitive science (Schad, Betancourt, and Vasishth 2019, for a revised published version see 2020).

Much research has been carried out in recent years to develop tools to ensure robust Bayesian data analyses (e.g., Gabry et al. 2017; Talts et al. 2018). One of the most recent end-products of this research has been the formulation of a principled Bayesian workflow for conducting a probabilistic analysis (Betancourt 2018; Schad, Betancourt, and Vasishth 2019). This workflow provides an initial coherent set of steps to take for a robust analysis, leaving room for further improvements and methodological developments. At an abstract level, parts of this workflow can be applied to any kind of data analysis, be it frequentist or Bayesian, be it based on sampling or on analytic procedures.

Here, we introduce parts of this principled Bayesian workflow by illustrating its use with experimental data from a reading-time experiment. Some parts of this principled Bayesian workflow are specifically recommended when using advanced/non-standard models.

A number of questions should be asked when fitting a model, and several checks should be performed to validate a probabilistic model. Before going into the details of this discussion, we first treat the process of model building, and how different traditions have yielded different approaches to this questions.

## 7.1 Model building

One strategy for model building is to start with a minimal model that captures just the phenomenon of interest but not much other structure in the data. For example, this could be a linear model with just the factor or covariate of main interest. For this model, we perform a number of checks described in detail in the following sections. If the model passes all checks and does not show signs of inadequacy, then it can be applied in practice and we can be confident that the model provides reasonably robust inferences on our scientific question. However, if the model shows signs of trouble on one or more of these checks, then the model may need to be improved. Alternatively, we may need to be more modest with respect to our scientific question. For example, in a repeated measures data set, we may be interested in estimating the correlation parameter between by-group adjustments (their random effects correlation) based on a sample of \(30\) subjects. If model analysis reveals that our sample size is not sufficiently large to estimate the correlation reliably, then we may need to either increase our sample size, or give up on our plan.

During the model building process, we make use of an aspirational model \(\mathcal{M}_A\): we mentally imagine a model with all the possible details that the phenomenon and measurement process contain; i.e., we imagine a model that one would fit if there were no limitations in resources, time, mathematical and computational tools, subjects, and so forth. It would contain all systematic effects that might influence the measurement process. For example, influences of time or heterogeneity across individuals. This should be taken to guide and inform model development; such a procedure prevents random walks in model space during model development. The model has to consider both the latent phenomenon of interest as well as the environment and experiment used to probe it.

In contrast to the aspirational model, the initial model \(\mathcal{M}_1\) may only contain enough structure to incorporate the phenomenon of core scientific interest, but none of the additional aspects/structures relevant for the modeling or measurement. The additional, initially left-out structures, which reflect the difference between the initial (\(\mathcal{M}_1\)) and the aspirational model (\(\mathcal{M}_A\)), can then be probed for using specifically designed summary statistics. These summary statistics can thus inform model expansion from the initial model \(\mathcal{M}_1\) into the direction of the aspirational model \(\mathcal{M}_A\). If the initial model proves inadequate, then the aspirational model and the associated summary statistics guide model development. If the expanded model is still not adequate, then another cycle of model development is conducted.

The range of prior and posterior predictive checks discussed in the following sections serve as a basis for a principled approach to model expansion. The notion of *expansion* is critical here. If an expanded model does not prove more adequate, one can always fall back to the previous model version.

Some researchers suggest an alternative strategy of fitting models with all the group-level variance components (e.g., by-participant and by-items) allowed by the experimental design and a full variance covariance matrix for all the group-level parameters (as we did in section 5.2.5). This type of model is sometimes called a “maximal” model (e.g., Barr et al. 2013). However, this model is maximal within the scope of a linear regression. In section 5.2.6, for example, we saw distributional models, which are more complex than the so-called maximal models. However, a maximal model can provide an alternative starting point for the principled Bayesian workflow. In this case, the focus does not lie so much on model expansion. Instead, for maximal models the workflow can be used to specify priors encoding domain expertise (possibly to ensure computational faithfulness and model sensitivity), and to ensure model adequacy. Some steps in the principled Bayesian workflow (e.g., the computationally more demanding steps of assessing computational faithfulness and model sensitivity) may even be performed only for models coded in Stan or only once for a given research program, where similar designs are repeatedly used. We will explain this in more detail below.

In the maximal model, “maximal” refers to maximal specification of the variance components within the scope of the linear regression approximation, not maximal with respect to the actual data generating process. Models that are bound by the linear regression structure cannot capture effects such as selection bias in the data, dynamical changes in processes across time, or measurement error. Importantly, the “maximal” models are not the aspirational model, which is an image of the true data generating process.

Finally, sometimes the results from the Bayesian workflow will show that our experimental design or data is not sufficient to answer our scientific question at hand. In this case, ambition needs to be reduced, or new data needs to be collected, possibly with a different experimental design more sensitive to the phenomenon of interest.

One important development in open science practices is pre-registration of experimental analyses before the data are collected (Chambers 2019). This can be done using online-platforms such as the Open Science Foundation or AsPredicted (but see Szollosi et al. 2020). What information can or should one document in preregistration of the Bayesian workflow? If one plans on using the maximal model for analysis, then this maximal model, including contrast coding (Schad et al. 2020), population- and group-level effects (also known as fixed and random effects) should be described. In the case of incremental model building, if a model isn’t a good fit to the data, then any resulting inference will be limited if not useless, so a rigid preregistration is useless unless one knows exactly what the model is. Thus, the deeper issue with preregistration is that a model cannot be confirmed until the phenomenon *and* experiment are all extremely well understood.
One practical possibility is to describe the initial and the aspirational model, and the incremental strategy used to probe the initial model to move more towards the aspirational model. This can also include delineation of summary statistics that one plans to use for probing the tested models. Even if it is difficult to spell out the aspirational model fully, it can be useful to preregister the initial model, summary statistics, and the principles one intends to apply in model selection.
Although the maximal modeling approach clearly reflects confirmatory hypothesis testing, the incremental model building strategy towards the aspirational model may be seen as lying at the boundary between confirmatory and exploratory, and becomes more confirmatory the more clearly the aspirational model can be spelled out a priori.

## 7.2 Principled questions on a model

What characterizes a useful probabilistic model? A useful probabilistic model should be consistent with domain expertise. Moreover, a useful probabilistic model should be rich enough to capture the structure of the true data generating process needed to answer scientific questions. When very complex or non-standard models are developed, there are two additional requirements that must be met (we will briefly touch upon these in the present chapter): it is key for the model to allow accurate posterior approximation, and the model must capture enough of the experimental design to give useful answers to our questions.

So what can we do aiming to meet these properties of our probabilistic model? In the following, we will outline a number of analysis steps to take and questions to ask in order to improve these properties for our model.

In a first step, we will use prior predictive checks to investigate whether our model is consistent with our domain expertise. Moreover, posterior predictive checks assess model adequacy for the given data set, that is, they investigate the question whether the model captures the relevant structure of the true data generating process. We will also briefly discuss two additional steps that are computationally very expensive, and can be used e.g., when coding advanced/non-standard models, but which are also part of the principled workflow: this includes investigating computational faithfulness by studying whether posterior estimation is accurate, and it includes studing model sensitivity and the question whether we can recover model parameters with the given design and model.

### 7.2.1 Prior predictive checks: Checking consistency with domain expertise

The first key question for checking the model is whether the model and the distributions of prior parameters are consistent with domain expertise. Prior distributions can be selected based on prior research or plausibility. However, for complex models it is often difficult to know which prior distributions should be chosen, and what consequences distributions of prior model parameters have for expected data. A viable solution is to use prior distributions to simulate hypothetical data from the model and to check whether the simulated data are plausible and consistent with domain expertise. This approach is often much easier to judge compared to assessing prior distributions in complex models directly.

In practice, this approach can be implemented by the following steps:

- Take the prior \(p(\boldsymbol{\Theta})\) and randomly draw a parameter set \(\boldsymbol{\Theta_{pred}}\) from it: \(\boldsymbol{\Theta_{pred}} \sim p(\boldsymbol{\Theta})\)
- Use this parameter set \(\boldsymbol{\Theta_{pred}}\) to simulate hypothetical data \(\boldsymbol{y_{pred}}\) from the model: \(\boldsymbol{y_{pred}} \sim p(\boldsymbol{y} \mid \boldsymbol{\Theta_{pred}})\)

To assess whether prior model predictions are consistent with domain expertise, it is useful to compute summary statistics of the simulated data \(t(\boldsymbol{y_{pred}})\). The distribution of these summary statistics can be visualized using, for example, histograms (see Figure 7.1). This can quickly reveal whether the data falls in an expected range, or whether a substantial amount of extreme data points are expected a priori. For example, in a study using self-paced reading times, extreme values may be considered to be reading times smaller than \(50\) ms or larger than \(2000\) ms. Reading times for a word larger than \(2000\) ms are not impossible, but would be implausible and largely inconsistent with domain expertise. Experience with reading studies shows that a small number of observations may actually take extreme values. However, if we observe a large number of extreme data points in the hypothetical data, and if these are inconsistent with domain expertise, the priors or the model should be adjusted so that they yield hypothetical data within the range of reasonable values.

Choosing good summary statistics is more an art than a science. However, the choice of summary statistics will be crucial, as they provide key markers of what we want the model to account for in the data. They should thus be carefully chosen and designed based on the expectations that we have about the true data generating process and about the kinds of structures and effects we expect the data may exhibit. Interestingly, summary statistics can also be used to critique the model: if someone wants to criticize an analysis, then they can formalize that criticism into a summary statistic they expect to show undesired behavior. Such criticism can serve as a very constructive way to write reviews in a peer-review setting. Here, we will show some examples of useful summary statistics below when discussing data analysis for a concrete example data set.

Choosing good priors will be particularly relevant in cases where the likelihood is not sufficiently informed by the data (see Figure 7.2, in particular g-i). In hierarchical models, for example, this often occurs in cases where a “maximal” model is fitted for a small data set that does not constrain estimation of all group-level effects variance and covariance parameters.^{25}

In such situations, using a prior in a Bayesian analysis (or a more informative prior rather than a relatively uninformative one) should incorporate just enough domain expertise to suppress extreme, although not impossible parameter values. This may allow the model to be fit, as the posterior is now sufficiently constrained. Thus, introducing prior information in Bayesian computation allows us to fit and interpret models that cannot be validly estimated using frequentist tools.

A welcome side-effect of incorporating more domain expertise (into what still constitutes weakly informative priors) is thus more concentrated prior distributions, which can facilitate Bayesian computation. This allows more complex models to be estimated; that is, using prior knowledge can make it possible to fit models that could otherwise not be estimated using the available tools. In other words, incorporating prior knowledge can allow us to get closer to the aspirational model in the iterative model building procedure. Moreover, more informative priors also lead to faster convergence of MCMC algorithms.

Incorporating more domain expertise into the prior also has crucial consequences for Bayesian modeling when computing Bayes factors (see chapter 15, on Bayes factors).

Importantly, this first step simulates from the *prior predictive distribution*, which specifies how the prior interacts with the likelihood. Mathematically, it computes an average (the integral) over different possible (prior) parameter values. The prior predictive distribution is (also see chapter 3):

\[\begin{equation} \begin{aligned} p(\boldsymbol{y_{pred}}) &= \int p(\boldsymbol{y_{pred}}, \boldsymbol{\Theta}) \; d\boldsymbol{\Theta} = \int p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) p(\boldsymbol{\Theta}) \; d\boldsymbol{\Theta}\\ &= \int \mathrm{likelihood}(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) \cdot \mathrm{prior}(\boldsymbol{\Theta}) \; d\boldsymbol{\Theta} \end{aligned} \end{equation}\]

As a concrete example, suppose we assume that our likelihood is a Normal distribution with mean \(\mu\) and standard deviation \(\sigma\). Suppose that we now define the following priors on the parameters: \(\mu \sim \mathit{Normal}(0,1)\), and \(\sigma \sim \mathit{Uniform}(1,2)\). We can generate the prior predictive distribution using the following steps:

- Do the following 100000 times:
- Take one sample m from a Normal(0,1) distribution
- Take one sample s from a Uniform(1,2) distribution
- Generate and save a data point from Normal(m,s)
- The generated data is the prior predictive distribution.

More complex generative processes involving repeated measures data can also be defined.

### 7.2.2 Computational faithfulness: Testing for correct posterior approximations

Approximations of posterior expectations can be inaccurate. For example, a computer program that is designed to sample from a posterior can be erroneous. This could involve an error in the specification of the likelihood (e.g., caused by an error in the R syntax formula), or insufficient sampling of the full density of the posterior. The sampler may be biased, sampling parameter values that are larger or smaller than the true posterior, or the variance of the posterior samples may be larger or smaller than the true posterior uncertainty. However, posterior sampling from simple and standard models should work properly in most cases. Thus, we think that in many applications, a further check of computational faithfulness may be asking for too much, and might need to be performed only once for a given research program, where different experiments are rather similar to each other. However, checking computational faithfulness can become an important issue when dealing with more advanced/non-standard models (such as those discussed in the later chapters of this book). Here, errors in the specification of the likelihood can occur more easily.

Given that posterior approximations can be inaccurate, it is important to design a procedure to test whether the posterior approximation of choice is indeed accurate, e.g., that the software used to implement the sampling works without errors for the specific problem at hand. This checking can be performed using simulation-based calibration (SBC; Talts et al. 2018; Schad, Betancourt, and Vasishth 2019). This is a very simulation-intensive procedure, which can take a long time to run for considerably complex models and larger data sets. We do not discuss SBC in detail here, but refer the reader to its later treatment in chapter 18, where SBC is applied for models coded in Stan directly, as well as to the description in Schad, Betancourt, and Vasishth (2019).

Assuming that our posterior computations are accurate and faithful, we can take a next step, namely looking at the sensitivity of the model analyses.

### 7.2.3 Model sensitivity

What can we realistically expect from the posterior of a model, and how can we check whether these expectations are justified for the current setup? First, we might expect that the posterior recovers the true parameters generating the data without bias. That is, when we simulate hypothetical data based on a true parameter value, we may expect that the posterior mean is close to the true value. However, for a given model, experimental design, and data set, this expectation may or may not be justified. Indeed, parameter estimation for some, e.g., non-linear, models may be biased, such that the true value of the parameter can practically not be recovered from the data. At the same time, we might expect from the posterior that it is highly informative with respect to the parameters that generated the data. That is, we may hope for small posterior uncertainty (a small posterior standard deviation) relative to our prior knowledge. However, posterior certainty may sometimes be low. Some experimental designs, models, or data sets may yield highly uninformative estimates, where uncertainty is not reduced compared to our prior information. This can be the case when we have very little data, or when the experimental design does not allow us to constrain certain model parameters; i.e., the model is too complex for the experimental design.

To study model sensitivity, one can investigate two questions about the model:

- How well does the estimated posterior mean match the true simulating parameter?
- How much is uncertainty reduced from the prior to the posterior?

To investigate these questions, it is again possible to perform extensive simulation studies. This is crucial to do for complex, non-standard, or cognitive models, but may be less important for simpler and more standard models. Indeed, the same set of simulations can be used that are also used in SBC. Therefore, both analyses can be usefully applied in tandem. Again, here we skip the details of how these computations can be implemented, and refer the interested reader to Schad, Betancourt, and Vasishth (2019).

### 7.2.4 Posterior predictive checks: Does the model adequately capture the data?

“*All models are wrong but some are useful.*” (Box 1979, 2). We know that our model probably does not fully capture the true data generating process, which is noisily reflected in the observed data. Our question therefore is whether our model is close enough to the true process that has generated the data, and whether the model is useful for informing our scientific question. To compare the model to the true data generating process (i.e., to the data), we can simulate data from the model and compare the simulated to the real data. This can be formulated via a posterior predictive distribution (see chapter 3): the model is fit to the data, and the estimated posterior model parameters are used to simulate new data.

Mathematically, the posterior predictive distribution is written:

\[\begin{equation} p(\boldsymbol{y_{pred}} \mid \boldsymbol{y}) = \int p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) p(\boldsymbol{\Theta} \mid \boldsymbol{y}) \; d \boldsymbol{\Theta} \end{equation}\]

Here, the observed data \(\boldsymbol{y}\) is used to infer the posterior distribution over model parameters, \(p(\boldsymbol{\Theta} \mid \boldsymbol{y})\). This is combined with the model or likelihood function, \(p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta})\), to yield new, now simulated, data, \(\boldsymbol{y_{pred}}\). The integral \(\int d \boldsymbol{\Theta}\) indicates averaging across different possible values for the posterior model parameters (\(\boldsymbol{\Theta}\)).

As mentioned in chapter 3, we can’t evaluate this integral exactly: \(\boldsymbol{\Theta}\) can be a vector of many parameters, making this a very complicated integral with no analytical solution. However, we can approximate it using sampling. We can now use each of the posterior samples as parameters to simulate new data from the model. This procedure then approximates the integral and yields an approximation to the posterior predictive distribution.

To summarize, in the posterior predictive distribution, the model is fit to the data, and the estimated posterior model parameters are used to simulate new data. Critically, the question then is how close the simulated data is to the observed data.

One approach is to use features of the data that we care about, and to test how well the model can capture these features. Indeed, we had already defined summary statistics in the prior predictive checks. We can now compute these summary statistics for the data simulated from the posterior predictive distribution. This will yield a distribution for each summary statistic. In addition, we compute the summary statistic for the observed data, and can now check whether the data falls within the distribution of the model predictions (cf. Figure 7.3a), or whether the model predictions are far from the observed data (see Figure 7.3b). If the observed data is similar to the posterior-predicted data, then this supports model adequacy. If we observe a large discrepancy, then this indicates that our model likely is missing some important structure of the true process that has generated the data, and that we have to use our domain expertise to further improve the model. Alternatively, a large discrepancy can be due to the data being an extreme observation, which was nevertheless generated by the process captured in our model. In general, we can’t discriminate between these two possibilities. Consequently, we have to use our best judgement as to which possibility is more relevant, in particular changing the model only if the discrepancy is consistent with a known missing model feature.

## 7.3 Exemplary data analysis

We perform an exemplary analysis of a data set from Gibson and Wu (2013), a data set that we have already encountered in previous chapters. The methodology they used is the familiar method (self-paced reading) that we encountered in earlier chapters. Gibson and Wu (2013) collected self-paced reading data using Chinese relative clauses. Relative clauses are sentences like: *The student who praised the teacher was very happy*. Here, the head noun, *student*, is modified by a relative clause *whoteacher*, and the head noun is the subject of the relative clause as well: the student praised the teacher. Such relative clauses are called subject relatives. By contrast, one can also have object relative clauses, where the head noun is modified by a relative clause which takes the head noun as an object. An example is: *The student whom the teacher praised was very happy*. Here, the teacher praised the student.
Gibson and Wu (2013) were interested in testing the hypothesis that Chinese shows an object relative (OR) processing advantage compared to the corresponding subject relative (SR). The theoretical reason for this processing advantage is that, in Chinese, the distance (which can be approximated by counting the number of words intervening) between the relative clause verb (*praised*) and the head noun is shorter in ORs than SRs . This prediction arises because, unlike English, the relative clause appears before the head noun in Chinese; see Gibson and Wu (2013) for a detailed explanation.

Their experimental design had one factor with two levels: (i) object relative sentences, and (ii) subject relative sentences. We use sum coding (-1, +1) for this factor, which we call `so`

, an abbreviation for subject-object. Following Gibson and Wu (2013), we analyze reading time on the target word, which was the head noun of the relative clause. As mentioned above, in Chinese, the head noun appears after the relative clause. By the time the subject reads the head noun, they already know whether they are reading a subject or an object relative. Because the distance between the relative clause verb and the head noun is shorter in Chinese object relatives compared to subject relatives, reading the head noun is expected to be easier in object relatives.

The data set contains reading time measurements in milliseconds from \(37\) subjects and from \(15\) items (there were \(16\) items originally, but one item was removed during analysis). The design is a classic repeated measures Latin square design.

### 7.3.1 Prior predictive checks

The first step in Bayesian data analysis is to specify the statistical model and the priors for the model parameters. As a statistical model, we use what is called the maximal model (Barr et al. 2013) for the design. Such a model includes population-level effects for the intercept and the slope (coded using sum contrast coding: \(+1\) for object relatives, and \(-1\) for subject relatives), correlated group-level intercepts and slopes for subjects, and correlated group-level intercepts and slopes for items. We define the likelihood as follows:

\[\begin{equation} rt_n \sim \mathit{LogNormal}(\alpha + u_{subj[n],1} + w_{item[n],1} + so_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma) \end{equation}\]

In `brms`

syntax, the model would be specified as follows:

`rt ~ so + ( so | subj) + (so | item), family = lognormal()`

.

Because we assume a possible correlation between group-level (or random) effects, the adjustments to the intercept and slope for subjects and items have multivariate (in this case, bivariate) normal distributions with means zero, as in Equation (7.1).

\[\begin{equation} \begin{aligned} {\begin{pmatrix} u_{i,1} \\ u_{i,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_u} \right) \\ {\begin{pmatrix} w_{j,1} \\ w_{j,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_w} \right) \end{aligned} \tag{7.1} \end{equation}\]

One possible standard setup for (relatively) uninformative priors which is sometimes used in reading studies (e.g., Paape, Nicenboim, and Vasishth 2017; Vasishth, Mertzen, Jäger, et al. 2018b) is as follows:

\[\begin{align} \alpha &\sim \mathit{Normal}(0, 10)\\ \beta &\sim \mathit{Normal}(0, 1)\\ \sigma &\sim \mathit{Normal}_+(0, 1)\\ \tau_{\{1,2\}} &\sim \mathit{Normal}_+(0, 1)\\ \rho &\sim \mathit{LKJ}(2) \end{align}\]

We define these priors in `brms`

syntax as follows:

```
priors <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 1), class = b, coef = so),
prior(normal(0, 1), class = sd),
prior(normal(0, 1), class = sigma),
prior(lkj(2), class = cor)
)
```

For the intercept (\(\alpha\)) we use a normal distribution with mean \(0\) and standard deviation \(10\). This is on the log scale as we assume a log-normal distribution of reading times. That is, this approach assumes a priori that the intercept for reading times varies between \(0\) seconds and (one standard deviation) \(exp(10) = 22026\) ms (i.e., \(22\) sec) or (two standard deviations) \(exp(20) = 485165195\) ms (i.e., \(135\) hours). Going from seconds to hours within one standard deviation shows how uninformative this prior is.

Moreover, for the effect of linguistic manipulations on reading times (\(\beta\)), one common standard prior is to assume a mean of \(0\) and a standard deviation of \(1\) (also on the log scale). The prior on the effect size on log scale is a multiplicative factor, that is, the prediction for the effect size depends on the intercept. For an intercept of \(exp(6) = 403\) ms, a variation to one standard deviation above multiplies the base effect by \(2.71\), increasing the mean from \(403\) to \(exp(6) \times exp(1) = 1097\). Likewise a variation to one standard deviation below multiplies the base effect by \(1 / 2.71\), decreasing the mean from \(403\) to \(exp(6) \times exp(-1) = 148\). This effect size is strongly changed when assuming a different intercept: for a slightly smaller value for the intercept of \(exp(5) = 148\) ms, the expected condition difference is reduced to \(37\)% (\(349\) ms), and for a slightly larger value for the intercept of \(exp(7) = 1097\) ms, the condition difference is enhanced to \(272\)% (\(2578\) ms). Also see Box 4.3 in chapter 4 for an explanation about the non-linear behavior of the log-normal model. Even though it seems \(\mathit{Normal}(0,1)\) is not entirely appropriate as a prior for the difference between object-relative and subject-relative sentences (i.e., the slope), we use it for illustrative purposes. We use the same \(\mathit{Normal}(0, 1)\) prior for the \(\tau\) parameter and \(\sigma\). Finally, for the group-level effects correlation between the intercept and the slope, we use an LKJ prior (Lewandowski, Kurowicka, and Joe 2009) with a relatively uninformative/regularizing prior parameter value of \(2\) (for visualization of the prior see Figure 7.4).

For the prior predictive checks, we use these priors to draw random parameter sets from the distributions, and to simulate hypothetical data using the statistical model. We load the data and code the predictor variable `so`

. We next use `brms`

to simulate prior predictive data from the hierarchical model.

```
data("df_gibsonwu")
df_gibsonwu <- df_gibsonwu %>%
mutate(so = ifelse(type == "obj-ext", 1, -1))
fit_prior_gibsonwu <- brm(rt ~ so + (so | subj) + (so | item),
data = df_gibsonwu,
family = lognormal(),
prior = priors,
sample_prior = "only"
)
```

Based on the simulated data we can now perform prior predictive checks: we compute summary statistics, and plot the distributions of the summary statistic across simulated data sets. First, we visualize the distribution of the simulated data. For a single data set, this could be visualized as a histogram. Here, we have a large number of simulated data sets, and thus a large number of histograms. We represent this uncertainty: for each bin, we plot the median as well as quantiles showing where 10%-90%, 20%-80%, 30%-70%, and 40%-60% of the histograms lie (for R code, see Schad, Betancourt, and Vasishth 2019). For the current prior data simulations, this shows (see Figure 7.5) that most of the hypothetical reading times are close to zero or larger than \(2000\) ms. It is immediately clear that the data predicted by this prior follows a very implausible distribution: it looks exponential; we would expect a log-normal distribution for reading times. Most data points take on extreme values.

As an additional summary statistic, in Figure 7.6 we take a look at the mean per simulated data set and also at the standard deviation. We create functions that use as summary statistics the mean and standard deviations; the functions collapse all the values over 2000 ms, making the figure more readable. Then we use these summary statistics to visualize prior predictive checks using the `pp_check()`

function, where we enter the prior predictions (`fit_prior_gibsonwu`

) and the relevant summary statistic (`mean_2000`

and `sd_2000`

).

```
mean_2000 <- function(x) {
tmp <- mean(x)
tmp[tmp > 2000] <- 2000
tmp
}
sd_2000 <- function(x) {
tmp <- sd(x)
tmp[tmp > 2000] <- 2000
tmp
}
fig_pri_1b <- pp_check(fit_prior_gibsonwu,
type = "stat",
stat = "mean_2000",
prefix = "ppd") +
labs(x = "Mean RT [ms]") +
theme(legend.position = "none")
fig_pri_1b
fig_pri_1c <- pp_check(fit_prior_gibsonwu,
type = "stat",
stat = "sd_2000",
prefix = "ppd") +
labs(x = "Standard Deviation RT [ms]") +
theme(legend.position = "none")
fig_pri_1c
```

The results, displayed in Figure 7.6, show that the mean (or the standard deviation) varies across a wide range, with a substantial number of data sets having a mean (or the standard deviation) larger than \(2000\) ms. Again, this reveals a highly implausible assumption about the intercept parameter.

Moreover, we also plot the size of the effect of object relative minus subject relative sentence as a measure of effect size with the following code. We first compute the summary statistic, here the difference in reading times between subject versus object relative sentences (i.e., variable `so`

). Then we define difference-values larger than \(2000\) ms as \(2000\) ms, and values smaller than \(-2000\) ms as \(-2000\) ms because these represent implausibly large/small values and make it easier to read the plot. We plot these prior predictive data using `pp_check`

by supplying the prior samples (`fit_prior_gibsonwu`

) and the summary statistic (`effsize_2000`

).

```
effsize_2000 <- function(x) {
tmp <- mean(x[df_gibsonwu$so == +1]) -
mean(x[df_gibsonwu$so == -1])
tmp[tmp > +2000] <- +2000
tmp[tmp < -2000] <- -2000
tmp
}
fig_pri_1d <- pp_check(fit_prior_gibsonwu,
type = "stat",
stat = "effsize_2000",
prefix = "ppd") +
labs(x = "Object - Subject [S-O RT]") +
theme(legend.position = "none")
fig_pri_1d
```

The results in Figure 7.7 show that our priors commonly assume differences in reading times between conditions of more than \(2000\) ms, which are larger than we would expect for a psycholinguistic manipulation of the kind investigated here. More specifically, given that we model reading times using a log-normal distribution, the expected effect size depends on the value for the intercept. To take an extreme example, for an intercept of \(exp(1) = 2.7\) ms and an effect size in log space of \(1\) (i.e., one standard deviation of the prior for the effect size), expected reading times for the two conditions are \(exp(1-1) = 1\) ms and \(exp(1+1) = 7\) ms. By contrast, for an intercept of \(exp(10) = 22026\) ms the corresponding reading times for the two conditions would be \(exp(10-1) = 8103\) ms and \(exp(10+1) = 59874\) ms.

This implies highly variable expectations for the effect size, including the possibility of very large effect sizes. If there is good reason to believe that the effect size is likely to be relatively small, priors with smaller expected effect sizes may be more reasonable. In chapter 6 we discussed different methods for working out ballpark estimates of the range of variation in expected effect sizes in reading studies on relative clause processing.

It is also useful to look at individual-level differences in the effect of object versus subject relatives. Figure 7.8 shows the subject with the largest (absolute) difference in reading times between object versus subject relatives.

Here, we first assign the prior simulated reading time to the data frame `df_gibsonwu`

(terming it `rtfake`

). Then we group the data frame by subject and by experimental condition (`so`

), average prior predictive reading times for each subject and condition, and then compute the difference in reading times between subject versus object relative sentences for each subject (the subset where `so == 1`

minus the subset where `so == -1`

). Now, we can take the absolute value of this difference (`abs(tmp$dif)`

), and take the maximum or standard deviation across all subjects. Again, we set values larger than 2000 ms to a value of 2000 ms for better visibility.

```
effsize_max_2000 <- function(x) {
df_gibsonwu$rtfake <- x
tmp <- df_gibsonwu %>%
group_by(subj, so) %>%
summarize(rtfake = mean(rtfake)) %>%
# calculates the difference between conditions:
summarize(dif = rtfake[so == 1] - rtfake[so == -1])
effect_size_max <- max(abs(tmp$dif), na.rm = TRUE)
effect_size_max[effect_size_max > 2000] <- 2000
effect_size_max
}
effsize_sd_2000 <- function(x) {
df_gibsonwu$rtfake <- x
tmp <- df_gibsonwu %>%
group_by(subj, so) %>%
summarize(rtfake = mean(rtfake)) %>%
summarize(dif = rtfake[so == 1] - rtfake[so == -1])
effect_size_SD <- sd(tmp$dif, na.rm = TRUE)
effect_size_SD[effect_size_SD > 2000] <- 2000
effect_size_SD
}
fig_pri_1e <- pp_check(fit_prior_gibsonwu,
type = "stat",
stat = "effsize_max_2000",
prefix = "ppd"
) +
labs(x = "Max Effect Size [S-O RT]") +
theme(legend.position = "none")
fig_pri_1e
fig_pri_1f <- pp_check(fit_prior_gibsonwu,
type = "stat",
stat = "effsize_sd_2000",
prefix = "ppd"
) +
labs(x = "SD Effect Size [S-O RT]") +
theme(legend.position = "none")
fig_pri_1f
```

The prior simulations in Figure 7.8 show common maximal effect sizes of larger than \(2000\) ms, which is more than we would expect for observed data; similarly, the variance in hypothetical effect sizes is large, with many SDs larger than \(2000\) ms, and thus again takes many values that are inconsistent with our domain expertise about reading experiments.

### 7.3.2 Adjusting priors

Based on these analyses of prior predictive data, we can next use our domain expertise to refine our priors and adjust them to values for which we expect more plausible prior predictive hypothetical data as captured in the summary statistics.

First, we adapt the intercept; recall that in chapter 6 we made a first attempt at coming up with priors for the intercept; now we take our reasoning one step further. Given our prior knowledge about mean reading times (see the discussion in the previous chapter), we could choose a normal distribution in log-space with a mean of \(6\). This corresponds to an expected grand average reading time of \(exp(6) = 403\) ms. For the standard deviation, we use a value of SD = \(0.6\). For these prior values, we expect a strongly reduced mean reading time and a strongly reduced residual standard deviation in the simulated hypothetical data. Moreover, we expect that implausibly small or large values for reading times will no longer be expected. For a visualization of the prior distribution of the intercept parameter in log-space and in ms-space see Figure 7.9a+b. Other values for the standard deviation that are close to \(0.6\), (e.g., SD = \(0.5\) or \(0.7\)), may yield similar results. Our goal is not to specify a precise value, but rather to use prior parameter values that are qualitatively in line with our domain expertise about expected observed reading time data, and that do not produce highly implausible hypothetical data.

Next, for the effect of object minus subject relative sentences, we define a normally distributed prior with mean \(0\) and a much smaller standard deviation of \(0.05\). Again, we do not have precise information on the specific value for the standard deviation, but as we saw in chapter 6, we have some understanding of the range of variation seen in reading studies involving relative clauses. We expect a generally smaller effect size (see the meta-analysis in chapter 6), and we can check through prior predictive checks (data simulation and investigation of summary statistics) whether this yields a plausible pattern of expected results. Figures 7.9c+d show expected effects in log-scale and in ms-scale for a simple linear regression example.

In addition, we assume much smaller values for the standard deviations in how the intercept and the slope vary across subjects and across items of \(0.1\), and a smaller residual standard deviation of \(0.5\). Our expectation for the correlation between group-level effects is unchanged. In summary, we settle on the following prior specification:

```
priors2 <- c(
prior(normal(6, 0.6), class = Intercept),
prior(normal(0, 0.05), class = b, coef = so),
prior(normal(0, 0.1), class = sd),
prior(normal(0, 0.5), class = sigma),
prior(lkj(2), class = cor)
)
```

#### 7.3.2.1 Prior predictive checks after increasing the informativity of the priors

We have now adjusted the priors increasing their informativity. These priors are still not too informative, but they are somewhat principled since they include some of the theory-neutral information that we have. Based on this new set of now weakly informative priors, we can again perform prior predictive checks as we did before. The new prior predictive checks are shown in Figure 7.10.

Figure 7.10a shows that now the distribution over histograms of the data looks much more reasonable, i.e., more like what we would expect for a histogram of observed data. Very small values for reading times are now rare, and not heavily inflated any more. Moreover, extremely large values for reading times larger than \(2000\) ms are rather unlikely.

We also take a look at the hypothetical average reading times (Figure 7.10b), and find that our expectations are now much more reasonable. We expect average reading times of around \(log(6) = 403\) ms. Most of the expected average reading times lie between \(50\) ms and \(1500\) ms, and only relatively few extreme values beyond these numbers are observed. The standard deviations of reading times are also in a much more reasonable range (see Figure 7.10d), with only very few values larger than the extreme value of \(2000\) ms.

As a next step, we look at the expected effect size (OR minus SR) in the hypothetical data (Figure 7.10c). Extreme values of larger or smaller than \(2000\) ms are now very rare, and most of the absolute values of expected effect sizes are smaller than \(200\) ms. More specifically, we also check the maximal effect size among all subjects (Figure 7.10e). Most of the distribution centers below a value of \(1000\) ms, reflecting a more plausible range of expected values. Likewise, the standard deviation of the psycholinguistically interesting effect size now rarely takes values larger than \(500\) ms (Figure 7.10f), reflecting more realistic a priori assumptions than in our initial (relatively) uninformative prior.

### 7.3.3 Computational faithfulness and model sensitivity

The next formal steps in the principled Bayesian workflow are to investigate computational faithfulness (using SBC) and model sensitivity. These allow the researcher to determine whether the posterior is estimated accurately for the given problem. Moreover, model sensitivity can be used to test whether parameter estimates are unbiased and whether anything can be learned by sampling data using the given design. Computational faithfulness (i.e., accurate posterior estimation) and model sensitivity need to be checked for non-standard and more complex models, but for simpler/standard models may be performed only once for a research program where experimental designs and models are similar across studies. These steps are computationally very expensive, and can take a very long time to run for realistic data sets and models. For details on how to implement these steps, we refer the interested readers to Schad, Betancourt, and Vasishth (2019) and Betancourt (2018). We discuss a simplified version inspired by SBC for more advanced custom models implemented directly in Stan in chapter 18.

### 7.3.4 Posterior predictive checks: Model adequacy

Having examined the prior predictive data in detail, we can now take the observed data and perform posterior inference on it. We start by fitting a maximal `brms`

model to the observed data.

```
fit_gibsonwu <- brm(rt ~ so + (1 + so | subj) + (1 + so | item),
data = df_gibsonwu,
family = lognormal(),
prior = priors2
)
```

One could examine the posteriors from this model; we skip this step for brevity, but the reader should run the above code and examine the posterior summaries by typing:

Figure 7.11 shows the posterior distribution for the slope parameter, which estimates the difference in reading times between object minus subject relative sentences.

```
mcmc_hist(fit_gibsonwu, pars = c("b_so")) +
labs(
x = "Object - subject relatives",
title = "Posterior distribution"
)
```

`## [1] 0.882`

Figure 7.11 shows that the reading times in object relative sentences tends to be slightly faster than in subject relative sentences (\(P(\beta<0) = 0.88\)); this is as predicted by Gibson and Wu (2013). However, given the wide \(95\)% credible intervals, it is difficult to rule out the possibility that there is effectively no difference in reading time between the two conditions without doing model comparison (with Bayes factor or cross-validation).

To assess model adequacy, we perform posterior predictive checks. We simulate data based on posterior samples of parameters. This then allows us to investigate the simulated data by computing the summary statistics that we used in the prior predictive checks, and by comparing model predictions with the observed data.

The results from these analyses show that the log-normal distribution (see Figure 7.12a) provides an approximation to the distribution of the data. However, although the fit looks reasonable, there is still systematic deviation from the data of the model’s predictions. This deviation suggests that maybe a constant offset is needed in addition to the log-normal distribution. This can be implemented in `brms`

by replacing the family specification `family = lognormal()`

with the shifted version `family = shifted_lognormal()`

, and motivates another round of model validation (see exercise 12.1, and also chapter 20 which deals with a log-normal race mode using Stan, and see Nicenboim, Logačev, et al. 2016; Rouder 2005 for a discussion about shifted log-normal models).

Next, for the other summary statistics, we first look at the distribution of means. The posterior predictive means capture the mean reading time in the observed data (i.e., the vertical line in Figure 7.12b); the data is not perfectly captured, but still in the distribution of the model. For the standard deviation we can see that the model posterior assumes too little posterior variation and that the model thus does not capture the standard deviation of the data well (Figure 7.12d). Figure 7.12c shows the effect size of object minus subject relative sentences predicted by model (histogram) and observed in the data (vertical line). Here, posterior model predictions for the effect are in line with the empirical data. However, again, the model is showing mostly smaller effect sizes than the data do. For the biggest effect among all subjects (Figure 7.12e) the model captures the data reasonably well. For the standard deviation of the effect across subjects (Figure 7.12f), the variation in the model is again a bit too small considering the variation in the data. Potentially, the lack of agreement between data and posterior predictive distributions might be due to the mismatch between the true distribution of the data and the log-normal likelihood that we assumed. This could be checked by running the model again and using a shifted log-normal instead of a log-normal likelihood.

## 7.4 Summary

In this chapter, we have introduced key questions to ask about a model and the inference process as discussed by Betancourt (2018) and by Schad, Betancourt, and Vasishth (2019), and have applied this to a data set from an experiment involving a typical repeated measures experimental design used in cognitive psychology and psycholinguistics. Prior predictive checks using analyses of simulated prior data suggest that, compared to previous applications in reading experiments (e.g., Nicenboim and Vasishth 2018), far more informative priors can and should be used. We demonstrated that including such additional domain knowledge into the priors leads to more plausible expected data. Moreover, incorporating more informative priors should also speed up the sampling process. These more informative priors, however, may not alter posterior inferences much for the present design. Posterior predictive checks showed weak support for our statistical model, as the model only partially successfully recovered the tested summary statistics. This may reflect the mis-fit of the likelihood, i.e., that the data may be better explained by a shifted log-normal distribution rather than a log-normal distribution. For inference on whether reading times differ between Chinese object versus subject relative sentences, a Bayes factor analysis would be needed to compare a null model assuming no effect to an alternative model assuming a difference between object versus subject relative sentences. See Vasishth, Yadav, et al. (2022) for more discussion.

In summary, this analysis provides an example and tutorial for using a principled Bayesian workflow (Betancourt 2018; Schad, Betancourt, and Vasishth 2019) in cognitive science experiments. The workflow reveals useful information about which (weakly informative) priors to use, and performs checks of the used inference procedures and the statistical model. The workflow provides a robust foundation for using a statistical model to answer scientific questions, and will be useful for researchers developing analysis plans as part of pre-registrations, registered reports, or simply as preparatory design analyses prior to conducting an experiment.

## 7.5 Further reading

Some important articles relating to developing a principled Bayesian workflow are by Betancourt (2018), Gabry et al. (2017), Gelman et al. (2020), and Talts et al. (2018). The `stantargets`

R package provides tools for a systematic, efficient, and reproducible workflow (Landau 2021). Also recommended is the article on reproducible workflows by Wilson et al. (2017).

### References

Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” *Journal of Memory and Language* 68 (3). Elsevier: 255–78.

Betancourt, Michael J. 2018. “Towards a Principled Bayesian Workflow.” https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html.

Box, George EP. 1979. “Robustness in the Strategy of Scientific Model Building.” In *Robustness in Statistics*, 201–36. Elsevier.

Chambers, Chris. 2019. *The Seven Deadly Sins of Psychology: A Manifesto for Reforming the Culture of Scientific Practice*. Princeton University Press.

Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael J. Betancourt, and Andrew Gelman. 2017. “Visualization in Bayesian Workflow.” *arXiv Preprint arXiv:1709.01449*.

Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” *arXiv Preprint arXiv:2011.01808*.

Gibson, Edward, and H-H Iris Wu. 2013. “Processing Chinese Relative Clauses in Context.” *Language and Cognitive Processes* 28 (1-2). Taylor & Francis: 125–55.

Landau, William Michael. 2021. “The Stantargets R Package: A Workflow Framework for Efficient Reproducible Stan-Powered Bayesian Data Analysis Pipelines.” *Journal of Open Source Software* 6 (60): 3193. https://doi.org/10.21105/joss.03193.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” *Journal of Multivariate Analysis* 100 (9): 1989–2001.

Nicenboim, Bruno, Pavel Logačev, Carolina Gattei, and Shravan Vasishth. 2016. “When High-Capacity Readers Slow down and Low-Capacity Readers Speed up: Working Memory and Locality Effects.” *Frontiers in Psychology* 7 (280). https://doi.org/10.3389/fpsyg.2016.00280.

Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” *Journal of Memory and Language* 99: 1–34. https://doi.org/10.1016/j.jml.2017.08.004.

Paape, Dario, Bruno Nicenboim, and Shravan Vasishth. 2017. “Does Antecedent Complexity Affect Ellipsis Processing? An Empirical Investigation.” *Glossa: A Journal of General Linguistics* 2 (1).

Rouder, Jeffrey N. 2005. “Are Unshifted Distributional Models Appropriate for Response Time?” *Psychometrika* 70 (2). Springer Science + Business Media: 377–81. https://doi.org/10.1007/s11336-005-1297-7.

Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2020. “Toward a Principled Bayesian Workflow in Cognitive Science.” *Psychological Methods* 26 (1). American Psychological Association: 103–26.

Schad, Daniel J., Michael Betancourt, and Shravan Vasishth. 2019. “Toward a Principled Bayesian Workflow in Cognitive Science.” arXiv. https://doi.org/10.48550/ARXIV.1904.12765.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2019. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” *Journal of Memory and Language* 110. https://doi.org/10.1016/j.jml.2019.104038.

*Journal of Memory and Language*110. Elsevier: 104038.

Szollosi, Aba, David Kellen, Danielle J Navarro, Richard Shiffrin, Iris van Rooij, Trisha Van Zandt, and Chris Donkin. 2020. “Is Preregistration Worthwhile?” *Trends in Cognitive Sciences* 24 (2). Elsevier: 94–95.

Talts, Sean, Michael J. Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” *arXiv Preprint arXiv:1804.06788*.

Vasishth, Shravan, Daniela Mertzen, Lena A Jäger, and Andrew Gelman. 2018b. “The Statistical Significance Filter Leads to Overoptimistic Expectations of Replicability.” *Journal of Memory and Language* 103: 151–75.

Vasishth, Shravan, Himanshu Yadav, Daniel J. Schad, and Bruno Nicenboim. 2022. “Sample Size Determination for Bayesian Hierarchical Models Commonly Used in Psycholinguistics.” *Computational Brain and Behavior*.

Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K Teal. 2017. “Good Enough Practices in Scientific Computing.” *PLoS Computational Biology* 13 (6). Public Library of Science San Francisco, CA USA: e1005510.

In frequentist methods (such as implemented in the lme4 package in the lmer program), this problem manifests itself as problems with convergence of the optimizer, which indicates that the likelihood is too flat and that the parameter estimates are not constrained by the data.↩