# Chapter 7 Workflow

Although modern Bayesian analysis tools (such as `brms`

) greatly facilitate Bayesian computations, model specification is still (as it should be) the responsibility of the user. Chapter 3 is one of the earlier chapters where some of the steps required to arrive at a robust and useful analysis were described. In this chapter, these ideas are brought together to spell out a principled approach to developing a workflow. This chapter is an abbreviated version a recent introduction of a principled Bayesian workflow to cognitive science (Schad, Betancourt, and Vasishth 2019, for a revised published version see 2020).

A lot of research has been done recently in order to create tools that guarantee reliable Bayesian data analyses (see, for example, Gabry et al. 2017; Talts et al. 2018). The development of a principled Bayesian workflow for performing a probabilistic analysis is one of the most recent outcomes of this research (Betancourt 2018; Schad, Betancourt, and Vasishth 2019). This process leaves space for future advancements in methodology and offers a logical first set of steps to take for a robust analysis. Parts of this workflow can, in principle, be applied to any type of data analysis, whether frequentist or Bayesian, whether sampling-based or based on analytic procedures.

In this chapter, we discuss some aspects of the principled Bayesian workflow. Certain components of this workflow are particularly useful when working with advanced or non-standard models.

When fitting a model, it is important to ask several questions and perform various checks to validate a probabilistic model. Before delving into the details of this discussion, we first examine the process of model building and how different traditions have led to different approaches to these questions.

## 7.1 Building a model

An effective approach to model building is to begin with a minimal model that captures only the phenomenon of interest, without incorporating much other structure in the data. For instance, this could be a linear model that includes only the factor or covariate of primary interest. This model is then subjected to a series of checks, which are described in detail in the following sections. If the model passes all of these checks and does not exhibit any signs of inadequacy, it can be applied to the problem at hand with confidence, knowing that it provides reasonably robust inferences for the scientific question that needs to be answered. However, if the model fails one or more of these checks, the model may need to be improved; in addition, even the scientific question may need to be changed. For example, in a repeated measures data set, we may use a sample of \(30\) subjects with the aim to estimate the correlation parameter between by-group adjustments (their random effects correlation). If the analysis shows that the sample is not large enough to reliably estimate the correlation term, the sample size may need to be increased, or the plan to investigate any correlation may need to be abandoned.

To guide and inform model development, initially an aspirational model \(\mathcal{M}_A\) is specified. This model is an idea that encompasses every aspect of the phenomenon and the measurement procedure, as if time, money, subjects, computational and mathematical tools, and other resources were all infinite. It accounts for all systematic effects that might influence the measurement process, such as influences of time or heterogeneity across individuals. By using this model as a starting point, random walks in model space can be avoided during the development of the model. The model has to capture both the latent phenomenon of interest and also the environment and experiment used to probe it.

The initial model \(\mathcal{M}_1\) is designed to incorporate only the phenomenon of core scientific interest, without including any additional aspects or structures relevant for modeling or measurement. This is in contrast to the aspirational model \(\mathcal{M}_A\), which includes all possible details (of course, within reason) of the phenomenon and measurement process. The difference between these two models can be probed using specifically designed summary statistics. These summary statistics can inform model expansion from the initial model \(\mathcal{M}_1\) towards the aspirational model \(\mathcal{M}_A\). If the initial model turns out to be inadequate, then the aspirational model, together with the relevant summary statistics, inform model development. In case the expanded model still shows problems in model checking, then model development is continued with another cycle of development.

In the following sections, prior and posterior predictive checks are discussed briefly, because they provide a foundation for a principled approach to model expansion. Critically, model development is best built up via *expansion*. In the case that an expanded model turns out to not be a better description of the data, it’s always possible to go back to a previous, simpler, version of the model.

An alternative strategy for model fitting is proposed by some researchers, whereby the model contains all group-level variance components (e.g., by-participant and by-items) that are allowed by the experimental design, as well as a full variance-covariance matrix for all group-level parameters. A commonly used name for this model, especially in psychology and psycholinguistics, is the “maximal” model (e.g., Barr et al. 2013). However, this model can be seen as “maximal” only in the framework of a linear model. For example, section 5.2.6 treated distributional models, which are already beyond the scope of the maximal models in the sense of Barr et al. (2013). Nevertheless, a maximal model can provide an alternative starting point for the principled Bayesian workflow. Here, model expansion is not the focus. Instead, if the maximal model approach is taken, the workflow that we discuss here can be useful for specifying priors encoding domain expertise (which may help to ensure computational faithfulness and model sensitivity) and to ensure model adequacy. In the principled Bayesian workflow, it may be reasonable for some steps (especially computationally demanding steps such as testing for computational faithfulness and model sensitivity) to be executed only one time for a given series of related studies (with similar designs) or only if models are coded in Stan.

The term “maximal” in a maximal model as used by Barr et al. (2013) does not refer to the actual process of generating data; rather, it refers to the maximal specification of the variance components within the parameters of the linear regression approximation. Models constrained by the linear regression structure are unable to account for factors like measurement error, dynamic changes in processes over time, or selection bias in the data. Crucially, the aspirational model–a representation of the actual data-generating process–is not the “maximal” model.

Last but not least, occasionally the outcomes of the Bayesian workflow will demonstrate that the data or experimental design used is insufficient to address the specific scientific question at hand. In this instance, either the level of ambition must be lowered or fresh data must be gathered, maybe using an alternative experimental design that is more sensitive to the relevant phenomenon.

Pre-registration of experimental analyses prior to data collection is a significant development in open science practices (Chambers 2019). Online resources like AsPredicted (https://aspredicted.org/) and the Open Science Foundation (https://osf.io/) can be used for this (but see Szollosi et al. 2020). What details should or can be recorded during the Bayesian workflow’s preregistration? The population- and group-level effects (also referred to as fixed and random effects) and contrast coding (Schad, Vasishth, et al. 2020) should be described if the maximal model is going to be used for analysis.
Rigid preregistration is meaningless in the context of incremental model building unless one knows precisely what the model is, as any subsequent inference will be limited, if not useless, if the model isn’t appropriate for the data at hand. Preregistration’s deeper problem is that a model cannot be validated until the phenomenon *and* experiment are thoroughly understood.
Defining the initial and aspirational models as well as the incremental strategy used to probe the initial model in order to move it closer to the aspirational model is one feasible option. Delineating summary statistics that one intends to use to probe the tested models can also fall under this category.
It can be helpful to preregister the initial model, laying out the summary statistics that will be investigated, and the principles one plans to use in model selection; this is a helpful step even though it can be challenging, or indeed impossible, to fully define the aspirational model. The incremental model building strategy towards the aspirational model may be seen as lying at the boundary between confirmatory and exploratory, and becomes more confirmatory the more clearly the aspirational model can be spelled out a priori.

## 7.2 Principled questions to ask on a model

What qualities are key for a useful probabilistic model? A first quality is consistency with domain expertise. Furthermore, in order to effectively address scientific inquiries, a probabilistic model must possess sufficient richness to accurately represent the structure of the actual data generation process. Two additional requirements must be met when developing very complex or non-standard models (which we will touch on briefly in this chapter): the model must allow accurate posterior approximation and it must capture enough of the experimental design to provide meaningful insights into our research questions.

In order to meet these requirements for our probabilistic model, what can we do? We will go over the several analysis steps and questions to ask.

We will first examine whether our model is consistent with our domain expertise using prior predictive checks. Additionally, posterior predictive checks evaluate whether the model adequately captures the relevant structure of the actual data-generating process for the given data set. We will also touch on two more, computationally costly steps that are part of the principled workflow and can be used, for example, when coding complex or non-standard models: this includes examining (a) model sensitivity, and (b) the question of whether we can recover model parameters with the provided design, including checks of computational faithfulness, by examining whether posterior estimation is accurate.

### 7.2.1 Checking whether assumptions are consist with domain expertise: Prior predictive checks

When investigating the model, it is crucial to first determine whether the model and the prior parameter distributions are in line with domain knowledge. Prior distributions may be chosen on the basis of previous studies or practicality. It is frequently challenging to determine which prior distributions to use for complex models, as well as the effects that distributions of prior model parameters have on the a priori expected data. One workable solution is to simulate artificial data from the model using prior distributions, and then verify if the simulated data make sense and align with domain knowledge. When compared to directly evaluating prior distributions in complex models, this (simulation-based) method is frequently far simpler to judge.

To put this strategy into action, take the following actions:

- Using the prior \(p(\boldsymbol{\Theta})\), draw a parameter set \(\boldsymbol{\Theta_{pred}}\) from it via random sampling: \(\boldsymbol{\Theta_{pred}} \sim p(\boldsymbol{\Theta})\)
- Based on this parameter set \(\boldsymbol{\Theta_{pred}}\), simulate artificial data \(\boldsymbol{y_{pred}}\) from the model: \(\boldsymbol{y_{pred}} \sim p(\boldsymbol{y} \mid \boldsymbol{\Theta_{pred}})\)

It is helpful to compute summary statistics of the simulated data \(t(\boldsymbol{y_{pred}})\) in order to evaluate whether previous model predictions are consistent with domain expertise. Histograms can be used to display the distribution of these summary statistics (see Figure 7.1). This may rapidly show whether the data falls within an expected range or whether a sizable number of extreme data points are expected a priori. Extreme values, for instance, might be reading times less than \(50\) ms or more than \(2000\) ms in a study utilizing self-paced reading times. While reading times longer than \(2000\) ms for a word are not impossible, they are unlikely and largely at odds with domain knowledge. Reading research has demonstrated that a tiny percentage of observations may truly take extreme values. However, if we find a substantial number of extreme data points in the hypothetical data and if these are at odds with domain expertise, then the model or the priors should be changed to produce hypothetical data that falls within the range of acceptable values.

Selecting quality summary statistics is more art than science. However, the choice of summary statistics will be important since they offer important indicators of the information that we want the model to take into account. As a result, they ought to be carefully selected and created in accordance with our expectations regarding the actual process of data generation as well as the kinds of structures and effects we expect the data will display.

One can use summary statistics to criticize the model as well. For example, one can formalize criticism of an analysis into a summary statistic that one believes will exhibit undesirable behavior. Writing reviews in a peer-review setting can benefit greatly from such kind of criticism. When we talk about data analysis for a specific example data set, we will illustrate some examples of helpful summary statistics below.

Selecting appropriate priors will be especially important when the data does not provide enough information to determine the likelihood (see Figure 7.2, especially g-i). This frequently happens, for instance, in hierarchical models when a “maximal” model is fitted for a small data set that does not constrain the estimation of variance and covariance parameters for all group-level effects.^{28}

In such cases, a prior in a Bayesian analysis (or a more informative one instead of a relatively uninformative one) should incorporate just enough domain expertise to suppress extreme but not impossible parameter values. Since the posterior is now sufficiently constrained, it may now be possible to fit the model. Therefore, by incorporating prior knowledge into Bayesian computation, we can fit and understand models that frequentist tools are unable to reliably estimate.

Thus, more concentrated prior distributions are a welcome side-effect of adding more domain expertise (into what still constitutes weakly informative priors), which can help with Bayesian computation. This makes it possible to estimate more complex models; in other words, models that would not otherwise be able to be estimated with the tools at hand, can be fitted thanks to the use of prior knowledge. Put another way, by utilizing prior knowledge, the iterative model-building process can help us approach the aspirational model better. Moreover, MCMC algorithms will converge more quickly once additional informative priors are provided.

When computing Bayes factors, adding more domain expertise to the prior has important implications for Bayesian modeling (see chapter 15, on Bayes factors).

Crucially, the *prior predictive distribution*, which describes the interaction between the prior and the likelihood, can be used to simulate prior predictive data. It calculates an integral, or average, over various possible (prior) parameter values mathematically. As previously mentioned (also refer to chapter 3), the prior predictive distribution is:

\[\begin{equation} \begin{aligned} p(\boldsymbol{y_{pred}}) &= \int p(\boldsymbol{y_{pred}}, \boldsymbol{\Theta}) \; d\boldsymbol{\Theta} = \int p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) p(\boldsymbol{\Theta}) \; d\boldsymbol{\Theta}\\ &= \int \mathrm{likelihood}(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) \cdot \mathrm{prior}(\boldsymbol{\Theta}) \; d\boldsymbol{\Theta} \end{aligned} \end{equation}\]

As an illustration, let’s say that we consider our likelihood to be a Normal distribution with mean \(\mu\) and standard deviation \(\sigma\). Assume that we now define \(\sigma \sim \mathit{Uniform}(1,2)\) and \(\mu \sim \mathit{Normal}(0,1)\) as priors on the parameters. The steps below can be used to create the prior predictive distribution:

- Perform the following 100,000 times:
- Select one sample (m) from the distribution Normal(0,1).
- Take one sample (s) from a Uniform(1,2) distribution
- Create and store a data point from a Normal(m,s) distribution.

- The prior predictive distribution is the generated data.

It is also possible to define more intricate generative processes involving data from repeated measures.

### 7.2.2 Testing for correct posterior approximations: Checks of computational faithfulness

Approximations of posterior expectations can be inaccurate. For example, a computer program that is designed to sample from a posterior can be erroneous. This may be due to an error in the likelihood specification (for example, an error in the R syntax formula) or to an inadequate sampling of the posterior’s entire density. The sampler may be biased, sampling parameter values that are larger or smaller than the true posterior, or the variance of the posterior samples may be larger or smaller than the true posterior uncertainty. However, posterior sampling from simple and standard models should work properly in most cases. Thus, we think that in many applications, a further check of computational faithfulness may be asking for too much, and might need to be performed only once for a given research program, where different experiments are rather similar to each other. However, checking computational faithfulness can become an important issue when dealing with more advanced/non-standard models (such as those discussed in the later chapters of this book). Here, errors in the specification of the likelihood can occur more easily.

Designing a process to verify whether the posterior approximation of choice is accurate is crucial because posterior approximations can be erroneous. For example, one should make sure that the software utilized to implement the sampling is error-free for the particular problem at hand. This checking can be performed using simulation-based calibration (SBC; Talts et al. 2018; Schad, Betancourt, and Vasishth 2019). This is a very simulation-intensive procedure, which can take a long time to run for considerably complex models and larger data sets. We do not discuss SBC in detail here, but refer the reader to its later treatment in chapter 18, where SBC is applied for models coded in Stan directly, as well as to the description in Schad, Betancourt, and Vasishth (2019).

After confirming the accuracy and faithfulness of our posterior computations, we can move on to examine the model analyses’ sensitivity.

### 7.2.3 Sensitivity of the model

What can we reasonably expect from a model’s posterior, and how can we determine whether these expectations are reasonable given the current configuration? First, we might expect that the data is generated without bias as the posterior recovers the true parameters. In other words, we could anticipate that the posterior mean will be near to the true value when simulating hypothetical data based on a true parameter value. This expectation, however, might or might not be warranted for a particular model, experimental setup, and data set. In fact, some models–such as non-linear models–may have biased parameter estimates, making it nearly impossible to determine the parameter’s true value from the data. Simultaneously, we could expect that the posterior is very informative concerning the parameters that produced the data. In other words, in comparison to our past knowledge, we could aim for a small posterior uncertainty, or a small posterior standard deviation. But posterior certainty isn’t always high. When compared to our past knowledge, some experimental designs, models, or data sets may produce extremely uninformative estimates where the degree of uncertainty is not decreased. This may occur when there is a dearth of data or when the model is too complex for the experimental design, preventing us from constraining specific model parameters.

In order to examine model sensitivity, two model-related questions can be looked into:

- How closely does the true simulating parameter match the estimated posterior mean?
- To what extent is uncertainty reduced between the posterior and the prior?

To investigate these questions, it is again possible to perform extensive simulation studies. This is crucial to do for complex, non-standard, or cognitive models, but may be less important for simpler and more standard models. Indeed, the same set of simulations can be used that are also used in SBC. Therefore, both analyses can be usefully applied in tandem. Again, here we skip the details of how these computations can be implemented, and refer the interested reader to Schad, Betancourt, and Vasishth (2019).

### 7.2.4 Does the model adequately capture the data?–Posterior predictive checks

“*All models are wrong but some are useful.*” (Box 1979, 2).

We are aware that the observed data noisily reflects the true data generating process, which our model most likely does not fully capture. Therefore, we want to know if our model accurately approximates the true process that produced the data and if it helps us formulate our scientific question. We can simulate data from the model and compare the simulated to the real data in order to compare the model to the actual data generating process (i.e., to the data). A posterior predictive distribution (refer to chapter 3) can be used to formulate this: the model is fitted to the data, and new data is simulated using the estimated posterior model parameters.

The posterior predictive distribution can be expressed mathematically as follows:

\[\begin{equation} p(\boldsymbol{y_{pred}} \mid \boldsymbol{y}) = \int p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta}) p(\boldsymbol{\Theta} \mid \boldsymbol{y}) \; d \boldsymbol{\Theta} \end{equation}\]

Here, the posterior distribution over model parameters, \(p(\boldsymbol{\Theta} \mid \boldsymbol{y})\), is inferred from the observed data, \(\boldsymbol{y}\). To produce new, simulated data, \(\boldsymbol{y_{pred}}\), this is combined with the likelihood function, \(p(\boldsymbol{y_{pred}} \mid \boldsymbol{\Theta})\). Averaging over various possible values for the posterior model parameters (\(\boldsymbol{\Theta}\)) is indicated by the integral \(\int d \boldsymbol{\Theta}\).

As stated in chapter 3, we are unable to evaluate this integral exactly. Since \(\boldsymbol{\Theta}\) can be a vector with multiple parameters, this integral is extremely complex and lacks an analytical solution. On the other hand, sampling allows us to approximate it. We can first obtain samples from the parameter posterior. Next, we can use these posterior samples to simulate new, artificial data from the model. This process approximates the posterior predictive distribution (which also approximates the integral).

In summary, we first fit the model to the data to obtain the posterior, and then simulate new data using the estimated posterior model parameters. The crucial question is then how closely the new simulated data resembles the observed data.

One strategy for comparing the data and the model is to use important features from the data and gauging the model’s ability to capture them. In fact, in the prior predictive checks, we had already defined summary statistics. Now that we have the data simulated from the posterior predictive distribution, we can compute these summary statistics. Every summary statistic will then have a distribution. We also compute the summary statistic for the observed data. We can now determine whether the observed data falls within the distribution of the model predictions (see Figure 7.3a) or whether the model predictions deviate significantly from the observed data (see Figure 7.3b).

Model adequacy is supported if the observed data matches the posterior-predicted data. A significant disparity suggests that our model is probably missing some crucial details from the actual process that produced the data, and we will need to apply our domain knowledge to enhance the model further. On the other hand, a significant disparity might result from the data being an extreme observation that was nonetheless produced by the process that our model was able to capture. Generally speaking, we are unable to distinguish between these two possibilities. As a result, we must decide which possibility is more important using our best judgment. Specifically, we ought to modify the model exclusively in situations where the disparity aligns with a recognized absent model feature.

## 7.3 Further reading

Some important articles relating to developing a principled Bayesian workflow are by Betancourt (2018), Gabry et al. (2017), Gelman et al. (2020), and Talts et al. (2018). The `stantargets`

R package provides tools for a systematic, efficient, and reproducible workflow (Landau 2021). Also recommended is the article on reproducible workflows by Wilson et al. (2017).

### References

Barr, Dale J., Roger P. Levy, Christoph Scheepers, and Harry J. Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” *Journal of Memory and Language* 68 (3): 255–78.

Betancourt, Michael J. 2018. “Towards a Principled Bayesian Workflow.” https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html.

Box, George E. P. 1979. “Robustness in the Strategy of Scientific Model Building.” In *Robustness in Statistics*, 201–36. Elsevier.

Chambers, Chris. 2019. *The Seven Deadly Sins of Psychology: A Manifesto for Reforming the Culture of Scientific Practice*. Princeton University Press.

Gabry, Jonah, Daniel P. Simpson, Aki Vehtari, Michael J. Betancourt, and Andrew Gelman. 2017. “Visualization in Bayesian Workflow.” *arXiv Preprint arXiv:1709.01449*.

Gelman, Andrew, Aki Vehtari, Daniel P. Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” *arXiv Preprint arXiv:2011.01808*.

Landau, William Michael. 2021. “The Stantargets R Package: A Workflow Framework for Efficient Reproducible Stan-Powered Bayesian Data Analysis Pipelines.” *Journal of Open Source Software* 6 (60): 3193. https://doi.org/10.21105/joss.03193.

Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2019. “Toward a Principled Bayesian Workflow in Cognitive Science.” *arXiv Preprint*. https://doi.org/10.48550/ARXIV.1904.12765.

Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2019. “Toward a Principled Bayesian Workflow in Cognitive Science.” *arXiv Preprint*. https://doi.org/10.48550/ARXIV.1904.12765.

*Psychological Methods*26 (1): 103–26. https://doi.org/https://doi.org/10.1037/met0000275.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2019. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” *Journal of Memory and Language* 110. https://doi.org/10.1016/j.jml.2019.104038.

*Journal of Memory and Language*110 (February): 104038. https://doi.org/10/gf9tjp.

Szollosi, Aba, David Kellen, Danielle J. Navarro, Richard M Shiffrin, Iris van Rooij, Trisha Van Zandt, and Christopher Donkin. 2020. “Is Preregistration Worthwhile?” *Trends in Cognitive Sciences* 24 (2): 94–95.

Talts, Sean, Michael J. Betancourt, Daniel P. Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” *arXiv Preprint arXiv:1804.06788*.

Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal. 2017. “Good Enough Practices in Scientific Computing.” *PLoS Computational Biology* 13 (6): e1005510.

This issue shows up as problems with optimizer convergence in frequentist methods (like the ones used in the

`lme4`

package), indicating that the likelihood is too flat and the parameter estimates are not limited by the data.↩︎