# Chapter 14 Introduction to model comparison

A key goal of cognitive science is to decide which theory under consideration accounts for the experimental data better. This can be accomplished by implementing the theories (or some aspects of them) as Bayesian models and comparing their predicting power. Thus, model comparison and hypothesis testing are closely related ideas. There are two Bayesian perspectives on model comparison: a *prior* predictive perspective based on the Bayes factor using marginal likelihoods, and a *posterior* predictive perspective based on cross-validation. The main characteristic difference between the prior predictive approach (Bayes factor) versus the posterior predictive approach (cross-validation) is the following: The Bayes factor examines how well the model (prior *and* likelihood) explains the experimental data. By contrast, the posterior predictive approach assesses model predictions for held-out data after seeing most of the data.

That is, the predictive accuracy of the Bayes factor is only based on its prior predictive distribution. In Bayes factor analyses, the prior model predictions are used to evaluate the support that the data give to the model. By contrast, in cross-validation, the model is fit to a large subset of the data (i.e., the training data). The posterior distributions of the parameters of this fitted model are then used to make predictions for held-out or validation data, and model fit is assessed on this subset of the data. Typically, this process is repeated several times, until the entire data set is assessed as held-out data. This attempts to assess whether the model will generalize to truly new, unobserved data. Of course, the held-out data is usually not “truly new” because it is part of the data that was collected, but at least it is data that the model has not been exposed to. That is, the predictive accuracy of cross-validation methods is based on how well the posterior predictive distribution that is fit to most of the data (i.e., the training data) characterizes out-of-sample data (i.e., the test or held-out data).

The prior predictive distribution is obviously highly sensitive to the priors: it evaluates the probability of the observed data under prior assumptions. By contrast, the posterior predictive distribution is less dependent on the priors because the priors are combined with the likelihood (and are thus less influential, given sufficient data) before making predictions for held-out validation data.

Jaynes (2003, chap. 20) compares these two perspectives to “a cruel realist” and “a fair judge”. According to Jaynes, Bayes factor adopts the posture of a cruel realist, who “judge[s] each model taking into account the prior information we actually have pertaining to it; that is, we penalize a model if we do not have the best possible prior information about its parameters, although that is not really a fault of the model itself.” By contrast, cross-validation adopts the posture of a scrupulously fair judge, “who insists that fairness in comparing models requires that each is delivering the best performance of which it is capable, by giving each the best possible prior probability for its parameters (similarly, in Olympic games we might consider it unfair to judge two athletes by their performance when one of them is sick or injured; the fair judge might prefer to compare them when both are doing their absolute best).”

Regardless of whether we use Bayes factor or cross-validation or any other method for model comparison, there are several important points that one should keep in mind:

Although the objective of model comparison might ultimately be to find out which of the models under consideration generalizes better, this generalization can only be done well within the range of the observed data (see Vehtari and Lampinen 2002; Vehtari and Ojanen 2012). That is, if one hypothesis implemented as the model \(\mathcal{M}_1\) shows to be superior to a second hypothesis, implemented as the model \(\mathcal{M}_2\), according to Bayes factor and/or cross-validation and evaluated with a young western university student population, this doesn’t mean that \(\mathcal{M}_1\) will be superior to \(\mathcal{M}_2\) when it is evaluated with a broader population (and in fact it seems that many times it won’t, see Henrich, Heine, and Norenzayan 2010). However, if we can’t generalize even within the range of the observed data (e.g., university students in the northern part of the western hemisphere), there is no hope of generalizing outside of that range (e.g., non-University students). Navarro (2019) argues that one of the most important functions of a model is to encourage directed exploration of new territory; our view is that this makes sense only if historical data is also accounted for (this is analogous to regression testing in software development—existing functionality/empirical coverage should not be lost when the model is extended to cover new data). In practice, what that means for us is that evaluating a model’s performance should be carried out using historical benchmark data in addition to any new data one has; just using isolated pockets of new data to evaluate a model is not convincing. For an example from psycholinguistics of model evaluation using historical benchmark data, see Nicenboim, Vasishth, and Rösler (2020b).

Model comparison can provide a quantitative way to evaluate models, but this cannot replace understanding the qualitative patterns in the data (see, e.g., Navarro 2019). A model can provide a good fit by behaving in a way that contradicts our substantive knowledge. For example, Lissón et al. (2021) examine two computational models of sentence comprehension. One of the models yielded higher predictive accuracy when the parameter that is related to the probability of correctly comprehending a sentence was higher for impaired subjects (individuals with aphasia) than for the control population. This contradicts domain knowledge—impaired subjects are generally observed to show worse performance than unimpaired control subjects—and led to a re-evaluation of the model.

Model comparison is based on finding the most “useful model” for characterizing our data, but neither the Bayes factor nor cross-validation (nor any other method that we are aware of) guarantees selecting the model closest to the truth (even with enough data). This is related to our previous point: A model that’s closest to the true generating data process is not guaranteed to produce the best (prior or posterior) predictions, and a model with a clearly wrong generating data process is not guaranteed to produce poor (prior or posterior) predictions. See Wang and Gelman (2014), for an example with cross-validation; and Navarro (2019) for a toy example with Bayes factor.

One should also check that the precision (the uncertainty) of the data being modeled is high; if an effect is being modeled that has high uncertainty (the posterior distribution of the target parameter is widely spread out), then any measure of model fit can be uninformative because we don’t have accurate estimates of the effect of interest. In the Bayesian context, this implies that the prior predictive and posterior predictive distributions of the effects generated by the model should be theoretically plausible and reasonably constrained, and the target parameter of interest should have as high precision as possible; this implies that we need to have sufficient data if we want to obtain precise estimates of the parameter of interest. Later in this part of the book, we will discuss the adverse impact of imprecision in the data on model comparison (see section 15.5.2). We will show that, in the face of low precision, we generally won’t learn much from model comparison.

When comparing a null model with an alternative model, it is important to be clear about what the null model specification is. For example, in section 5.2.4, we encountered the correlated varying intercepts and varying slopes model for the Stroop effect. The

`brms`

formula for the full model was:

`n400 ~ 1 + c_cloze + (1 + c_cloze | subj)`

If we want to test the null hypothesis that centered cloze has no effect on the dependent variable, one null model is:

`n400 ~ 1 + (1 + c_cloze | subj) (Model M0a)`

In model `M0a`

, by-subject variability is allowed; just the fixed effect of centered cloze is assumed to be zero. This is called a nested model comparison, because the null model is subsumed in the full model.

An alternative null model could remove only the varying slopes:

`n400 ~ 1 + c_cloze + (1 | subj) (Model M0b)`

Model `M0b`

, which is also nested inside the full model, is testing a different null hypothesis than `M0a`

above: is the between-subject variability in the centered cloze effect zero?

Yet another possibility is to remove both the fixed and random effects of centered cloze:

`n400 ~ 1 + (1 | subj) (Model M0c)`

Model `M0c`

is also nested inside the full model, but it now has two parameters missing instead of one. Usually, it is best to compare models by removing one parameter; otherwise one cannot be sure which parameter was responsible for our rejecting or accepting the null hypothesis.

**Box 14.1 ****Credible intervals should not be used to reject a null hypothesis**

Researchers often incorrectly use credible intervals for null hypothesis testing, that is, to test whether a parameter \(\beta\) is zero or not. A common approach is to check whether zero is included in the 95% credible interval for the parameter \(\beta\); if it is, then the null hypothesis that the effect is zero is accepted; and if zero is outside the interval, then the null is rejected. For example, in a tutorial paper that two of the authors of this book wrote (Nicenboim and Vasishth 2016), we incorrectly suggest that the credible interval can be used to reject the hypothesis that the \(\beta\) is zero. This is not the correct approach.

The problem with this approach is that it is a heuristic that will work in some cases and might be misleading in others (for an example, see Vasishth, Yadav, et al. 2022). Unfortunately, when they will work or not is in fact not well-defined.

Why is the credible-interval approach only a heuristic?
One line of (incorrect) reasoning that justifies looking at the overlap between credible intervals and zero is based on the fact that the most likely values of \(\beta\) lie within 95% credible interval.^{41} This entails that if zero is outside the interval, it must have a low probability density. This is true, but it’s meaningless: Regardless of where zero lies (or any point value), zero will have a probability mass of exactly zero since we are dealing with a continuous distribution.
The lack of overlap doesn’t tell us how much posterior probability the null model has.

A partial solution could be to look at a probability interval close to zero rather than zero (e.g., an interval of, say, \(-2\) to \(2\) ms in a response time experiment), so that we obtain a non-zero probability mass. While the lack of overlap would be slightly more informative, excluding a small interval can be problematic when the prior probability mass of that interval is very small to begin with (as was the case with the regularizing priors we assigned to our parameters). Rouder, Haaf, and Vandekerckhove (2018) show that if prior probability mass is added to the point value zero using a *spike-and-slab* prior (or if probability mass is added to the small interval close to zero if one considers that equivalent to the null model), looking at whether zero is in the 95% credible interval is analogous to the Bayes factor. Unfortunately, the *spike-and-slab* prior cannot be incorporated in Stan, because it relies on a discrete parameter. However, other programming tools (like `PyMC3`

, JAGS, or Turing) can be used if such a prior needs to be fit; see further readings.

Rather than looking at the overlap of the 95% credible interval, we might be tempted to conclude that there is evidence for an effect because the probability that a parameter is positive is high, that is \(P(\beta > 0) >> 0.5\). However, the same logic from the previous paragraph renders this meaningless. Given that the probability mass of a point value, \(P(\beta = 0)\), is zero, what we can conclude from \(P(\beta > 0) >> 0.5\) is that \(\beta\) is very likely to be positive rather than negative, but we can’t make any assertions about whether \(\beta\) is exactly zero.

As we saw, the main problem with these heuristics is that they ignore that the null model is a separate hypothesis. In many situations, the null hypothesis may not be of interest, and it might be perfectly fine to base our conclusions on credible intervals or \(P(\beta > 0)\). The problem arises when these heuristics are used to provide evidence in favor or against the null hypothesis. If one wants to argue about the evidence in favor of or against a null hypothesis, Bayes factors or cross-validation will be needed. These are discussed in the next two chapters.

How can credible intervals be used sensibly? The region of practical equivalence (ROPE) approach (Spiegelhalter, Freedman, and Parmar 1994; Freedman, Lowe, and Macaskill 1984; and, more recently, Kruschke and Liddell 2018; Kruschke 2014) is a reasonable alternative to hypothesis testing and arguing for or against a null. This approach is related to the spike-and-slab discussion above. In the ROPE approach, one can define a range of values for a target parameter that is predicted before the data are seen. Of course, there has to be a principled justification for choosing this range a priori; an example of a principled justification would be the prior predictions of a computational model. Then, the overlap (or lack thereof) between this predicted range and the observed credible interval can be used to infer whether one has estimates consistent (or partly consistent) with the predicted range. Here, we are not ruling out any null hypothesis, and we are not using the credible interval to make a decision like “the null hypothesis is true/false.”

## 14.1 Further reading

Roberts and Pashler (2000) and Pitt and Myung (2002) argue for the need of going beyond “a good fit” (this is a good posterior predictive check in the context of Bayesian data analysis) and argue for the need of model comparison and a focus on measuring the generalizability of a model. Navarro (2019) deals with the problematic aspects of model selection in the context of psychological literature and cognitive modeling. Fabian Dablander’s blog post, https://fabiandablander.com/r/Law-of-Practice.html, shows a very clear comparison between Bayes factor and PSIS-LOO-CV. Rodriguez, Williams, and Rast (2021) provides JAGS code for fitting models with spike-and-slab priors. Fabian Dablander has a comprehensive blog post on how to implement a Gibbs sampler in R when using such a prior: https://fabiandablander.com/r/Spike-and-Slab.html.

### References

Freedman, Laurence S., D. Lowe, and P. Macaskill. 1984. “Stopping Rules for Clinical Trials Incorporating Clinical Opinion.” *Biometrics* 40 (3): 575–86.

Henrich, Joseph, Steven J. Heine, and Ara Norenzayan. 2010. “The Weirdest People in the World?” *Behavioral and Brain Sciences* 33 (2-3). Cambridge University Press: 61–83. https://doi.org/10.1017/S0140525X0999152X.

Jaynes, Edwin T. 2003. *Probability Theory: The Logic of Science*. Cambridge university press.

Kruschke, John. 2014. *Doing Bayesian Data Analysis: A tutorial with R, JAGS, and Stan*. Academic Press.

Kruschke, John, and Torrin M Liddell. 2018. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” *Psychonomic Bulletin & Review* 25 (1). Springer: 178–206.

Lissón, Paula, Dorothea Pregla, Bruno Nicenboim, Dario Paape, Mick van het Nederend, Frank Burchert, Nicole Stadie, David Caplan, and Shravan Vasishth. 2021. “A Computational Evaluation of Two Models of Retrieval Processes in Sentence Processing in Aphasia.” *Cognitive Science* 45 (4): e12956. https://onlinelibrary.wiley.com/doi/full/10.1111/cogs.12956.

Nicenboim, Bruno, and Shravan Vasishth. 2016. “Statistical methods for linguistic research: Foundational Ideas - Part II.” *Language and Linguistics Compass* 10 (11): 591–613. https://doi.org/10.1111/lnc3.12207.

Nicenboim, Bruno, Shravan Vasishth, and Frank Rösler. 2020b. “Are Words Pre-Activated Probabilistically During Sentence Comprehension? Evidence from New Data and a Bayesian Random-Effects Meta-Analysis Using Publicly Available Data.” *Neuropsychologia* 142. https://doi.org/10.1016/j.neuropsychologia.2020.107427.

Pitt, Mark A., and In Jae Myung. 2002. “When a Good Fit Can Be Bad.” *Trends in Cognitive Sciences* 6 (10): 421–25. https://doi.org/10.1016/S1364-6613(02)01964-2.

Roberts, Seth, and Harold Pashler. 2000. “How Persuasive Is a Good Fit? A Comment on Theory Testing.” *Psychological Review* 107 (2): 358–67.

Rodriguez, Josue E, Donald R Williams, and Philippe Rast. 2021. “Who Is and Is Not" Average’"? Random Effects Selection with Spike-and-Slab Priors.” PsyArXiv.

Rouder, Jeffrey N., Julia M Haaf, and Joachim Vandekerckhove. 2018. “Bayesian Inference for Psychology, Part IV: Parameter Estimation and Bayes Factors.” *Psychonomic Bulletin & Review* 25 (1): 102–13.

Spiegelhalter, David J, Laurence S. Freedman, and Mahesh KB Parmar. 1994. “Bayesian Approaches to Randomized Trials.” *Journal of the Royal Statistical Society. Series A (Statistics in Society)* 157 (3): 357–416.

Vasishth, Shravan, Himanshu Yadav, Daniel J. Schad, and Bruno Nicenboim. 2022. “Sample Size Determination for Bayesian Hierarchical Models Commonly Used in Psycholinguistics.” *Computational Brain and Behavior*.

Vehtari, Aki, and Jouko Lampinen. 2002. “Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities.” *Neural Computation* 14 (10): 2439–68. https://doi.org/10.1162/08997660260293292.

Vehtari, Aki, and Janne Ojanen. 2012. “A Survey of Bayesian Predictive Methods for Model Assessment, Selection and Comparison.” *Statistical Surveys* 6 (0). Institute of Mathematical Statistics: 142–228. https://doi.org/10.1214/12-ss102.

Wang, Wei, and Andrew Gelman. 2014. “Difficulty of Selecting Among Multilevel Models Using Predictive Accuracy.” *Statistics at Its Interface* 7: 1–8.

This is also strictly true only in a highest density interval (HDI), this is a credible interval where all the points within the interval have a higher probability density than points outside the interval. However, when posterior distributions are symmetrical, these intervals are virtually identical to the equal-tail intervals we use in this book.↩