# Chapter 16 Cross-validation

A popular way to evaluate and compare models is to investigate their ability to make predictions for “out-of-sample data”, that is, to use what we learned from the observed data to predict future or unseen observations. Cross-validation is used to test which of the models under consideration is/are able to learn the most from our data in order to make better predictions. However, in cognitive science, our objective will rarely be to predict future observations, but rather to compare how well different models fare in accounting for the observed data.

The objective of cross-validation is to avoid over-optimistic predictions; such over-optimistic predictions would arise if we were to use the data to estimate the parameters of our model, and then use these estimates to predict the same data. That amounts to using the data twice. The basic idea behind cross-validation is that the models are with a large subset of the data, the training set, and then we predict a smaller part of the data, the held-out set. In order to treat the entire data set as a held-out set, and to evaluate the predictive accuracy using every observation, one changes what constitutes the training set and the held-out set. This ensures that the predictions of the model are tested over the entire data set.

## 16.1 The expected log predictive density of a model

In order to compare the quality of the posterior predictions of two models, a utility function or a scoring rule is used (see Gneiting and Raftery 2007 for a review on scoring rules). The logarithmic score rule (Good 1952), shown in equation (16.1), has been proposed as a reasonable way to assess the posterior predictive distribution of a candidate model $$\mathcal{M}_1$$ given the data $$y$$. This approach is reasonable because it takes into account the uncertainty of the predictions (compare this the mean square error). If new observations are well-accounted by the posterior predictive distribution, then the density of the posterior predictive distribution is high and so is its logarithm.

$\begin{equation} u( \mathcal{M}_1, y_{pred}) = \log p(y_{pred}| y, \mathcal{M}_1) \tag{16.1} \end{equation}$

Unlike the Bayes factor, the prior is absent from equation (16.1). However, the prior does have a role here: The posterior predictive distribution is based on the posterior distribution $$p(\Theta\mid y)$$ (where $$\Theta$$ is a vector of all the parameters of the model), which, according to Bayes’ rule, depends on both priors and likelihood together. Recall equation (3.8) in section 3.6, repeated here for convenience:

$\begin{equation} p(y_{pred}\mid y )=\int_\Theta p(y_{pred}\mid \Theta) p(\Theta\mid y)\, d\Theta \tag{16.2} \end{equation}$

In equation (16.2), we are implicitly conditioning on the model under consideration:

$\begin{equation} p(y_{pred}\mid y, \mathcal{M}_1)=\int_\Theta p(y_{pred}\mid \Theta, \mathcal{M}_1) p(\Theta\mid y, \mathcal{M}_1)\, d\Theta \tag{16.3} \end{equation}$

The predicted data, $$y_{pred}$$, are unknown to the utility function, so the utility function as presented in equation (16.1) cannot be evaluated. For this reason, we marginalize over all possible future data (calculating $$E[\log p(y_{pred}| y, \mathcal{M}_1)]$$); this expression is called the expected log predictive density of model $$\mathcal{M}_1$$:

$\begin{equation} elpd = u(\mathcal{M}_1) = \int_{y_{pred}} p_t(y_{pred}) \log p(y_{pred} \mid y, \mathcal{M}_1) \, dy_{pred} \tag{16.4} \end{equation}$

where $$p_t$$ is the true data generating distribution. If we consider a set of models, the model with the highest $$elpd$$ is the model with the predictions that are the closest to those of the true data generating process.44 The intuition behind Equation (16.4) is that we are evaluating the predictive distribution of $$\mathcal{M}_1$$ over all possible future data weighted by how likely the future data is according to its true distribution. This means that observations that are very likely according to the true model will have a higher weight than unlikely ones.

But we don’t know the true data-generating distribution, $$p_t$$! If we knew it, we wouldn’t be looking for the best model, since $$p_t$$ is the best model.

We can use the observed data distribution as a proxy for the true data generation distribution. So instead of weighting the predictive distribution by the true density of all possible future data, we just use the $$N$$ observations that we have. We can do that because our observations are presumed to be samples from the true distribution of the data: Under this assumption, observations with higher likelihood according to the true distribution of the data will also be more commonly obtained. This means that instead of integrating, we sum the posterior predictive density of the observations and we give to each observation the same weight; this is valid because observations that are more common will be already appearing more often (see also Box 16.1). This quantity is called log pointwise predictive density or $$lpd$$ (without the $$1/N$$ in Vehtari, Gelman, and Gabry 2017b):

$\begin{equation} lpd = \frac{1}{N} \sum_{n=1}^{N} \log p(y_n|y, \mathcal{M}_1) \tag{16.5} \end{equation}$

The $$lpd$$ is an overestimate of elpd for actual future data, because the parameters of the posterior predictive distribution are estimated with the same observations that we are considering out-of-sample. Incidentally, this also explains why posterior predictive checks are generally optimistic and good fits cannot be taken too seriously. But they do serve the purpose of identifying very strong model misspecifications.45

However, we can obtain a more conservative estimate of the predictive performance of a model using cross-validation (Geisser and Eddy 1979). This is explained next. (As an aside, we mention here that there are also other alternatives to cross-validation; these are presented in Vehtari and Ojanen 2012).

Box 16.1 How do we get rid of the integral in the approximation of elpd?

As an example, imagine that there are $$N$$ observations in an experiment. Suppose also that the true generative process (which is always unknown to us) is a Beta distribution:

$\begin{equation} p_t(y) = \mathit{ Beta}(y | 1, 3) \end{equation}$

Set $$N$$ and observe some simulated data $$y$$:

N <- 10000
y_data <- rbeta(N, 1, 3)
head(y_data)
##  0.222 0.466 0.509 0.117 0.614 0.362

Let’s say that we fit the Bayesian model $$\mathcal{M}_{1}$$, and somehow, after getting the posterior distribution, we are able derive the analytical form of its posterior predictive distribution for the model:

$\begin{equation} p(y_{pred} | y, \mathcal{M}_1) = \mathit{Beta}(y_{pred} | 2, 2) \end{equation}$

This distribution will tell us how likely different future observations will be, and it also entails that our future observations will be bounded by $$0$$ and $$1$$. (Any observation outside this range will have a probability density of zero).

Imagine that we could know the true distribution of the data, $$p_t$$, which is conveniently close to our posterior predictive distribution. This means that equation (16.4), repeated below, is simple enough, and we know all its terms:

$\begin{equation} elpd = u(\mathcal{M}_1) = \int_{y_{pred}} p_t(y_{pred}) \log p(y_{pred} \mid y, \mathcal{M}_1)\, dy_{pred} \end{equation}$

We can compute this quantity in R. Notice that we don’t introduce the data at any point. However, the data had to be used when p, the posterior predictive distribution, was derived; we skipped that step here.

# True distribution:
p_t <- function(y) dbeta(y, 1, 3)
# Predictive distribution:
p <- function(y) dbeta(y, 2, 2)
# Integration:
integrand <- function(y) p_t(y) * log(p(y))
integrate(f = integrand, lower = 0, upper = 1)
## -0.375 with absolute error < 0.00000068

Because we will never know p_t, this integral can be approximated using the data, y_data. It is possible to approximate the integration without any reference to p_t (see equation (16.5)):

1/N * sum(log(p(y_data)))
##  -0.356

The main problem with this approach is that we are using y_data twice, once to derive p, the predictive posterior distribution, and once for the approximation of $$elpd$$. We’ll see that cross-validation approaches rely on deriving the posterior predictive distribution with part of the data, and estimating the approximation to $$elpd$$ with unseen data. (Don’t worry that we don’t know the analytical form of the posterior predictive distribution: we saw that we could generate samples from that distribution based on the likelihood and posterior samples.)

## 16.2 K-fold and leave-one-out cross-validation

The basic idea of K-fold cross-validation (K-fold-CV) is to split the N observations of our data into K subsets, such that each subset is used as a validation (or held-out) set, $$D_k$$, while the remaining set (the training set), $$D_{-k}$$ is used for estimating the parameters and approximating $$p_t$$, the true data distribution. The leave-one-out cross-validation (LOO-CV) method represent a special case of K-fold-CV where the training set only excludes one observation ($$K = N$$). We estimate $$elpd$$ as follows.

$\begin{equation} \widehat{elpd} = \frac{1}{N} \sum_{n=1}^{N} \log p(y_n| D_{\backslash n}, \mathcal{M}_1) \tag{16.6} \end{equation}$

In equation (16.6), each observation, $$y_n$$, belongs to a certain “validation” fold, $$D_k$$, and the predictive accuracy of $$y_n$$ is evaluated based on a posterior predictive model trained on the set, $$D_{\backslash n}$$, which is the complete data set excluding the validation fold that contains the $$n$$-th observation. This means that the posterior predictive distribution is used to evaluate $$y_{n}$$, even though the posterior predictive distribution was derived without having information from that $$y_n$$-th observation (in other words, the model was trained without that observation in the subset of the data, $$D_{-k}$$). In K-fold-CV, several observations are held out in same (validation) fold. This means that the the held-out observations are split among K folds, and $$D_{\backslash n}$$, the data used to derive the posterior predictive distribution, contain only a proportion of the observations; this proportion is $$(1 - 1/K)$$. By contrast, in leave-one-out cross-validation, the held-out data set includes only one observation. That is, $$D_{\backslash n}$$ contains the entire data set except for one data point, $$y_n$$, with $$n=1,\dots,N$$. Box 16.2 explains the algorithm in detail.

Vehtari, Gelman, and Gabry (2017b) define the expected log pointwise predictive density of the observation $$y_n$$ as follows:

$\begin{equation} \widehat{elpd}_{n} = \log p(y_n| D_{\backslash n} , \mathcal{M}_1) \end{equation}$

This quantity indicates the predictive accuracy of the model $$\mathcal{M}_1$$ for a single observation, and it is reported in the package loo and also in brms. In addition, the loo package uses the sum of the expected log pointwise predictive density, $$\sum elpd_n$$ (equation (16.6) without $$\frac{1}{N}$$) as a measure of predictive accuracy (this is referred as elpd_loo or elpd_kfold by loo and brms packages). For model comparison, the difference between the $$\sum elpd_n$$ of competing models can be computed, including the standard deviation of the sampling distribution of the difference. It’s important to notice that we are calculating an approximation to the expectation that we actually want to compute, $$elpd$$, and thus we always need to consider its inherent randomness (Vehtari, Simpson, et al. 2019).

Unlike what is common with information criterion methods (such as the Akaike Information Criterion, AIC, and the Deviance Information Criterion, DIC), a higher $$\widehat{elpd}$$ means higher predictive accuracy. An alternative to using $$\widehat{elpd}$$ is to examine $$-2\times \widehat{elpd}$$, which is equivalent to deviance, and is called the LOO Information Criterion (LOOIC) (see section 22 of Vehtari 2022).

The approximation to the true data generating distribution is worse when fewer observations are used, and thus ideally we would set $$K =N$$, and thus compute LOO-CV rather than K-fold-CV. The main advantage of LOO-CV is its robustness, since the training set is as similar as possible to the observed data, and the same observations are never used simultaneously for training and evaluating the predictions. A major disadvantage is the computational burden (Vehtari and Ojanen 2012), since we need to fit a model as many times as the number of observations. The package loo provides an approximation to LOO-CV, Pareto smoothed importance sampling leave-one-out (PSIS-LOO; Vehtari and Gelman 2015; Vehtari, Gelman, and Gabry 2017b) which, as we show next, is relatively straightforward to use in brms and in Stan models (see https://mc-stan.org/loo/articles/loo2-with-rstan.html). However, in some cases, its estimates can be unreliable; this is indicated by the estimated shape parameter $$\hat{k}$$ of the generalized Pareto distribution. In those cases, where one or several pointwise predictive density have associated large (larger than 0.5 or 0.7, see https://mc-stan.org/loo/reference/pareto-k-diagnostic.html) $$\hat{k}$$, either (i) the problematic predictions can be refitted with exact LOO-CV, (ii) one can try some additional computations using the existing posterior sample based on the moment matching approximation (see https://mc-stan.org/loo/articles/loo2-moment-matching.html and Paananen et al. 2021), or (iii) one can abandon PSIS-LOO-CV and use K-fold-CV, with K typically set to 10.

One of the main disadvantages of cross-validation (in comparison with Bayes factor at least) is that the numerical difference in predictive accuracy is hard to interpret. As a rule of thumb, it has been suggested that if the elpd difference (elpd_diff in the loo package) is less than 4, the difference is small, and if it is larger than 4, one should compare that difference to its standard error (se_diff) (see section 16 of Vehtari 2022).

Box 16.2 The cross-validation algorithm

Here we spell out the Bayesian cross-validation algorithm in detail:

1. Split the data pseudo-randomly into $$K$$ held-out or validation sets $$D_k$$, (where $$k=1,\dots,K$$) that are a fraction of the original data, and $$K$$ training sets, $$D_{-k}$$. The length of the held-out data vector $$D_k$$ is approximately $$1/K$$-th the size of the full data set. It is common to use $$K=10$$ for K-fold-CV. For LOO-CV, K should be set to the number of observations.

2. Fit $$K$$ models using each of the $$K$$ training sets, and obtain posterior distributions $$p_{-k} (\Theta) = p(\Theta\mid D_{-k})$$, where $$\Theta$$ is the vector of model parameters.

3. Each posterior distribution $$p(\Theta\mid D_{-k})$$ is used to compute the predictive accuracy for each held-out data-point $$y_n$$ in the vector $$D_{k}$$:

$\begin{equation} \widehat{elpd}_n = \log p(y_n \mid D_{-k}) %= \log \int p(y_n \mid \Theta) p(\Theta\mid D_{-k})\, d\Theta \end{equation}$

Given that the posterior distribution $$p(\Theta\mid D_{-k})$$ is summarized by $$S$$ samples, the log predictive density for each data point $$y_n$$ in a data vector $$D_k$$ can be approximated as follows:

$\begin{equation} \widehat{elpd}_n = \log \left(\frac{1}{S} \sum_{s=1}^S p(y_n\mid \Theta^{k,s})\right) \tag{16.7} \end{equation}$

where $$\Theta^{k,s}$$ corresponds to the sample $$s$$ of the posterior of the model fit to the validation set $$-k$$.

1. We obtain the $$elpd_{kfold}$$ (or $$elpd_{loo}$$) for all the held-out data points by summing up the $$\widehat{elpd}_n$$:

$\begin{equation} elpd_{kfold} = \sum_{n=1}^n \widehat{elpd}_n \tag{16.8} \end{equation}$

We can also compute the standard deviation of the sampling distribution (the standard error) by multiplying the standard deviation (or square root of variance) of the $$N$$ components by $$\sqrt{N}$$. Letting $$\widehat{ELPD}$$ be the vector $$\widehat{elpd}_1,\dots,\widehat{elpd}_N$$, we can write:

$\begin{equation} se(\widehat{elpd}) = \sqrt{N \mathit{Var}(\widehat{ELPD})} \tag{16.9} \end{equation}$

The difference between the $$elpd_{kfold}$$ of two competing models, $$\mathcal{M}_1$$ and $$\mathcal{M}_2$$ , is a measure of relative predictive performance. We can also compute the standard error of their difference using the formula discussed in Vehtari, Gelman, and Gabry (2017b).

$\begin{equation} se(\widehat{elpd}_{\mathcal{M}1} - \widehat{elpd}_{\mathcal{M}2}) = \sqrt{N \mathit{Var}(\widehat{ELPD_{\mathcal{M}1}} - \widehat{ELPD_{\mathcal{M}2}})} \tag{16.10} \end{equation}$

## 16.3 Testing the N400 effect using cross-validation

As we did in section 15.2 with the Bayes factor, we revisit section 5.2, where we estimated the effect of cloze probability on the N400 average signal. We consider two models here, a model that includes the effect of cloze probability, such as fit_N400_sih from section 5.2.5, and a null model.

We can verify that the likelihood that we fit was appropriate for a hierarchical model that includes an effect of cloze probability as follows:

formula(fit_N400_sih) 
## n400 ~ c_cloze + (c_cloze | subj) + (c_cloze | item)

In contrast to the situation with Bayes factor, priors are less critical for cross-validation. Priors are only important in cross-validation to the extent that they affect parameter estimation: As we saw previously, very narrow priors can bias the posterior; and unrealistically wide priors can lead to convergence problems. The number of samples is also less critical than with Bayes factor, most of the uncertainty in the estimates of the $$\widehat{elpd}$$ is due to the number of observations. However, a very small number of samples can affect the $$\widehat{elpd}$$ because the posterior estimation will be affected by the small sample size. We update our previous formula to define a null model as follows:

fit_N400_sih_null <- update(fit_N400_sih, ~ . - c_cloze) 

### 16.3.1 Cross-validation with PSIS-LOO

Estimating $$elpd$$ using PSIS-LOO is very straightforward with brms, which uses the package loo as a back-end. There is no need to refit the model, and loo takes care of the applying the PSIS approximation to derive estimates and standard errors.

(loo_sih <- loo(fit_N400_sih))
##
## Computed from 4000 by 2863 log-likelihood matrix
##
##          Estimate   SE
## elpd_loo -11092.9 46.7
## p_loo        81.5  2.8
## looic     22185.7 93.4
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
(loo_sih_null <- loo(fit_N400_sih_null))
##
## Computed from 4000 by 2863 log-likelihood matrix
##
##          Estimate   SE
## elpd_loo -11095.5 46.5
## p_loo        89.5  3.0
## looic     22191.0 93.1
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.

The function loo reports three quantities with their standard error:

1. elpd_loo is the sum of pointwise predictive accuracy (a larger, less negative number indicates better predictions).
2. p_loo is an estimate of effective complexity of the model; asymptotically and under certain regularity conditions, p_loo can be interpreted as the effective number of parameters. If p_loo is larger than the number of data points or parameters, this may indicate a severe model misspecification.
3. looic is simply -2*elpd_loo, the $$elpd$$ on the deviance scale. This is called the information criterion, and is mainly provided for historical reasons: other information criteria like the AIC (Akaike Information Criterion) and the DIC (Deviance Information Criterion) are commonly used in model selection (Venables and Ripley 2002; Lunn et al. 2012).

It’s important to bear in mind that the PSIS-LOO approximation to LOO can only be trusted if Pareto k estimates ($$\hat{k}$$) are smaller than 0.7. To compare the models, we need to take a look at the difference between elpd_loo and the standard error of that difference:

loo_compare(loo_sih, loo_sih_null)
##                   elpd_diff se_diff
## fit_N400_sih       0.0       0.0
## fit_N400_sih_null -2.6       2.5

Although the model that includes cloze probability as a predictor has higher predictive accuracy, the difference is smaller than 4 and it’s smaller than two SE. This means that from the perspective LOO-CV, both models are almost indistinguishable! In fact, the same will happen if we compare the model with logarithmic predictability to the linear or null model; see exercise 16.1.

We could also check whether the alternative model is making good predictions for some range of values by examining the difference in pointwise predictive accuracy as a function of, for example, cloze probability. In the following plot, we subtract the predictive accuracy of the alternative model from the accuracy of the null model; we can now interpret larger differences as an advantage for the alternative model. However, we see that as far as posterior predictive accuracy goes, both models are quite similar. Figure 16.1 shows that the difference in predictive accuracy is symmetrical with respect to the zero; as we go further from the mean cloze (which is around $$0.5$$), the differences in predictions are larger but they span over positive and negative values.

The following code stores the difference in predictive accuracy of the models in a variable and plots it in Figure 16.1.

df_eeg <- mutate(df_eeg,
diff_elpd = loo_sih$pointwise[, "elpd_loo"] - loo_sih_null$pointwise[, "elpd_loo"]
)
ggplot(df_eeg, aes(x = cloze, y = diff_elpd)) +
geom_point(alpha = .4, position = position_jitter(w = .001, h = 0)) FIGURE 16.1: The difference in predictive accuracy between a model including the effect of cloze and a null model. A larger (more positive) difference indicates an advantage for the model that includes the effect of cloze.

This is unsettling because the effect of cloze probability on the N400 has been replicated in numerous studies. We would expect to see that, similar to the Bayes factor, cross-validation techniques will also show that a model that includes cloze probability as a predictor is superior to a model without it. Before we discuss why we don’t see a large difference, let us check what K-fold-CV yields.

### 16.3.2 Cross-validation with K-fold

Estimating $$elpd$$ using k-fold-CV has the advantage of omitting one layer of approximations: the $$elpd$$ based on PSIS-LOO-CV is an approximation of the $$elpd$$ based on exact LOO-CV (and we saw how any cross-validation approach gave as an approximation to the true $$elpd$$). This means that we don’t need to worry about $$\hat{k}$$. However, K-fold also uses a reduced training set in comparison with LOO, worsening the approximation to the true generating process $$p_{t}$$.

Because we divide our data into folds, we need to think about the way we split the data: We could do it randomly, but we take the risk that in some of the training sets, observations from a subject, for example, would be completely absent. This will lead to large differences in predictive accuracy between folds. We can avoid that by using stratification: we split the observations into groups ensuring that relative category frequencies are approximately preserved. We do this with the kfold() function, available in the package brms, by setting folds = "stratified" and group = "subj", by default K is set to 10, but that can be changed.

kfold_sih <- kfold(fit_N400_sih,
folds = "stratified",
group = "subj")
kfold_sih_null <- kfold(fit_N400_sih_null,
folds = "stratified",
group = "subj")

Running K-fold CV takes some time since each model is refit K times. We can now inspect the $$elpd$$ values:

kfold_sih
##
## Based on 10-fold cross-validation
##
##            Estimate   SE
## elpd_kfold -11097.4 46.6
## p_kfold        85.7  3.5
## kfoldic     22194.8 93.1
kfold_sih_null
##
## Based on 10-fold cross-validation
##
##            Estimate   SE
## elpd_kfold -11098.7 46.5
## p_kfold        93.3  3.7
## kfoldic     22197.5 93.0

Compare the two models using loo_compare (this function is usedf for both PSIS-LOO-CV and K-fold-CV):

loo_compare(kfold_sih, kfold_sih_null)
##                   elpd_diff se_diff
## fit_N400_sih       0.0       0.0
## fit_N400_sih_null -1.3       4.1

We see that, in this case, the results with K-fold-CV and PSIS-LOO-CV are quite similar: We can’t really distinguish between the two models.

### 16.3.3 Leave-one-group-out cross-validation

An alternative to splitting the observations randomly using stratification is to treat naturally occurring clusters as folds, this is leave-one-group-out cross-validation (LOGO-CV). By doing this we can interpret the output of cross-validation as the capacity of the models for generalizing to unseen clusters. We implement LOGO-CV with subjects.

logo_sih <- kfold(fit_N400_sih,
group = "subj")
logo_sih_null <- kfold(fit_N400_sih_null,
group = "subj")

Running LOGO CV with subjects takes some time since each model is refit as many times as we have subjects, in this case 37 times. We can now inspect the $$elpd$$ estimates and evaluate which model generalizes better to unseen subjects.

We compare the models using loo_compare.

loo_compare(logo_sih, logo_sih_null)
##                   elpd_diff se_diff
## fit_N400_sih       0.0       0.0
## fit_N400_sih_null -1.5       2.3

As before, and as with PSIS-LOO-CV, K-fold-CV, we can’t distinguish between the two models.

## 16.4 Comparing different likelihoods with cross-validation

We now compare two models with different likelihoods. In section 3.7.2 from chapter 3, we saw how a log-normal distribution was a more appropriate likelihood than a normal distribution for response times data. This was because response times are bounded by zero and right skewed, unlike the symmetrical normal distribution. Now we’ll use PSIS-LOO-CV to compare the predictive accuracy of the Stroop model from section 5.3 in chapter 5, which assumed a log-normal likelihood to fit response times of correct responses with a similar model which assumes a normal likelihood.

Load the data from bcogsci, create a sum coded predictor (see chapter 8 for more details), and fit the model as in section 5.3.

data("df_stroop")
df_stroop <- df_stroop %>%
mutate(c_cond = if_else(condition == "Incongruent", 1, -1))
fit_stroop_log <- brm(RT ~ c_cond + (c_cond | subj),
family = lognormal(),
prior =
c(
prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = b),
prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = sd),
prior(lkj(2), class = cor)
),
data = df_stroop
)

Calculate the $$elpd_{loo}$$ for the original model with the log-normal likelihood:

loo_stroop_log <- loo(fit_stroop_log)
loo_stroop_log 
##
## Computed from 4000 by 3058 log-likelihood matrix
##
##          Estimate    SE
## elpd_loo -19858.7  93.6
## p_loo        60.5   4.1
## looic     39717.4 187.1
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     3057  100.0%  1069
##  (0.5, 0.7]   (ok)          1    0.0%  203
##    (0.7, 1]   (bad)         0    0.0%  <NA>
##    (1, Inf)   (very bad)    0    0.0%  <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.

The summary shows that $$k$$ estimates are ok.

Now fit a similar model where we assume that the likelihood is a normal distribution. It’s important now to change the priors since they are on a different scale (namely, in milliseconds). We choose reasonable but wide priors. We can do a sensitivity analysis, if we are unsure about the priors. However, unlike what happened with the Bayes factor in chapter 15, priors are going to affect cross-validation based model comparison only as far as they have a noticeable effect on the posterior distribution.

fit_stroop_normal <- brm(RT ~ c_cond + (c_cond | subj),
family = gaussian(),
prior =
c(
prior(normal(400, 600), class = Intercept),
prior(normal(0, 100), class = b),
prior(normal(0, 300), class = sigma),
prior(normal(0, 300), class = sd),
prior(lkj(2), class = cor)
),
data = df_stroop
)

If we try to obtain $$elpd$$ based on PSIS-LOO-CV, we’ll find several large $$\hat{k}$$ values.

loo_stroop_normal <- loo(fit_stroop_normal)
loo_stroop_normal
##
## Computed from 4000 by 3058 log-likelihood matrix
##
##          Estimate    SE
## elpd_loo -21588.9 480.5
## p_loo       117.5  67.2
## looic     43177.8 961.1
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     3054  99.9%   1305
##  (0.5, 0.7]   (ok)          1   0.0%   84
##    (0.7, 1]   (bad)         2   0.1%   43
##    (1, Inf)   (very bad)    1   0.0%   3
## See help('pareto-k-diagnostic') for details.

We can try the model matching approximation for problematic observations, by setting moment_match = TRUE in the loo() call (and we also need to fit the model with save_pars = save_pars(all = TRUE)). In this particular case, this approximation won’t solve our problem. We skip that step here. We can use exact LOO (rather than its approximation) for the problematic observations. By setting reloo = TRUE, we re-fit the 3 problematic observations with $$\hat{k}$$ values over $$0.7$$ using exact LOO-CV.

loo_stroop_normal <- loo(fit_stroop_normal, reloo = TRUE)
loo_stroop_normal
##
## Computed from 4000 by 3058 log-likelihood matrix
##
##          Estimate     SE
## elpd_loo -21702.3  589.5
## p_loo       230.7  180.2
## looic     43404.7 1178.9
## ------
## Monte Carlo SE of elpd_loo is 0.3.
##
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     3057  100.0%  1
##  (0.5, 0.7]   (ok)          1    0.0%  77
##    (0.7, 1]   (bad)         0    0.0%  <NA>
##    (1, Inf)   (very bad)    0    0.0%  <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.

We are ready to compare the models.

loo_compare(loo_stroop_log, loo_stroop_normal)
##                   elpd_diff se_diff
## fit_stroop            0.0       0.0
## fit_stroop_normal -1843.6     533.8

Here cross-validation shows a clear advantage for the model with the log-normal likelihood. We visualize the pointwise predictive accuracy in Figure 16.2.

df_stroop <- mutate(df_stroop,
diff_elpd = loo_stroop_log$pointwise[, "elpd_loo"] - loo_stroop_normal$pointwise[, "elpd_loo"]
)
ggplot(df_stroop, aes(x = RT, y = diff_elpd)) +
geom_point(alpha = .4) + xlab("RC (ms)") FIGURE 16.2: The difference in predictive accuracy between a Stroop model with a log-normal likelihood and a model with a normal likelihood. A larger (more positive) difference indicates an advantage for the model with the log-normal likelihood.

Figure 16.2 shows that at first glance, the advantage of the log-normal likelihood lseems to lie in being able to capture extremely slow observations.

Figure 16.3 zooms in to visualize the pointwise predictive accuracy for observations with response times smaller than 2 seconds.

ggplot(df_stroop, aes(x = RT, y = diff_elpd)) +
geom_point(alpha = .3) +
xlab("RC (ms)") +
geom_hline(yintercept = 0, linetype = "dashed")+
coord_cartesian(xlim = c(0, 2000), ylim = c(-10, 10)) FIGURE 16.3: The difference in predictive accuracy between a Stroop model with a log-normal likelihood and a model with a normal likelihood for observations smaller than 2 seconds. A larger (more positive) difference indicates an advantage for the model with the log-normal likelihood.

Figure 16.3 suggests that the advantage of the log-normal likelihood seems to lie in being able to account for most of the observations in the data set, which occur around the $$500$$ ms.

## 16.5 Issues with cross-validation

Sivula, Magnusson, and Vehtari (2020) analyzed the behavior of the uncertainty estimate of the $$elpd$$ in typical situations. Although they focus on LOO-CV, the consequences are the same for K-fold-CV (and cross-validation in a non-Bayesian context). Sivula, Magnusson, and Vehtari (2020) identified three cases where the uncertainty estimates can perform badly:

1. The models make very similar predictions.
2. The number of observations is small.
3. The models are misspecified with outliers (influential extreme values) in the data.

When the models make similar predictions (as it is the case with nested models that we saw in earlier model comparisons) and when there is not much difference in the predictive performance of the models, the uncertainty estimates will behave badly. In these situations, cross-validation is not very useful for separating very small effect sizes from zero effect sizes. In addition, small differences in the predictive performance cannot reliably be detected by cross-validation if the number of observations is small (say, around 100 observations). However, if the predictions are very similar, Sivula, Magnusson, and Vehtari (2020) show that the same problems persist even with a larger data set.

One of the issues that cross-validation methods face when they are used to compare nested models lies in the way that the exact $$elpd$$ is approximated: In cross-validation approximations, out-of-sample observations are used, which are not part of the model that that was fit. Every time we evaluate the predictive accuracy of an observation, we ignore modeling assumptions. One of the weaknesses of cross-validation is the high variance in the approximation of the integral over the unknown true data distribution, $$p_t$$ (Vehtari and Ojanen 2012, sec. 4).

Cross-validation methods are sometimes criticized because when a lot of data are available, they will give undue preference to the complex model in comparison to a true simpler model (Gronau and Wagenmakers 2018). This might be true for toy examples where we can have unlimited observations and we compare a “wrong” model with the true model.46 However, the problems that we face in practice are often very different: This is because the true model is unknown and very likely not under consideration in our comparison (see Navarro 2019). In our experience, we are very far from the asymptotic behavior of cross-validation whereby it gives undue preference to a more complex model in comparison to a true simpler model. The main weakness of cross-validation lies in its lack of assumptions, which prevents it from selecting a more complex model rather than a simple one when there is only a modest gain in predictions (Vehtari, Simpson, et al. 2019).

An alternative to the cross-validation approach discussed here for nested models is the projection predictive method (Piironen, Paasiniemi, and Vehtari 2020). However, this approach (which is less general since it is valid only for generalized linear models) has a somewhat different objective. In the projection predictive method, we first build the most complete predictive model, the reference model, and then we look for a simpler model that gives as similar predictions as the reference model. The idea is that for a given complexity (number of predictors), the model with the smallest predictive discrepancy to the reference model should be selected. See https://github.com/stan-dev/projpred for an implementation of this approach. Thus this approach focuses on model simplification rather than on model comparison.

For models that are badly misspecified, the bias in the uncertainty makes their comparison unreliable as well. In this case, posterior predictive checks and possible model refinements are worth considering before carrying out model comparison.

If there are a large number of observations and the models under consideration are different enough from each other, the differences in predictive accuracy will dwarf the variance in the estimate of $$elpd$$, and cross-validation can be very useful (see also Piironen and Vehtari 2017). An example of this situation appeared in section 16.4. When models are very different, one advantage of cross-validation methods in comparison with the Bayes factor is that the selection of priors is less critical in cross-validation. It is sometimes hard to decide on priors that encode our knowledge for one model, and this difficulty is exacerbated when we want to assign comparable prior information to models with a different number of parameters that might be on a different scale. Given that cross-validation methods are less sensitive to prior specification, different models can be compared on the same footing. See Nicenboim and Vasishth (2018) for an example from psycholinguistics where K-fold-CV does help in distinguishing between models.

## 16.6 Cross-validation in Stan

We can also use PSIS-LOO-CV and K-fold-CV with our Stan models, but we should be careful to store the appropriate log-likelihood in the generated quantities block.

### 16.6.1 PSIS-LOO-CV in Stan

As explained earlier, PSIS-LOO (as implemented in the package loo) approximates the likelihood of the held-out data based on the observed data: it’s faster (because only one model is fit), and it only requires a minimal modification of the Stan code we need to fit a model. By default, Stan only saves the sum of the log likelihood of each observation (in the parameter lp__). If we want to store the log-likelihood of each observation, we have to do this in the generated quantities block.

We revisit the model implemented in section 10.4.2, which was evaluated using the Bayes factor in chapter 15. Now, we want to compare the predictive performance of a model that assumes an effect of attentional load on pupil size against a similar model that assumes no effect. To do this, we assume the following likelihood:

$\begin{equation} p\_size_n \sim \mathit{Normal}(\alpha + c\_load_n \cdot \beta_1 + c\_trial \cdot \beta_2 + c\_load \cdot c\_trial \cdot \beta_3, \sigma) \end{equation}$

Define priors for all the $$\beta$$’s as before:

\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(1000, 500) \\ \beta_{\{1,2,3\}} &\sim \mathit{Normal}(0, 100) \\ \sigma &\sim \mathit{Normal}_+(0, 1000) \end{aligned} \end{equation}

Prepare the data as in section 10.4.2:

df_pupil <- df_pupil %>%
ls_pupil <- list(
c_load = df_pupil$c_load, c_trial= df_pupil$c_trial,
p_size = df_pupil$p_size, N = nrow(df_pupil) ) Add a generated quantities block to the model shown below. (It is also possible to run this block in a stand-alone file with the rstan function gqs()). If we use the variable name log_lik in the Stan code, the loo package will know where to find the log likelihood of the observations. Code the effects as beta1, beta2, beta3 to more easily compare the model with the one used in the BF chapter, but in this case we could have used a vector or an array instead. This is the model pupil_null.stan shown below: data { int<lower = 1> N; vector[N] c_load; vector[N] c_trial; vector[N] p_size; } parameters { real alpha; real<lower = 0> beta1; real beta2; real beta3; real<lower = 0> sigma; } model { // priors including all constants target += normal_lpdf(alpha | 1000, 500); target += normal_lpdf(beta1 | 0, 100); target += normal_lpdf(beta2 | 0, 100); target += normal_lpdf(beta3 | 0, 100); target += normal_lpdf(sigma | 0, 1000) - normal_lccdf(0 | 0, 1000); target += normal_lpdf(p_size | alpha + c_load * beta1 + c_trial * beta2 + c_load .* c_trial * beta3, sigma); } generated quantities{ array[N] real log_lik; for (n in 1:N){ log_lik[n] = normal_lpdf(p_size[n] | alpha + c_load[n] * beta1 + c_trial[n] * beta2 + c_load[n] * c_trial[n] * beta3, sigma); } } For the null model, just omit the term with beta1 in both the model block and the generated quantities block. This is the model pupil_model_cv.stan shown below: data { int<lower = 1> N; vector[N] c_load; vector[N] c_trial; vector[N] p_size; } parameters { real alpha; real beta2; real beta3; real<lower = 0> sigma; } model { target += normal_lpdf(alpha | 1000, 500); target += normal_lpdf(beta2 | 0, 100); target += normal_lpdf(beta3 | 0, 100); target += normal_lpdf(sigma | 0, 1000) - normal_lccdf(0 | 0, 1000); target += normal_lpdf(p_size | alpha + c_trial * beta2 + c_load .* c_trial * beta3, sigma); } generated quantities{ array[N] real log_lik; for (n in 1:N){ log_lik[n] = normal_lpdf(p_size[n] | alpha + c_trial[n] * beta2 + c_load[n] * c_trial[n] * beta3, sigma); } } The models can be found in the bcogsci package: pupil_model_cv <- system.file("stan_models", "pupil_model_cv.stan", package = "bcogsci") pupil_null <- system.file("stan_models", "pupil_null.stan", package = "bcogsci") Fit the models: fit_pupil_int_pos_ll <- stan( file = pupil_model_cv, iter = 3000, data = ls_pupil ) fit_pupil_int_null_ll <- stan( file = pupil_null, iter = 3000, data = ls_pupil ) Show summary of predictive accuracy of the models using the function loo. Unlike brms, the package loo is configured to warn the user if $$\hat{k} > 0.5$$ (rather than $$\hat{k} > 0.5$$). In practice, however, PSIS-LOO-CV has good performance for values of $$\hat{k}$$ up to $$0.7$$ and the pointwise $$\widehat{elpd}$$ values with these associated $$\hat{k}$$ can still be used (Vehtari, Gelman, and Gabry 2017b). (loo_pos <- loo(fit_pupil_int_pos_ll)) ## ## Computed from 6000 by 41 log-likelihood matrix ## ## Estimate SE ## elpd_loo -258.7 5.6 ## p_loo 3.4 1.0 ## looic 517.4 11.2 ## ------ ## Monte Carlo SE of elpd_loo is 0.0. ## ## All Pareto k estimates are good (k < 0.5). ## See help('pareto-k-diagnostic') for details. (loo_null <- loo(fit_pupil_int_null_ll)) ## ## Computed from 6000 by 41 log-likelihood matrix ## ## Estimate SE ## elpd_loo -258.2 5.6 ## p_loo 3.2 1.0 ## looic 516.3 11.2 ## ------ ## Monte Carlo SE of elpd_loo is 0.0. ## ## All Pareto k estimates are good (k < 0.5). ## See help('pareto-k-diagnostic') for details. loo_compare(loo_pos, loo_null) ## elpd_diff se_diff ## model2 0.0 0.0 ## model1 -0.5 0.4 As it happened with the cloze probability effect in the previous section, we cannot decide which model has better predictive accuracy according to PSIS-LOO. #### 16.6.1.1 K-fold-CV in Stan If we want to use K-fold-CV (or LOGO-CV) in Stan (as opposed to PSIS-LOO), we need to be careful to store the log-likelihood of the held-out data, since we evaluate our model with only this subset of the data. The following example closely follows the vignette https://cran.r-project.org/web/packages/loo/vignettes/loo2-elpd.html. The steps taken are as follows: 1. Split the data in 10 folds. Since there is only one subject, we don’t need to stratify (using kfold_split_stratified), and we use kfold_split_random from the loo package. df_pupil$fold <- kfold_split_random(K = 10, N = nrow(df_pupil))
# Show number of obs for each fold:
df_pupil %>%
group_by(fold) %>%
count() %>%
print(n=10)
## # A tibble: 10 × 2
## # Groups:   fold 
##     fold     n
##    <int> <int>
##  1     1     4
##  2     2     4
##  3     3     4
##  4     4     4
##  5     5     4
##  6     6     4
##  7     7     4
##  8     8     4
##  9     9     4
## 10    10     5
1. Fit and extract the log pointwise predictive densities for each fold.

Compile the alternative and the null models first with stan_model, and prepare two matrices to store the predictive densities from the held out data. Each matrix has as many rows as post-warmup iterations we’ll produce ($$3000 \times 2$$), and as many columns as observations in the data set.

pupil_stanmodel <- stan_model(pupil_model_cv)
pupil_null_stanmodel <- stan_model(pupil_null)
log_pd_kfold <- matrix(nrow = 6000, ncol = nrow(df_pupil))
log_pd_null_kfold <- matrix(nrow = 6000, ncol = nrow(df_pupil))

Next, loop over the 10 folds. Each loop carries out the following steps. First, fit each model (i.e., the alternative and null model) to all the observations except the ones belonging to the held-out fold using sampling(); this uses the already-compiled models. Second, compute the log pointwise predictive densities for the held-out fold with gqs(). This function produces generated quantities based on samples from a posterior (in the draw argument) and ignores all the blocks except generated quantities.47 Finally, store the predictive density for the observations of the held-out fold in a matrix by extracting the log likelihood of the held-out data. The output of this loop is a matrix of the log pointwise predictive densities of all the observations.

# Loop over the folds
for(k in 1:10){
# Training set for k
df_pupil_train <- df_pupil %>%
filter(fold != k)
ls_pupil_train <- list(
c_load = df_pupil_train$c_load, c_trial= df_pupil_train$c_trial,
p_size = df_pupil_train$p_size, N = nrow(df_pupil_train) ) # Held out set for k df_pupil_ho <- df_pupil %>% filter(fold == k) ls_pupil_ho <- list( c_load = df_pupil_ho$c_load,
c_trial= df_pupil_ho$c_trial, p_size = df_pupil_ho$p_size,
N = nrow(df_pupil_ho)
)
# Train the models
fit_train <- sampling(pupil_stanmodel,
iter = 3000,
data = ls_pupil_train)
fit_null_train <- sampling(pupil_null_stanmodel,
iter = 3000,
data = ls_pupil_train)
# Generated quantities based on the posterior from the training set
# and the data from the held out set
gq_ho <- gqs(pupil_stanmodel,
draws = as.matrix(fit_train),
data = ls_pupil_ho)
gq_null_ho <- gqs(pupil_null_stanmodel,
draws = as.matrix(fit_null_train),
data = ls_pupil_ho)
# Extract log likelihood which represents
# the pointwise predictive density
log_pd_kfold[, df_pupil$fold == k] <- extract_log_lik(gq_ho) log_pd_null_kfold[, df_pupil$fold == k] <-
extract_log_lik(gq_null_ho)
}
1. Compute K-fold $$\widehat{elpd}$$.

Now we evaluate the predictive performance of the model on the 10 folds using elpd().

(elpd_pupil_kfold <- elpd(log_pd_kfold))
##
## Computed from 6000 by 41 log-likelihood matrix using the generic elpd function
##
##      Estimate   SE
## elpd   -259.7  6.0
## ic      519.5 12.0
(elpd_pupil_null_kfold <- elpd(log_pd_null_kfold))
##
## Computed from 6000 by 41 log-likelihood matrix using the generic elpd function
##
##      Estimate   SE
## elpd   -259.6  6.0
## ic      519.1 12.1
1. Compare the $$\widehat{elpd}$$ estimates.
loo_compare(elpd_pupil_kfold, elpd_pupil_null_kfold)
##        elpd_diff se_diff
## model2  0.0       0.0
## model1 -0.2       0.5

As with PSIS-LOO, we cannot decide which model has better predictive accuracy according to K-fold-CV.

## 16.7 Summary

In this chapter, we learned how to use K-fold cross-validation and leave-one-out cross-validation, using both built-in functionality in brms as well as Stan, in conjunction with the loo package. We saw an example of model comparison where cross-validation helped distinguish between the two models (log-normal vs. normal likelihood), and another example where no important differences were found between the models being compared (the N400 data with cloze probability as predictor). In general, cross-validation will be helpful when comparing rather different models (for an example from psycholinguistics, see Nicenboim and Vasishth 2018); when the models are highly similar, it will be difficult to distinguish between them. In particular, for typical psychology and linguistics data sets, it will be difficult to get conclusive results from model comparisons using cross-validation that aim to find evidence for the presence of a population-level (or fixed) effect, if the effect is very small and/or the data are relatively sparse (this is often the case, especially in psycholinguistic data). In such cases, if the aim is to find evidence for a theoretical claim, other model comparison methods like Bayes factors might be more meaningful.

A technical discussion about cross-validation methods can be found in Chapter 7 of Gelman et al. (2014). For a discussion about the advantages and disadvantages of (leave-one-out) cross-validation, see Gronau and Wagenmakers (2018), Vehtari, Simpson, et al. (2019) and Gronau and Wagenmakers (2019). A LOO glossary from loo package can be found in (https://mc-stan.org/loo/reference/loo-glossary.html). Cross-validation is still an active area of research, there are multiple websites and blog posts on this topic: Aki Vehtari, the creator of the loo package has a comprehensive FAQ about cross-validation in https://avehtari.github.io/modelselection/CV-FAQ.html; on Andrew Gelman’s blog, he also discusses the situations where cross-validation can be applied: https://statmodeling.stat.columbia.edu/2018/08/03/loo-cross-validation-approaches-valid/.

## 16.9 Exercises

Exercise 16.1 Predictive accuracy of the linear and the logarithm effect of cloze probability.

Is there a difference in predictive accuracy between the model that incorporates a linear effect of cloze probability and one that incorporates log-transformed cloze probabilities?

Exercise 16.2 Lognormal model

Use PSIS-LOO to compare a model of Stroop as the one in 11.1 with a model that assumes no population-level effect

1. in brms.
2. in Stan.

Exercise 16.3 Lognormal vs rec-normal model in Stan

In section 12.1, we proposed a reciprocal truncated normal distribution (rec-normal) to response times data, as an alternative to the log-normal distribution. The log-likelihood (of $$\mu$$ and $$\sigma$$) of an individual observation, $$\mathit{RT}_{n}$$, for the rec-normal distribution would be the following one.

$\begin{equation} \log \mathcal{L} = \log(\mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma)) - 2 \cdot \log(\mathit{RT}_n) \end{equation}$

As explained in 12.1, we obtain the log-likelihood based on all the $$N$$ observations by summing the log-likelihood of individual observations.

$\begin{equation} \log \mathcal{L} = \sum_n^N \log(\mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma)) - \sum_n^N 2 \cdot \log(\mathit{RT}_n) \end{equation}$

Since these two models assume right-skewed data with only positive values, the question that we are interested in here is if we can really distinguish between them. Investigate this in the following way:

1. Generate data (N = 100 and N = 1000) with a rec-normal distribution (e.g., rt = 1 / rtnorm(N, mu, sigma, a = 0)).
2. Generate data (N = 100 and N = 1000) with a log-normal distribution

Fit a rec-normal and a log-normal model using Stan to each of the four datasets, and use PSIS-LOO to compare the models.

What do you conclude?

### References

Bernardo, José M, and Adrian FM Smith. 2009. Bayesian Theory. Vol. 405. John Wiley & Sons.

Geisser, Seymour, and William F Eddy. 1979. “A Predictive Approach to Model Selection.” Journal of the American Statistical Association 74 (365). Taylor & Francis Group: 153–60.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014. Bayesian Data Analysis. Third Edition. Boca Raton, FL: Chapman; Hall/CRC Press.

Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477). Taylor & Francis: 359–78. https://doi.org/10.1198/016214506000001437.

Good, I. J. 1952. “Rational Decisions.” Journal of the Royal Statistical Society. Series B (Methodological) 14 (1). [Royal Statistical Society, Wiley]: 107–14. http://www.jstor.org/stable/2984087.

Gronau, Quentin F., and Eric-Jan Wagenmakers. 2018. “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection.” Computational Brain & Behavior. https://doi.org/10.1007/s42113-018-0011-7.

Gronau, Quentin F., and Eric-Jan Wagenmakers. 2018. “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection.” Computational Brain & Behavior. https://doi.org/10.1007/s42113-018-0011-7.

2019. “Rejoinder: More Limitations of Bayesian Leave-One-Out Cross-Validation.” Computational Brain & Behavior 2 (1). Springer: 35–47.

Lunn, David, Chris Jackson, David J Spiegelhalter, Nicky Best, and Andrew Thomas. 2012. The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.

Navarro, Danielle J. 2019. “Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection.” Computational Brain & Behavior 2 (1): 28–34. https://doi.org/10.1007/s42113-018-0019-z.

Nicenboim, Bruno, and Shravan Vasishth. 2018. “Models of Retrieval in Sentence Comprehension: A Computational Evaluation Using Bayesian Hierarchical Modeling.” Journal of Memory and Language 99: 1–34. https://doi.org/10.1016/j.jml.2017.08.004.

Paananen, Topi, Juho Piironen, Paul-Christian Bürkner, and Aki Vehtari. 2021. “Implicitly Adaptive Importance Sampling.” Statistics and Computing 31 (2). Springer Science; Business Media LLC. https://doi.org/10.1007/s11222-020-09982-2.

Piironen, Juho, Markus Paasiniemi, and Aki Vehtari. 2020. “Projective inference in high-dimensional problems: Prediction and feature selection.” Electronic Journal of Statistics 14 (1). Institute of Mathematical Statistics; Bernoulli Society: 2155–97. https://doi.org/10.1214/20-EJS1711.

Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35. https://doi.org/10.1007/s11222-016-9649-y.

Sivula, Tuomas, Måns Magnusson, and Aki Vehtari. 2020. “Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison.”

Vehtari, Aki, and Andrew Gelman. 2015. “Pareto Smoothed Importance Sampling.” arXiv Preprint arXiv:1507.02646.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017b. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

Vehtari, Aki, and Janne Ojanen. 2012. “A Survey of Bayesian Predictive Methods for Model Assessment, Selection and Comparison.” Statistical Surveys 6 (0). Institute of Mathematical Statistics: 142–228. https://doi.org/10.1214/12-ss102.

Vehtari, Aki, Daniel P. Simpson, Yuling Yao, and Andrew Gelman. 2019. “Limitations of ‘Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection’.” Computational Brain & Behavior 2 (1): 22–27. https://doi.org/10.1007/s42113-018-0020-6.

Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with S-PLUS. New York: Springer.

1. Maximizing the $$elpd$$ in (16.4) is also equivalent to minimizing the Kullback–Leibler (KL) divergence from the true data generating distribution $$p_t(y_{pred})$$ to the posterior predictive distribution of the candidate model $$\mathcal{M}_1$$.

2. The double use of the data is also a problem when one relies on information criteria like the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC).

3. If the true model is under consideration among the models being compared, we are under an $$M_{closed}$$ scenario. However, this is rarely realistic. The most common case is an $$M_{open}$$ scenario (Bernardo and Smith 2009), where the true model is not included in the set of models being compared.

4. The reader using cmdstanr rather than rstan might find a cryptic error here. This is because cmdstanr expects the parameters not to change. A workaround can be found in https://discourse.mc-stan.org/t/generated-quantities-returns-error-mismatch-between-model-and-fitted-parameters-csv-file/17869/15