5.2 A hierarchical model with a normal likelihood: The N400 effect

Event-related potentials (ERPs) allow scientists to observe electrophysiological responses in the brain measured by means of electroencephalography (EEG) that are time-locked to a specific event (i.e., the presentation of the stimuli). A very robust ERP effect in the study of language is the N400. It has been shown that words with low predictability are accompanied by an N400 effect in comparison with high-predictable words, this is a relative negativity that peaks around 300-500 after word onset over central parietal scalp sites (first reported in Kutas and Hillyard 1980, for semantic anomalies, and in 1984 for low predictable word; for a review, see Kutas and Federmeier 2011). The N400 is illustrated in Figure 5.4.

Typical ERP for the grand average across the N400 spatial window (central parietal electrodes: Cz, CP1, CP2, P3, Pz, P4, POz) for high and low predictability nouns (specifically from the constraining context of the experiment reported in Nicenboim, Vasishth, and Rösler 2020a). The x-axis indicates time in seconds and the y-axis indicates voltage in microvolts (unlike many EEG/ERP plots, the negative polarity is plotted downwards).

FIGURE 5.4: Typical ERP for the grand average across the N400 spatial window (central parietal electrodes: Cz, CP1, CP2, P3, Pz, P4, POz) for high and low predictability nouns (specifically from the constraining context of the experiment reported in Nicenboim, Vasishth, and Rösler 2020a). The x-axis indicates time in seconds and the y-axis indicates voltage in microvolts (unlike many EEG/ERP plots, the negative polarity is plotted downwards).

For example, in (1) below, the continuation ‘paint’ has higher predictability than the continuation ‘dog’, and thus we would expect a more negative signal, that is, an N400 effect, in ‘dog’ in (b) in comparison with ‘paint’ in (a). It is often the case that predictability is measured with a cloze task (see section 1.4).

  1. Example from Kutas and Hillyard (1984)
    1. Don’t touch the wet paint.
    2. Don’t touch the wet dog.

The EEG data are typically recorded in tens of electrodes every couple of milliseconds, but for our purposes (i.e., for learning about Bayesian hierarchical models), we can safely ignore the complexity of the data. A common way to simplify the high-dimensional EEG data when we are dealing with the N400 is to focus on the average amplitude of the EEG signal at the typical spatio-temporal window of the N400 (for example, see Frank et al. 2015).

For this example, we are going to focus on the N400 effect for critical nouns from a subset of the data of Nieuwland et al. (2018). Nieuwland et al. (2018) presented a replication attempt of an original experiment of DeLong, Urbach, and Kutas (2005) with sentences like (2).

  1. Example from DeLong, Urbach, and Kutas (2005)
    1. The day was breezy so the boy went outside to fly a kite.
    2. The day was breezy so the boy went outside to fly an airplane.

We’ll ignore the goal of original experiment (DeLong, Urbach, and Kutas 2005), and its replication (Nieuwland et al. 2018). We are going to focus on the N400 at the final nouns in the experimental stimuli. In example (2), for example, the final noun ‘kite’ has higher predictability than ‘airplane’, and thus we would expect a more negative signal in ‘airplane’ in (b) in comparison with ‘kite’ in (a).

To speed up computation, we restrict the data set to the subjects from the Edinburgh lab, using the relevant subset containing the Edinburgh data fro df_eeg in the bcogsci package. Center the cloze probability before using it as a predictor.

## # A tibble: 2,863 × 7
##    subj cloze  item  n400 cloze_ans     N c_cloze
##   <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>   <dbl>
## 1     1  0        1  7.08         0    44  -0.476
## 2     1  0.03     2 -0.68         1    44  -0.446
## 3     1  1        3  1.39        44    44   0.524
## # … with 2,860 more rows
## # A tibble: 1 × 1
##       n
##   <int>
## 1    37

One convenient aspect of using averages of EEG data is that they are roughly normally distributed. This allows us to use the Normal likelihood. Figure 5.5 shows the distribution of the data.

Histogram of the N400 averages for every trial, overlaid is a density plot of a normal distribution.

FIGURE 5.5: Histogram of the N400 averages for every trial, overlaid is a density plot of a normal distribution.

5.2.1 Complete pooling model (\(M_{cp}\))

We’ll start from the simplest model which is basically the linear regression we encountered in the preceding chapter.

5.2.1.1 Model assumptions

This model, call it \(M_{cp}\), makes the following assumptions.

  1. The EEG averages for the N400 spatiotemporal window are normally distributed.
  2. Observations are independent.
  3. There is a linear relationship between cloze and the EEG signal for the trial.

This model is incorrect for these data due to assumption (2) being violated.

With the last assumption, we are saying that the difference in the average signal when we compare nouns with cloze probability of 0 and 0.1 is the same as the difference in the signal when we compare nouns with cloze values of 0.1 and 0.2 (or 0.9 and 1). This is just an assumption, and it may not necessarily be the case in the actual data. This means that we are going to get a posterior for \(\beta\) conditional on the assumption that the linear relationship holds. Even if it approximately holds, we still don’t know how much we deviate from this assumption.

We can now decide on a likelihood and priors.

5.2.1.2 Likelihood and priors

A normal likelihood seems reasonable for these data:

\[\begin{equation} signal_n \sim \mathit{Normal}( \alpha + c\_cloze_n \cdot \beta,\sigma) \tag{5.1} \end{equation}\]

where \(n =1, \ldots, N\), and \(signal\) is the dependent variable (average signal in the N400 spatiotemporal window in microvolts). The variable \(N\) represents the total number of data points.

As always we need to rely on our previous knowledge and domain expertise to decide on priors. We know that ERPs (signals time-locked to a stimulus) have mean amplitudes of a couple of microvolts: This is easy to see in any plot of the EEG literature. This means that we don’t expect the effect of our manipulation to exceed, say, \(10 \mu V\). As before, a priori we’ll assume that effects can be negative or positive. We can quantify our prior knowledge regarding plausible values of \(\beta\) as normally distributed centered at zero with a standard deviation of \(10 \mu V\). (Other values such as \(5 \mu V\) would have been also reasonable, since it would entail that 95% of the prior mass probability is between -10 and 10 \(\mu V\).)

If the signal for each ERP is baselined, that is, the mean signal of a time window before the time window of interest is subtracted from the time window of interest, then the mean signal would be relatively close to 0. Since we know that the ERPs were baselined in this study, we expect that the grand mean of our signal should be relatively close to zero. Our prior for \(\alpha\) is then normally distributed centered in zero with a standard deviation of 10 \(\mu V\).

The standard deviation of our signal distribution is harder to guess. We know that EEG signals are quite noisy, and that the standard deviation must be higher than zero. Our prior for \(\sigma\) is a truncated normal distribution with location zero and scale 50. Recall that since we truncate the distribution, the parameters location and scale do not correspond to the mean and standard deviation of the new distribution; see Box 4.1.

We can draw random samples from this truncated distribution and calculate their mean and standard deviation:

## mean   sd 
## 39.6 29.7

So we are essentially saying that we assume a priori that we will find the true standard deviation of the signal in the following interval with 95% probability:

##   2.5%  97.5% 
##   1.55 110.40

To sum up, we are going to use the following priors:

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(0,10)\\ \beta &\sim \mathit{Normal}(0,10)\\ \sigma &\sim \mathit{Normal}_{+}(0,50) \end{aligned} \end{equation}\]

A model such as \(M_{cp}\) is sometimes called a fixed-effects model: all the parameters are fixed in the sense that do not vary from subject to subject or from item to item. A similar frequentist model would correspond to fitting a simple linear model using the lm function: lm(n400 ~ 1 + cloze, data = df_eeg).

We fit this model in brms as follows (the default family is gaussian() so we can omit it). As with the lm function in R, by default an intercept is fitted and thus n400 ~ c_cloze is equivalent to n400 ~ 1 + c_cloze:

For now, check the summary, and plot the posteriors of the model (Figure 5.6).

## ...
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     3.66      0.22     3.23     4.09 1.00     4060     2945
## c_cloze       2.25      0.54     1.15     3.33 1.00     3906     2929
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    11.82      0.16    11.50    12.14 1.00     3873     2881
## 
## ...
Posterior distributions of the complete pooling model, fit_N400_cp.

FIGURE 5.6: Posterior distributions of the complete pooling model, fit_N400_cp.

5.2.2 No pooling model (\(M_{np}\))

One of the assumptions of the previous model is clearly wrong: observations are not independent, they are clustered by subject (and also by the specific item, but we’ll ignore this until section 5.2.4). It is reasonable to assume that EEG signals are more similar within subjects than between them. The following model assumes that each subject is completely independent from each other.14

5.2.2.1 Model assumptions

  1. EEG averages for the N400 spatio-temporal window are normally distributed.
  2. Every subject’s model is fit independently of the other subjects; the subjects have no parameters in common (an exception is the standard deviation; this is the same for all subjects).
  3. There is a linear relationship between cloze and the EEG signal for the trial.

What likelihood and priors can we choose here?

5.2.2.2 Likelihood and priors

The likelihood is a normal distribution as before:

\[\begin{equation} signal_n \sim \mathit{Normal}( \alpha_{subj[n]} + c\_cloze_n \cdot \beta_{subj[n]},\sigma) \end{equation}\]

As before, \(n\) represents each observation, that is, the \(n\)th row in the data frame, which has \(N\) rows, and now the index \(i\) identifies the subject. The notation \(subj[n]\), which roughly follows Gelman and Hill (2007), identifies the subject index; for example, if \(subj[10]=3\), then the \(10\)th row of the data frame is from subject \(3\).

We define the priors as follows:

\[\begin{equation} \begin{aligned} \alpha_{i} &\sim \mathit{Normal}(0,10)\\ \beta_{i} &\sim \mathit{Normal}(0,10)\\ \sigma &\sim \mathit{Normal}_+(0,50) \end{aligned} \end{equation}\]

In brms, such a model can be fit by removing the common intercept with the formula n400 ~ 0 + factor(subj) + c_cloze:factor(subj).

This formula forces the model to estimate one intercept and one slope for each level of subj.15 The by-subject intercepts are indicated with factor(subj) and the by-subject slopes with c_cloze:factor(subj). It’s very important to specify that subject should be treated as a factor and not as a number; we don’t assume that subject number 3 will show 3 times more positive (or negative) average signal than subject number 1! The model fits 37 independent intercepts and 37 independent slopes. By setting a prior to class = b and omitting coef, we are essentially setting identical priors to all the intercepts and slopes of the model. The parameters are independent from each other; it is only our previous knowledge (or prior beliefs) about their possible values (encoded in the priors) that is identical. We can set different priors to each intercept and slope, but that will mean setting 74 priors!

For this model, printing a summary means printing the 75 parameters (\(\alpha_{1,...,37}\), \(\beta_{1,...,37}\), and \(\sigma\)). We could do this as always by printing out the model results: just type fit_N400_np.

It may be easier to understand the output of the model by plotting \(\beta_{1,..,37}\) using bayesplot. (brms also includes a wrapper for this function called stanplot). We can take a look at the internal names that brms gives to the parameters with variables(fit_N400_np); they are b_factorsubj, then the subject index and then :c_cloze. The code below changes the subject labels back to their original numerical indices and plots them in Figure 5.7. The subjects are ordered by the magnitude of their mean effects.

The model \(M_{np}\) does not estimate a unique population-level effect; instead, there is a different effect estimated for each subject. However, given the posterior means from each subject, it is still possible to calculate the average of these estimates \(\hat\beta_{1,...,I}\), where \(I\) is the total number of subjects:

## # A tibble: 1 × 3
##    mean    lq    hq
##   <dbl> <dbl> <dbl>
## 1  2.18  1.17  3.17

In Figure 5.7, the 95% credible interval of this overall mean effect is plotted as two vertical lines together with the effect of cloze probability for each subject (ordered by effect size). Here, rather than using a plotting function from brms, we can extract the summary of by-subject effects, reorder them by magnitude, and then plot the summary with a custom plot using ggplot2.

The code below generates Figure 5.7.

95% credible intervals of the effect of cloze probability for each subject according to the no pooling model, fit_N400_np. The solid vertical line represents the mean over all the subjects; and the broken vertical lines mark the 95% credible interval for this mean.

FIGURE 5.7: 95% credible intervals of the effect of cloze probability for each subject according to the no pooling model, fit_N400_np. The solid vertical line represents the mean over all the subjects; and the broken vertical lines mark the 95% credible interval for this mean.

5.2.3 Varying intercepts and varying slopes model (\(M_{v}\))

One major problem with the no-pooling model is that we completely ignore the fact that the subjects were doing the same experiment. We fit each subject’s data ignoring the information available in the other subjects’ data. The no-pooling model is very likely to overfit the individual subjects’ data; we are likely to ignore the generalities of the data and we may end up overinterpreting noisy estimates from each subject’s data. The model can be modified to explicitly assume that the subjects have an overall effect common to all the subjects, with the individual subjects deviating from this common effect.

In the model that we fit next, we will assume that there is an overall effect that is common to the subjects and, importantly, that all subjects’ parameters originate from one common (normal) distribution. This model specification will result in the estimation of posteriors for each subject being also influenced by what we know about all the subjects together. We begin with a hierarchical model with uncorrelated varying intercepts and slopes.16

5.2.3.1 Model assumptions

  1. EEG averages for the N400 spatio-temporal window are normally distributed.
  2. Each subject deviates to some extent (this is made precise below) from the grand mean and from the mean effect of predictability. This implies that there is some between-subject variability in the individual-level intercept and slope adjustments by subject.
  3. There is a linear relationship between cloze and the EEG signal.

5.2.3.2 Likelihood and priors

The likelihood now incorporates the assumption that both the intercept and slope are adjusted by subject.

\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta+ u_{subj[n],2}),\sigma) \end{equation}\]

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(0,10)\\ \beta &\sim \mathit{Normal}(0,10)\\ u_1 &\sim \mathit{Normal}(0,\tau_{u_1})\\ u_2 &\sim \mathit{Normal}(0,\tau_{u_2})\\ \tau_{u_1} &\sim \mathit{Normal}_+(0,20) \\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20) \\ \sigma &\sim \mathit{Normal}_+(0,50) \end{aligned} \end{equation}\]

In this model each subject has their own intercept adjustment, \(u_{subj,1}\), and slope adjustment, \(u_{subj,2}\).17 If \(u_{subj,1}\) is positive, the subject will have a more positive EEG signal than the grand mean average. If \(u_{subj,2}\) is positive, the subject will have a more positive EEG response to a change of one unit in c_cloze than the overall mean effect (i.e., there will be a more positive effect of cloze probability on the N400). The parameters \(u\) are sometimes called random effects and thus a model with fixed effects (\(\alpha\) and \(\beta\)) and random effects is called a mixed model. However, random effects have different meanings in different contexts. To avoid ambiguity, brms calls these random-effects parameters group-level effects. Since we are estimating \(\alpha\) and \(u\) at the same time and we assume that the average of the \(u\)’s is 0 (since it is assumed to be normally distributed with mean 0), what is common between the subjects, the grand mean, is estimated as the intercept \(\alpha\), and the deviations of individual subjects’ means from this grand mean are the adjustments \(u_1\). Similarly, the mean effect of cloze is estimated as \(\beta\), and the deviations of individual subjects’ mean effects of cloze from \(\beta\) are the adjustment \(u_2\). The standard deviations of these two adjustment terms, \(\tau_{u_1}\) and \(\tau_{u_2}\), respectively, represent between subject variability; see Box 5.2.

Thus, the model \(M_{v}\) has three standard deviations: \(\sigma\), \(\tau_{u_1}\) and \(\tau_{u_2}\). In statistics, it is conventional to talk about variances (the square of these standard deviations); for this reason, these standard deviations are also (confusingly) called variance components. The variance components \(\tau_{u_1}\) and \(\tau_{u_2}\) characterize between-subject variability, and the variance component \(\sigma\) characterizes within-subject variability.

The by-subject adjustments \(u_1\) and \(u_2\) are parameters in the model, and therefore have priors defined on them.18 Parameters that appear in the prior specifications for parameters, such as \(\tau_u\), are often called hyperparameters,19 and the priors on such hyperparameters are called hyperpriors. Thus, the parameter \(u_1\) has \(Normal(0,\tau_{u_1})\) as a prior; \(\tau_{u_1}\) is a hyperparameter, and the hyperprior on \(\tau_{u_1}\) is \(Normal(0,20)\).20

We know that in general, in EEG experiments, the standard deviations for the by-subject adjustments are smaller than the standard deviation of the observations (which is the within-subjects standard deviation). That is, usually the between-subject variability in the intercepts and slopes is smaller than the within-subjects variability in the data. For this reason, reducing the scale of the truncated normal distribution to \(20\) (in comparison to \(50\)) seems reasonable for the priors of the \(\tau\) parameters. As always, we can do a sensitivity analysis to verify that our priors are reasonably uninformative (if we intended them to be uninformative).

Box 5.2 Some important (and sometimes confusing) points:
  • Why does \(u\) have a mean of \(0\)?

    Because we want \(u\) to capture only differences between subjects, we could achieve the same by assuming the following relationship between the likelihood and the intercept and slope:

    \[\begin{equation} \begin{aligned} signal_n &\sim \mathit{Normal}(\alpha_{subj[n]} + \beta_{subj[n]} \cdot c\_cloze_n, \sigma) \\ \alpha_i &\sim \mathit{Normal}(\alpha,\tau_{u_1})\\ \beta_i &\sim \mathit{Normal}(\beta,\tau_{u_2})\\ \end{aligned} \end{equation}\]

    In fact, this is another common way to write the model.

  • Why do the adjustments \(u\) have a normal distribution?

Mostly by convention, the adjustments \(u\) are assumed to come from a Normal distribution. Another reason is that if we don’t know anything about the distribution besides its mean and variance, the normal distribution is the most conservative assumption (see chapter 9 of McElreath 2015).

For now, we are assuming that there is no relationship (no correlation) between the by-subject intercept and slope adjustments \(u_1\) and \(u_2\); this lack of correlation is indicated in brms using the double pipe ||.21 In brms, we need to specify hyperpriors for \(\tau_{u_1}\) and \(\tau_{u_2}\); these are called sd in brms, to distinguish these standard deviations from the standard deviation of the residuals \(\sigma\). As with the population-level effects, the by-subjects intercept adjustments are implicitly fit for the group-level effects and thus (c_cloze || subj) is equivalent to (1 + c_cloze || subj). If we don’t want an intercept we need to explicitly indicate it with (0 + c_cloze || subj) or (-1 + c_cloze || subj). Such a removal of the intercept is not normally done.

When we print a brms fit, we first see the summaries of the posteriors of the standard deviation of the by-group intercept and slopes, \(\tau_{u_1}\) and \(\tau_{u_2}\) as sd(Intercept) and sd(c_cloze), and then, as with previous models, the population-level effects, \(\alpha\) and \(\beta\) as Intercept and c_cloze, and the scale of the likelihood, \(\sigma\), as sigma. The full summary can be printed out by typing:

Because the above command will result in some pages of output, it is easier to understand the summary graphically (Figure 5.8). Rather than the wrapper plot(), we use the original function of the packagebayesplot, mcmc_dens, to only show density plots. We extract the first 6 parameters of the model with variables(fit_N400_v)[1:6].

Posterior distributions of the parameters in the model fit_N400_v.

FIGURE 5.8: Posterior distributions of the parameters in the model fit_N400_v.

Because we estimated how the population-level effect of cloze is adjusted for each subject, we could examine how each subject is being affected by the manipulation. For this we do the following, and we plot it in Figure 5.9. These are adjustments, \(u_{1,1},u_{1,...},u_{1,37}\), and not the effect of the manipulation by subject, \(\beta + [u_{1,1},u_{1,...},u_{1,37}]\). The code below produces Figure 5.9.

95% credible intervals of adjustments to the effect of cloze probability for each subject (\(u_{1,1..37}\)) according to the varying intercept and varying slopes model, fit_N400_v.

FIGURE 5.9: 95% credible intervals of adjustments to the effect of cloze probability for each subject (\(u_{1,1..37}\)) according to the varying intercept and varying slopes model, fit_N400_v.

There is an important difference between the no-pooling model and the varying intercepts and slopes model we just fit. The no-pooling model fits each individual subject’s intercept and slope independently for each subject. By contrast, the varying intercepts and slopes model takes all the subjects’ data into account in order to compute the fixed effects \(\alpha\) and \(\beta\); and the model “shrinks” (Pinheiro and Bates 2000) the by-subject intercept and slope adjustments towards the fixed effects estimates. In Figure 5.10, we can see the shrinkage of the estimates in the varying intercepts model by comparing them with the estimates of the no pooling model (\(M_{np}\)). The code below produces Figure 5.10. This code is more involved since it requires us to build a data frame with the by-subject effects (\(\beta + u_2\)).

This plot compares the estimates of the effect of cloze probability for each subject between (i) the no pooling, fit_N400_np and (ii) the varying intercepts and varying slopes, hierarchical, model, fit_N400_v.

FIGURE 5.10: This plot compares the estimates of the effect of cloze probability for each subject between (i) the no pooling, fit_N400_np and (ii) the varying intercepts and varying slopes, hierarchical, model, fit_N400_v.

5.2.4 Correlated varying intercept varying slopes model (\(M_{h}\))

The model \(M_{v}\) allowed for differences in intercepts (mean voltage) and slopes (effects of cloze) across subjects, but it has the implicit assumption that these varying intercepts and varying slopes are independent. It is in principle possible that subjects showing more negative voltage may also show stronger effects (or weaker effects). Next, we fit a model that allows a correlation between the intercepts and slopes. We model the correlation between varying intercepts and slopes by defining a variance-covariance matrix \(\boldsymbol{\Sigma}\) between the by-subject varying intercepts and slopes, and by assuming that both adjustments (intercept and slope) come from a multivariate (in this case, a bivariate) normal distribution.

  • In \(M_h\), we model the EEG data with the following assumptions:
  1. EEG averages for the N400 spatio-temporal window are normally distributed.
  2. Some aspects of the mean signal voltage and of the effect of predictability depend on the subject, and these two might be correlated, i.e., we assume group-level intercepts and slopes, and allow a correlation between them by-subject.
  3. There is a linear relationship between cloze and the EEG signal for the trial.

The likelihood remains identical to the model \(M_v\), which assumes no correlation between group-level intercepts and slopes (section 5.2.3):

\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta + u_{subj[n],2}),\sigma) \end{equation}\]

The correlation is indicated in the priors on the adjustments for intercept \(u_{1}\) and slopes \(u_{2}\).

  • Priors: \[\begin{equation} \begin{aligned} \alpha & \sim \mathit{Normal}(0,10) \\ \beta & \sim \mathit{Normal}(0,10) \\ \sigma &\sim \mathit{Normal}_+(0,50)\\ {\begin{pmatrix} u_{i,1} \\ u_{i,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_u} \right) \end{aligned} \end{equation}\]

In this model, a bivariate normal distribution generates the varying intercepts and varying slopes \(\mathbf{u}\); this is an \(n\times 2\) matrix. The variance-covariance matrix \(\boldsymbol{\Sigma_u}\) defines the standard deviations of the varying intercepts and varying slopes, and the correlation between them. Recall from section 1.6.2 that the diagonals of the variance-covariance matrix contain the variances of the correlated random variables, and the off-diagonals contain the covariances. In this example, the covariance \(Cov(u_1,u_2)\) between two variables \(u_1\) and \(u_2\) is defined as the product of their correlation \(\rho\) and their standard deviations \(\tau_{u_1}\) and \(\tau_{u_2}\). In other words, \(Cov(u_1,u_2) = \rho_u \tau_{u_1} \tau_{u_2}\).

\[\begin{equation} \boldsymbol{\Sigma_u} = {\begin{pmatrix} \tau_{u_1}^2 & \rho_u \tau_{u_1} \tau_{u_2} \\ \rho_u \tau_{u_1} \tau_{u_2} & \tau_{u_2}^2 \end{pmatrix}} \end{equation}\]

In order to specify a prior for \(\Sigma_u\), we need priors for the standard deviations, \(\tau_{u_1}\), and \(\tau_{u_2}\), and also for their correlation, \(\rho_u\). We can use the same priors for \(\tau\) as before. For the correlation parameter \(\rho_u\) (and the correlation matrix more generally), we use the LKJ prior. The basic idea of the LKJ prior on the correlation matrix is that as its parameter, \(\eta\) (eta), increases, it will favor correlations closer to zero.22 At \(\eta = 1\), the LKJ correlation distribution is uninformative (similar to \(Beta(1,1)\)), at \(\eta < 1\), it favors extreme correlations (similar to \(Beta(a<1,b<1)\)). We set \(\eta = 2\) so that we don’t favor extreme correlations, and we still represent our lack of knowledge through the wide spread of the prior between \(-1\) and \(1\). Thus, \(\eta = 2\) gives us a regularizing, relatively uninformative or mildly informative prior.

Figure 5.11 shows a visualization of different parametrizations of the LKJ prior.

Visualization of the LKJ correlation distribution prior with four different values of the \(\eta\) parameter.Visualization of the LKJ correlation distribution prior with four different values of the \(\eta\) parameter.Visualization of the LKJ correlation distribution prior with four different values of the \(\eta\) parameter.Visualization of the LKJ correlation distribution prior with four different values of the \(\eta\) parameter.

FIGURE 5.11: Visualization of the LKJ correlation distribution prior with four different values of the \(\eta\) parameter.

\[\begin{equation} \begin{aligned} \tau_{u_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20)\\ \rho_u &\sim \mathit{LKJcorr}(2) \end{aligned} \end{equation}\]

In our brms model, we allow a correlation between the by-subject intercepts and slopes by using a single pipe | instead of the double pipe || that we used previously. This convention follows that in the frequentist lmer function. As before, the varying intercepts are implicitly fit.

Because we have a new parameter, the correlation \(\rho_{u}\), we need to add a new prior for this correlation; in brms, this is achieved by addition a prior for the parameter type cor.

The estimates do not change much in comparison with the varying intercept/slope model, probably because the estimation of the correlation is quite poor (i.e., there is a lot of uncertainty). As before we show the estimates graphically (Figure 5.12. One can access the complete summary as always with fit_N400_h.

The posteriors of the parameters in the model fit_N400_h.

FIGURE 5.12: The posteriors of the parameters in the model fit_N400_h.

We are now half-way to what is called the maximal hierarchical model (Barr et al. 2013), because everything that we said about subjects is also relevant for items. The next section spells out this type of model.

5.2.5 By-subjects and by-items correlated varying intercept varying slopes model (\(M_{sih}\))

Our new model, \(M_{sih}\) will allow for differences in intercepts (mean voltage) and slopes (effects of predictability) across subjects and across items. In typical Latin square designs, subjects and items are said to be crossed random effects—each subject sees exactly one instance of each item. Here we assume a possible correlation between varying intercepts and slopes by subjects, and another one by items.

  • In \(M_{sih}\), we model the EEG data with the following assumptions:
  1. EEG averages for the N400 spatio-temporal window are normally distributed.
  2. Some aspects of the mean signal voltage and of the effect of predictability depend on the subject, i.e., we assume group-level intercepts, and slopes, and a correlation between them by-subject.
  3. Some aspects of the mean signal voltage and of the effect of predictability depend on the item, i.e., we assume group-level intercepts, and slopes, and a correlation between them by-item.
  4. There is a linear relationship between cloze and the EEG signal for the trial.
  • Likelihood:

\[\begin{multline} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} + \\ c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma) \end{multline}\]

  • Priors: \[\begin{equation} \begin{aligned} \alpha & \sim \mathit{Normal}(0,10) \\ \beta & \sim \mathit{Normal}(0,10) \\ \sigma &\sim \mathit{Normal}_+(0,50)\\ {\begin{pmatrix} u_{i,1} \\ u_{i,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_u} \right) \\ {\begin{pmatrix} w_{j,1} \\ w_{j,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_w} \right) \end{aligned} \end{equation}\]

We have added the index \(j\), which represents each item, as we did with subjects; \(item[n]\) indicates the item that corresponds to the observation in the \(n\)-th row of the data frame.

We have hyperparameters and hyperpriors as before:

\[\begin{equation} \begin{aligned} \boldsymbol{\Sigma_u} & = {\begin{pmatrix} \tau_{u_1}^2 & \rho_u \tau_{u_1} \tau_{u_2} \\ \rho_u \tau_{u_1} \tau_{u_2} & \tau_{u_2}^2 \end{pmatrix}}\\ \boldsymbol{\Sigma_w} & = {\begin{pmatrix} \tau_{w_1}^2 & \rho_w \tau_{w_1} \tau_{w_2} \\ \rho_w \tau_{w_1} \tau_{w_2} & \tau_{w_2}^2 \end{pmatrix}} \end{aligned} \end{equation}\]

\[\begin{equation} \begin{aligned} \tau_{u_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20)\\ \rho_u &\sim \mathit{LKJcorr}(2) \\ \tau_{w_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{w_2} &\sim \mathit{Normal}_+(0,20)\\ \rho_w &\sim \mathit{LKJcorr}(2) \\ \end{aligned} \end{equation}\]

We set identical priors for by-items group-level effects as for by-subject group-level effects, but this only because we don’t have any differentiated prior information about subject-level vs. item-level variation. However, bear in mind that the estimation for items is completely independent from the estimation for subjects. Although we wrote many more equations than before, the brms model is quite straightforward to extend:

We can also simplify the call to brms, when we assign the same priors to the by-subject and by-item parameters:

We have new group-level effects in the summary, but again the estimate of the effect of cloze remains virtually unchanged (Figure 5.13).

The posterior distributions of the parameters in the model fit_N400_sih.

FIGURE 5.13: The posterior distributions of the parameters in the model fit_N400_sih.

Box 5.3 The Matrix Formulation of Hierarchical Models (the Laird-Ware form)

We have been writing linear models as follows; where \(n\) refers to the row id in the data frame.

\[\begin{equation} y_n \sim \mathit{Normal}(\alpha + \beta\cdot x_n) \end{equation}\]

This simple linear model can be re-written as follows:

\[\begin{equation} y_n = \alpha + \beta\cdot x_n + \varepsilon_n \end{equation}\]

where \(\varepsilon_n \sim \mathit{Normal}(0, \sigma)\).

The model does not change if \(\alpha\) is multiplied by \(1\):

\[\begin{equation} y_n = \alpha\cdot 1 + \beta\cdot x_n + \varepsilon_n \end{equation}\]

The above is actually \(n\) linear equations, and can be written compactly in matrix form:

\[\begin{equation} {\begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n\\ \end{pmatrix}} = {\begin{pmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots \\ 1 & x_n\\ \end{pmatrix}} {\begin{pmatrix} \alpha\\ \beta \\ \end{pmatrix}} + {\begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \\ \end{pmatrix}} \end{equation}\]

Consider this matrix in the above equation:

\[\begin{equation} {\begin{pmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots \\ 1 & x_n\\ \end{pmatrix}} \end{equation}\]

This matrix is called the model matrix or the design matrix; we will encounter it again in the contrast coding chapters, where it plays a crucial role. If we write the dependent variable \(y\) as a \(n\times 1\) vector, the above matrix as the matrix \(X\) (which has dimensions \(n\times 2\), the intercept and slope parameters as a \(2\times 1\) matrix \(\zeta\), and the residual errors as an \(n\times 1\) matrix, we can write the linear model very compactly:

\[\begin{equation} y = X \zeta + \varepsilon \end{equation}\]

The above matrix formulation of the linear model extends to the hierarchical model very straightforwardly. For example, consider the model \(M_{sih}\) that we just saw above. This model has the following likelihood:

\[\begin{multline} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} \\ + c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma) \end{multline}\]

The terms in the location parameter in the Normal likelihood can be re-written in matrix form, just like the linear model above. To see this, consider the fact that the location term

\[\begin{equation} \alpha + u_{subj[n],1} + w_{item[n],1} + c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}) \end{equation}\]

can be re-written as

\[\begin{multline} \alpha\cdot 1 + u_{subj[n],1}\cdot 1 + w_{item[n],1}\cdot 1 +\\ \beta \cdot c\_cloze_n + u_{subj[n],2}\cdot c\_cloze_n+ w_{item[n],2}\cdot c\_cloze_n \end{multline}\]

The above equation can in turn be written in matrix form:

\[\begin{equation} \begin{aligned} & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} {\begin{pmatrix} \alpha\\ \beta \\ \end{pmatrix}} + \\ & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} {\begin{pmatrix} u_{subj[1],1} & u_{subj[2],1} & \dots & u_{subj[n],1}\\ u_{subj[1],2} & u_{subj[2],2} & \dots & u_{subj[n],2}\\ \end{pmatrix}} +\\ & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} {\begin{pmatrix} w_{item[1],1} & w_{item[2],1} & \dots & w_{item[n],1} \\ w_{item[1],2} & w_{item[2],2} & \dots & w_{item[n],2} \\ \end{pmatrix}} \end{aligned} \end{equation}\]

In this hierarchical model, there are three model matrices:

  • the matrix associated with the intercept \(\alpha\) and the slope \(\beta\); below, we call this the matrix \(X\).
  • the matrix associated with the by-subject varying intercepts and slopes; call this the matrix \(Z_u\).
  • the matrix associated with the by-item varying intercepts and slopes; call this the matrix \(Z_w\).

The model can now be written very compactly in matrix form by writing these three matrices as follows:

\[\begin{equation} \begin{aligned} X = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}}\\ Z_u = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} \\ Z_w = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} \\ \end{aligned} \end{equation}\]

The location part of the model \(M_{sih}\) can now be written very compactly:

\[\begin{equation} X \zeta + Z_u z_u + Z_w z_w \end{equation}\]

Here, \(\zeta\) is a \(2\times 1\) matrix containing the intercept \(\alpha\) and the slope \(\beta\), and \(z_u\) and \(z_w\) are the intercept and slope adjustments by subject and by item:

\[\begin{equation} \begin{aligned} z_u = & {\begin{pmatrix} u_{subj[1],1} & u_{subj[2],1} & \dots & u_{subj[n],1}\\ u_{subj[1],2} & u_{subj[2],2} & \dots & u_{subj[n],2}\\ \end{pmatrix}}\\ z_w = & {\begin{pmatrix} w_{item[1],1} & w_{item[2],1} & \dots & w_{item[n],1} \\ w_{item[1],2} & w_{item[2],2} & \dots & w_{item[n],2} \\ \end{pmatrix}}\\ \end{aligned} \end{equation}\]

In summary, the hierarchical model has a very general matrix formulation, called the Laird-Ware form (Laird and Ware 1982):

\[\begin{equation} signal = X \zeta + Z_u z_u + Z_w z_w + \varepsilon \end{equation}\]

The practical relevance of this matrix formulation is that we can define hierarchical models very compactly and efficiently in Stan by expressing the model in terms of the model matrices (Sorensen, Hohenstein, and Vasishth 2016). As an aside, notice that in the above example, \(X=Z_u=Z_w\); but in principle one could have different model matrices for the fixed vs. random effects.

5.2.6 Beyond the maximal model–Distributional regression models

We can use posterior predictive checks to verify that our last model can capture the entire signal distribution. This is shown in Figure 5.14

Overlay of densities from the posterior predictive distributions of the model fit_N400_sih.

FIGURE 5.14: Overlay of densities from the posterior predictive distributions of the model fit_N400_sih.

However, we know that in ERP studies, large levels of impedance between the recording electrodes and the skin tissue increase the noise in the recordings (Picton et al. 2000). Given that skin tissue is different between subjects, it could be the case that the level of noise varies by subject. It might be a good idea to verify that our model is good enough for capturing the by-subject variability. The code below produces Figure 5.15.

The plot shows 100 predicted distributions with the label \(y_rep\) and the distribution of the average signal data with the label \(y\) density plots for the 37 subjects that participated in the experiment.

FIGURE 5.15: The plot shows 100 predicted distributions with the label \(y_rep\) and the distribution of the average signal data with the label \(y\) density plots for the 37 subjects that participated in the experiment.

Figure 5.15 hints that we might be misfitting some subjects: Some of the by-subject observed distributions of the EEG signal averages look much tighter than their corresponding posterior predictive distributions (e.g., subjects 3, 5, 9, 10, 14), whereas some other by-subject observed distributions look wider (e.g., subjects 25, 26, 27). Another approach to examine whether we misfit the by-subject noise level is to plot posterior distributions of the standard deviations and compare them with the observed standard deviation. This is achieved in the following code, which groups the data by subject, and shows the distribution of standard deviations. The result is shown in Figure 5.16. It is clear now that, for some subjects, the observed standard deviation lies outside the distribution of predictive standard deviations.

Distribution of posterior predicted standard deviations in gray and observed standard deviation in black lines by subject.

FIGURE 5.16: Distribution of posterior predicted standard deviations in gray and observed standard deviation in black lines by subject.

Why is our “maximal” hierarchical model misfitting the by-subject distribution of data? This is because, the maximal models are, in general and implicitly, models with the maximal group-level effect structure for the location parameter (e.g., the mean, \(\mu\), in a normal model). Other parameters (e.g., scale or shape parameters) are estimated as auxiliary parameters, and are assumed to be constant across observations and clusters. This assumption is so common that researchers may not be aware that it is just an assumption. In the Bayesian framework, it is easy to change such default assumptions if necessary. Changing the assumption that all subjects have the same residual standard deviation leads to the distributional regression model. Such models can be fit in brms.23.

We are going to change our previous likelihood, so that the scale, \(\sigma\) has also a group-level effect structure. We exponentiate \(\sigma\) to make sure that the negative adjustments do not cause \(\sigma\) to become negative.

\[\begin{equation} \begin{aligned} signal_n &\sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} + \\ & \hspace{2cm} c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma_n)\\ \sigma_n &= \exp(\sigma_\alpha + \sigma_{u_{subj[n]}}) \end{aligned} \end{equation}\]

We just need to add priors to our new parameters (that replace the old prior for \(\sigma\)). We set the prior to the intercept of the standard deviation, \(\sigma_\alpha\), to be similar to our previous \(\sigma\). For the variance component of \(\sigma\), \(\tau_{\sigma_u}\), we set rather uninformative hyperpriors. Recall that everything is exponentiated when it goes inside the likelihood; that is why we use \(\log(50)\) rather than 50 in \(\sigma\).

\[\begin{equation} \begin{aligned} \sigma_\alpha &\sim \mathit{Normal}(0,log(50))\\ \sigma_u &\sim \mathit{Normal}(0, \tau_{\sigma_u}) \\ \tau_{\sigma_u} &\sim \mathit{Normal}_+(0, 5) \end{aligned} \end{equation}\]

This model can be fit in brms using the internal function brmsformula(). This is a powerful function that extends the formulas that we used so far allowing for setting a hierarchical regression to any parameter of a model. This will allow us to set a by-subject hierarchical structure to the parameter \(\sigma\). We also need to set new priors; these priors are identified by dpar = sigma.

Inspect the output below; notice that our estimate for the effect of cloze remains very similar to that of the model fit_N400_sih.

Compare the two models’ estimates:

##           Estimate Est.Error Q2.5 Q97.5
## b_c_cloze     2.29     0.699 0.89  3.65
##           Estimate Est.Error  Q2.5 Q97.5
## b_c_cloze     2.28     0.669 0.965  3.58

Nonetheless, Figure 5.17 shows that the fit of the model with respect to the by-subject variability is much better than before. Furthermore, Figure 5.18 shows that the observed standard deviations for each subject are well inside the posterior predictive distributions. The code below produces Figure 5.17.

The gray density plots show 100 predicted distributions from a model that includes a hierarchical structure for \(\sigma\). The black density plots show the distribution of the average signal data for the 37 subjects in the experiment.

FIGURE 5.17: The gray density plots show 100 predicted distributions from a model that includes a hierarchical structure for \(\sigma\). The black density plots show the distribution of the average signal data for the 37 subjects in the experiment.

The gray lines show the distributions of posterior predicted standard deviations from a model that includes a hierarchical structure for \(\sigma\), and observed mean standard deviations by subject (black vertical lines).

FIGURE 5.18: The gray lines show the distributions of posterior predicted standard deviations from a model that includes a hierarchical structure for \(\sigma\), and observed mean standard deviations by subject (black vertical lines).

The model fit_N400_s raises the question: how much structure should we add to our statistical model? Should we also assume that \(\sigma\) can vary by items, and also by our experimental manipulation? Should we also have a maximal model for \(\sigma\)? Unfortunately, there are no clear answers that apply to every situation. The amount of complexity that we can introduce in a statistical model depends on (i) the answers we are looking for (we should include parameters that represent what we want to estimate), (ii) the size of the data at hand (more complex models require more data), (iii) our computing power (as the complexity increases models take increasingly long to converge and require more computer power to finish the computations in a feasible time frame), and (iv) our domain knowledge.

Ultimately, all models are approximations (that’s in the best case; often, they are plainly wrong) and we need to think carefully about which aspects of our data we have to account and which aspects we can abstract away from.

In the context of cognitive modeling, McClelland (2009) argues that models should not focus on a every single detail of the process they intend to explain. In order to understand a model, it needs to be simple enough. However, McClelland (2009) warns us that one must bear in mind that oversimplification does have an impact on what we can conclude from our analysis: A simplification can limit the phenomena that a model addresses, or can even lead to incorrect predictions. There is a continuum between purely statistical models (e.g., a linear regression) and computational cognitive models. For example, we can define “hybrid” models such as the linear ballistic accumulator (Brown and Heathcote 2008; and see Nicenboim 2018 for an implementation in Stan), where a great deal of cognitive detail is sacrificed for tractability. The conclusions of McClelland (2009) apply to any type of model in cognitive science: “Simplification is essential, but it comes at a cost, and real understanding depends in part on understanding the effects of the simplification”.

References

Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68 (3). Elsevier: 255–78.

Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015b. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Brown, Scott D., and Andrew Heathcote. 2008. “The Simplest Complete Model of Choice Response Time: Linear Ballistic Accumulation.” Cognitive Psychology 57 (3): 153–78. https://doi.org/10.1016/j.cogpsych.2007.12.002.

DeLong, Katherine A, Thomas P Urbach, and Marta Kutas. 2005. “Probabilistic Word Pre-Activation During Language Comprehension Inferred from Electrical Brain Activity.” Nature Neuroscience 8 (8): 1117–21. https://doi.org/10.1038/nn1504.

Frank, Stefan L., Leun J. Otten, Giulia Galli, and Gabriella Vigliocco. 2015. “The ERP Response to the Amount of Information Conveyed by Words in Sentences.” Brain and Language 140: 1–11. https://doi.org/10.1016/j.bandl.2014.10.006.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Kutas, Marta, and Kara D. Federmeier. 2011. “Thirty Years and Counting: Finding Meaning in the N400 Componentof the Event-Related Brain Potential (ERP).” Annual Review of Psychology 62 (1): 621–47. https://doi.org/10.1146/annurev.psych.093008.131123.

Kutas, Marta, and Steven A Hillyard. 1980. “Reading Senseless Sentences: Brain Potentials Reflect Semantic Incongruity.” Science 207 (4427): 203–5. https://doi.org/10.1126/science.7350657.

Kutas, Marta, and Steven A Hillyard. 1984. “Brain Potentials During Reading Reflect Word Expectancy and Semantic Association.” Nature 307 (5947): 161–63. https://doi.org/10.1038/307161a0.

Laird, Nan M, and James H Ware. 1982. “Random-Effects Models for Longitudinal Data.” Biometrics. JSTOR, 963–74.

McClelland, James L. 2009. “The Place of Modeling in Cognitive Science.” Topics in Cognitive Science 1 (1): 11–38. https://doi.org/10.1111/j.1756-8765.2008.01003.x.

McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with R Examples. Boca Raton, Florida: Chapman; Hall/CRC.

Nicenboim, Bruno. 2018. “The Implementation of a Model of Choice: The (Truncated) Linear Ballistic Accumulator.” In StanCon. Aalto University, Helsinki, Finland. https://doi.org/10.5281/zenodo.1465990.

Nicenboim, Bruno, Shravan Vasishth, and Frank Rösler. 2020a. “Are Words Pre-Activated Probabilistically During Sentence Comprehension? Evidence from New Data and a Bayesian Random-Effects Meta-Analysis Using Publicly Available Data.” Neuropsychologia 142. https://doi.org/10.1016/j.neuropsychologia.2020.107427.

Nieuwland, Mante S, Stephen Politzer-Ahles, Evelien Heyselaar, Katrien Segaert, Emily Darley, Nina Kazanina, Sarah Von Grebmer Zu Wolfsthurn, et al. 2018. “Large-Scale Replication Study Reveals a Limit on Probabilistic Prediction in Language Comprehension.” eLife 7. https://doi.org/10.7554/eLife.33468.

Picton, T.W., S. Bentin, P. Berg, E. Donchin, S.A. Hillyard, R. Johnson JR., G.A. Miller, et al. 2000. “Guidelines for Using Human Event-Related Potentials to Study Cognition: Recording Standards and Publication Criteria.” Psychophysiology 37 (2): 127–52. https://doi.org/10.1111/1469-8986.3720127.

Pinheiro, José C, and Douglas M Bates. 2000. Mixed-Effects Models in S and S-PLUS. New York: Springer-Verlag.

Sorensen, Tanner, Sven Hohenstein, and Shravan Vasishth. 2016. “Bayesian Linear Mixed Models Using Stan: A Tutorial for Psychologists, Linguists, and Cognitive Scientists.” Quantitative Methods for Psychology 12 (3): 175–200.


  1. For simplicity, we assume that they share the same standard deviation.

  2. If we don’t remove the intercept, that is, if we use the formula n400 ~ 1 + factor(subj) + c_cloze:factor(subj), with factor(subj) we are going estimate the deviation between the first subject and the each of the other subjects.

  3. The analogous frequentist model can be fit using lmer from the package lme4, using (1+c_cloze||subj) or, equivalently, (c_cloze||subj) for the by-subject random effects.

  4. The intercept adjustment is often called \(u_0\) in statistics books, where the intercept might be called \(\alpha\) or (sometimes also \(\beta_0\)), and thus \(u_1\) refers to the adjustment to the slope. However, in this book, we start the indexing with 1 to be consistent with the Stan language.

  5. By contrast, in the frequentist lmer model, the adjustments \(u_1\) and \(u_2\) are not parameters; they are called conditional modes; see Bates, Mächler, et al. (2015b).

  6. Another source of confusion here is that hyperparameters is also used in the machine learning literature with a different meaning.

  7. One could in theory keep going deeper and deeper, defining hyper-hyperpriors etc., but the model would quickly become impossible to fit.

  8. The double pipe is also used in the same way in lmer from the package lme4.

  9. This is because an LKJ correlation distribution with a large \(\eta\) corresponds to a correlation matrix with values close to zero in the lower and upper triangles

  10. shorturl.at/brNS8