## 5.2 A hierarchical model with a normal likelihood: The N400 effect

Event-related potentials (ERPs) allow scientists to observe electrophysiological responses in the brain measured by means of electroencephalography (EEG) that are time-locked to a specific event (i.e., the presentation of the stimuli). A very robust ERP effect in the study of language is the N400. It has been shown that words with low predictability are accompanied by an *N400 effect* in comparison with high-predictable words, this is a relative negativity that peaks around 300-500 after word onset over central parietal scalp sites (first reported in Kutas and Hillyard 1980, for semantic anomalies, and in 1984 for low predictable word; for a review, see Kutas and Federmeier 2011). The N400 is illustrated in Figure 5.4.

For example, in (1) below, the continuation *‘paint’* has higher predictability than the continuation *‘dog’*, and thus we would expect a more negative signal, that is, an N400 effect, in *‘dog’* in (b) in comparison with *‘paint’* in (a). It is often the case that predictability is measured with a cloze task (see section 1.4).

- Example from Kutas and Hillyard (1984)
- Don’t touch the wet paint.
- Don’t touch the wet dog.

The EEG data are typically recorded in tens of electrodes every couple of milliseconds, but for our purposes (i.e., for learning about Bayesian hierarchical models), we can safely ignore the complexity of the data. A common way to simplify the high-dimensional EEG data when we are dealing with the N400 is to focus on the average amplitude of the EEG signal at the typical spatio-temporal window of the N400 (for example, see Frank et al. 2015).

For this example, we are going to focus on the N400 effect for critical nouns from a subset of the data of Nieuwland et al. (2018). Nieuwland et al. (2018) presented a replication attempt of an original experiment of DeLong, Urbach, and Kutas (2005) with sentences like (2).

- Example from DeLong, Urbach, and Kutas (2005)
- The day was breezy so the boy went outside to fly a kite.
- The day was breezy so the boy went outside to fly an airplane.

We’ll ignore the goal of original experiment (DeLong, Urbach, and Kutas 2005), and its replication (Nieuwland et al. 2018). We are going to focus on the N400 at the final nouns in the experimental stimuli. In example (2), for example, the final noun *‘kite’* has higher predictability than *‘airplane’*, and thus we would expect a more negative signal in *‘airplane’* in (b) in comparison with *‘kite’* in (a).

To speed up computation, we restrict the data set to the subjects from the Edinburgh lab, using the relevant subset containing the Edinburgh data fro `df_eeg`

in the `bcogsci`

package. Center the cloze probability before using it as a predictor.

```
## # A tibble: 2,863 × 7
## subj cloze item n400 cloze_ans N c_cloze
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 7.08 0 44 -0.476
## 2 1 0.03 2 -0.68 1 44 -0.446
## 3 1 1 3 1.39 44 44 0.524
## # … with 2,860 more rows
```

```
## # A tibble: 1 × 1
## n
## <int>
## 1 37
```

One convenient aspect of using averages of EEG data is that they are roughly normally distributed. This allows us to use the Normal likelihood. Figure 5.5 shows the distribution of the data.

```
df_eeg %>% ggplot(aes(n400)) +
geom_histogram(
binwidth = 4,
colour = "gray",
alpha = .5,
aes(y = ..density..)
) +
stat_function(fun = dnorm, args = list(
mean = mean(df_eeg$n400),
sd = sd(df_eeg$n400)
)) +
xlab("Average voltage in microvolts for
the N400 spatiotemporal window")
```

### 5.2.1 Complete pooling model (\(M_{cp}\))

We’ll start from the simplest model which is basically the linear regression we encountered in the preceding chapter.

#### 5.2.1.1 Model assumptions

This model, call it \(M_{cp}\), makes the following assumptions.

- The EEG averages for the N400 spatiotemporal window are normally distributed.
- Observations are independent.
- There is a linear relationship between cloze and the EEG signal for the trial.

**This model is incorrect for these data due to assumption (2) being violated.**

With the last assumption, we are saying that the difference in the average signal when we compare nouns with cloze probability of 0 and 0.1 is the same as the difference in the signal when we compare nouns with cloze values of 0.1 and 0.2 (or 0.9 and 1). This is just an assumption, and it may not necessarily be the case in the actual data. This means that we are going to get a posterior for \(\beta\) *conditional* on the assumption that the linear relationship holds. Even if it *approximately* holds, we still don’t know how much we deviate from this assumption.

We can now decide on a likelihood and priors.

#### 5.2.1.2 Likelihood and priors

A normal likelihood seems reasonable for these data:

\[\begin{equation} signal_n \sim \mathit{Normal}( \alpha + c\_cloze_n \cdot \beta,\sigma) \tag{5.1} \end{equation}\]

where \(n =1, \ldots, N\), and \(signal\) is the dependent variable (average signal in the N400 spatiotemporal window in microvolts). The variable \(N\) represents the total number of data points.

As always we need to rely on our previous knowledge and domain expertise to decide on priors. We know that ERPs (signals time-locked to a stimulus) have mean amplitudes of a couple of microvolts: This is easy to see in any plot of the EEG literature. This means that we don’t expect the effect of our manipulation to exceed, say, \(10 \mu V\). As before, a priori we’ll assume that effects can be negative or positive. We can quantify our prior knowledge regarding plausible values of \(\beta\) as normally distributed centered at zero with a standard deviation of \(10 \mu V\). (Other values such as \(5 \mu V\) would have been also reasonable, since it would entail that 95% of the prior mass probability is between -10 and 10 \(\mu V\).)

If the signal for each ERP is *baselined*, that is, the mean signal of a time window before the time window of interest is subtracted from the time window of interest, then the mean signal would be relatively close to 0. Since we know that the ERPs were baselined in this study, we expect that the grand mean of our signal should be relatively close to zero. Our prior for \(\alpha\) is then normally distributed centered in zero with a standard deviation of 10 \(\mu V\).

The standard deviation of our signal distribution is harder to guess. We know that EEG signals are quite noisy, and that the standard deviation must be higher than zero. Our prior for \(\sigma\) is a truncated normal distribution with location zero and scale 50. Recall that since we truncate the distribution, the parameters location and scale do not correspond to the mean and standard deviation of the new distribution; see Box 4.1.

We can draw random samples from this truncated distribution and calculate their mean and standard deviation:

```
## mean sd
## 39.6 29.7
```

So we are essentially saying that we assume a priori that we will find the true standard deviation of the signal in the following interval with 95% probability:

```
## 2.5% 97.5%
## 1.55 110.40
```

To sum up, we are going to use the following priors:

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(0,10)\\ \beta &\sim \mathit{Normal}(0,10)\\ \sigma &\sim \mathit{Normal}_{+}(0,50) \end{aligned} \end{equation}\]

A model such as \(M_{cp}\) is sometimes called a *fixed-effects* model: all the parameters are fixed in the sense that do not vary from subject to subject or from item to item. A similar frequentist model would correspond to fitting a simple linear model using the `lm`

function: `lm(n400 ~ 1 + cloze, data = df_eeg)`

.

We fit this model in `brms`

as follows (the default family is `gaussian()`

so we can omit it). As with the `lm`

function in R, by default an intercept is fitted and thus `n400 ~ c_cloze`

is equivalent to `n400 ~ 1 + c_cloze`

:

```
fit_N400_cp <- brm(n400 ~ c_cloze,
prior =
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma)
),
data = df_eeg
)
```

For now, check the summary, and plot the posteriors of the model (Figure 5.6).

```
## ...
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 3.66 0.22 3.23 4.09 1.00 4060 2945
## c_cloze 2.25 0.54 1.15 3.33 1.00 3906 2929
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 11.82 0.16 11.50 12.14 1.00 3873 2881
##
## ...
```

### 5.2.2 No pooling model (\(M_{np}\))

One of the assumptions of the previous model is clearly wrong: observations are not independent, they are clustered by subject (and also by the specific item, but we’ll ignore this until section 5.2.4). It is reasonable to assume that EEG signals are more similar within subjects than between them. The following model assumes that each subject is completely independent from each other.^{14}

#### 5.2.2.1 Model assumptions

- EEG averages for the N400 spatio-temporal window are normally distributed.
- Every subject’s model is fit independently of the other subjects; the subjects have no parameters in common (an exception is the standard deviation; this is the same for all subjects).
- There is a linear relationship between cloze and the EEG signal for the trial.

What likelihood and priors can we choose here?

#### 5.2.2.2 Likelihood and priors

The likelihood is a normal distribution as before:

\[\begin{equation} signal_n \sim \mathit{Normal}( \alpha_{subj[n]} + c\_cloze_n \cdot \beta_{subj[n]},\sigma) \end{equation}\]

As before, \(n\) represents each observation, that is, the \(n\)th row in the data frame, which has \(N\) rows, and now the index \(i\) identifies the subject. The notation \(subj[n]\), which roughly follows Gelman and Hill (2007), identifies the subject index; for example, if \(subj[10]=3\), then the \(10\)th row of the data frame is from subject \(3\).

We define the priors as follows:

\[\begin{equation} \begin{aligned} \alpha_{i} &\sim \mathit{Normal}(0,10)\\ \beta_{i} &\sim \mathit{Normal}(0,10)\\ \sigma &\sim \mathit{Normal}_+(0,50) \end{aligned} \end{equation}\]

In `brms`

, such a model can be fit by removing the common intercept with the formula `n400 ~ 0 + factor(subj) + c_cloze:factor(subj)`

.

This formula forces the model to estimate one intercept and one slope for *each* level of `subj`

.^{15}
The by-subject intercepts are indicated with `factor(subj)`

and the by-subject slopes with `c_cloze:factor(subj)`

. It’s very important to specify that `subject`

should be treated as a factor and not as a number; we don’t assume that subject number 3 will show 3 times more positive (or negative) average signal than subject number 1! The model fits 37 independent intercepts and 37 independent slopes. By setting a prior to `class = b`

and omitting `coef`

, we are essentially setting identical priors to all the intercepts and slopes of the model. The parameters are independent from each other; it is only our previous knowledge (or prior beliefs) about their possible values (encoded in the priors) that is identical. We can set different priors to each intercept and slope, but that will mean setting 74 priors!

```
fit_N400_np <- brm(n400 ~ 0 + factor(subj) + c_cloze:factor(subj),
prior =
c(
prior(normal(0, 10), class = b),
prior(normal(0, 50), class = sigma)
),
data = df_eeg
)
```

For this model, printing a summary means printing the 75 parameters (\(\alpha_{1,...,37}\), \(\beta_{1,...,37}\), and \(\sigma\)). We could do this as always by printing out the model results: just type `fit_N400_np`

.

It may be easier to understand the output of the model by plotting \(\beta_{1,..,37}\) using `bayesplot`

. (`brms`

also includes a wrapper for this function called `stanplot`

). We can take a look at the internal names that `brms`

gives to the parameters with `variables(fit_N400_np)`

; they are `b_factorsubj`

, then the subject index and then `:c_cloze`

. The code below changes the subject labels back to their original numerical indices and plots them in Figure 5.7. The subjects are ordered by the magnitude of their mean effects.

The model \(M_{np}\) does not estimate a unique population-level effect; instead, there is a different effect estimated for each subject. However, given the posterior means from each subject, it is still possible to calculate the average of these estimates \(\hat\beta_{1,...,I}\), where \(I\) is the total number of subjects:

```
# parameter name of beta by subject:
ind_effects_np <- paste0(
"b_factorsubj",
unique(df_eeg$subj), ":c_cloze"
)
beta_across_subj <- as_draws_df(fit_N400_np) %>%
#removes the meta data from the object
as.data.frame() %>%
select(all_of(ind_effects_np)) %>%
rowMeans()
# Calculate the average of these estimates
(grand_av_beta <- tibble(
mean = mean(beta_across_subj),
lq = quantile(beta_across_subj, c(.025)),
hq = quantile(beta_across_subj, c(.975))
))
```

```
## # A tibble: 1 × 3
## mean lq hq
## <dbl> <dbl> <dbl>
## 1 2.18 1.17 3.17
```

In Figure 5.7, the 95% credible interval of this overall mean effect is plotted as two vertical lines together with the effect of cloze probability for each subject (ordered by effect size). Here, rather than using a plotting function from `brms`

, we can extract the summary of by-subject effects, reorder them by magnitude, and then plot the summary with a custom plot using `ggplot2`

.

```
# make a table of beta's by subject
beta_by_subj <- posterior_summary(fit_N400_np,
variable = ind_effects_np
) %>%
as.data.frame() %>%
mutate(subject = 1:n()) %>%
## reorder plot by magnitude of mean:
arrange(Estimate) %>%
mutate(subject = factor(subject, levels = subject))
```

The code below generates Figure 5.7.

```
ggplot(
beta_by_subj,
aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = subject)
) +
geom_point() +
geom_errorbarh() +
geom_vline(xintercept = grand_av_beta$mean) +
geom_vline(xintercept = grand_av_beta$lq, linetype = "dashed") +
geom_vline(xintercept = grand_av_beta$hq, linetype = "dashed") +
xlab("By-subject effect of cloze probability in microvolts")
```

### 5.2.4 Correlated varying intercept varying slopes model (\(M_{h}\))

The model \(M_{v}\) allowed for differences in intercepts (mean voltage) and slopes (effects of cloze) across subjects, but it has the implicit assumption that these varying intercepts and varying slopes are independent. It is in principle possible that subjects showing more negative voltage may also show stronger effects (or weaker effects). Next, we fit a model that allows a correlation between the intercepts and slopes. We model the correlation between varying intercepts and slopes by defining a variance-covariance matrix \(\boldsymbol{\Sigma}\) between the by-subject varying intercepts and slopes, and by assuming that both adjustments (intercept and slope) come from a multivariate (in this case, a bivariate) normal distribution.

- In \(M_h\), we model the EEG data with the following assumptions:

- EEG averages for the N400 spatio-temporal window are normally distributed.
- Some aspects of the mean signal voltage and of the effect of predictability depend on the subject, and these two might be correlated, i.e., we assume group-level intercepts and slopes, and allow a correlation between them by-subject.
- There is a linear relationship between cloze and the EEG signal for the trial.

The likelihood remains identical to the model \(M_v\), which assumes no correlation between group-level intercepts and slopes (section 5.2.3):

\[\begin{equation} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta + u_{subj[n],2}),\sigma) \end{equation}\]

The correlation is indicated in the priors on the adjustments for intercept \(u_{1}\) and slopes \(u_{2}\).

- Priors: \[\begin{equation} \begin{aligned} \alpha & \sim \mathit{Normal}(0,10) \\ \beta & \sim \mathit{Normal}(0,10) \\ \sigma &\sim \mathit{Normal}_+(0,50)\\ {\begin{pmatrix} u_{i,1} \\ u_{i,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_u} \right) \end{aligned} \end{equation}\]

In this model, a bivariate normal distribution generates the varying intercepts and varying slopes \(\mathbf{u}\); this is an \(n\times 2\) matrix. The variance-covariance matrix \(\boldsymbol{\Sigma_u}\) defines the standard deviations of the varying intercepts and varying slopes, and the correlation between them. Recall from section 1.6.2 that the diagonals of the variance-covariance matrix contain the variances of the correlated random variables, and the off-diagonals contain the covariances. In this example, the covariance \(Cov(u_1,u_2)\) between two variables \(u_1\) and \(u_2\) is defined as the product of their correlation \(\rho\) and their standard deviations \(\tau_{u_1}\) and \(\tau_{u_2}\). In other words, \(Cov(u_1,u_2) = \rho_u \tau_{u_1} \tau_{u_2}\).

\[\begin{equation} \boldsymbol{\Sigma_u} = {\begin{pmatrix} \tau_{u_1}^2 & \rho_u \tau_{u_1} \tau_{u_2} \\ \rho_u \tau_{u_1} \tau_{u_2} & \tau_{u_2}^2 \end{pmatrix}} \end{equation}\]

In order to specify a prior for \(\Sigma_u\), we need priors for the standard deviations, \(\tau_{u_1}\), and \(\tau_{u_2}\), and also for their correlation, \(\rho_u\). We can use the same priors for \(\tau\) as before. For the correlation parameter \(\rho_u\) (and the correlation matrix more generally), we use the LKJ prior. The basic idea of the LKJ prior on the correlation matrix is that as its parameter, \(\eta\) (*eta*), increases, it will favor correlations closer to zero.^{22} At \(\eta = 1\), the LKJ correlation distribution is uninformative (similar to \(Beta(1,1)\)), at \(\eta < 1\), it favors extreme correlations (similar to \(Beta(a<1,b<1)\)). We set \(\eta = 2\) so that we don’t favor extreme correlations, and we still represent our lack of knowledge through the wide spread of the prior between \(-1\) and \(1\). Thus, \(\eta = 2\) gives us a regularizing, relatively uninformative or mildly informative prior.

Figure 5.11 shows a visualization of different parametrizations of the LKJ prior.

\[\begin{equation} \begin{aligned} \tau_{u_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20)\\ \rho_u &\sim \mathit{LKJcorr}(2) \end{aligned} \end{equation}\]

In our `brms`

model, we allow a correlation between the by-subject intercepts and slopes by using a single pipe `|`

instead of the double pipe `||`

that we used previously. This convention follows that in the frequentist `lmer`

function. As before, the varying intercepts are implicitly fit.

Because we have a new parameter, the correlation \(\rho_{u}\), we need to add a new prior for this correlation; in `brms`

, this is achieved by addition a prior for the parameter type `cor`

.

```
prior_h <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),
prior(normal(0, 20),
class = sd, coef = Intercept,
group = subj
),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = subj
),
prior(lkj(2), class = cor, group = subj)
)
fit_N400_h <- brm(n400 ~ c_cloze + (c_cloze | subj),
prior = prior_h,
data = df_eeg
)
```

The estimates do not change much in comparison with the varying intercept/slope model, probably because the estimation of the correlation is quite poor (i.e., there is a lot of uncertainty). As before we show the estimates graphically (Figure 5.12. One can access the complete summary as always with `fit_N400_h`

.

We are now half-way to what is called the maximal hierarchical model (Barr et al. 2013), because everything that we said about subjects is also relevant for items. The next section spells out this type of model.

### 5.2.5 By-subjects and by-items correlated varying intercept varying slopes model (\(M_{sih}\))

Our new model, \(M_{sih}\) will allow for differences in intercepts (mean voltage) and slopes (effects of predictability) across subjects *and* across items. In typical Latin square designs, subjects and items are said to be *crossed random effects*—each subject sees exactly one instance of each item. Here we assume a possible correlation between varying intercepts and slopes by subjects, and another one by items.

- In \(M_{sih}\), we model the EEG data with the following assumptions:

- EEG averages for the N400 spatio-temporal window are normally distributed.
- Some aspects of the mean signal voltage and of the effect of predictability depend on the subject, i.e., we assume group-level intercepts, and slopes, and a correlation between them by-subject.
- Some aspects of the mean signal voltage and of the effect of predictability depend on the item, i.e., we assume group-level intercepts, and slopes, and a correlation between them by-item.
- There is a linear relationship between cloze and the EEG signal for the trial.

- Likelihood:

\[\begin{multline} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} + \\ c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma) \end{multline}\]

- Priors: \[\begin{equation} \begin{aligned} \alpha & \sim \mathit{Normal}(0,10) \\ \beta & \sim \mathit{Normal}(0,10) \\ \sigma &\sim \mathit{Normal}_+(0,50)\\ {\begin{pmatrix} u_{i,1} \\ u_{i,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_u} \right) \\ {\begin{pmatrix} w_{j,1} \\ w_{j,2} \end{pmatrix}} &\sim {\mathcal {N}} \left( {\begin{pmatrix} 0\\ 0 \end{pmatrix}} ,\boldsymbol{\Sigma_w} \right) \end{aligned} \end{equation}\]

We have added the index \(j\), which represents each item, as we did with subjects; \(item[n]\) indicates the item that corresponds to the observation in the \(n\)-th row of the data frame.

We have hyperparameters and hyperpriors as before:

\[\begin{equation} \begin{aligned} \boldsymbol{\Sigma_u} & = {\begin{pmatrix} \tau_{u_1}^2 & \rho_u \tau_{u_1} \tau_{u_2} \\ \rho_u \tau_{u_1} \tau_{u_2} & \tau_{u_2}^2 \end{pmatrix}}\\ \boldsymbol{\Sigma_w} & = {\begin{pmatrix} \tau_{w_1}^2 & \rho_w \tau_{w_1} \tau_{w_2} \\ \rho_w \tau_{w_1} \tau_{w_2} & \tau_{w_2}^2 \end{pmatrix}} \end{aligned} \end{equation}\]

\[\begin{equation} \begin{aligned} \tau_{u_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{u_2} &\sim \mathit{Normal}_+(0,20)\\ \rho_u &\sim \mathit{LKJcorr}(2) \\ \tau_{w_1} &\sim \mathit{Normal}_+(0,20)\\ \tau_{w_2} &\sim \mathit{Normal}_+(0,20)\\ \rho_w &\sim \mathit{LKJcorr}(2) \\ \end{aligned} \end{equation}\]

We set identical priors for by-items group-level effects as for by-subject group-level effects, but this only because we don’t have any differentiated prior information about subject-level vs. item-level variation. However, bear in mind that the estimation for items is completely independent from the estimation for subjects. Although we wrote many more equations than before, the `brms`

model is quite straightforward to extend:

```
prior_sih_full <-
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),
prior(normal(0, 20),
class = sd, coef = Intercept,
group = subj
),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = subj
),
prior(lkj(2), class = cor, group = subject),
prior(normal(0, 20),
class = sd, coef = Intercept,
group = item
),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = item
),
prior(lkj(2), class = cor, group = item)
)
fit_N400_sih <- brm(n400 ~ c_cloze + (c_cloze | subj) +
(c_cloze | item),
prior = prior_sih_full,
data = df_eeg
)
```

We can also simplify the call to `brms`

, when we assign the same priors to the by-subject and by-item parameters:

```
prior_sih <-
c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(normal(0, 50), class = sigma),
prior(normal(0, 20), class = sd),
prior(lkj(2), class = cor)
)
fit_N400_sih <- brm(n400 ~ c_cloze + (c_cloze | subj) +
(c_cloze | item),
prior = prior_sih,
data = df_eeg
)
```

We have new group-level effects in the summary, but again the estimate of the effect of cloze remains virtually unchanged (Figure 5.13).

**Box 5.3**

**The Matrix Formulation of Hierarchical Models (the Laird-Ware form)**

We have been writing linear models as follows; where \(n\) refers to the row id in the data frame.

\[\begin{equation} y_n \sim \mathit{Normal}(\alpha + \beta\cdot x_n) \end{equation}\]

This simple linear model can be re-written as follows:

\[\begin{equation} y_n = \alpha + \beta\cdot x_n + \varepsilon_n \end{equation}\]

where \(\varepsilon_n \sim \mathit{Normal}(0, \sigma)\).

The model does not change if \(\alpha\) is multiplied by \(1\):

\[\begin{equation} y_n = \alpha\cdot 1 + \beta\cdot x_n + \varepsilon_n \end{equation}\]

The above is actually \(n\) linear equations, and can be written compactly in matrix form:

\[\begin{equation} {\begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n\\ \end{pmatrix}} = {\begin{pmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots \\ 1 & x_n\\ \end{pmatrix}} {\begin{pmatrix} \alpha\\ \beta \\ \end{pmatrix}} + {\begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \\ \end{pmatrix}} \end{equation}\]

Consider this matrix in the above equation:

\[\begin{equation} {\begin{pmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots \\ 1 & x_n\\ \end{pmatrix}} \end{equation}\]

This matrix is called the model matrix or the design matrix; we will encounter it again in the contrast coding chapters, where it plays a crucial role. If we write the dependent variable \(y\) as a \(n\times 1\) vector, the above matrix as the matrix \(X\) (which has dimensions \(n\times 2\), the intercept and slope parameters as a \(2\times 1\) matrix \(\zeta\), and the residual errors as an \(n\times 1\) matrix, we can write the linear model very compactly:

\[\begin{equation} y = X \zeta + \varepsilon \end{equation}\]

The above matrix formulation of the linear model extends to the hierarchical model very straightforwardly. For example, consider the model \(M_{sih}\) that we just saw above. This model has the following likelihood:

\[\begin{multline} signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} \\ + c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma) \end{multline}\]

The terms in the location parameter in the Normal likelihood can be re-written in matrix form, just like the linear model above. To see this, consider the fact that the location term

\[\begin{equation} \alpha + u_{subj[n],1} + w_{item[n],1} + c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}) \end{equation}\]

can be re-written as

\[\begin{multline} \alpha\cdot 1 + u_{subj[n],1}\cdot 1 + w_{item[n],1}\cdot 1 +\\ \beta \cdot c\_cloze_n + u_{subj[n],2}\cdot c\_cloze_n+ w_{item[n],2}\cdot c\_cloze_n \end{multline}\]

The above equation can in turn be written in matrix form:

\[\begin{equation} \begin{aligned} & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} {\begin{pmatrix} \alpha\\ \beta \\ \end{pmatrix}} + \\ & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} {\begin{pmatrix} u_{subj[1],1} & u_{subj[2],1} & \dots & u_{subj[n],1}\\ u_{subj[1],2} & u_{subj[2],2} & \dots & u_{subj[n],2}\\ \end{pmatrix}} +\\ & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} {\begin{pmatrix} w_{item[1],1} & w_{item[2],1} & \dots & w_{item[n],1} \\ w_{item[1],2} & w_{item[2],2} & \dots & w_{item[n],2} \\ \end{pmatrix}} \end{aligned} \end{equation}\]

In this hierarchical model, there are three model matrices:

- the matrix associated with the intercept \(\alpha\) and the slope \(\beta\); below, we call this the matrix \(X\).

- the matrix associated with the by-subject varying intercepts and slopes; call this the matrix \(Z_u\).
- the matrix associated with the by-item varying intercepts and slopes; call this the matrix \(Z_w\).

The model can now be written very compactly in matrix form by writing these three matrices as follows:

\[\begin{equation} \begin{aligned} X = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}}\\ Z_u = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} \\ Z_w = & {\begin{pmatrix} 1 & c\_cloze_1\\ 1 & c\_cloze_2\\ \vdots & \vdots \\ 1 & c\_cloze_n\\ \end{pmatrix}} \\ \end{aligned} \end{equation}\]

The location part of the model \(M_{sih}\) can now be written very compactly:

\[\begin{equation} X \zeta + Z_u z_u + Z_w z_w \end{equation}\]

Here, \(\zeta\) is a \(2\times 1\) matrix containing the intercept \(\alpha\) and the slope \(\beta\), and \(z_u\) and \(z_w\) are the intercept and slope adjustments by subject and by item:

\[\begin{equation} \begin{aligned} z_u = & {\begin{pmatrix} u_{subj[1],1} & u_{subj[2],1} & \dots & u_{subj[n],1}\\ u_{subj[1],2} & u_{subj[2],2} & \dots & u_{subj[n],2}\\ \end{pmatrix}}\\ z_w = & {\begin{pmatrix} w_{item[1],1} & w_{item[2],1} & \dots & w_{item[n],1} \\ w_{item[1],2} & w_{item[2],2} & \dots & w_{item[n],2} \\ \end{pmatrix}}\\ \end{aligned} \end{equation}\]

In summary, the hierarchical model has a very general matrix formulation, called the Laird-Ware form (Laird and Ware 1982):

\[\begin{equation} signal = X \zeta + Z_u z_u + Z_w z_w + \varepsilon \end{equation}\]

The practical relevance of this matrix formulation is that we can define hierarchical models very compactly and efficiently in Stan by expressing the model in terms of the model matrices (Sorensen, Hohenstein, and Vasishth 2016). As an aside, notice that in the above example, \(X=Z_u=Z_w\); but in principle one could have different model matrices for the fixed vs. random effects.

### 5.2.6 Beyond the maximal model–Distributional regression models

We can use posterior predictive checks to verify that our last model can capture the entire signal distribution. This is shown in Figure 5.14

However, we know that in ERP studies, large levels of impedance between the recording electrodes and the skin tissue increase the noise in the recordings (Picton et al. 2000). Given that skin tissue is different between subjects, it could be the case that the level of noise varies by subject. It might be a good idea to verify that our model is good enough for capturing the by-subject variability. The code below produces Figure 5.15.

```
ppc_dens_overlay_grouped(df_eeg$n400,
yrep =
posterior_predict(fit_N400_sih,
ndraws = 100
),
group = df_eeg$subj
) +
xlab("Signal in the N400 spatiotemporal window")
```

Figure 5.15 hints that we might be misfitting some subjects: Some of the by-subject observed distributions of the EEG signal averages look much tighter than their corresponding posterior predictive distributions (e.g., subjects 3, 5, 9, 10, 14), whereas some other by-subject observed distributions look wider (e.g., subjects 25, 26, 27). Another approach to examine whether we misfit the by-subject noise level is to plot posterior distributions of the standard deviations and compare them with the observed standard deviation. This is achieved in the following code, which groups the data by subject, and shows the distribution of standard deviations. The result is shown in Figure 5.16. It is clear now that, for some subjects, the observed standard deviation lies outside the distribution of predictive standard deviations.

Why is our “maximal” hierarchical model misfitting the by-subject distribution of data? This is because, the maximal models are, in general and implicitly, models with the maximal group-level effect structure for the location parameter (e.g., the mean, \(\mu\), in a normal model). Other parameters (e.g., scale or shape parameters) are estimated as auxiliary parameters, and are assumed to be constant across observations and clusters. This assumption is so common that researchers may not be aware that it is just an assumption. In the Bayesian framework, it is easy to change such default assumptions if necessary. Changing the assumption that all subjects have the same residual standard deviation leads to the distributional regression model. Such models can be fit in `brms`

.^{23}.

We are going to change our previous likelihood, so that the scale, \(\sigma\) has also a group-level effect structure. We exponentiate \(\sigma\) to make sure that the negative adjustments do not cause \(\sigma\) to become negative.

\[\begin{equation} \begin{aligned} signal_n &\sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} + \\ & \hspace{2cm} c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma_n)\\ \sigma_n &= \exp(\sigma_\alpha + \sigma_{u_{subj[n]}}) \end{aligned} \end{equation}\]

We just need to add priors to our new parameters (that replace the old prior for \(\sigma\)). We set the prior to the intercept of the standard deviation, \(\sigma_\alpha\), to be similar to our previous \(\sigma\). For the variance component of \(\sigma\), \(\tau_{\sigma_u}\), we set rather uninformative hyperpriors. Recall that everything is exponentiated when it goes inside the likelihood; that is why we use \(\log(50)\) rather than 50 in \(\sigma\).

\[\begin{equation} \begin{aligned} \sigma_\alpha &\sim \mathit{Normal}(0,log(50))\\ \sigma_u &\sim \mathit{Normal}(0, \tau_{\sigma_u}) \\ \tau_{\sigma_u} &\sim \mathit{Normal}_+(0, 5) \end{aligned} \end{equation}\]

This model can be fit in `brms`

using the internal function `brmsformula()`

. This is a powerful function that extends the formulas that we used so far allowing for setting a hierarchical regression to any parameter of a model. This will allow us to set a by-subject hierarchical structure to the parameter \(\sigma\). We also need to set new priors; these priors are identified by `dpar = sigma`

.

```
prior_s <- c(
prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b),
prior(normal(0, 20), class = sd),
prior(lkj(2), class = cor),
prior(normal(0, log(50)), class = Intercept, dpar = sigma),
prior(normal(0, 5),
class = sd, group = subj,
dpar = sigma
)
)
fit_N400_s <- brm(brmsformula(
n400 ~ c_cloze + (c_cloze | subj) + (c_cloze | item),
sigma ~ 1 + (1 | subj)
),
prior = prior_s,
data = df_eeg
)
```

Inspect the output below; notice that our estimate for the effect of cloze
remains very similar to that of the model `fit_N400_sih`

.

Compare the two models’ estimates:

```
## Estimate Est.Error Q2.5 Q97.5
## b_c_cloze 2.29 0.699 0.89 3.65
```

```
## Estimate Est.Error Q2.5 Q97.5
## b_c_cloze 2.28 0.669 0.965 3.58
```

Nonetheless, Figure 5.17 shows that the fit of the model with respect to the by-subject variability is much better than before. Furthermore, Figure 5.18 shows that the observed standard deviations for each subject are well inside the posterior predictive distributions. The code below produces Figure 5.17.

```
ppc_dens_overlay_grouped(df_eeg$n400,
yrep =
posterior_predict(fit_N400_s,
ndraws = 100
),
group = df_eeg$subj
) +
xlab("Signal in the N400 spatiotemporal window")
```

The model `fit_N400_s`

raises the question: how much structure should we add to our statistical model? Should we also assume that \(\sigma\) can vary by items, and also by our experimental manipulation? Should we also have a maximal model for \(\sigma\)? Unfortunately, there are no clear answers that apply to every situation. The amount of complexity that we can introduce in a statistical model depends on (i) the answers we are looking for (we should include parameters that represent what we want to estimate), (ii) the size of the data at hand (more complex models require more data), (iii) our computing power (as the complexity increases models take increasingly long to converge and require more computer power to finish the computations in a feasible time frame), and (iv) our domain knowledge.

Ultimately, all models are approximations (that’s in the best case; often, they are plainly wrong) and we need to think carefully about which aspects of our data we have to account and which aspects we can abstract away from.

In the context of cognitive modeling, McClelland (2009) argues that models should not focus on a every single detail of the process they intend to explain. In order to understand a model, it needs to be simple enough. However, McClelland (2009) warns us that one must bear in mind that oversimplification does have an impact on what we can conclude from our analysis: A simplification can limit the phenomena that a model addresses, or can even lead to incorrect predictions. There is a continuum between purely statistical models (e.g., a linear regression) and computational cognitive models. For example, we can define “hybrid” models such as the linear ballistic accumulator (Brown and Heathcote 2008; and see Nicenboim 2018 for an implementation in Stan), where a great deal of cognitive detail is sacrificed for tractability. The conclusions of McClelland (2009) apply to any type of model in cognitive science: “Simplification is essential, but it comes at a cost, and real understanding depends in part on understanding the effects of the simplification”.

### References

Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” *Journal of Memory and Language* 68 (3). Elsevier: 255–78.

Bates, Douglas M, Martin Mächler, Ben Bolker, and Steve Walker. 2015b. “Fitting Linear Mixed-Effects Models Using lme4.” *Journal of Statistical Software* 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Brown, Scott D., and Andrew Heathcote. 2008. “The Simplest Complete Model of Choice Response Time: Linear Ballistic Accumulation.” *Cognitive Psychology* 57 (3): 153–78. https://doi.org/10.1016/j.cogpsych.2007.12.002.

DeLong, Katherine A, Thomas P Urbach, and Marta Kutas. 2005. “Probabilistic Word Pre-Activation During Language Comprehension Inferred from Electrical Brain Activity.” *Nature Neuroscience* 8 (8): 1117–21. https://doi.org/10.1038/nn1504.

Frank, Stefan L., Leun J. Otten, Giulia Galli, and Gabriella Vigliocco. 2015. “The ERP Response to the Amount of Information Conveyed by Words in Sentences.” *Brain and Language* 140: 1–11. https://doi.org/10.1016/j.bandl.2014.10.006.

Gelman, Andrew, and Jennifer Hill. 2007. *Data Analysis Using Regression and Multilevel/Hierarchical Models*. Cambridge University Press.

Kutas, Marta, and Kara D. Federmeier. 2011. “Thirty Years and Counting: Finding Meaning in the N400 Componentof the Event-Related Brain Potential (ERP).” *Annual Review of Psychology* 62 (1): 621–47. https://doi.org/10.1146/annurev.psych.093008.131123.

Kutas, Marta, and Steven A Hillyard. 1980. “Reading Senseless Sentences: Brain Potentials Reflect Semantic Incongruity.” *Science* 207 (4427): 203–5. https://doi.org/10.1126/science.7350657.

Kutas, Marta, and Steven A Hillyard. 1984. “Brain Potentials During Reading Reflect Word Expectancy and Semantic Association.” *Nature* 307 (5947): 161–63. https://doi.org/10.1038/307161a0.

Laird, Nan M, and James H Ware. 1982. “Random-Effects Models for Longitudinal Data.” *Biometrics*. JSTOR, 963–74.

McClelland, James L. 2009. “The Place of Modeling in Cognitive Science.” *Topics in Cognitive Science* 1 (1): 11–38. https://doi.org/10.1111/j.1756-8765.2008.01003.x.

McElreath, Richard. 2015. *Statistical Rethinking: A Bayesian Course with R Examples*. Boca Raton, Florida: Chapman; Hall/CRC.

Nicenboim, Bruno. 2018. “The Implementation of a Model of Choice: The (Truncated) Linear Ballistic Accumulator.” In *StanCon*. Aalto University, Helsinki, Finland. https://doi.org/10.5281/zenodo.1465990.

Nicenboim, Bruno, Shravan Vasishth, and Frank Rösler. 2020a. “Are Words Pre-Activated Probabilistically During Sentence Comprehension? Evidence from New Data and a Bayesian Random-Effects Meta-Analysis Using Publicly Available Data.” *Neuropsychologia* 142. https://doi.org/10.1016/j.neuropsychologia.2020.107427.

Nieuwland, Mante S, Stephen Politzer-Ahles, Evelien Heyselaar, Katrien Segaert, Emily Darley, Nina Kazanina, Sarah Von Grebmer Zu Wolfsthurn, et al. 2018. “Large-Scale Replication Study Reveals a Limit on Probabilistic Prediction in Language Comprehension.” *eLife* 7. https://doi.org/10.7554/eLife.33468.

Picton, T.W., S. Bentin, P. Berg, E. Donchin, S.A. Hillyard, R. Johnson JR., G.A. Miller, et al. 2000. “Guidelines for Using Human Event-Related Potentials to Study Cognition: Recording Standards and Publication Criteria.” *Psychophysiology* 37 (2): 127–52. https://doi.org/10.1111/1469-8986.3720127.

Pinheiro, José C, and Douglas M Bates. 2000. *Mixed-Effects Models in S and S-PLUS*. New York: Springer-Verlag.

Sorensen, Tanner, Sven Hohenstein, and Shravan Vasishth. 2016. “Bayesian Linear Mixed Models Using Stan: A Tutorial for Psychologists, Linguists, and Cognitive Scientists.” *Quantitative Methods for Psychology* 12 (3): 175–200.

For simplicity, we assume that they share the same standard deviation.↩

If we don’t remove the intercept, that is, if we use the formula

`n400 ~ 1 + factor(subj) + c_cloze:factor(subj)`

, with`factor(subj)`

we are going estimate the deviation between the first subject and the each of the other subjects.↩The analogous frequentist model can be fit using

`lmer`

from the package`lme4`

, using`(1+c_cloze||subj)`

or, equivalently,`(c_cloze||subj)`

for the by-subject random effects.↩The intercept adjustment is often called \(u_0\) in statistics books, where the intercept might be called \(\alpha\) or (sometimes also \(\beta_0\)), and thus \(u_1\) refers to the adjustment to the slope. However, in this book, we start the indexing with 1 to be consistent with the Stan language.↩

By contrast, in the frequentist

`lmer`

model, the adjustments \(u_1\) and \(u_2\) are not parameters; they are called conditional modes; see Bates, Mächler, et al. (2015b).↩Another source of confusion here is that

*hyperparameters*is also used in the machine learning literature with a different meaning.↩One could in theory keep going deeper and deeper, defining hyper-hyperpriors etc., but the model would quickly become impossible to fit.↩

The double pipe is also used in the same way in

`lmer`

from the package`lme4`

.↩This is because an LKJ correlation distribution with a large \(\eta\) corresponds to a correlation matrix with values close to zero in the lower and upper triangles↩

shorturl.at/brNS8↩