Chapter 18 Multinomial processing trees

In this chapter, we introduce a widely-used cognitive model that can be implemented in Stan, the multinomial processing tree. This model is useful in situations where the behavioral response from the subject is one of several possible categorical outcomes. As an example, we will look into a word production task, where we ask individuals with aphasia (a language impairment that is usually due to a cerebrovascular accident or head trauma, Damasio 1992), to name the object shown in a picture, e.g., a picture of a cat. The participant (hereafter, subject) could produce the correct name (“cat”), a semantically and phonologically related but incorrect name (“rat”), a semantically unrelated but phonologically related word (“hat”), or a non-word (“cag”). The researcher may have a theory about how each possible outcome ends up being probabilistically produced. Such a theoretical process model can be expressed as a multinomial processing tree. Before we dive into multinomial processing trees, we discuss the distributions that generalize the binomial and Bernoulli distribution for modeling more than two possible outcomes.

18.1 Modeling multiple categorical responses

One way to model categorical responses is using multinomial or categorical distributions. The categorical responses could be “yes” or “no”; “blue”, “red” or “yellow”; “true”, “false”, or “I don’t know”; or more complicated categories, but crucially the response of each observation can be coded as belonging to only one category. The multinomial and the categorical distribution represent two ways of characterizing the underlying generative process for such data.

The multinomial distribution is the generalization of the binomial distribution for more than two possible outcomes. Recall that the binomial works like this: in order to randomly generate the number of successes in observation consisting of $10$ trials, with the probability of success $0.5$, one can type:

rbinom(1, size = 10, prob = 0.5)

## [1] 4

It is possible to repeatedly generate multiple observations as follows. Suppose five simulated observations are needed, each with $10$ trials:

rbinom(5, size = 10, prob = 0.5)

## [1] 6 5 7 7 2

Now, suppose that there are N=3 possible answers to a question (yes, no don’t know), and suppose that the probabilities of producing each answer are:

$P($yes$)=0.1$
$P($no$)=0.1$
$P($don’t know$)=0.8$

The probabilities must sum to $1$, because those are the only three possible outcomes. Given such a situation, it is possible to simulate a single experiment with $10$ trials, where each of the three possibilities appears a certain number of times. We do this with rmnom from the extraDistr package (one could have used rmultinom() in R equally well, but the output would look a bit different):

(random_sample <- rmnom(1, size = 10, prob = c(0.1, 0.1, 0.8)))

##      [,1] [,2] [,3]
## [1,]    0    2    8

The above call returns the result of the random sample: $0$ cases of the first answer type, $2$ cases of the second; and $8$ cases of the third.

Analogously to the binomial function shown above, five observations can be simulated, each having 10 trials:

rmnom(5, size = 10, prob = c(0.1, 0.1, 0.8))

##      [,1] [,2] [,3]
## [1,]    1    2    7
## [2,]    3    0    7
## [3,]    1    2    7
## [4,]    1    1    8
## [5,]    3    1    6

The categorical distribution is the generalization of the Bernoulli distribution for more than two possible outcomes, and it is the special case of the multinomial distribution when we have only one trial. Recall that the Bernoulli distribution can be used as follows. If we carry out a coin toss (each coin toss counts as a single trial), we will either get a heads or a tails:

rbern(5, prob = 0.5)

## [1] 0 0 1 0 1

## equivalent rbinom command:
rbinom(5, size = 1, prob = 0.5)

## [1] 0 0 1 1 1

Thus, what the Bernoulli is to the Binomial, the Categorical is to the Multinomial. For example, one can simulate five observations, each of which will give one of the three responses with the given probabilities. We do this with rcat from the extraDistr package.

rcat(5, prob = c(0.1, 0.1, 0.8), labels = c("yes", "no", "dontknow"))

## [1] dontknow yes      dontknow dontknow dontknow
## Levels: yes no dontknow

The above is analogous to using the multinomial with size = 1 (a single trial in each experiment). In the output below, the rmnom function shows which of the three categories is produced.

rmnom(5, size = 1, prob = c(0.1, 0.1, 0.8))

##      [,1] [,2] [,3]
## [1,]    0    0    1
## [2,]    0    1    0
## [3,]    1    0    0
## [4,]    0    0    1
## [5,]    0    0    1

With these distributions as background, consider now a simulated situation where multiple responses are possible.

18.1.1 A model for multiple responses using the multinomial likelihood

Impaired picture naming ( anomia) is common in most cases of aphasia. It is assessed as part of most comprehensive aphasia test batteries, since picture naming accuracy is a relatively easily obtained and is a reliable test score; in addition, the types of errors that are committed can provide useful information for diagnosis.

In this simulated experiment, the responses are categorized as shown in Table 18.1.

TABLE 18.1: Categorization of responses for the simulated experiment.
Category	Description	Example
Correct	The response matches the target.	cat
Neologism	The response is not a word, but it has a phonological relation to the target.	cag
Formal	The response is a word with only a phonological relation to the target.	hat
Mixed	The response is a word with both a semantic and phonological relation the target.	rat
NR	All other responses, including omissions, descriptions, non-nouns, etc.	–

First, generate data assuming a multinomial distribution. The outcomes will be determined by a vector $\boldsymbol{\theta}$ (called true_theta below in the R code) that indicates the probability of each outcome:

(true_theta <- tibble(theta_NR = .2, 
                     theta_Neologism = .1,
                     theta_Formal = .2,
                     theta_Mixed = .08,
                     theta_Correct =  1 -
  (theta_NR + theta_Neologism + theta_Formal + theta_Mixed)))

## # A tibble: 1 × 5
##   theta_NR theta_Neologism theta_Formal theta_Mixed theta_Correct
##      <dbl>           <dbl>        <dbl>       <dbl>         <dbl>
## 1      0.2             0.1          0.2        0.08          0.42

## The probabilities must sum to 1:
sum(true_theta)

## [1] 1

Given this vector of probabilities $\boldsymbol{\theta}$, generate values assuming a multinomial distribution of responses in 100 trials:

N_trials <- 100
(ans_mn <- rmultinom(1, N_trials, true_theta))

##                 [,1]
## theta_NR          23
## theta_Neologism    7
## theta_Formal      18
## theta_Mixed        6
## theta_Correct     46

Now, we’ll try to recover the probability of each answer with a model with the following likelihood:

\[\begin{equation} ans \sim \mathit{Multinomial}(\boldsymbol{\theta}) \end{equation}\]

where $\boldsymbol{\theta} = \{\theta_{nr}, \theta_{neol.}, \theta_{formal}, \theta_{mix}, \theta_{corr}\}$.

A common prior for multinomial likelihood is the Dirichlet distribution, which extends the Beta distribution to cases where more than two categories are available.

\[\begin{equation} \boldsymbol{\theta} \sim \mathit{Dirichlet}(\boldsymbol{\alpha}) \end{equation}\]

The Dirichlet distribution has a parameter $\alpha$, called concentration parameter, and it is a vector with the same length as $\boldsymbol{\theta}$. If we set $\boldsymbol{\alpha} = \{2,2,2,2,2\}$, this is analogous to $\sim \mathit{Beta}(2,2)$, that is, the intuition behind this concentration parameter is that the prior probability distribution of the vector $\boldsymbol{\theta}$ corresponds to have seen two outcomes of each category in the past.

A Stan model assuming a multinomial likelihood and Dirichlet prior is shown below. Since the elements of $\boldsymbol{\theta}$ should sum to one we declare this vector, theta, as a simplex. The simplex type ensures that its elements sum to one and also constraints them to have non-negative values. In order to generate the vector $\boldsymbol{\alpha}$ that contains five times the value two, we use rep_vector(2, 5) (which is similar to rep(2, 5) in R).

data {
  int<lower = 1> N_trials;
  array[5] int<lower = 0,upper = N_trials> ans;
}
parameters {
  simplex[5] theta;
}
model {
  target += dirichlet_lpdf(theta | rep_vector(2, 5));
  target += multinomial_lpmf(ans | theta);
}
generated quantities{
  array[5] int pred_ans = multinomial_rng(theta, N_trials);
}

Fit the model:

# Create a list:
# c(ans_mn) makes a vector out of the matrix ans_mn
data_mn <-  list(N_trials = N_trials,
                 ans = c(ans_mn))

str(data_mn)

## List of 2
##  $ N_trials: num 100
##  $ ans     : int [1:5] 23 7 18 6 46

multinom <- system.file("stan_models",
                        "multinom.stan",
                        package = "bcogsci")
fit_mn <- stan(multinom, data = data_mn)

Print the posteriors:

print(fit_mn, pars = c("theta"))

##          mean 2.5% 97.5% n_eff Rhat
## theta[1] 0.23 0.16  0.31  4267    1
## theta[2] 0.08 0.04  0.14  4765    1
## theta[3] 0.18 0.11  0.26  4017    1
## theta[4] 0.07 0.03  0.13  4440    1
## theta[5] 0.44 0.35  0.53  4104    1

Next, use mcmc_recover_hist in the code below to confirm that the posterior distributions of the elements of $\boldsymbol{\theta}$ are close to the true values that were set up when simulating the data. See Figure 18.1.

as.data.frame(fit_mn) %>%
  select(starts_with("theta")) %>%
  mcmc_recover_hist(true = unlist(true_theta)) +
  coord_cartesian(xlim = c(0, 1))

FIGURE 18.1: Posterior distributions and true means of theta for the multinomial model defined in multinom.stan.

We evaluate here whether our model is able to “recover” the true value of its parameters. By “recover”, we mean that the true values are somewhere inside the posterior distribution of the model.

The frequentist properties of Bayesian models guarantee that if we simulate data several times, 95% of the true values should be inside of the 95% CrI intervals generated by a “well-calibrated” model. Furthermore, if the true values of some parameters are consistently well above or below their posterior distribution, it may mean that there is some problem with the model specification. We follow Cook, Gelman, and Rubin (2006) here, and for now we are going to verify that our model is roughly correct. A more principled (and computationally demanding) approach uses simulation based calibration (SBC) introduced in section 12.2 of chapter 12 (also see Talts et al. 2018; Schad, Betancourt, and Vasishth 2020).

18.1.2 A model for multiple responses using the categorical distribution

Using the same information as above, we can model each response one at a time, instead of aggregating them. Using the categorical distribution gives us more flexibility to define what happens at every trial. However, we are not using the additional flexibility for now, and hence the next model and the previous one are equivalent.

data {
  int<lower = 1> N_obs;
  array[N_obs] int<lower = 1, upper = 5> w_ans;
}
parameters {
  simplex[5] theta;
}
model {
  target += dirichlet_lpdf(theta | rep_vector(2, 5));
  for(n in 1:N_obs)
    target += categorical_lpmf(w_ans[n] | theta);
}
generated quantities{
  array[N_obs] int pred_w_ans;
  for(n in 1:N_obs)
    pred_w_ans[n] = categorical_rng(theta);
}

Given the same set of probabilities $\boldsymbol{\theta}$ as above, generate $100$ individual observations using rcat() from the extraDistr package:

N_obs <- 100 
ans_cat <- rcat(N_obs, prob = as.matrix(true_theta))

The above output is how Stan expects to see the data. The data fed into the Stan model is defined as a list as usual:

data_cat <-  list(N_obs = N_obs,
                  w_ans = ans_cat)
str(data_cat)

## List of 2
##  $ N_obs: num 100
##  $ w_ans: num [1:100] 1 3 3 3 1 1 2 3 2 5 ...

Fitting the Stan model (categorical.stan) should yield approximately the same $\boldsymbol{\theta}$ as with the multinomial likelihood defined in the model multinom.stan.

categorical <- system.file("stan_models",
                           "categorical.stan",
                           package = "bcogsci")
fit_cat <- stan(categorical, data = data_cat)

print(fit_cat, pars = c("theta"))

##          mean 2.5% 97.5% n_eff Rhat
## theta[1] 0.19 0.12  0.27  4699    1
## theta[2] 0.08 0.04  0.14  4961    1
## theta[3] 0.27 0.20  0.36  4805    1
## theta[4] 0.05 0.02  0.10  5155    1
## theta[5] 0.40 0.31  0.49  4756    1

The above models estimate the posterior distribution for the probability for each possible response. If we had some experimental manipulation, we could even fit regressions to these parameters. This is called a multinomial logistic regression or categorical regression, see further reading for some examples.

18.2 Modeling picture naming abilities in aphasia with MPT models

Multinomial processing tree (MPT) modeling is a method that estimates latent variables that have a psychological interpretation given categorical data (a review is provided in Batchelder and Riefer 1999). In other words, an MPT model is just one way to model categorical responses following a multinomial or categorical distribution. MPT models assume that the observed response categories result from a sequences of underlying cognitive events which are represented as a binary branching tree. Each binary branching is associated with a parameter that represents the probability of going down either branch. Every successive node is assumed to be independent of the preceding node, allowing us to use the product rule from probability theory to compute the probability of going down a particular path. The leaves of the binary branching tree are the observed response in the data. The goal is to derive posterior distributions of the latent probability parameters specified for the binary branching in the model.

Walker, Hickok, and Fridriksson (2018) created an MPT model that specifies a set of possible internal errors that lead to the various possible response types during a picture naming trial for aphasic patients. Here we’ll explore a simplification of the original model.

The model assumes that when an attempt is made to produce a word, errors in production can arise at the whole word level (lexical level) or the segmental level (phonological level). Semantic errors are assumed to arise from the lexical substitutions, and neologism errors from phonological substitutions. Real word responses that are phonologically related to the correct target word can arise from substitutions at the lexical or phonological level.

The task for the subject is to view a picture and name the object represented in the picture. When an attempt is made to retrieve the word from memory, the following possible steps can unfold (this is a simplified version of the original model):

Either the subject will make some lexical selection, or fail to make a lexical selection, returning a non-response (NR). The probability of making some lexical selection is $a$, so the probability of a non-response is $1-a$, as these are only two possibilities at this initial stage of the binary branching tree. Example: the subject sees the picture of a cat, and either produces the response “I don’t know”, or starts the process of producing a word.
If a lexical selection is made, the target word is selected with probability $t$, or some other word is chosen with probability $1-t$.
Once a word is selected, either its phonological representation is selected with probability $f$, or some other (incorrect) phonological representation is selected with probability $1-f$.
Once a word is selected, there can be a phonological change that leads to a real, formally related word with probability $c$, or a neologism with probability $1-c$. Example: the subjects produces either a formally related word “hat,” or a neologism like “cag.”

The end-result of walking down this tree is that the subject either produces a non-response (“I don’t know” or silence), a correct response, a related word, a mixed word, or a neologism. There is more than one way to produce a neologism or a related word, and the posterior probabilities of the various paths will determine the probability of each possible path.

FIGURE 18.2: Representation of a simplification of the MPT used in Walker, Hickok, and Fridriksson (2018).

TABLE 18.2: Psychological interpretation of the parameters of the MPT model.
Param.	Description
a	Probability of initiating an attempt
t	Probability of selecting a target word over competitors
f	Probability of retrieving correct phonemes
c	Probability of a phoneme change in the target word creating a real word

18.2.1 Calculation of the probabilities in the MPT branches

By navigating through the branches of the MPT (Figure 18.2), we can calculate the probabilities of the five responses (the categorical outcomes), based on the four underlying parameters assumed in the MPT:

$P(\mathit{NR}| a,t,f,c)= 1-a$
$P(\mathit{Neologism}| a,t,f,c)= a \cdot (1-t) \cdot (1-f) \cdot (1-c) + a \cdot t \cdot (1-f) \cdot (1-c)$
$P(\mathit{Formal}| a,t,f,c)= a \cdot (1-t) \cdot (1-f) \cdot c + a \cdot t \cdot (1-f) \cdot c$
$P(\mathit{Mixed}| a,t,f,c)= a \cdot (1-t) \cdot f$
$P(\mathit{Correct}| a,t,f,c)= a \cdot t \cdot f$

Given that

\[\begin{multline} P(\mathit{NR}| a,t,f,c) + P(\mathit{Neologism}| a,t,f,c) + P(\mathit{Formal}| a,t,f,c) + \\ P(\mathit{Mixed}| a,t,f,c) + P(\mathit{Correct}| a,t,f,c) = 1 \end{multline}\]

there is no need to characterize every outcome: we can always calculate any one of the remaining responses as one minus the other responses.

18.2.2 A simple MPT model

First, simulate 200 trials assuming no variability between items and subjects. It is convenient to define functions to compute each outcome’s probability, based on the previous MPT. One needs to assign “true values” to the underlying parameters of the MPT; these values are only for illustration. Ideally, one should simulate data using parameter values that are realistic; that is, one should use values based on the literature.

# The probabilities of the different answers:
Pr_NR <- function(a, t, f, c)
    1 - a
Pr_Neologism <- function(a, t, f, c)
    a * (1 - t) * (1 - f) * (1 - c) + a * t * (1 - f) * (1 - c)
Pr_Formal <- function(a, t, f, c)
    a * (1 - t) * (1 - f) * c +  a * t * (1 - f) * c
Pr_Mixed <- function(a, t, f, c)
    a * (1 - t) * f
Pr_Correct <- function(a, t, f, c)
    a * t * f
# The true underlying values for simulated data:
a_true <- .75
t_true <- .9
f_true <- .8
c_true <- .1
# The probability of the different answers:
Theta <- tibble(NR = Pr_NR(a_true, t_true, f_true, c_true),
                Neologism = Pr_Neologism(a_true, t_true, f_true, c_true),
                Formal = Pr_Formal(a_true, t_true, f_true, c_true),
                Mixed = Pr_Mixed(a_true, t_true, f_true, c_true),
                Correct = Pr_Correct(a_true, t_true, f_true, c_true))
N_trials <- 200
(ans <- rmultinom(1, N_trials, c(Theta)))

##           [,1]
## NR          49
## Neologism   26
## Formal       5
## Mixed       17
## Correct    103

The above data can be modeled in Stan as discussed below (see mpt_mnm.stan). The probabilities of the different categories go into the transformed parameters section because they are derived from the probability parameters in the model. The data are modeled as coming from a multinomial likelihood. If priors are not specified, then a Beta distribution with $a=1$ and $b=1$ (a Uniform(0,1) distribution) is assumed for the parameters $a$, $t$, $f$, and $c$. Unlike $\boldsymbol{\theta}$, the values of these parameters are independent of each other and they do not sum to one. For this reason, we should not use a Dirichlet prior here.

We define the following model:

\[\begin{equation} \begin{aligned} \theta_{nr} &= 1-a \\ \theta_{neol.} &= a \cdot (1-t) \cdot (1-f) \cdot (1-c) + a \cdot t \cdot (1-f) \cdot (1-c)\\ \theta_{formal} &= a \cdot (1-t) \cdot (1-f) \cdot c + a \cdot t \cdot (1-f) \cdot c\\ \theta_{mix} &= a \cdot (1-t) \cdot f\\ \theta_{corr} &= a \cdot t \cdot f\\ \boldsymbol{\theta} &= \{\theta_{nr}, \theta_{neol.}, \theta_{formal}, \theta_{mix}, \theta_{corr}\}\\ ans &\sim \mathit{Multinomial}(\theta)\\ a,t,f,c &\sim \mathit{Beta}(2, 2) \end{aligned} \end{equation}\]

This translates to the following code:

data {
  int<lower = 1> N_trials;
  array[5] int<lower = 0, upper = N_trials> ans;
}
parameters {
  real<lower = 0, upper = 1> a;
  real<lower = 0, upper = 1> t;
  real<lower = 0, upper = 1> f;
  real<lower = 0, upper = 1> c;
}
transformed parameters {
  simplex[5] theta;
  theta[1] = 1 - a; //Pr_NR
  theta[2] = a * (1 - t) * (1 - f) * (1 - c) +
    a * t * (1 - f) * (1 - c); //Pr_Neologism
  theta[3] = a * (1 - t) * (1 - f) * c
    + a * t * (1 - f) * c;  //Pr_Formal
  theta[4] = a * (1 - t) * f; //Pr_Mixed
  theta[5] = a * t * f; //Pr_Correct
}
model {
  target += beta_lpdf(a | 2, 2);
  target += beta_lpdf(t | 2, 2);
  target += beta_lpdf(f | 2, 2);
  target += beta_lpdf(c | 2, 2);
  target += multinomial_lpmf(ans | theta);
}
generated quantities{
    array[5] int pred_ans;
  pred_ans = multinomial_rng(theta, N_trials);
}

Fit the model:

data_sMPT <-  list(N_trials = N_trials,
                   ans = c(ans))

mpt_mnm <- system.file("stan_models",
                       "mpt_mnm.stan",
                       package = "bcogsci")
fit_sMPT <- stan(mpt_mnm, data = data_sMPT)

Print out a summary of the posterior of the parameter of interest:

print(fit_sMPT, pars = c("a", "t", "f", "c"))

##   mean 2.5% 97.5% n_eff Rhat
## a 0.75 0.69  0.81  4516    1
## t 0.85 0.78  0.90  4631    1
## f 0.79 0.72  0.85  4600    1
## c 0.20 0.09  0.36  4755    1

What the model gives us is posterior distributions of each of the parameters a, t, f, c. From these we can derive the probabilities of producing the different observed responses, and the posterior predictive distributions, which could be used for model evaluation.

An important sanity check in modeling is checking whether the model can in principle recover the true parameters that generated the data; see Figure 18.3.

as.data.frame(fit_sMPT) %>%
  select(c("a","t","f","c")) %>%
  mcmc_recover_hist(true = c(a_true, t_true, f_true, c_true)) +
  coord_cartesian(xlim = c(0, 1))

FIGURE 18.3: Posterior distributions and true values of the parameters of the simple MPT model (mpt_mnm.stan).

The above figure shows that the model can indeed recover the true parameters fairly accurately.

The posterior distributions of the $\boldsymbol{\theta}$ parameters can also be summarized:

print(fit_sMPT, pars = c("theta"))

##          mean 2.5% 97.5% n_eff Rhat
## theta[1] 0.25 0.19  0.31  4516    1
## theta[2] 0.13 0.09  0.17  4911    1
## theta[3] 0.03 0.01  0.06  4090    1
## theta[4] 0.09 0.06  0.13  4758    1
## theta[5] 0.50 0.44  0.57  3945    1

These posteriors tell us the probability of producing each of the possible responses. This model might be useful for estimating the latent parameters, $a$, $t$, $f$, $c$, but without further constraints it is unfalsifiable.

Recall that for the multinomial likelihood in section 18.1.1, we had a simplex of size five, which means that we had four free parameters (since the fifth can be deduced based on the others). With five possible answers we can always estimate a vector of probabilities $\boldsymbol{\theta}$, that fits the data, in the same way that with two possible answers (e.g., zeros and ones), we can always estimate a single probability $\theta$ that fits the data (using a Bernoulli or Binomial likelihood). The MPT that we present here is just reparameterizing the vector $\boldsymbol{\theta}$ of the multinomial likelihood (with the same number of free parameters). This means that it will always achieve a perfect fit; see exercise 18.2. This doesn’t mean that this MPT model is “useless”: Under the assumption that the model is meaningful, one can estimate its latent parameters and this estimation can have theoretical implications. If we want to be able to falsify this model, we’ll need to constrain it more, as we suggest below.

18.2.3 An MPT model assuming by-item variability

The use of aggregated data implies the assumption that the estimated parameters do not vary too much between subjects and items. If this assumption is incorrect, the analysis of aggregated data may lead to erroneous conclusions: reliance on aggregated data in the presence of parameter heterogeneity may lead to biased parameter estimates and the underestimation of credible intervals.

If it is known that $f$ is affected by the phonological complexity of the individual word (e.g., cat is easier to produce than umbrella), the previous model does not have a way to include that information.

Simulated data can be generated taking into account the complexity of the items. Assume here for simplicity that the complexity of items is scaled and centered; i.e., mean complexity is represented by $0$, and the standard deviation is assumed to be $1$. We will assume a regression model that determines the parameter, $f$, as a function of the phonological complexity of each trial.

One important detail is that f is a probability and needs to be bounded between $0$ and $1$. To make sure that this property is met, the computation of f for each item will be converted to probability space using the logistic function. This is achieved as follows.

Suppose that $f'$ is a linear function of complexity. For example, two parameters $\alpha_f$ and $\beta_f$ (intercept and slope respectively) could determine how $f'$ is affected by complexity:

$f'_j=\alpha_f + complexity_j\cdot \beta_f$.

The parameters $\alpha_f$ and $\beta_f$ are defined in an unconstrained log-odds space (they can be any real number). The model that is fit then yields an $f'_j$ value for each item $j$ in log-odds space. The log-odds value $f'_j$ can be converted to a probability value $f_{true}$ by applying the logistic function (or the inverse logit, $logit^{-1}$) to $f'$. Recall from the generalized linear model discussed earlier that if we have a model in log-odds space:

\[\begin{equation} \log \left(\frac{p_j}{1-p_j}\right) = \alpha + \beta\cdot x_j = \mu_j \end{equation}\]

Then we can recover the probability $p_j$ by solving for $p_j$:

\[\begin{equation} p_j = \frac{\exp(\mu_j)}{1+\exp(\mu_j)} = \frac{1}{1+\exp(-\mu_j)} \end{equation}\]

The above is the logistic or inverse logit function: it takes as input $\mu_j$ and returns the corresponding probability $p_j$. The plogis function in R carries out the calculation shown above.

N_obs <- 50
complexity <- rnorm(N_obs) # by default mean = 0, sd = 1
## choose some hypothetical values:
alpha_f <- .3
# the negative sign indicates that
# increased complexity will lead to a reduced value of f:
beta_f <- -.3
# f' as a linear function of complexity:
f_prime <- alpha_f + complexity * beta_f
head(f_prime)

## [1]  0.468  0.369 -0.168  0.279  0.261 -0.215

## probabilities f for each item:
f_true <- plogis(f_prime)
head(f_true)

## [1] 0.615 0.591 0.458 0.569 0.565 0.447

This change in our assumptions entails that the probability of each response changes depending on the item associated with each observation. The parameters theta now have to be a matrix. This is in R; in Stan, we will code it as an array of simplexes, i.e., an array of non-negative values that sums to 1.

We continue with the functions defined in 18.2.2, and the same values for a_true, t_true, and c_true as defined in that section. Since most of the equations depend on f, and f is a vector now, the outcomes are automatically vectors. But this is not the case for theta_NR_v, and thus we need to repeat the value.

theta_NR_v <- rep(Pr_NR(a_true, t_true, f_true, c_true), N_obs)
theta_Neologism_v <- Pr_Neologism(a_true, t_true, f_true, c_true)
theta_Formal_v <- Pr_Formal(a_true, t_true, f_true, c_true)
theta_Mixed_v <- Pr_Mixed(a_true, t_true, f_true, c_true)
theta_Correct_v <- Pr_Correct(a_true, t_true, f_true, c_true)
theta_item <- matrix(c(theta_NR_v,
                       theta_Neologism_v,
                       theta_Formal_v,
                       theta_Mixed_v,
                       theta_Correct_v),
                     ncol = 5)
dim(theta_item)

## [1] 50  5

head(theta_item,n = 3)

##      [,1]  [,2]   [,3]   [,4]  [,5]
## [1,] 0.25 0.260 0.0289 0.0461 0.415
## [2,] 0.25 0.276 0.0307 0.0443 0.399
## [3,] 0.25 0.366 0.0406 0.0344 0.309

Store this in a data frame:

sim_data_cx <- tibble(item = 1:N_obs,
                      complexity = complexity,
                      w_ans = c(rcat(N_obs,theta_item)))
sim_data_cx

## # A tibble: 50 × 3
##    item complexity w_ans
##   <int>      <dbl> <dbl>
## 1     1     -0.560     5
## 2     2     -0.230     2
## 3     3      1.56      2
## # ℹ 47 more rows

The following model (saved in mpt_cat.stan) is essentially doing the same thing as the previous model but instead of fitting a multinomial to the summary of all the trials, it is fitting a categorical distribution to each individual observation. (This is analogous to the difference between the Bernoulli and Binomial distributions).

This is still not an appropriate model for the generative process that we are assuming in this section, because it still ignores the effect of complexity. But it is a good start.

data {
  int<lower = 1> N_obs;
  array[N_obs] int<lower = 1, upper = 5> w_ans;
}
parameters {
  real<lower=0,upper=1> a;
  real<lower=0,upper=1> t;
  real<lower=0,upper=1> f;
  real<lower=0,upper=1> c;
}
transformed parameters {
  array[N_obs] simplex[5] theta;

  for(n in 1:N_obs){
    //Pr_NR:
    theta[n, 1] = 1 - a;
    //Pr_Neologism:
    theta[n, 2] = a * (1 - t) * (1 - f) * (1 - c) +
      a * t * (1 - f) * (1 - c);
    //Pr_Formal:
    theta[n, 3] = a * (1 - t) * (1 - f) * c +
      a * t * (1 - f) * c;
    //Pr_Mixed:
    theta[n, 4] = a * (1 - t) * f;
    //Pr_Correct:
    theta[n, 5] = a * t * f;
  }
}
model {
  target += beta_lpdf(a | 2, 2);
  target += beta_lpdf(t | 2, 2);
  target += beta_lpdf(f | 2, 2);
  target += beta_lpdf(c | 2, 2);
  for(n in 1:N_obs)
    target += categorical_lpmf(w_ans[n] | theta[n]);
}
generated quantities{
    array[N_obs] int pred_w_ans;
  for(n in 1:N_obs)
    pred_w_ans[n] = categorical_rng(theta[n]);
}

An important aspect of the previous model is that theta is declared as simplex[5] theta[N_obs]. This means that theta is an array of simplexes and thus has now two dimensions: each element of the array (of length N_obs) is a simplex and sums to one. That’s why we iterate over the N_obs. However, one limitation of the previous model is that the latent parameters a, t, f, c are declared as real and they do not vary in each iteration of the loop. Before moving to the next section, you might want to do exercise 18.3, where you are asked to edit the previous chunk of code to incorporate the fact that f is now a transformed parameter that depends on the trial information and two new parameters.

18.2.4 A hierarchical MPT

The previous model doesn’t take into account that subjects might vary (and neither does the modification to this model that is suggested in exercise 18.3). Let’s focus on taking into account the differences between subjects.

Different subjects might not be equally motivated to do the task. This can be accounted for by adding hierarchical structure to the parameter a, the probability of initiating an attempt. Begin by simulating some data that incorporates by-subject variability.

First, define the number of items and subjects, and the number of observations:

N_item <- 20
N_subj <- 30
N_obs <- N_item * N_subj

Then, generate a vector for subjects and for items. Assume here that each subject sees each item.

subj <- rep(1:N_subj, each = N_item)
item <- rep(1:N_item, time = N_subj)

A vector representing complexity is created for the number of items we have, and this vector is repeated as many times as there are subjects:

complexity <- rep(rnorm(N_item), times = N_subj)

Next, create a data frame with all the above information:

(exp_sim <- tibble(subj = subj,
                  item = item,
                  complexity = complexity))

## # A tibble: 600 × 3
##    subj  item complexity
##   <int> <int>      <dbl>
## 1     1     1     -0.560
## 2     1     2     -0.230
## 3     1     3      1.56 
## # ℹ 597 more rows

To create subject-level variability in the data, a between-subject standard deviation needs to be defined. This standard deviation represents the deviations of subjects about the grand mean. We are defining this adjustment in log-odds space.

# New parameters, in log-odds space:
tau_u_a <- 1.1
## generate subject adjustments in log-odds space:
u_a <- rnorm(N_subj, 0, tau_u_a)
str(u_a)

##  num [1:30] -1.175 -0.24 -1.129 -0.802 -0.688 ...

Given the fixed a_true probability value of 0.75, the subject-level values for individual a_true can be derived by (a) first converting the overall a_true value to log-odds space, (b) adding the by-subject adjustment to this converted overall value, and (c) then converting back to probability space using the logistic or inverse logit (plogis) function. Essentially we generate data assuming the following:

\[\begin{equation} \begin{aligned} a_{h,n}' &= \alpha_a + u_{a,subj[n]}\\ a_{h,n} &= \text{logit}^{-1}(a_{h,n}') \end{aligned} \end{equation}\]

Where $u_{a,subj[n]}$ is a vector with the same length as the total number of observations. The meaning of this notation was explained in the section 11.1.

This is done in R as follows:

a_true <- .75 # as before
## convert the intercept to log-odds space:
alpha_a <- qlogis(a_true)
## a_h' in log-odds space:
a_h_prime <-  alpha_a + u_a[subj]
## convert back to probability space
a_true_h <- plogis(a_h_prime)
str(a_true_h)

##  num [1:600] 0.481 0.481 0.481 0.481 0.481 ...

What this achieves mathematically is adding varying intercepts by subjects to alpha_a, and then the values adjusted by subject are saved in probability space.

As before, f_true is computed as a function of complexity:

alpha_f <- .3; beta_f <- -.3
f_true <- plogis(alpha_f + complexity * beta_f)

We continue with the same probability functions and the rest of the true values remain the same as well.

t_true <- .9; c_true <- .1
Pr_NR <- function(a, t, f, c)
    1 - a
Pr_Neologism <- function(a, t, f, c)
    a * (1 - t) * (1 - f) * (1 - c) + a * t * (1 - f) * (1 - c)
Pr_Formal <- function(a, t, f, c)
    a * (1 - t) * (1 - f) * c +  a * t * (1 - f) * c
Pr_Mixed <- function(a, t, f, c)
    a * (1 - t) * f
Pr_Correct <- function(a, t, f, c)
    a * t * f

Now, we can define the probabilities of different outcomes:

# Aux. parameters that define the probabilities:
theta_NR_v_h <- Pr_NR(a_true_h, t_true, f_true, c_true) 
theta_Neologism_v_h <- Pr_Neologism(a_true_h, t_true, f_true, c_true)
theta_Formal_v_h <- Pr_Formal(a_true_h, t_true, f_true, c_true)
theta_Mixed_v_h <- Pr_Mixed(a_true_h, t_true, f_true, c_true)
theta_Correct_v_h <- Pr_Correct(a_true_h, t_true, f_true, c_true)
theta_h <- matrix(
  c(theta_NR_v_h,
    theta_Neologism_v_h,
    theta_Formal_v_h,
    theta_Mixed_v_h,
    theta_Correct_v_h),
  ncol = 5)
dim(theta_h)

## [1] 600   5

The probability specifications shown above can now generate the simulated data:

(sim_data_h <- mutate(exp_sim,
                      w_ans = rcat(N_obs,theta_h)))

## # A tibble: 600 × 4
##    subj  item complexity w_ans
##   <int> <int>      <dbl> <dbl>
## 1     1     1     -0.560     2
## 2     1     2     -0.230     1
## 3     1     3      1.56      1
## # ℹ 597 more rows

Next, define the following model; we omit the steps with $f'$ and $a'$ and directly apply the logistic function to a regression. The parameters $t$, $c$ do not vary by item or subject and therefore do not have the subscript $_n$. We start by defining relatively weak priors for all the parameters in the following model. (See how we decided on the priors of $\alpha$ and $\beta$ in the logistic regression example of section 4.3.2).

\[\begin{equation} \begin{aligned} \alpha_a, \alpha_f &\sim \mathit{Normal}(0, 1.5)\\ \beta_f &\sim \mathit{Normal}(0, 1)\\ t,c &\sim \mathit{Beta}(2, 2)\\ \tau_u &\sim \mathit{Normal}(0, 1)\\ u_a &\sim \mathit{Normal}(0, \tau_{u_a})\\ a_n &= logit^{-1}(\alpha_a + u_{a,subj[n]})\\ f_n &= logit^{-1}(\alpha_f + complexity_n \cdot \beta_f)\\ \theta_{n,nr} &= 1 - a_n \\ \theta_{n,neol.} &= a_n \cdot (1-t) \cdot (1-f_n) \cdot (1-c) + a_n \cdot t \cdot (1-f_n) \cdot (1-c)\\ \theta_{n,formal} &= a_n \cdot (1-t) \cdot (1-f_n) \cdot c + a_n \cdot t \cdot (1-f_n) \cdot c\\ \theta_{n,mix} &= a_n \cdot (1-t) \cdot f_n\\ \theta_{n,corr} &= a_n \cdot t \cdot f_n\\ \theta_n &= \{\theta_{n, nr}, \theta_{n, neol.}, \theta_{n, formal}, \theta_{n, mix}, \theta_{n, corr}\}\\ ans_n &\sim \mathit{Categorical}(\theta_n) \end{aligned} \end{equation}\]

The corresponding Stan model mpt_h.stan will look like this:

data {
  int<lower = 1> N_obs;
  array[N_obs] int<lower = 1, upper = 5> w_ans;
  array[N_obs] real complexity;
  int<lower = 1> N_subj;
  array[N_obs] int<lower = 1, upper = N_subj> subj;
}
parameters {
  real<lower = 0, upper = 1> t;
  real<lower = 0, upper = 1> c;
  real alpha_a;
  real<lower = 0> tau_u_a;
  vector[N_subj] u_a;
  real alpha_f;
  real beta_f;
}
transformed parameters {
  array[N_obs] simplex[5] theta;
  for (n in 1:N_obs){
    real a = inv_logit(alpha_a + u_a[subj[n]]);
    real f = inv_logit(alpha_f + complexity[n] * beta_f);
    //Pr_NR
    theta[n, 1] = 1 - a;
    //Pr_Neologism
    theta[n, 2] = a * (1 - t) * (1 - f) * (1 - c) +
      a * t * (1 - f) * (1 - c);
    //Pr_Formal
    theta[n, 3] = a * (1 - t) * (1 - f) * c
      + a * t * (1 - f) * c;
    //Pr_Mixed
    theta[n, 4] = a * (1 - t) * f;
    //Pr_Correct
    theta[n, 5] = a * t * f;
  }
}
model {
  target += beta_lpdf(t | 2, 2);
  target += beta_lpdf(c | 2, 2);
  target += normal_lpdf(alpha_a | 0, 1.5);
  target += normal_lpdf(alpha_f | 0, 1.5);
  target += normal_lpdf(beta_f | 0, 1);
  target += normal_lpdf(u_a | 0, tau_u_a);
  target += normal_lpdf(tau_u_a | 0, 1) - normal_lccdf(0 | 0, 1);
  for(n in 1:N_obs)
    target +=  categorical_lpmf(w_ans[n] | theta[n]);
}
generated quantities{
  array[N_obs] int<lower = 1, upper = 5> pred_w_ans;
  for(n in 1:N_obs)
    pred_w_ans[n] = categorical_rng(theta[n]);
}

For ease of exposition, we are not using the non-centered parameterization discussed previously in section 11.1.2. We could also apply it here; that will speed up and improve the convergence of the model. See Exercise 18.4.

It would be a good idea to plot prior predictive distributions for this model, but we skip this step here. Next, fit the model to the simulated data, by first defining the data as a list:

sim_list_h <-  list(N_obs = nrow(sim_data_h),
                    w_ans = sim_data_h$w_ans,
                    N_subj = max(sim_data_h$subj),
                    subj = sim_data_h$subj,
                    complexity = sim_data_h$complexity)

mpt_h <- system.file("stan_models",
                     "mpt_h.stan",
                     package = "bcogsci")
fit_mpt_h <- stan(mpt_h, data = sim_list_h)

Print out a summary of the posterior:

print(fit_mpt_h,
      pars = c("t", "c", "tau_u_a", "alpha_a", "alpha_f", "beta_f"))

##          mean  2.5% 97.5% n_eff Rhat
## t        0.91  0.87  0.94  5876    1
## c        0.11  0.07  0.16  5616    1
## tau_u_a  0.99  0.66  1.40  2090    1
## alpha_a  1.02  0.60  1.44  1526    1
## alpha_f  0.25  0.05  0.45  4814    1
## beta_f  -0.22 -0.42 -0.02  4751    1

If we had fit this to real data, we would now conclude that:

given the value of beta_f, complexity has an adverse effect on the probability of retrieving the correct phonemes, and
given the posterior distribution of tau_u_a, there is a great deal of variation in the subjects’ probability of initiating an attempt at each trial. Furthermore, if we had some expectation about $t$ and $c$ based on previous research, we could conclude that our results are in line (or not) with previous findings.

One could inspect how one unit of complexity is affecting the probability of retrieving the correct phoneme ($f$). We first derive the value of f for an item of zero complexity (that is $\alpha_{f} + 0 \times \beta_{f}$) and then the value of f for an item with a complexity of one ($\alpha_{f} + 1 \times \beta_{f}$). We are interested in summarizing the difference between the two:

as.data.frame(fit_mpt_h) %>%
  select(alpha_f, beta_f) %>%
  mutate(f_0 = plogis(alpha_f),
         f_1 = plogis(alpha_f + beta_f),
         diff_f = f_1 - f_0) %>%
  summarize(Estimate = mean(diff_f),
          `2.5%` = quantile(diff_f, 0.025),
          `97.5%` = quantile(diff_f, 0.975))

##   Estimate   2.5%    97.5%
## 1  -0.0542 -0.104 -0.00399

One further interesting step could be to develop a competing model that assumes a different latent process, and then comparing the performance of the MPT with this competing model, using Bayes factors or K-fold-CV (or both).

Since we generated the data based on known latent parameters, we also plot the posteriors together with the true values of the parameters in Figure 18.4. This is something that we can only do with simulated data.

as.data.frame(fit_mpt_h) %>%
  select(tau_u_a, alpha_a, t, alpha_f, beta_f, c) %>%
  mcmc_recover_hist(true = c(tau_u_a,
                             qlogis(a_true),
                             t_true, alpha_f,
                             beta_f, c_true))

FIGURE 18.4: Posterior of the hierarchical MPT with true values as vertical lines (model mpt_h.stan).

If everything is correctly defined in the model, we should be able to generate posterior predictive data based on our estimates that looks quite similar to the simulated data; see Figure 18.5. The error bars in $y_{rep}$ include 90% of the probability mass of the predictive distribution (this is a default of ppc_bars()). In a well-calibrated model, the data ($y$, here the proportion of answers of each type) should be inside the error bars in 90% of the cases.

gen_data <- rstan::extract(fit_mpt_h)$pred_w_ans
ppc_bars(sim_list_h$w_ans, gen_data) +
  ggtitle ("Hierarchical model")

FIGURE 18.5: A posterior predictive check for aggregated data in the hierarchical MPT model.

It is also useful to look at the individual subjects’ posteriors; these are shown in Figure 18.6.

ppc_bars_grouped(sim_list_h$w_ans, 
                 gen_data, group = subj) +
  ggtitle ("By-subject plot for the hierarchical model")

FIGURE 18.6: Individual subjects in the hierarchical MPT model.

But what about the first non-hierarchical MPT model (mpt_cat.stan)?:

mpt_cat <- system.file("stan_models",
                       "mpt_cat.stan",
                       package = "bcogsci")
fit_sh <- stan(mpt_cat, data = sim_list_h)

The aggregated data looks great (Figure 18.7).

gen_data_sMPT <- rstan::extract(fit_sh)$pred_w_ans
ppc_bars(sim_list_h$w_ans, gen_data_sMPT) +
  ggtitle ("Non-hierarchical model")

FIGURE 18.7: Posterior predictive check for aggregated data in a non-hierarchical MPT model (mpt_cat.stan).

However, in the non-hierarchical model, the fit to individual subjects is not as good (Figure 18.8): The error bars of the predicted distribution do not include the observed proportion of answers for many of the subjects.

ppc_bars_grouped(sim_list_h$w_ans, gen_data_sMPT, group = subj) +
  ggtitle ("By-subject plot for the non-hierarchical model")

FIGURE 18.8: Individual subjects in the non-hierarchical MPT model (mpt_cat.stan).

The hierarchical model does a better job of modeling individual-level variability.

18.3 Summary

In this chapter, we learned to fit increasingly complex MPTs to model categorical responses by assuming underlying latent events that might or might not happen with a certain probability. We started with a simple model and ended with a hierarchical model. We saw how to generate simulated data and investigate parameter recovery to verify that the models were correctly implemented. Furthermore, we showed how one can carry out posterior predictive checks to evaluate the fit of the models, and how to interpret the posteriors of the parameters. MPTs have a lot of potential as stand-alone models or as parts of larger cognitive models (as in, for example, Paape and Vasishth 2022; Nicenboim and Vasishth 2016; and Klauer and Kellen 2018).

18.4 Further reading

Koster and McElreath (2017) present a tutorial on multinomial logistic regression/categorical regression in the context of behavioral ecology and anthropology. Another tutorial on MPTs is presented by Matzke et al. (2015). For the complete implementation of an MPT relating to aphasia, see Walker, Hickok, and Fridriksson (2018). Some examples of cognitive models using MPTs are Lee et al. (2020) and Smith and Batchelder (2010). An example of using MPTs to model garden-pathing processes in sentence processing appears in Paape and Vasishth (2022).

18.5 Exercises

Exercise 18.1 Modeling multiple categorical responses.

Re-fit the model presented in section 18.1.2, adding the assumption that you have more information about the probability of giving a correct response in the task. Assume that you know that subjects’ answers have around 60% accuracy. Encode this information in the priors with two different degrees of certainty. (Hint: 1. As with the Beta distribution, you can increase the pseudo-counts to increase the amount of information and reduce the “width” of the distribution; compare $Beta(9,1)$ with $Beta(900,100)$. 2. You’ll need to use a column vector for the Dirichlet concentration parameters. [.., .., ] is a row_vector that can be transposed and converted into a column vector by adding the transposition symbol ' after the right bracket.)
What is the difference between the multinomial and categorical parameterizations?
What can we learn about impaired picture naming from the models in sections 18.1.1 and 18.1.2?

Exercise 18.2 An alternative MPT to model the picture recognition task.

Build any alternative tree with four parameters $w$, $x$, $y$, $z$ to fit the data generated in 18.2.2. Compare the posterior distribution of the auxiliary vector theta (that goes in the multinomial_lpmf) with the one derived in section 18.2.2.

Exercise 18.3 A simple MPT model that incorporates phonological complexity in the picture recognition task.

Edit the Stan code mpt_cat.stan from bcogsci presented in section 18.2.3 to incorporate the fact that f is now a transformed parameter that depends on the trial information and two new parameters, $\alpha_f$ and $\beta_f$. The rest of the latent parameters do not need to vary by trial.

\[\begin{equation} \begin{aligned} f'_j &=\alpha_f + complexity_j\cdot \beta_f\\ f_j &= logit^{-1}(f'_j) \end{aligned} \end{equation}\]

The inverse logit or logistic function is called inv_logit in Stan. Fit the model to the data of 18.2.3 and report the posterior distributions of the latent parameters.

Exercise 18.4 A more hierarchical MPT.

Modify the hierarchical MPT presented in section 18.2.4 so that all the parameters are affected by individual differences. Simulate data and fit it. How well can you recover the parameters? You should use the non-centered parametrization for the by-subject adjustments. (Hint: Convergence will be reached much faster if you don’t assume that the adjustment parameters are correlated as in 11.1.2, but you could also assume a correlation between all (or some of) the adjustments by using the Cholesky factorization discussed in section 11.1.3.)

Exercise 18.5 Advanced: Multinomial processing trees.

The data set df_source_monitoring in bcogsci contains data from the package psychotools coming from a source-monitoring experiment (Batchelder and Riefer 1990) performed by Wickelmaier and Zeileis (2018).

In this type of experiment, subjects study items from (at least) two different sources, A and B. After the presentation of the study items, subjects are required to classify each item as coming from source A, B, or as new: N (that is, a distractor). In their version of the experiment, Wickelmaier and Zeileis used two different A-B pairs: Half of the subjects had to read items either quietly (source A = think) or aloud (source B = say). The other half had to write items down (source A = write) or read them aloud (source B = say).

experiment: write-say or think-say
age: Age of the respondent in years.
gender: Gender of the respondent.
subj: Subject id.
source: Item source, a, b or b (new)
a, b, N: Number of responses for each type of stimuli

Fit a multinomial processing tree following Figures 18.9 and 18.10 to investigate whether experiment type, age and/or gender affects the different processes assumed in the model. As in Batchelder and Riefer (1990), assume that $a = g$ (for identifiability) and that discriminability is equal for both sources ($d_1 = d_2$).

$Multinomial processing tree for the source A items from the source monitoring paradigm (Batchelder and Riefer, 1990). $D_1$ stands for the detectability of source A, $d_1$ stands for the source discriminabilities for source A items, $b$ stands for the bias for responding “old” to a nondetected item, $a$ stands for guessing that a detected but nondiscriminated item belongs to source A, and $g$ stands for guessing that the item is a source A item.$

FIGURE 18.9: Multinomial processing tree for the source A items from the source monitoring paradigm (Batchelder and Riefer, 1990). $D_1$ stands for the detectability of source A, $d_1$ stands for the source discriminabilities for source A items, $b$ stands for the bias for responding “old” to a nondetected item, $a$ stands for guessing that a detected but nondiscriminated item belongs to source A, and $g$ stands for guessing that the item is a source A item.

$Multinomial processing tree for the source B items from source monitoring paradigm (Batchelder and Riefer, 1990). $D_2$ stand for the detectability of source B items, $d_2$ stands for the source discriminabilities for source B, $b$ stands for the bias for responding “old” to a nondetected item, $a$ stands for guessing that a detected but nondiscriminated item belongs to Source A, and $g$ stands for guessing that the item is a source A item.$

FIGURE 18.10: Multinomial processing tree for the source B items from source monitoring paradigm (Batchelder and Riefer, 1990). $D_2$ stand for the detectability of source B items, $d_2$ stands for the source discriminabilities for source B, $b$ stands for the bias for responding “old” to a nondetected item, $a$ stands for guessing that a detected but nondiscriminated item belongs to Source A, and $g$ stands for guessing that the item is a source A item.

$Multinomial processing tree for the new items in the source monitoring paradigm (Batchelder and Riefer, 1990). $b$ stands for the bias for responding “old” to a nondetected item, $a$ stands for guessing that a detected but nondiscriminated item belongs to source A, and $g$ stands for guessing that the item is a source A item.$

FIGURE 18.11: Multinomial processing tree for the new items in the source monitoring paradigm (Batchelder and Riefer, 1990). $b$ stands for the bias for responding “old” to a nondetected item, $a$ stands for guessing that a detected but nondiscriminated item belongs to source A, and $g$ stands for guessing that the item is a source A item.

Notice the following:

The data are aggregated at the level of source, so you should use multinomial_lpmf for every row of the data set rather than categorical_lpmf().
In contrast to the previous example, source determines three different trees, this means that the parameter theta has to be defined in relationship to the item source.
All the predictors are between subject, this means that only a by-intercept adjustment (for every latent process) is possible.

If you want some basis to start with, you can have a look at the incomplete code in source.stan, by typing the following in R:

cat(readLines(system.file("stan_models",
                     "source.stan",
                     package = "bcogsci")),
    sep = "\n")

References

Batchelder, William H., and David M. Riefer. 1990. “Multinomial Processing Models of Source Monitoring.” Psychological Review 97 (4): 548.

1999. “Theoretical and Empirical Review of Multinomial Process Tree Modeling.” Psychonomic Bulletin & Review 6 (1): 57–86.

Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3): 675–92. https://doi.org/10.1198/106186006X136976.

Damasio, Antonio R. 1992. “Aphasia.” New England Journal of Medicine 326 (8): 531–39. https://doi.org/10.1056/NEJM199202203260806.

Klauer, Karl Christoph, and David Kellen. 2018. “RT-MPTs: Process Models for Response-Time Distributions Based on Multinomial Processing Trees with Applications to Recognition Memory.” Journal of Mathematical Psychology 82: 111–30. https://doi.org/https://doi.org/10.1016/j.jmp.2017.12.003.

Koster, Jeremy, and Richard McElreath. 2017. “Multinomial Analysis of Behavior: Statistical Methods.” Behavioral Ecology and Sociobiology 71 (9): 138. https://doi.org/10.1007/s00265-017-2363-8.

Lee, Michael D., Jason R. Bock, Isaiah Cushman, and William R. Shankle. 2020. “An Application of Multinomial Processing Tree Models and Bayesian Methods to Understanding Memory Impairment.” Journal of Mathematical Psychology 95: 102328. https://doi.org/https://doi.org/10.1016/j.jmp.2020.102328.

Matzke, Dora, Conor V. Dolan, William H. Batchelder, and Eric-Jan Wagenmakers. 2015. “Bayesian Estimation of Multinomial Processing Tree Models with Heterogeneity in Participants and Items.” Psychometrika 80 (1): 205–35. https://doi.org/10.1007/s11336-013-9374-9.

Nicenboim, Bruno, and Shravan Vasishth. 2016. “Statistical methods for linguistic research: Foundational Ideas - Part II.” Language and Linguistics Compass 10 (11): 591–613. https://doi.org/10.1111/lnc3.12207.

Paape, Dario, and Shravan Vasishth. 2022. “Estimating the True Cost of Garden-Pathing: A Computational Model of Latent Cognitive Processes.” Cognitive Science 46 (8): e13186. https://doi.org/https://doi.org/10.1111/cogs.13186.

Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2019. “Toward a Principled Bayesian Workflow in Cognitive Science.” arXiv Preprint. https://doi.org/10.48550/ARXIV.1904.12765.

2020. “Toward a Principled Bayesian Workflow in Cognitive Science.” Psychological Methods 26 (1): 103–26. https://doi.org/https://doi.org/10.1037/met0000275.

Smith, Jared B., and William H. Batchelder. 2010. “Beta-MPT: Multinomial Processing Tree Models for Addressing Individual Differences.” Journal of Mathematical Psychology 54 (1): 167–83.

Talts, Sean, Michael J. Betancourt, Daniel P. Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” arXiv Preprint arXiv:1804.06788.

Walker, Grant M, Gregory Hickok, and Julius Fridriksson. 2018. “A Cognitive Psychometric Model for Assessment of Picture Naming Abilities in Aphasia.” Psychological Assessment 6: 809–26. https://doi.org/10.1037/pas0000529.

Wickelmaier, Florian, and Achim Zeileis. 2018. “Using Recursive Partitioning to Account for Parameter Heterogeneity in Multinomial Processing Tree Models.” Behavior Research Methods 50 (3): 1217–33.