18.1 Modeling multiple categorical responses

One way to model categorical responses is using multinomial or categorical distributions. The categorical responses could be “yes” or “no”; “blue”, “red” or “yellow”; “true”, “false”, or “I don’t know”; or more complicated categories, but crucially the response of each observation can be coded as only belonging to only one category. The multinomial and the categorical distribution represent two ways of characterizing the underlying generative process for such data.

The multinomial distribution is the generalization of the binomial distribution for more than two possible outcomes. Recall that the binomial works like this: in order to randomly generate the number of successes in observation consisting of 10 trials, with the probability of success 0.5, one can type:

## [1] 4

It is possible to repeatedly generate multiple observations as follows. Suppose five simulated observations are needed, each with 10 trials:

## [1] 6 5 7 7 2

Now, suppose that there are N=3 possible answers to a question (yes, no don’t know), and suppose that the probabilities of producing each answer are:

  • \(P(\)yes\()=0.1\)
  • \(P(\)no\()=0.1\)
  • \(P(\)don’t know\()=0.8\)

The probabilities must sum to 1, because those are the only three possible outcomes. Given such a situation, it is possible to simulate a single experiment with 10 trials, where each of the three possibilities appears a certain number of times. We do this with rmnom from the extraDistr package (one could have used rmultinom() in R equally well, but the output would look a bit different):

##      [,1] [,2] [,3]
## [1,]    1    2    7

The above call returns the result of the random sample: 1 cases of the first answer type, 2 cases of the second; and 7 cases of the third.

Analogously to the binomial function shown above, five observations can be simulated, each having 10 trials:

##      [,1] [,2] [,3]
## [1,]    1    1    8
## [2,]    3    1    6
## [3,]    1    1    8
## [4,]    0    2    8
## [5,]    0    0   10

The categorical distribution is the generalization of the Bernoulli distribution for more than two possible outcomes, and it is the special case of the multinomial distribution when we have only one trial. Recall that the Bernoulli distribution can be used as follows. If we carry out a coin toss (each coin toss counts as a single trial), we will either get a heads or a tails:

## [1] 1 0 0 0 0
## [1] 1 1 1 1 1

Thus, what the Bernoulli is to the Binomial, the Categorical is to the Multinomial. For example, one can simulate five observations, each of which will give one of the three responses with the given probabilities. We do this with rcat from the extraDistr package.

## [1] dontknow dontknow yes      yes      dontknow
## Levels: yes no dontknow

The above is analogous to using the multinomial with size = 1 (a single trial in each experiment). In the output below, the rmnom function shows which of the three categories is produced.

##      [,1] [,2] [,3]
## [1,]    0    0    1
## [2,]    0    0    1
## [3,]    0    0    1
## [4,]    0    0    1
## [5,]    0    0    1

With these distributions as background, consider now a simulated situation where multiple responses are possible.

18.1.1 A model for multiple responses using the multinomial likelihood

Impaired picture naming (anomia) is common in most cases of aphasia. It is assessed as part of most comprehensive aphasia test batteries, since picture naming accuracy is a relatively easily obtained and is a reliable test score; in addition, the types of errors that are committed can provide useful information for diagnosis.

In this simulated experiment, the responses are categorized as shown in Table 18.1.

TABLE 18.1: Categorization of responses for the simulated experiment.
Category Description Example
Correct The response matches the target. cat
Neologism The response is not a word, but it has a phonological relation to the target. cag
Formal The response is a word with only a phonological relation to the target. hat
Mixed The response is a word with both a semantic and phonological relation the target. rat
NR All other responses, including omissions, descriptions, non-nouns, etc.

First, generate data assuming a multinomial distribution. The outcomes will be determined by a vector \(\boldsymbol{\theta}\) (called true_theta below in the R code) that indicates the probability of each outcome:

## # A tibble: 1 × 5
##   theta_NR theta_Neologism theta_Formal theta_Mixed theta_Correct
##      <dbl>           <dbl>        <dbl>       <dbl>         <dbl>
## 1      0.2             0.1          0.2        0.08          0.42
## [1] 1

Given this vector of probabilities \(\boldsymbol{\theta}\), generate values assuming a multinomial distribution of responses in 100 trials:

##                 [,1]
## theta_NR          19
## theta_Neologism    7
## theta_Formal      17
## theta_Mixed        7
## theta_Correct     50

Now, we’ll try to recover the probability of each answer with a model with the following likelihood:

\[\begin{equation} ans \sim \mathit{Multinomial}(\boldsymbol{\theta}) \end{equation}\]

where \(\boldsymbol{\theta} = \{\theta_{nr}, \theta_{neol.}, \theta_{formal}, \theta_{mix}, \theta_{corr}\}\).

A common prior for multinomial likelihood is the Dirichlet distribution, which extends the Beta distribution to cases where more than two categories are available.

\[\begin{equation} \boldsymbol{\theta} \sim \mathit{Dirichlet}(\boldsymbol{\alpha}) \end{equation}\]

The Dirichlet distribution has a parameter \(\alpha\), called concentration parameter, and it is a vector with the same length as \(\boldsymbol{\theta}\). If we set \(\boldsymbol{\alpha} = \{2,2,2,2,2\}\), this is analogous to \(\sim \mathit{Beta}(2,2)\), that is, the intuition behind this concentration parameter is that the prior probability distribution of the vector \(\boldsymbol{\theta}\) corresponds to have seen 2 outcomes of each category in the past.

A Stan model assuming a multinomial likelihood and Dirichlet prior is shown below. Since the elements of \(\boldsymbol{\theta}\) should sum to one we declare this vector, theta, as a simplex. A simplex guarantees that its elements sum to one and also constraints them to have non-negative values. In order to generate the vector \(\boldsymbol{\alpha}\) that contains five times the value two, we use rep_vector(2, 5) (which is similar to rep(2, 5) in R).

data {
  int<lower = 1> N_trials;
  int<lower = 0,upper = N_trials> ans[5];
parameters {
  simplex[5] theta;
model {
  target += dirichlet_lpdf(theta | rep_vector(2, 5));
  target += multinomial_lpmf(ans | theta);
generated quantities{
  int pred_ans[5] = multinomial_rng(theta, 5);

Fit the model:

## List of 2
##  $ N_trials: num 100
##  $ ans     : int [1:5] 19 7 17 7 50

Print the posteriors:

##          mean 2.5% 97.5% n_eff Rhat
## theta[1] 0.19 0.12  0.27  4330    1
## theta[2] 0.08 0.04  0.14  4252    1
## theta[3] 0.17 0.11  0.25  4415    1
## theta[4] 0.08 0.04  0.14  4265    1
## theta[5] 0.47 0.38  0.57  4172    1

Next, we use mcmc_recover_hist in the code below to confirm that the posterior distributions of the elements of \(\boldsymbol{\theta}\) are close to the true values that were set up when simulating the data. See Figure 18.1.

Posterior distributions and true means of theta for the multinomial model defined in multinom.stan.

FIGURE 18.1: Posterior distributions and true means of theta for the multinomial model defined in multinom.stan.

We evaluate here whether our model is able to “recover” the true value of its parameters. By “recover”, we mean that the true values are somewhere inside the posterior distribution of the model.

The frequentist properties of Bayesian models guarantee that if we simulate data several times, 95% of the true values should be inside of the 95% CrI intervals generated by a “well-calibrated” model. Furthermore, if the true values of some parameters are consistently well above or below their posterior distribution, it may mean that there is some problem with the model specification. We follow Cook, Gelman, and Rubin (2006) here, and for now we are going to verify that our model is roughly correct. A more principled (and computationally demanding) approach uses simulation based calibration (SBC), which is discussed in Talts et al. (2018) and Schad, Betancourt, and Vasishth (2020).

18.1.2 A model for multiple responses using the categorical distribution

Using the same information as above, we can model each response one at a time, instead of aggregating them. Using the categorical distributions gives us more flexibility to define what happens at every trial. However, we are not using the additional flexibility for now, and hence the next model and the previous one are equivalent.

data {
  int<lower = 1> N_obs;
  int<lower = 1, upper = 5> w_ans[N_obs];
parameters {
  simplex[5] theta;
model {
  target += dirichlet_lpdf(theta | rep_vector(2, 5));
  for(n in 1:N_obs)
    target += categorical_lpmf(w_ans[n] | theta);
generated quantities{
  int pred_w_ans[N_obs];
  for(n in 1:N_obs)
    pred_w_ans[n] = categorical_rng(theta);

Given the same set of probabilities \(\boldsymbol{\theta}\) as above, generate 100 individual observations:

The above output is how Stan expects to see the data. The data fed into the Stan model is defined as a list as usual:

## List of 2
##  $ N_obs: num 100
##  $ w_ans: num [1:100] 5 1 3 5 1 4 2 1 5 5 ...

Fitting the Stan model (categorical.stan) should yield approximately the same \(\boldsymbol{\theta}\) as with the multinomial likelihood defined in the model multinom.stan. See Figure ??.

##          mean 2.5% 97.5% n_eff Rhat
## theta[1] 0.20 0.13  0.28  5024    1
## theta[2] 0.09 0.04  0.15  3515    1
## theta[3] 0.25 0.17  0.33  3998    1
## theta[4] 0.05 0.02  0.10  4893    1
## theta[5] 0.41 0.32  0.50  4346    1

The above models estimate the posterior distribution for the probability for each possible response. If we had some experimental manipulation, we could even fit regressions to these parameters. This is called a multinomial logistic regression or categorical regression, see further reading for some examples.


Cook, Samantha R, Andrew Gelman, and Donald B Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3). Taylor & Francis: 675–92. https://doi.org/10.1198/106186006X136976.

Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2019. “Toward a Principled Bayesian Workflow in Cognitive Science.” arXiv Preprint arXiv:1904.12765.

2020. “Toward a Principled Bayesian Workflow in Cognitive Science.” Psychological Methods 26 (1). American Psychological Association: 103–26.

Talts, Sean, Michael J. Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” arXiv Preprint arXiv:1804.06788.