# Chapter 12 Custom distributions in Stan

Stan includes a large number of distributions, but what happens if we need a distribution that is not provided? In many cases, we can simply build a custom distribution by combining the ever-growing number of functions available in the Stan language.

## 12.1 A change of variables with the reciprocal normal distribution

In previous chapters, when faced with response times, we assumed a log-normal distribution. The log-normal distribution moves the inferences relating to the location parameter into a multiplicative frame (see Box 4.3). Another alternative, however, is to assume a “reciprocal”-normal distribution (rec-normal) of response times. That is, we may want to assume that the reciprocal of the response times are normally distributed. This can happen if the Box-Cox variance stabilizing transform procedure suggests that a reciprocal transformation is needed (Box and Cox 1964).

\[\begin{equation} \begin{aligned} 1/y &\sim \mathit{Normal}(\mu, \sigma) \\ y &\sim \mathit{RecNormal}(\mu, \sigma) \end{aligned} \tag{12.1} \end{equation}\]

An interesting aspect of the rec-normal is that it affords an interpretation of the location parameter in terms of rate or speed rather than time. Analogously to the case of the log-normal, neither the location \(\mu\) nor the scale \(\sigma\) are in the same scale as the dependent variable \(y\). These parameters are in the same scale as the transformed dependent variable (here, \(1/y\)) that is normally distributed.

By setting \(\mu = 0.002\) and \(\sigma = 0.0004\) we generate data that looks right-skewed and not unlike a distribution of response times, the code below shows some summary statistics of the generated data.

```
N <- 100
mu <- .002
sigma <- .0004
rt <- 1 / rnorm(N, mu, sigma)
c(mean(rt), sd(rt), min(rt), max(rt))
```

`## [1] 520 112 360 926`

Figure 12.1 shows the distribution generated with the code shown above.

We can fit the rec-normal distribution to the response times in a simple model with a normal likelihood, by storing the reciprocal of the response times (`1/RT`

) in the vector variable `recRT`

(rather than storing the “raw” response times):

```
model {
\\ priors go here
target += normal_lpdf(recRT | mu, sigma);
}
```

One issue here is that the parameters of the likelihood, `mu`

and `sigma`

are going to be very far away from the unit scale (\(0.002\) and \(0.0004\) respectively). Due to the way Stan’s sampler is built, parameters that are too small (much smaller than 1) or too large (much larger than 1) can cause convergence problems. A straightforward solution is to fit the normal distribution to parameters in reciprocal of seconds rather than milliseconds; this would make the parameters have values that are \(1000\) times larger (\(2\) and \(0.4\) respectively). We can do this using the `transformed parameters`

block. Although we use `mu`

and `sigma`

to fit the data, the priors are defined on the parameters `mu_s`

and `sigma_s`

(\(\mu_s\) and \(\sigma_s\)).

Unless one can rely on previous estimates, finding good priors for \(\mu_s\) and \(\sigma_s\) is not trivial. To define appropriate priors, we would need to start with relatively arbitrary priors, inspect the prior predictive distributions, adjust the priors, and repeat the inspection of prior predictive distributions until these distributions start to look realistic. In the interest of conserving space, we skip this iterative process here, and assign the following priors:

\[\begin{equation} \begin{aligned} \mu_s & \sim \mathit{Normal}(2, 2)\\ \sigma_s & \sim \mathit{Normal}_+(0.4, 0.2) \end{aligned} \end{equation}\]

```
data {
int<lower = 1> N;
vector[N] recRT;
}
parameters {
real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;
}
model {
target += normal_lpdf(mu_s | 2, 2);
target += normal_lpdf(sigma_s | 0.4, 0.2) +
- normal_lccdf(0 | 0.4, 0.2);
target += normal_lpdf(recRT | mu, sigma);
}
```

Fit and display the summary of the previous model:

```
normal_recrt <- system.file("stan_models",
"normal_recrt.stan",
package = "bcogsci"
)
fit_rec <- stan(normal_recrt,
data = list(N = N, recRT = 1 / rt)
)
```

```
## mean 2.5% 97.5% n_eff Rhat
## mu 0.0020 0.0019 0.0021 2992 1
## sigma 0.0004 0.0003 0.0005 3134 1
```

Is a rec-normal likelihood more appropriate than a log-normal likelihood? As things stand, we cannot compare the models with these two likelihoods. This is because the dependent variables are different: we cannot compare reciprocal response times with untransformed response times on the millisecond scale. Model comparison with the Bayes factor or cross-validation can only compare models with the same dependent variables; see chapters 14-16.

If we do want to compare the reciprocal-normal likelihood and the log-normal likelihood, we have to set up the models with the two likelihoods in such a way that the dependent measure is on the raw millisecond scale in each model. This means that for the reciprocal normal likelihood, the model will receive as data raw reading times in milliseconds, and these will be treated as a transformed random variable from reciprocal reading times. This approach is discussed next, but requires knowledge of the Jacobian adjustment (see Box 12.1).

**Box 12.1 ****Understanding the Jacobian adjustment in change of variables.**

Some background knowledge is necessary to understand Jacobians. Suppose that \(X\) is a continuous random variable with PDF \(f_X(x)\). Then, the relationship between the PDF \(f_X(x)\) and the CDF \(F_X(x=a)\), where \(a\) is some specific instance of \(X\), is:

\[\begin{equation} F_X(a)= \int_{-\infty}^a f_X(x)\, dx \end{equation}\]

This also means that if we differentiate \(F_X(a)\), we get back \(f_X(x)\); this fact is a consequence of the fundamental theorem of calculus. The fundamental theorem states the following: Let \(f\) be a continuous real-valued function defined on a closed interval \([a, b]\). Let \(F\) be the function defined, for all \(x\) in \([a, b]\), by

\[\begin{equation*} F(c) = \int_a^c f(x)\, dx \end{equation*}\]

Then, \(F\) is continuous on \([a, b]\), differentiable on the open interval \((a, b)\), and

\[\begin{equation*} \frac{d(F(x))}{dx} = f(x) \end{equation*}\]

for all \(x\) in \((a, b)\).

With the above background, we now explain the Jacobian in the case of univariate distributions. In this book, we don’t need to know the Jacobian in the case of a multivariate distribution, but see section 6.7 in Ross (2002) for more on that topic.

Suppose you have a continuous random variable \(X\) that has a particular PDF associated with it: \(X\sim f_X(x)\). Now, if this random variable is transformed such that \(Y = g(X)\), the question arises: what is the PDF of \(Y\)?

Here, we have a situation where we have transformed a random variable \(X\) to \(Y\); this is called a *change of variables*.

There is a theorem in statistics which states the following (the statement of the theorem, and its proof, are adapted from Ross 2002):

Let \(X\) be a continuous random variable with probability density function \(f_X\). Suppose that \(g(x)\) is a strict monotone (increasing or decreasing) function, differentiable and (thus continuous) function of \(x\). Then the random variable \(Y\) defined by \(Y=g(X)\) has a probability density function defined by

\[\begin{equation*} f_Y(y)= \left\{ \begin{array}{l l} f_X(g^{-1}(y)) \mid \frac{d}{dx} g^{-1}(y) \mid & \quad \textrm{if } y = g(x) \textrm{ for some $x$}\\ 0 & \quad \textrm{if } y\neq g(x) \textrm{ for all $x$}.\\ \end{array} \right. \end{equation*}\]

where \(g^{-1}(y)\) is defined to be equal to the value of \(x\) such that \(g(x)=y\).

The proof of this theorem goes as follows. Suppose that \(y=g(x)\) for some \(x\). Then, the cumulative distribution function of \(Y\) is \(F_Y(y)\). This CDF tells us the probability \(P(Y\leq y)\); but since \(Y = g(X)\), we can write this probability as \(P(g(X)\leq y)\). So:

\[\begin{equation} F_Y(y) = P(g(X)\leq y) \end{equation}\]

Now, consider the term \(P(g(X)\leq y)\). In particular, consider the term \(g(X)\); applying the inverse of the function \(g(\cdot)\) to \(g(X)\) gives us back \(X\): \(g^{-1}g(X)=X\). Applying the inverse to both sides of \(\leq\) sign in \(P(g(X)\leq y)\) gives us:

\[\begin{equation} P(X\leq g^{-1}(y)) \end{equation}\]

But the above expression is the probability that the CDF of \(X\) gives us (recall that \(g^{-1}(y) = x\)):

\[\begin{equation} P(X\leq g^{-1}(y)) = F_X(g^{-1}(y)) \end{equation}\]

Now, we know (see the background above) that the PDF \(f_Y(y)\) can be derived from the CDF \(F_Y(y)\): \(f_Y(y) = \frac{d(F_Y(y))}{dy}\). Differentiating the CDF yields:

\[\begin{equation} f_Y(y) = f_X(g^{-1}(y)) \frac{d(g^{-1}(y))}{dy} \end{equation}\]

If you have forgotten your calculus (or didn’t study it in school), it is not at all obvious how the above differentiation comes about. Here, we are using the fact that:

\[\begin{equation} F_Y(y) = F_X(g^{-1}(y)) = \int f_X(g^{-1}(y)) \, dy \end{equation}\]

Recall (the fundamental theorem of calculus) that differentiating the integral \(\int f_X(g^{-1}(y)) \, dy\) will give us \(f_X(g^{-1}(y))\).

We now show how to carry out the following differentiation:

\[\begin{equation} \begin{split} \frac{d(F_Y(y))}{dy}=& \frac{d}{dy}(F_X(g^{-1}(y))) \end{split} \end{equation}\]

We use something called the chain rule from calculus (Salas, Etgen, and Hille 2003). To make things typographically easier to follow, rewrite \(w(y)=g^{-1}(y)\). Then, let

\[\begin{equation*} u = w(y) \end{equation*}\]

Differentiating this function give us (\(w'(y)\) is just a shorthand way of writing the derivative of the function \(w(\cdot)\)):

\[\begin{equation*} \frac{du}{dy} = w'(y) \end{equation*}\]

Also, let:

\[\begin{equation*} z = F_X(u) \end{equation*}\]

Differentiating this function gives us:

\[\begin{equation*} \frac{dz}{du} = F'_X(u) = f_X(u) \end{equation*}\]

By the chain rule:

\[\begin{equation*} \frac{du}{dy} \times \frac{dz}{du} = w'(y) f_X(u) = \frac{d}{dy}(g^{-1}(y)) f_X(g^{-1}(y)) \end{equation*}\]

That is how we obtain the result:

\[\begin{equation} f_Y(y) = f_X(g^{-1}(y)) \frac{d(g^{-1}(y))}{dy} \end{equation}\]

Notice that \(\frac{d(g^{-1}(y))}{dy}\) (this is the rate of change of the function) has to be non-negative because \(g^{-1}(y)\) is non-decreasing. That is why we take the absolute value of the derivative:

\[\begin{equation} f_Y(y) = f_X(g^{-1}(y)) \mid \frac{d(g^{-1}(y))}{dy}\mid \end{equation}\]

Next, consider some examples.

**Example 1**:

Suppose that we have reciprocal reading times (call this random variable \(X\)) in 1/milliseconds, which are assumed to come from some PDF \(f_{X}(x)\) (say, a normal distribution). Suppose we transform this random variable to be on the milliseconds scale:

\[\begin{equation} Y = g(X) = \frac{1}{X} \end{equation}\]

The question now is: what is the PDF \(f_Y(y)\) of the transformed random variable \(Y\).

The inverse of \(g(x)\) is \(x=g^{-1}(y) = \frac{1}{y}\), and its derivative is \(\frac{d(g^{-1}(y))}{dy} = -\frac{1}{y^2}\).

Using the theorem proved above, the PDF of \(Y\) is:

\[\begin{equation} \begin{aligned} f_Y(y) &= f_X(g^{-1}(y)) \mid \frac{d(g^{-1}(y))}{dy}\mid = f_X(g^{-1}(y)) \mid -\frac{1}{y^2}\mid \\ f_Y(y) & =_X(g^{-1}(y)) \frac{1}{y^2} \end{aligned} \end{equation}\]

Rewriting this in terms of \(f_X\), which is the Normal distribution:

\[\begin{equation} f_Y(y) = \frac{1}{\sqrt{2\pi}\sigma y^2}\exp\left( -\frac{1}{2}\left( \frac{1/y - \mu}{\sigma} \right)^2 \right) \end{equation}\]

**Example 2**

Suppose that we have a continuous random variable \(X\) with PDF \(f_X(x)\). Suppose now that we transform this random variable to \(Y = \log(X)\). What is the PDF of \(Y\)?

Here, \(y = g(x) = \log(x)\), and the inverse is \(x = g^{-1}(y) = \exp(y)\). The derivative of the inverse is \(\frac{d(g^{-1}(y))}{dy}\) is \(\exp(y)\).

Using the theorem above, the PDF of \(Y\) is:

\[\begin{equation} \begin{aligned} f_Y(y) &= f_X(g^{-1}(y)) \mid \frac{d(g^{-1}(y))}{dy}\mid = f_X(g^{-1}(y)) \mid \exp(y) \mid\\ f_Y(y) &= f_X(g^{-1}(y)) \exp(y) \end{aligned} \end{equation}\]

**Example 3**

Suppose that we have a continuous random variable \(X\) with PDF \(f_X(x)\) , the normal distribution. Suppose now that we transform this random variable to \(Y = \exp(X)\). What is the PDF of \(Y\)?

Here, \(y=g(x)= \exp(x)\) and \(x=g^{-1}(y)= \log(y)\). The derivative of \(g^{-1}(y)\) is \(\frac{d(g^{-1}(y))}{dy} = \frac{1}{y}\).

Using the theorem,

\[\begin{equation} f_Y(y) = f_X(g^{-1}(y)) \mid \frac{d(g^{-1}(y))}{dy}\mid = f_X(g^{-1}(y)) \frac{1}{y} \end{equation}\]

Rewriting the PDF of \(Y\) in terms of \(f_X(\cdot)\):

\[\begin{equation} f_Y(y) = f_X(\log(y)) \frac{1}{y} = \frac{1}{\sqrt{2\pi}\sigma y} \exp\left(-\frac{1}{2}\left(\frac{\log y-\mu}{\sigma}\right)^2\right) \end{equation}\]

### 12.1.1 Scaling a probability density with the Jacobian adjustment

To work with the original dependent variable (`RT`

rather than `1/RT`

) we need to do a *change of variables*: If the random variable X represents the reciprocal reading times (\(1/RT\)), we can transform this random variable to a new one, \(Y = 1/X = 1/(1/RT) = RT\), yielding a transformed random variable \(Y\) which represents reading time.

This change of variables requires an adjustment to the (unnormalized log) posterior probability to account for the distortion caused by the transform.^{37} The probability must be scaled by a *Jacobian* adjustment, which, in the univariate case as this one, is the absolute value of the derivative of the transform; see Box 12.1.^{38}

\[\begin{equation} \begin{aligned} p(\mathit{RT}_n | \mu, \sigma )& = \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) | \frac{d}{d\mathit{RT}_n} 1/\mathit{RT}_n| = \\ & \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) \cdot | - 1/\mathit{RT}_n^2| = \\ & \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) \cdot 1/\mathit{RT}_n^2 \end{aligned} \end{equation}\]

If we omit the Jacobian adjustment, we are essentially fitting an incorrect likelihood (this will be shown later in this chapter). The discrepancy between the correct and incorrect posterior would depend on how large the Jacobian adjustment for our model is.

Because Stan works in log-space, rather than multiplying the Jacobian adjustment, we add its logarithm to the log-probability density (\(\log(1/\mathit{RT}_n^2) = - 2 \cdot \log(\mathit{RT}_n)\)). The log likelihood (of \(\mu\) and \(\sigma\)) with its adjustment based on an individual observation, \(\mathit{RT}_{n}\), would be the following one.

\[\begin{equation} \log \mathcal{L} = \log(\mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma)) - 2 \cdot \log(\mathit{RT}_n) \end{equation}\]

We obtain the log-likelihood based on all the \(N\) observations by summing the log-likelihood of individual observations.

\[\begin{equation} \log \mathcal{L} = \sum_n^N \log(\mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma)) - \sum_n^N 2 \cdot \log(\mathit{RT}_n) \end{equation}\]

In Stan, this summing up is done as follows. The function `normal_lpdf`

applied to a vector already returns the sum of the individual log-probability densities. The Jacobian adjustment `2*log(RT)`

, which returns a vector of values, has to be summed up and added in manually (the term therefore becomes `-sum(2*log(RT))`

).

```
target += normal_lpdf(1 ./ RT | mu, sigma)
- sum(2 * log(RT));
```

Before fitting the model with the change of variables, we are going to truncate the distribution. We didn’t encounter negative values in our synthetic data, but this was because the distribution was not too spread out; that is, the scale was much smaller than the location (\(\sigma << \mu\)). However, in principle we could end up generating negative values. For this reason, we truncate the underlying normal distribution (for more details, see Box 4.1). The reciprocal truncated normal distribution has been argued to be an appropriate model of response times, neural inter-spike intervals, and latency distributions of saccades in a simple optimality model in which reward is maximized to yield an optimal response rate (Harris et al. 2014; Harris and Waddington 2012).

Because we have \(N\) observations, the truncation consists of adding `- N * normal_lccdf(0 | mu, sigma)`

to `target`

; also see section 10.4.1. The complete model, `recnormal_rt.stan`

, including the truncation is as follows:

```
data {
int<lower = 1> N;
vector[N] RT;
}
parameters {
real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;
}
model {
target += normal_lpdf(mu_s | 2, 2);
target += normal_lpdf(sigma_s | 0.4, 0.2)
- normal_lccdf(0 | 0.4, 0.2);
target += normal_lpdf(1 ./ RT | mu, sigma)
- N * normal_lccdf(0 | mu, sigma)
- sum(2 * log(RT));
}
```

Next, generate data from a reciprocal truncated normal distribution:

Fit the model to the data:

```
recnormal_rt <- system.file("stan_models",
"recnormal_rt.stan",
package = "bcogsci"
)
fit_rec <- stan(recnormal_rt, data = list(N = N, RT = rt))
```

Print the posterior summary:

```
## mean 2.5% 97.5% n_eff Rhat
## mu 0.0020 0.0020 0.0021 3014 1
## sigma 0.0004 0.0003 0.0004 3187 1
```

We get the same results as before, but now we could potentially compare this model to one that assumes a log-normal likelihood, for example.

An important question is the following: Does every transformation of a random variable require a Jacobian adjustment? Essentially, if we assign a distribution to a random variable and then transform it, this does not require a Jacobian adjustment. This is the case when, in Stan syntax, we only have the random variable at the left of the pipe `|`

in a PDF or PMF.

Alternatively, if we apply a transformation on a random variable first, and then assign a distribution to the transformed random variable afterwards, we have a change of variables. This requires a Jacobian adjustment. This is the case when we have a transformation of a random variable at the left of the pipe `|`

in the PDF or PMF. This is the reason why `1 ./ RT`

requires a Jacobian adjustment, but not `mu_s`

, `mu`

, `sigma_s`

, or `sigma_s`

in the previous model.

As a last step, we can encapsulate our new distribution in a function, and also create a random number generator function. This is done in a block called `functions`

. To create a function, we need to specify the type of every argument and the type that the function returns.

As a simple example, if we would like a function to center a vector, we would do it as follows:

```
functions {
vector center(vector x) {
vector[num_elements(x)] centered;
centered = x - mean(x);
return centered;
}
}
data {
...
}
parameters {
...
}
model {
...
}
```

We want to create a log(PDF) function, similar to the native Stan functions that end in `_lpdf`

. Our function will take as arguments a vector `RT`

, and real numbers `mu`

and `sigma`

. In `_lpdf`

functions, (some of) the arguments (e.g., `RT`

, `mu`

, and `sigma`

) can be vectorized, but the output of the function is always a real number: the sum of the log(PDF) evaluated at every value of the random variable at the left of the pipe. As we show below, to do that we just move the right hand side of `target`

in our original model inside our new function. See further readings at the end of this chapter for more about Stan functions.

For our custom random number generator function (which should end with `_rng`

), we have the added complication that the function is truncated. We opt for a less-than-optimally efficient implementation for the sake of clarity. We simply generate random values from a normal distribution, do the reciprocal transformation and if the number is above zero, this number is returned; otherwise, a new number is drawn. A more efficient but more complex implementation can be found in section 18.10 of the Stan user guide (Stan Development Team 2021).

The complete code can be found in `recnormal_rt_f.stan`

, and is also shown below:

```
functions {
real recnormal_lpdf(vector RT, real mu, real sigma){
real lpdf;
lpdf = normal_lpdf(1 ./ RT | mu, sigma)
- num_elements(RT) * normal_lccdf(0 | mu, sigma)
- sum(2 * log(RT));
return lpdf;
}
real recnormal_rng(real mu, real sigma){
real pred_rt = 0;
while (pred_rt <= 0)
pred_rt = 1 / normal_rng(mu, sigma);
return pred_rt;
}
}
data {
int<lower = 1> N;
vector[N] RT;
}
parameters {
real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;
}
model {
target += normal_lpdf(mu_s | 2, 2);
target += normal_lpdf(sigma_s | 0.4, 0.2)
- normal_lccdf(0 | 0.4, 0.2);
target += recnormal_lpdf(RT | mu, sigma);
}
generated quantities {
array[N] real rt_pred;
for (n in 1:N)
rt_pred[n] = recnormal_rng(mu, sigma);
}
```

Fit the model to the simulated data:

```
recnormal_rt_f <- system.file("stan_models",
"recnormal_rt_f.stan",
package = "bcogsci"
)
fit_rec_f <- stan(recnormal_rt_f, data = list(N = N, RT = rt))
```

Print the summary:

```
## mean 2.5% 97.5% n_eff Rhat
## mu 0.0020 0.0020 0.0021 2831 1.002
## sigma 0.0004 0.0003 0.0004 3559 0.999
```

## 12.2 Validation of a computed posterior distribution

The model converged, but did it work? At the very least we expect that the simulated data that we used to test the model shows similar distribution to the posterior predictive distributions. This is because we are fitting the data with the same function that we use to generate the data. Figure 12.2 shows that the simulated data and the posterior predictive distributions are similar.

```
ppc_dens_overlay(rt, yrep = extract(fit_rec_f)$rt_pred[1:500, ]) +
coord_cartesian(xlim = c(0, 2000))
```

Our synthetic data set was generated by first defining a *ground truth* vector of parameters \(\{\mu; \sigma \}\). We would expect that the true values of the parameters \(\mu\) and \(\sigma\) should be well inside the posterior distribution of our model. We investigate this by plotting the true values of the parameters together with their posterior distributions using the function `mcmc_recover_hist`

of `bayesplot`

as shown below in Figure 12.3.

```
post_rec <- as.data.frame(fit_rec_f) %>%
select("mu", "sigma")
mcmc_recover_hist(post_rec, true = c(mu, sigma))
```

Even though this approach can capture serious misspecifications in our models, it has three important shortcomings.

First, it’s unclear what we should conclude if the true value of a parameter is not in the bulk of its posterior distribution. Even in a well-specified model, if we inspect enough parameters we will find that some of the marginal posterior distributions are relatively far away from the true values. In fact, a well-specified model should also be well-calibrated, and that means that its credible intervals should also behave like frequentist *confidence intervals* (Cook, Gelman, and Rubin 2006). Frequentist confidence intervals apply to the long-term success rate of the model; that is, how often in the long run, the true value of the parameter will be inside the interval. This means that in, say, 5% of the cases, we should expect that a true value is outside the 95% of the CrI of its corresponding posterior distribution (Cook, Gelman, and Rubin 2006).

The second shortcoming is that if the posteriors that we obtain are very wide, they might end up containing the true values, leading us to believe that the model is working correctly even if the model is highly biased.

The third shortcoming is that regardless of how a successful recovery of parameters is defined, we might be able to recover a posterior based on data generated from some parts of the parameter space, but not based on data generated from other parts of parameter space (Talts et al. 2018). However, inspecting the entire parameter space is infeasible.

An alternative to our approach of plotting the true values of the parameters together with their posterior distributions is simulation-based calibration (Talts et al. 2018). This is discussed next.

### 12.2.1 The simulation-based calibration procedure

Talts et al. (2018) suggest the following procedure for validating a model. Check that the Bayesian computation is *faithful* in the sense that it is not biased: We want to rule out that our model will overestimate, underestimate, or will have the incorrect precision in the parameters given the data and priors.

- Generate true values (i.e., a ground truth) for the parameters of the model from the prior:

\[\begin{equation} \boldsymbol{\tilde{\Theta}} \sim p(\boldsymbol{\Theta}) \end{equation}\]

Here, \(p(\boldsymbol{\Theta})\) represents a joint prior distributions of vector of parameters. In our previous example, the vector of parameters is \(\{\mu; \sigma\}\), and the joint prior distribution is two independent distributions, a normal and a truncated normal. Crucially, the prior distributions should be meaningful; that is, one should use priors that are not too uninformative. The prior space plays a crucial role because it determines the parameter space where we will verify the correctness or faithfulness of the model.

Assuming 120 samples (from the priors), this step for our model would be as follows:

```
N_sim <- 120
mu_tilde <- rnorm(N_sim, 2, 2) / 1000
sigma_tilde <- rtnorm(N_sim, 0.4, 0.2, a = 0) / 1000
```

- Generate multiple (\(N_sim\)) data sets based on the probability density function used as the likelihood:

\[\begin{equation} \boldsymbol{\tilde{D}} \sim p(D | \boldsymbol{\tilde{\Theta}}) \end{equation}\]

The previous equation indicates that each data set \(\tilde{D}_n\) is sampled from each of the generated parameters, \(\tilde{\Theta}_n\). Following our previous example, use the reciprocal truncated normal distribution to generate data sets of response times:

```
## Create place holder for all the simulated datasets
rt_tilde <- vector(mode = "list", length = N_sim)
# Number of observations
N <- 100
for (n in 1:N_sim) {
rt_tilde[[n]] <- 1 / rtnorm(N,
mu_tilde[n],
sigma_tilde[n],
a = 0
)
}
```

Now, `rt_tilde`

, which corresponds to \(\boldsymbol{\tilde{D}}\), consists of a list with 120 vectors of 100 observations each. Each list represents a simulated data set.

- Fit the same Bayesian model to each of the generated data sets.

After fitting the same model to each data set, we should compare the recovered posterior distribution for each parameter of each simulated data set; that is, \(p(\Theta_n | \tilde{D}_{n})\), with the parameters that were used to generate the data, \(\tilde{\Theta}_{n}\). This comparison is represented by `...`

in the code below (that is, it is not yet implemented in the code shown below). This raises the question of how to compare the posterior distribution with the ground truth values of the parameters.

Talts et al. (2018) show that this procedure (1-3) also defines a natural condition for assessing whether the computed posterior distributions match the exact posterior distributions. The average over the ensemble of posteriors, the recovered posteriors of each simulated dataset, should be the same as the prior.

Integrating out the ground posteriors over their joint distribution, we obtain the expectation of the posterior over all possible generated data sets, also called the *data averaged posterior*, which is identical to the prior distribution.

\[\begin{equation} \int_{S_{\boldsymbol{\tilde{D}}}, S_{\boldsymbol{\Theta}}} p(\boldsymbol{\Theta} | \boldsymbol{\tilde{D}}) p(\boldsymbol{\tilde{D}} | \boldsymbol{\tilde{\Theta}}) p(\boldsymbol{\Theta}) \ d\boldsymbol{\tilde{D}} d\boldsymbol{\tilde{\Theta}} = p(\boldsymbol{\Theta}) \end{equation}\]

Any mismatch between the data averaged posterior and the prior distribution indicates some problem in our model (Cook, Gelman, and Rubin 2006).^{39}

The next two steps describe a well-defined comparison of the exact posterior and prior distribution.

- Generate an ensemble of rank statistics of prior samples relative to corresponding posterior samples.

Talts et al. (2018) demonstrate that the match between the exact posterior and prior can be evaluated by examining whether, for each one-dimensional parameter of \(\boldsymbol{\Theta}\) (e. g., \(\{\mu; \sigma\}\)), the *rank statistics* of the prior sample relative to \(S\) posterior samples will be uniformly distributed across \([0, S]\). In other words, for each sample of the prior (e.g., \(\tilde{\mu}_{n}\)), we calculate the number of samples of the posterior that is smaller than the corresponding value of the parameter sampled, and we examine its distribution. A mismatch between the exact posterior and our posterior would show up as a non-uniform distribution.

As a next step, we examine this new distribution visually using a histogram with \(S + 1\) possible ranks (from \(0\) to \(S\)). There are two issues that we need to take into account (Talts et al. 2018):

Regardless of our model, histograms will deviate from uniformity if the posterior samples are dependent. This can be solved by

*thinning*the samples of the posterior, that is removing a number of intermediate samples (Talts et al. 2018 recommend between 6 and 10).To reduce the noise in the histogram, we should have bins of equal size. If \(S + 1\) are divisible by a large power of \(2\), e.g, \(1024\), we will be able to re-bin the histogram easily with bins of equal size. We complete the code shown in step 3 by generating an ensemble of ranks statistics as well.

```
N_s <- 1024 * 6 # Total samples from the posterior
# Thin by 6, and remove one extra sample so that S + 1 = 1024
thinner <- seq(from = 1, to = N_s, by = 6)[-1]
# Placeholders for the ranks
rank_mu <- rep(NA, N_sim)
rank_sigma <- rep(NA, N_sim)
# Loop over the simulations
for (n in 1:N_sim) {
message("Fit number ", n)
# Fit is only stored temporarily
fit <- stan(recnormal_rt,
data = list(
N = N,
RT = rt_tilde[[n]]
),
warmup = 1500,
iter = 1500 + N_s / 4,
control = list(
adapt_delta = .999,
max_treedepth = 14
)
)
# Break the loop if there are divergent transitions
if (get_num_divergent(fit)) break()
post <- extract(fit)
# Number of samples of the posterior that is smaller than
# the corresponding value of the parameter sampled
rank_mu[n] <- sum(post$mu[thinner] < mu_tilde[n])
rank_sigma[n] <- sum(post$sigma[thinner] < sigma_tilde[n])
}
df_rank <- tibble(
sim = rep(1:N_sim, 2),
variable = rep(c("mu", "sigma"),
each = N_sim
),
rank = c(rank_mu, rank_sigma)
)
```

- Graphically assess the correctness of the model using rank histograms

As a last step we use a histogram for each parameter to identify deviations of uniformity. If the posterior estimates are correct, then each of the \(B\) bins has a probability of \(1/B\) that a simulation (i.e. an individual rank) falls into it: \(\mathit{Binomial}(N_{sim}, 1/B)\). This allow us to complement the histograms with confidence intervals, indicating where the variation expected from a uniform histogram should be.

Use the code below to build the plot shown in Figure 12.4. We can conclude that the implementation of our model (in the parameter space determined by the priors) is correct.

```
B <- 16
bin_size <- 1024 / 16
ci_l <- qbinom(0.005, size = N_sim, prob = 1 / B)
ci_u <- qbinom(0.995, size = N_sim, prob = 1 / B)
ggplot(df_rank, aes(x = rank)) +
geom_histogram(
breaks =
seq(0, to = 1023, length.out = B + 1),
closed = "left", colour = "black"
) +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
facet_wrap(~variable, scales = "free_y")
```

Next, we consider an example where simulation-based calibration can reveal a problem.

### 12.2.2 Simulation-based calibration revealing a problem

Let’s assume that we made an error in the model implementation. We’ll fit an **incorrect** model: rather than the complete likelihood including the normal PDF and the truncation and Jacobian adjustments, we will fit only the normal PDF evaluated at the reciprocal of the response times, that is `target += normal_lpdf(1 ./ RT | mu, sigma);`

.

Assessing the correctness of the model by looking at the recovery of the parameters is misleading in this case, as it is evident from Figure 12.5. One could conclude that the model is fine based on this plot, since the true values of the parameters are inside their posterior distributions.

However, the rank histograms produced by the simulation-based calibration procedure in Figure 12.6 show very tall bins at the left and at the right of each histogram, exceeding the 99% CI. This is a very clear indication that there is a problem with the model specification: our incorrect model is overestimating \(\mu\) and underestimating \(\sigma\). This example illustrates the importance of simulation-based calibration.

### 12.2.3 Issues and limitation of simulation-based calibration

In the previous sections, we have used only rank histograms. Even though histograms are very intuitive means of visualization to assess model correctness, they might not be sensitive enough for small deviations (Talts et al. 2018) and other visualizations might be better suited. Another limitation of histograms is that they are sensitive to the number of bins (Säilynoja, Bürkner, and Vehtari 2022). One could bin the histograms multiple times, but this approach is difficult to interpret when there are many parameters, and can be vulnerable to multiple testing biases (Talts et al. 2018). One alternative to rank histograms is visualizations based on the empirical cumulative density function of the ranks. This is briefly presented in Box 12.2.

**Box 12.2 ****Different rank visualizations and the SBC package.**

Implementing the simulation-based calibration algorithm “by hand” introduces a new source of potential errors. Fortunately, the R package `SBC`

(Kim et al. 2022) provides tools to validate a Stan model (or any sampling algorithm) by allowing us to run simulation-based calibrations easily.
The package is in active development at the moment^{40} and can be installed with the following command.

One of the main advantages of this package is that it provides several ways to visualize the results of the simulation-based calibration procedure; see https://hyunjimoon.github.io/SBC/. Figure 12.7 shows rank histograms produced by `SBC`

of a correct model and several different incorrect models. An alternative to rank histograms is to use an empirical cumulative distribution function (ECDF)-based method, as proposed by Säilynoja, Bürkner, and Vehtari (2022). The idea behind this method is that if the ranks produced by the simulation-based calibration algorithm are uniform the ECDF of the ranks should be close to the CDF of a uniform distribution. Figure 12.8 shows the difference between the ECDF of the ranks and the CDF of a uniform distribution together with 95% confidence bands (this is the default in the `SBC`

package) for a correct model and different incorrect ones.

Even though simulation-based calibration is a comprehensive approach to test model correctness, it has several drawbacks, regardless of the means of visualization:

This approach requires fitting a model many times, which can be too time intensive for complex models. It’s worth bearing in mind that code with errors can sometimes show clear “catastrophic” failures in the recovery of the parameters. For this reason, one could start the verification of model correctness with a simple recovery of parameters (see Talts et al. 2018; Gelman et al. 2020), and then proceed with the simulation-based calibration procedure for at least some suspect aspects of the model (e.g., when working with a custom likelihood).

Another major limitation of simulation-based calibration is that it is concerned exclusively with the computational aspects of the analysis and offers no guarantee for any single observation. A complete Bayesian workflow should include prior and posterior predictive checks, see chapter 7 and Schad, Betancourt, and Vasishth (2020).

Finally, it’s important to apply simulation-based calibration only after appropriate priors are set. This is important because only the priors determine the parameter space that will be inspected. Weak priors that may be good enough for estimation may nevertheless lead to a parameter space that is unrealistically large since, during the calibration stage, there are no independent data to constrain the space.

## 12.3 Another custom distribution: Re-implementing the exponential distribution manually

There are cases when one needs a distribution that is not included in Stan. Many times one can find the PDF (or a CDF) derived in a paper, and one needs to implement it manually by writing the log(PDF) in the Stan language. Even though the exponential distribution is included in Stan, we demonstrate how we would include it step-by-step as if it weren’t available. This example extends what is demonstrated in Stan’s user guide (Stan Development Team 2021, ch. 21).

The exponential distribution is used to model waiting times until a certain event. Its PDF is the following:

\[\begin{equation} f(x | \lambda )={ \begin{cases} \lambda e^{-\lambda x} & x\geq 0,\\0&x<0. \end{cases}} \end{equation}\]

The parameter \(\lambda\) is often called the rate parameter and must be positive. A higher value for the rate parameter leads to shorter waiting times on average. The mean of the distribution is \(1/\lambda\). The exponential distribution has the key property of being *memoryless*. What this means is explained in Ross (2002) as follows: Suppose that \(X\) is a random variable with an exponential distribution as a PDF; the random variable represents the lifetime of an item (e.g., \(X\) could represent the time it takes for a radioactive particle to completely decay). If the item is \(t\) time-units old, then the remaining life \(s\) of that item has the same probability distribution as the life of a new item. Mathematically, this amounts to stating that

\[\begin{equation} P(X>s + t | X> t)= P(X>s) \end{equation}\]

The implication of the memoryless property is that we do not need to know the age of an item to know what the distribution of its remaining life is.

To give another example of memorylessness, the conditional probability that a certain event will happen in the next 100 ms is the same regardless of whether we have been already waiting 1000 ms, 10 ms, or 0 ms. Although the exponential distribution is not commonly used for modeling response times in cognitive science, it has been used in the past (Ashby and Townsend 1980; Ashby 1982). We focus on this distribution because of its simple analytical form.

As a first step, we make our own version of the PDF in R, and then check that it integrates to 1 to verify that this is actually a proper distribution (and that we haven’t introduced a typo in the formula).

To avoid underflow, that is getting a zero instead of a very small number, we’ll work in log-scale. This will also be useful when we implement this function in Stan, thus the log(PDF) is

\[\begin{equation} \log(f(x| \lambda ))= \log(\lambda) -\lambda x \end{equation}\]

where \(x >0\).

Implement this function in R with the same arguments as the `d*`

family of functions: if `log = TRUE`

the output is a log density; call this new function `dexp2`

.

```
dexp2 <- function(x, lambda = 1, log = FALSE) {
log_density <- log(lambda) - lambda * x
if (log == FALSE) {
exp(log_density)
} else {
log_density
}
}
```

Verify that this function integrates to 1 for some point values for the parameter (here, \(\lambda = 1\) and \(\lambda = 20\)):

`## 1 with absolute error < 0.000057`

`## 1 with absolute error < 0.00000001`

To test our function, we’ll also need to generate random values from the exponential distribution. If the quantile function of the distribution exists, the inverse transform sampling is a relatively straightforward way to get pseudo random numbers sampled from a target distribution (for an accessible introduction to inverse sampling, see Lynch 2007). Given a target distribution with a PDF \(f\), and a quantile function \(F^{-1}\) (the inverse of the CDF), the inverse transform sampling method consists of the following:

- Sample one number \(u\) from \(\mathit{Uniform}(0,1)\). Let \(u=F(z)=\int_L^z f(x)\, dx\) (here, \(L\) is the lower bound of the PDF f).
- Then \(z=F^{-1}(u)\) is a draw from \(f(x)\).

In this case, the quantile function (the inverse of the CDF) is the following:

\[\begin{equation} - \log(1-p)/\lambda \end{equation}\]

Here is how one can derive this inverse of the CDF. First, consider the fact that the CDF of the exponential distribution is as follows. The term \(q\) is some quantile of the distribution:

\[\begin{equation} F(q) = \int_0^q \lambda \exp(-\lambda x)\, dx \end{equation}\]

We can solve this integral by using the so-called \(u\)-substitution method (Salas, Etgen, and Hille 2003). First, define

\[\begin{equation} u = -\lambda x \end{equation}\]

Then, the derivative \(du/dx\) is:

\[\begin{equation} \frac{du}{dx} = -\lambda \end{equation}\]

This implies that \(du = - \lambda dx\), or that \(- du = \lambda dx\).

In the CDF, replace the term \(\lambda\), \(dx\) with \(-du\):

\[\begin{equation} F(q) = \int_0^q \lambda \exp(-\lambda x)\, dx = \int_0^q (- \exp(-\lambda x)\, du) \end{equation}\]

Rewriting \(-\lambda x\) as \(u\), the CDF simplifies to:

\[\begin{equation} F(q) = \int_0^q \lambda \exp(-\lambda x)\, dx = \int_0^q (- \exp(u)\, du) \end{equation}\]

We know from calculus that the integral of \(-exp(u)\) is \(-\exp(u)\). So, the integral becomes:

\[\begin{equation} F(q) = \left[ - \exp(u) \right]_0^q \end{equation}\]

Replacing \(u\) with \(-\lambda x\), we get:

\[\begin{equation} F(q) = \left[ - \exp(-\lambda x) \right]_0^q = 1- \exp(-\lambda q) \end{equation}\]

Thus, we know that the CDF is \(F(t) = 1- \exp(-\lambda q) = p\), where \(p\) is the probability of observing the quantile \(q\) or some value smaller than \(q\). To derive the inverse of the CDF, solve the equation below for \(t\):

\[\begin{equation} \begin{aligned} p &= 1- e^{-\lambda q}\\ p-1 &= - e^{-\lambda q}\\ -p+1 &= e^{-\lambda q}\\ \log(1-p) &= -\lambda q\\ - \log(1-p)/\lambda &= q\\ \end{aligned} \end{equation}\]

Write the quantile function (the inverse of the CDF) and the random number generator for the exponential distribution in R. to differentiate it from the built-in function in R, we will call it `qexp2`

:

```
qexp2 <- function(p, lambda = 1) {
-log(1 - p) / lambda
}
rexp2 <- function(n, lambda = 1) {
u <- runif(n, 0, 1)
qexp2(u, lambda)
}
```

The functions that we would use in a Stan model are relatively faithful to the R code, but follow Stan conventions: The function `exp_lpdf`

stores the sum of the log(PDF) evaluated at each value of x, this would be analogous to doing `sum(dexp2(x, lambda, log = TRUE))`

; the function `exp_rng`

implements a non-vectorized version of `rexp2`

and uses the auxiliary function `exp_icdf`

(icdf stands for inverse CDF), which is similar to `qexp2`

.

```
functions {
real exp_lpdf(vector x, real lambda){
vector[num_elements(x)] lpdf = log(lambda) - lambda * x;
return sum(lpdf);
}
real exp_icdf(real p, real lambda){
return - log(1 - p) / lambda;
}
real exp_rng(real lambda){
real u = uniform_rng(0, 1);
return exp_icdf(u, lambda);
}
}
```

We are now ready to generate synthetic data and fit the distribution in Stan. Generate 1000 observations.

Use `exponential.stan`

which includes the function block shown before and the following obligatory blocks:

```
data {
int<lower = 1> N;
vector[N] RT;
}
parameters {
real<lower = 0> lambda;
}
model {
target += normal_lpdf(lambda | 0, .1) -
normal_lccdf(0 | 0, .1);
target += exp_lpdf(RT | lambda);
}
generated quantities {
array[N] real rt_pred;
for (n in 1:N)
rt_pred[n] = exp_rng(lambda);
}
```

Fit the data with Stan.

```
exponential <- system.file("stan_models",
"exponential.stan",
package = "bcogsci")
fit_exp <- stan(exponential, data = list(N = N, RT = rt))
```

Print the summary:

```
## mean 2.5% 97.5% n_eff Rhat
## lambda 0.01 0 0.01 1505 1
```

Carry out a quick check first, verifying that the true value of the parameter \(\lambda\) is reasonably inside its posterior distribution. This is shown in Figure 12.9(a). Figure 12.9(b) shows the results of the simulation-based calibration procedure, and shows that our implementation was correct.

```
# Plot a
post_exp <- as.data.frame(fit_exp) %>%
select("lambda")
mcmc_recover_hist(post_exp, true = lambda) +
# make it similar to plot b:
ggtitle("a") +
xlab("lamda") +
theme(plot.title = element_text(hjust = 0),
legend.position="none",
strip.text.x = element_blank())
# Plot b
B <- 16
bin_size <- 1024 / 16
ci_l <- qbinom(0.005, size = N_sim, prob = 1 / B)
ci_u <- qbinom(0.995, size = N_sim, prob = 1 / B)
ggplot(df_rank_lambda, aes(x = rank)) +
geom_histogram(
breaks =
seq(0, to = 1023, length.out = B + 1),
closed = "left", colour = "black"
) +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
ggtitle("b") +
theme(plot.title = element_text(hjust = 0))
```

## 12.4 Summary

In this chapter, we learned how to create a PDF that is not provided by Stan. We learned how to do this by doing a change of variables (with its Jacobian adjustment) and by building the distribution from scratch in a function ending in `_lpdf`

. We also learned how to verify the correctness of the new functions by recovering the true values of the parameters and by using simulation-based calibration.

## 12.5 Further reading

Jacobian adjustments in Bayesian models are an ongoing source of confusion and there are several posts in blogs and study cases that try to shed light on them:

- https://rstudio-pubs-static.s3.amazonaws.com/486816_440106f76c944734a7d4c84761e37388.html
- https://betanalpha.github.io/assets/case_studies/probability_theory.html#42_probability_density_functions
- https://jsocolar.github.io/jacobians/#fn1
- https://mc-stan.org/documentation/case-studies/mle-params.html

A complete tutorial of simulation-based calibration using the `SBC`

package was given in the online event Stanconnect 2021, and it is available in https://www.martinmodrak.cz/post/2021-sbc_tutorial/. The ideas that predate this technique can be found in Cook, Gelman, and Rubin (2006) and were extended in Talts et al. (2018). The use of ECDF-based visualizations is discussed in Säilynoja, Bürkner, and Vehtari (2022). The role of simulation-based calibration in the Bayesian workflow is discussed in Schad, Betancourt, and Vasishth (2020). Examples of the use of simulation-based calibration to validate novel models used in cognitive science are Hartmann, Johannsen, and Klauer (2020) and Bürkner and Charpentier (2020). The extension of this procedure to validate Bayes factors is discussed in Schad et al. (2021).

Custom functions in general and custom probability functions are treated in chapters 18 and 19 of the Stan user’s guide (Stan Development Team 2021).

## 12.6 Exercises

**Exercise 12.1 **Fitting a shifted lognormal distribution.

A random variable \(Y\) has a shifted log-normal distribution with shift \(\psi\), location \(\mu\), and scale \(\sigma\), if \(Z = Y-\psi\) and \(Z \sim \mathit{LogNormal}(\mu,\sigma)\).

Implement a

`shifted_lognormal_ldpf`

function in Stan with three parameters,`mu`

,`sigma`

, and`psi`

. Tip: One can use the regular log-normal distribution and apply a change of variable. In this case the adjustment of the Jacobian would be \(|\frac{d}{dY} Y - \psi|=1\), which in log-space is conveniently zero.Verify the correctness of the model by recovering the true values of (your choice) of the parameters of the model and by using simulation-based calibration. In order to use simulation-based calibration, you will need to decide on sensible priors; assume that \(\psi \sim \mathit{Normal}_+(100,50)\), and choose priors for \(\mu\) and \(\sigma\) so that the prior predictive distributions are adequate for response times.

**Exercise 12.2 **Fitting a Wald distribution.

The Wald distribution (or inverse Gaussian distribution) and its variants have been proposed as another useful distribution for response times (see for example Heathcote 2004).

The probability density function of the Wald distribution is the following.

\[\begin{equation} f(x;\mu ,\lambda )={\sqrt {\frac {\lambda }{2\pi x^{3}}}}\exp {\biggl (}-{\frac {\lambda (x-\mu )^{2}}{2\mu ^{2}x}}{\biggr )} \end{equation}\]

- Implement this distribution in Stan as
`wald_lpdf`

. In order to do this, you will need to derive the logarithm of the PDF presented above. You can adapt the code of the following R function.

```
dwald <- function(x, lambda, mu, log = FALSE) {
log_density <- 0.5 * log(lambda / (2 * pi())) -
1.5 * log(x) -
0.5 * lambda * ((x - mu) / (mu * sqrt(x)))^2
if (log == FALSE) {
exp(log_density)
} else {
log_density
}
}
```

- Verify the correctness of the model by recovering the true values of (your choice) of the parameters of the model and by using simulation-based calibration. As with the previous exercise, you will need to decide on sensible priors by deriving prior predictive distributions that are adequate for response times.

### References

Ashby, F Gregory. 1982. “Testing the Assumptions of Exponential, Additive Reaction Time Models.” *Memory & Cognition* 10 (2). Springer: 125–34.

Ashby, F Gregory, and James T Townsend. 1980. “Decomposing the Reaction Time Distribution: Pure Insertion and Selective Influence Revisited.” *Journal of Mathematical Psychology* 21 (2). Elsevier: 93–123.

Box, George E.P., and David R. Cox. 1964. “An Analysis of Transformations.” *Journal of the Royal Statistical Society. Series B (Methodological)*. JSTOR, 211–52.

Bürkner, Paul-Christian, and Emmanuel Charpentier. 2020. “Modelling Monotonic Effects of Ordinal Predictors in Bayesian Regression Models.” *British Journal of Mathematical and Statistical Psychology*. Wiley Online Library.

Cook, Samantha R, Andrew Gelman, and Donald B Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” *Journal of Computational and Graphical Statistics* 15 (3). Taylor & Francis: 675–92. https://doi.org/10.1198/106186006X136976.

Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” *arXiv Preprint arXiv:2011.01808*.

Harris, Christopher M., and Jonathan Waddington. 2012. “On the Convergence of Time Interval Moments: Caveat Sciscitator.” *Journal of Neuroscience Methods* 205 (2): 345–56. https://doi.org/https://doi.org/10.1016/j.jneumeth.2012.01.017.

Harris, Christopher M., Jonathan Waddington, Valerio Biscione, and Sean Manzi. 2014. “Manual Choice Reaction Times in the Rate-Domain.” *Frontiers in Human Neuroscience* 8: 418. https://doi.org/10.3389/fnhum.2014.00418.

Hartmann, Raphael, Lea Johannsen, and Karl Christoph Klauer. 2020. “rtmpt: An R Package for Fitting Response-Time Extended Multinomial Processing Tree Models.” *Behavior Research Methods* 52 (3). Springer: 1313–38.

Heathcote, Andrew. 2004. “Fitting Wald and ex-Wald Distributions to Response Time Data: An Example Using Functions for the S-Plus Package.” *Behavior Research Methods, Instruments, & Computers* 36 (4). Springer: 678–94.

Kim, Shinyoung, Hyunji Moon, Martin Modrák, and Teemu Säilynoja. 2022. *SBC: Simulation Based Calibration for Rstan/Cmdstanr Models*.

Lynch, Scott Michael. 2007. *Introduction to Applied Bayesian Statistics and Estimation for Social Scientists*. New York, NY: Springer.

Ross, Sheldon. 2002. *A First Course in Probability*. Pearson Education.

Salas, Saturnino L, Garret J Etgen, and Einar Hille. 2003. *Calculus: One and Several Variables*. Ninth. John Wiley & Sons.

Säilynoja, Teemu, Paul-Christian Bürkner, and Aki Vehtari. 2022. “Graphical Test for Discrete Uniformity and Its Applications in Goodness-of-Fit Evaluation and Multiple Sample Comparison.” *Statistics and Computing* 32 (2). Springer: 1–21.

Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2020. “Toward a Principled Bayesian Workflow in Cognitive Science.” *Psychological Methods* 26 (1). American Psychological Association: 103–26.

Schad, Daniel J., Bruno Nicenboim, Paul-Christian Bürkner, Michael J. Betancourt, and Shravan Vasishth. 2021. “Workflow Techniques for the Robust Use of Bayes Factors.”

Stan Development Team. 2021. “Stan Modeling Language Users Guide and Reference Manual, Version 2.27.” https://mc-stan.org.

Talts, Sean, Michael J. Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” *arXiv Preprint arXiv:1804.06788*.

Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Yes, but Did It Work?: Evaluating Variational Inference.” In *International Conference on Machine Learning*, 5581–90. PMLR.

Not every transformation is valid, univariate changes of variables must be monotonic and differentiable, multivariate changes of variables must be injective and differentiable.↩

In the multivariate case, it is equal to the absolute determinant of the Jacobian, the matrix of all its first-order partial derivatives, of the transform; see section 6.7 of Ross (2002).↩

We are working under the assumption that Stan, which yields the posterior approximation, works correctly. In principle, if we assume that our model is correct, we can also use simulation-based calibration to examine whether our approximation to the posterior distribution is correct (that is, whether Stan’s sampler, or any posterior approximation method works correctly); for example, see Yao et al. (2018). In any case, no mismatch between the data-averaged posterior and the corresponding prior distribution of a particular parameter means that we have support for the correctness of both the model and the approximation to the posterior.↩

Even though the package is already fully functional, function names and arguments might change.↩