Chapter 6 The Art and Science of Prior Elicitation

Nothing strikes fear into the heart of the newcomer to Bayesian methods more than the idea of specifying priors for the parameters in a model. On the face of it, this concern seems like a valid one; how can one know what the plausible parameter values are in a model before one has even seen the data?
In reality, this worry is purely a consequence of the way we are normally taught to carry out data analysis, especially in areas like psychology and linguistics. Model fitting is considered to be a black-box activity, with the primary concern being whether the effect of interest is “significant” or “non-significant.” As a consequence of the training that we receive, we learn to focus on one thing (the \(p\)-value) and we learn to ignore the estimates that we obtain from the model; it becomes irrelevant whether the effect of interest has a mean value of 500 ms (in a reading study, say) or 10 ms; all that matters is whether it is a significant effect or not. In fact, the way many scientists summarize the literature in their field is by classifying studies into two bins: significant and non-significant. There are obvious problems with this classification method; for example, \(p=0.051\) might be counted as “marginally” significant, but \(p=0.049\) is never counted as marginally non-significant. But there will usually not be any important difference between these two borderline values. Real-life examples of such a binary classification approach are seen in Phillips, Wagers, and Lau (2011) and Hammerly, Staub, and Dillon (2019). Because the focus is on significance, we never develop a sense of what the estimates of an effect are likely to be in a future study. This is why, when faced with a prior-distribution specification problem, we are misled into feeling like we know nothing about the quantitative estimates relating to a problem we are studying.

Prior specification has a lot in common with something that physicists call a Fermi problem. As Von Baeyer (1988) describes it: “A Fermi problem has a characteristic profile: Upon first hearing it, one doesn’t have even the remotest notion what the answer might be. And one feels certain that too little information exists to find a solution. Yet, when the problem is broken down into subproblems, each one answerable without the help of experts or reference books, an estimate can be made ”. Fermi problems in the physics context are situations where one needs ballpark (approximate) estimates of physical quantities in order to proceed with a calculation. The name comes from a physicist, Enrico Fermi; he developed the ability to carry out fairly accurate back-of-the-envelope calculations when working out approximate numerical values needed for a particular computation. Von Baeyer (1988) puts it well: “Prudent physicists—those who want to avoid false leads and dead ends—operate according to a long-standing principle: Never start a lengthy calculation until you know the range of values within which the answer is likely to fall (and, equally important, the range within which the answer is unlikely to fall).” As in physics, so in data analysis: as Bayesians, we need to acquire the ability to work out plausible ranges of values for parameters. This is a learnable skill, and improves with practice. With time and practice, we can learn to become prudent data analysts.

As Spiegelhalter, Abrams, and Myles (2004) point out, there is no one “correct” prior distribution. One consequence of this fact is that a good Bayesian analysis always takes a range of prior specifications into account; this is called a sensitivity analysis. We have already seen examples of this, but more examples will be provided in this and later chapters.

Prior specification requires the estimation of probabilities. Human beings are not good at estimating probabilities, because they are susceptible to several kinds of biases (Kadane and Wolfson 1998; Spiegelhalter, Abrams, and Myles 2004). We list the most important ones that are relevant to cognitive science applications:

  • Availability bias: Events that are more salient to the researcher are given higher probability, and events that are less salient are given lower probability.
  • Adjustment and anchoring bias: The initial assessment of the probability of an event can influence one’s subsequent judgements. An example is credible intervals: a researcher’s estimate of the credible interval will tend to be influenced by their initial assessment.
  • Overconfidence: When eliciting credible intervals from oneself, there is a tendency to specify too tight an interval.
  • Hindsight bias: If one relies on the data to come up with a prior for the analysis of that very same data set, one’s assessment is likely to be biased.

Although training can improve the natural tendency to be biased in these different ways, one must recognize that bias is inevitable when eliciting priors, either from oneself or from other experts; it follows that one should always define “a community of priors” (Kass and Greenhouse 1989): one should consider the effect of informed as well as skeptical or agnostic (uninformative) priors on the posterior distribution of interest. Incidentally, bias is not unique to Bayesian statistics; the same problems arise in frequentist data analysis. Even in frequentist analyses, the researcher always interprets the data in the light of their prior beliefs; the data never really “speak for themselves.” For example, the researcher might remove ``outliers’’ based on a belief that certain values are implausible; or the researcher will choose a particular likelihood based on their belief about the underlying generative process. All these are subjective decisions made by the researcher, and can dramatically impact the outcome of the analyses.

The great advantage that Bayesian methods have is that they allow us to formally take a range of (competing) prior beliefs into account when interpreting the data. We illustrate this point in the present chapter.

6.1 Eliciting priors from oneself for a self-paced reading study: A simple example

In section 3.5, we have already encountered a sensitivity analysis; there, several priors were used to investigate how the posterior is affected. Here is another example of a sensitivity analysis; the problem here is how to elicit priors from oneself for a particular research problem.

6.1.1 An example: English relative clauses

We will work out priors from first principles for a commonly-used experiment design in psycholinguistics. As an example, consider English subject vs. object relative clause processing differences. Relative clauses are sentences like (1a) and (1b):

(1a) The reporter [who the photographer sent to the editor] was hoping for a good story. (ORC)

(1b) The reporter [who sent the photographer to the editor] was hoping for a good story. (SRC)

Sentence (1a) is an object relative clause (ORC): the noun reporter is modified by a relative clause (demarcated in square brackets), and the noun reporter is the object of the verb sent. Sentence (1b) is a subject relative clause (SRC): the noun reporter is modified by a relative clause (demarcated in square brackets), but this time the noun reporter is the subject of the verb sent. Many theories in sentence processing predict that the reading time at the verb sent will be shorter in English subject vs. object relatives; one explanation is that the dependency distance between reporter and sent is shorter in subject vs. object relatives (Grodner and Gibson 2005).

The experimental method we consider here is self-paced reading.26 The self-paced reading method is commonly used in psycholinguistics as a cheaper and faster substitute to eyetracking during reading. The subject is seated in front of a computer screen and is initially shown a series of broken lines that mask words from a complete sentence. The subject then unmasks the first word (or phrase) by pressing the space bar. Upon pressing the space bar again, the second word/phrase is unmasked and the first word/phrase is masked again; see Figure 6.1. The time in milliseconds that elapses between these two space-bar presses counts as the reading time for the first word/phrase. In this way, the reading time for each successive word/phrase in the sentence is recorded. Usually, at the end of each trial, the subject is also asked a yes/no question about the sentence. This is intended to ensure that the subject is adequately attending to the meaning of the sentence.

A moving window self-paced reading task for the sentence “The king is dead.” Words are unmasked one by one after each press of the space bar.

FIGURE 6.1: A moving window self-paced reading task for the sentence “The king is dead.” Words are unmasked one by one after each press of the space bar.

A classic example of self-paced reading data appeared in Exercise 5.2. A hierarchical model that we could fit to such data would be the following. In chapter 5, we showed that for reading-time data, the log-normal likelihood is generally a better choice than a normal likelihood. In the present chapter, in order to make it easier for the reader to get started with thinking about priors, we use the normal likelihood instead of the log-normal. In real-life data analysis, the normal likelihood would be a very poor choice for reading-time data.

The model below has varying intercepts and varying slopes for subjects and for items, but assumes no correlation between the varying intercepts and slopes. The correlation is removed in order to compare the posteriors to the estimates from the corresponding frequentist lme4 model. In the model shown below, we use “default” priors that the brm function assumes for all the parameters. We are only using default priors here as a starting point; in practice, we will never use default priors for a reported analysis. In the model output below, for brevity we will only display the summary of the posterior distribution for the slope parameter, which represents the difference between the two condition means.

data("df_gg05_rc")
df_gg05_rc <- df_gg05_rc %>%
  mutate(c_cond = if_else(condition == "objgap", 1 / 2, -1 / 2))
fit_gg05 <- brm(RT ~ c_cond + (1 + c_cond || subj) +
                  (1 + c_cond || item), df_gg05_rc)
(default_b <- posterior_summary(fit_gg05,
                                variable = "b_c_cond"))
##          Estimate Est.Error Q2.5 Q97.5
## b_c_cond      103      35.9   31   173

The estimates from this model are remarkably similar to those from a frequentist linear mixed model (Bates, Mächler, et al. 2015a):

fit_lmer <- lmer(RT ~ c_cond + (1 + c_cond || subj) +
                   (1 + c_cond || item), df_gg05_rc)
b <- summary(fit_lmer)$coefficients["c_cond", "Estimate"]
SE <- summary(fit_lmer)$coefficients["c_cond", "Std. Error"]
## estimate of the slope and
## lower and upper bounds of the 95% CI:
(lmer_b <- c(b, b - (2 * SE), b + (2 * SE)))
## [1] 102.3  29.9 174.7

The similarity between the estimates from the Bayesian and frequentist models is due to the fact that default priors, being relatively uninformative, don’t influence the posterior much. This leads to the likelihood dominating in determining the posteriors. In general, such uninformative priors on the parameters will show a similar lack of influence on the posterior (Spiegelhalter, Abrams, and Myles 2004). We can quickly establish this in the above example by using another uninformative prior:

fit_gg05_unif <- brm(RT ~ c_cond + (1 + c_cond || subj) +
                       (1 + c_cond || item),
  prior = c(
    prior(uniform(-2000, 2000), class = Intercept,
          lb = -2000, ub = 2000),
    prior(uniform(-2000, 2000), class = b,
          lb = -2000, ub = 2000),
    prior(normal(0, 500), class = sd),
    prior(normal(0, 500), class = sigma)
  ), df_gg05_rc
  )
(uniform_b <- posterior_summary(fit_gg05_unif, 
                                variable = c("b_c_cond")))
##          Estimate Est.Error Q2.5 Q97.5
## b_c_cond      103        38   28   177

As shown in Table 6.1, the means of the posteriors from this versus the other two model estimates shown above all look very similar.

TABLE 6.1: Estimates of the mean difference (with 95% confidence/credible intervals) between two conditions in a hierarchical model of English relative clause data from Grodner and Gibson, 2005, using (a) the frequentist hierarchical model, (b) a Bayesian model using default priors from the brm function, and (c) a Bayesian model with uniform priors.
model mean lower upper
Frequentist \(102\) \(30\) \(175\)
Default prior \(103\) \(31\) \(173\)
Uniform \(103\) \(28\) \(177\)

It is tempting for the newcomer to Bayesian statistics to conclude from Table 6.1 that default priors used in brms, or uniform priors, are good enough for fitting models. This conclusion would in general be incorrect. There are many reasons why a sensitivity analysis–which includes regularizing, relatively informative priors–is necessary in Bayesian modeling. First, relatively informative, regularizing priors must be considered in many cases to avoid convergence problems (an example is finite mixture models, presented in chapter 19). In fact, in many cases the frequentist model fit in lme4 will return estimates–such as \(\pm 1\) correlation estimates between varying intercepts and varying slopes–that are actually represent convergence failures (Bates, Kliegl, et al. 2015; Matuschek et al. 2017). In Bayesian models, unless we use regularizing priors that are at least mildly informative, we will generally face similar convergence problems. Second, when computing Bayes factors, sensitivity analyses using increasingly informative priors is vital; see chapter 15 for extensive discussion of this point. Third, one of the greatest advantages of Bayesian models is that one can formally take into account conflicting or competing prior beliefs in the model, by eliciting informative priors from competing experts. Although such a use of informative priors is still rare in cognitive science, it can be of great value when trying to interpret a statistical analysis.

Given the importance of regularizing, informative priors, we consider next some informative priors that we could use in the given model. We unpack the process by which we could work these priors out from existing information in the literature.

Initially, when trying to work out some alternative priors for some parameters of interest, we might think that we know absolutely nothing about the seven parameters in this model. But, as in Fermi problems, we actually know more than we realize.

Let’s think about the parameters in the relative clause example one by one. For ease of exposition, we begin by writing out the model in mathematical form. \(n\) is the row id in the data-frame. The variable c_cond is a sum-coded (\(\pm 0.5\)) predictor.

\[\begin{equation} RT_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} + c\_cond_n \cdot (\beta+ u_{subj[n],2}+w_{item[n],2}),\sigma) \end{equation}\]

where

\[\begin{equation} \begin{aligned} u_1 &\sim \mathit{Normal}(0,\tau_{u_1})\\ u_2 &\sim \mathit{Normal}(0,\tau_{u_2})\\ w_1 &\sim \mathit{Normal}(0,\tau_{w_1})\\ w_2 &\sim \mathit{Normal}(0,\tau_{w_2}) \end{aligned} \end{equation}\]

The parameters that we need to define priors for are the following: \(\alpha, \beta, \tau_{u_1}, \tau_{u_2}, \tau_{w_1}, \tau_{w_2}, \sigma\).

6.1.2 Eliciting a prior for the intercept

We will proceed from first principles. Let’s begin with the intercept, \(\alpha\); under the sum-contrast coding used here, it represents the grand mean reading time in the data set.

Ask yourself: What is the absolute minimum possible reading time? The answer is 0 ms; reading time cannot be negative. You have already eliminated half the real-number line as impossible values! Thus, one cannot really say that one knows nothing about the plausible values of mean reading times. Having eliminated half the real-number line, now ask yourself: what is a reasonable upper bound on reading time for an English ditransitive verb? Even after taking into account variations in word length and frequency, one minute (\(60\) seconds) seems like too long; even \(30\) seconds seems unreasonably long to spend on a single word. As a first attempt at an approximation, somewhere between \(2500\) and \(3000\) ms might constitute a reasonable upper bound, with 3000 ms being less likely than \(2500\) ms.

Now consider what an approximate average reading time for a verb might be. One can arrive at such a ballpark number by asking oneself how fast one can read an abstract that has, say, \(500\) words in it. Suppose that we estimate that we can read \(500\) words in \(120\) seconds (two minutes). Then, \(120/500=0.24\) seconds is the time we would spend per word on average; this is \(240\) ms per word. Maybe two minutes for \(500\) words was too optimistic? Let’s adjust the mean to \(300\) ms, instead of \(240\) ms. Such intuition-based judgments can be a valuable starting point for an analysis, as Fermi showed repeatedly in his work (Von Baeyer 1988). If one is uncomfortable consulting one’s intuition about average reading times, or even as a sanity check to independently validate one’s own intuitions, one can look up a review article on reading that gives empirical estimates (e.g., Rayner 1998).

One could express the above guesses as a normal distribution truncated at \(0\) ms on the ms scale, with mean \(300\) ms and standard deviation \(1000\) ms. An essential step in such an estimation procedure is to plot one’s assumed prior distribution graphically to see if it seems reasonable: Figure 6.2 shows a graphical summary of this prior.

A truncated normal distribution representing a prior distribution on mean reading times.

FIGURE 6.2: A truncated normal distribution representing a prior distribution on mean reading times.

Once we plot the prior, one might conclude that the prior distribution is a bit too widely spread out to represent mean reading time per word. But for estimating the posterior distribution, it will rarely be harmful to allow a broader range of values than we strictly consider plausible (the situation is different when it comes to Bayes factors analyses, as we will see later—there, widely spread out priors for a parameter of interest can have a dramatic impact on the Bayes factor test for whether that parameter is zero or not).

Another way to obtain a better feel for what plausible distributions of word reading times might be to just plot some existing data from published work. Figure 6.3 shows the distribution of mean reading times from ten published studies.

The distribution of mean reading times from ten self-paced reading studies.

FIGURE 6.3: The distribution of mean reading times from ten self-paced reading studies.

Although our truncated normal distribution, \(\mathit{Normal}_+(300,1000)\), seems like a pretty wild guess, it actually is not terribly unreasonable given what we observe in these ten published self-paced reading studies. As shown in Figure 6.3, the distribution of mean reading times in these different self-paced reading studies from different languages (English, Persian, Dutch, Hindi, German, Spanish) fall within the prior distribution. The means range from a minimum value of 464 ms and a maximum value of 751 ms. These values easily lie within the 95% credible interval for a \(\mathit{Normal}_+(300,1000)\): [40, 2458] ms. These 10 studies are not about relative clauses; but that doesn’t matter, because we are just trying to come up with a prior distribution on average reading times for a word. We just want an approximate idea of the range of plausible mean reading times.

The above prior specification for the intercept can (and must!) be evaluated in the context of the model using prior predictive checks. We have already encountered prior predictive checks in previous chapters; we will revisit them in detail in chapter 7. In the above data set on English relative clauses, one could check what the prior on the intercept implies in terms of the data generated by the model (see chapter 5 for examples). As stressed repeatedly throughout this book, sensitivity analysis is an integral component of Bayesian methodology. A sensitivity analysis should be used to work out what the impact is of a range of priors on the posterior distribution.

6.1.3 Eliciting a prior for the slope

Having come up with some potential priors for the intercept, consider next the prior specification for the effect of relative clause type on reading time; this is the slope \(\beta\) in the model above. Recall that c_cond is \(\pm 0.5\) sum coded.

Theory suggests (see Grodner and Gibson 2005 for a review) that subject relatives in English should be easier to process than object relatives, at the relative clause verb. This means that a priori, we expect the difference between object and subject relatives to be positive in sign. What would be a reasonable mean for this effect? We can look at previous research to obtain some ballpark estimates.

For example, Just and Carpenter (1992) carried out a self-paced reading study on English subject and object relatives, and their Figure 2 (p. 130) shows that the difference between the two relative clause types at the relative clause verb ranges from about 10 ms to 100 ms (depending on working memory capacity differences in different groups of subjects). This is already a good starting point, but we can look at some other published data to gain more confidence about the approximate difference between the conditions.

For example, Reali and Christiansen (2007) investigated subject and object relatives in four self-paced reading studies; in their design, the noun phrase inside the relative clause was always a pronoun, and they carried out analyses on the verb plus pronoun, not just the verb as in Grodner and Gibson (2005). We can still use the estimates from this study, because including a pronoun like “I”, “you”, or “they” in a verb region is not going to increase reading times dramatically. The hypothesis in Reali and Christiansen (2007) was that because object relatives containing a pronoun occur more frequently in corpora than subject relatives with a pronoun, the relative clause verb should be processed faster in object relatives than subject relatives (this is the opposite to the prediction for the reading times at the relative clause verb discussed in Grodner and Gibson 2005). The authors report comparisons for the pronoun and relative clause verb taken together (i.e., pronoun+verb in object relatives and verb+pronoun in subject relatives). In experiment 1, they report a \(-57\) ms difference between object and subject relatives, with a 95% confidence interval ranging from \(-104\) to \(-10\) ms. In a second experiment, they report a difference of \(-53.5\) ms with a 95% confidence interval ranging from \(-79\) to \(-28\) ms; in a third experiment, the difference was \(-32\) ms [\(-48,-16\)]; and in a fourth experiment, \(-43\) ms [\(-84, -2\)]. This range of values gives us a good ballpark estimate of the magnitude of the effect.

Yet another study involved English relative clauses is by Fedorenko, Gibson, and Rohde (2006). In this self-paced reading study, Fedorenko and colleagues compared reading times within the entire relative clause phrase (the relative pronoun and the noun+verb sequence inside the relative clause). Their data show that object relatives are harder to process than subject relatives; the difference in means is \(460\) ms, with a confidence interval \([299, 621]\) ms. This difference is much larger than in the studies mentioned above, but this is because of the long region of interest considered—it is well-known that the longer the reading/reaction time, the larger the standard deviation and therefore the larger the potential difference between means (Wagenmakers and Brown 2007).

One can also look at adjacent, related phenomena in sentence processing to get a feel for what the relative clause processing time difference should be. Research on similarity-based interference is closely related to relative clause processing differences: in both types of phenomenon, the assumption is that intervening nouns can increase processing difficulty. So let’s look at some reading studies on similarity-based interference.

In a recent study (Jäger, Engelmann, and Vasishth 2017), we investigated the estimates from about 80 reading studies on interference. Interference here refers to the difficulty experienced by the comprehender during sentence comprehension (e.g., in reading studies) when they need to retrieve a particular word from working memory but other words with similar features hinder retrieval. The meta-analysis in Jäger, Engelmann, and Vasishth (2017) suggests that the effect sizes for interference effects range from at most \(-50\) to \(50\) ms, depending on the phenomenon (some kinds of interference cause speed-ups, others cause slow-downs; see the discussion in Engelmann, Jäger, and Vasishth 2020).

Given that the Grodner and Gibson (2005) design can be seen as falling within the broader class of interference effects (Lewis and Vasishth 2005; Vasishth et al. 2019; Vasishth and Engelmann 2022), it is reasonable to choose informative priors that reflect this observed range of interference effects in the literature.

The above discussion gives us some empirical basis for assuming that the object minus subject relative clause difference in the Grodner and Gibson (2005) study on English could range from 10 to at most 100 ms or so. Although we expect the effect to be positive, perhaps we don’t want to pre-judge this before we see the data. For this reason, we could decide on a \(\mathit{Normal}(0,50)\) prior on the slope parameter in the model. This prior, which implies that we are 95% certain that the range of values lies between \(-100\) and \(+100\) ms. This prior is specifically for the millisecond scale, and specifically for the case where the critical region is one word (the relative clause verb in English).

In this particular example, it makes sense to assume that large effects like 100 ms are unlikely; this is so even if we do occasionally see estimates that are even higher than 100 ms in published data. For example, in Gordon, Hendrick, and Johnson (2001), their experiments 1-4 have very large OR-SR differences at the relative clause verb: \(450\) ms, \(250\) ms, \(500\) ms, and \(200\) ms, respectively, with an approximate SE of \(50\) ms. The number of subjects in the four experiments were 44, 48, 48, and 68, respectively. Given the other estimates mentioned above, we would be unwilling to take such large effects seriously because a major reason for observing overly large estimates in a one-word region of interest would be publication bias coupled with Type M error (Gelman and Carlin 2014). Published studies in psycholinguistics are often underpowered, which leads to exaggerated estimates being published (Type M error). Because big-news effects are encouraged in major journals, overestimates tend to get published preferentially.27 In recent work (Vasishth, Yadav, et al. 2022), we have shown that even under the optimistic assumption that effect size is approximately \(50\) ms, achieving 80% power in English relative clause studies would require at least 120 subjects (if one takes the uncertainty of the effect estimate into account, many more subjects would be needed).

Of course, if our experiment is designed so that the critical region constitutes several words, as in the Fedorenko, Gibson, and Rohde (2006) study, then one would have to choose a prior with a larger mean and standard deviation.

Box 6.1 The scale of the parameter must be taken into account when eliciting a prior

A related, important issue to consider when defining priors is the scale in which the parameter is defined. For example, if we were analyzing the Grodner and Gibson (2005) experiment using the log-normal likelihood, then the intercept and slope are on the log millisecond scale. A uniform prior on the intercept and slope parameter imply rather strange priors on the millisecond scale. For example, suppose we assume that the intercept on the log ms scale has priors \(\mathit{Normal}_+(300,100)\) and the slope has a prior \(\mathit{Normal}(0,50)\). In the millisecond scale, the priors on the intercept and slope imply a very broad range of reading time differences between the two conditions, ranging from a very large negative value to a very large positive value, which obviously makes little sense:

intercept <- rtnorm(100000, mean = 300, sd = 100, a = 0)
slope <- rnorm(100000, mean = 0, sd = 50)
effect <- exp(intercept + slope / 2) -
  exp(intercept - slope / 2)
quantile(effect, prob = c(0.025, 0.975))
##       2.5%      97.5% 
## -6.78e+210  3.03e+211

In this connection, it may be useful to revisit the discussion in section 4.3.2, where we discussed the effect of prior specification on the log-odds scale and what that implies on the probability scale.

Box 6.2 Cromwell’s rule

A frequently asked question from newcomers to Bayes is: what if I define a too restricted prior? Wouldn’t that bias the posterior distribution? This concern is also raised quite often by critics of Bayesian methods. The key point here is that a good Bayesian analysis always involves a sensitivity analysis, and also includes prior and posterior predictive checks under different priors. One should reject the priors that make no sense in the particular research problem we are working on, or which unreasonably bias the posterior. As one gains experience with Bayesian modeling, these concerns will recede as we come to understand how useful and important priors are for interpreting the data. Chapter 7 will elaborate on developing a sensible workflow for understanding and interpreting the results of a Bayesian analysis.

As an extreme example of an overly specific prior, if one were to define a \(\mathit{Normal}(0,10)\) prior for the \(\alpha\) and/or \(\beta\) parameters on the millisecond scale for the Grodner and Gibson (2005) example above; that would definitely bias the posterior for the parameters. Let’s check this. Try running this code (the output of the code is suppressed here to conserve space). In this model, the correlation between the varying intercepts and varying slopes for subjects and for items are not included; this is only done in order to keep the model simple.

restrictive_priors <-
  c(prior(normal(0, 10), class = Intercept),
    prior(normal(0, 10), class = b),
    prior(normal(0, 500), class = sd),
    prior(normal(0, 500), class = sigma))

fit_restrictive <- brm(RT ~ c_cond + (c_cond || subj) +
                         (c_cond || item),
                       prior = restrictive_priors,
                    # Increase the iterations to avoid warnings
                       iter = 4000,
                       df_gg05_rc)
summary(fit_restrictive)

If you run the above code, you will see that the overly specific (and extremely unreasonable) priors on the intercept and slope will dominate in determining the posterior; such priors obviously make no sense. If there is ever any doubt about the implications of a prior, prior and posterior predictive checks should be used to investigate the implications.

Here, an important Bayesian principle is Cromwell’s rule (Lindley 1991; Jackman 2009): we should generally allow for some uncertainty in our priors. A prior like \(\mathit{Normal}(0,10)\) or \(Normal_{+}(0,10)\) on the millisecond scale is clearly overly restrictive given what we’ve established about plausible values of the relative clause effect from existing data. A more reasonable but still quite tight prior would be \(\mathit{Normal}(0,50)\). In the spirit of Cromwell’s rule, just to be conservative, we can consider allowing (in a sensitivity analysis) larger possible effect sizes by adopting a prior such as \(\mathit{Normal}(0,75)\), and we allow the effect to be negative, even if theory suggests otherwise.

Although there are no fixed rules for deciding on a prior, a sensitivity analysis will quickly establish whether the prior or priors chosen are biasing the posterior. One critical thing to remember related to Cromwell’s rule is that if we categorically rule out a range of values a priori for a parameter by giving that range a probability of \(0\), the posterior will also never include that range of values, no matter what the data show. For example, in the Reali and Christiansen (2007) experiments, if we had used a truncated prior like \(\mathit{Normal}_{+}(0,50)\), the posterior can never show the observed negative sign on the effects as reported in the paper. As a general rule, therefore, one should allow the effect to vary in both directions, positive and negative. Sometimes unidirectional priors are justified; in those cases, it is of course legitimate to use them. An example is the prior on standard deviations (which cannot be negative).

6.1.4 Eliciting priors for the variance components

Having defined the priors for the intercept and the slope, we are left with prior specifications for the variance component parameters. At least in psycholinguistics, the residual standard deviation is usually the largest source of variance; the by-subject intercepts’ standard deviation is usually the next-largest value, and if experimental items are designed to have minimal variance, then these are usually the smallest components. Here again, we can look at some previous data to get a sense of what the priors should look like.

For example, we could use the estimates for the variance components from existing studies. Figure 6.4 shows the empirical distributions from 10 published studies. There are four classes of variance component: the subject and item intercept standard deviations, the standard deviations of slopes, and the standard deviations of the residuals. In each case, we can compute the estimated means and standard deviations of each type of variance component, and then use these to define normal distributions truncated at \(0\). The empirically estimated distributions of the variance components are shown in Figure 6.4. The estimated means and standard deviations of each type of variance component are as follows:

  • Subject intercept SDs: estimated mean: 165, estimated standard deviation (sd): 55.
  • Item intercept SDs: mean: 49, sd: 52.
  • Slope SDs: mean 39, sd: 58.
  • Residual SDs: mean: 392, sd: 140.

The largest standard deviations are those from the item intercepts and the residual standard deviation, so these are the ones we will focus on. We can and should orient our prior for the group-level (also known as random effects) variance components to subsume these larger values.

Histograms of empirical distributions of the different variance components from ten publishes studies. The y-axis shows counts rather than density in order to make it clear that we are working with only a few data sets.

FIGURE 6.4: Histograms of empirical distributions of the different variance components from ten publishes studies. The y-axis shows counts rather than density in order to make it clear that we are working with only a few data sets.

We can now use the equations (4.2) and (4.3) shown in Box 4.1 to work out the means and standard deviations of a corresponding truncated normal distribution. As an example, we could assume a prior distribution truncated at \(0\) from below, and at \(1000\) ms from above. That is, \(a=0\), and \(b=1000\).

We can write a function that takes the estimated means and standard deviations, and returns the mean and standard deviation of the corresponding truncated distribution (see Box 4.1).

The largest variance component among the group-level effects (that is, all variance components other than the residual standard deviation) is the by-subjects intercept. One can compute the mean and standard deviation of of the truncated distribution that would generate the observed mean and standard deviation of the item-level estimates:

# Subject intercept SDs: 
compute_meansd_parent(mean_trunc = intsubjmean, sd_trunc = intsubjsd)
## [1] "location:  165 scale:  55"

The corresponding truncated distribution is shown in Figure 6.5.

A truncated normal distribution (with location 165 and scale 55) representing an empirically derived prior distribution for the parameter for the by-subjects intercept adjustment in a hierarchical model.

FIGURE 6.5: A truncated normal distribution (with location 165 and scale 55) representing an empirically derived prior distribution for the parameter for the by-subjects intercept adjustment in a hierarchical model.

The prior shown in Figure 6.5 looks a bit too restrictive; it could well happen that in a future study the by-subject intercept standard deviation is closer to 500 ms. Taking Cromwell’s rule into account, one could widen the scale parameter of the truncated normal to, say 200. The result is Figure 6.6.

A truncated normal distribution (with location 165 and scale 200) representing an empirically derived prior distribution for the parameter for the by-subjects intercept adjustment in a hierarchical model taking Cromwell’s rule into account.

FIGURE 6.6: A truncated normal distribution (with location 165 and scale 200) representing an empirically derived prior distribution for the parameter for the by-subjects intercept adjustment in a hierarchical model taking Cromwell’s rule into account.

Figure 6.6 does not look too unreasonable as an informative prior for this variance component. This prior will also serve us well for all the other group-level effects (the random intercept for items, and the random slopes for subject and item), which will have smaller values.

Finally, the prior for the residual standard deviation is going to have to allow a broader range of wider values:

compute_meansd_parent(mean_trunc = resmean, sd_trunc = ressd)
## [1] "location:  391 scale:  142"

Figure 6.7 shows a plausible informative prior derived from the empirical estimates.

A truncated normal distribution representing an empirically derived prior distribution for the parameter for the residual standard deviation in a hierarchical model.

FIGURE 6.7: A truncated normal distribution representing an empirically derived prior distribution for the parameter for the residual standard deviation in a hierarchical model.

We stress again that Cromwell’s rule should generally be kept in mind—it’s usually better to have a little bit more uncertainty than warranted than too tight a prior. An overly tight prior will ensure that the posterior is entirely driven by the prior. Again, prior predictive checks should be an integral part of the process of establishing a sensible set of priors for the variance components. This point about prior predictive checks will be elaborated on with examples in chapter 7.

We now apply the relatively informative priors we came up with above to analyze the Grodner and Gibson (2005) data. Applying Cromwell’s rule, we allow for a bit more uncertainty than our existing empirical data suggest.

Specifically, we could choose the following informative priors for the Grodner and Gibson (2005) data:

  • The intercept: \(\alpha \sim \mathit{Normal}(500,100)\)
  • The slope: \(\beta \sim \mathit{Normal}(50,50)\)
  • All variance components: \(\sigma_u, \sigma_w \sim \mathit{Normal}_{+}(165,200)\)
  • The residual standard deviation: \(\sigma \sim \mathit{Normal}_{+}(391,200)\)

The first step is to check whether the prior predictive distribution makes sense. Figure 6.8 shows that the prior predictive distributions are not too implausible, although they could be improved further. One big problem is the normal distribution assumed in the model; a log-normal distribution captures the shape of the distribution of the Grodner and Gibson (2005) data better than a normal distribution. The discrepancy between the Grodner and Gibson (2005) data and our prior predictive distribution implies that we might be using the wrong likelihood. Another problem is that the reading times in the prior predictive distribution can be negative—this is also a consequence of our using the wrong likelihood. As an exercise, fit a model with a log-normal likelihood and informative priors based on previous data. When using a log-normal likelihood, the prior for the slope parameter obviously has to be on the log scale. Therefore, we will need to define an informative prior on the log scale for the slope parameter. For example, consider the following prior on the slope: \(\mathit{Normal}(0.12, 0.04)\). Here is how to interpret this on the millisecond scale: Assuming a mean reading time of 6 log ms, this prior roughly corresponds to an effect size on the millisecond scale that has a 95 credible interval ranging from \(16\) ms to \(81\) ms. Review 3.7.2 if you have forgotten how this transformation was done.

For now, because our running example uses a normal likelihood on reading times in milliseconds, we can retain these priors.

Prior predictive distributions from the model (using a normal likelihood) to be used for the Grodner and Gibson data analysis. The panels show eight prior predictive distributions.

FIGURE 6.8: Prior predictive distributions from the model (using a normal likelihood) to be used for the Grodner and Gibson data analysis. The panels show eight prior predictive distributions.

The sensitivity analysis could then be displayed, showing the posteriors under different prior settings. Figures 6.9 and 6.10 show the posteriors under two distinct sets of priors.

Posterior distributions of parameters for the English relative clause data, using uniform priors (\(\mathit{Uniform}(0,2000)\)) on the intercept and slope.

FIGURE 6.9: Posterior distributions of parameters for the English relative clause data, using uniform priors (\(\mathit{Uniform}(0,2000)\)) on the intercept and slope.

Posterior distributions of parameters for the English relative clause data, using relatively informative priors on the intercept and slope.

FIGURE 6.10: Posterior distributions of parameters for the English relative clause data, using relatively informative priors on the intercept and slope.

What can one do if one doesn’t know absolutely anything about one’s research problem? An example is the power posing data that we encountered in Chapter 4, in an exercise in section 4.6. Here, we investigated the change in testosterone levels after the subject was either asked to adopt a high power pose or a low power pose (a between-subjects design). Not being experts in this domain, we may find ourselves stumped for priors. In such a situation, it could be defensible to use uninformative priors like \(\mathit{Cauchy}(0,2.5)\), at least initially. However, as discussed in a later chapter, if one is committed to doing a Bayes factor analysis, then we are obliged to think carefully about plausible a priori values of the effect. This would require consulting one or more experts or reading the literature on the topic to obtain ballpark estimates. An exercise at the end of this chapter will elaborate on this idea. We turn next to the topic of eliciting priors from experts.

6.2 Eliciting priors from experts

It can happen that one is working on a research problem where either our own prior knowledge is lacking, or we need to incorporate a range of competing prior beliefs into the analysis. In such situations, it becomes important to elicit priors from experts other than oneself. Although informal elicitation can be a perfectly legitimate approach, there does exist a well-developed methodology for systematically eliciting priors in Bayesian statistics (O’Hagan et al. 2006).

The particular method developed by O’Hagan and colleagues comes with an R package called SHELF, which stands for the Sheffield Elicitation Framework; the method was developed by statisticians at the University of Sheffield, UK. SHELF is available from http://www.tonyohagan.co.uk/shelf/. This framework comes with a detailed set of instructions and a fixed procedure for eliciting distributions. It also provides detailed guidance on documenting the elicitation process, thereby allowing a full record of the elicitation process to be created. Creating such a record is important because the elicitation procedure needs to be transparent to a third party reading the final report on the data analysis.

The SHELF procedure works as follows. There is a facilitator and an expert (or a group of experts; we will consider the single expert case here, but one can easily extend the approach to multiple experts).

  • A pre-elicitation form is filled out by the facilitator in consultation with the expert. This form sets the stage for the elicitation exercise and records some background information, such as the nature of the expertise of the assessor.
  • Then, an elicitation method is chosen. Simple methods are the most effective. One good approach is the quartile method. The expert first decides on a lower and upper limit of possible values for the quantity to be estimated. Because the lower and upper bounds are elicited before the median, this minimizes the effects of the “anchoring and adjustment heuristic” (O’Hagan et al. 2006), whereby experts tend to anchor their subsequent estimates of quartiles based on their first judgement of the median. Following this, a median value is decided on, and lower and upper quartiles are elicited. The SHELF package has functions to display these quartiles graphically, allowing the expert to adjust them at this stage if necessary. It is important for the expert to confirm that, in their judgement, the four partitioned regions that result have equal probability.
  • The elicited distribution is then displayed as a density plot (several choices of probability density functions are available, but we will usually use the normal or the truncated normal in this chapter); this graphical summary serves to give feedback to the expert. The parameters of the distribution are also displayed. Once the expert agrees to the final density, the parameters can be considered the expert’s judgement regarding the prior distribution of the bias. One can consult multiple experts and either combine their judgements into one prior, or consider each expert’s prior separately in a sensitivity analysis.

When eliciting priors from more than one expert, one can elicit the priors separately and then use the priors separately in a sensitivity analysis. This approach takes each individual expert’s opinion in interpreting the data and can be a valuable sensitivity analysis (for an example from psycholinguistics, see the discussion surrounding Table 2.2 on p. 47 in Vasishth and Engelmann 2022). Alternatively, one can pool the priors together (see Spiegelhalter, Abrams, and Myles 2004 for discussion) and create a single consensus prior; this would amount to an average of the differing opinions about prior distributions. A third approach is to elicit a consensus prior by bringing all the experts together and eliciting a prior from the group in a single setting. Of course, these approaches are not mutually exclusive. One of the hallmark properties of Bayesian analysis is that the posterior distribution of the parameter of interest can be investigated in light of differing prior beliefs and the data (and of course the model). Box 6.3 illustrates a simple elicitation procedure involving two experts; the example is adapted from the SHELF package’s vignette.

Box 6.3 Example: prior elicitiation using SHELF

An example of prior elicitation using SHELF is shown below. This example is adapted from the SHELF vignette.

Suppose that two experts are consulted separately. The question asked of the experts is what they think that a probability parameter \(X\) has as plausible values. The parameter \(X\) can be seen as a percentage; so, it ranges from \(0\) to \(100\).

Step 1: Elicit quartiles and median from each expert.

  • Expert A states that \(P(X<30)=0.25, P(X<40)=0.5, P(X<50)=0.75\).
  • Expert B states that \(P(X<20)=0.25, P(X<25)=0.5, P(X<35)=0.75\).

Step 2: Fit the implied distributions for each expert’s judgements and plot the distributions, along with a pooled distribution (the linear pool in the figure) using the plotfit() function from the library SHELF.

elicited <- matrix(c(30, 20, 0.25,
                     40, 25, 0.5,
                     50, 35, 0.75),
                   nrow = 3, ncol = 3, byrow = TRUE)
dist_2expr <- fitdist(vals = elicited[, 1:2],
                      probs = elicited[, 3],
                      lower = 0, upper = 100)
plotfit(dist_2expr, lp = TRUE, returnPlot = TRUE) +
  scale_color_grey() +
  theme_light()
Visualizing priors elicited from two experts for a parameter \(X\) representing a percentage ranging from \(0\) to \(100\).

FIGURE 6.11: Visualizing priors elicited from two experts for a parameter \(X\) representing a percentage ranging from \(0\) to \(100\).

Step 3: Then bring the two experts together and elicit a consensus distribution.

Suppose that the experts agree that \(P(X<25)=0.25, P(X<30)=0.5, P(X<40)=0.75\). The consensus distribution is then:

elicited <- matrix(c(25, 0.25,
                     30, 0.5,
                     40, 0.75),
                   nrow = 3, ncol = 2, byrow = TRUE)
dist_cons <- fitdist(vals = elicited[,1],
                     probs = elicited[,2],
                     lower = 0, upper = 100)
plotfit(dist_cons, ql = 0.05, qu = 0.95, returnPlot = TRUE) +
  theme_light()
Visualizing a consensus prior from two experts for a parameter \(X\) representing a percentage ranging from \(0\) to \(100\).

FIGURE 6.12: Visualizing a consensus prior from two experts for a parameter \(X\) representing a percentage ranging from \(0\) to \(100\).

Step 4: Give feedback to the experts by showing them the 5th and 95th percentiles, and check that these bounds match their beliefs. If not, then repeat the above steps.

feedback(dist_cons, quantiles = c(0.05, 0.95))
## $fitted.quantiles
##      normal     t gamma lognormal logt beta hist
## 0.05   12.5  7.49  16.2      17.3 14.8 15.2    5
## 0.95   50.4 55.10  53.2      55.3 64.1 51.2   88
##      mirrorgamma mirrorlognormal mirrorlogt
## 0.05        10.5            9.18       2.08
## 0.95        49.1           48.60      52.10
## 
## $fitted.probabilities
##    elicited normal     t gamma lognormal  logt  beta hist
## 25     0.25  0.288 0.289 0.279     0.274 0.275 0.283 0.25
## 30     0.50  0.451 0.453 0.461     0.466 0.469 0.456 0.50
## 40     0.75  0.772 0.774 0.769     0.767 0.768 0.770 0.75
##    mirrorgamma mirrorlognormal mirrorlogt
## 25       0.292           0.295      0.296
## 30       0.446           0.444      0.447
## 40       0.772           0.773      0.775

6.3 Deriving priors from meta-analyses

Meta-analysis has been used widely in clinical research (Higgins and Green 2008; Sutton et al. 2012; DerSimonian and Laird 1986; Normand 1999), but it is used relatively rarely in psychology and linguistics. Random-effects meta-analysis (discussed in a later chapter in detail) is an especially useful tool in cognitive science.

Meta-analysis is not a magic bullet; this is because of publication bias—usually only (supposedly) newsworthy results are published, leading to a skewed picture of the effects. As a consequence, meta-analysis will probably lead to biased estimates. Nevertheless, they can still tell us something about what we know so far from published studies, if only that the studies are too noisy to be interpretable. Despite this limitation, some prior information is better than no information. As long as one remains aware of the limitations of meta-analysis, one can still use them effectively to study one’s research questions.

We begin with observed effects \(y_n\) (e.g., estimated difference between two conditions) and their estimated standard errors (SEs); the SEs serve as an indication of the precision of the estimate, with larger SEs implying a low-precision estimate. Once we have collected the observed estimates (e.g., from published studies), we can define an assumed underlying generative process whereby each study \(n=1,\dots,N\) has an unknown true mean \(\zeta_n\):

\(y_n \sim \mathit{Normal}(\zeta_n,SE_n)\)

A further assumption is that each unknown true mean \(\zeta_n\) in each study is generated from a distribution that has some true overall mean \(\zeta\), and standard deviation \(\tau\). The standard deviation \(\tau\) reflects between-study variation, which could be due to different subjects being used in each study, different lab protocols, different methods, different languages being studied, etc.

\(\zeta_n \sim \mathit{Normal}(\zeta,\tau)\)

This kind of meta-analysis is actually the familiar hierarchical model we have already encountered in chapter 5. As in hierarchical models, hyperpriors have to be defined for \(\zeta\) and \(\tau\). A useful application of this kind of meta-analysis is to derive a posterior distribution for \(\zeta\) based on the available evidence; this posterior can be used (e.g., with a normal approximation) as a prior for a future study.

A simple example is the published data on Chinese relative clauses; the data are adapted from Vasishth (2015). Table 6.2 shows the mean difference between the object and subject relative, along with the standard error, that was derived from published reading studies on Chinese relatives.

TABLE 6.2: The difference between object and subject relative clause reading times (effect), along with their standard errors (SE), from different published reading studies on Chinese relative clauses.
study.id study y se
1 Hsiao et al 03 \(50.0\) \(25.0\)
2 Gibson et al 13 \(-120.0\) \(48.0\)
3 Vas. et al 13, E1 \(148.5\) \(50.9\)
4 Vas. et al 13, E2 \(82.6\) \(41.2\)
5 Vas. et al 13, E3 \(-109.4\) \(54.8\)
6 Jaeg. et al 15, E1 \(55.6\) \(65.1\)
7 Jaeg. et al 15, E2 \(81.9\) \(36.3\)
8 Wu 09 \(50.0\) \(23.0\)
9 Qiao et al 11, E1 \(-70.0\) \(42.0\)
10 Qiao et al 11, E2 \(6.2\) \(19.9\)
11 Lin & Garn. 11, E1 \(-100.0\) \(30.0\)
12 Chen et al 08 \(75.0\) \(35.5\)
13 C Lin & Bev. 06 \(100.0\) \(80.0\)

Suppose that we want to do a new study investigating the difference between object and subject relative clauses, and suppose that in the sensitivity analysis, one of the priors we want is an empirically justifiable informative prior. Of course, the sensitivity analysis will also contain uninformative priors; we have seen examples of such priors in the previous chapters.

We can derive an empirically justified prior by conducting a group-level effects meta-analysis. We postpone discussion of how exactly to fit such a model to chapter 13; here, we simply report the posterior distribution of the overall effect \(\zeta\) based on the prior data, ignoring the details of model fitting.

The posterior distribution of the difference between object and subject relative clause processing in Chinese relative clause data, computed from a random-effects meta-analysis using published Chinese relative clause data from reading studies.

FIGURE 6.13: The posterior distribution of the difference between object and subject relative clause processing in Chinese relative clause data, computed from a random-effects meta-analysis using published Chinese relative clause data from reading studies.

The posterior distribution of \(\zeta\) is shown in Figure 6.13. What we can derive from this posterior distribution of \(\zeta\) is a normal approximation that represents what we know so far about Chinese relatives, based on the available data. The key here is the word “available”; almost certainly there exist studies that were inconclusive and were therefore never published. The published record is always biased because of the nature of the publication game in science (only supposedly newsworthy results get published).

The mean of the posterior is \(27\) ms, and the width of the 95% credible intervals is \(72--19=91\) ms. Since the 95% credible interval has a width that is approximately four times the standard deviation (assuming a normal distribution), we can work out the standard deviation by dividing the width by four: \(22.75\). Given these estimates, we could use a normal distribution with mean \(27\) and standard deviation \(22.75\) as an informative prior in a sensitivity analysis.

As an example, we will analyze a data set from Gibson and Wu (2013) that was not part of the above meta-analysis; for the above meta-analysis, the estimate from the Gibson and Wu (2013) study that was shown in Table 6.2 has been removed. The meta-analysis posterior will then be used as an informative prior. First, load the data and sum-code the predictor:

data("df_gibsonwu")
df_gibsonwu <- df_gibsonwu %>%
  mutate(c_cond = if_else(type == "obj-ext", 1 / 2, -1 / 2))

Because we will now use a log-normal likelihood for the reading time data, we need to work out what the meta-analysis posterior of \(\zeta\) corresponds to on the log scale. The grand mean reading time of the Gibson and Wu (2013) data on the log scale is \(6.1\). In order to arrive at approximately the mean difference of \(27\) ms, the log-scale value of the mean difference can be worked out as follows.

If we know the grand mean reading time \(\hat \alpha\) on the log scale, then we can find out what the slope \(\hat\beta\) needs to be such that the difference between the two conditions is approximately \(27\) ms. In other words, we need to find \(\hat beta\) in the equation:

\[exp(\hat\alpha + \hat\beta/2) - exp(\hat\alpha + \hat\beta/2) = 27\]

One can find this estimate by trial and error, or solve the equation analytically. Similarly, if the lower and upper bounds of the effect estimate from the meta-analysis are known on the ms scale, we can figure out the lower and upper bounds of \(\hat\beta\), call them \(\hat\beta_{lower}\) and \(\hat\beta_{upper}\):

\[exp(\hat\alpha + \hat\beta_{lower}/2) - exp(\hat\alpha + \hat\beta_{lower}/2) = -19\]

\[exp(\hat\alpha + \hat\beta_{upper}/2) - exp(\hat\alpha + \hat\beta_{upper}/2) = 72\]

Once we have estimates of \(\hat\beta_{lower}\) and \(\hat\beta_{upper}\), we can figure out the standard deviation estimate of the effect by computing the interval \(\hat\beta_{upper}-\hat\beta_{lower}\) and dividing it by 4 (because the 95% credible interval will span four times the standard deviation). Thus, the end-result of our calculation is a mean and a standard deviation (on the log scale) of a normal distribution, which we can use as a relatively informative prior, informed by the meta-analysis, for the future study.

(int <- mean(log(df_gibsonwu$rt)))
## [1] 6.06
## the effect size on the log ms scale:
b <- 0.058
## the slope on the log scale:
exp(int + b / 2) - exp(int - b / 2)
## [1] 24.9
## the lower bound on the log scale:
lower <- -0.052
exp(int + lower / 2) - exp(int - lower / 2)
## [1] -22.3
upper <- 0.17
exp(int + upper / 2) - exp(int - upper / 2)
## [1] 73
## the interval divided by 4:
(SE <- round( (upper - lower ) / 4, 3 ) )
## [1] 0.056

As always, we will do a sensitivity analysis, using uninformative priors on the slope parameter (\(\mathit{Normal}(0,1)\)), as well as the meta-analysis prior (\(\mathit{Normal}(0.058,0.056)\)).

## uninformative priors on the parameters of interest
## and on the variance components:
fit_gibsonwu_log <-
  brm(rt ~ c_cond +
        (c_cond | subj) +
        (c_cond | item),
      family = lognormal(),
      prior = c(prior(normal(6, 1.5), class = Intercept),
                prior(normal(0, 1), class = b),
                prior(normal(0, 1), class = sigma),
                prior(normal(0, 1), class = sd),
                prior(lkj(2), class = cor)),
      data = df_gibsonwu)
## meta-analysis priors:
fit_gibsonwu_ma <-
  brm(rt ~ c_cond +
        (c_cond | subj) +
        (c_cond | item),
      family = lognormal(),
      prior = c(
        prior(normal(6, 1.5), class = Intercept),
        prior(normal(0.058, 0.056), class = b),
        prior(normal(0, 1), class = sigma),
        prior(normal(0, 1), class = sd),
        prior(lkj(2), class = cor)),
      data = df_gibsonwu)

A summary of the posteriors (means and 95% credible intervals) under the \(\mathit{Normal}(0,1)\) and the meta-analysis prior is shown in Table 6.3. In this particular case, the posteriors are influenced by the two different priors. The differences between the two posteriors are small, but these differences could in principle lead to different outcomes (and conclusions) in a Bayes factor analysis.

TABLE 6.3: A summary of the posteriors under a relatively uninformative prior and an informative prior based on a meta-analysis, for the Chinese relative clause data from Gibson and Wu, 2013.
Priors Mean Lower Upper
\(\mathit{Normal}(0,1)\) \(-0.07\) \(-0.18\) \(0.03\)
\(\mathit{Normal}(0.058, 0.056)\) \(-0.01\) \(-0.08\) \(0.08\)

6.4 Using previous experiments’ posteriors as priors for a new study

In a situation where we are attempting to replicate a previous study’s results, we can derive an informative prior for the analysis of the replication attempt by figuring out a prior based on the previous study’s posterior distribution. In the previous chapter, we encountered this in one of the exercises: Given data on Chinese relatives (Gibson and Wu 2013), we want to replicate the effect with a new data set that has the same design but different subjects. The data from the replication attempt is from Vasishth et al. (2013).

The first data set from Gibson and Wu (2013) was analyzed in the previous section using uninformative priors. We can extract the mean and standard deviation of the posterior, and use that to derive an informative prior for the replication attempt.

Now, for the replication study, we can use this posterior (with a normal approximation), if we want to build on what we learned from the original Gibson and Wu (2013) study. As usual, we will do a sensitivity analysis: one model is fit with an uninformative prior on the parameter of interest, \(\mathit{Normal}(0,1)\), as we did in the preceding section; and another model will be fit with the informative directional prior \(\mathit{Normal}(-0.073, 0.04)\). For good measure, we can also include a model with a prior derived from the meta-analysis in the preceding section (the posterior of the \(\zeta\) parameter).

TABLE 6.4: A summary of the posteriors under an uninformative prior (Normal(0,1)), a prior based on previous data, and a meta-analysis prior, for data from a replication attempt of Gibson and Wu, 2013.
Priors Mean Lower Upper
Normal(0,1) \(-0.08\) \(-0.21\) \(0.05\)
Normal(-0.07,0.2) \(-0.08\) \(-0.20\) \(0.04\)
Normal(0.041, 0.2) \(-0.08\) \(-0.20\) \(0.05\)

Table 6.4 summarizes the different posteriors under the three prior specification. Again, in this case, the differences in the posteriors are small, but in a Bayes factor analysis, the outcomes under these different priors could be different.

6.5 Summary

Working out appropriate priors for one’s research problem is essentially a Fermi problem. One can use several different strategies for working out priors: introspection, a literature review, computing statistics from existing data, conducting a meta-analysis, using posteriors from existing data as priors for a new, closely related study, or formally eliciting priors from domain experts. If a prior specification is too vague, this can lead to slow convergence or convergence problems, and would lead to biased Bayes factors (biased towards the null hypothesis); and if a prior is too informative, this can bias the posterior. This inherent potential for bias in prior specification should be formally investigated using sensitivity analyses (with a collection of uninformative, skeptical, and informative priors of various types), and prior and posterior predictive checks. Although prior specification seems like a daunting task to the beginning student of Bayes, with time and experience one can develop a very well-informed set of priors for one’s research problems.

6.6 Further reading

For interesting (and amusing) examples of Fermi solutions to questions, see https://what-if.xkcd.com/84/. Two important books, Mahajan (2010) and Mahajan (2014), unpack the art of approximation in mathematics and other disciplines; the approach presented in these books is closely related to the art of Fermi-style approximation. Levy (2021) is an important book that develops the analytical skill needed to figure out what your “tacit knowledge” about a particular problem is. Tetlock and Gardner (2015) explains how experts deploy existing knowledge to derive probabilistic predictions (predictions that come with a certain amount of uncertainty) about real-world problems—this skill is closely related to prior (self-)elicitation. An excellent presentation of prior elicitation is in O’Hagan et al. (2006). Useful discussions about priors are provided in Lunn et al. (2012); Spiegelhalter, Abrams, and Myles (2004); Gelman, Simpson, and Betancourt (2017); and Simpson et al. (2017). The Stan website also includes some guidelines: Prior distributions for rstanarm models in https://mc-stan.org/rstanarm/articles/priors.html; and prior choice recommendations in https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations. Browne and Draper (2006) and Gelman (2006) discuss prior specifications in hierarchical models.

References

Bates, Douglas M., Reinhold Kliegl, Shravan Vasishth, and R. Harald Baayen. 2015. “Parsimonious Mixed Models.”

Bates, Douglas M., Martin Mächler, Ben Bolker, and Steve Walker. 2015a. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Browne, William J., and David Draper. 2006. “A Comparison of Bayesian and Likelihood-Based Methods for Fitting Multilevel Models.” Bayesian Analysis 1 (3): 473–514.

DerSimonian, Rebecca, and Nan M. Laird. 1986. “Meta-Analysis in Clinical Trials.” Controlled Clinical Trials 7 (3): 177–88. https://doi.org/https://doi.org/10.1016/0197-2456(86)90046-2.

Engelmann, Felix, Lena A. Jäger, and Shravan Vasishth. 2020. “The Effect of Prominence and Cue Association in Retrieval Processes: A Computational Account.” Cognitive Science 43 (12): e12800. https://doi.org/10.1111/cogs.12800.

Fedorenko, Evelina, Edward Gibson, and Douglas Rohde. 2006. “The Nature of Working Memory Capacity in Sentence Comprehension: Evidence Against Domain-Specific Working Memory Resources.” Journal of Memory and Language 54 (4): 541–53. https://doi.org/https://doi.org/10.1016/j.jml.2005.12.006.

Gelman, Andrew. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models (Comment on Article by Browne and Draper).” Bayesian Analysis 1 (3): 515–34.

Gelman, Andrew, and John B. Carlin. 2014. “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” Perspectives on Psychological Science 9 (6): 641–51. https://doi.org/https://doi.org/10.1177/1745691614551642.

Gelman, Andrew, Daniel P. Simpson, and Michael J. Betancourt. 2017. “The Prior Can Often Only Be Understood in the Context of the Likelihood.” Entropy 19 (10): 555. https://doi.org/10.3390/e19100555.

Gibson, Edward, and H.-H. Iris Wu. 2013. “Processing Chinese Relative Clauses in Context.” Language and Cognitive Processes 28 (1-2): 125–55. https://doi.org/https://doi.org/10.1080/01690965.2010.536656.

Gordon, P. C., Randall Hendrick, and Marcus Johnson. 2001. “Memory Interference During Language Processing.” Journal of Experimental Psychology: Learning, Memory, and Cognition 27 (6): 1411–23. https://doi.org/ https://doi.org/10.1037/0278-7393.27.6.1411.

Grodner, Daniel, and Edward Gibson. 2005. “Consequences of the Serial Nature of Linguistic Input.” Cognitive Science 29: 261–90. https://doi.org/https://doi.org/10.1207/s15516709cog0000_7.

Hammerly, Christopher, Adrian Staub, and Brian W. Dillon. 2019. “The Grammaticality Asymmetry in Agreement Attraction Reflects Response Bias: Experimental and Modeling Evidence.” Cognitive Psychology 110: 70–104. https://doi.org/https://doi.org/10.1016/j.cogpsych.2019.01.001.

Higgins, Julian, and Sally Green. 2008. Cochrane Handbook for Systematics Reviews of Interventions. New York: Wiley-Blackwell.

Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Vol. 846. John Wiley & Sons.

Jäger, Lena A., Felix Engelmann, and Shravan Vasishth. 2017. “Similarity-Based Interference in Sentence Comprehension: Literature review and Bayesian meta-analysis.” Journal of Memory and Language 94: 316–39. https://doi.org/https://doi.org/10.1016/j.jml.2017.01.004.

Jäger, Lena A., Daniela Mertzen, Julie A. Van Dyke, and Shravan Vasishth. 2020. “Interference Patterns in Subject-Verb Agreement and Reflexives Revisited: A Large-Sample Study.” Journal of Memory and Language 111. https://doi.org/https://doi.org/10.1016/j.jml.2019.104063.

Just, Marcel Adam, and Patricia A. Carpenter. 1992. “A Capacity Theory of Comprehension: Individual Differences in Working Memory.” Psychological Review 99 (1): 122–49. https://doi.org/https://doi.org/10.1037/0033-295X.99.1.122.

Kadane, Joseph, and Lara J. Wolfson. 1998. “Experiences in Elicitation.” Journal of the Royal Statistical Society: Series D (The Statistician) 47 (1): 3–19. https://doi.org/https://doi.org/10.1111/1467-9884.00113.

Kass, Robert E., and Joel B. Greenhouse. 1989. “Investigating Therapies of Potentially Great Benefit: ECMO: Comment: A Bayesian Perspective.” Statistical Science 4 (4): 310–17. https://doi.org/https://www.jstor.org/stable/2245831.

Levy, Dan. 2021. Maxims for Thinking Analytically: The Wisdom of Legendary Harvard Professor Richard Zeckhauser. Dan Levy.

Lewis, Richard L., and Shravan Vasishth. 2005. “An Activation-Based Model of Sentence Processing as Skilled Memory Retrieval.” Cognitive Science 29: 1–45. https://doi.org/ 10.1207/s15516709cog0000_25.

Lindley, Dennis V. 1991. Making Decisions. Second. John Wiley & Sons.

Lunn, David J., Chris Jackson, David J. Spiegelhalter, Nichola G. Best, and Andrew Thomas. 2012. The BUGS Book: A Practical Introduction to Bayesian Analysis. Vol. 98. CRC Press.

Mahajan, Sanjoy. 2010. Street-Fighting Mathematics: The Art of Educated Guessing and Opportunistic Problem Solving. Cambridge, MA: The MIT Press.

Mahajan, Sanjoy. 2014. The Art of Insight in Science and Engineering: Mastering Complexity. Cambridge, MA: The MIT Press.

Matuschek, Hannes, Reinhold Kliegl, Shravan Vasishth, R. Harald Baayen, and Douglas M. Bates. 2017. “Balancing Type I Error and Power in Linear Mixed Models.” Journal of Memory and Language 94: 305–15. https://doi.org/10.1016/j.jml.2017.01.001.

Nicenboim, Bruno, Shravan Vasishth, Felix Engelmann, and Katja Suckow. 2018. “Exploratory and Confirmatory Analyses in Sentence Processing: A case study of number interference in German.” Cognitive Science 42 (S4). https://doi.org/10.1111/cogs.12589.

Normand, S. L. T. 1999. “Tutorial in Biostatistics Meta-Analysis: Formulating, Evaluating, Combining, and Reporting.” Statistics in Medicine 18 (3): 321–59. https://doi.org/https://doi.org/10.1002/(SICI)1097-0258(19990215)18:3<321::AID-SIM28>3.0.CO;2-P.

O’Hagan, Anthony, Caitlin E. Buck, Alireza Daneshkhah, J. Richard Eiser, Paul H. Garthwaite, David J. Jenkinson, Jeremy E. Oakley, and Tim Rakow. 2006. Uncertain Judgements: Eliciting Experts’ Probabilities. John Wiley & Sons.

Phillips, Colin, Matthew W. Wagers, and Ellen F. Lau. 2011. “Grammatical Illusions and Selective Fallibility in Real-Time Language Comprehension.” In Experiments at the Interfaces, 37:147–80. Emerald Bingley, UK.

Rayner, K. 1998. “Eye movements in reading and information processing: 20 years of research.” Psychological Bulletin 124 (3): 372–422. https://doi.org/https://doi.org/10.1037/0033-2909.124.3.372.

Reali, Florencia, and Morten H. Christiansen. 2007. “Processing of Relative Clauses Is Made Easier by Frequency of Occurrence.” Journal of Memory and Language 57 (1): 1–23. https://doi.org/https://doi.org/10.1016/j.jml.2006.08.014.

Simpson, Daniel P., Håvard Rue, Andrea Riebler, Thiago G. Martins, and Sigrunn H. Sørbye. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28. https://doi.org/10.1214/16-STS576.

Spiegelhalter, David J., Keith R. Abrams, and Jonathan P. Myles. 2004. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Vol. 13. John Wiley & Sons.

Sutton, Alexander J., Nicky J. Welton, Nicola Cooper, Keith R. Abrams, and A. E. Ades. 2012. Evidence Synthesis for Decision Making in Healthcare. Vol. 132. John Wiley & Sons.

Tetlock, Philip, and Dan Gardner. 2015. Superforecasting: The Art and Science of Prediction. Crown Publishers.

Vasishth, Shravan. 2015. “A Meta-Analysis of Relative Clause Processing in Mandarin Chinese Using Bias Modelling.” Master’s thesis, Sheffield, UK: School of Mathematics; Statistics, University of Sheffield. http://www.ling.uni-potsdam.de/~vasishth/pdfs/VasishthMScStatistics.pdf.

Vasishth, Shravan, Zhong Chen, Qiang Li, and Gueilan Guo. 2013. “Processing Chinese Relative Clauses: Evidence for the Subject-Relative Advantage.” PLoS ONE 8 (10): 1–14.

Vasishth, Shravan, and Felix Engelmann. 2022. Sentence Comprehension as a Cognitive Process: A Computational Approach. Cambridge, UK: Cambridge University Press. https://books.google.de/books?id=6KZKzgEACAAJ.

Vasishth, Shravan, Daniela Mertzen, Lena A. Jäger, and Andrew Gelman. 2018. “The Statistical Significance Filter Leads to Overoptimistic Expectations of Replicability.” Journal of Memory and Language 103: 151–75. https://doi.org/https://doi.org/10.1016/j.jml.2018.07.004.

Vasishth, Shravan, Bruno Nicenboim, Felix Engelmann, and Frank Burchert. 2019. “Computational Models of Retrieval Processes in Sentence Processing.” Trends in Cognitive Sciences 23: 968–82. https://doi.org/https://doi.org/10.1016/j.tics.2019.09.003.

Vasishth, Shravan, Himanshu Yadav, Daniel J. Schad, and Bruno Nicenboim. 2022. “Sample Size Determination for Bayesian Hierarchical Models Commonly Used in Psycholinguistics.” Computational Brain and Behavior.

Von Baeyer, Hans Christian. 1988. “How Fermi Would Have Fixed It.” The Sciences 28 (5): 2–4.

Wagenmakers, Eric-Jan, and Scott D. Brown. 2007. “On the Linear Relation Between the Mean and the Standard Deviation of a Response Time Distribution.” Psychological Review 114 (3): 830.


  1. This discussion reuses text from Vasishth, Yadav, et al. (2022).↩︎

  2. See Vasishth et al. (2013);Jäger et al. (2020);Nicenboim, Vasishth, et al. (2018);Vasishth et al. (2018) for detailed discussion of this point in the context of psycholinguistics.↩︎