Introduction to Bayesian Data Analysis for Cognitive Science
Nicenboim, Schad, Vasishth
https://bruno.nicenboim.me/bayescogsci/
https://github.com/bnicenboim/bayescogsci
Be sure to read the textbook’s chapter 1 before watching this lecture.
This lecture covers some basic ideas in frequentist statistics that everyone should know. These ideas are very useful as background knowledge when studying Bayesian methods.
For the normal distribution, where \(X \sim N(\mu,\sigma)\), and given \(i=1,\dots, n\) independent data points, we can get MLEs of \(\mu\) and \(\sigma\) by computing:
\[\begin{equation} \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x} \end{equation}\]
and
\[\begin{equation} \hat \sigma ^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^2 = s^2 \end{equation}\]
you will sometimes see the “unbiased” estimate (and this is what
R computes) but for large sample sizes the difference is
not important:
\[\begin{equation} \hat \sigma ^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2 = s^2 \end{equation}\]
I use \(\bar{x}\) and \(s\) to represent the estimates of the mean and standard deviation from a particular data-set. \(\hat\mu\) and \(\hat\sigma\) are the formulas (analytically derived) for estimating the mean and standard deviation, and are called the estimators.
The significance of these MLEs is that, having assumed a particular underlying pdf, we can estimate the (unknown) parameters (the mean and variance/standard deviation) of the distribution that generated our particular data.
This leads us to the distributional properties of the mean under (hypothetical) repeated sampling.
For large enough sample sizes, the sampling distribution of the means will be approximately normal, regardless of the underlying distribution (as long as this distribution has a mean and variance defined for it).
\(SE = \frac{\sigma}{\sqrt{n}}\)
When estimated from data, we will write
\(SE = \frac{s}{\sqrt{n}}\)
I say estimated because we are estimating SE using an estimate of \(\sigma\).
The estimated standard error allows us to define a so-called :
\[\begin{equation} \bar{x} \pm 1.96 SE \end{equation}\]
So, for a given sample mean, we define a 95% confidence interval as follows:
\[\begin{equation} \bar{x} \pm 1.96 \frac{s}{\sqrt{n}} \end{equation}\]
I usually just write:
\[\begin{equation} \bar{x} \pm 2 \frac{s}{\sqrt{n}} \end{equation}\]
Example with simulated data:
n<-100
x<-rnorm(n,mean=500,sd=100)
mu_hat<-mean(x)
hat_sigma<-sd(x)
## lower bound:
mu_hat-(2*hat_sigma/sqrt(n))
## [1] 482.3328
## upper bound:
mu_hat+(2*hat_sigma/sqrt(n))
## [1] 520.8799
If you take repeated samples from a particular distribution, and compute the CI each time, 95% of those repeatedly computed CIs will contain the true population mean.
nsim<-100
mu<-0
sigma<-1
lower<-rep(NA,nsim)
upper<-rep(NA,nsim)
for(i in 1:nsim){
x<-rnorm(n,mean=mu,sd=sigma)
lower[i]<-mean(x) - 2 * sd(x)/sqrt(n)
upper[i]<-mean(x) + 2 * sd(x)/sqrt(n)
}
## check how many CIs contain mu:
CIs<-ifelse(lower<mu & upper>mu,1,0)
table(CIs)
## CIs
## 0 1
## 5 95
## approx. 95% of the CIs contain true mean:
table(CIs)[2]/sum(table(CIs))
## 1
## 0.95
Graphical visualization:
There is a correspondence between the correctly computed frequentist CI and the hypothesis testing procedure (see below).
Although some people use Bayesian credible intervals to carry out hypothesis tests (I have also done this), this is technically not correct because we do not automatically know the frequentist properties of the Bayesian credible interval. To carry out hypothesis tests in a Bayesian approach, Bayes factors or k-fold cross validation are needed. See the textbook chapters on model comparison for more details.
If you have repeated measures/dependent data, then the correct CI is computed after aggregation such that you have only one data point per subject per condition.
Consider these repeated measures data:
library(bcogsci)
data("df_gg05_rc")
head(df_gg05_rc)
## subj item condition RT residRT qcorrect experiment
## 1 1 1 objgap 320 -21.39 0 tedrg3
## 2 1 2 subjgap 424 74.66 1 tedrg2
## 3 1 3 objgap 309 -40.34 0 tedrg3
## 4 1 4 subjgap 274 -91.24 1 tedrg2
## 5 1 5 objgap 333 -8.39 1 tedrg3
## 6 1 6 subjgap 266 -87.32 1 tedrg2
## 8 data points per subject per condition:
t(xtabs(~subj+condition,df_gg05_rc))
## subj
## condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## objgap 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## subjgap 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## subj
## condition 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
## objgap 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## subjgap 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
Incorrectly computed CIs per condition:
(means<-with(df_gg05_rc,tapply(RT,condition,mean)))
## objgap subjgap
## 471.3601 369.0744
(sds<-with(df_gg05_rc,tapply(RT,condition,sd)))
## objgap subjgap
## 464.4060 177.2674
## what should n be?
(n<- length(unique(df_gg05_rc$subj)))
## [1] 42
## wrong lower bound for objgap condition:
means[1]-2*sds[1]/sqrt(n)
## objgap
## 328.0413
## wrong upper bound:
means[1]+2*sds[1]/sqrt(n)
## objgap
## 614.6789
Correctly computed CIs:
agg_gg05<-aggregate(RT~subj+condition,mean,
data=df_gg05_rc)
t(xtabs(~subj+condition,agg_gg05))
## subj
## condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## objgap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## subjgap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## subj
## condition 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
## objgap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## subjgap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## Correct CIs:
(means<-with(agg_gg05,tapply(RT,condition,mean)))
## objgap subjgap
## 471.3601 369.0744
(sds<-with(agg_gg05,tapply(RT,condition,sd)))
## objgap subjgap
## 259.8924 117.6722
## what should n be?
(n<- length(unique(agg_gg05$subj)))
## [1] 42
## correct lower bound for objgap condition:
means[1]-2*sds[1]/sqrt(n)
## objgap
## 391.1556
## correct upper bound:
means[1]+2*sds[1]/sqrt(n)
## objgap
## 551.5647
Suppose we have a random sample of size \(n\), and the data come from a \(N(\mu,\sigma)\) distribution, and the data are independent and identically distributed (for now).
We can estimate sample mean \(\bar{x}=\hat \mu\) and and sample standard deviation \(s=\hat\sigma\), which in turn allows us to estimate the sampling distribution of the mean under (hypothetical) repeated sampling (thanks to the central limit theorem):
\[\begin{equation} N(\bar{x},\frac{s}{\sqrt{n}}) \end{equation}\]
The NHST approach is to set up a null hypothesis that \(\mu\) has some fixed value. For example:
\[\begin{equation} H_0: \mu = \mu_0 = 0 \end{equation}\]
This amounts to assuming that the true distribution of sample means is (approximately) normally distributed and centered at 0, with the standard error estimated from the data.
The intuitive idea is that
We formalize “near” and “far” by determining the value of the number \(t\), which represents how many standard errors the sample mean is distant from the hypothesized mean:
\[\begin{equation} t \times SE = \bar{x} - \mu \end{equation}\]
The above equation quantifies the distance of sample mean from \(\mu\) in SE units.
So, given a sample and null hypothesis mean \(\mu\), we can compute the quantity:
\[\begin{equation} t = \frac{\bar{x} - \mu}{SE} \end{equation}\]
We will call this the observed t-value.
The random variable T:
\[\begin{equation} T = \frac{\bar{X} - \mu}{SE} \end{equation}\]
has a t-distribution, which is defined in terms of the sample size \(n\). We will express this as: \(T \sim t(n-1)\).
Note also that, as \(n\) approaches infinity, \(T\sim N(0,1)\).
Thus, given a sample size \(n\), and given our null hypothesis, we can draw t-distribution corresponding to the null hypothesis distribution.
For large \(n\), we could even use N(0,1), although it is traditional to always use the t-distribution no matter how large \(n\) is.
Compare the t-distribution t(21) (solid line) with Normal(0,1) (broken line).
x<-seq(-4,4,by=0.01)
plot(x,dt(x,df=20),type="l")
lines(x,dnorm(x),lty=2)
Now compare the t-distribution t(1000) (solid line) with Normal(0,1) (broken line).
plot(x,dt(x,df=1000),type="l")
lines(x,dnorm(x),lty=2,col="red")
So, the null hypothesis testing procedure is:
\[\begin{equation} t_{observed}=\frac{\bar{x}-\mu}{s/\sqrt{n}} \end{equation}\] - Reject null hypothesis if the observed t-value is large (defined below).
So, for large sample sizes, if \(\mid t\mid >2\) (approximately), we can reject the null hypothesis.
For a smaller sample size \(n\) (say 42), you can compute the exact critical t-value:
n<-42
qt(0.025,df=n-1)
## [1] -2.019541
This is the critical t-value on the left-hand side of the t-distribution. The corresponding value on the right-hand side is:
qt(0.975,df=n-1)
## [1] 2.019541
Their absolute values are of course identical (the distribution is symmetric when the t-distribution is centered on 0).
This is the probability of observing a t-value at least as extreme as the one you observed, under the assumption that the null hypothesis is true.
The only way to establish whether an effect is “real” or not is by actual replication (holds for Bayes as well).
Given iid data (Note: aggregated!):
OR<-subset(agg_gg05,condition=="objgap")$RT
SR<-subset(agg_gg05,condition=="subjgap")$RT
diff<-OR-SR
## one sample t-test:
t.test(diff)
##
## One Sample t-test
##
## data: diff
## t = 3.1093, df = 41, p-value = 0.003404
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 35.85024 168.72119
## sample estimates:
## mean of x
## 102.2857
## paired t-test:
t.test(OR,SR,paired=TRUE)
##
## Paired t-test
##
## data: OR and SR
## t = 3.1093, df = 41, p-value = 0.003404
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 35.85024 168.72119
## sample estimates:
## mean difference
## 102.2857
You should know when to aggregate data to meet the one sample (=paired) t-test’s assumptions.
A very common mistake is to forget or neglect to aggregate the data. The following is wrong:
OR<-subset(df_gg05_rc,condition=="objgap")$RT
SR<-subset(df_gg05_rc,condition=="subjgap")$RT
diff<-OR-SR
## one sample t-test (WRONG):
t.test(diff)
##
## One Sample t-test
##
## data: diff
## t = 3.9997, df = 335, p-value = 7.81e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 51.98059 152.59084
## sample estimates:
## mean of x
## 102.2857
## paired t-test (WRONG):
t.test(OR,SR,paired=TRUE)
##
## Paired t-test
##
## data: OR and SR
## t = 3.9997, df = 335, p-value = 7.81e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 51.98059 152.59084
## sample estimates:
## mean difference
## 102.2857
Look at the degrees of freedom–they are wrong (we have only 42 subjects, so it should have been 41).
Power, which is calculated before a study is conducted, is a function of three variables:
A quick way to get a ballpark estimate of power is by using the
power.t.test function in R.
Example: what sample size do we need in a standard within-subjects design (like the two-condition relative clause study mentioned above) to reach 80% power if the true effect size were 15 ms, with a standard deviation of 150 ms?
power.t.test(n=NULL,
delta=15,
sd=150,
sig.level=0.05,
power=0.80,
alternative="two.sided",
type="one.sample",
strict=TRUE)
##
## One-sample t test power calculation
##
## n = 786.8089
## delta = 15
## sd = 150
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
In this example, something close to 800 subjects would be needed to achieve 80% power.
For more complex designs, we use simulation to compute power. See my frequentist textbook draft for more:
If your true effect size is believed to be \(D\), then we can compute (apart from statistical power) this error rate, which is defined as follows:
Type M error: the expectation of the ratio of the absolute magnitude of the effect to the hypothesized true effect size, given that result is significant. Gelman and Carlin also call this the exaggeration ratio, which is perhaps more descriptive than ``Type M error.’’
Here’s a visualization of Type M error in action, under low statistical power.
The theoretical probability of rejecting at least one test incorrectly: \(1-0.95^3=0.143\).
Source: Gouvea et al 2010. Language and Cognitive Processes.
Source: Liversedge et al., 2024. Cognition.
All my own work, published during the period 2002 to 2016, has this Type I inflation problem.
Source: Vasishth and Lewis, 2006. Language.
If we do a single t-test when the null is actually true, our Type I error is 0.05:
nsim<-1000
pvals<-rep(NA,nsim)
for(i in 1:nsim){
y<-rnorm(10,mean=0,sd=1)
pvals[i]<-t.test(y)$p.value
}
mean(pvals<0.05)
## [1] 0.06
If we do two t-tests when the null in all the analyses is actually true, our Type I error is no longer 0.05:
nsim<-1000
ntests<-2
pvals<-matrix(rep(NA,nsim*ntests),ncol=ntests)
for(j in 1:ntests){
for(i in 1:nsim){
y<-rnorm(10,mean=0,sd=1)
pvals[i,j]<-t.test(y)$p.value
}
}
head(pvals)
## [,1] [,2]
## [1,] 0.2227763 0.04618541
## [2,] 0.7433136 0.97112086
## [3,] 0.8559875 0.58501021
## [4,] 0.3284245 0.90432025
## [5,] 0.8683764 0.69789140
## [6,] 0.4849970 0.86804439
What is the probability that at least one of the two tests comes out significant despite the null being true in both cases?
sig<-rep(NA,nsim)
for(i in 1:nsim){
if(pvals[i,1]<0.05 | pvals[i,2]<0.05){
sig[i]<-1
} else {
sig[i]<-0
}
}
## The probability that at least
## one t-test comes out significant:
mean(sig>0)
## [1] 0.103
This inflation of Type I error is called the multiple comparisons problem.
The more t-tests/F-tests you do, the higher the Type I error.
Let’s write a function that computes the Type I error when we do n hypothesis tests, where n can be 1,2,3,…
The full function is not visible here (see the R source code).
Let’s test the function with some ntest values.
## ntests=1
computeTypeI(ntests=1)
## [1] 0.048
## ntests=2
computeTypeI(ntests=2)
## [1] 0.105
## ntests=3
computeTypeI(ntests=3)
## [1] 0.142
Let’s plot a figure showing how Type I error will inflate as we increase the number of tests:
n<-150
inflation<-rep(NA,n)
for(i in 1:n){
inflation[i]<-computeTypeI(ntests=i)
}
So, once you have done some 100 statistical tests, you are basically guaranteed to obtain some significant effect or the other, even if the null were in fact true.
If working in the frequentist framework, just do a Bonferroni correction. If you do n tests, the new \(\alpha\) is 0.05/n.
In the classical Bayesian framework, there is no concept of Type I/II error. There, the focus is rather on uncertainty quantification. More on that later.
Of course, one can think about the frequentist properties of Bayesian hypothesis tests; in that approach, you will run into the same Type I error inflation problems as in the classical frequentist approach.
We consider the case where we have two conditions (e.g., subject and object relatives), and a repeated measures design. The dependent measure is reading times in milliseconds.
If you are not familiar with relative clauses, just imagine doing an experiment with two types of sentences, an easy-to-read sentence type and a difficult-to-read sentence type, and measuring reading time difference between the hard and easy conditions.
The alphabetically first condition level is coded 0, and the other condition level is coded 1. E.g., if condition labels are objgap and subjgap, then objgap is coded 0 and subjgap 1. You can change this with the command (not run):
## this code has not been run:
## code subj as 0 and obj as 1:
df_gg05_rc$condition<-
factor(df_gg05_rc$condition,
levels=c("subjgap","objgap"))
In mathematical form, the linear model is:
\[\begin{equation} rt = \beta_0 + \beta_1 condition + \epsilon \end{equation}\]
where
agg_gg05$condition<-factor(agg_gg05$condition)
contrasts(agg_gg05$condition)
## subjgap
## objgap 0
## subjgap 1
## this model is wrong for these data:
m<-lm(RT ~ condition, agg_gg05)
round(summary(m)$coefficients,2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 471.36 31.13 15.14 0.00
## conditionsubjgap -102.29 44.02 -2.32 0.02
The null hypothesis of interest is that the difference in means between the two relative clause types \(\beta_1\) is:
\(H_0: \beta_1 = 0\)
We will make a distinction between the unknown true mean \(\beta_0, \beta_1\) and the estimated mean from the data \(\hat\beta_0, \hat\beta_1\). These estimated means are maximum likelihood estimates of the parameters.
Alternatively, we can code objgap as \(+1\) and subjgap as \(-1\) (or vice versa).
Equivalently: objgap as \(+1/2\) and subjgap as \(-1/2\) (or vice versa).
With \(\pm 1\) coding:
agg_gg05$so<-ifelse(agg_gg05$condition=="objgap",
1,-1)
## this model is wrong:
m_sum<-lm(RT~so,agg_gg05)
round(summary(m_sum)$coefficients,2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 420.22 22.01 19.09 0.00
## so 51.14 22.01 2.32 0.02
This kind of parameterization is called sum-to-zero contrast or more simply sum contrast coding. This is the coding we will use.
The null hypothesis for the slope is
\[\begin{equation} H_0: \mathbf{1\times} \mu_{obj} + (\mathbf{-1\times}) \mu_{subj} = 0 \end{equation}\]
or:
\[\begin{equation} H_0: \mu_{obj} = \mu_{subj} \end{equation}\]
The sum contrasts are referring to the \(\pm 1\) terms in the null hypothesis:
The model is:
Estimated object relative reading times:
\[\begin{equation} rt = 420\mathbf{\times 1} + 51\mathbf{\times 1} \end{equation}\]
Estimated subject relative reading times:
\[\begin{equation} rt = 420\mathbf{\times 1} + 51\mathbf{\times -1} \end{equation}\]
The \(\epsilon\) has been dropped here because the mean of the random variable \(\epsilon\) is 0.
The model is:
\[\begin{equation} rt = \beta_0 + \beta_1 + \epsilon \hbox{ where } \epsilon\sim Normal(0,\sigma) \end{equation}\]
It is an assumption of the linear model that the residuals are (approximately) normally distributed.
This assumption is not crucial if our goal is only to estimate the parameters.
However, this assumption is crucial if we are doing hypothesis testing.
We can check this assumption in R. This model is wrong for these data.
m<-lm(RT ~ condition, agg_gg05)
car::qqPlot(residuals(m))
## [1] 37 33
If the residuals were approximately normally distributed, the quantiles of the standard normal and the residuals would align, leading to a diagonal line angled at 45 degrees (not the case here).
A log-transform would improve the situation here:
m<-lm(log(RT) ~ condition, agg_gg05)
car::qqPlot(residuals(m))
## [1] 37 33
The correct model for the aggregated data:
library(lme4)
## Loading required package: Matrix
m1<-lmer(RT ~ condition + (1|subj), agg_gg05)
## compare with the paired t-test result above!
## They are exactly the same.
summary(m1)$coefficients
## Estimate Std. Error t value
## (Intercept) 471.3601 31.12777 15.142753
## conditionsubjgap -102.2857 32.89632 -3.109336
OR<-subset(agg_gg05,condition=="objgap")$RT
SR<-subset(agg_gg05,condition=="subjgap")$RT
diff<-SR-OR
## one sample t-test:
t.test(diff)
##
## One Sample t-test
##
## data: diff
## t = -3.1093, df = 41, p-value = 0.003404
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -168.72119 -35.85024
## sample estimates:
## mean of x
## -102.2857
The linear mixed model with varying intercepts (on the aggregated data) is exactly the one-sample (paired) t-test.
Residuals check:
car::qqPlot(residuals(m1))
## [1] 37 33
The correct way to analyze these data are using linear mixed models on the unaggregated data, and carrying out a log transform on the data:
df_gg05_rc$so<-ifelse(df_gg05_rc$condition=="objgap",
1,-1)
m2<-lmer(log(RT) ~ so +
(1+so|subj) + (1+so| item),
df_gg05_rc)
## boundary (singular) fit: see help('isSingular')
summary(m2)$coefficients
## Estimate Std. Error t value
## (Intercept) 5.88305598 0.05202442 113.08258
## so 0.06201673 0.02466207 2.51466
The convergence warning is due to data sparsity, leading to an inability to estimate the varying intercepts/slopes correlation for items.
For a full and formal review of linear models (including linear mixed modeling), see: