Introduction to Bayesian Data Analysis for Cognitive Science
Nicenboim, Schad, Vasishth
Be sure to read the textbook’s chapter 1 in addition to watching this lecture.
It only takes about a day to understand these materials, but the content here will positively impact your ability to carry out statistical modeling and data analysis.
A random variable \(X\) is a function \(X : \Omega \rightarrow \mathbb{R}\) that associates to each outcome \(\omega \in \Omega\) exactly one number \(X(\omega) = x\).
\(S_X\) is all the \(x\)’s (all the possible values of X, the support of X). I.e., \(x \in S_X\).
An example of a discrete random variable: keep tossing a coin again and again until you get a Heads.
A second example of a discrete random variable: tossing a coin once.
Every discrete (continuous) random variable X has associated with it a probability mass (density) function (PMF, PDF).
Thinking just about discrete random variables for now:
\[\begin{equation} p_X : S_X \rightarrow [0, 1] \end{equation}\]
defined by
\[\begin{equation} p_X(x) = \hbox{Prob}(X(\omega) = x), x \in S_X \end{equation}\]
Example of a PMF: a random variable \(X\) representing tossing a coin once.
The (CDF) \(F(X\leq x)\) gives the cumulative probability of observing all the events \(X\leq x\).
\[\begin{equation} \begin{split} F(x=1) & = \hbox{Prob}(X \leq 1)\\ & = \sum_{x=0}^{1} p_X(x) \\ & = p_X(x=0) + p_X(x=1) \\ & = 1 \end{split} \end{equation}\]
\[\begin{equation} \begin{split} F(x=0) & = \hbox{Prob}(X \leq 0)\\ & = \sum_{x=0}^{0} p_X(x)\\ & = p_X(x=0) \\ & = 0.5 \end{split} \end{equation}\]
Do 10 coin-tossing experiments, each with one trial. The probability (which I call \(\theta\) below) of heads 0.5:
extraDistr::rbern(n = 10, prob = 0.5)
## [1] 1 1 1 0 0 1 0 0 0 0
The probability mass function: Bernoulli
\[p_X(x)= \theta^x (1-\theta)^{(1-x)}\]
where x can have values 0, 1.
What’s the probability of a tails/heads? The d-family of functions:
extraDistr::dbern(0, prob = 0.5)
## [1] 0.5
extraDistr::dbern(1, prob = 0.5)
## [1] 0.5
Notice that these probabilities sum to 1.
The cumulative probability distribution function: the p-family of functions:
\[F(x=1) = Prob(X \leq 1) = \sum_{x=0}^{1} p_X(x) = 1\]
extraDistr::pbern(1, prob = 0.5)
## [1] 1
\[F(x=0) = Prob(X \leq 0) = \sum_{x=0}^{0} p_X(x) = 0.5\]
extraDistr::pbern(0, prob = 0.5)
## [1] 0.5
size
in R).\[\theta^{x}(1-\theta)^{1-x}\] where \[S_X=\{0,1\}\]
Binomial PMF
\[{n \choose x} \theta^{x}(1-\theta)^{n-x}\]
where \[S_X=\{0,1,\dots,n\}\]
choose(10, 2)
## [1] 45
1. Generate random data: rbinom
Example: 10 separate experiments, each with 1 trial:
rbinom(n = 10, size = 1, prob = 0.5)
## [1] 1 0 1 0 0 0 0 0 0 0
## equivalent to: rbern(10,0.5)
Example: 10 separate experiments, each with 10 trials:
rbinom(n = 10, size = 10, prob = 0.5)
## [1] 5 5 7 6 4 2 5 3 5 7
2. Compute probabilities of particular events (0,1,…,10 successes when n=10): dbinom
probs <- round(dbinom(0:10, size = 10,
prob = 0.5), 3)
x <- 0:10
## x probs
## 1 0 0.001
## 2 1 0.010
## 3 2 0.044
## 4 3 0.117
## 5 4 0.205
## 6 5 0.246
## 7 6 0.205
## 8 7 0.117
## 9 8 0.044
## 10 9 0.010
## 11 10 0.001
3. Compute cumulative probabilities: pbinom
4. Compute quantiles using the inverse of the CDF: qbinom
probs <- pbinom(0:10, size = 10, prob = 0.5)
qbinom(probs, size = 10, prob = 0.5)
## [1] 0 1 2 3 4 5 6 7 8 9 10
These four functions are the d-p-q-r family of functions, and are available for all the distributions available in R (e.g., Poisson, geometric, normal, beta, uniform, gamma, exponential, Cauchy, etc.).
In coin tosses, H and T are discrete possible outcomes.
\[\begin{equation} X \sim f(\cdot) \end{equation}\]
means that the random variable \(X\) is assumed to have PDF \(f(\cdot)\).
For example, if we say that \(X\sim Normal(\mu,\sigma)\), we are assuming that the PDF is
\[\begin{equation} f(x\mid \mu, \sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] \end{equation}\]
where \(-\infty<x<+\infty\)
We can truncate the normal distribution such that \(S_X\) is bounded between some lower bound and/or upper bound–this comes later.
The PDF below is associated with the normal distribution that you are probably familiar with:
\[\begin{equation} f(x\mid \mu, \sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] \end{equation}\]
where \(-\infty<x<+\infty\).
In the discrete RV case, we could compute the probability of a event occurring:
extraDistr::dbern(x = 1, prob = 0.5)
## [1] 0.5
dbinom(x = 2, size = 10, prob = 0.5)
## [1] 0.04394531
The cumulative distribution function (CDF) is
\[\begin{equation} P(X<u) = F(X<u) =\int_{-\infty}^{u} f(x)\, dx \end{equation}\]
In the \(Normal(\mu=0,\sigma=1)\),
More generally, for any \(Normal(\mu,\sigma)\),
The PDF of the normal again:
\[\begin{equation} f(x\mid \mu, \sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] \end{equation}\]
This part of \(f(x\mid \mu,\sigma)\) (call it \(g(x)\)) is the ``kernel’’ of the normal PDF:
\[\begin{equation} g(x\mid \mu, \sigma) = \exp\left[-\frac{(x-\mu)^2}{2 \sigma^2}\right] \end{equation}\]
For the above function, the area under the curve doesn’t sum to 1:
Sum up the area under the curve \(\int g(x)\, dx\):
g <- function(x, mu = 0, sigma = 1) {
exp((-(x - mu)^2 / (2 * (sigma^2))))
}
integrate(g, lower = -Inf, upper = +Inf)$value
## [1] 2.506628
The shape doesn’t change of course:
In simple examples like the one shown here, given the kernel of some PDF like \(g(x)\), we can figure out the normalizing constant by solving for \(k\) in:
\[\begin{equation} k \int g(x)\, dx = 1 \end{equation}\]
Solving for \(k\) just amounts to computing:
\[\begin{equation} k = \frac{1}{\int g(x)\, dx} \end{equation}\]
So, in our example above,
(k<-1/integrate(g, lower = -Inf, upper = +Inf)$value)
## [1] 0.3989423
The above number is just \(\frac{1}{\sqrt{2 \pi \sigma^2}}\), where \(\sigma=1\):
1/(sqrt(2*pi*1))
## [1] 0.3989423
Once we include the normalizing constant, the area under the curve in \(g(x)\) sums to 1:
k * integrate(g, lower = -Inf, upper = +Inf)$value
## [1] 1
We will see the practical implication of this when we move on to chapter 2 of the textbook.
In the continuous case, we also have this family of d-p-q-r functions. In the normal distribution:
1. Generate random data using rnorm
round(rnorm(5, mean = 0, sd = 1),3)
## [1] -1.849 0.018 0.188 0.891 -0.439
For the standard normal, mean=0, and sd=1 can be omitted (these are the default values in R).
round(rnorm(5),3)
## [1] 2.733 0.679 0.918 -0.501 -0.011
2. Compute probabilities using CDF: pnorm
Some examples of usage:
pnorm(2)
## [1] 0.9772499
pnorm(2, lower.tail = FALSE)
## [1] 0.02275013
3. Compute quantiles: qnorm
qnorm(0.9772499)
## [1] 2.000001
4. Compute the probability density: dnorm
dnorm(2)
## [1] 0.05399097
Note: In the continuous case, this is a density, the value \(f(x)\), not a probability. Cf. the discrete examples dbern and dbinom, which give probabilities of a point value \(x\).
The likelihood function refers to the PMF \(p(k|n,\theta)\), treated as a function of \(\theta\).
For example, suppose that we record \(n=10\) trials, and observe \(k=7\) successes. The likelihood function is:
\[\begin{equation} \mathcal{L}(\theta|k=7,n=10)= \binom{10}{7} \theta^{7} (1-\theta)^{10-7} \end{equation}\]
If we now plot the likelihood function for all possible values of \(\theta\) ranging from \(0\) to \(1\), we get the plot shown below.
The MLE (from a particular sample of data need not invariably give us an accurate estimate of \(\theta\).
Sample size is key here: as \(n\rightarrow \infty\), we approach the true value of the parameter (here, \(\theta\)).
\[\begin{equation} \mathcal{L}(\mu,\sigma|x)=Normal(x,\mu,\sigma) \end{equation}\]
Below, assume that \(\sigma=1\).
## the data:
x<-0
## the likelihood under different values
## of mu:
dnorm(x,mean=0,sd=1)
## [1] 0.3989423
dnorm(x,mean=10,sd=1)
## [1] 7.694599e-23
Assuming that \(\sigma=1\), the likelihood function of \(\mu\):
If we have two independent data points, the joint likelihood given the data of \(\mu\), assuming \(\sigma=1\):
x1<-0
x2<-1.5
dnorm(x1,mean=0,sd=1) *
dnorm(x2,mean=0,sd=1)
## [1] 0.05167004
## log likelihood:
dnorm(x1,mean=0,sd=1,log=TRUE) +
dnorm(x2,mean=0,sd=1,log=TRUE)
## [1] -2.962877
## more compactly:
x<-c(x1,x2)
sum(dnorm(x,mean=0,sd=1,log=TRUE))
## [1] -2.962877
One practical implication: one can use the log likelihood to compare competing models’ fit:
## Model 1:
sum(dnorm(x,mean=0,sd=1,log=TRUE))
## [1] -2.962877
## Model 2:
sum(dnorm(x,mean=10,sd=1,log=TRUE))
## [1] -87.96288
Model 1 has higher likelihood than Model 2, so we’d prefer to assume that the data are better characterized by Model 1 than 2 (neither may be the true model!).
More generally, for independent and identically distributed data \(x=x_1,\dots,x_n\):
\[\begin{equation} \mathcal{L}(\mu,\sigma|x) = \prod_{i=1}^n Normal(x_i,\mu,\sigma) \end{equation}\]
or
\[\begin{equation} \ell (\mu,\sigma|x) = \sum_{i=1}^n log(Normal(x_i,\mu,\sigma)) \end{equation}\]
Read section 1.4.1 of chapter 1 of the textbook, and (optionally) chapter 2 of the linear modeling lecture notes here:
Data from: Laurinavichyute, A. (2020). Similarity-based interference and faulty encoding accounts of sentence processing. dissertation, University of Potsdam.
X: Likert ratings 1-7.
Y: 0, 1 accuracy responses.
The joint PMF: \(p_{X,Y}(x,y)\)
For each possible pair of values of X and Y, we have a joint probability mass function \(p_{X,Y}(x,y)\).
Two useful quantities that we can compute:
\[\begin{equation} p_{X}(x)=\sum_{y\in S_{Y}}p_{X,Y}(x,y).\label{eq-marginal-pmf2} \end{equation}\]
\[\begin{equation} p_{Y}(y)=\sum_{x\in S_{X}}p_{X,Y}(x,y).\label{eq-marginal-pmf} \end{equation}\]
\(x=1\) | \(x=2\) | \(x=3\) | \(x=4\) | \(x=5\) | \(x=6\) | \(x=7\) | \(p(Y)\) | |
---|---|---|---|---|---|---|---|---|
\(y=0\) | \(0.018\) | \(0.023\) | \(0.04\) | \(0.043\) | \(0.063\) | \(0.049\) | \(0.055\) | \(0.291\) |
\(y=1\) | \(0.031\) | \(0.053\) | \(0.086\) | \(0.096\) | \(0.147\) | \(0.153\) | \(0.142\) | \(0.709\) |
\(p(X)\) | \(0.049\) | \(0.077\) | \(0.126\) | \(0.139\) | \(0.21\) | \(0.202\) | \(0.197\) |
\[\begin{equation} p_{X\mid Y}(x\mid y) = \frac{p_{X,Y}(x,y)}{p_Y(y)} \end{equation}\]
and
\[\begin{equation} p_{Y\mid X}(y\mid x) = \frac{p_{X,Y}(x,y)}{p_X(x)} \end{equation}\]
Let’s do the calculation for \(p_{X\mid Y}(x\mid y=0)\).
\(x=1\) | \(x=2\) | \(x=3\) | \(x=4\) | \(x=5\) | \(x=6\) | \(x=7\) | \(p(Y)\) | |
---|---|---|---|---|---|---|---|---|
\(y=0\) | \(0.018\) | \(0.023\) | \(0.04\) | \(0.043\) | \(0.063\) | \(0.049\) | \(0.055\) | \(0.291\) |
\(y=1\) | \(0.031\) | \(0.053\) | \(0.086\) | \(0.096\) | \(0.147\) | \(0.153\) | \(0.142\) | \(0.709\) |
\(p(X)\) | \(0.049\) | \(0.077\) | \(0.126\) | \(0.139\) | \(0.21\) | \(0.202\) | \(0.197\) |
\[\begin{equation} \begin{split} p_{X\mid Y} (1\mid 0) =& \frac{p_{X,Y}(1,0)}{p_Y(0)}\\ =& \frac{0.018}{0.291}\\ =& 0.062 \end{split} \end{equation}\]
As an exercise, figure out the conditional distribution of X given Y, and the conditional distribution of Y given X.
Next, we turn to continuous bivariate/multivariate distributions.
The variance-covariance matrix:
\[\begin{equation}\label{eq:covmatfoundations} \Sigma = \begin{pmatrix} \sigma _{X}^2 & \rho_{XY}\sigma _{X}\sigma _{Y}\\ \rho_{XY}\sigma _{X}\sigma _{Y} & \sigma _{Y}^2\\ \end{pmatrix} \end{equation}\]
The off-diagonals of this matrix contain the covariance between \(X\) and \(Y\):
\[Cov(X,Y) = \rho_{XY}\sigma _{X}\sigma _{Y}\]
The joint distribution of \(X\) and \(Y\) is defined as follows:
\[\begin{equation}\label{eq:jointpriordistfoundations} \begin{pmatrix} X \\ Y \\ \end{pmatrix} \sim \mathcal{N}_2 \left( \begin{pmatrix} 0 \\ 0 \\ \end{pmatrix}, \Sigma \right) \end{equation}\]
The joint PDF has the property that the volume under the curve sums to 1.
Formally, we would write the volume under the curve as a double integral: we are summing up the volume under the curve for both \(X\) and \(Y\) (hence the two integrals).
\[\begin{equation} \iint_{S_{X,Y}} f_{X,Y}(x,y)\, dx dy = 1 \end{equation}\]
Here, the terms \(dx\) and \(dy\) express the fact that we are computing the volume under the curve along the \(X\) axis and the \(Y\) axis.
The joint CDF would be written as follows. The equation below gives us the probability of observing a value like \((u,v)\) or some value smaller than that (i.e., some \((u',v')\), such that \(u'<u\) and \(v'<v\)).
\[\begin{equation} \begin{split} F_{X,Y}(u,v) =& \hbox{Prob}(X<u,Y<v)\\ =& \int_{-\infty}^u \int_{-\infty}^v f_{X,Y}(x,y)\, dy dx \\ ~& \hbox{ for } (x,y) \in \mathbb{R}^2\\ \end{split} \end{equation}\]
Just as in the discrete case, the marginal distributions can be derived by marginalizing out the other random variable:
\[\begin{equation} f_X(x) = \int_{S_Y} f_{X,Y}(x,y)\, dy \quad f_Y(y) = \int_{S_X} f_{X,Y}(x,y)\, dx \end{equation}\]
Here, \(S_X\) and \(S_Y\) are the respective supports.
## define a variance-covariance matrix:
Sigma <- matrix(c(5^2, 5 * 10 * 0.6,
5 * 10 * 0.6, 10^2),
byrow = FALSE, ncol = 2
)
## generate data:
u <- MASS::mvrnorm(n = 100, mu = c(0, 0),
Sigma = Sigma)
head(u, n = 3)
## [,1] [,2]
## [1,] -1.796916 -6.710635
## [2,] -1.240513 9.394426
## [3,] -3.011623 -12.695913
One practical implication: Such bi/multivariate distributions become critically important to understand when we turn to hierarchical (linear mixed) models.