Textbook

Introduction to Bayesian Data Analysis for Cognitive Science

Nicenboim, Schad, Vasishth

Be sure to read the textbook’s chapter 1 in addition to watching this lecture.

Introduction: Motivation for this lecture

It only takes about a day to understand these materials, but the content here will positively impact your ability to carry out statistical modeling and data analysis.

Discrete random variables

A random variable \(X\) is a function \(X : \Omega \rightarrow \mathbb{R}\) that associates to each outcome \(\omega \in \Omega\) exactly one number \(X(\omega) = x\).

\(S_X\) is all the \(x\)’s (all the possible values of X, the support of X). I.e., \(x \in S_X\).

An example of a discrete RV

An example of a discrete random variable: keep tossing a coin again and again until you get a Heads.

  • \(X: \omega \rightarrow x\)
  • \(\omega\): H, TH, TTH,(infinite)
  • \(X(H)=1, X(TH)=2, X(TTH)=3\),
  • \(x=1,2,\dots; x \in S_X\)

A second example of a discrete random variable: tossing a coin once.

  • \(X: \omega \rightarrow x\)
  • \(\omega\): H, T
  • \(X(T)=0, X(H)=1\)
  • \(x=0,1; x \in S_X\)

The probability mass function (PMF)

Every discrete (continuous) random variable X has associated with it a probability mass (density) function (PMF, PDF).

  • PMF is used for discrete distributions and PDF for continuous.
  • (Some books use PDF for both discrete and continuous distributions.)

Thinking just about discrete random variables for now:

\[\begin{equation} p_X : S_X \rightarrow [0, 1] \end{equation}\]

defined by

\[\begin{equation} p_X(x) = \hbox{Prob}(X(\omega) = x), x \in S_X \end{equation}\]

Example of a PMF: a random variable \(X\) representing tossing a coin once.

  • In the case of a fair coin, \(x\) can be \(0\) or \(1\), and the probability of each possible event (each event is a subset of the set of possible outcomes) is \(0.5\).
  • Formally: \(p_X(x) = \hbox{Prob}(X(\omega) = x), x \in S_X\)
  • The probability mass function defines the probability of each event: \(p_X(0) = p_X(1) =0.5\).

The cumulative distribution function (CDF)

The (CDF) \(F(X\leq x)\) gives the cumulative probability of observing all the events \(X\leq x\).

\[\begin{equation} \begin{split} F(x=1) & = \hbox{Prob}(X \leq 1)\\ & = \sum_{x=0}^{1} p_X(x) \\ & = p_X(x=0) + p_X(x=1) \\ & = 1 \end{split} \end{equation}\]

\[\begin{equation} \begin{split} F(x=0) & = \hbox{Prob}(X \leq 0)\\ & = \sum_{x=0}^{0} p_X(x)\\ & = p_X(x=0) \\ & = 0.5 \end{split} \end{equation}\]

Do 10 coin-tossing experiments, each with one trial. The probability (which I call \(\theta\) below) of heads 0.5:

extraDistr::rbern(n = 10, prob = 0.5)
##  [1] 1 1 1 0 0 1 0 0 0 0

The probability mass function: Bernoulli

\[p_X(x)= \theta^x (1-\theta)^{(1-x)}\]

where x can have values 0, 1.

What’s the probability of a tails/heads? The d-family of functions:

extraDistr::dbern(0, prob = 0.5)
## [1] 0.5
extraDistr::dbern(1, prob = 0.5)
## [1] 0.5

Notice that these probabilities sum to 1.

The cumulative probability distribution function: the p-family of functions:

\[F(x=1) = Prob(X \leq 1) = \sum_{x=0}^{1} p_X(x) = 1\]

extraDistr::pbern(1, prob = 0.5)
## [1] 1

\[F(x=0) = Prob(X \leq 0) = \sum_{x=0}^{0} p_X(x) = 0.5\]

extraDistr::pbern(0, prob = 0.5)
## [1] 0.5

Another example of a discrete random variable: The binomial

  • Consider carrying out a single experiment where you toss a coin 10 times (the number of trials, size in R).
  • When the number of trials (size) is 1, we have a Bernoulli; when we have size greater than 1, we have a Binomial.

\[\theta^{x}(1-\theta)^{1-x}\] where \[S_X=\{0,1\}\]

Binomial PMF

\[{n \choose x} \theta^{x}(1-\theta)^{n-x}\]

where \[S_X=\{0,1,\dots,n\}\]

  • \(n\) is the number of times the coin was tossed (the number of trials; size in R).
  • \({n \choose x}\) is the number of ways that you can get \(x\) successes in \(n\) trials.
choose(10, 2)
## [1] 45
  • \(\theta\) is the probability of success in \(n\) trials.

Four critical R functions for the binomial RV

1. Generate random data: rbinom

  • n: number of experiments done (Note: in the binomial pdf, n stands for the number of trials). In R, n is called the number of observations.
  • size: the number of times the coin was tossed in each experiment (the number of trials)

Example: 10 separate experiments, each with 1 trial:

rbinom(n = 10, size = 1, prob = 0.5)
##  [1] 1 0 1 0 0 0 0 0 0 0
## equivalent to: rbern(10,0.5)

Example: 10 separate experiments, each with 10 trials:

rbinom(n = 10, size = 10, prob = 0.5)
##  [1] 5 5 7 6 4 2 5 3 5 7

2. Compute probabilities of particular events (0,1,…,10 successes when n=10): dbinom

probs <- round(dbinom(0:10, size = 10, 
                      prob = 0.5), 3)
x <- 0:10
##     x probs
## 1   0 0.001
## 2   1 0.010
## 3   2 0.044
## 4   3 0.117
## 5   4 0.205
## 6   5 0.246
## 7   6 0.205
## 8   7 0.117
## 9   8 0.044
## 10  9 0.010
## 11 10 0.001

3. Compute cumulative probabilities: pbinom

4. Compute quantiles using the inverse of the CDF: qbinom

probs <- pbinom(0:10, size = 10, prob = 0.5)
qbinom(probs, size = 10, prob = 0.5)
##  [1]  0  1  2  3  4  5  6  7  8  9 10

These four functions are the d-p-q-r family of functions, and are available for all the distributions available in R (e.g., Poisson, geometric, normal, beta, uniform, gamma, exponential, Cauchy, etc.).

Continuous random variables

In coin tosses, H and T are discrete possible outcomes.

\[\begin{equation} X \sim f(\cdot) \end{equation}\]

means that the random variable \(X\) is assumed to have PDF \(f(\cdot)\).

For example, if we say that \(X\sim Normal(\mu,\sigma)\), we are assuming that the PDF is

\[\begin{equation} f(x\mid \mu, \sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] \end{equation}\]

where \(-\infty<x<+\infty\)

We can truncate the normal distribution such that \(S_X\) is bounded between some lower bound and/or upper bound–this comes later.

The normal random variable

The PDF below is associated with the normal distribution that you are probably familiar with:

\[\begin{equation} f(x\mid \mu, \sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] \end{equation}\]

where \(-\infty<x<+\infty\).

  • The support of \(X\), i.e., the elements of \(S_X\), has values ranging from \(-\infty\) to \(+\infty\)
  • \(\mu\) is the location parameter (here, mean)
  • \(\sigma\) is the scale parameter (here, standard deviation)

In the discrete RV case, we could compute the probability of a event occurring:

extraDistr::dbern(x = 1, prob = 0.5)
## [1] 0.5
dbinom(x = 2, size = 10, prob = 0.5)
## [1] 0.04394531
  • In a continuous distribution, probability is defined as the area under the curve.
  • As a consequence, for any particular point value \(x\), where \(X\sim Normal(\mu,\sigma)\), it is always the case that \(\hbox{Prob}(X=x) = 0\).
  • In any continuous distribution, we can compute probabilities like \(\hbox{Prob}(x_1<X<x_2) = ?\), where \(x_1<x_2\), by summing up the area under the curve.
  • To compute probabilities like \(\hbox{Prob}(x_1<X<x_2) = ?\), we need the cumulative distribution function.

The cumulative distribution function (CDF) is

\[\begin{equation} P(X<u) = F(X<u) =\int_{-\infty}^{u} f(x)\, dx \end{equation}\]

  • The integral sign \(\int\) is just the summation symbol in continuous space.
  • Recall the summation in the CDF of the Bernoulli!

The standard normal distribution

In the \(Normal(\mu=0,\sigma=1)\),

  • Prob\((-1<X<+1) = 0.68\)
  • Prob\((-2<X<+2) = 0.95\)
  • Prob\((-3<X<+3) = 0.997\)

More generally, for any \(Normal(\mu,\sigma)\),

  • Prob\((-1\times \sigma <X<+1\times \sigma) = 0.68\)
  • Prob\((-2\times \sigma <X<+2\times \sigma) = 0.95\)
  • Prob\((-3\times \sigma<X<+3\times \sigma) = 0.997\)

The normalizing constant and the kernel

The PDF of the normal again:

\[\begin{equation} f(x\mid \mu, \sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] \end{equation}\]

This part of \(f(x\mid \mu,\sigma)\) (call it \(g(x)\)) is the ``kernel’’ of the normal PDF:

\[\begin{equation} g(x\mid \mu, \sigma) = \exp\left[-\frac{(x-\mu)^2}{2 \sigma^2}\right] \end{equation}\]

For the above function, the area under the curve doesn’t sum to 1:

Sum up the area under the curve \(\int g(x)\, dx\):

g <- function(x, mu = 0, sigma = 1) {
  exp((-(x - mu)^2 / (2 * (sigma^2))))
}

integrate(g, lower = -Inf, upper = +Inf)$value
## [1] 2.506628

The shape doesn’t change of course:

In simple examples like the one shown here, given the kernel of some PDF like \(g(x)\), we can figure out the normalizing constant by solving for \(k\) in:

\[\begin{equation} k \int g(x)\, dx = 1 \end{equation}\]

Solving for \(k\) just amounts to computing:

\[\begin{equation} k = \frac{1}{\int g(x)\, dx} \end{equation}\]

So, in our example above,

(k<-1/integrate(g, lower = -Inf, upper = +Inf)$value)
## [1] 0.3989423

The above number is just \(\frac{1}{\sqrt{2 \pi \sigma^2}}\), where \(\sigma=1\):

1/(sqrt(2*pi*1))
## [1] 0.3989423

Once we include the normalizing constant, the area under the curve in \(g(x)\) sums to 1:

k * integrate(g, lower = -Inf, upper = +Inf)$value
## [1] 1

We will see the practical implication of this when we move on to chapter 2 of the textbook.

The d-p-q-r functions for the normal distribution

In the continuous case, we also have this family of d-p-q-r functions. In the normal distribution:

1. Generate random data using rnorm

round(rnorm(5, mean = 0, sd = 1),3)
## [1] -1.849  0.018  0.188  0.891 -0.439

For the standard normal, mean=0, and sd=1 can be omitted (these are the default values in R).

round(rnorm(5),3)
## [1]  2.733  0.679  0.918 -0.501 -0.011

2. Compute probabilities using CDF: pnorm

Some examples of usage:

  • \(\hbox{Prob}(X<2)\) (e.g., in \(X\sim Normal(0,1)\))
pnorm(2)
## [1] 0.9772499
  • \(\hbox{Prob}(X>2)\) (e.g., in \(X\sim Normal(0,1)\))
pnorm(2, lower.tail = FALSE)
## [1] 0.02275013

3. Compute quantiles: qnorm

qnorm(0.9772499)
## [1] 2.000001

4. Compute the probability density: dnorm

dnorm(2)
## [1] 0.05399097

Note: In the continuous case, this is a density, the value \(f(x)\), not a probability. Cf. the discrete examples dbern and dbinom, which give probabilities of a point value \(x\).

The likelihood function (Binomial)

The likelihood function refers to the PMF \(p(k|n,\theta)\), treated as a function of \(\theta\).

For example, suppose that we record \(n=10\) trials, and observe \(k=7\) successes. The likelihood function is:

\[\begin{equation} \mathcal{L}(\theta|k=7,n=10)= \binom{10}{7} \theta^{7} (1-\theta)^{10-7} \end{equation}\]

If we now plot the likelihood function for all possible values of \(\theta\) ranging from \(0\) to \(1\), we get the plot shown below.

The MLE (from a particular sample of data need not invariably give us an accurate estimate of \(\theta\).

Sample size is key here: as \(n\rightarrow \infty\), we approach the true value of the parameter (here, \(\theta\)).

The likelihood function (Normal)

\[\begin{equation} \mathcal{L}(\mu,\sigma|x)=Normal(x,\mu,\sigma) \end{equation}\]

Below, assume that \(\sigma=1\).

## the data:
x<-0
## the likelihood under different values 
## of mu:
dnorm(x,mean=0,sd=1)
## [1] 0.3989423
dnorm(x,mean=10,sd=1)
## [1] 7.694599e-23

Assuming that \(\sigma=1\), the likelihood function of \(\mu\):

If we have two independent data points, the joint likelihood given the data of \(\mu\), assuming \(\sigma=1\):

x1<-0
x2<-1.5
dnorm(x1,mean=0,sd=1) * 
  dnorm(x2,mean=0,sd=1) 
## [1] 0.05167004
## log likelihood:
dnorm(x1,mean=0,sd=1,log=TRUE) + 
  dnorm(x2,mean=0,sd=1,log=TRUE) 
## [1] -2.962877
## more compactly:
x<-c(x1,x2)
sum(dnorm(x,mean=0,sd=1,log=TRUE))
## [1] -2.962877

One practical implication: one can use the log likelihood to compare competing models’ fit:

## Model 1:
sum(dnorm(x,mean=0,sd=1,log=TRUE))
## [1] -2.962877
## Model 2:
sum(dnorm(x,mean=10,sd=1,log=TRUE))
## [1] -87.96288

Model 1 has higher likelihood than Model 2, so we’d prefer to assume that the data are better characterized by Model 1 than 2 (neither may be the true model!).

More generally, for independent and identically distributed data \(x=x_1,\dots,x_n\):

\[\begin{equation} \mathcal{L}(\mu,\sigma|x) = \prod_{i=1}^n Normal(x_i,\mu,\sigma) \end{equation}\]

or

\[\begin{equation} \ell (\mu,\sigma|x) = \sum_{i=1}^n log(Normal(x_i,\mu,\sigma)) \end{equation}\]

The expectation and variance of an RV

Read section 1.4.1 of chapter 1 of the textbook, and (optionally) chapter 2 of the linear modeling lecture notes here:

https://github.com/vasishth/LM

Bivariate/multivariate distributions

Discrete bivariate distributions

Data from: Laurinavichyute, A. (2020). Similarity-based interference and faulty encoding accounts of sentence processing. dissertation, University of Potsdam.

X: Likert ratings 1-7.

Y: 0, 1 accuracy responses.

The joint PMF: \(p_{X,Y}(x,y)\)

For each possible pair of values of X and Y, we have a joint probability mass function \(p_{X,Y}(x,y)\).

Two useful quantities that we can compute:

The marginal distributions (\(p_{X}\) and \(p_Y\))

\[\begin{equation} p_{X}(x)=\sum_{y\in S_{Y}}p_{X,Y}(x,y).\label{eq-marginal-pmf2} \end{equation}\]

\[\begin{equation} p_{Y}(y)=\sum_{x\in S_{X}}p_{X,Y}(x,y).\label{eq-marginal-pmf} \end{equation}\]

The joint PMF for two random variables X and Y, along with the marginal distributions of X and Y.
\(x=1\) \(x=2\) \(x=3\) \(x=4\) \(x=5\) \(x=6\) \(x=7\) \(p(Y)\)
\(y=0\) \(0.018\) \(0.023\) \(0.04\) \(0.043\) \(0.063\) \(0.049\) \(0.055\) \(0.291\)
\(y=1\) \(0.031\) \(0.053\) \(0.086\) \(0.096\) \(0.147\) \(0.153\) \(0.142\) \(0.709\)
\(p(X)\) \(0.049\) \(0.077\) \(0.126\) \(0.139\) \(0.21\) \(0.202\) \(0.197\)

The conditional distributions (\(p_{X|Y}\) and \(p_{Y|X}\))

\[\begin{equation} p_{X\mid Y}(x\mid y) = \frac{p_{X,Y}(x,y)}{p_Y(y)} \end{equation}\]

and

\[\begin{equation} p_{Y\mid X}(y\mid x) = \frac{p_{X,Y}(x,y)}{p_X(x)} \end{equation}\]

Let’s do the calculation for \(p_{X\mid Y}(x\mid y=0)\).

The joint PMF for two random variables X and Y, along with the marginal distributions of X and Y.
\(x=1\) \(x=2\) \(x=3\) \(x=4\) \(x=5\) \(x=6\) \(x=7\) \(p(Y)\)
\(y=0\) \(0.018\) \(0.023\) \(0.04\) \(0.043\) \(0.063\) \(0.049\) \(0.055\) \(0.291\)
\(y=1\) \(0.031\) \(0.053\) \(0.086\) \(0.096\) \(0.147\) \(0.153\) \(0.142\) \(0.709\)
\(p(X)\) \(0.049\) \(0.077\) \(0.126\) \(0.139\) \(0.21\) \(0.202\) \(0.197\)

\[\begin{equation} \begin{split} p_{X\mid Y} (1\mid 0) =& \frac{p_{X,Y}(1,0)}{p_Y(0)}\\ =& \frac{0.018}{0.291}\\ =& 0.062 \end{split} \end{equation}\]

As an exercise, figure out the conditional distribution of X given Y, and the conditional distribution of Y given X.

Continuous bivariate distributions

Next, we turn to continuous bivariate/multivariate distributions.

The variance-covariance matrix:

\[\begin{equation}\label{eq:covmatfoundations} \Sigma = \begin{pmatrix} \sigma _{X}^2 & \rho_{XY}\sigma _{X}\sigma _{Y}\\ \rho_{XY}\sigma _{X}\sigma _{Y} & \sigma _{Y}^2\\ \end{pmatrix} \end{equation}\]

The off-diagonals of this matrix contain the covariance between \(X\) and \(Y\):

\[Cov(X,Y) = \rho_{XY}\sigma _{X}\sigma _{Y}\]

The joint distribution of \(X\) and \(Y\) is defined as follows:

\[\begin{equation}\label{eq:jointpriordistfoundations} \begin{pmatrix} X \\ Y \\ \end{pmatrix} \sim \mathcal{N}_2 \left( \begin{pmatrix} 0 \\ 0 \\ \end{pmatrix}, \Sigma \right) \end{equation}\]

The joint PDF has the property that the volume under the curve sums to 1.

Formally, we would write the volume under the curve as a double integral: we are summing up the volume under the curve for both \(X\) and \(Y\) (hence the two integrals).

\[\begin{equation} \iint_{S_{X,Y}} f_{X,Y}(x,y)\, dx dy = 1 \end{equation}\]

Here, the terms \(dx\) and \(dy\) express the fact that we are computing the volume under the curve along the \(X\) axis and the \(Y\) axis.

The joint CDF would be written as follows. The equation below gives us the probability of observing a value like \((u,v)\) or some value smaller than that (i.e., some \((u',v')\), such that \(u'<u\) and \(v'<v\)).

\[\begin{equation} \begin{split} F_{X,Y}(u,v) =& \hbox{Prob}(X<u,Y<v)\\ =& \int_{-\infty}^u \int_{-\infty}^v f_{X,Y}(x,y)\, dy dx \\ ~& \hbox{ for } (x,y) \in \mathbb{R}^2\\ \end{split} \end{equation}\]

Just as in the discrete case, the marginal distributions can be derived by marginalizing out the other random variable:

\[\begin{equation} f_X(x) = \int_{S_Y} f_{X,Y}(x,y)\, dy \quad f_Y(y) = \int_{S_X} f_{X,Y}(x,y)\, dx \end{equation}\]

Here, \(S_X\) and \(S_Y\) are the respective supports.

Generate simulated bivariate (multivariate) data

## define a variance-covariance matrix:
Sigma <- matrix(c(5^2, 5 * 10 * 0.6, 
                  5 * 10 * 0.6, 10^2),
  byrow = FALSE, ncol = 2
)
## generate data:
u <- MASS::mvrnorm(n = 100, mu = c(0, 0), 
                   Sigma = Sigma)
head(u, n = 3)
##           [,1]       [,2]
## [1,] -1.796916  -6.710635
## [2,] -1.240513   9.394426
## [3,] -3.011623 -12.695913

One practical implication: Such bi/multivariate distributions become critically important to understand when we turn to hierarchical (linear mixed) models.