1.1 Discrete random variables: An example using the Binomial distribution

Imagine that our data come from a grammaticality judgement task (participants see or hear sentences and have to decide whether these are grammatical or ungrammatical), and that the responses from participants are a sequence of 1’s and 0’s, where 1 represents the judgment “grammatical”, and 0 represents the judgement “ungrammatical”. Assume also that each response, coded as 1 or 0, is generated independently from the others. We can simulate the outcome of such an experiment (i.e., a sequence of 1s and 0s) in R. Let’s generate the outcome of 20 such experiments with a sample size of 10 (i.e., each experiment provides us with 10 responses). For each experiment, we count the number of 1s (or number of “successes”). The R code that generated this output will be explained soon.

##  [1] 7 7 4 7 6 5 6 3 6 6 5 6 7 4 5 7 8 3 5 5

Outcomes such as this one (i.e., number of successes for a variable with two possible outcomes) follow a probability distribution p(Y). In more technical terms, we can say that the number of successes in each of the \(20\) simulated experiments above is being generated by a discrete random variable \(Y\) which has associated with it a probability distribution \(p(Y)\) called the Binomial distribution.1

As mentioned above, for a discrete random variable, the probability distribution \(p(Y)\) is called a probability mass function (PMF). The PMF defines the probability of each possible outcome. In the above example, with \(n=10\) trials, there are 11 possible outcomes: \(0,\dots,10\) successes. Which of these outcomes is most probable depends on a numerical value called a parameter in the Binomial distribution that represents the probability of success. We will call this parameter \(\theta\); because it represents a probability, it must have a value between 0 and 1. The left-hand side plot in Figure 1.1 shows an example of a Binomial PMF with \(10\) trials and the parameter \(\theta\) fixed at value \(0.5\). Setting \(\theta\) to 0.5 leads to a PMF where the most probable outcome is 5 successes out of 10. If we had set \(\theta\) to, say 0.1, then the most probable outcome would be 1 success out of 10; and if we had set \(\theta\) to 0.9, then the most probable outcome would be 9 successes out of 10.

As we will see later, when we analyze data, a primary goal will be to compute a so-called estimate of the parameter (here, \(\theta\)). In real-life situations, the value of the \(\theta\) parameter will be unknown and unknowable; the data will allow us to compute a “guess” about the unknown value of the parameter. We will call this guess the estimate of the parameter.

Probability mass functions of a binomial distribution assuming 10 trials, with 50%, 10%, and 90% probability of success.

FIGURE 1.1: Probability mass functions of a binomial distribution assuming 10 trials, with 50%, 10%, and 90% probability of success.

The probability mass function for the binomial is written as follows.

\[\begin{equation} \hbox{Binomial}(k|n,\theta) = \binom{n}{k} \theta^{k} (1-\theta)^{n-k} \end{equation}\]

Here, \(n\) represents the total number of trials, \(k\) the number of successes, and \(\theta\) the probability of success. The term \(\binom{n}{k}\), pronounced n-choose-k, represents the number of ways in which one can choose \(k\) successes out of \(n\) trials. For example, 1 success out of 10 can occur in 10 possible ways: the very first trial could be a 1, the second trial could be a 1, etc. The term \(\binom{n}{k}\) expands to \(\frac{n!}{k!(n-k)!}\); the exclamation mark is the factorial (e.g., \(3!=3\times 2\times 1\)). In R, \(\binom{n}{k}\) is computed using the function choose(n,k), with \(n\) and \(k\) representing positive integer values.

1.1.1 The mean and variance of the Binomial distribution

It is possible to analytically compute the mean and variance of the PMF associated with the Binomial random variable \(Y\). Without getting into the details of how these are derived mathematically, we just state here that the mean of \(Y\) (also called the expectation, conventionally written \(E[Y]\)) and variance of \(Y\) (written \(Var(Y)\)) of a Binomial distribution with parameter \(\theta\) and \(n\) trials are \(E[Y] = n\theta\) and \(Var(Y) = n\theta (1-\theta)\).

Of course, we always know \(n\) (because we decide on the number of trials ourselves), but in real experimental situations we never know the true value of \(\theta\). But \(\theta\) can be estimated from the data. From the observed data, we can compute the estimate of \(\theta\), \(\hat \theta=k/n\). The quantity \(\hat \theta\) is the observed proportion of successes, and is called the maximum likelihood estimate of the true (but unknown mean). Once we have estimated \(\theta\) in this way, we can also obtain an estimate (also a maximum likelihood estimate) of the variance by computing \(n\hat\theta (1-\hat\theta)\). These estimates are then used for statistical inference.

What does the term “maximum likelihood estimate” mean? The term likelihood refers to the value of the Binomial distribution function for a particular value of \(\theta\), once we have observed some data. For example, suppose you record \(n=10\) trials, and observe \(k=7\) successes. What is the probability of observing \(7\) successes out of \(10\)? We need the binomial distribution to compute this value:

\[\begin{equation} \hbox{Binomial}(k=7|n=10,\theta) = \binom{10}{7} \theta^{7} (1-\theta)^{10-7} \end{equation}\]

Once we have observed the data, both \(n\) and \(k\) are fixed. The only variable in the above equation now is \(\theta\): the above function is now only dependent on the value of \(\theta\). When the data are fixed, the probability mass function is only dependent on the value of the parameter \(\theta\), and is called a likelihood function. It is therefore often expressed as a function of \(\theta\):

\(p( y | \theta ) = p( k=7, n=10 | \theta) = \mathcal{L}(\theta)\)

The vertical bar notation above should be read as saying that, given some data \(y\) (which in the binomial case will be \(k\) “successes” in \(n\) trials), the function returns a value for different values of \(\theta\).

If we now plot this function for all possible values of \(\theta\), we get the plot shown in Figure 1.2. We will show below how to compute this function for different values of \(\theta\).

The likelihood function for 7 successes out of 10.

FIGURE 1.2: The likelihood function for 7 successes out of 10.

What is important about this plot is that it shows that, given the data, the maximum point is at the point \(0.7\), which corresponds to the estimated mean using the formula shown above: \(k/n = 7/10\). Thus, the maximum likelihood estimate (MLE) gives us the most likely value of the parameter \(\theta\) given the data. It is crucial to note here that the phrase “most likely” does not mean that the MLE from a particular sample of data invariably gives us an accurate estimate of \(\theta\). For example, if we run our experiment for \(10\) trials and get \(1\) success out of \(10\), the MLE is \(0.10\). We could have happened to observe only one success out of ten even if the true \(\theta\) were \(0.5\). The MLE would however give an accurate estimate of the true parameter \(\theta\) as \(n\) approaches infinity.

1.1.2 What information does a probability distribution provide?

What good is a probability mass function? We consider this question next. A lot of important information can be extracted from a PMF.

1.1.2.1 Compute the probability of a particular outcome (discrete case only)

The Binomial distribution shown in Figure 1.1 already displays the probability of each possible outcome under a different value for \(\theta\). In R, there is a built-in function that allows us to calculate \(P(Y=k)\), the probability in the random variable \(Y\) of \(k\) successes out of \(n\), given a particular value of \(k\) (this number constitutes our data), the number of trials \(n\), and given a particular value of \(\theta\); this is the dbinom function. For example, the probability of 5 successes out of 10 when \(\theta\) is \(0.5\) is:

(prob <- dbinom(5, size = 10, prob = 0.5))
## [1] 0.2461

To be completely explicit, we can write this probability as \(P(k=5|n=10,\theta=0.5)=0.246\); when \(n\) and \(\theta\) are clear from context, we can also write simply \(P(k=5)\).

The probabilities of success when \(\theta\) is 0.1 or 0.7 can be computed by replacing 0.5 above by each of these probabilities:

dbinom(5, size = 10, prob = 0.1)
## [1] 0.001488
dbinom(5, size = 10, prob = 0.7)
## [1] 0.1029

One can alternatively run the above command in one shot; one can give a range of probabilities, and get the probability of 5 successes out of 10 for these different probabilities:

dbinom(5, size = 10, prob = c(0.1, 0.7))
## [1] 0.001488 0.102919

The above command prints out the probability of 5 successes out of 10, when \(\theta=0.1\) and \(\theta=0.7\).

1.1.2.2 Compute the cumulative probability of k or less (more) than k successes

Instead of the probability of obtaining a given number of successes we could be interested in knowing the cumulative probability of obtaining 1 or less, or 2 or less successes. Formally, we will write this cumulative probability as \(F(2)\), and more generally, as \(F(k)\), for a particular value of \(k\). Thus, \(F(k)=P(Y\leq k)\).

We can compute this cumulative probability with the dbinom function, through a simple summation procedure:

## the cumulative probability of obtaining
## 0, 1, or 2 successes out of 10,
## with theta=0.5:
dbinom(0, size = 10, prob = 0.5) + dbinom(1, size = 10, prob = 0.5) +
  dbinom(2, size = 10, prob = 0.5)
## [1] 0.05469

Mathematically, we could write the above summation as:

\[\begin{equation} \sum_{k=0}^2 \binom{n}{k} \theta^{k} (1-\theta)^{n-k} \end{equation}\]

An alternative to the cumbersome addition in the R code above is this more compact statement, which closely mimics the above mathematical expression:

sum(dbinom(0:2, size = 10, prob = 0.5))
## [1] 0.05469

R has a built-in function called pbinom that does this summation for us. If we want to know the probability of \(2\) or less successes as in the above example, we can write:

pbinom(2, size = 10, prob = 0.5, lower.tail = TRUE)
## [1] 0.05469

The specification lower.tail=TRUE ensures that the summation goes from \(2\) to numbers smaller than \(2\) (which lie in the lower tail of the distribution in Figure 1.1). If we wanted to know what the probability is of obtaining \(2\) or more successes out of \(10\), we can set lower.tail to FALSE:

pbinom(2, size = 10, prob = 0.5, lower.tail = FALSE)
## [1] 0.9453

The cumulative distribution function or CDF, \(F(k)\), can be plotted by computing the cumulative probabilities for any value \(k\) or less than \(k\), where \(k\) ranges from \(0\) to \(10\) in our running example. The CDF is shown in Figure 1.3.

The cumulative distribution function for a binomial distribution assuming 10 trials, with 50% probability of success.

FIGURE 1.3: The cumulative distribution function for a binomial distribution assuming 10 trials, with 50% probability of success.

1.1.2.3 Compute the inverse of the cumulative distribution function (the quantile function)

We can also find out the value of the variable \(k\) (the quantile) such that the probability of obtaining \(k\) or less than \(k\) successes is some specific probability value \(p\). If we switch the x and y axes of Figure 1.3, we obtain another very useful function, the inverse CDF.

The inverse of the CDF (known as the quantile function in R because it returns the quantile, the value k) is available in R as the function qbinom. The usage is as follows: to find out what the value \(k\) of the outcome is such that the probability of obtaining \(k\) or less successes is \(0.37\), type:

qbinom(0.37, size = 10, prob = 0.5)
## [1] 4

1.1.2.4 Generate random data from a \(\hbox{Binomial}(n,\theta)\) distribution

We can generate random simulated data from a Binomial distribution by specifying the number of trials and the probability of success \(\theta\). In R, we do this in the following way:

rbinom(n = 10, size = 1, prob = 0.5)
##  [1] 1 0 1 1 0 1 0 1 0 1

The above code generates a sequences of 10 randomly generated \(1\)’s and \(0\)’s. Each of the 1’s and 0’s is called an outcome coming from a Bernoulli distribution. In other words, a Bernoulli distribution is just a Binomial distribution with size=1, i.e., a single trial.

Repeatedly run the above code; you will get different sequences of 1’s and 0’s each time. For each generated sequence, one can calculate the number of successes by just summing up the vector, or computing its mean and multiplying by the number of trials, here \(10\):

y <- rbinom(n = 10, size = 1, prob = 0.5)
mean(y) * 10
## [1] 6
sum(y)
## [1] 6

We can think of the above as 10 Bernoulli trials, or a single Binomial trial of size 10. When we see the data in this way, the syntax and output changes:

(y <- rbinom(n = 1, size = 10, prob = 0.5))
## [1] 5

Seen as a Binomial experiment with size 10, the function returns the number of successes in a particular experiment with size 10.

Next, let’s take a look at an example of a continuous random variable.


  1. When an experiment consists of only a single trial (i.e., we can have a total number of only 0 or 1 successes), \(p(Y)\) is called a Bernoulli distribution.↩︎