1.5 Likelihood and maximum likelihood estimation

We now turn to an important topic: the idea of likelihood, and of maximum likelihood estimation. Consider as a first example the discrete case, using the Binomial distribution.

Suppose we toss a fair coin 10 times, and count the number of heads; we do this experiment once. Notice below that we set the probability of success to be 0.5. This is because we are assuming that we tossed a fair coin.

(x <- rbinom(1, size = 10, prob = 0.5))
## [1] 3

The probability of obtaining this value depends on the parameter we set for \(\theta\) in the PMF for the binomial distribution. Here are some possible values for \(\theta\) and the resulting probabilities:

dbinom(x, size = 10, prob = 0.1)
## [1] 0.0574
dbinom(x, size = 10, prob = 0.3)
## [1] 0.2668
dbinom(x, size = 10, prob = 0.5)
## [1] 0.1172
dbinom(x, size = 10, prob = 0.8)
## [1] 0.0007864
dbinom(x, size = 10, prob = 1.0)
## [1] 0

The value of \(\theta\) that gives us the highest probability will be called the maximum likelihood estimate. The function dbinom (which is a function of \(\theta\)) is also called a likelihood function, and the maximum value of this function is called the maximum likelihood estimate. We can graphically figure out the maximal value of the dbinom likelihood function here by plotting the value of the function for all possible values of \(\theta\), and checking which is the maximal value:

theta <- seq(0, 1, by = 0.01)
plot(theta, dbinom(x, size = 10, prob = theta),
  type = "l",
  ylab = "likelihood/probability"
)
abline(v = x / 10)

It should be clear from the figure that the maximum value corresponds to the proportion of heads: 3/10. This value is called the maximum likelihood estimate (MLE). We can obtain this MLE of \(\theta\), which maximizes the likelihood, by computing:

\[\begin{equation} \hat \theta = \frac{x}{n} \end{equation}\]

where \(n\) is sample size, and \(x\) is the number of successes. For the analytical derivation of this result, see the Linear Modeling lecture notes: https://github.com/vasishth/LM

The likelihood function in a continuous case is similar to that of the discrete example above, but there is one crucial difference, which we will just get to below.

Consider the following example. Suppose we have one data point from a Normal distribution, with mean 0 and standard deviation 1:

(x <- rnorm(1, mean = 0, sd = 1))
## [1] 0.948

The likelihood function for this data point is going to depend on two parameters, \(\mu\) and \(\sigma\). For simplicity, let’s assume that \(\sigma\) is known to be 1 and that only \(\mu\) is unknown. In this situation, the value of \(\mu\) that maximizes the likelihood will be the MLE. As before, we can graphically find the MLE by plotting the likelihood function:

mu <- seq(-3, 3, by = 0.01)
plot(mu, dnorm(x,
  mean = mu,
  sd = 1
), type = "l")
abline(v = x)

The maximum point in this function will always be the sample mean from the data; the sample mean is the MLE. In the above case, the mean of the single data point 0.948 is the number itself. If we had two data points from a Normal(0,1) distribution, then the likelihood function would be defined as follows. First, let us simulate two data points:

(x1 <- rnorm(1))
## [1] -1.202
(x2 <- rnorm(1))
## [1] -0.4661

These two data points are independent of each other. Hence, to obtain the joint likelihood, we will have to multiply the likelihoods of each of the numbers, given some value for \(\mu\):

mu <- 1
dnorm(x1, mean = mu) * dnorm(x2, mean = mu)

In order to plot the joint likelihood, we need to write a function:

normallik <- function(mu = NULL) {
  dnorm(x1, mean = mu) * dnorm(x2, mean = mu)
}
## test:
normallik(mu = 1)
## [1] 0.004815

Now, we can plot the likelihood function for these two data points:

mu <- seq(-3, 3, by = 0.01)
plot(mu, normallik(mu), type = "l")
abline(v = mean(c(x1, x2)))

Notice that the maximum value of this joint likelihood is the mean of the two data points.

For the normal distribution, where \(Y \sim N(\mu,\sigma)\), we can get MLEs of \(\mu\) and \(\sigma\) by computing:

\[\begin{equation} \hat \mu = \frac{1}{n}\sum y_i = \bar{y} \end{equation}\]

and

\[\begin{equation} \hat \sigma ^2 = \frac{1}{n}\sum (y_i-\bar{y})^2 \end{equation}\]

As mentioned earlier, as the formula for the variance, you will sometimes see the unbiased estimate (and this is what R computes) but for large sample sizes the difference is not important:

\[\begin{equation} \hat \sigma ^2 = \frac{1}{n-1}\sum (y_i-\bar{y})^2 \end{equation}\]

Now we come to a crucial difference between the discrete and continuous cases discussed above. The dbinom is the PMF, but it is also a likelihood function when seen as a function of \(\theta\). Once we have fixed the \(\theta\) parameter to a particular value, the dbinom function gives us the probability of a particular outcome. Given some data \(k\) from \(n\) trials from a Binomial distribution, and treating \(\theta\) as variable between 0 and 1, dbinom gives us the likelihood.

The situation is slightly different in the continuous PDF. Taking the normal distribution as an example, the dnorm function \(f(y|\mu,\sigma)\) doesn’t give us the probability of a point value, but rather the density. When the function \(f(y|\mu,\sigma)\) is treated as a function of the parameters, it gives us the likelihood. We need to make a careful distinction between the words probability and likelihood; in day-to-day usage the two words are used interchangeably, but here these two terms have different technical meanings.

A final point to note is that a likelihood function is not a PDF; the area under the curve does not need to sum to 1. By contrast, in a PDF, the area under the curve must sum to 1.

1.5.1 The importance of the MLE

One significance of the MLE is that, having assumed a particular underlying PMF/PDF, we can estimate the (unknown) parameters (the mean and variance) of the distribution that we assume to have generated our particular data. For example, in the Binomial case, we have a formula for computing the MLEs of the mean and variance; for the Normal distribution, we have a formula for computing the MLE of the mean and the variance.

What are the distributional properties of the mean under repeated sampling? This is a question that forms the basis for hypothesis testing in the frequentist paradigm. So we turn to this topic in the next chapter.

Basic random variable theory: A formal statement

We can summarize the above informal concepts relating to random variables very compactly if we re-state them in mathematical form. A mathematical statement has the advantage not only of brevity but also of reducing ambiguity.

Formally, a random variable \(Y\) is defined as a function from a sample space of possible outcomes \(S\) to the real number system:

\[\begin{equation} Y : S \rightarrow \mathbb{R} \end{equation}\]

The random variable associates to each outcome \(\omega \in S\) exactly one number \(Y(\omega) = y\). \(S_Y\) is all the \(y\)’s (all the possible values of \(Y\), the support of \(Y\)). I.e., \(y \in S_Y\).

Every random variable \(Y\) has associated with it a probability mass (distribution) function (PMF, PDF). I.e., PMF is used for discrete distributions and PDF for continuous distributions. The PMF maps every element of \(S_Y\) to a value between 0 and 1. The PDF maps a range of values in the support of \(Y\) to a value between 0 and 1 (e.g., \(P(a \leq Y\leq b) \rightarrow [0, 1]\)).

\[\begin{equation} p_Y : S_Y \rightarrow [0, 1] \end{equation}\]

Probability mass functions (discrete case) and probability density functions (continuous case) are functions that assign probabilities or relative frequencies to events in a sample space.

The expression

\[\begin{equation} Y \sim f(\cdot) \end{equation}\]

will be used to mean that the random variable \(Y\) has PDF/PMF \(f(\cdot)\). For example, if we say that \(Y \sim Binomial(n,\theta)\), then we are asserting that the PMF is:

\[\begin{equation} \hbox{Binomial}(k|n,\theta) = \binom{n}{k} \theta^{k} (1-\theta)^{n-k} \end{equation}\]

If we say that \(Y\sim Normal(\mu,\sigma)\), we are asserting that the PDF is

\[\begin{equation} Normal(y|\mu,\sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left(-\frac{(y-\mu)^2}{2\sigma^2} \right) \end{equation}\]

The cumulative distribution function or CDF is defined as follows:

For discrete distributions, the probability that \(Y\) is less than \(a\) is written:

\[\begin{equation} P(Y<a) = F(Y<a) =\sum_{-\infty}^{a} f(y) \end{equation}\]

For continuous distributions, the summation symbol \(\sum\) above becomes the summation symbol for the continuous case, which is the integral \(\int\). The upper and lower bounds are marked by adding a subscript and a superscript on the integral. For example, if we want the area under the curve between points a and b for some function \(f(y)\), we write \(\int_b^a f(y)\, dy\). So, if we want the probability that \(Y\) is less than \(a\), we would write:

\[\begin{equation} P(Y<a) = F(Y<a) =\int_{-\infty}^{a} f(y)\, dy \end{equation}\]

The above integral is simply summing up the area under the curve between the points \(-\infty\) and \(a\); this gives us the probability of observing \(a\) or a value smaller than \(a\).

We can use the complementary cumulative distribution function to compute quantities like \(P(Y>a)\) by computing \(1-F(a)\), and the quantity \(P(a\leq Y\leq b)\) by computing \(F(b)-F(a)\), where \(b>a\).

A final point here is that we can go back and forth between the PDF and the CDF. If the PDF is \(f(y)\), then the CDF that allows us to compute quantities like \(P(Y<b)\) is just the integral:

\[\begin{equation} F(Y<b)=\int_{-\infty}^b f(y)\, dy \end{equation}\]

The above is simply computing the area under the curve \(f(y)\), ranging from \(b\) to \(-\infty\).

Because differentiation is the opposite of integration (this is called the Fundamental Theorem of Calculus), if we differentiate the CDF, we get the PDF back:

\[\begin{equation} d(F(y))/dy=f(y) \end{equation}\]

In bivariate distributions, the joint CDF is written \(F_{X,Y}(a,b)=P(X\leq a, Y\leq b)\), where \(-\infty < a,b<\infty\). The marginal distributions of \(F_X\) and \(F_Y\) are the CDFs of each of the associated random variables. The CDF of \(X\):

\[\begin{equation} F_X(a) = P(X\leq a) = F_X(a,\infty) \end{equation}\]

The CDF of \(Y\):

\[\begin{equation} F_Y(b) = P(Y\leq b) = F_Y(\infty,b) \end{equation}\]

\(f(x,y)\) is the joint PDF of \(X\) and \(Y\). Every joint PDF satisfies

\[\begin{equation} f(x,y)\geq 0\mbox{ for all }(x,y)\in S_{X,Y}, \end{equation}\] and \[\begin{equation} \int \int_{S_{X,Y}}f(x,y)\,\mathrm{d} x\,\mathrm{d} y=1. \end{equation}\]

where \(S_{X,Y}\) is the joint support of the two random variables.

If X and Y are jointly continuous, they are individually continuous, and their PDFs are:

\[\begin{equation} \begin{split} P(X\in A) = & P(X\in A, Y\in (-\infty,\infty)) \\ = & \int_A \int_{-\infty}^{\infty} f(x,y)\,dy\, dx\\ = & \int_A f_X(x)\, dx \end{split} \end{equation}\]

where

\[\begin{equation} f_X(x) = \int_{-\infty}^{\infty} f(x,y)\, dy \end{equation}\]

Similarly:

\[\begin{equation} f_Y(y) = \int_{-\infty}^{\infty} f(x,y)\, dx \end{equation}\]