4.1 The likelihood ratio test: The theory

Suppose that \(X_1,\dots, X_n\) are independent and normally distributed with mean \(\mu\) and standard deviation \(\sigma\) (assume for simplicity that \(\sigma\) is known).

Let the null hypothesis be \(H_0: \mu=\mu_0\) and the alternative be \(H_1: \mu\neq \mu_0\). Here, \(\mu_0\) is a number, such as \(0\).

The likelihood of the data \(y\) can be computed under the null model, in which \(\mu=\mu_0\), and under the alternative model, in which \(\mu\) has some specific alternative value. To make this concrete, imagine 10 data points being generated from a Normal(0,1).

y <- rnorm(10)

We can compute the joint likelihood under a null hypothesis that \(\mu=0\):

likNULL <- prod(dnorm(y, mean = 0, sd = 1))
likNULL
## [1] 9.151e-06

On the log scale, we would need to add the log likelihoods of each data point:

loglikNULL <- sum(dnorm(y, mean = 0, sd = 1, log = TRUE))
loglikNULL
## [1] -11.6

Similarly, we can compute the log likelihood with \(\mu\) equal to the maximum likelihood estimate of \(\mu\), the sample mean.

loglikALT <- sum(dnorm(y, mean = mean(y), sd = 1, log = TRUE))
loglikALT
## [1] -11.59

Essentially, the likelihood ratio test compares the ratio of likelihoods of the two models; on the log scale, the difference in log likelihood is taken. This is because a ratio on the log scale becomes a difference. The likelihood ratio test then chooses the model with the higher log likelihood, provided that the higher likelihood is high enough (we will just make this more precise).

One can specify the test in general terms as follows. Suppose that the likelihood is with respect to some parameter \(\theta\). We can evaluate the likelihood at \(\mu_0\), the null hypothesis value of the parameter, and evaluate the likelihood using the maximum likelihood estimate \(\hat\theta\) of the parameter, as we did above. The likelihood ratio can then be written as follows:

\[\begin{equation} \Lambda = \frac{max_{\theta\in \omega_0}(lik(\theta))}{max_{\theta\in \omega_1}(lik(\theta))} \end{equation}\]

where, \(\omega_0=\{\mu_0\}\) and \(\omega_1=\{\forall \mu \mid \mu\neq \mu_0\}\). The function max just selects the maximum value of any choices of parameter values; in the case of the null hypothesis there is only one value, \(\mu_0\). In the case of the alternative model, the maximum likelihood estimate \(\hat\theta\) is the maximum value.

Now, assuming for simplicity that the data are coming from a normal distribution, the numerator of the likelihood ratio statistic is:

\[\begin{equation} lik(\theta=\mu_0) = \frac{1}{(\sigma\sqrt{2\pi})^n} exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu_0)^2 \right) \end{equation}\]

For the denominator, the MLE \(\bar{X}\) is taken as \(\mu\):

\[\begin{equation} lik(\theta=\bar{X}) =\frac{1}{(\sigma\sqrt{2\pi})^n} exp \left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \bar{X})^2 \right) \end{equation}\]

The likelihood ratio statistic is then:

\[\begin{equation} \Lambda = \frac{lik(\theta=\mu_0)}{lik(\theta=\bar{X})}= \frac{\frac{1}{(\sigma\sqrt{2\pi})^n} exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu_0)^2 \right)}{\frac{1}{(\sigma\sqrt{2\pi})^n} exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \bar{X})^2 \right)} \end{equation}\]

Canceling out common terms:

\[\begin{equation} \Lambda = \frac{exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu_0)^2 \right)}{ exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \bar{X})^2 \right)} \end{equation}\]

Taking logs and pulling out the common term:

\[\begin{equation} \begin{split} \log \Lambda =& \left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu_0)^2 \right)-\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \bar{X})^2 \right)\\ =& -\frac{1}{2\sigma^2} \left( \sum_{i=1}^n (X_i - \mu_0)^2 - \sum_{i=1}^n (X_i - \bar{X})^2 \right)\\ \end{split} \end{equation}\]

Now, it is a standard algebraic trick (see Box 4.1.1) to rewrite \(\sum_{i=1}^n (X_i -\mu_0)^2\) as a sum of two terms:

\[\begin{equation} \sum_{i=1}^n (X_i -\mu_0)^2 = \sum_{i=1}^n (X_i - \bar{X})^2 + n(\bar{X} - \mu_0)^2 \tag{4.1} \end{equation}\]

An algebraic trick

We are going to establish that equality (4.1) holds.

Consider this expression: \(\sum_{i=1}^n (X_i -\mu_0)^2\).

We can rewrite the terms inside the brackets in the following way:

\[\begin{equation} \sum_{i=1}^n (X_i - \bar{X} + \bar{X} -\mu_0)^2 \end{equation}\]

Notice that \(-\bar{X} + \bar{X}\) sums to 0, so the expression remains unchanged!

Now, collect together the two pairs of terms inside the brackets as follows:

\[\begin{equation} \sum_{i=1}^n ((X_i - \bar{X}) + (\bar{X} -\mu_0))^2 \end{equation}\]

What we are doing here is that we are treating \(X_i - \bar{X}\) as one term, say \(a_i\) (we have to use the index \(i\) because of the index inside the brackets), and \(\bar{X} -\mu_0\) as another term, say \(b\). So the above equation can be written simply as:

\[\begin{equation} \sum_{i=1}^n (a_i + b)^2 = \sum_{i=1}^n (a_i^2 + b^2 + 2a_i b) \end{equation}\]

Opening the brackets, we would rewrite this as:

\[\begin{equation} \sum_{i=1}^n a_i^2 + \sum_{i=1}^nb^2 + 2\sum_{i=1}^n a_i b \end{equation}\]

Instead of using \(a_i\) and \(b\), we use the full expressions now:

\[\begin{equation} \sum_{i=1}^n ((X_i - \bar{X})^2 + (\bar{X} -\mu_0)^2 + 2 (X_i - \bar{X})(\bar{X} -\mu_0) ) \end{equation}\]

If we remove the outside-most brackets, we have to move the summation term \(\sum_{i=1}^n\) to each of the expressions inside the brackets:

\[\begin{equation} \sum_{i=1}^n (X_i - \bar{X})^2 +\sum_{i=1}^n (\bar{X} -\mu_0)^2 + 2\sum_{i=1}^n (X_i - \bar{X})(\bar{X} -\mu_0) ) \end{equation}\]

But \(\sum_{i=1}^n (\bar{X} -\mu_0)^2=n (\bar{X} -\mu_0)^2\), so we can rewrite the equation as:

\[\begin{equation} \sum_{i=1}^n (X_i - \bar{X})^2 +n(\bar{X} -\mu_0)^2 + 2\sum_{i=1}^n \times (X_i - \bar{X})(\bar{X} -\mu_0) ) \end{equation}\]

Also, \(\sum_{i=1}^n 2 (X_i - \bar{X}) = 0\), because the sum of all the deviations of the data \(x_i\) from the mean has to sum to 0. This follows because the definition of the mean is \(\frac{\sum_i^n X_i}{n} = \bar{X}\), and that means that \(\sum_i^n X_i = n\bar{X}\), or \(X_1-\bar{X} + X_2 - \bar{X} \dots + X_n -\bar{X}=0\).

The fact that \(2 \sum_{i=1}^n (X_i - \bar{X}) = 0\) allows us to simplify

\[\begin{equation} \sum_{i=1}^n (X_i - \bar{X})^2 +n(\bar{X} -\mu_0)^2) \end{equation}\]

This proves that

\[\begin{equation} \sum_{i=1}^n (X_i -\mu_0)^2 = \sum_{i=1}^n (X_i - \bar{X})^2 + n(\bar{X} - \mu_0)^2 \end{equation}\]

Returning to equation (4.1), if we rearrange the terms, we obtain:

\[\begin{equation} \sum_{i=1}^n (X_i -\mu_0)^2 - \sum_{i=1}^n (X_i - \bar{X})^2 = n(\bar{X} - \mu_0)^2 \tag{4.2} \end{equation}\]

Now, we just established above that \(\log \Lambda\) is:

\[\begin{equation} \log \Lambda= -\frac{1}{2\sigma^2} \left( \sum_{i=1}^n (X_i - \mu_0)^2 - \sum_{i=1}^n (X_i - \bar{X})^2 \right) \tag{4.3} \end{equation}\]

Consider the term in the brackets:

\[\begin{equation} \left(\sum_{i=1}^n (X_i - \mu_0)^2 - \sum_{i=1}^n (X_i - \bar{X})^2\right) \end{equation}\]

Given equation (4.2), this can be rewritten as:

\[\begin{equation} n(\bar{X} - \mu_0)^2 \end{equation}\]

Rewriting equation @ref\tag{4.4} in this way gives us:

\[\begin{equation} \log \Lambda = -\frac{1}{2\sigma^2} n(\bar{X} - \mu_0)^2 \end{equation}\]

Rearranging terms:

\[\begin{equation} -2 \log \Lambda = \frac{n(\bar{X} - \mu_0)^2 }{\sigma^2} \end{equation}\]

Or even more transparently:

\[\begin{equation} -2 \log \Lambda = \frac{(\bar{X} - \mu_0)^2 }{\frac{\sigma^2}{n}} \end{equation}\]

This should remind you of the t-test; the above equation is the square of the formula for the t-distribution.

Basically, just like in the t-test, what this is saying is that we reject the null when the distance between \(\bar{X}\) and \(\mu_0\) (normalized by the square of the standard error) is large. In other words, we reject the null hypothesis when negative two times the difference in log likelihood, is large.

Now we will define what it means for \(-2\log \Lambda\) to be large. There is a theorem in statistics that states that for large \(n\), the distribution of \(-2\log \Lambda\) approaches the chi-squared distribution, with degrees of freedom corresponding to the difference in the number of parameters between the two models being compared.

We will define the likelihood ratio test statistic, call it \(LRT\), as follows. Here, \(Lik(\theta)\) refers to the likelihood given some value \(\theta\) for the parameter, and \(logLik(\theta)\) refers to the log likelihood.

\[\begin{equation} \begin{split} LRT =& -2\times (Lik(\theta_0)/Lik(\theta_1)) \\ \log LRT =& -2\times \{logLik(\theta_0)-logLik(\theta_1)\}\\ \end{split} \end{equation}\]

where \(\theta_1\) and \(\theta_0\) are the estimates of \(\theta\) under the alternative and null hypotheses, respectively. The likelihood ratio test rejects \(H_0\) if \(\log LRT\) is sufficiently large. As the sample size approaches infinity, \(\log LRT\) approaches the chi-squared distribution:

\[\begin{equation} \log LRT \rightarrow \chi_r^2 \hbox{ as } n \rightarrow \infty \end{equation}\]

Here, \(r\) is called the degrees of freedom and is the difference in the number of parameters under the null and alternative hypotheses.

The above result is called Wilks’ theorem. The proof of Wilks’ theorem is fairly involved but you can find it in Lehmann’s textbook Testing Statistical Hypotheses.

Note that sometimes you will see the form:

\[\begin{equation} \log LRT = 2 \{logLik(\theta_1) - logLik(\theta_0)\} \end{equation}\]

It should be clear that both statements are saying the same thing; in the second case, we are just subtracting the null hypothesis log likelihood from the alternative hypothesis log likelihood, so the negative sign disappears.

That’s the theory. Let’s see how the likelihood ratio test works for (a) simulated data, and (b) our running example, the English relative clause data from Grodner and Gibson (2005).

References

Grodner, Daniel, and Edward Gibson. 2005. “Consequences of the Serial Nature of Linguistic Input.” Cognitive Science 29: 261–90.