2.2 The central limit theorem using simulation

Suppose we collect some data, which can be represented by a vector \(y\); this is a single sample. Given data \(y\), and assuming for concreteness that the underlying likelihood is a \(Normal(\mu=500,\sigma=100)\), the sample mean and standard deviation, \(\bar{y}\) and \(s\) give us an estimate of the unknown parameters mean \(\mu\) and the standard deviation \(\sigma\) of the distribution from which we assume that our data come from. Figure 2.1 shows the distribution of a particular sample, where the number of data points is \(n=1000\). Note that in this example the parameters are specified by us, so they are not unknown; in a real data-collection situation, the sample mean and standard deviation are all we have as estimates of the parameters.

## sample size:
n<-1000
##  independent and identically distributed sample:
y<-rnorm(n,mean=500,sd=100)
## histogram of data:
hist(y,freq=FALSE)
## true value of mean:
abline(v=500,lwd=2)
A sample of data y of size n=1000, from the distribution Normal(500,100). The vertical line shows the true mean of the distribution we are sampling from.

FIGURE 2.1: A sample of data y of size n=1000, from the distribution Normal(500,100). The vertical line shows the true mean of the distribution we are sampling from.

Suppose now that you had not a single sample of size 1000 but many repeated samples. This isn’t something one can normally do in real life; we often run a single experiment or, at most, repeat the same experiment once. However, one can simulate repeated sampling easily within R. Let us take 100 repeated samples like the one above, and save the samples in a matrix containing \(n=1000\) rows and 2000 columns, each column representing an experiment:

mu <- 500
sigma <- 100
## number of experiments:
k <- 2000
## store for data:
y_matrix <- matrix(rep(NA, n * k), ncol = k)
for (i in 1:k) {
  ## expt result with sample size n:
  y_matrix[, i] <- rnorm(n, mean = mu, sd = sigma)
}

Now, if we compute the means \(\bar{y}_k\) of each of the \(k=1,\dots,100\) experiments we just carried out, if certain conditions are met, these means will be normally distributed, with mean \(\mu\) and standard deviation \(\sigma/\sqrt{n}\). To understand this point, it is useful to first visualize the distribution of means and graphically summarize this standard deviation, which confusingly is called standard error.

## compute means from each replication:
y_means <- colMeans(y_matrix)
## the mean and sd (=standard error) of the means
mean(y_means)
## [1] 500
sd(y_means)
## [1] 3.115
Sampling from a normal distribution (left); and the sampling distribution of the means under repeated sampling (right). The right-hand plot shows an overlaid normal distribution, and the standard deviation (standard error) as error bars.

FIGURE 2.2: Sampling from a normal distribution (left); and the sampling distribution of the means under repeated sampling (right). The right-hand plot shows an overlaid normal distribution, and the standard deviation (standard error) as error bars.

The sampling distribution of means has a normal distribution provided two conditions are met: (a) the sample size should be “large enough”, and (b) \(\mu\) and \(\sigma\) are defined for the probability density or mass function that generated the data.

The statement in the preceding sentence is called the central limit theorem (CLT). The significance of the CLT for us as researchers is that from the summary statistics computed from a single sample, we can obtain an estimate of the sampling distribution of the sample mean: \(Normal(\bar{y},s/\sqrt{n})\).

Derivation of the sampling distribution of the sample mean

The statement that the sampling distribution of the meas will be normal, with mean \(\mu\) and standard deviation \(\sigma/\sqrt{n}\), can be derived formally through a simple application of random variable theory. Suppose that we gather independent and identically distributed data \(y_1, \dots, y_n\), each of which is assumed to be generated by a random variable \(Y\sim Normal(\mu,\sigma)\).

When we compute the mean \(\bar{y}\) for each sample, we are assuming that each of the means is coming from a random variable \(\bar{Y}\), which is just a linear combination of values generated by instances of the random variable \(Y\), which itself has a pdf with mean (expectation) \(\mu\) and variance \(\sigma^2\):

\[\begin{equation} \bar{Y}=\frac{1}{n} \sum_{i=1}^n Y = \frac{1}{n}Y_1 + \dots + \frac{1}{n}Y_n \end{equation}\]

So, the expectation of \(\bar{Y}\) is

\[\begin{equation} \begin{split} E[\bar{Y}] =& E[\frac{1}{n}Y_1 + \dots + \frac{1}{n}Y_n]\\ =& \frac{1}{n} (E[Y] + \dots + E[Y])\\ =& \frac{1}{n} (\mu + \dots + \mu)\\ =& \frac{1}{n} n\mu \\ =& \mu \\ \end{split} \end{equation}\]

And the variance of \(\bar{Y}\) is

\[\begin{equation} \begin{split} Var(\bar{Y}) =& Var(\frac{1}{n}Y_1 + \dots + \frac{1}{n} Y_n)\\ =& \frac{1}{n^2} Var(Y_1 + \dots + Y_n)\\ \end{split} \end{equation}\]

The last line above arises because the variance of a random variable \(Z\) multiplied by a constant \(a\), \(Var(aZ)\) is \(a^2 Var(Z)\). Here, \(a=1/n\), so \(a^2 = 1/n^2\). Because \(Y_1,\dots,Y_n\) are independent, we can compute the variance \(Var(Y_1 + \dots + Y_n)\) by using the fact that the variance of the sum of independent random variables is the sum of their variances. This fact gives us:

\[\begin{equation} \label{sdsmderivation} \begin{split} \frac{1}{n^2} Var(Y_1 + \dots + Y_n) =& \frac{1}{n^2} (Var(Y) + \dots + Var(Y))\\ =& \frac{1}{n^2} n Var(Y)\\ =& \frac{1}{n} Var(Y)\\ =& \frac{\sigma^2}{n}\\ \end{split} \end{equation}\]

This derives the above result that the expectation (i.e., the mean) and variance of the sampling distribution of the sample means are

\[\begin{equation} E[\bar{Y}] = \mu \quad Var(\bar{Y}) = \frac{\sigma^2}{n} \end{equation}\]

The above means that we can estimate the expectation \(\bar{Y}\) and the variance of \(\bar{Y}\) from a single sample. (Of course, whether these estimates are accurate or not is subject to variation, as we will see below).

One way to state the central limit theorem is as follows. Let \(f(Y)\) be the pdf of a random variable \(Y\), and assume that the pdf has mean \(\mu\) and variance \(\sigma^2\). Then, the distribution of the mean \(\bar{Y}\) has the following form when the sample size \(n\) is large:

\[\begin{equation} \bar{Y} \sim Normal(\mu,\sigma^2/n) \end{equation}\]

For us, the practical implication of this result is huge. From a single sample of \(n\) independent data points \(y_1,\dots, y_n\), we can derive the distribution of hypothetical sample means under repeated sampling. That is, it becomes possible to say something about what the plausible and implausible values of the sample mean are under repeated sampling. This is the basis for all hypothesis testing and statistical inference in the frequentist framework that we will look at in this book.

Sometimes the central limit theorem is misunderstood to imply that the pdf \(f(Y)\) that is assumed to generate the data is always going to be normal. It is important to understand that there are two pdfs we are talking about here. First, there is the pdf that the data were generated from; this need not be normal. For example, you could get data from a Normal, Exponential, Gamma, or other distribution. Second, there is the sampling distribution of the sample mean under repeated sampling. It is the sampling distribution that the central limit theorem is about, not the distribution that generated the data.

Technically, the correct statement of the central limit theorem is as in Box 2.2.1.

The central limit theorem

If \(Y_1,\dots, Y_n\) are random samples from a random variable \(X\) with mean \(\mu\) and standard deviation \(\sigma\), then as the sample size \(n\) approaches infinity, the distribution of the random variable

\[\begin{equation} Z = \frac{\bar{Y}-\mu}{\sigma/\sqrt{n}} \end{equation}\]

is the standard normal distribution (\(\mathit{normal}(\mu=0,\sigma=1)\)).

For a formal proof, see p. 267 of Miller and Miller (2004).

The theorem is stated in terms of \(Z\) and not \(\bar{Y}\) as presented above; this is because the variance of \(\bar{Y}\) approaches 0 as n approaches infinity (because \(\sigma/\sqrt{n}\) will approach 0). But the way that we present the central limit theorem above is fine if we are talking about large \(n\) (and not about \(n\) approaching infinity).

References

Miller, I., and M. Miller. 2004. John e. Freund’s Mathematical Statistics with Applications. Prentice Hall.