1.2 Continuous random variables: An example using the Normal distribution
We will now revisit the idea of a random variable using a continuous distribution as an example. Imagine that you have reading time data, represented as a vector \(y\). The data \(y\) are measured in milliseconds and are assumed to come from (i.e., assumed to be generated by) a random variable \(Y\) that has as PDF the Normal distribution (this is not a realistic assumption; we will come up with a more realistic assumption later). We will write the above statement compactly as follows:
\[\begin{equation} Y \sim Normal(\mu,\sigma) \end{equation}\]
This expression should be read as: the random variable \(Y\) has PDF \(Normal(\mu,\sigma)\). The random variable \(Y\) refers to an abstract mathematical object; specific instances of observed (or simulated) data will be referred to with the lower-case equivalent of the random variable; here, \(y\) can refer to a particular data-point or a vector of data (the context will make it clear whether we are talking about a single data point or a vector of data). A further important notational detail: in statistics textbooks, you will often see the Normal distribution represented in terms of the mean and the variance (\(\sigma^2\)) instead of the standard deviation \(\sigma\) that we use here. We have consciously chosen to use the standard deviation in the Normal distribution; this is because R
uses the standard deviation and not the variance to define a normal distribution.
An important assumption we make here is that each data point in the vector of data \(y\) is independent of the others. Often, we express this assumption by saying that we have “independent and identically distributed data”; this is abbreviated as “i.i.d”. One can be very explicit about this and write:
\[\begin{equation} Y \stackrel{iid}{\sim} Normal(\mu,\sigma) \end{equation}\]
The probability density function (PDF) of the Normal distribution is defined as follows:
\[\begin{equation} Normal(y|\mu,\sigma)=f(y)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left(-\frac{(y-\mu)^2}{2\sigma^2} \right) \end{equation}\]
Here, \(\mu\) is the mean, and \(\sigma\) is the standard deviation of the Normal distribution that the reading times have been sampled from. To display this function, we would have to decide on specific values for \(\mu\) and \(\sigma\) and then input the \(y\) value into the function to plot it. It may be instructive to write this function “by hand” and then plot it. We just choose some default values for \(\mu\) and \(\sigma\) here.
<- function(y = NULL, mu = 500, sigma = 100) {
my_normal 1 / sqrt(2 * pi * sigma^2)) * exp(-(y - mu)^2 / (2 * sigma^2))
(
}<- seq(100, 900, by = 0.01)
y plot(y, my_normal(y = y), type = "l")
Just like the dbinom
function we saw earlier, there is a built-in function in R
called dnorm
that will allow us to display the Normal distribution.
Using dnorm
, we can visualize the Normal distribution for particular values of \(\mu\) and \(\sigma\) as a PDF. Analogously to the discrete example we saw earlier, we can also plot the corresponding CDF (using pnorm
), and the inverse CDF (using qnorm
). See Figure 1.5.
In the continuous case, the PDF gives us something called the density for each possible value of our data \(y\); “density” here is not the probability, rather is it giving us a non-negative number that is the value of the function \(f(y)\) in the equation immediately above. We will return to the concept of density when we do an exercise on maximum likelihood estimation.
As in the discrete case, the CDF tells us the probability of observing a value like y or some value less than that (written: \(P(Y<y)\)); and the inverse CDF gives us the quantile \(y\) such that \(P(Y<y)\) is some specific value between 0 and 1. The PDF, CDF, and inverse CDF are three different ways of looking at the same information.
One important fact about the normal distribution is that 95% of the probability mass is covered by approximately plus/minus 1.96 times the standard deviation about the mean. Thus, the range \(\mu\pm 1.96\times \sigma\) will cover approximately 95% of the area under the curve. We will approximate this by talking about \(\mu\pm 2\times \sigma\).
As in the discrete example, the PDF, CDF, and inverse of the CDF allow us to ask questions like:
- What is the probability of observing values between some range \(a\) and \(b\) from a Normal distribution with mean \(\mu\) and standard deviation \(\sigma\)? We can compute the probability of the random variable lying between 1 and minus infinity:
pnorm(1, mean = 0, sd = 1)
## [1] 0.8413
Notice here that the probability of any point value in a PDF is always 0. This is because the probability in a continuous probability distribution is defined to be the area under the curve, and the area under the curve at any single point on the x-axis is always 0. The implication here is that we can only ask about probabilities between a range of values; e.g., the probability that \(Y\) lies between \(a\) and \(b\), or \(P(a<Y<b)\), where \(a\neq b\). Also, notice that \(P(a<Y<b)\) and \(P(a\leq Y\leq b)\) will be the same probability, because of the fact that \(P(Y=a)\) or \(P(Y=b)\) both equal 0.
- What is the quantile \(q\) such that the probability is \(p\) of observing that value \(q\) or something less (or more) than it? For example, we can work out the quantile \(q\) such that the probability of observing \(q\) or something less than it is 0.975, in the Normal(500,100) distribution. Formally, we would write this as \(F(a)=P(Y<a)\).
qnorm(0.975, mean = 500, sd = 100)
## [1] 696
The above output says that the probability that the random variable is less than \(q=696\) is 97.5%.
- Generating simulated data. Given a vector of \(n\) independent and identically distributed data \(y\), i.e., given that each data point is being generated independently from \(Y \sim Normal(\mu,\sigma)\) for some values of the parameters \(\mu,\sigma\), the maximum likelihood estimates for the expectation and variance are
\[\begin{equation} \bar{y} = \frac{\sum_{i=1}^n y_i}{n} \end{equation}\]
\[\begin{equation} Var(y) = \frac{\sum_{i=1}^n (y_i- \bar{y})^2}{n} \end{equation}\]
As an aside, keep in mind that the formula for variance in R
is slightly different: it divides by \(n-1\) instead of \(n\).
\[\begin{equation} Var(y) = \frac{\sum_{i=1}^n (y_i- \bar{y})^2}{n-1} \end{equation}\]
The reason for this difference is simply that division by \(n-1\) leads to a so-called unbiased estimate of the variance.
Let’s simulate some sample data from a Normal distribution and compute estimates of the mean and variance.
Let’s generate \(10\) data points using the rnorm
function, and then compute the mean and variance from the simulated data:
<- rnorm(10, mean = 500, sd = 100)
y mean(y)
## [1] 543.1
var(y)
## [1] 8037
As mentioned earlier, depending on the sample size, the sample mean and sample variance from a particular sample may or may not be close to the true values of the respective parameters, despite the fact that these are maximum likelihood estimates. The estimates of the mean and variance from a particular sample will be close to the true values of the parameters as the sample size goes to infinity.