2.5 Hypothesis testing: The one sample t-test

With the central limit theorem and the idea of hypothetical repeated sampling behind us, we turn now to one of the simplest statistical tests that one can do with continuous data: the t-test.

Due to its simplicity, it is tempting to take only a cursory look at the t-test and move on immediately to the linear (mixed) model. This would be a mistake. The humble t-test is surprising in many ways, and holds several important lessons for us. There are subtleties in this test, and a close connection to the linear mixed model. For these reasons, it is worth slowing down and spending some time understanding this test. Once the t-test is clear, more complex statistical tests will be easier to follow, because the logic of these more complex tests will essentially be more of the same, or variations on this general theme. You will see later that t-test can be seen as an analysis of variance or ANOVA; and the paired t-test has an exact analog in the linear mixed model with varying intercepts.

2.5.1 The one-sample t-test

As in our running example, suppose we have a random sample \(y\) of size \(n\), and the data come from a \(N(\mu,\sigma)\) distribution, with unknown parameters \(\mu\) and \(\sigma\). An assumption of the t-test is that the data points are independent in the sense discussed at the beginning of this chapter. We can estimate \(\mu\) from the sample mean \(\bar{y}\), which we will sometimes also write as \(\hat \mu\). We can also estimate \(\sigma\) from the sample standard deviation \(s\), which we can also write as \(\hat\sigma\). These estimates in turn allow us to estimate the sampling distribution of the mean under (hypothetical) repeated sampling:

\[\begin{equation} N(\hat\mu,\frac{\hat \sigma}{\sqrt{n}}) \end{equation}\]

It is important to realize here that the above sampling distribution is only as realistic as the estimates of the mean and standard deviation parameters—if those happen to be inaccurately estimated, then the sampling distribution is not realistic either.

Assume as before that we take an independent random sample of size \(1000\) from a random variable \(Y\) that is normally distributed, with mean 500 and standard deviation 100. As usual, we begin by estimating the mean and SE:

n <- 1000
mu <- 500
sigma <- 100
## generate simulated data:
y <- rnorm(n, mean = 500, sd = 100)
## compute summary statistics:
y_bar <- mean(y)
SE <- sd(y) / sqrt(n)

The null hypothesis significance testing (NHST) approach as practised in psychology and other areas is to set up a null hypothesis that \(\mu\) has some fixed value. Just as an example, assume that our null hypothesis is:

\[\begin{equation} H_0: \mu = 450 \end{equation}\]

This amounts to assuming that the true sampling distribution of sample means is (approximately) normally distributed and centered around 450, with the standard error estimated from the data.

FIGURE 2.7: The sampling distribution of the mean when the null hypothesis is that the mean is 450. Also shown is the observed sample mean.

The intuitive idea here is that

if the sample mean \(\bar{y}\) is “near” the hypothesized \(\mu\) (here, 450), the data are possibly (but by no means necessarily) consistent with the null hypothesis distribution.
if the sample mean \(\bar{y}\) is “far” from the hypothesized \(\mu\), the data are inconsistent with the null hypothesis distribution.

The terms “near” and “far” will be quantified by determining how many standard error units the sample mean is from the hypothesized mean. This way of thinking shifts the focus away from the sampling distribution above, towards the distance measured in standard error units from the null hypothesis mean.

The distance between the sample mean and the hypothesized mean can be written in SE units as follows. We will say that the sample mean is \(t\) standard errors away from the hypothesized mean:

\[\begin{equation} t \times SE = \bar{x} - \mu \end{equation}\]

If we divide both sides with the standard error, we obtain something that is referred to as the observed t-statistic:

\[\begin{equation} t_{obs} = \frac{\bar{x} - \mu}{SE} \end{equation}\]

This observed t-value, an expression of the distance between the sample mean and the hypothesized mean, becomes the basis for the statistical test.

Notice that the t-value is a random variable: it is a transformation of \(\bar{X}\), the random variable generating the sample means. The t-value can therefore be seen as an instance of the following transformed random variable \(T\):

\[\begin{equation} T = \frac{\bar{X} - \mu}{SE} \end{equation}\]

This random variable has a pdf associated with it, the t-distribution, which is defined in terms of the sample size \(n\); the pdf is written \(t(n-1)\). Under repeated sampling, the t-distribution is generated from this random variable \(T\).

Figure 2.8 quickly visualizes this: sample 10000 times from a \(Y\sim Normal(\mu=450,\sigma=100)\). Assume that the null hypothesis is \(\mu_0=450\); that is, the null hypothesis is in fact true in this case. The sample size is \(n=5\). Then compute the t statistic each time, and plot the distribution of the t-values as a histogram. Plot a \(t(n-1)\) distribution on top of this histogram to compare to the two distributions. A Normal(0,1) distribution is plotted as a broken line for comparison with the \(t(df=4)\) distribution.

set.seed(4321)
nsim <- 10000
n <- 5
mu <- 450
## null hypothesis mean:
mu0 <- 450
sigma <- 100
tval <- rep(NA, nsim)
se <- sigma / sqrt(n)
for (i in 1:nsim) {
  y <- rnorm(n, mean = mu, sd = sigma)
  xbar <- mean(y)
  tval[i] <- (xbar - mu0) / se
}

hist(tval, freq = FALSE, main = "t-value distribution")
x <- seq(-4, 4, by = 0.001)
lines(x, dt(x, df = n - 1))
lines(x, dnorm(x), lty = 2)

The distribution of the t-value under repeated sampling assuming that the null hypothesis is true, compared with a t(n-1) distribution (solid line) and a Normal(0,1) distribution (broken line).

FIGURE 2.8: The distribution of the t-value under repeated sampling assuming that the null hypothesis is true, compared with a t(n-1) distribution (solid line) and a Normal(0,1) distribution (broken line).

We will compactly express the statement that “the observed t-value is assumed to be generated (under repeated sampling) from a t-distribution with n-1 degrees of freedom” as:

\[\begin{equation} T \sim t(n-1) \end{equation}\]

For large \(n\), the pdf of the random variable \(T\) approaches \(N(0,1)\). This is illustrated in Figure 2.9; notice that the t-distribution has fatter tails than the normal for small \(n\), say \(n<20\), but for larger n, the t-distribution and the normal become increasingly similar in shape. Incidentally, when n=2, the t-distribution \(t(1)\) is the Cauchy distribution we saw earlier; this distribution is characterized by fat tails, and has no mean or variance defined for it.

A visual comparison of the t-distribution (with degrees of freedom ranging from 1 to 50) with the standard normal distribution (N(0,1)). The dashed line represents the standard normal distribution, and the solid line the t-distribution with the relevant degrees of freedom.

FIGURE 2.9: A visual comparison of the t-distribution (with degrees of freedom ranging from 1 to 50) with the standard normal distribution (N(0,1)). The dashed line represents the standard normal distribution, and the solid line the t-distribution with the relevant degrees of freedom.

Thus, given a sample size \(n\), we can define a t-distribution corresponding to the null hypothesis distribution. For large values of \(n\), we could even use \(N(0,1)\), although it is traditional in psychology and linguistics to always use the t-distribution no matter how large \(n\) is.

The null hypothesis testing procedure proceeds as follows:

Define the null hypothesis: in our example, the null hypothesis was that \(\mu = 450\). This amounts to making a commitment about what fixed value we think the true underlying distribution of sample means is centered at.
Given data of size \(n\), estimate \(\bar{y}\), standard deviation \(s\), and from that, estimate the standard error \(s/\sqrt{n}\). The standard error will be used to describe the sampling distribution’s standard deviation.
Compute the observed t-value:

\[\begin{equation} t=\frac{\bar{y}-\mu}{s/\sqrt{n}} \end{equation}\]

Reject null hypothesis if the observed t-value is “large” (to be made more precise next).
Fail to reject the null hypothesis, or (under some conditions, to be made clear later) even go so far as to accept the null hypothesis, if the observed t-value is “small”.

What constitutes a large or small observed t-value? Intuitively, the t-value from the sample is large when we end up far in either tail of the distribution. The two tails of the t-distribution will be referred to as the rejection region. The word region here refers to the real number line along the x-axis, under the tails of the distribution. The rejection region will go off to infinity on the outer sides, and is demarcated by a vertical line on the inner side of each tail. This is shown in Figure 2.10. It goes off to infinity because the support—the range of possible values—of the random variable that the t-distribution belongs to stretches from minus infinity to plus infinity.

The rejection region in a t-distribution (sample size 20) assuming that the null hypothesis is true. The rejection region is the x-axis under the gray-colored area.

FIGURE 2.10: The rejection region in a t-distribution (sample size 20) assuming that the null hypothesis is true. The rejection region is the x-axis under the gray-colored area.

The location of the vertical lines is determined by the absolute critical t-value along the x-axis of the t-distribution. This is the value such that the area under the curve in the tails to the left or right of the tails is 0.025. As discussed in chapter 1, this area under the curve represents the probability of observing a value as extreme as the critical t-value, or some value that is more extreme. Notice that if we ask ourselves what the probability is of observing some particular t-value (a point value), the answer must necessarily be \(0\) (if you are unclear about why, re-read chapter 1). But we can ask the question: what is the absolute t-value, written \(|t|\), such that \(P(T>|t|)=0.05\)? That’s the critical t-value. We call it the critical t-value because it demarcates the rejection region shown in Figure 2.10: we are adopting the convention that any observed t-value larger than this critical t-value allows us to reject the null hypothesis.

For a given sample size \(n\), we can identify the rejection region by using the qt function, whose usage is analogous to the qnorm function discussed in chapter 1.

Because the shape of the t-distribution depends on the degrees of freedom (n-1), the absolute critical t-value beyond which we reject the null hypothesis will change depending on sample size. For large sample sizes, say \(n>50\), the rejection point is about 2.

abs(qt(0.025, df = 15))

## [1] 2.131

abs(qt(0.025, df = 50))

## [1] 2.009

Consider the observed t-value from our sample in our running example:

## null hypothesis mean:
mu <- 450
(t_value <- (y_bar - mu) / SE)

## [1] 17.43

This observed t-value is huge and is telling you the distance of the sample mean from the null hypothesis mean \(\mu\) in standard error units.

\[\begin{equation} t=\frac{\bar{y}-\mu_0}{s/\sqrt{n}} \hbox{ or } t\frac{s}{\sqrt{n}}=\bar{y}-\mu_0 \end{equation}\]

For large sample sizes, if the absolute t-value \(|t|\) is greater than \(2\), we will reject the null hypothesis.

For a smaller sample size \(n\), you can compute the exact critical t-value:

qt(0.025, df = n - 1)

## [1] -2.093

Why is this critical t-value negative in sign? That is because it is on the left-hand side of the t-distribution, which is symmetric. The corresponding value on the right-hand side is:

qt(0.975, df = n - 1)

## [1] 2.093

These values are of course identical if we ignore the sign. This is why we always frame our discussion around the absolute t-value.

In R, the built-in function t.test delivers the observed t-value. Given our running example, with the null hypothesis \(\mu=450\), R returns the following:

## observed t-value from t-test function:
t.test(y, mu = 450)$statistic

##       t 
## -0.8859

The default value for the null hypothesis mean \(\mu\) in this function is 0; so if one doesn’t define a null hypothesis mean, the statistical test is done with reference to a null hypothesis that \(\mu=0\). That is why this t-value does not match our calculation above:

t.test(y)$statistic

##     t 
## 8.399

In the most common usage of the t-test, the null hypothesis mean will be \(0\), because usually one is comparing a difference in means between two conditions or two sets of conditions. So the above line of code will work out correctly in those cases; but if you ever have a different null hypothesis mean than \(0\), then you have to specify it in the t.test function.

So, the way that the t-test is used in psychology and related areas is to implement a decision: either reject the null hypothesis or fail to reject it. Whenever we do an experiment and carry out a t-test, we use the t-test to make this binary decision: reject or fail to reject the null hypothesis.

Recall that behind the t-test lies the assumption that the observed t-value is coming from a random variable, \(T\sim t(n-1)\). The particular t-value we observe from a particular data-set belongs to a distribution of t-values under hypothetical repeated sampling. Thus, implicit in the logic of the t-test—and indeed every frequentist statistical test—is the assumption that the experiment is in principle repeatable: the experiment can in principle be re-run as many times as we want, assuming we have the necessary resources and time.

A quick simulation of t-values under repeated sampling makes this clear. Suppose that our null hypothesis mean is \(450\), and our sample size \(n=100\). Assume that the data come from \(Normal(\mu=450,\sigma=100)\). Thus, in this case the null hypothesis is in fact true. Let’s do \(10000\) simulations, compute the sample mean each time, and then store the observed t-value. The t-distribution that results is shown in Figure 2.11.

n <- 100
nsim <- 10000
tvals <- rep(NA, nsim)
for (i in 1:nsim) {
  y <- rnorm(n, mean = 450, sd = 100)
  SE <- sd(y) / sqrt(n)
  tvals[i] <- (mean(y) - 450) / SE
}
plot(density(tvals),
  main = "Simulated t-distribution",
  xlab = "t-values under repeated sampling"
)

FIGURE 2.11: The distribution of t-values under repeated sampling. The null hypothesis is true.

What would the t-distribution look like if the null hypothesis were false? Assume now that the null hypothesis is that \(\mu=450\) as before, but that in fact the true \(\mu\) is 470. Now the null hypothesis is false. Figure 2.12 shows the t-distribution under repeated sampling. The t-distribution is now centered around 2; why? This is because if we plug in the hypothesized mean (450) and the true mean (470) and the standard error (\(100/\sqrt(100)=10\)) into the equation for computing the t-value, the expected value of the t-distribution (its mean) is \(2\).

\[\begin{equation} \frac{470-450}{10} = 2 \end{equation}\]

n <- 100
nsim <- 10000
tvals <- rep(NA, nsim)
for (i in 1:nsim) {
  y <- rnorm(n, mean = 470, sd = 100)
  SE <- sd(y) / sqrt(n)
  tvals[i] <- (mean(y) - 450) / SE
}
plot(density(tvals),
  main = "Simulated t-distribution",
  xlab = "t-values under repeated sampling"
)

The distribution of t-values under repeated sampling. The null hypothesis is false, with the true mean being 470 (the null hypothesis is that the true mean is 450).

FIGURE 2.12: The distribution of t-values under repeated sampling. The null hypothesis is false, with the true mean being 470 (the null hypothesis is that the true mean is 450).

This implicit idea of the experiment’s repeatability leads to an important aspect of the t-test: once certain assumptions about the null hypothesis and the alternative hypothesis are fixed, we can use simulation to compute the proportion of times that the null hypothesis would be rejected under repeated sampling. One has to consider two alternative possible scenarios: the null is either true, or it is false. In other words, this simulation-based approach allows us to study the t-test’s ability (at least in theory) to lead the researchers to the correct decision under (hypothetical) repeated sampling. We turn to this issue next.

2.5.2 Type I, II error, and power

When we do a hypothesis test using the t-test, the observed t-value will either fall in the rejection region, leading us to reject the null hypothesis, or it will land in the non-rejection region, leading us to fail to reject the null. For a particular experiment, that is a single, one-time event.

So suppose we have made our decision based on the observed t-value. Now, the null hypothesis can be either true or not true; we don’t know which of those two possibilities is the reality. When we decide (based on the observed t-value) that the null is true, we are asserting that the parameter \(\mu\) actually does have the hypothesized value \(\mu_0\); when we decide that the null is false, we are asserting that the parameter \(\mu\) has some specific value \(\mu_{alt}\) other than \(\mu_0\). We can represent these two alternative possible realities in a tabular form, as shown in Table ??. The two columns show the two possible worlds, one in which the null is true, and the other in which it is false. The two rows show the two possible decisions we can take based on the observed t-value: reject the null or fail to reject it.

As the table shows, we can make two kinds of mistakes:

Type I error or \(\alpha\): Reject the null when it’s true.
Type II error or \(\beta\): Accept the null when it’s false.

In psychology and related areas, Type I error is usually fixed a priori at 0.05. This stipulated Type I error value is why the absolute critical t-value is kept at approximately \(2\); if, following recommendations from Benjamin et al. (2018), we were to stipulate that the Type I error be 0.005, then the critical t-value would have had to be set at:

abs(qt(0.0025, df = n - 1))

## [1] 2.871

This suggested change in convention hasn’t been taken up yet in cognitive science, but this could well change one day.

Type II error, the probability of incorrectly accepting the null hypothesis when it is false with some particular value for the parameter \(\mu\), is conventionally recommended (Cohen 1988) to be kept at 0.20 or lower. This implies that the probability of correctly rejecting a null hypothesis for some particular true value of \(\mu\) is 1-Type II error. This probability, called statistical power, or just power, should then obviously be larger than 0.80. Again, there is nothing special about these stipulations; they are conventions that became the norm over time.

Next, we will consider the trade-off between Type I and II error. For simplicity, assume that the standard error is 1, and the null hypothesis is that \(\mu=0\). This means that the observed t-value is really the sample mean.

Consider the concrete situation where, in reality, the true value of \(\mu\) is \(2\). Let’s assume for simplicity that the standard error is 1; this means that the true \(\mu\) is also the t-value. This allows us to talk about the hypothesis test in terms of the t-distribution.

As mentioned above, the null hypothesis \(H_0\) is that \(\mu=0\). Now the \(H_0\) is false because \(\mu=2\) and not \(0\). Type I and II error can be visualized graphically as shown in Figure 2.13.

A visualization of Type I and II error. The dark-shaded tails of the left-hand side distribution represent Type I error; and the light-colored shaded region of the right-hand side distribution represents Type II error. Power is the unshaded area under the curve in the right-hand side distribution.

FIGURE 2.13: A visualization of Type I and II error. The dark-shaded tails of the left-hand side distribution represent Type I error; and the light-colored shaded region of the right-hand side distribution represents Type II error. Power is the unshaded area under the curve in the right-hand side distribution.

To understand Figure 2.13, one has to consider two distributions side by side. First, consider the null hypothesis distribution, centered at 0. Under the null hypothesis distribution, the rejection region lies below the dark colored tails of the distribution. The area under the curve in these dark-colored tails is the Type I error (conventionally set at 0.05) that we decide on even before we conduct the experiment and collect the data. Because the Type I error is set at 0.05, and because the t-distribution is symmetric, the area under the curve in each tail is 0.025. The absolute critical t-value helps us demarcate the boundaries of the rejection regions through the vertical lines shown in the figure. These vertical lines play a crucial role in helping us understand Type II error and power. This becomes clear when we consider the distribution representing the alternative possible value of \(\mu\), the distribution centered around 2. In this second distribution, consider now the area under the curve between the vertical lines demarcating the rejection region under the null. This area under the curve is the probability of accepting the null hypothesis when the null hypothesis is false with some specific value (here, when \(\mu\) has value 2).

Some interesting observations follow. Suppose that the true mean is in fact \(\mu=2\), as in the above illustration. Then,

Simply decreasing Type I error to a smaller value like 0.005 will increase Type II error, which means that power (1-Type II error) will fall.
Increasing sample size will squeeze the vertical lines closer to each other because standard error will go down with increasing sample size. This will reduce Type II error, and therefore increase power. Decreasing sample size will have the opposite effect.
If we design an experiment with a larger effect size, e.g., by setting up a stronger manipulation (concrete examples will be discussed in this book later on), our Type II error will go down, and therefore power will go up. Figure 2.14 shows a graphical visualization of a situation where the true mean is \(\mu=4\). Here, Type II error is much smaller compared to Figure 2.13, where \(\mu=2\).

FIGURE 2.14: The change in Type II error if the true mean is 4.

In summary, when we plan out an experiment, we are also required to specify the Type I and II error associated with the design. Both sources of error are within our control, at least to some extent. The Type I error we decide to use will determine our critical t-value and therefore our decision criterion for rejecting, failing to reject, or even (under certain conditions, to be discussed below) accepting the null hypothesis.

The Type II error we decide on will determine the long-run probability of incorrectly ``accepting’’ the null hypothesis; its inverse (1-Type II error), statistical power, will determine the long-run probability of correctly rejecting the null hypothesis under the assumption that the \(\mu\) has some particular value other than the null hypothesis mean.

That’s the theory anyway. In practice, researchers only rarely consider the power properties of their experiment design; the focus is almost exclusively on Type I error. The neglect of power in experiment design has had interesting consequences for theory development, as we will see later in this book. For a case study in psycholinguistics, see S. Vasishth et al. (2018).

2.5.3 How to compute power for the one-sample t-test

Power (which is 1-Type II error) is a function of three variables:

the effect size
the standard deviation
the sample size.

There are two ways that one can compute power in connection with the t-test: either one can use the built-in R function, power.t.test, or one can use simulation.

2.5.3.1 Power calculation using the `power.t.test` function

Suppose that we have an expectation that an effect size is 15 ms \(\pm 5\) ms (this could be based on the predictions of a theoretical model, or prior data); suppose also that prior experiments show standard deviations ranging from 100 to 300 ms. Given a particular sample size, this is enough information to compute a power curve as a function of effect size and standard deviation. See Figure 2.15 and the associated code below.

sds <- seq(100, 300, by = 1)
lower <- power.t.test(
  delta = 15 - 5,
  sd = sds, n = 10,
  strict = TRUE
)$power
upper <- power.t.test(
  delta = 15 + 5,
  sd = sds, n = 10,
  strict = TRUE
)$power
meanval <- power.t.test(
  delta = 15,
  sd = sds, n = 10,
  strict = TRUE
)$power

plot(sds, meanval,
  type = "l",
  main = "Power curve (n=10)\n
     using power.t.test",
  xlab = "standard deviation",
  ylab = "power",
  yaxs = "i",
  ylim = c(0.05, 0.12)
)
lines(sds, lower, lty = 2)
lines(sds, upper, lty = 2)
text(125, 0.053, "10")
text(150, 0.054, "15")
text(175, 0.056, "20")

FIGURE 2.15: An illustration of a power curve for 10 participants, as a function of standard deviation, and three estimates of the effect: 15, 10, and 20.

2.5.3.2 Power calculations using simulation

An analogous calculation as the one shown above using the power.t.test function can also be done using simulated data. First, generate simulated data repeatedly for each possible combination of parameter values (here, effect size and standard deviation), and compute the proportion of significant effects for each parameter combination. This can be done by defining a function that takes as input the number of simulations, sample size, effect size, and standard deviation:

compute_power <- function(nsim = 100000, n = 10,
                          effect = NULL,
                          stddev = NULL) {
  crit_t <- abs(qt(0.025, df = n - 1))
  temp_power <- rep(NA, nsim)
  for (i in 1:nsim) {
    y <- rnorm(n, mean = effect, sd = stddev)
    temp_power[i] <- ifelse(abs(t.test(y)$statistic)
    > crit_t, 1, 0)
  }
  ## return power calculation:
  mean(temp_power)
}

Then, plot the power curves as a function of effect size and standard deviation, exactly as in Figure 2.15. Power calculations using simulations are shown in Figure 2.16. It is clear that simulation-based power estimation is going to be noisy; this is because each time we are generating simulated data and then carrying out a statistical test on it. This is no longer a closed-form mathematical calculation as done in power.t.test (this function simply implements a formula for power calculation specified for this simple case). Because the power estimates will be noisy, we show a smoothed lowess line for each effect size estimate.

An illustration of a power curve using simulation, for 10 participants, as a function of standard deviation, and three estimates of the effect: 15, 10, and 20. The power curves are lowess-smoothed.

FIGURE 2.16: An illustration of a power curve using simulation, for 10 participants, as a function of standard deviation, and three estimates of the effect: 15, 10, and 20. The power curves are lowess-smoothed.

In the above example, simulation-based power calculation is overkill, and completely unnecessary because we have power.t.test. However, the technique shown above will be extended and will become our bread-and-butter method once we switch to power calculations for complicated linear mixed models. There, no closed form calculation can be done to compute power, at least not without oversimplifying the model; simulation will be the only practical way to calculate power.

It is important to appreciate the fact that power is a function of the effect size, sample size, and standard deviation. As these three variables change, power will change. Because we can never be sure what the true effect size is, or what the true standard deviation is, power functions (power as a function of plausible values for the relevant parameters) are much more useful than single numbers.

2.5.4 The p-value

Continuing with our t-test example, the t.test function in R will not only print out a t-value as shown above, but also a probability known as a p-value. This is the probability of obtaining the observed t-value that we obtained, or some value more extreme than that, conditional on the assumption that the null hypothesis is true.

We can compute the p-value “by hand”. This can be computed, as done earlier, simply by calculating the area under the curve that lies beyond the absolute observed t-value on either side of the t-distribution. It is standard practice to take the tail probability on both sides of the t-distribution.

(abs_t_value <- abs(t.test(y, mu = 450)$statistic))

##     t 
## 2.498

2 * pt(abs_t_value, df = n - 1, lower.tail = FALSE)

##       t 
## 0.03396

The area from both sides of the tail is taken because it is conventional to do what is called a two-sided t-test: our null hypothesis is that \(\mu=450\), and our alternative hypothesis is two-sided: \(\mu\) is either less than \(450\) or \(\mu\) is larger than \(450\). When we reject the null hypothesis, we are accepting this alternative, that \(\mu\) could be some value other than \(450\). Notice that this alternative hypothesis is remarkably vague; we would reject the null hypothesis regardless of whether the sample mean turns out to be \(600\) or \(-600\), for example. The practical implication is that the p-value gives us the strength of the evidence against the null hypothesis; it doesn’t give us evidence in favor of a specific alternative. The p-value will reject the null hypothesis regardless of whether our sample mean is positive or negative in sign. In psychology and allied disciplines, whenever the p-value falls below \(0.05\), it is common practice to write something along the lines that “there was reliable evidence for the predicted effect.” This statement is incorrect! We only ever have evidence against the null. By looking at the sample mean and its sign, we are making a very big leap that we have evidence for the specific sample mean we happened to get. As we will see below, when power is low, the sample mean can be wildly far from the true mean that produced the data.

One need not have a two-sided alternative; one could have defined the alternative to be one-sided (for example, that \(mu>450\)). In that case, one would compute only one side of the area under the curve. This kind of one-sided test is not normally done, but one can imagine a situation where a one-sided test is justified (for example, when only one sign of the effect is possible, or if there is a strong theoretical reason to expect only one particular sign—positive or negative—on an effect). That said, in their scientific career, only one of the authors of this book has ever had occasion to use a one-sided test. In this book, we will not use one-sided t-tests.

The p-value is always interpreted with reference to the pre-defined Type I error. Conventionally, we reject the null if \(p<0.05\). This is because we set the Type I error probability at 0.05. Keep in mind that Type I error probability and the p-value are two distinct things. The Type I error probability is the probability of your incorrectly rejecting the null under repeated sampling. This is not the same thing as your p-value. The latter probability will be obtained from a particular experiment, and will vary from experiment to experiment; it is a random variable. By contrast, the Type I error probability is a value we fix in advance.

2.5.5 The distribution of the p-value under the null hypothesis

We have been talking about a continuous random variable as a dependent measure, and have learnt about the standard two-sided t-test, with a point null hypothesis. When we do such a test, we usually use the p-value to decide whether to reject the null hypothesis or not.

Sometimes, you will hear statisticians (e.g., Andrew Gelman on his blog) criticize p-values by saying that the null hypothesis significance test is a “specific random number generator”. For example, a blog post from May 5, 2016 (shorturl.at/irHS9) quotes Daniel Lakeland: “A p-value is the probability of seeing data as extreme or more extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis).”

What does that sentence mean? We explain this point here because it is very helpful in understanding the p-value.

Suppose that the null hypothesis is in fact true. We will now simulate the distribution of the p-value under repeated sampling, given this assumption. See Figure 2.17.

FIGURE 2.17: The p-value has a uniform distribution when the null hypothesis is true; a demonstration using simulation.

When the null hypothesis is actually true, the distribution of the p-value is uniform—every value between 0 and 1 is equally likely. The practical implication of this criticism of p-values is that when we do a single experiment and obtain a p-value under the assumption that the null is true, if the null were in fact true, then we are just using a random number generator (generating a random number between 0 and 1, \(Z\sim uniform(0,1)\)) to make a decision that the effect is present or absent.

A broader implication is that we should not place our theory development exclusively at the feet of the p-value. As we discuss in this book, other considerations (such as carrying out replication attempts that yield similar estimates; the uncertainty of the estimates, and power) are more important than just looking at the p-value to make a binary decision about the null hypothesis.

Proof that the p-value is uniformly distributed under the null: The probability integral transform

The fact that the p-value comes from a uniform distribution with bounds 0 and 1 when the null hypothesis is true can be formally derived using the random variable theory in chapter 1.

Consider the fact that the p-value is a random variable; call it \(Z\). The p-value is the cumulative distribution function (CDF) of the random variable \(T\), which itself is a transformation of the random variable that represents the mean of the data \(\bar{Y}\):

\(T=(\bar{Y}-\mu)/(\sigma/\sqrt{n})\)

This random variable \(T\) itself has some CDF \(F(T)\). It is possible to show that if a random variable \(Z=F(T)\), i.e., if \(Z\) is the CDF for the random variable \(T\), then \(Z\) has a uniform distribution ranging from 0 to 1, \(Z \sim Uniform(0,1)\).

This is an amazing fact. To get a grip on this, let’s first think about the fact that when a random variable \(Z\) comes from a \(Uniform(0,1)\) distribution, then \(P(Z<z)=z\). Consider some examples:

when \(z=0\), then \(P(Z<0)=0\);
when \(z=0.25\), then \(P(Z<0.25)=0.25\);
when \(z=0.5\), then \(P(Z<0.5)=0.5\);
when \(z=0.75\), then \(P(Z<0.75)=0.75\);
when \(z=1\), then \(P(Z<1)=1\).

You can verify this by typing:

z <- c(0, 0.25, 0.5, 0.75, 1)
punif(z, min = 0, max = 1)

## [1] 0.00 0.25 0.50 0.75 1.00

Next, we will prove the above statement, that if a random variable \(Z=F(T)\), i.e., if \(Z\) is the CDF for a random variable \(T\), then \(Z \sim Uniform(0,1)\). The proof is actually quite astonishing and even has a name: it’s called the probability integral transform.

Suppose that \(Z\) is the CDF of a random variable \(T\): \(Z=F(T)\). Then, it follows that \(P(Z\leq z)\) can be rewritten in terms of the CDF of T: \(P(F(T)\leq z)\). Now, if we apply the inverse of the CDF (\(F^{-1}\)) to both the left and right sides of the inequality, we get \(P(F^{-1}F(T)\leq F^{-1}(z))\). But \(F^{-1}F(T)\) gives us back \(T\); this holds because if we have a one-to-one onto function \(f(x)\), then applying the inverse \(f^{-1}\) to this function gives us back \(x\).

The fact that \(F^{-1}F(T)\) gives us back \(T\) means that we can rewrite \(P(F^{-1}F(T)\leq F^{-1}(z))\) as \(P(T\leq F^{-1}(z) )\). But this probabilityy is the CDF \(F(F^{-1}(z))\), which simplifies to \(z\). This shows that \(P(Z\leq z) = z\); i.e., that the p-value has a uniform distribution under the null hypothesis.

The above proof is restated below compactly:

\[\begin{equation} \begin{split} P(Z\leq z) =& P(F(T)\leq z)\\ =& P(F^{-1}F(T)\leq F^{-1}(z))\\ =& P(T\leq F^{-1}(z) ) \\ =& F(F^{-1} (z))\\ =& z\\ \end{split} \end{equation}\]

It is for this reason that statisticians like Andrew Gelman periodically point out that “the null hypothesis significance test is a specific random number generator”.

2.5.6 Type M and S error in the face of low power

Beyond Type I and II error, there are also two other kinds of error to be aware of. These are Type M(agnitude) and S(ign) error; both sources of error are closely related to statistical power.

The terms Type M and S error were introduced by Gelman et al. (2014), but the ideas have been in existence for some time (Hedges 1984),(Lane and Dunlap 1978). Button et al. (2013) refer to Type M and S error as the “winner’s curse” and “the vibration of effects.” In related work, Ioannidis (2008) refers to the vibration ratio in the context of epidemiology.

Type S and M error can be illustrated with the following example. Suppose your true effect size is believed to be \(D=15\), then we can compute (apart from statistical power) the following error rates, which are defined as follows:

Type S error: the probability that the sign of the effect is incorrect, given that the result is statistically significant.
Type M error: the expectation of the ratio of the absolute magnitude of the effect to the hypothesized true effect size, given that the result is significant. Gelman and Carlin also call this the exaggeration ratio, which is perhaps more descriptive than “Type M error”.

Suppose that a particular study has standard error \(46\), and sample size \(37\). And suppose that the true \(\mu=15\), as in the example discussed above. Then, we can compute statistical power, Type S and M error through simulation in the following manner:

## probable effect size, derived from past studies:
D <- 15
## SE from the study of interest:
se <- 46
stddev <- se * sqrt(37)
nsim <- 10000
drep <- rep(NA, nsim)
for (i in 1:nsim) {
  samp <- rnorm(37, mean = D, sd = stddev)
  drep[i] <- mean(samp)
}

Power can be computed by simply determining the proportion of times that the absolute observed t-value is larger than 2:

## power: the proportion of cases where
## we reject the null hypothesis correctly:
(pow <- mean(ifelse(abs(drep / se) > 2, 1, 0)))

## [1] 0.057

Power is quite low here (we deliberately chose an example with low power to illustrate Type S and M error).

Next, we figure out which of the samples are statistically significant (which simulated values yield \(p<0.05\)). As a criterion, use a t-value of 2 to reject the null; this could have been done more precisely by working out an exact critical t-value for the given sample size.

## which results in drep are significant at alpha=0.05?
signif <- which(abs(drep / se) > 2)

Type S error is the proportion of significant cases with the wrong sign (sign error), and Type M error is the ratio by which the true effect (of \(\mu=15\)) is exaggerated in those simulations that happened to come out significant.

## Type S error rate | signif:
(types_sig <- mean(drep[signif] < 0))

## [1] 0.1754

## Type M error rate | signif:
(typem_sig <- mean(abs(drep[signif]) / D))

## [1] 7.351

In this scenario, when power is approximately 6%, whenever we get a significant effect, the probability of obtaining the wrong sign is a whopping 18% and the effect is likely to be 7.35 times larger than its true magnitude. The practical implication is as follows.

When power is low, relying on the p-value (statistical significance) to declare an effect as being present will be misleading even if the result is statistically significant. This is because the significant effect will be based on an overestimate of the effect (Type M error), and even the sign of the effect could be wrong. This isn’t just a theoretical point; it has real-world consequences for theory development. For an example from psycholinguistics regarding this point, see S. Vasishth et al. (2018), Lena A. Jäger et al. (2020a), and Mertzen et al. (2021). In all these studies, an attempt was made to re-estimate published estimates of effects using larger sample sizes; in all cases, the larger sample sizes showed smaller estimates. In other words, the original published estimates, which were all based on low-powered studies, were over-estimates that did not replicate. A practical implication of Type M error is that “statistically significant effects” will not be “reliable” in any meaningful sense if power is low.

Another useful way to visualize Type M and S error is through a visualization that is referred to as a funnel plot. As shown in Figure 2.18, estimates obtained from low-powered studies will tend to be exaggerated (the lower part of the funnel), and as power goes up, the effect estimates start to cluster tightly around the true value of the effect.

An illustration of a funnel plot. Shown are repeated samples of an effect estimate under different values of power, where the true value of the effect is 15 (marked by the vertical line). Significant effects are shaded gray. The lower the power, the wider the fluctuation of the effect; under low power, it is the exaggerated effects that end up statistically significant, even though they are very biased relative to the true value. As power goes up, the effect estimates start to cluster around the true value, and significant effects are also accurate estimates of the effect. Thus, low power leads to exaggerated estimates of the effect, especially if the data are filtered by statistical significance.

FIGURE 2.18: An illustration of a funnel plot. Shown are repeated samples of an effect estimate under different values of power, where the true value of the effect is 15 (marked by the vertical line). Significant effects are shaded gray. The lower the power, the wider the fluctuation of the effect; under low power, it is the exaggerated effects that end up statistically significant, even though they are very biased relative to the true value. As power goes up, the effect estimates start to cluster around the true value, and significant effects are also accurate estimates of the effect. Thus, low power leads to exaggerated estimates of the effect, especially if the data are filtered by statistical significance.

What is important to appreciate here is the fact that significant effects “point to the truth” just in case power is high; when power is low, either null results will frequently be found even if the null is false, and those results that turn out significant will be based on Type M error. Thus, when power is low, all possible outcomes (significant or non-significant) from a statistical analysis based on a p-value will be meaningless in the sense that non-significant effects don’t allow us to accept the null hypothesis (due to low power), and significant effects are going to be based on exaggerated estimates that don’t reflect the true value.

In many fields, it is practically impossible to conduct a high-powered study. What should one do in this situation? When reporting results that are likely based on an underpowered study, the best approach is to (i) simply report estimates with confidence intervals and not make binary decisions like “effect present/absent”, (ii) openly acknowledge the power limitation, (iii) attempt to conduct a direct replication of the effect to establish robustness, and (iv) attempt to synthesize the evidence from existing knowledge (Cumming 2014).

One can focus on reporting estimates in a paper as follows. For example, one can state that “the estimate of the mean effect was 50 ms, with 95% confidence interval [-10,110] ms. This estimate is consistent with the effect being present.” Such a wording does not make any discovery claim; it just reports that the estimate is consistent with the predicted direction. Such a reported estimate can be used in meta-analyses, facilitating cumulative acquisition of knowledge.

By direct replication, we mean that the study should be run multiple times with the same materials and design but new participants, to establish whether effect estimates in the original study and the replication study are consistent with each other. Direct replications stand in contrast to conceptual replications, which are not exact repetitions of the original design, but involve some further or slightly different but related experimental manipulations. Conceptual replications are also a very useful tool for cross-validating the existence of an effect, but direct replications should be a standard way to validate the consistency of an effect.

Of course, truly direct replications are impossible to conduct because repeating a study will always differ from the original one in some way or another—the lab may differ, the protocols might differ slightly, the experimenter may be different, etc. Such between-study variability is obviously unavoidable in direct-replication attempts, but they are still worthwhile for establishing the existence of an effect. To make clearer the idea of establishing robustness through replication attempts, detailed examples of different kinds of replication attempts of published studies will be presented in this book’s example data sets.

Finally, meta-analysis is a very important tool for developing a good understanding of what has been learned in a field about a particular phenomenon. Meta-analysis has been largely neglected in psycholinguistics, with researchers classifying previous work using a voting method: researchers routinely count the number of studies that showed a significant vs. non-significant effect. As an example, Hammerly, Staub, and Dillon (2019) summarize the literature on a particular psycholinguistic phenomenon in this binary manner: “In our review, only 11 of the 22 studies that tested for the interaction indicative of the grammaticality asymmetry found a significant effect. Of the studies that ran contrasts to test for an effect of attractor number in grammatical sentences, 7 of the 20 studies found a significant effect, while all of the 16 studies that tested for effects of attractor number in ungrammatical sentences found a significant effect.” As discussed above, this kind of summary is quite meaningless if the power of the experiments is not known to be high. It is much better to summarize the literature using a meta-analysis, which reports an estimate based on all existing studies, along with an uncertainty interval. Later in the book, concrete examples of meta-analyses will be discussed.

2.5.7 Searching for significance

The NHST procedure is essentially a decision procedure: if \(p<0.05\), we reject the null hypothesis; otherwise, we fail to reject the null. Because significant results are easier to publish than non-significant results, a common approach taken by researchers (including the first author of this book, when he was a graduate student) is to run the experiment and periodically check if statistical significance has been reached. The procedure can be described as follows:

The experimenter gathers \(n\) data points, then checks for significance (is \(p<0.05\) or not?).
If the result is not significant, they get more data (say, \(n\) more data points). Then they check for significance again.

Since time and money (and patience) are limited, the researcher might decide to stop collecting data after some multiple of \(n\) have been collected.

One can simulate different scenarios here. Suppose that \(n\) is initially \(15\) subjects.
Under the standard assumptions, set Type I error probability to be \(0.05\). Suppose that the null hypothesis that \(\mu=0\) is in fact true, and that the standard deviation is \(250\) ms (assuming a reading study).

## Standard properties of the t-test:
pvals <- NULL
tstat_standard <- NULL
n <- 15
nsim <- 10000
## assume a standard dev of 250:
stddev <- 250
mn <- 0
for (i in 1:nsim) {
  samp <- rnorm(n, mean = mn, sd = stddev)
  pvals[i] <- t.test(samp)$p.value
  tstat_standard[i] <- t.test(samp)$statistic
}

Type I error rate is about 5%, consistent with our expectations:

round(mean(pvals < 0.05), 2)

## [1] 0.05

But the situation quickly deteriorates as soon as we adopt the strategy outlined above. Below, we will also track the distribution of the t-statistic.

pvals <- NULL
tstat <- NULL
## how many subjects can I run?
upper_bound <- n * 6

for (i in 1:nsim) {
  significant <- FALSE
  x <- rnorm(n, mean = mn, sd = stddev) ## take sample
  while (!significant & length(x) < upper_bound) {
    ## if not significant:
    if (t.test(x)$p.value > 0.05) {
      ## get more data
      x <- append(x, rnorm(n, mean = mn, sd = stddev))
    } else {
      significant <- TRUE
    } ## otherwise stop:
  }
  pvals[i] <- t.test(x)$p.value
  tstat[i] <- t.test(x)$statistic
}

Now, Type I error rate is much higher than 5%:

round(mean(pvals < 0.05), 2)

## [1] 0.16

Figure 2.19 shows the distributions of the t-statistic in the standard case vs/ with the above stopping rule:

A comparison of the distribution of t-values with an a priori fixed stopping rule (solid line), versus a flexible stopping rule conditional on finding significance (broken line).

FIGURE 2.19: A comparison of the distribution of t-values with an a priori fixed stopping rule (solid line), versus a flexible stopping rule conditional on finding significance (broken line).

What is important to realize here is that the inflation in the Type I error probability observed above was due to the fact that the t-distribution is no longer a t-distribution: there are bumps in the tails when we use the flexible stopping rule, and these raise the Type I error. This demonstrates why one should fix one’s sample size in advance, based on a power analysis. One should not deploy a stopping rule like the one above; if one uses such a stopping rule, there is a higher than 5% probability of incorrectly declaring a result as statistically significant than the original Type I error rate of 0.05.

There can be compelling reasons to adopt the above peek-and-run strategy; e.g., if one wants to avoid exposing patients to a treatment that might turn out to be harmful. In such situations, one can run an adaptive experimental trial by correcting for Type I error inflation (Pocock 2013).

In this book, we will aim to develop a workflow whereby the sample size is fixed through power analysis, in advance of running an experiment.

References

Benjamin, Daniel J, James O Berger, Magnus Johannesson, Brian A Nosek, E-J Wagenmakers, Richard Berk, Kenneth A Bollen, et al. 2018. “Redefine Statistical Significance.” Nature Human Behaviour 2 (1): 6.

Button, Katherine S, John PA Ioannidis, Claire Mokrysz, Brian A Nosek, Jonathan Flint, Emma SJ Robinson, and Marcus R Munafò. 2013. “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience.” Nature Reviews Neuroscience 14 (5): 365–76.

Cohen, Jacob. 1988. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum.

Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014. Bayesian Data Analysis. Third. Boca Raton, FL: Chapman; Hall/CRC.

Hammerly, Christopher, Adrian Staub, and Brian W. Dillon. 2019. “The Grammaticality Asymmetry in Agreement Attraction Reflects Response Bias: Experimental and Modeling Evidence.” Cognitive Psychology 110: 70–104.

Hedges, Larry V. 1984. “Estimation of Effect Size Under Nonrandom Sampling: The Effects of Censoring Studies Yielding Statistically Insignificant Mean Differences.” Journal of Educational Statistics 9 (1): 61–85.

Ioannidis, John PA. 2008. “Why Most Discovered True Associations Are Inflated.” Epidemiology 19 (5): 640–48.

Jäger, Lena A., Daniela Mertzen, Julie A. Van Dyke, and Shravan Vasishth. 2020a. “Interference Patterns in Subject-Verb Agreement and Reflexives Revisited: A Large-Sample Study.” Journal of Memory and Language 111. https://doi.org/https://doi.org/10.1016/j.jml.2019.104063.

Lane, David M, and William P Dunlap. 1978. “Estimating Effect Size: Bias Resulting from the Significance Criterion in Editorial Decisions.” British Journal of Mathematical and Statistical Psychology 31 (2): 107–12.

Mertzen, Daniela, Dario Paape, Brian W. Dillon, Ralf Engbert, and Shravan Vasishth. 2021. “Syntactic and Semantic Interference in Sentence Comprehension: Support from English and German eye-tracking data.”

Pocock, Stuart J. 2013. Clinical Trials: A Practical Approach. John Wiley & Sons.

Vasishth, Shravan, Daniela Mertzen, Lena A. Jäger, and Andrew Gelman. 2018. “The Statistical Significance Filter Leads to Overoptimistic Expectations of Replicability.” Journal of Memory and Language 103: 151–75. https://doi.org/https://doi.org/10.1016/j.jml.2018.07.004.