2.6 The two-sample t-test vs. the paired t-test

In our running example above, we examined the case where we have a single vector of data \(y\). This led to the one-sample t-test.

Next, we consider a case where we have two vectors of data. The data-set below is from Johnson (2011). Shown below are F1 formant data (in Hertz) for different vowels produced by male and female speakers of different languages. (In a speech wave, different bands of energy centered around particular frequencies are called formants.)

F1data<-read.table("data/F1_data.txt",header=TRUE)
F1data

##    female male vowel  language
## 1     391  339     i  W.Apache
## 2     561  512     e  W.Apache
## 3     826  670     a  W.Apache
## 4     453  427     o  W.Apache
## 5     358  291     i CAEnglish
## 6     454  406     e CAEnglish
## 7     991  706     a CAEnglish
## 8     561  439     o CAEnglish
## 9     398  324     u CAEnglish
## 10    334  307     i   Ndumbea
## 11    444  361     e   Ndumbea
## 12    796  678     a   Ndumbea
## 13    542  474     o   Ndumbea
## 14    333  311     u   Ndumbea
## 15    343  293     i      Sele
## 16    520  363     e      Sele
## 17    989  809     a      Sele
## 18    507  367     o      Sele
## 19    357  300     u      Sele

Notice that the male and female values can be seen as dependent or paired: each row belongs to the same vowel and language. Nevertheless, we can compare males’ and females’ F1 frequencies, completely ignoring this paired nature of the data. The t-test does not “know” whether these data are paired or not—it is the researcher’s job to make sure that model assumptions are met. In this case, the assumption of the t-test is that the data are independent.

Let’s ignore the paired nature of the data for now, and treat the two vectors as independent vectors. Suppose that our null hypothesis is that there is no difference between the mean F1’s for males (\(\mu_m\)) and females (\(\mu_f\)). Now, our null hypothesis is \(H_0: \mu_m = \mu_f\) or \(H_0: \mu_m - \mu_f = \delta = 0\).

This kind of design calls for a two-sample t-test.

The function call in R for a two-sample t-test is shown below. Note here that we are assuming that both the male and female F1 scores have equal variance.

t.test(F1data$female,F1data$male,
       paired=FALSE,
       var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  F1data$female and F1data$male
## t = 1.5, df = 36, p-value = 0.1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -30.07 217.54
## sample estimates:
## mean of x mean of y 
##     534.6     440.9

This t-test is computing the following t-statistic:

\[\begin{equation} t=\frac{d-(\mu_m - \mu_f)}{SE} = \frac{d-0}{SE} \end{equation}\]

where \(d\) is the difference between the two sample means; the rest of the terms we are familiar with. SE is the standard error of the sampling distribution of the difference between the means.

We will now do this calculation “by hand”. The only new things are the formula for the SE calculation, and the degrees of freedom for t-distribution \((2\times n - 2)=36\).

The standard error for the difference in the means in the two-sample t-test is computed using this formula:

\[\begin{equation} SE_\delta = \sqrt{\frac{\hat\sigma_m^2}{n_m} + \frac{\hat\sigma_f^2}{n_f}} \end{equation}\]

Here, \(\hat\sigma_m\) is estimate of the standard deviation for males, and \(\hat\sigma_f\) for the females; the \(n\) are the respective sample sizes.

n_m<-n_f<-19
## difference of sample means:
d<-mean(F1data$female)-mean(F1data$male)
(SE<-sqrt(var(F1data$male)/n_m+var(F1data$female)/n_f))

## [1] 61.04

(observed_t <- (d-0)/SE)

## [1] 1.536

## p-value:
2*(1-pt(observed_t,df=36))

## [1] 0.1334

The output of the two-sample t-test and the hand-calculation above match up.

Now consider what will change once we take into account the fact that the data are paired. The two-sample t-test now becomes a so-called paired t-test.

For such paired data, the null hypothesis is as before: \(H_0: \delta=0\). But since each row in the data-frame is paired (from the same vowel+language), we subtract the vector row-wise, and get a new vector \(d\) (not a single number \(d\) as in the two-sample t-test) with the row-wise differences. Then, we just do the familiar one-sample test we saw earlier:

d<-F1data$female-F1data$male
t.test(d)

## 
##  One Sample t-test
## 
## data:  d
## t = 6.1, df = 18, p-value = 9e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   61.48 125.99
## sample estimates:
## mean of x 
##     93.74

An alternative syntax for the paired t-test explicitly feeds the two paired vectors into the function, but one must explicitly specify that they are paired, otherwise the test is a two-sample (i.e., unpaired) t-test:

t.test(F1data$female,F1data$male,paired=TRUE)

## 
##  Paired t-test
## 
## data:  F1data$female and F1data$male
## t = 6.1, df = 18, p-value = 9e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   61.48 125.99
## sample estimates:
## mean of the differences 
##                   93.74

Incidentally, notice that the p-value in the paired t-test is statistically significant, unlike the two-sample t-test above. The null hypothesis is the same in both tests, but the significance level leads to different conclusions.

Which analysis is correct, the two-sample t-test or the paired t-test? It all depends on your assumptions about what the data represent. If you consider the data paired, for the reasons given above, then a paired test is called for. If there is no pairing (here, domain knowledge is required), we can treat this as unpaired data.

Next, we look at some perhaps subtle points about the paired t-test.

2.6.1 Common mistakes involving the t-test

The paired t-test assumes that each row in the data-frame is independent of the other rows. This implies that the data-frame cannot have more than one row for a particular pair. In other words, the data-frame cannot have repeated measurements spread out across rows.

For example, doing a paired t-test on this hypothetical data-frame would be incorrect:

Why? Because the assumption is that each row is independent of the others. This assumption is violated here (this is assuming that repeating the vowel from the same language will lead to some commonalities between the two repetitions).

Consider another hypothetical example. In the table below, from subject 1 we see two data points each for condition a and for condition b.

Here, we again have repeated measurements from subject 1. The independence assumption is violated.

How to proceed when we have repeated measurements from each subject or each item? The solution is to aggregate the data so that each subject (or item) has only one value for each condition.

This aggregation allows us to meet the independence assumption of the t-test, but it has a potentially huge drawback: it pretends we have one measurement from each subject for each condition. Later on we will learn how to analyze unaggregated data, but if we want to do a paired t-test, we have no choice but to aggregate the data in this way.

A fully worked example will make this clear. We have repeated measures data on subject versus object relative clauses in English. Subject relative clauses are sentences like The man who was standing near the doorway laughed. Here, the phrase (called a relative clause) who was standing near the doorway modifies the noun phrase man; it is called a subject relative because the noun phrase man is the subject of the relative clause. By contrast, object relative clauses are sentences like The man who was the woman was talking to near the doorway laughed; here, the man is the grammatical object object of the relative clause who was the woman was talking to near the doorway.

The data are from a self-paced reading study reported in Grodner and Gibson (2005), their experiment 1. A theoretical prediction is that in English, object relatives are harder to read than subject relatives, in the relative clause verb region. We want to test this prediction.

First, load the data containing reading times from the region of interest (the relative clause verb):

gg05e1 <- read.table("data/grodnergibsonE1crit.txt",
                     
                     header=TRUE)

head(gg05e1)

##    subject item condition rawRT
## 6        1    1    objgap   320
## 19       1    2   subjgap   424
## 34       1    3    objgap   309
## 49       1    4   subjgap   274
## 68       1    5    objgap   333
## 80       1    6   subjgap   266

We have repeated measurements for each condition from the subjects, and from items. You can establish this by using the xtabs command. Notice that there are no missing data points:

t(xtabs(~subject+condition,gg05e1))

##          subject
## condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
##   objgap  8 8 8 8 8 8 8 8 8  8  8  8  8  8  8  8  8  8
##   subjgap 8 8 8 8 8 8 8 8 8  8  8  8  8  8  8  8  8  8
##          subject
## condition 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
##   objgap   8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##   subjgap  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##          subject
## condition 34 35 36 37 38 39 40 41 42
##   objgap   8  8  8  8  8  8  8  8  8
##   subjgap  8  8  8  8  8  8  8  8  8

t(xtabs(~item+condition,gg05e1))

##          item
## condition  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
##   objgap  21 21 21 21 21 21 21 21 21 21 21 21 21 21 21
##   subjgap 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21
##          item
## condition 16
##   objgap  21
##   subjgap 21

It is important to stress once more that it is the researcher’s responsibility to make sure that the t-test’s assumptions are met. For example, one could fit a two-sample t-test to the data as provided. The two-sample t-test can be implemented using the syntax shown below:

t.test(rawRT~condition,gg05e1)

## 
##  Welch Two Sample t-test
## 
## data:  rawRT by condition
## t = 3.8, df = 431, p-value = 2e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   48.98 155.59
## sample estimates:
##  mean in group objgap mean in group subjgap 
##                 471.4                 369.1

This t-test is incorrect for several reasons, but the most egregious error here is that the data are paired (each subject delivers data for both conditions), and that property of the data is being ignored.

Another common mistake is to do a paired t-test on the data without checking that the data are independent in the sense discussed above. Again, the t.test function will happily return a meaningless result:

t.test(rawRT~condition,paired=TRUE,gg05e1)

## 
##  Paired t-test
## 
## data:  rawRT by condition
## t = 4, df = 335, p-value = 8e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   51.98 152.59
## sample estimates:
## mean of the differences 
##                   102.3

Here, the degrees of freedom indicate that we have fit the incorrect model. There are 42 subjects and 16 items, and the presentation of items to subjects uses a Latin square design (each subject sees only one condition per item). The 335 degrees of freedom come from \(42\times 8=336\) data points, minus one. Why do we say \(42\times 8\) and not \(42\times 16\)? That is because each subject will return eight differences in reading time for each condition: each subject gives us eight subject-relative data points and eight object-relative data points.

For each of the 42 subjects, the t-test function internally creates a vector of eight data points of subject relatives and subtracts the vector of eight data points of object relatives. That is how we end up with \(42\times 8=336\) data points.

These 336 data points are assumed by the t-test to be independent of each other; but this cannot be the case because each subject delivers eight data points for each condition; these are obviously dependent (correlated) because they come from the same subject.

What is needed is a single data-point for each subject and condition, and for each item and condition. In order to conduct the t-test, aggregation of the data by subjects and by items is necessary.

Consider the by-subjects aggregation procedure below. Now we have only one data-point for each condition and subject:

bysubj<-aggregate(rawRT~subject+condition,
                  mean,
                  data=gg05e1)
t(xtabs(~subject+condition,bysubj))

##          subject
## condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
##   objgap  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1
##   subjgap 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1
##          subject
## condition 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
##   objgap   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   subjgap  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##          subject
## condition 34 35 36 37 38 39 40 41 42
##   objgap   1  1  1  1  1  1  1  1  1
##   subjgap  1  1  1  1  1  1  1  1  1

Notice that the data are correlated: the longer the subject relative clause data from a participant, the longer their object relative clause data:

SRdata<-subset(bysubj,condition=="subjgap")$rawRT
ORdata<-subset(bysubj,condition=="objgap")$rawRT
plot(SRdata,ORdata)
abline(lm(ORdata~SRdata))

cor(SRdata,ORdata)

## [1] 0.5876

Returning to the t-test, by aggregating the data the independence assumption of the t-test is met, and the degrees of freedom for this by-subjects analysis are now correct (\(42-1=41\)):

t.test(rawRT~condition, bysubj,paired=TRUE)

## 
##  Paired t-test
## 
## data:  rawRT by condition
## t = 3.1, df = 41, p-value = 0.003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   35.85 168.72
## sample estimates:
## mean of the differences 
##                   102.3

Similar to the by-subjects aggregation done above, one could do a by-items aggregation and then a by-items t-test (What should be the degrees of freedom for the by-items analysis? There are 16 items in this data-set). This is left as an exercise for the reader.

The paired t-test illustrated above is actually not the best way to analyze this data-set, because it ignores the fact that each subject delivers not one but eight data points per condition. Each subject’s repeated measurements will introduce a source of variance, but this source of variance is being suppressed in this t-test, leading to a possibly over-enthusiastic t-value. In order to take this variability into account, we must switch to the linear mixed model. But before we get to the linear mixed model, we have to consider the linear model. The next chapter turns to this topic.

References

Grodner, Daniel, and Edward Gibson. 2005. “Consequences of the Serial Nature of Linguistic Input.” Cognitive Science 29: 261–90.

Johnson, Keith. 2011. Quantitative Methods in Linguistics. John Wiley & Sons.