2.8 Common mistakes involving the (paired) t-test

2.8.1 Ignoring the independence assumption

As also discussed above, the paired t-test assumes that each pair of data points that goes into the t-test is independent of the other pairs of data points. This implies that the data-frame cannot have more than one row for a particular pair. In other words, the data-frame cannot have repeated measurements. Why not? Because the assumption is that in each condition, each row is independent of the others.

Consider a hypothetical example. In the table below, from subject 1 we see two data points each for condition a and for condition b.

Here, we have repeated measurements from subject 1 for condition a and condition b. The independence assumption is violated. Erroneous analyses using t-tests abound in experimental science; for examples of these kinds of misuse of the t-test in phonetics, see Nicenboim, Roettger, and Vasishth (2018) and the references cited there.

How to proceed when we have repeated measurements from each subject or each item? The solution is to aggregate the data so that each subject (or item) has only one value for each condition.

A by-subject aggregation allows us to meet the independence assumption of the t-test, but it has a potentially huge drawback: it pretends we have one measurement from each subject for each condition. Later on we will learn how to analyze unaggregated data, but if we want to do a paired t-test, we have no choice but to aggregate the data in this way.

Although an example of aggregation was already discussed above, it may be helpful to revisit the approach that is necessary using a fully worked example. Consider the following repeated measures data on subject versus object relative clauses in English; the data are from a self-paced reading study, Experiment 1 of Grodner and Gibson (2005).

Subject relative clauses are sentences like The man who was standing near the doorway laughed. Here, the phrase (called a relative clause) who was standing near the doorway modifies the noun phrase man; it is called a subject relative because the noun phrase man is the subject of the relative clause. By contrast, object relative clauses are sentences like The man who was the woman was talking to near the doorway laughed; here, the man is the grammatical object of the relative clause who was the woman was talking to near the doorway.

A theoretical prediction is that in English, object relatives are harder to read than subject relatives, in the relative clause verb region. We want to test this prediction. Reading time at the region of interest (i.e., the time participants spent reading the relative clause verb) is used to index processing difficulty.

First, load the data containing reading times from the region of interest (the relative clause verb):

## From the library lingpsych:
data("df_gg05e1")
head(df_gg05e1)
##    subject item condition rawRT
## 6        1    1    objgap   320
## 19       1    2   subjgap   424
## 34       1    3    objgap   309
## 49       1    4   subjgap   274
## 68       1    5    objgap   333
## 80       1    6   subjgap   266

We have repeated measurements for each condition from 42 subjects, and from 16 items arranged in a Latin square design. You can establish this by using the xtabs command. There are no missing data points:

t(xtabs(~ subject + condition, df_gg05e1))
##          subject
## condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
##   objgap  8 8 8 8 8 8 8 8 8  8  8  8  8  8  8  8  8  8
##   subjgap 8 8 8 8 8 8 8 8 8  8  8  8  8  8  8  8  8  8
##          subject
## condition 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
##   objgap   8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##   subjgap  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##          subject
## condition 34 35 36 37 38 39 40 41 42
##   objgap   8  8  8  8  8  8  8  8  8
##   subjgap  8  8  8  8  8  8  8  8  8
t(xtabs(~ item + condition, df_gg05e1))
##          item
## condition  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
##   objgap  21 21 21 21 21 21 21 21 21 21 21 21 21 21 21
##   subjgap 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21
##          item
## condition 16
##   objgap  21
##   subjgap 21

It is important to stress once more that it is the researcher’s responsibility to make sure that the t-test’s assumptions are met. For example, one could fit a two-sample t-test to the data as provided. The two-sample t-test can be implemented using the syntax shown below:

## incorrect t-test!
gg05incorrectres1<-t.test(rawRT ~ condition, df_gg05e1)
summary_ttest(gg05incorrectres1)
## [1] "t(431)=3.77 p=0"
## [1] "est.: 471.36 [48.98,155.59] ms"
## [2] "est.: 369.07 [48.98,155.59] ms"

This t-test is incorrect for several reasons, but the most egregious error here is that the data are paired (each subject delivers data for both conditions), and that property of the data is being ignored.

Another common mistake is to do a paired t-test on the data without checking that the data are independent in the sense discussed above. This kind of mistake happens when researchers neglect to aggregate the data. Again, the t.test function will happily return a meaningless result:

## Incorrect paired t-test!
gg05incorrectres2<-t.test(rawRT ~ condition, 
                          paired = TRUE, df_gg05e1)
summary_ttest(gg05incorrectres2)
## [1] "t(335)=4 p=0"
## [1] "est.: 102.29 [51.98,152.59] ms"

Here, the degrees of freedom indicate that we have fit the incorrect model. As mentioned above, there are 42 subjects and 16 items, and the presentation of items to subjects uses a Latin square design (each subject sees only one condition per item). The 335 degrees of freedom come from \(42\times 8=336\) data points, minus one. Why do we say \(42\times 8\) and not \(42\times 16\)? That is because each subject will return eight differences in reading time for each condition: each subject gives us eight subject-relative data points and eight object-relative data points.

For each of the 42 subjects, the t-test function internally creates a vector of eight data points of subject relatives and subtracts the vector of eight data points of object relatives. That is how we end up with \(42\times 8=336\) data points.

The 336/2=168 data points in each condition are assumed by the t-test to be independent of each other; but this cannot be the case because each subject delivers eight data points for each condition; these are obviously dependent (correlated) because they come from the same subject.

What is needed is a single data point, the mean reading time for each subject and condition. That is the by-subjects t-test: the averaging is done over the items for each subject. Analogously, it is also common to compute a by-items t-test; this time, the averaging is done over the subjects for each item. If we want to use t-tests to analyze our data, two t-tests are necessary: by subjects and by items. This necessity of analyzing data both by subjects and by items is discussed by Clark (1973) in a famous article, and more recently by Yarkoni (2020) and Westfall, Nichols, and Yarkoni (2017).

Consider the by-subjects aggregation procedure below. Once the code below is run, we have only one data point for each condition and subject:

bysubj <- aggregate(rawRT ~ subject + condition,
  mean,
  data = df_gg05e1
)
t(xtabs(~ subject + condition, bysubj))
##          subject
## condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
##   objgap  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1
##   subjgap 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1
##          subject
## condition 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
##   objgap   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   subjgap  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##          subject
## condition 34 35 36 37 38 39 40 41 42
##   objgap   1  1  1  1  1  1  1  1  1
##   subjgap  1  1  1  1  1  1  1  1  1

Returning to the t-test, by aggregating the data the independence assumption of the t-test is met, and the degrees of freedom for this by-subjects analysis are now correct (\(42-1=41\)):

gg05res<-t.test(rawRT ~ condition, bysubj, 
                paired = TRUE)
summary_ttest(gg05res)
## [1] "t(41)=3.11 p=0.003"
## [1] "est.: 102.29 [35.85,168.72] ms"

One unsurprising property of the averaged data is that the longer the average subject relative clause reading time for a particular subject, the longer that subject’s average object relative clause reading time (Figure 2.23). This is not surprising because the mean reading times for each condition are coming from the same subject.

SRdata <- subset(bysubj, condition == "subjgap")$rawRT
ORdata <- subset(bysubj, condition == "objgap")$rawRT
plot(SRdata, ORdata)
abline(lm(ORdata ~ SRdata))
The correlation between average object and subject relative clause reading times for the 42 subjects in the Grodner and Gibson (2005) experiment.

FIGURE 2.23: The correlation between average object and subject relative clause reading times for the 42 subjects in the Grodner and Gibson (2005) experiment.

The correlation is:

cor(SRdata, ORdata)
## [1] 0.5876

Similar to the by-subjects aggregation done above, one could do a by-items aggregation and then a by-items t-test (What should be the degrees of freedom for the by-items analysis? There are 16 items in this data set).

byitem <- aggregate(rawRT ~ item + condition,
  mean,
  data = df_gg05e1
)
t(xtabs(~ item + condition, byitem))
##          item
## condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
##   objgap  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1
##   subjgap 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1
gg05itemres<-t.test(rawRT ~ condition, 
                    byitem, paired = TRUE)
summary_ttest(gg05itemres)
## [1] "t(15)=3.75 p=0.002"
## [1] "est.: 102.29 [44.21,160.36] ms"

As in the by-subjects analysis, there seems to be a slight correlation between the average object and subject reading times among the items, but the sparse data (16 items) leads to a lot of variability around the fitted line (Figure 2.24).

SRdatabyitem <- subset(
  byitem,
  condition == "subjgap"
)$rawRT
ORdatabyitem <- subset(
  byitem,
  condition == "objgap"
)$rawRT
plot(SRdatabyitem, ORdatabyitem)
abline(lm(ORdatabyitem ~ SRdatabyitem))
The correlation between average object and subject relative clause reading times for the 16 items in the Grodner and Gibson (2005) experiment.

FIGURE 2.24: The correlation between average object and subject relative clause reading times for the 16 items in the Grodner and Gibson (2005) experiment.

cor(SRdatabyitem, ORdatabyitem)
## [1] 0.1744

2.8.2 Doing a by-subjects and by-items paired t-test is generally dangerous

The paired t-test illustrated above is in general quite a dangerous way to analyze this data-set, because it ignores the fact that each subject delivers not one but eight data points per condition. Each subject’s repeated measurements will introduce a source of variance, but this source of variance is being suppressed in the by-subjects t-test, leading to a possibly over-enthusiastic t-value. Similarly, each item also delivers multiple data points, and therefore introduce a potential source of variance within each item that will be artificially ignored when we aggregate data.

In order to take this variability into account simultaneously, we must switch to the linear mixed model. For example, in the Fedorenko, Gibson, and Rohde (2006) data, the interaction between Noun Type and RC type is significant when we aggregate over items; a linear mixed model that takes the item level variability into account yields a smaller t-value of -1.883, with estimate -103.208, 95% CIs [-212.844, 6.428] ms. Compare this with the t-value we obtained with the paired t-test earlier:

summary_ttest(hardINTres)
## [1] "t(43)=-2.13 p=0.039"
## [1] "est.: -404.81 [-787.91,-21.71] ms"

Thus, if we use the p-value to check whether the effect is “significant” (\(p<0.05\)) or not (\(p>0.05\)), the conclusion can change quite dramatically depending on whether one ignores sources of variance in one’s analysis or not.

The next chapter explains how to obtain the more realistic estimates from the linear mixed model by taking both subject and item variability into account simultaneously.

2.8.3 The difference between a significant and a non-significant result need not itself be significant

A very common error in psychology and psycholinguistics is to find a significant effect in one study, then find a non-significant effect in another study that changes one variable from the first study. The conclusion then drawn is that the variable that differs between the two studies is “significant.” Gelman and Hill (2007) summarize this mistake as follows: “the difference between significant and non-significant is itself not significant.” Indeed, S. Nieuwenhuis, Forstmann, and Wagenmakers (2011) show that this mistake is widespread in areas like neuroscience.

A real-life example will illustrate the problem. Recall that in the Fedorenko, Gibson, and Rohde (2006) design, the hard conditions showed a significant interaction but the easy conditions showed a non-significant interaction:

summary_ttest(hardINTres)
## [1] "t(43)=-2.13 p=0.039"
## [1] "est.: -404.81 [-787.91,-21.71] ms"
summary_ttest(easyINTres)
## [1] "t(43)=-0.1 p=0.922"
## [1] "est.: -15.16 [-326.05,295.74] ms"

Is this difference between the easy and hard conditions itself significant? Not necessarily; one has to test for that by carrying out a higher-order interaction: the difference between (i) the RC vs. Noun-Type interaction in the hard condition, and (ii) the RC vs. Noun-Type interaction in the easy condition. In the present case, the higher-order interaction is not even remotely significant. If we base our conclusion on the p-value, as is the norm, the main claim in the Fedorenko, Gibson, and Rohde (2006) study is not warranted.

diff_hard<-diff_name_hard-diff_occ_hard
diff_easy<-diff_name_easy-diff_occ_easy
RCxNTxLD_INTres<-t.test(diff_hard,diff_easy,paired=TRUE)
summary_ttest(RCxNTxLD_INTres)
## [1] "t(43)=-1.59 p=0.118"
## [1] "est.: -389.65 [-882.6,103.3] ms"
## boundary (singular) fit: see help('isSingular')

Incidentally, the above analysis can be done with repeated measures ANOVA, or a linear mixed model (we will discuss the linear mixed model in detail).

The repeated measures ANOVA yields the following:

bysubj2 <- aggregate(RT ~ subj + rctype + nountype + load,
  mean,
  data = df_fedorenko06
)
subj_anova<-anova_test(data = bysubj2, 
           dv = RT, 
           wid = subj,
           within = c(rctype,nountype,load)
  )
get_anova_table(subj_anova)
## ANOVA Table (type III tests)
## 
##                 Effect DFn DFd      F        p p<.05
## 1               rctype   1  43 29.255 2.63e-06     *
## 2             nountype   1  43  5.007 3.00e-02     *
## 3                 load   1  43  2.117 1.53e-01      
## 4      rctype:nountype   1  43  2.942 9.40e-02      
## 5          rctype:load   1  43  0.098 7.56e-01      
## 6        nountype:load   1  43  3.464 7.00e-02      
## 7 rctype:nountype:load   1  43  2.541 1.18e-01      
##        ges
## 1 4.30e-02
## 2 3.00e-03
## 3 6.00e-03
## 4 3.00e-03
## 5 8.75e-05
## 6 3.00e-03
## 7 2.00e-03

Notice that the square of the t-value above is the same as the F-score (up to rounding error) in the ANOVA output: \((-1.59)^2 = 2.53\).

The linear mixed models analysis shows a slightly different t-value for the three-way interaction: -1.43, with estimate -95.765, [-229.671, 38.141] ms. But it is quite close to the statistics computed using the paired t-test.

This mistake—not explicitly checking for the interaction—is very common in psychology and psycholinguistics.

References

Clark, Herbert H. 1973. “The Language-as-Fixed-Effect Fallacy: A Critique of Language Statistics in Psychological Research.” Journal of Verbal Learning and Verbal Behavior 12 (4): 335–59.
Fedorenko, Evelina, Edward Gibson, and Douglas Rohde. 2006. “The Nature of Working Memory Capacity in Sentence Comprehension: Evidence Against Domain-Specific Working Memory Resources.” Journal of Memory and Language 54 (4): 541–53.
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, UK: Cambridge University Press.
Grodner, Daniel, and Edward Gibson. 2005. “Consequences of the Serial Nature of Linguistic Input.” Cognitive Science 29: 261–90.
Nicenboim, Bruno, Timo B. Roettger, and Shravan Vasishth. 2018. “Using Meta-Analysis for Evidence Synthesis: The case of incomplete neutralization in German.” Journal of Phonetics 70: 39–55. https://doi.org/https://doi.org/10.1016/j.wocn.2018.06.001.
Nieuwenhuis, Sander, Birte U Forstmann, and Eric-Jan Wagenmakers. 2011. “Erroneous Analyses of Interactions in Neuroscience: A Problem of Significance.” Nature Neuroscience 14 (9): 1105–7.
Westfall, J, TE Nichols, and Tal Yarkoni. 2017. “Fixing the Stimulus-as-Fixed-Effect Fallacy in Task fMRI.” Wellcome Open Research 1 (23). https://doi.org/10.12688/wellcomeopenres.10298.2.
Yarkoni, Tal. 2020. “The Generalizability Crisis.” The Behavioral and Brain Sciences, 1–37.