2.6 The two-sample t-test vs. the paired t-test

In our running example above, we examined the case where we have a single vector of data \(y\). This led to the one-sample t-test.

Next, we consider a case where we have two vectors of data. The data-set below is from Johnson (2011). Shown below are F1 formant data (in Hertz) for different vowels produced by male and female speakers of different languages. (In a speech wave, different bands of energy centered around particular frequencies are called formants.)

library(lingpsych)
data("df_F1data")
F1data <- df_F1data
F1data
##    female male vowel  language
## 1     391  339     i  W.Apache
## 2     561  512     e  W.Apache
## 3     826  670     a  W.Apache
## 4     453  427     o  W.Apache
## 5     358  291     i CAEnglish
## 6     454  406     e CAEnglish
## 7     991  706     a CAEnglish
## 8     561  439     o CAEnglish
## 9     398  324     u CAEnglish
## 10    334  307     i   Ndumbea
## 11    444  361     e   Ndumbea
## 12    796  678     a   Ndumbea
## 13    542  474     o   Ndumbea
## 14    333  311     u   Ndumbea
## 15    343  293     i      Sele
## 16    520  363     e      Sele
## 17    989  809     a      Sele
## 18    507  367     o      Sele
## 19    357  300     u      Sele

Notice that each row belongs to the same vowel and language, and there are repeated instances of each vowel and language. However, males’ and females’ F1 frequencies can be seen as independent, completely ignoring the repeated instance of vowel and language. The t-test does not “know” whether the rows in the data are repeated or not—it is the researcher’s job to make sure that model assumptions are met. In this case, when analyzing the male vs. female data, the assumption of the t-test is that each number in the male and female vector is independent of the others.

So, we will treat the male and female vectors as independent. Suppose that our null hypothesis is that there is no difference between the mean F1’s for males (\(\mu_m\)) and females (\(\mu_f\)). Now, our null hypothesis is \(H_0: \mu_m = \mu_f\) or \(H_0: \mu_m - \mu_f = \delta = 0\).

This kind of design calls for a two-sample t-test. The two-sample t-test stands in contrast to the paired t-test, discussed below.

Because the formatting of the t-test in R is somewhat verbose, here is a function that extracts the essential information from a t-test:

summary_ttest <- function(res, paired = TRUE, units = "ms") {
  obs_t <- round(res$statistic, 2)
  dfs <- round(res$parameter)
  pval <- round(res$p.value, 3)
  ci <- round(res$conf.int, 2)
  est <- round(res$estimate, 2)
  if (paired == TRUE) {
    print(paste(
      paste("t(", dfs, ")=",
        obs_t,
        sep = ""
      ),
      paste("p=", pval, sep = "")))
      print(paste("est.: ", est, " [", 
                  ci[1], ",", ci[2], "] ", 
                  units, sep = "")
    )
  } else {
    print(paste(
      paste("t(", dfs, ")=",
        obs_t,
        sep = ""
      ),
      paste("p=", pval, sep = "")))
      print(paste(paste("est. 1: ", est[1], sep = ""),
      paste("est. 2: ", est[2], sep = ""),
      paste("CI of diff. in means: [", ci[1], ",", ci[2], "]", sep = "")))
  }
}

The function call in R for a two-sample t-test is shown below. The assumption is that both the male and female F1 scores have equal variance.

res<-t.test(F1data$female, F1data$male,
  paired = FALSE,
  var.equal = TRUE
)
summary_ttest(res,paired=FALSE)
## [1] "t(36)=1.54 p=0.133"
## [1] "est. 1: 534.63 est. 2: 440.89 CI of diff. in means: [-30.07,217.54]"

This t-test is computing the following t-statistic:

\[\begin{equation} t=\frac{d-(\mu_m - \mu_f)}{SE} = \frac{d-0}{SE} \end{equation}\]

where \(d\) is the difference between the two sample means; the rest of the terms we are familiar with. SE is the estimated standard error of the sampling distribution of the difference between the means.

We will now do this calculation “by hand.” The only new things are the formula for the SE calculation, and the degrees of freedom for t-distribution \((2\times n - 2)=36\).

The standard error for the difference in the means in the two-sample t-test is computed using this formula:

\[\begin{equation} SE_\delta = \sqrt{\frac{\hat\sigma_m^2}{n_m} + \frac{\hat\sigma_f^2}{n_f}} \end{equation}\]

Here, \(\hat\sigma_m\) is an estimate of the standard deviation for males, and \(\hat\sigma_f\) for the females; the \(n\) are the respective sample sizes for males (\(n_m\)) and females (\(n_f\)).

n_m <- n_f <- 19
## difference of sample means:
d <- mean(F1data$female) - mean(F1data$male)
(SE <- sqrt(var(F1data$male) / n_m + var(F1data$female) / n_f))
## [1] 61.04
(observed_t <- (d - 0) / SE)
## [1] 1.536
## p-value:
2 * (1 - pt(observed_t, df = n_m + n_f - 2))
## [1] 0.1334

The output of the two-sample t-test and the hand-calculation above match up.

Now consider what will change once we analyze the data focusing this time on the paired nature of the data (same vowel and same language). The two-sample t-test now becomes a paired t-test.

For such paired data, the null hypothesis is that the F1 formant value produced for a given vowel-and-language combination is the same for males and females: \(H_0: \mu_f-\mu_m=\delta=0\). But since each row in the data-frame is paired (from the same vowel+language), we subtract the vector row-wise, and get a new vector \(d\) (not a single number \(d\) as in the two-sample t-test) with the row-wise differences. Then, we just do the familiar one-sample test we saw earlier:

d <- F1data$female - F1data$male
res<-t.test(d)
summary_ttest(res)
## [1] "t(18)=6.11 p=0"
## [1] "est.: 93.74 [61.48,125.99] ms"

An alternative syntax for the paired t-test explicitly feeds the two paired vectors into the function, but one must explicitly specify that they are paired, otherwise the test is a two-sample (i.e., unpaired) t-test:

res<-t.test(F1data$female, F1data$male, paired = TRUE)
summary_ttest(res)
## [1] "t(18)=6.11 p=0"
## [1] "est.: 93.74 [61.48,125.99] ms"

Incidentally, if one flips the order in which the vectors are entered into the t.test function, the sign of the t-value and of the estimate will of course flip.

res<-t.test(F1data$male, F1data$female, paired = TRUE)
summary_ttest(res)
## [1] "t(18)=-6.11 p=0"
## [1] "est.: -93.74 [-125.99,-61.48] ms"

Generally, one should ensure that the order in which one enters the vectors leads to an estimate that has an easy-to-interpret sign. For example, if object relatives (ORs) are expected to take longer to read than subject relatives (SRs), it would be better to place the vectors of reading times in the order OR,SR. Of course, no harm can come from reversing the orders; it’s just that the meaning of the estimate could be misinterpreted by the reader.

A crucial assumption in the above paired t-test is that all the vowel-and-language combinations are independent of each other. Whether this assumption is correct or not (or approximately correct) depends on domain knowledge.

Incidentally, the p-value in the paired t-test is statistically significant, unlike the two-sample t-test above. The null hypothesis is the same in both tests, but the significance level leads to different conclusions. Which analysis is correct, the two-sample t-test or the paired t-test? It all depends on your assumptions about what the data represent. If you consider the data to be paired by vowel-and-language, for the reasons given above, then a paired test is called for. If it is more reasonable to assume that each data point from the 19 males is independent of the others and from the 19 females, we would treat this as unpaired data. Students are often tempted to choose the test that yields a significant effect, just because a p-value below 0.05 would render the analysis publishable. The reasoning should be based on what the model assumptions are, and whether the model makes sense, even in the simple case of the t-test.

In the data sets we will encounter in the remainder of this book, we will never use one or two-sample or paired t-tests. Instead, we will be using linear models or linear mixed models. The different varieties of t-test are discussed here because they are frequently used (and misused) in published papers, and because (as shown later) t-tests are in fact simplified linear mixed models in disguise.

Next, we look at some applications of the t-test in more complex settings than simple two-condition designs, and some subtle points relating to this statistical test. The essential point to take away from the following sections is that although the t-test has the wonderful property of simplicity, hidden dangers lurk which can mislead even highly experienced researchers.

References

Johnson, Keith. 2011. Quantitative Methods in Linguistics. John Wiley & Sons.