3.1 From the t-test to the linear (mixed) model

We begin with the now-familiar Grodner and Gibson (2005) self-paced reading data; this design was first presented in chapter 2. Load the data and compute the means for the raw reading times by condition:

library(lingpsych)
data("df_gg05e1")
gg05e1 <- df_gg05e1
means <- round(with(gg05e1, tapply(rawRT,
  IND = condition,
  mean
)))
means
##  objgap subjgap 
##     471     369

As predicted by theory (Grodner and Gibson 2005), object relatives (labeled objgap here) are read slower than subject relatives (labeled subjgap).

As discussed in the previous chapter, a paired t-test can be done to evaluate whether we have evidence against the null hypothesis that object relatives and subject relatives have identical reading times. However, we have to aggregate the data by subjects and by items first.

bysubj <- aggregate(rawRT ~ subject +
  condition,
mean,
data = gg05e1
)
byitem <- aggregate(rawRT ~ item +
  condition,
mean,
data = gg05e1
)

Figure 3.1 shows the distribution of the data in the two conditions (aggregated data by subjects and by items). Clearly, there are some extreme, possibly influential values in the by-subjects data.

## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
The distribution of the object and subject relative clause data at the critical region (relative clause verb) in the Grodner and Gibson 2005 data set (their Experiment 1).

FIGURE 3.1: The distribution of the object and subject relative clause data at the critical region (relative clause verb) in the Grodner and Gibson 2005 data set (their Experiment 1).

summary_ttest(t.test(rawRT ~ condition,
  paired = TRUE, bysubj
))
## [1] "t(41)=3.11 p=0.003"
## [1] "est.: 102.29 [35.85,168.72] ms"
summary_ttest(t.test(rawRT ~ condition,
  paired = TRUE, byitem
))
## [1] "t(15)=3.75 p=0.002"
## [1] "est.: 102.29 [44.21,160.36] ms"

What these two t-tests show is that both by subjects and by items, there is strong evidence against the null hypothesis that the object and relatives have identical reading times.

Interestingly, exactly the same t-values can be obtained by running the following commands, which implement a kind of linear model called the linear mixed model:

m0rawsubj <- lmer(rawRT ~ condition + (1 | subject), bysubj)
summary(m0rawsubj)$coefficients
##                  Estimate Std. Error t value
## (Intercept)         471.4      31.13  15.143
## conditionsubjgap   -102.3      32.90  -3.109
m0rawitem <- lmer(rawRT ~ condition + (1 | item), byitem)
summary(m0rawitem)$coefficients
##                  Estimate Std. Error t value
## (Intercept)         471.4      20.20  23.336
## conditionsubjgap   -102.3      27.25  -3.754

The sign of the t-value is the opposite to that of the paired t-test above; the reason for that is the contrast coding used; this will be explained below.

Our goal in this chapter is to understand the above model involving the lmer function, using the familiar paired t-test as a starting point.

For now, consider only the by-subject analysis. Given the sample means shown above for the two conditions, we can rewrite our best guess about how the object and subject relative clause reading time distributions were generated:

  • Object relative reading time: \(Normal(471,\hat\sigma)\)
  • Subject relative reading time: \(Normal(369,\hat\sigma)\)

Here, \(\hat\sigma\) is the estimated standard deviation; it is assumed to be the same for the two relative clause conditions. Right now the focus is only on the two estimates mean reading times for the two conditions.

The above two statements about the underlying pdf that produced the two relative clause conditions’ reading times can also be rewritten with respect to the mean reading time of the object relative and the difference between the two conditions (the reasons for this will become clear presently):

  • Object relative: \(Normal(471-102\times 0,\hat\sigma)\)
  • Subject relative: \(Normal(471-102\times 1,\hat\sigma)\)

Note that the two distributions for object and subject relative clauses (RCs) are assumed to be independent. This assumed independence is expressed by the fact that we define two separate Normal distributions, one for object relatives and the other for subject relatives. We saw earlier that this assumption of independence does not hold in our data because, in the aggregated data, each subject delivers a data point for subject relatives and another data point for object relatives. However, for now we will ignore this detail; we will fix this shortcoming later.

The interesting point to notice here is that the mean for the object and subject relatives’ distributions can be rewritten as a sum of two terms. An equivalent way to express the fact that object relatives are coming from a \(Normal(471,\hat\sigma)\) is to say that each object relative data point can be described by the following equation:

\[\begin{equation} y = 471 + -102 \times 0 + \varepsilon \hfill \hbox{ where } \varepsilon \sim Normal(0,\hat\sigma) \end{equation}\]

Similarly, the subject relative’s distribution can be written as being generated from:

\[\begin{equation} y = 471 - 102 \times 1 + \varepsilon \hbox{ where } \varepsilon \sim Normal(0,\hat\sigma) \end{equation}\]

In these data, the parameter \(\hat\sigma\) is estimated to be \(213\) ms. How do we know what this estimate is? If, as in the present case, the data are available, this is easy to do: just extract the vector of differences between the object and subject relative data from the aggregated data, and then compute the standard deviation of that vector:

objgapdata<-subset(bysubj,condition=="objgap")$rawRT
subjgapdata<-subset(bysubj,condition=="subjgap")$rawRT
d<-objgapdata-subjgapdata
sd(d)
## [1] 213.2

But even if we didn’t have access to the data, and only knew what the observed t-value, the mean difference between the two conditions, and the subject sample size is, we can still get the same standard deviation as the one we computed above. The standard deviation estimate can be derived from the by-subject t-test output as follows. Let the mean difference between the means of the object and subject relative conditions be \(\bar{d}\). The observed t-value is

\[\begin{equation} t_{obs}= \frac{\bar{d}-0}{s/\sqrt{n}} \end{equation}\]

Solving for \(s\):

\[\begin{equation} s = \bar{d} \times \sqrt{n}/t_{obs} = -102 \times \sqrt{42}/-3.109 = 213 \end{equation}\]

So, based on the data, our model for the relative clause data consists of two equations:

Object relatives:

\[\begin{equation} y = 471 -102\times 0 + \varepsilon \hfill \hbox{ where } \varepsilon \sim Normal(0,213) \end{equation}\]

Subject relatives:

\[\begin{equation} y = 471 - 102\times 1 + \varepsilon \hbox{ where } \varepsilon \sim Normal(0,213) \end{equation}\]

The above statements describe a generative process for the data.

Given such a statement about the generative process, we can express the estimated mean reading times for each RC type as follows. The term \(\varepsilon\) has mean 0; we stipulate this when we specify that \(\varepsilon \sim Normal(0,\sigma)\).

Mean object relative reading times:

\[\begin{equation} \hbox{Mean OR RT}= 471 -102\times 0 \end{equation}\]

Mean subject relative reading times:

\[\begin{equation} \hbox{Mean SR RT} = 471 - 102\times 1 \end{equation}\]

There is a function in R, the lm() function, which expresses the above statistical model, and prints out exactly the same numerical values that are shown above:

summary(m0raw <- lm(rawRT ~ condition, bysubj))$coefficients
##                  Estimate Std. Error t value  Pr(>|t|)
## (Intercept)         471.4      31.13  15.143 1.795e-25
## conditionsubjgap   -102.3      44.02  -2.324 2.263e-02

The linear model function lm() prints out two coefficients, \(471\) and \(-102\), that express the mean reading times for object and subject relative data, using a simple coding scheme: object relatives are coded as 0, and subject relatives are coded as 1. This coding scheme is not visible to the user, but is represented internally in R. However, the user can see the coding for each condition level by typing:

## make sure that the condition column is of type factor:
bysubj$condition <- factor(bysubj$condition)
contrasts(bysubj$condition)
##         subjgap
## objgap        0
## subjgap       1

We will discuss contrast coding in detail in a later chapter, but right now the simple 0,1 coding above—called treatment contrasts—is enough for our purposes.

Thus, what the linear model above gives us is two numbers: the estimated mean object relative reading time (471), and the estimated difference between object and subject relative (-102). We can extract the two coefficients by typing:

round(coef(m0raw))
##      (Intercept) conditionsubjgap 
##              471             -102

In the vocabulary of linear modeling, the first number is called the intercept, and the second one is called the slope. Note that the meaning of the intercept and slope depends on the ordering of the factor levels. We can make the sample mean of the subject relative represent the intercept:

## reverse the factor level ordering:
bysubj$condition <- factor(bysubj$condition,
  levels = c("subjgap", "objgap")
)
contrasts(bysubj$condition)
##         objgap
## subjgap      0
## objgap       1

Now, the intercept is the mean of the subject relatives, and the slope is the difference between object and subject relatives reading times. Note that the sign of the t-value has changed—the sign depends on the contrast coding.

m0raw <- lm(rawRT ~ condition, bysubj)
summary(m0raw)$coefficients
##                 Estimate Std. Error t value  Pr(>|t|)
## (Intercept)        369.1      31.13  11.857 1.819e-19
## conditionobjgap    102.3      44.02   2.324 2.263e-02

Let’s switch back to the original factor level ordering:

bysubj$condition <- factor(bysubj$condition,
  levels = c("objgap", "subjgap")
)
contrasts(bysubj$condition)
##         subjgap
## objgap        0
## subjgap       1
m0raw <- lm(rawRT ~ condition, bysubj)
summary(m0raw)$coefficients
##                  Estimate Std. Error t value  Pr(>|t|)
## (Intercept)         471.4      31.13  15.143 1.795e-25
## conditionsubjgap   -102.3      44.02  -2.324 2.263e-02

In mathematical form, the model can now be stated as a single equation:

\[\begin{equation} rawRT = \hat\beta_0 + \hat\beta_1 condition + \varepsilon \end{equation}\]

where

  • condition is a 0,1 coded vector, with object relatives coded as 0, and subject relatives coded as 1.
  • \(\hat\beta_0\) is the estimated mean for the object relative (which is coded as 0)
  • \(\hat\beta_1\) is the amount by which the object relative mean must be changed to obtain the estimated mean for the subject relative.
  • \(\varepsilon\) is the noisy variation from trial to trial around the means for the two conditions, represented by \(Normal(0,213)\). That is, \(\hat\sigma\) is 213 ms.

The null hypothesis of scientific interest here is always with reference to the slope, that the difference in means between the two relative clause types \(\beta_1\) is 0:

\(H_0: \beta_1 = 0\)

The t-test value printed out in the linear model is simply the familiar t-test formula in action:

\[\begin{equation} t_{obs} = \frac{\hat\beta_1 - 0}{SE} \end{equation}\]

The intercept also has a null hypothesis associated with it, namely that \(H_0: \beta_0 = 0\). However, this null hypothesis test is of absolutely no interest for us. This hypothesis test is reported by the lm() function only because the intercept is needed for technical reasons, to be discussed in the contrast coding chapter.

The contrast coding mentioned above determines the meaning of the \(\beta_0\) and \(\beta_1\) parameters:

bysubj$condition <- factor(bysubj$condition,
  levels = c("objgap", "subjgap")
)
contrasts(bysubj$condition)
##         subjgap
## objgap        0
## subjgap       1

When discussing linear models, we will make a distinction between the unknown true means \(\beta_0, \beta_1\) and the estimated mean from the data \(\hat\beta_0, \hat\beta_1\). The estimates that we have from the data are:

  • Estimated mean object relative processing time: \(\hat\beta_0=471\) ms .
  • Estimated mean subject relative processing time: \(\hat\beta_0+\hat\beta_1=471+-102=369\) ms.

The underlying statistical model expressed in terms of the unknown parameters is stated in terms of \(\beta_0, \beta_1, \sigma\):

\[\begin{equation} rawRT = \beta_0 + \beta_1 condition + \varepsilon \end{equation}\]

where \(\varepsilon \sim Normal(0,\sigma)\).

The null hypotheses are always stated in terms of these unknown parameters. Thus, a careful distinction should be made between the true unknown parameters, and their estimates from the data; the latter are designated by a hat over the parameter: \(\hat\cdot\).

References

Grodner, Daniel, and Edward Gibson. 2005. “Consequences of the Serial Nature of Linguistic Input.” Cognitive Science 29: 261–90.