3.2 Sum coding

We have established so far that the mathematical form of the model is:

\[\begin{equation} rawRT = \beta_0 + \beta_1 condition + \varepsilon \end{equation}\]

An important change that we will make now is to the contrast coding of the condition vector. First, recode the levels of the condition column as shown below.

## new contrast coding:
bysubj$cond <- ifelse(bysubj$condition == "objgap", 1, -1)

Now, the two conditions are coded not as 0, 1 but as -1 and +1. This new coding is stored in a new column called cond (of course, one can call it anything one likes). It is always a good idea to quickly carry out a sanity check to ensure that the coding is correct. One way to do this is by comparing the cond column with the condition column. Many a data analysis has gone awry because the researcher neglected to check that their coding was set up the way they wanted it!

xtabs(~ cond + condition, bysubj)
##     condition
## cond objgap subjgap
##   -1      0      42
##   1      42       0

With this coding, the \(\beta\) parameters have a different meaning:

m1raw <- lm(rawRT ~ cond, bysubj)
round(coef(m1raw))
## (Intercept)        cond 
##         420          51
  • The intercept now represents the grand mean reading time: the estimate from the data is \(\hat \beta_0=420\).
  • The estimated mean object relative reading time is: \(\hat\beta_0+\hat\beta_1\times 1=420+51=471\).
  • The estimated mean subject relative reading time is: \(\hat\beta_0+\hat\beta_1\times (-1)=420-51=369\).

This kind of parametrization is called sum-to-zero contrast or more simply sum contrast coding. This is the coding we will use most frequently in this book. Contrast coding will be elaborated on in a later chapter; there, the advantages of sum coding over treatment coding will become clear. For now, it is sufficient to understand that one can reparametrize the model using different kinds of contrast coding, and that such a reparametrization impacts the interpretation of the parameters.

With sum coding, the null hypothesis for the slope is

\[\begin{equation} H_0: \mathbf{1\times} \mu_{obj} + (\mathbf{-1}) \times\mu_{subj} = 0 \end{equation}\]

The sum contrast coding of \(+1\) standing for object relatives and \(-1\) standing for subject relatives in the linear model directly refer to the \(\pm 1\) coefficients in the null hypothesis above. Now the model, using the estimates based on the data, is as follows.

Object relative reading times:

\[\begin{equation} rt = 420\mathbf{\times 1} + 51\mathbf{\times 1} + \varepsilon \end{equation}\]

Subject relative reading times:

\[\begin{equation} rt = 420\mathbf{\times 1} + 51\mathbf{\times (-1)} + \varepsilon \end{equation}\]

One could write it in a single line as:

\[\begin{equation} rt = 420 + 51\times condition + \varepsilon \end{equation}\]

The numerical values in the vector \(\varepsilon\) are called the residuals; these numbers represent the amount by which the observed data (the column rawRT in the data) deviate from the predicted values in the above model.

To see what the residuals are, first compute the reading times predicted by the intercept and slope estimates from the model output:

b0<-coef(m1raw)[1]
b1<-coef(m1raw)[2]
bysubj$predicted<-b0 + b1*bysubj$cond

Now compute the difference between the observed and predicted values. These are the residuals.

head(bysubj$rawRT-bysubj$predicted,n=6)
## [1] -136.36  -51.99 -224.11  145.01  -75.61 -141.11

Notice that the fitted model’s residuals are identical to the values we computed above:

residuals(m1raw)[1:6]
##       1       2       3       4       5       6 
## -136.36  -51.99 -224.11  145.01  -75.61 -141.11

Whenever a residual is negative in sign, this means that the corresponding observed data point in the rawRT column is faster than the predicted value from the linear model.