3.3 Checking model assumptions

It is an assumption of the linear model that the residuals are (approximately) normally distributed, That is what the statement \(\varepsilon\sim Normal(0,\sigma)\) implies. When carrying out hypothesis testing, it is important to check that model assumptions are approximately satisfied; this is because the null hypothesis significance testing procedure crucially assumes that the residuals are normal.

Here is how we can check whether this normality assumption is met. First, extract the vector of residuals:

## extract residuals:
res_m1raw <- residuals(m1raw)

Then, graphically compare the residuals to the quantiles of the standard normal distribution (\(Normal(0,1)\)); see Figure 3.2.

qqnorm(res_m1raw)
The residuals of the model `m1raw` plotted against the quantiles of a standard normal distribution.

FIGURE 3.2: The residuals of the model m1raw plotted against the quantiles of a standard normal distribution.

When the normality assumption is met, the residuals will align with the quantiles of the standard normal distribution, resulting in a straight diagonal line. When the normality assumption is not met, the line will tend to curve away from the diagonal.

There are formal statistical tests for checking whether a distribution is normal, but these tests are so stringent that even a sample from the standard normal might not pass this test. This is why statistics textbooks like Venables and Ripley (2002) suggest using a graphical comparison of quantiles to establish approximate normality.

In the above case, a log transform of the data improves the normality of residuals; see Figure 3.3. We will discuss transformation in detail later in this book; for now, it is sufficient to note that for continuous data that consists of all-positive values (here, reading times), a log transform will often be the appropriate transform.

m1log <- lm(log(rawRT) ~ cond, bysubj)
The residuals of the model `m0log` plotted against the quantiles of a standard normal distribution.

FIGURE 3.3: The residuals of the model m0log plotted against the quantiles of a standard normal distribution.

The estimates of the parameters are now on the log ms scale:

  • The estimated grand mean reading time: \(\hat\beta_0=5.9488\).
  • The estimated mean object relative reading time: \(\hat\beta_0+\hat\beta_1=5.9488+0.0843=6.0331\).
  • The estimated mean subject relative reading time: \(\hat\beta_0-\hat\beta_1=5.9488-0.0843=5.8645\).

The model does not change, only the scale does:

\[\begin{equation} \log rt = \beta_0 + \beta_1 condition + \varepsilon \end{equation}\]

where \(\varepsilon \sim Normal(0,\sigma)\).

Now, the intercept and slope can be used to compute the reading time in the two conditions. Note that because \(exp(log(rt))=rt\), to get the estimates on the raw ms scale, we just need to exponentiate both sides of the equation:

\[\begin{equation} exp(\log rt) = exp( \beta_0 + \beta_1 condition) \end{equation}\]

This approach gives us the following estimates on the ms scale:

  • Estimated object relative reading time: \(exp(\hat\beta_0+\hat\beta_1)=exp(5.9488+0.0843)=417\).
  • Estimated subject relative reading time: \(exp(\hat\beta_0-\hat\beta_1)=exp(5.9488-0.0843=352\).

The exponentiated values are medians, not means. We use the median here because the mean in the log-transformed data depends on the standard deviation. TO-DO: explain this in a box.

The difference in reading time is 417-352=65 ms. If we had fit the model to raw reading times, the difference would have been larger, 102 ms:

m1raw <- lm(rawRT ~ cond, bysubj)
summary(m1raw)$coefficients
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   420.22      22.01  19.092 6.469e-32
## cond           51.14      22.01   2.324 2.263e-02
  • Estimated mean object relative reading time: \(\hat\beta_0+\hat\beta_1=420.22+51.14=471.36\).
  • Estimated mean subject relative reading time: \(\hat\beta_0-\hat\beta_1=420.22-51.14=369.08\).

The difference in the means on the raw scale is 102 ms. The larger estimate based on the raw scale is less realistic, and we will see later that the large difference between the two conditions is driven by a few extreme, influential values.

References

Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with S-PLUS. New York: Springer.