3.6 Shrinkage in linear mixed models

The estimate of the effect for each participant computed from a linear mixed model “gravitates” towards the grand mean effect compared to when we fit a separate linear model to each subject’s data. This is called shrinkage in linear mixed models. We say that the individual-level estimates are shrunk towards the mean estimates over the entire data (the intercept and slope parameter estimates for \(\beta_0\) and \(\beta_1\)). The more extreme the data from a given subject relative to the mean estimates from the data set, or the less data we have from a given subject, the greater the shrinkage. Let’s think about these two situations next.

3.6.1 Shrinkage of extreme estimates from individual subjects

Figure 3.16 shows the data from three subjects who exhibit implausibly large effects of the OR-SR effect in the data. Subjects 28, 36, and 37’s estimates from the lmList model are as follows. These three subjects’ intercept and slope estimates are both in the upper third quantile of the respective distributions.

lmlist_m1 <- lmList(logrt ~ so | subject, gg05e1)
coefs <- coef(lmlist_m1)
intercepts <- coefs[1]
summary(intercepts)
##   (Intercept)  
##  Min.   :5.14  
##  1st Qu.:5.71  
##  Median :5.84  
##  Mean   :5.88  
##  3rd Qu.:6.05  
##  Max.   :6.62
intercepts$`(Intercept)`[28];
## [1] 6.1
intercepts$`(Intercept)`[36];
## [1] 6.009
intercepts$`(Intercept)`[37]
## [1] 6.617
slopes <- coefs[2]
summary(slopes)
##        so         
##  Min.   :-0.1610  
##  1st Qu.:-0.0218  
##  Median : 0.0503  
##  Mean   : 0.0620  
##  3rd Qu.: 0.1281  
##  Max.   : 0.4481
slopes$so[28];slopes$so[36];slopes$so[37]
## [1] 0.4481
## [1] 0.3825
## [1] 0.3554

On the millisecond scale, the median reading times are implausibly large:

## subject 28's RC effect estimate:
exp(intercepts$`(Intercept)`[28]+
      slopes$so[28])-
  exp(intercepts$`(Intercept)`[28]-
        slopes$so[28])
## [1] 413.2
## subject 36's RC effect estimate:
exp(intercepts$`(Intercept)`[36]+
      slopes$so[36])-
  exp(intercepts$`(Intercept)`[36]-
        slopes$so[36])
## [1] 319.1
## subject 37's RC effect estimate:
exp(intercepts$`(Intercept)`[37]+
      slopes$so[37])-
  exp(intercepts$`(Intercept)`[37]-
        slopes$so[37])
## [1] 542.7

Subject 37’s estimate is the largest of all on the millisecond scale because that subject has the slowest average reading time of all the 42 subjects: a median of 748 ms. When we back-transform the estimates from the log ms to the ms scale, the non-linearity of the log scale has the effect that larger intercepts will translate to much larger mean effects of the OR-SR difference. For example, a slope of 0.3 on the log ms scale will translate to a much larger effect on the ms scale if the intercept is larger:

exp(5+0.3)-exp(5-0.3)
## [1] 90.39
exp(6+0.3)-exp(6-0.3)
## [1] 245.7

Interestingly, if these three extreme subjects’ data are removed and a one-sample t-test is done on the slopes (the repeated measures regression), the t-value is just on the border of being significant. Compare the t-test on the full data vs. data without these unusual subjects:

## full data:
summary_ttest(t.test(slopes$so))
## [1] "t(41)=2.81 p=0.008"
## [1] "est.: 0.06 [0.02,0.11] ms"
## without subjects 28, 36, and 37:
summary_ttest(t.test(slopes$so[-c(28,36,37)]))
## [1] "t(38)=2.03 p=0.049"
## [1] "est.: 0.04 [0,0.07] ms"

The linear mixed model even shows that the relative clause effect disappears if the data from these three subjects are removed:

gg05e1_a<-subset(gg05e1,
                 subject!=28 & 
                 subject!=36 & 
                 subject!=37)
m4_lmer_a<-lmer(logrt~so +
                  (1+so|subject)+
                  (1+so||item),gg05e1_a)
summary(m4_lmer_a)$coefficients
##             Estimate Std. Error t value
## (Intercept)  5.85543    0.05134 114.048
## so           0.03642    0.01987   1.832

The advantage of shrinkage is that these extreme subjects’ estimates are shrunk towards the grand mean in the linear mixed model. Compare the estimates from the lmList model (the repeated measures regression that analyzes each subject separately) with those from the linear mixed model:

## slope estimates from lmList:
slopes$so[28];slopes$so[36];slopes$so[37]
## [1] 0.4481
## [1] 0.3825
## [1] 0.3554
## extract slope estimates from m4_lmer:
b1<-fixef(m4_lmer)[2]
lmerslopes<-b1+ranef(m4_lmer)$subject[,2]
lmerslopes[28];lmerslopes[36];lmerslopes[37]
## [1] 0.2757
## [1] 0.234
## [1] 0.2804

The above example illustrates one important achievement of the linear mixed model: it conservatively down-weights extreme estimates of the slope for particular subjects toward the grand mean slope. This down-weighting is sometimes referred to as “shrinkage”, and sometimes as “borrowing strength from the mean”. Figure 3.16 illustrates the above discussion graphically.

The figures show linear model fits (the grand mean estimates) for three subjects with unusually large intercepts and slopes; shown are the simple linear model fit on all the data (gray line), the `lmList` model fit to the individual subject's data (black line), and the linear mixed model fit (the broken line). In all three subjects' models, the linear mixed model estimates are shrunk towards the grand mean (gray line) estimates.

FIGURE 3.16: The figures show linear model fits (the grand mean estimates) for three subjects with unusually large intercepts and slopes; shown are the simple linear model fit on all the data (gray line), the lmList model fit to the individual subject’s data (black line), and the linear mixed model fit (the broken line). In all three subjects’ models, the linear mixed model estimates are shrunk towards the grand mean (gray line) estimates.

3.6.2 Shrinkage when data are missing

The importance and value of shrinkage becomes even clearer once we simulate a situation where there is some missing data. Missingness can happen in experiments, either due to lost measurements (arising from computer error or programming errors), or some other reason. To see what happens when we have missing data, let’s randomly delete some data from one subject. We will randomly delete 50% of subject 37’s data:

set.seed(4321)
## choose some data randomly to remove:
rand <- rbinom(1, n = 16, prob = 0.5)

Here are subject 37’s reading times (16 data points):

gg05e1[which(gg05e1$subject == 37), ]$rawRT
##  [1]  770  536  686  578  457  487 2419  884 3365  233
## [11]  715  671 1104  281 1081  971

Now, we randomly delete half the data:

gg05e1$deletedRT <- gg05e1$rawRT
gg05e1[which(gg05e1$subject == 37), ]$deletedRT <-
  ifelse(rand, NA,
    gg05e1[which(gg05e1$subject == 37), ]$rawRT
  )

Now subject 37’s estimates are going to be even more inaccurate than with 16 data points, because they are based on much less data (even one extreme value can strongly influence the mean):

subset(gg05e1, subject == 37)$deletedRT
##  [1]  770   NA  686  578   NA   NA   NA   NA 3365  233
## [11]   NA  671 1104   NA   NA  971

Here are the original estimates of the intercept and slope for subject 37:

lmlist_m1_orig <- lmList(log(rawRT) ~ so | subject, gg05e1)
intercepts_orig<-coef(lmlist_m1_orig)[1]
slopes_orig<-coef(lmlist_m1_orig)[2]
(intercept_orig37<-intercepts_orig$`(Intercept)`[37])
## [1] 6.617
(slope_orig37<-slopes_orig$so[37])
## [1] 0.3554

Now look at the estimates based on the deleted data:

## on deleted data:
lmlist_m1_deleted <- lmList(log(deletedRT) ~ so | subject, gg05e1)
intercepts_del <- coef(lmlist_m1_deleted)[1]
slopes_del <- coef(lmlist_m1_deleted)[2]
## subject 37's new estimates on deleted data:
(intercept_del37<-intercepts_del$`(Intercept)`[37])
## [1] 6.688
(slope_del37<-slopes_del$so[37])
## [1] 0.3884

The estimates from subject 37 have become even more extreme once half the data from this subject have been deleted. The intercept goes from 6.62 in the original data to 6.69 in the deleted data; and the slope from 0.36 in the original data to 0.39 in the deleted data.

Now fit two linear mixed models (m4_lmer), one with the original data, and another with the deleted data, and examine subject 37’s estimates on undeleted vs. deleted data.

m1_lmer_orig<-lmer(log(rawRT)~so + (1+so|subject)+
                     (1+so||item),gg05e1)
b0_orig<-summary(m1_lmer_orig)$coefficients[1,1]
b1_orig<-summary(m1_lmer_orig)$coefficients[2,1]
u_orig<-ranef(m1_lmer_orig)$subject
u0_orig<-u_orig$`(Intercept)`
u1_orig<-u_orig$`so`

## subject 37's estimates:
b0_orig + u0_orig[37]
## [1] 6.576
b1_orig + u1_orig[37]
## [1] 0.2804
## LMM on the deleted data:
m4_lmer_del<-lmer(log(deletedRT)~so + (1+so|subject)+
                    (1+so||item),gg05e1)
b0_del<-summary(m4_lmer_del)$coefficients[1,1]
b1_del<-summary(m4_lmer_del)$coefficients[2,1]
u_del<-ranef(m4_lmer_del)$subject
u0_del<-u_del$`(Intercept)`
u1_del<-u_del$`so`

## subject 37's estimates:
b0_del + u0_del[37] 
## [1] 6.605
b1_del + u1_del[37]
## [1] 0.2654

The two intercept and slope estimates for subject 37 are pretty similar despite half the data from that subject being missing in the second analysis.

Figure 3.17 illustrates the difference between the different models: there is hardly any difference between the two linear mixed model estimates; by contrast, the lmList estimates on the full vs. missing data differ quite a bit.

The figure shows linear model fits (the grand mean estimates) for subject 37. When using lmList, deleting data leads to very different estimates; but using lmer, deleting half the data from this subject hardly affects the individual subject's estimates.

FIGURE 3.17: The figure shows linear model fits (the grand mean estimates) for subject 37. When using lmList, deleting data leads to very different estimates; but using lmer, deleting half the data from this subject hardly affects the individual subject’s estimates.

The conclusion to take away from Figure 3.17 is that the estimates from the hierarchical model are almost unaffected by the missing data, but the estimates from the lmList model are heavily affected by the mis-estimation due to less data.

Thus, shrinkage in linear mixed models has the beneficial effect that it yields more robust estimates (think Type M error!) compared to models which fit data separately for each subject. This property of shrinkage is one reason why linear mixed models are so important in cognitive science, and why they should be the standard workhorse for data analysis.