9.2 Type I error inflation due to multiple comparisons

Next, we consider a case where the design is more complex than a two-condition experiment. The data are from an eyetracking reading study conducted by B. W. Dillon et al. (2013); this is a \(2\times 2 \times 2\) repeated measures design. The dependent measure is total reading time at a particular word in the sentence. Because there are eight conditions, we can carry out at most seven comparisons.

The question we want to answer here is: if we carry out multiple (here, seven) hypothesis tests, what will be the probability of rejecting at least one of them incorrectly? Analytically, the answer is easy to compute. Assume that the Type I error probability for each hypothesis test is 0.05. To keep things simple, suppose we are testing two null hypotheses. The probability of getting both the two null hypothesis tests correctly accepted is \(0.95^2\); this assumes that the tests are independent. It follows that the probability of getting at least one null hypothesis test incorrectly rejected is \(1-0.95^2=0.0.0975\). That’s higher than our desired Type I error of 0.05.

Next, let’s see what simulation gives us when we fit a varying intercepts and slopes model for the case where we are testing seven hypotheses.

First, load the data, and obtain the estimates from a fitted linear mixed model with full variance-covariance matrices for subjects and items. Because fitting such a model with lmer will lead to convergence errors, we fit the model using a Bayesian linear mixed model, and use the estimates from this model to simulate data.

library(lingpsych)
data("dillonE1ttnested")
library(brms)
postD_mE1ttnested<-posterior_samples(dillonE1ttnested)
## Warning: Method 'posterior_samples' is deprecated.
## Please see ?as_draws for recommended alternatives.
beta1<-postD_mE1ttnested$b_Intercept
beta2<-postD_mE1ttnested$"b_dep"
beta3<-postD_mE1ttnested$"b_gram"
beta4<-postD_mE1ttnested$"b_dep:gram"
beta5<-postD_mE1ttnested$"b_int.rg"
beta6<-postD_mE1ttnested$"b_int.ag"
## the two estimates of interest:
beta7<-postD_mE1ttnested$"b_int.ru"
beta8<-postD_mE1ttnested$"b_int.au"
betameans<-c(beta1,beta2,beta3,beta4,beta5,beta6,beta7,beta8)

cor<-posterior_samples(postD_mE1ttnested,"^cor")
## Warning: Method 'posterior_samples' is deprecated.
## Please see ?as_draws for recommended alternatives.
sd<-posterior_samples(postD_mE1ttnested,"^sd")
## Warning: Method 'posterior_samples' is deprecated.
## Please see ?as_draws for recommended alternatives.
sigma<-posterior_samples(postD_mE1ttnested,"sigma")
## Warning: Method 'posterior_samples' is deprecated.
## Please see ?as_draws for recommended alternatives.
item_re<-posterior_samples(postD_mE1ttnested,"^r_item")
## Warning: Method 'posterior_samples' is deprecated.
## Please see ?as_draws for recommended alternatives.
subj_re<-posterior_samples(postD_mE1ttnested,"^r_subj")
## Warning: Method 'posterior_samples' is deprecated.
## Please see ?as_draws for recommended alternatives.
library(SIN)
sds<-colMeans(sd)
cors<-colMeans(cor)
sig<-mean(sigma$sigma)
## assume intermediate correlation:
cormat<-matrix(rep(0.5,8*8),ncol=8)
diagmat<-diag(rep(0.5,8))
cormat<-cormat+diagmat
Sigma_u<-SIN::sdcor2cov(stddev=sds[8+(c(1,2,3,8,7,5,6,4))],
                          corr=cormat)
Sigma_w<-SIN::sdcor2cov(stddev=sds[c(1,2,3,8,7,5,6,4)],
                          corr=cormat)

Finally, we set all the slopes to zero, and compute Type I error under different conditions.

## set all effects to 0:
betameans[2:8]<-0

We will fit a model for a full variance-covariance matrix for both subjects and items. We avoid fitting the correlation parameters, because these will be difficult to estimate with the sample size (40 subjects and 48 items) used in the @B. W. Dillon et al. (2013) study. To illustrate the effect of mis-specification of the likelihood function, we will fit the simulated data to both log and raw reading times.

## full model:
nsim<-100
significantlog<-significantraw<-matrix(0,nsim,ncol=7)
## record failed convergences, remove these later:
failedlog<-failedraw<-rep(0,nsim)

for(i in 1:nsim){
##  print(paste("********simulation ",i,"********",sep=""))
  simdat<-gen_fake_lnorm2x2x2(beta = betameans,
                               Sigma_u=Sigma_u,Sigma_w=Sigma_w,
                               sigma_e=sig,nitem=48,
                               nsubj=40)

  m_fakelog<-lmer(log(rt)~Dep+Gram+DepxGram+Int_gram_refl+Int_gram_agr+
                 Int_ungram_refl+Int_ungram_agr+
                 (1+Dep+Gram+DepxGram+Int_gram_refl+Int_gram_agr+
                    Int_ungram_refl+Int_ungram_agr||subj)+
                 (1+Dep+Gram+DepxGram+Int_gram_refl+Int_gram_agr+
                    Int_ungram_refl+Int_ungram_agr||item),simdat,
               control=lmerControl(optimizer="nloptwrap", calc.derivs = FALSE))

    m_fakeraw<-lmer(rt~Dep+Gram+DepxGram+Int_gram_refl+Int_gram_agr+
                 Int_ungram_refl+Int_ungram_agr+
                 (1+Dep+Gram+DepxGram+Int_gram_refl+Int_gram_agr+
                    Int_ungram_refl+Int_ungram_agr||subj)+
                 (1+Dep+Gram+DepxGram+Int_gram_refl+Int_gram_agr+
                    Int_ungram_refl+Int_ungram_agr||item),simdat,
               control=lmerControl(optimizer="nloptwrap", 
                                   calc.derivs = FALSE))

  ## ignore failed trials (log rts model)
    if(any( grepl("failed to converge", m_fakelog@optinfo$conv$lme4$messages) ) ||
       any( grepl("singular", m_fakelog@optinfo$conv$lme4$messages) )){
      failedlog[i]<-1
    } else{
  ## first slope:
  if(abs(summary(m_fakelog)$coefficients[2,3])>2){
    significantlog[i,1]<-1
  } else {significantlog[i,1]<-0}
  ## second slope:
  if(abs(summary(m_fakelog)$coefficients[3,4])>2){
    significant[i,2]<-1
  } else {significantlog[i,2]<-0}
  ## third slope:
  if(abs(summary(m_fakelog)$coefficients[4,4])>2){
    significantlog[i,3]<-1
  } else {significantlog[i,3]<-0}
  ## fourth slope:
  if(abs(summary(m_fakelog)$coefficients[5,4])>2){
    significant[i,4]<-1
  } else {significantlog[i,4]<-0}
  ## fifth slope:
  if(abs(summary(m_fakelog)$coefficients[6,4])>2){
    significantlog[i,5]<-1
  } else {significantlog[i,5]<-0}
  ## sixth slope:
  if(abs(summary(m_fakelog)$coefficients[7,4])>2){
    significantlog[i,6]<-1
  } else {significantlog[i,6]<-0}
  ## seventh slope:
  if(abs(summary(m_fakelog)$coefficients[8,4])>2){
    significantlog[i,7]<-1
  } else {significantlog[i,7]<-0}
    }
    
    ## ignore failed trials (raw rts model)
    if(any( grepl("failed to converge", m_fakeraw@optinfo$conv$lme4$messages) ) ||
       any( grepl("singular", m_fakeraw@optinfo$conv$lme4$messages) )){
      failedraw[i]<-1
    } else{
  ## first slope:
  if(abs(summary(m_fakeraw)$coefficients[2,3])>2){
    significantraw[i,1]<-1
  } else {significantraw[i,1]<-0}
  ## second slope:
  if(abs(summary(m_fakeraw)$coefficients[3,4])>2){
    significant[i,2]<-1
  } else {significantraw[i,2]<-0}
  ## third slope:
  if(abs(summary(m_fakeraw)$coefficients[4,4])>2){
    significantraw[i,3]<-1
  } else {significantraw[i,3]<-0}
  ## fourth slope:
  if(abs(summary(m_fakeraw)$coefficients[5,4])>2){
    significant[i,4]<-1
  } else {significantraw[i,4]<-0}
  ## fifth slope:
  if(abs(summary(m_fakeraw)$coefficients[6,4])>2){
    significantraw[i,5]<-1
  } else {significantraw[i,5]<-0}
  ## sixth slope:
  if(abs(summary(m_fakeraw)$coefficients[7,4])>2){
    significantraw[i,6]<-1
  } else {significantraw[i,6]<-0}
  ## seventh slope:
  if(abs(summary(m_fakeraw)$coefficients[8,4])>2){
    significantraw[i,7]<-1
  } else {significantraw[i,7]<-0}
  }  
}

## In 100 simulations, what is the proportion of times that *at least one* of the effects is significant?
## Log rt model:
save(significantlog,file="maxmodelsiglog.rda")
countslog<-table(rowSums(significantlog)>0)
## Type I error under multiple comparisons:
countslog[2]/sum(countslog)

## Raw rt model:
save(significantraw,file="maxmodelsigraw.rda")
countsraw<-table(rowSums(significantraw)>0)
## Type I error under multiple comparisons:
countsraw[2]/sum(countsraw)

What we see below is that (a) with the log reading time model, we get an inflation of Type I error when we compare multiple hypotheses, (b) model mis-specification (using a Normal likelihood when a log-normal should have been used) will further inflate Type I error.

countslog<-table(rowSums(significantlog)>0)
## Type I error under multiple comparisons:
countslog[2]/sum(countslog)
## TRUE 
## 0.21
countsraw<-table(rowSums(significantraw)>0)
## Type I error under multiple comparisons + model mis-specification:
countsraw[2]/sum(countsraw)
## TRUE 
##  0.3

References

Dillon, Brian W., Alan Mishler, Shayne Sloggett, and Colin Phillips. 2013. “Contrasting Intrusion Profiles for Agreement and Anaphora: Experimental and Modeling Evidence.” Journal of Memory and Language 69 (2): 85–103.