Chapter 9 Contrast coding for designs with two predictor variables

Chapter 8 provides a basic introduction into contrast coding in situations where there is one predictor variable, i.e., one factor, for which its effect can be estimated using a specified contrast matrix. Here, we will investigate how contrast coding generalizes to situations where there is more than one predictor variable. This could either be a situation where two factors are present or where one factor is paired with a continuous predictor variable, i.e., a covariate. We first discuss contrast coding for the case of two factors (for $2 \times 2$ designs; section 9.1) and then go on to investigate situations where one predictor is a factor and the other predictor is a covariate (section 9.2). Moreover, one problem in the analysis of interactions occurs in situations where the model is not linear, but has some non-linear link function, such as e.g., in logistic models or when assuming a log-normally distributed dependent variable. In these situations, the model makes predictions for each condition (i.e., design cell) at the latent level of the linear model. Sometimes it is important to translate these model predictions to the level of the observations (e.g., to probabilities in a logistic regression model). We will discuss how this can be implemented in section 9.3. We begin by treating contrast coding in a factorial $2 \times 2$ design.

9.1 Contrast coding in a factorial $2 \times 2$ design

In chapter 8 in section 8.3, we used a data set with one 4-level factor. Here, we assume that the same four means come from an $A(2) \times B(2)$ between-subject-factor design rather than an F(4) between-subject-factor design. Load the simulated data and show summary statistics in Table 9.1 and in Figure 9.1. The means and standard deviations are exactly the same as in Figure 8.2 and in Table 8.3.

FIGURE 9.1: Means and error bars (showing standard errors) for a simulated data set with a two-by-two between-subjects factorial design.

TABLE 9.1: Summary statistics per condition for the simulated data.
Factor A	Factor B	N data	Means	Std. dev.	Std. errors
A1	B1	$5$	$10$	$10$	$4.5$
A1	B2	$5$	$20$	$10$	$4.5$
A2	B1	$5$	$10$	$10$	$4.5$
A2	B2	$5$	$40$	$10$	$4.5$

In order to carry out a $2\times 2$ ANOVA-type ( main effects and interaction) analysis, one needs sum contrasts in the linear model. (This is true for factors with two levels, but does not generalize to factors with more levels.) The results of such an analysis are shown in Table 9.2.

# define sum contrasts:
contrasts(df_contrasts4$A) <- contr.sum(2)
contrasts(df_contrasts4$B) <- contr.sum(2)

# Bayesian LM
fit_AB.sum <- brm(DV ~ 1 + A * B,
  data = df_contrasts4,
  family = gaussian(),
  prior = c(
    prior(normal(20, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

TABLE 9.2: Bayesian linear model of a 2 x 2 design with sum contrasts.
Predictor	Estimate	Est. Error	2.5%	97.5%
Intercept	$19.99$	$2.43$	$15.29$	$24.83$
A1	$-4.95$	$2.45$	$-9.92$	$-0.15$
B1	$-9.95$	$2.59$	$-15.00$	$-4.79$
A1:B1	$4.98$	$2.57$	$-0.07$	$10.00$

Next, we reproduce the $A(2) \times B(2)$ - ANOVA with contrasts specified for the corresponding one-way $F(4)$ ANOVA, that is by treating the $2 \times 2 = 4$ condition means as four levels of a single factor $F$. In other words, we go back to the data frame simulated for the analysis of repeated contrasts (see chapter 8, section 8.3). We first define weights for condition means according to our hypotheses, invert this matrix, and use it as the contrast matrix for factor $F$. We define weights of $1/4$ and $-1/4$. We do so because (a) we want to compare the mean of two conditions to the mean of two other conditions (e.g., factor $A$ compares $\frac{F1 + F2}{2}$ to $\frac{F3 + F4}{2}$). Moreover, (b) we want coefficients to code half the difference between condition means, reflecting sum contrasts. Together $(a+b)$, this yields weights of $1/2 \cdot 1/2 = 1/4$. The resulting contrast matrix contains contrast coefficients of $+1$ or $-1$, showing that we successfully implemented sum contrasts. The results are identical to the previous models.

t(fractions(HcInt <- rbind(
  A = c(F1 = 1 / 4, F2 = 1 / 4, F3 = -1 / 4, F4 = -1 / 4),
  B = c(F1 = 1 / 4, F2 = -1 / 4, F3 = 1 / 4, F4 = -1 / 4),
  AxB = c(F1 = 1 / 4, F2 = -1 / 4, F3 = -1 / 4, F4 = 1 / 4)
)))

##    A    B    AxB 
## F1  1/4  1/4  1/4
## F2  1/4 -1/4 -1/4
## F3 -1/4  1/4 -1/4
## F4 -1/4 -1/4  1/4

(XcInt <- ginv2(HcInt))

##    A  B  AxB
## F1  1  1  1 
## F2  1 -1 -1 
## F3 -1  1 -1 
## F4 -1 -1  1

contrasts(df_contrasts3$F) <- XcInt

fit_F4.sum <- brm(DV ~ 1 + F,
  data = df_contrasts3,
  family = gaussian(),
  prior = c(
    prior(normal(20, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_F4.sum)

##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    20.02      2.34  15.40 24.55
## FA           -4.94      2.41  -9.60 -0.17
## FB           -9.95      2.43 -14.75 -5.23
## FAxB          5.01      2.50   0.02  9.94

This shows that it is possible to specify the contrasts not only for each factor (e.g., here in the $2 \times 2$ design) separately. Instead, one can also pool all experimental conditions (or design cells) into one large factor (here factor $F$ with $4$ levels), and specify the contrasts for the main effects and for the interactions in the resulting one large contrast matrix simultaneously.

In this approach, it can again be very useful to apply the hypr package to construct contrasts for a $2 \times 2$ design. The first parameter estimates the main effect $A$, i.e., it compares the average of $F1$ and $F2$ to the average of $F3$ and $F4$. The second parameter estimates the main effect $B$, i.e., it compares the average of $F1$ and $F3$ to the average of $F2$ and $F4$. We code direct differences between the averages, i.e., we implement scaled sum contrasts instead of sum contrasts. This is shown below: the contrast matrix contains coefficients of $+1/2$ and $-1/2$ instead of $+1$ and $-1$. The interaction term estimates the difference between differences, i.e., the difference between $F1 - F2$ and $F3 - F4$.

hAxB <- hypr(
  A = (F1 + F2) / 2 ~ (F3 + F4) / 2,
  B = (F1 + F3) / 2 ~ (F2 + F4) / 2,
  AxB = (F1 - F2) ~ (F3 - F4)
)
hAxB

## hypr object containing 3 null hypotheses:
##   H0.A: 0 = (F1 + F2 - F3 - F4)/2
##   H0.B: 0 = (F1 + F3 - F2 - F4)/2
## H0.AxB: 0 = F1 - F2 - F3 + F4
## 
## Call:
## hypr(A = ~1/2 * F1 + 1/2 * F2 - 1/2 * F3 - 1/2 * F4, B = ~1/2 * 
##     F1 + 1/2 * F3 - 1/2 * F2 - 1/2 * F4, AxB = ~F1 - F2 - F3 + 
##     F4, levels = c("F1", "F2", "F3", "F4"))
## 
## Hypothesis matrix (transposed):
##    A    B    AxB 
## F1  1/2  1/2    1
## F2  1/2 -1/2   -1
## F3 -1/2  1/2   -1
## F4 -1/2 -1/2    1
## 
## Contrast matrix:
##    A    B    AxB 
## F1  1/2  1/2  1/4
## F2  1/2 -1/2 -1/4
## F3 -1/2  1/2 -1/4
## F4 -1/2 -1/2  1/4

contrasts(df_contrasts3$F) <- contr.hypothesis(hAxB)

fit_F4hypr <- brm(DV ~ 1 + F,
  data = df_contrasts3,
  family = gaussian(),
  prior = c(
    prior(normal(20, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_F4hypr)

##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    19.97      2.48  15.16 24.80
## FA           -9.85      4.97 -19.50  0.22
## FB          -19.83      4.93 -29.62 -9.89
## FAxB         19.24      9.67   0.17 37.49

The results show that the estimates for the main effects have double the size as compared to the sum contrasts–this is the result of the scaling that we applied. I.e., the main effects now directly estimate the difference between averages. The interaction estimates the difference between differences. Nevertheless, if adequate priors are used, then both contrasts would lead to the same hypothesis tests if one were doing hypothesis testing using Bayes factors. Thus, the hypr package can be used to code hypotheses in a $2 \times 2$ design.

An alternative way to code main effects and interactions is to use the ifelse command in R. For example, if we want to use $\pm 1$ sum contrasts in the above example, we can specify the contrasts for the main effects as vectors:

A <- ifelse(df_contrasts3$F %in% c("F1", "F2"), -1, 1)
B <- ifelse(df_contrasts3$F %in% c("F1", "F3"), -1, 1)

Now, defining the interaction is simply a matter of multiplying the two vectors:

AxB <- A * B

An alternative is to use $\pm 1/2$ coding when using this approach:

A <- ifelse(df_contrasts3$F %in% c("F1", "F2"), -1 / 2, 1 / 2)
B <- ifelse(df_contrasts3$F %in% c("F1", "F3"), -1 / 2, 1 / 2)
## interaction:
(AxB <- A * B)

##  [1]  0.25  0.25  0.25  0.25  0.25 -0.25 -0.25 -0.25 -0.25 -0.25 -0.25
## [12] -0.25 -0.25 -0.25 -0.25  0.25  0.25  0.25  0.25  0.25

Now the main effects and interaction can be directly interpreted as differences between averages and as differences between differences. If one wants the interaction term to be on the same scale as the main effects, it would need to be multiplied by 2, however, then its interpretation would not be straightforward (“i.e., difference between differences”) any more, but a scaled variant of this.

## rescale:
(AxB <- A * B * 2)

##  [1]  0.5  0.5  0.5  0.5  0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5
## [14] -0.5 -0.5  0.5  0.5  0.5  0.5  0.5

This kind of vector-based contrast coding is convenient for more complex designs, such as $2\times 2\times 2$ factorial designs.

9.1.1 Nested effects

One can estimate effects that do not correspond directly to main effects and interaction of the traditional ANOVA. For example, in a $2 \times 2$ experimental design, where factor $A$ codes word frequency (low/high) and factor $B$ is part of speech (noun/verb), one can estimate the effect of word frequency within nouns and the effect of word frequency within verbs. Formally, $A_{B1}$ versus $A_{B2}$ are nested within levels of $B$. Said differently, simple effects of factor $A$ are estimated for each of the levels of factor $B$. In this version, we estimate the main effect of part of speech ($B$; as in traditional ANOVA). Instead of also estimating the second main effect word frequency, $A$, and the interaction, we estimate (1) whether the two levels of word frequency $A$ differ for the first level of $B$ (i.e., nouns), and (2) whether the two levels of word frequency, $A$, differ for the second level of $B$ (i.e., verbs). In other words, we estimate whether there are differences for $A$ in each of the levels of $B$. Often researchers have hypotheses about these differences, and not about the interaction.

t(fractions(HcNes <- rbind(
  B = c(F1 = 1 / 2, F2 = -1 / 2, F3 = 1 / 2, F4 = -1 / 2),
  B1xA = c(F1 = -1, F2 = 0, F3 = 1, F4 = 0),
  B2xA = c(F1 = 0, F2 = -1, F3 = 0, F4 = 1)
)))

##    B    B1xA B2xA
## F1  1/2   -1    0
## F2 -1/2    0   -1
## F3  1/2    1    0
## F4 -1/2    0    1

(XcNes <- ginv2(HcNes))

##    B    B1xA B2xA
## F1  1/2 -1/2    0
## F2 -1/2    0 -1/2
## F3  1/2  1/2    0
## F4 -1/2    0  1/2

contrasts(df_contrasts3$F) <- XcNes

fit_Nest <- brm(DV ~ 1 + F,
  data = df_contrasts3,
  family = gaussian(),
  prior = c(
    prior(normal(20, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_Nest)

##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    19.96      2.45  15.04  24.8
## FB          -19.78      4.88 -29.56 -10.2
## FB1xA         0.05      7.09 -13.84  14.1
## FB2xA        19.55      6.84   5.87  32.9

The regression coefficients estimate the grand mean, the difference for the main effect of part of speech ($B$) and the two differences (for $A$; i.e., simple main effects) within the two levels (noun and verb) of part of speech ($B$).

These custom nested contrasts’ columns are scaled versions of the corresponding hypothesis matrix. This is the case because the columns are orthogonal. It illustrates the advantage of orthogonal contrasts for the interpretation of regression coefficients: the underlying comparisons being estimated are already clear from the contrast matrix.

There is also a built-in R formula specification for nested designs. The order of factors in the formula from left to right specifies a top-down order of nesting within levels, i.e., here factor $A$ (word frequency) is nested within levels of the factor $B$ (part of speech). This yields the same result as our previous result based on custom nested contrasts:

contrasts(df_contrasts4$A) <- c(-0.5, +0.5)
contrasts(df_contrasts4$B) <- c(+0.5, -0.5)
fit_Nest2 <- brm(DV ~ 1 + B / A,
  data = df_contrasts4,
  family = gaussian(),
  prior = c(
    prior(normal(20, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_Nest2)

##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    20.02      2.52  15.01 25.13
## B1          -19.80      4.98 -30.04 -9.45
## BB1:A1       -0.09      6.99 -13.73 13.54
## BB2:A1       19.47      6.82   5.68 33.14

In cases such as these, where $A_{B1}$ vs. $A_{B2}$ are nested within levels of $B$, it is necessary to include the effect of $B$ (part of speech) in the model, even if one is only interested in the effect of $A$ (word frequency) within levels of $B$ (part of speech). Leaving out factor $B$ in this case would increase posterior uncertainty in the case of fully balanced data, and can lead to biases in parameter estimation in the case the data are not fully balanced.

Again, we show how nested contrasts can be easily implemented using hypr:

hNest <- hypr(
  B = (F1 + F3) / 2 ~ (F2 + F4) / 2,
  B1xA = F3 ~ F1,
  B2xA = F4 ~ F2
)
hNest

## hypr object containing 3 null hypotheses:
##    H0.B: 0 = (F1 + F3 - F2 - F4)/2
## H0.B1xA: 0 = F3 - F1
## H0.B2xA: 0 = F4 - F2
## 
## Call:
## hypr(B = ~1/2 * F1 + 1/2 * F3 - 1/2 * F2 - 1/2 * F4, B1xA = ~F3 - 
##     F1, B2xA = ~F4 - F2, levels = c("F1", "F2", "F3", "F4"))
## 
## Hypothesis matrix (transposed):
##    B    B1xA B2xA
## F1  1/2   -1    0
## F2 -1/2    0   -1
## F3  1/2    1    0
## F4 -1/2    0    1
## 
## Contrast matrix:
##    B    B1xA B2xA
## F1  1/2 -1/2    0
## F2 -1/2    0 -1/2
## F3  1/2  1/2    0
## F4 -1/2    0  1/2

contrasts(df_contrasts3$F) <- contr.hypothesis(hNest)

fit_NestHypr <- brm(DV ~ 1 + F,
  data = df_contrasts3,
  family = gaussian(),
  prior = c(
    prior(normal(20, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_NestHypr)

##           Estimate Est.Error  Q2.5 Q97.5
## Intercept    20.03      2.48  15.1  25.0
## FB          -19.96      5.06 -30.2 -10.0
## FB1xA        -0.07      6.76 -13.6  13.2
## FB2xA        19.55      6.80   6.2  33.0

Of course, we can also ask the reverse question: Are there differences for part of speech ($B$) in the levels of word frequency ($A$; in addition to estimating the main effect of word frequency, $A$)? That is, do nouns differ from verbs for low-frequency words ($B_{A1}$) and do nouns differ from verbs for high-frequency words ($B_{A2}$)?

hNest2 <- hypr(
  A = (F1 + F2) / 2 ~ (F3 + F4) / 2,
  A1xB = F2 ~ F1,
  A2xB = F4 ~ F3
)
hNest2

## hypr object containing 3 null hypotheses:
##    H0.A: 0 = (F1 + F2 - F3 - F4)/2
## H0.A1xB: 0 = F2 - F1
## H0.A2xB: 0 = F4 - F3
## 
## Call:
## hypr(A = ~1/2 * F1 + 1/2 * F2 - 1/2 * F3 - 1/2 * F4, A1xB = ~F2 - 
##     F1, A2xB = ~F4 - F3, levels = c("F1", "F2", "F3", "F4"))
## 
## Hypothesis matrix (transposed):
##    A    A1xB A2xB
## F1  1/2   -1    0
## F2  1/2    1    0
## F3 -1/2    0   -1
## F4 -1/2    0    1
## 
## Contrast matrix:
##    A    A1xB A2xB
## F1  1/2 -1/2    0
## F2  1/2  1/2    0
## F3 -1/2    0 -1/2
## F4 -1/2    0  1/2

contrasts(df_contrasts3$F) <- contr.hypothesis(hNest2)

fit_Nest2Hypr <- brm(DV ~ 1 + F,
  data = df_contrasts3,
  family = gaussian(),
  prior = c(
    prior(normal(20, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_Nest2Hypr)

##           Estimate Est.Error   Q2.5 Q97.5
## Intercept    19.98      2.48  15.10 24.94
## FA           -9.90      4.96 -19.65  0.12
## FA1xB         9.72      6.95  -4.19 23.57
## FA2xB        29.29      6.99  14.99 42.98

Regression coefficients estimate the grand mean, the difference for the main effect of word frequency ($A$) and the two part of speech effects (for $B$; i.e., simple main effects) within levels of word frequency ($A$).

9.1.2 Interactions between contrasts

In a $2 \times 2$ experimental design, the results from sum contrasts are equivalent to typical ANOVA results that we see in frequentist analyses. This means that sum contrasts assess the main effects and the interactions. One interesting question arises here is: what would happen in a $2 \times 2$ design if we had used treatment contrasts instead of sum contrasts? Is it still possible to meaningfully interpret the results from the treatment contrasts in a simple $2 \times 2$ design?

This leads us to a very important principle in interpreting results from contrasts: When interactions between contrasts are included in a model, then the results for one contrast actually depend on the specification of the other contrast(s) in the analysis! This may be counter-intuitive at first, but it is very important and essential to keep in mind when interpreting results from contrasts. How does this work in detail?

The general rule to remember is that the effect of one contrast measures its effect at the location $0$ of the other contrast(s) in the analysis.This can be seen in the regression equation of a $2 \times 2$ design with factors $A$ and $B$:

\[\begin{equation} E[Y] = \alpha + \beta_A A + \beta_B B + \beta_{A \times B} A \times B \end{equation}\]

If we set the predictor $B$ to zero, then the equation simplifies to:

\[\begin{equation} E[Y] = \alpha + \beta_A A \end{equation}\]

Thus, now we can see the “pure” effect of $A$.

What does that mean practically? Let us consider the example that we use two treatment contrasts in a $2 \times 2$ design. Here are the results from the linear model

contrasts(df_contrasts4$A) <- c(0, 1)
contrasts(df_contrasts4$B) <- c(0, 1)
fit_treatm <- brm(DV ~ 1 + B * A,
  data = df_contrasts4,
  family = gaussian(),
  prior = c(
    prior(normal(20, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_treatm)

##           Estimate Est.Error   Q2.5 Q97.5
## Intercept     9.87      4.97   0.12  20.0
## B1           10.25      6.92  -3.05  24.3
## A1            0.35      6.97 -13.18  14.3
## B1:A1        19.20      9.64   0.16  37.7

Let’s take a look at the effect of factor $A$. How can we interpret what this measures? This effect actually estimates the effect of factor $A$ at the “location” where factor $B$ is coded as $0$. Factor $B$ is coded as a treatment contrast, that is, it codes a zero at its baseline condition, which is $B1$. Thus, the effect of factor $A$ estimates the effect of $A$ nested within the baseline condition of $B$, i.e., a simple effect. We take a look at the data presented in Figure 9.1, what this nested effect should be. Figure 9.1 shows that the effect of factor $A$ nested in $B1$ is $0$. If we now compare this to the results from the linear model, it is indeed clear that the effect of factor $A$ is exactly estimated as $0$. As expected, when factor $B$ is coded as a treatment contrast, the effect of factor $A$ estimates the effect of $A$ nested within the baseline level of factor $B$.

Next, consider the effect of factor $B$. According to the same logic, this effect estimates the effect of factor $B$ at the “location” where factor $A$ is $0$. Factor $A$ is also coded as a treatment contrast, that is, it codes its baseline condition $A1$ as $0$. The effect of factor $B$ estimates the effect of $B$ nested within the baseline condition of $A$. Figure 9.1 shows that this effect should be $10$.

How do we know what the “location” is, where a contrast applies? For the treatment contrasts discussed here, it is possible to reason this through because all contrasts are coded as $0$ or $1$. How can one derive the “location” in general? What we can do is to look at the comparisons that are estimated when using the treatment contrasts (or in case we use Bayes factors, which hypotheses are tested) in the presence of an interaction between them by using the generalized matrix inverse. We go back to the default treatment contrasts. Then we extract the contrast matrix from the design matrix:

contrasts(df_contrasts4$A) <- contr.treatment(2)
contrasts(df_contrasts4$B) <- contr.treatment(2)
XcTr <- df_contrasts4 %>%
  group_by(A, B) %>%
  summarise() %>%
  model.matrix(~ 1 + A * B, .) %>%
  as.data.frame() %>%
  as.matrix()
rownames(XcTr) <- c("A1_B1", "A1_B2", "A2_B1", "A2_B2")
XcTr

##       (Intercept) A2 B2 A2:B2
## A1_B1           1  0  0     0
## A1_B2           1  0  1     0
## A2_B1           1  1  0     0
## A2_B2           1  1  1     1

This shows the treatment contrast for factors $A$ and $B$, and their interaction. We can now assign this contrast matrix to a hypr object. hypr automatically converts the contrast matrix into a hypothesis matrix, such that we can read from the hypothesis matrix which comparison are being estimated by the different contrasts.

htr <- hypr() # initialize empty hypr object
cmat(htr) <- XcTr # assign contrast matrix to hypr object
htr # look at the resulting hypothesis matrix

## hypr object containing 4 null hypotheses:
## H0.(Intercept): 0 = A1_B1                          (Intercept)
##          H0.A2: 0 = -A1_B1 + A2_B1
##          H0.B2: 0 = -A1_B1 + A1_B2
##       H0.A2:B2: 0 = A1_B1 - A1_B2 - A2_B1 + A2_B2
## 
## Call:
## hypr(`(Intercept)` = ~A1_B1, A2 = ~-A1_B1 + A2_B1, B2 = ~-A1_B1 + 
##     A1_B2, `A2:B2` = ~A1_B1 - A1_B2 - A2_B1 + A2_B2, levels = c("A1_B1", 
## "A1_B2", "A2_B1", "A2_B2"))
## 
## Hypothesis matrix (transposed):
##       (Intercept) A2 B2 A2:B2
## A1_B1  1          -1 -1  1   
## A1_B2  0           0  1 -1   
## A2_B1  0           1  0 -1   
## A2_B2  0           0  0  1   
## 
## Contrast matrix:
##       (Intercept) A2 B2 A2:B2
## A1_B1 1           0  0  0    
## A1_B2 1           0  1  0    
## A2_B1 1           1  0  0    
## A2_B2 1           1  1  1

The same result is obtained by applying the generalized inverse to the contrast matrix (this is what hypr does as well). An important fact is that when we apply the generalized inverse to the contrast matrix, we obtain the corresponding hypothesis matrix (for details, see Schad, Vasishth, et al. 2020).

t(ginv2(XcTr))

##       (Intercept) A2 B2 A2:B2
## A1_B1  1          -1 -1  1   
## A1_B2  0           0  1 -1   
## A2_B1  0           1  0 -1   
## A2_B2  0           0  0  1

As discussed above, the effect of factor $A$ estimates its effect nested within the baseline level of factor $B$. Likewise, the effect of factor $B$ estimates its effect nested within the baseline level of factor $A$.

How does this work for sum contrasts? They do not have a baseline condition that is coded as $0$. In sum contrasts, the average of the contrast coefficients is $0$. Therefore, effects estimate the average effect across factor levels, i.e., they estimate a main effect. This is what is typically also tested in standard ANOVA. Let’s look at the example shown in Table 9.2: given that factor $B$ has a sum contrast, the main effect of factor $A$ is estimated as the average across levels of factor $B$. Figure 9.1 shows that the effect of factor $A$ in level B1 is $10 - 10 = 0$, and in level $B2$ it is $20 - 40 = -20$. The average effect across both levels is $(0 - 20)/2 = -10$. Due to the sum contrast coding, we have to divide this by two, yielding an expected effect of $-10 / 2 = -5$. This is exactly what the effect of factor $A$ measures (see Table 9.2, Estimate for $A1$).

Similarly, factor $B$ estimates its effect at the location $0$ of factor $A$. Again, $0$ is exactly the mean of the contrast coefficients from factor $A$, which is coded as a sum contrast. Therefore, factor $B$ estimates the effect of $B$ averaged across factor levels of $A$, i.e., the main effect of $B$. For factor level $A1$, factor $B$ has an effect of $10 - 20 = -10$. For factor level $A2$, factor $B$ has an effect of $10 - 40 = -30$. The average effect is $(-10 - 30)/2 = -20$, which again needs to be divided by $2$ due to the sum contrast. This yields exactly the estimate of $-10$ that is also reported in Table 9.2 (Estimate for B1).

Again, we look at the hypothesis matrix for the main effects and the interaction:

contrasts(df_contrasts4$A) <- contr.sum(2)
contrasts(df_contrasts4$B) <- contr.sum(2)
XcSum <- df_contrasts4 %>%
  group_by(A, B) %>%
  summarise() %>%
  model.matrix(~ 1 + A * B, .) %>%
  as.data.frame() %>%
  as.matrix()
rownames(XcSum) <- c("A1_B1", "A1_B2", "A2_B1", "A2_B2")

hsum <- hypr() # initialize empty hypr object
cmat(hsum) <- XcSum # assign contrast matrix to hypr object
hsum # look at the resulting hypothesis matrix

## hypr object containing 4 null hypotheses:
## H0.(Intercept): 0 = (A1_B1 + A1_B2 + A2_B1 + A2_B2)/4  (Intercept)
##          H0.A1: 0 = (A1_B1 + A1_B2 - A2_B1 - A2_B2)/4
##          H0.B1: 0 = (A1_B1 - A1_B2 + A2_B1 - A2_B2)/4
##       H0.A1:B1: 0 = (A1_B1 - A1_B2 - A2_B1 + A2_B2)/4
## 
## Call:
## hypr(`(Intercept)` = ~1/4 * A1_B1 + 1/4 * A1_B2 + 1/4 * A2_B1 + 
##     1/4 * A2_B2, A1 = ~1/4 * A1_B1 + 1/4 * A1_B2 - 1/4 * A2_B1 - 
##     1/4 * A2_B2, B1 = ~1/4 * A1_B1 - 1/4 * A1_B2 + 1/4 * A2_B1 - 
##     1/4 * A2_B2, `A1:B1` = ~1/4 * A1_B1 - 1/4 * A1_B2 - 1/4 * 
##     A2_B1 + 1/4 * A2_B2, levels = c("A1_B1", "A1_B2", "A2_B1", 
## "A2_B2"))
## 
## Hypothesis matrix (transposed):
##       (Intercept) A1   B1   A1:B1
## A1_B1  1/4         1/4  1/4  1/4 
## A1_B2  1/4         1/4 -1/4 -1/4 
## A2_B1  1/4        -1/4  1/4 -1/4 
## A2_B2  1/4        -1/4 -1/4  1/4 
## 
## Contrast matrix:
##       (Intercept) A1 B1 A1:B1
## A1_B1  1           1  1  1   
## A1_B2  1           1 -1 -1   
## A2_B1  1          -1  1 -1   
## A2_B2  1          -1 -1  1

This shows that each of the effects now does not compute nested comparisons any more, but that they rather estimate their effect averaged across conditions of the other factor. The averaging involves using weights of $1/2$. Moreover, the regression coefficients in the sum contrast measure half the distance between conditions, leading to weights of $1/2 \cdot 1/2 = 1/4$.

The general rule to remember from these examples is that when interactions between contrasts are estimated, what an effect of a factor estimates depends on the contrast coding of the other factors in the design! The effect of a factor estimates the effect nested within the location zero of the other contrast(s) in an analysis. If another contrast is centered, and zero is the average of this other contrasts’ coefficients, then the contrast of interest estimates the average or main effect, averaged across the levels of the other factor. Importantly, this property holds only when the interaction between two contrasts is included into a model. If the interaction is omitted and only effects are estimated, then there is no such “action at a distance”.

This may be a very surprising result for interactions of contrasts. However, it is also essential to interpreting contrast coefficients involved in interactions. It is particularly relevant for the analysis of the default treatment contrast, where the main effects estimate nested effects rather than average effects.

9.2 One factor and one covariate

9.2.1 Estimating a group difference and controlling for a covariate

In this section we treat the case where there are again two predictor variables for one dependent variable, but where one predictor variable is a discrete factor, and the other is a continuous covariate. Let’s assume we have measured some response time (RT), e.g., in a lexical decision task. We want to predict the response time based on each subject’s IQ, and we expect that higher IQ leads to shorter response times. Moreover, we have two groups of each 30 subjects. These are coded as factor $F$, with factor levels $F1$ and $F2$. We assume that these two groups have obtained different training programs to optimize their response times on the task. Group $F1$ obtained a control training, whereas group $F2$ obtained training to improve lexical decisions. We want to estimate in how far the training for better lexical decisions in group $F2$ leads to shorter response times compared to the control group $F1$. This is our main question of interest here, i.e., in how far the training program in F2 leads to faster response times compared to the control group $F1$. We load the data, which is a simulated data set.

data("df_contrasts5")

Our main effect of interest is the factor $F$. We want to estimate its effect on response times and code it using scaled sum contrasts, such that negative parameter estimates would yield support for our hypothesis that response times are faster in the training group $F2$:

(contrasts(df_contrasts5$F) <- c(-0.5, +0.5))

## [1] -0.5  0.5

We run a brms model to estimate the effect of factor $F$, i.e., how strongly the response times in the two groups differ from each other.

fit_RT_F <- brm(RT ~ 1 + F,
  data = df_contrasts5,
  family = gaussian(),
  prior = c(
    prior(normal(200, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_RT_F)

##           Estimate Est.Error  Q2.5  Q97.5
## Intercept    212.5      5.06 202.4 222.30
## F1           -24.2     10.09 -43.4  -4.41

$Means and error bars (showing standard errors) for a simulated data set of response times for two different groups of subjects, who have obtained a training in lexical decisions ($F2$) versus have obtained a control training ($F1$).$

FIGURE 9.2: Means and error bars (showing standard errors) for a simulated data set of response times for two different groups of subjects, who have obtained a training in lexical decisions ($F2$) versus have obtained a control training ($F1$).

We find (see model estimates and data shown in Figure 9.2) that response times in group $F2$ are roughly $25$ ms faster than in group $F1$ (Estimate of $-24$). This suggests that as expected, the training program that group $F2$ obtained seems to be successful in speeding up response times. Recall that one cannot just look at the 95% credible interval and check whether zero is outside the interval to declare that we have found an effect. To make a discovery claim, we need to run a Bayes factor analysis on this data set to directly test this hypothesis, and this may or may not provide evidence for a difference in response times between groups.

Let’s assume we have allocated subjects to the two groups randomly. Let’s say that we also measured the IQ of each person using an IQ test. We did so because we expected that IQ could have a strong influence on response times, and we wanted to control for this influence. We now can check whether the two groups had the same average IQ.

df_contrasts5 %>%
  group_by(F) %>%
  summarize(M.IQ = mean(IQ))

## # A tibble: 2 × 2
##   F      M.IQ
##   <fct> <dbl>
## 1 F1       85
## 2 F2      115

Group $F2$ not only obtained additional training and had faster response times, group $F2$ also had a higher IQ (mean of 115) on average than group $F1$ (mean IQ = 85). Thus, the random allocation of subjects to the two groups seems to have created–by chance–a difference in IQs. Now we can ask the question: why might response times in group $F2$ be faster than in group $F1$? Is this because of the training program in $F2$? Or is this simply because the average IQ in group $F2$ was higher than in group $F1$? To investigate this question, we add both predictor variables simultaneously in a brms model. Before we enter the continuous IQ variable, we center it, by subtracting its mean. Centering covariates is generally good practice. Moreover, it is often important to $z$-transform the covariate, i.e., to not only subtract the mean, but also to divide by its standard deviation (this can be done as follows: df_contrasts5$IQ.s <- scale(df_contrasts5$IQ)). The reason why this is often important is that the sampler doesn’t work well if predictors have different scales. For the simple models we use here, the sampler works without $z$-transformation. However, for more realistic and more complex models, $z$-transformation of covariates is often very important.

df_contrasts5$IQ.c <- df_contrasts5$IQ - mean(df_contrasts5$IQ)
fit_RT_F_IQ <- brm(RT ~ 1 + F + IQ.c,
  data = df_contrasts5,
  family = gaussian(),
  prior = c(
    prior(normal(200, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_RT_F_IQ)

##           Estimate Est.Error   Q2.5  Q97.5
## Intercept   212.35      4.78 203.18 222.06
## F1            6.98     13.25 -18.97  32.45
## IQ.c         -1.07      0.32  -1.69  -0.45

The results from the brms model now show that the difference in response times between groups (i.e., factor $F$) is not estimated to be $-25$ ms any more, but instead, the estimate is about $+7$ ms, and the 95% credible interval spans the range $-20$ to $33$. Thus, there doesn’t seem to be much reason to believe any more that the groups would differ. At the same time, we see that the predictor variable IQ shows a negative effect (Estimate = $-1$ with 95% credible interval: $-1.7$ to $-0.4$), suggesting that–as expected–response times seem to be faster in subjects with higher IQ.

$Response times as a function of individual IQ for two groups with a lexical decision training ($F2$) versus a control training ($F1$). Points indicate individual subjects, and lines with error bands indicate linear regression lines.$

FIGURE 9.3: Response times as a function of individual IQ for two groups with a lexical decision training ($F2$) versus a control training ($F1$). Points indicate individual subjects, and lines with error bands indicate linear regression lines.

This result can also be seen in Figure 9.3, which shows that response times decrease with increasing IQ, as suggested by the brms model. However, the heights of the two regression lines do not differ from each other, consistent with the observation in the brms model that the effect of factor $F$ did not seem to differ from zero. That is, factor $F$ in the brms model estimates the difference in height of the regression line between both groups. That the height does not differ and the effect of $F$ is estimated to be close to zero suggests that in fact group $F2$ showed faster response times not because of their additional training program. Instead, they had faster response times simply because their IQ was by chance higher on average compared to the control group $F1$. This analysis is the Bayesian equivalence of the frequentist “analysis of covariance” (ANCOVA), where it’s possible to estimate a group difference after “controlling for” the influence of a covariate.

We can also see in Figure 9.3 that the two regression lines for the two groups are exactly parallel to each other. That is, the influence of IQ on response times seems to be exactly the same in both groups. This is actually an assumption in the ANCOVA analysis that needs to be checked in the data. That is, if we want to estimate the difference between groups after controlling for a covariate (here IQ), we have to investigate whether the influence of the covariate is the same in both groups. We can estimate this by including an interaction term between the factor and the covariate in the brms model:

fit_RT_FxIQ <- brm(RT ~ 1 + F * IQ.c,
  data = df_contrasts5,
  family = gaussian(),
  prior = c(
    prior(normal(200, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_RT_FxIQ)

##           Estimate Est.Error   Q2.5  Q97.5
## Intercept   212.41      6.91 198.67 225.98
## F1            6.45     13.82 -20.79  33.59
## IQ.c         -1.06      0.33  -1.73  -0.40
## F1:IQ.c      -0.02      0.66  -1.28   1.26

The estimate for the interaction (the term F1:IQ.c) is very small here (close to $0$) and the 95% credible intervals clearly overlap with zero, showing that the two regression lines are estimated to be very similar, or parallel, to each other. If this is the case, then it is possible to correct for IQ when estimating the group difference.

9.2.2 Estimating differences in slopes

We now take a look at a different data set.

data("df_contrasts6")
levels(df_contrasts6$F) <- c("simple", "complex")

This again contains data from response times (RT) in two groups. Let’s assume the two groups have performed two different response time tasks, where one simple RT task doesn’t rely on much cognitive processing (group “simple”), whereas the other task is more complex and depends on complex cognitive operations (group “complex”). We therefore expect that RTs in the simple task should be independent of IQ, whereas in the complex task, individuals with a high IQ should be faster in responding compared to individuals with low IQ. Thus, our primary hypothesis of interest states that the influence of IQ on RT differs between conditions. This means that we are interested in the difference between slopes. A slope in a linear regression assesses how strongly the dependent variable (here RT) changes with an increase of one unit on the covariate (here IQ), it thus assesses how “steep” the regression line is. Our hypothesis thus states that the regression lines differ between groups.

Response times as a function of individual IQ for two groups performing a simple versus a complex task. Points indicate individual subjects' responses, and lines with error bands show the fitted linear regression lines.

FIGURE 9.4: Response times as a function of individual IQ for two groups performing a simple versus a complex task. Points indicate individual subjects’ responses, and lines with error bands show the fitted linear regression lines.

The results, displayed in Figure 9.4, suggest that the data seem to support our hypothesis. For the subjects performing the complex task, response times seem to decrease with increasing IQ, whereas for subjects performing the simple task, response times seem to be independent of IQ. As stated before, our primary hypothesis relates to the difference in slopes. Statistically speaking, this is assessed in the interaction between the factor and the covariate. Thus, we run a brms model where the interaction is included. Importantly, we first use scaled sum contrasts for the group effect, and again center the covariate IQ.

contrasts(df_contrasts6$F) <- c(-0.5, +0.5)
df_contrasts6$IQ.c <- df_contrasts6$IQ - mean(df_contrasts6$IQ)
fit_RT_FxIQ2 <- brm(RT ~ 1 + F * IQ.c,
  data = df_contrasts6,
  family = gaussian(),
  prior = c(
    prior(normal(200, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_RT_FxIQ2)

##           Estimate Est.Error   Q2.5  Q97.5
## Intercept   209.88      4.84 200.40 219.58
## F1           19.36      9.43   1.43  38.00
## IQ.c         -0.80      0.33  -1.45  -0.16
## F1:IQ.c      -1.58      0.65  -2.85  -0.29

We can see that the main effect of IQ (term IQ.c) is negative ($-0.8$) with 95% credible intervals $-1.5$ to $-0.2$, suggesting that overall response times decrease with increasing IQ. This is qualified by the interaction term, which is estimated to be negative ($-1.6$), with 95% credible intervals $-2.9$ to $-0.3$. This suggests that the slope in the complex group (which was coded as $+0.5$ in the scaled sum contrast) is more negative than the slope in the simple group (which was coded as $-0.5$ in the scaled sum contrast). Thus, the interaction assesses the difference between slopes.

We can also run a model, where the nested slopes are estimated, i.e., the slope of IQ in the simple group and the slope of IQ in the complex group. This can be implemented by using the nested coding that we learned about in the previous section:

fit_RT_FnIQ2 <- brm(RT ~ 1 + F / IQ.c,
  data = df_contrasts6,
  family = gaussian(),
  prior = c(
    prior(normal(200, 50), class = Intercept),
    prior(normal(0, 50), class = sigma),
    prior(normal(0, 50), class = b)
  )
)

fixef(fit_RT_FnIQ2)

##               Estimate Est.Error   Q2.5  Q97.5
## Intercept       209.98      4.74 200.43 219.16
## F1               19.33      9.56   0.68  37.87
## Fsimple:IQ.c     -0.01      0.47  -0.95   0.93
## Fcomplex:IQ.c    -1.60      0.46  -2.50  -0.70

Now we see that the slope of IQ in the simple group (Fsimple:IQ.c) is estimated to be $0$, with credible intervals clearly including zero. By contrast, the slope in the complex group (Fcomplex:IQ.c) is estimated as $-1.6$ (95% CrI = $[-2.5,-0.7]$). This is consistent with our hypothesis that high IQ speeds up response times for the complex but not for the simple task. (To obtain evidence for this effect, we need Bayes factors, see chapter 15.) We can also see from the nested analysis that the difference in slopes between conditions is $-1.6 - 0.0 = -1.6$. This is exactly the value for the interaction term that we estimated in the previous model, demonstrating that interaction terms assess the difference between slopes; i.e., they estimate in how far the regression lines in the two conditions are parallel, with an estimate of $0$ indicating perfectly parallel lines.

Notice that we can compute posterior samples for the nested slopes from the model with the interaction. That is, we can take the model that estimates main effects and the interaction, and compute posterior samples for the slope of IQ in the simple task and the slope of IQ in the complex task. First, we extract the posterior samples from the model.

df_postSamp_RT_FxIQ2 <- as_draws_df(fit_RT_FxIQ2)

Then, we take a look at the contrast coefficients for the group factor:

contrasts(df_contrasts6$F)

##         [,1]
## simple  -0.5
## complex  0.5

They show a value of $-0.5$ for the simple group. Thus, to compute the slope for the simple group we have to take the overall slope for IQ.c and subtract $-0.5$ times the estimate for the interaction:

df_postSamp_RT_FxIQ2 <- df_postSamp_RT_FxIQ2 %>%
  mutate(b_IQ.c_simple = b_IQ.c - 0.5 * `b_F1:IQ.c`)

Likewise, to estimate the slope for the complex group we have to take the overall slope for IQ.c and add $+0.5$ times the estimate for the interaction:

df_postSamp_RT_FxIQ2 <-df_postSamp_RT_FxIQ2 %>%
  mutate(b_IQ.c_complex = b_IQ.c + 0.5 * `b_F1:IQ.c`)

c(
  IQc.c_simple = mean(df_postSamp_RT_FxIQ2$b_IQ.c_simple),
  IQc.c_complex = mean(df_postSamp_RT_FxIQ2$b_IQ.c_complex)
)

##  IQc.c_simple IQc.c_complex 
##       -0.0136       -1.6069

The results show that the posterior means for the slope of IQ.c are $0$ and $-1.6$ for the simple and the complex groups, as we had found above in the nested analysis.

In most situations one should always center covariates before including them into a model. If covariates are not centered, then the effects (here the effect for the factor) cannot be interpreted as main effects any more.

One can also do analyses with interactions between a covariate and a factor, but by using different contrast codings. For example, if we use treatment contrasts for the factor, then the main effect of IQ.c assesses not the average slope of IQ.c across conditions, but instead the nested slope of IQ.c within the baseline group of the treatment contrast. The interaction still assesses the difference in slopes between groups. In a situation where there are more than two groups, when one estimates the interaction of contrasts with a covariate, then the contrasts define which slopes are compared with each other in the interaction terms. For example, when using sum contrasts in an example where the influence of IQ is measured on response times for nouns, verbs, and adjectives, then there are two interaction terms: these assess (1) whether the slope of IQ for nouns is different from the average slope across conditions, and (2) whether the slope of IQ for verbs is different from the average slope across conditions. If one uses repeated contrasts in a situation where the influence of IQ on response times is estimated for word frequency conditions “low”, “medium-low”, “medium-high”, and “high”, then there are three interaction terms (one for each contrast). The first interaction term estimates the difference in slopes between “low” and “medium-low” word frequencies, the second interaction term estimates the difference in slopes between “medium-low” and “medium-high” word frequencies, and the third interaction term estimates the difference in slopes between “medium-high” and “high” word frequency conditions. Thus, the logic of how contrasts specify certain comparisons between conditions extends directly to the situation where differences in slopes are estimated.

9.3 Interactions in generalized linear models (with non-linear link functions) and non-linear models

Next, we look at generalized linear models, where a linear predictor is passed through a non-linear link function to predict the dependent variable. Examples for generalized linear models include logistic regression models and models assuming a Poisson distribution. Even though a log-normal model is a linear model on a log-transformed dependent variable, the same techniques apply to this type of model since the logarithm transform is not linear. Here, we treat an example with a logistic model in a $2 \times 2$ factorial between-subject design. The logistic model has the following non-linear link function called the logistic function: $P(y=1 \mid \eta) = \frac{1}{1 + \exp(-\eta)}$, where $\eta$ is the latent linear predictor. For example, in our $2 \times 2$ factorial design with main effects A and B and their interaction, $\eta$ is computed as a linear combination of the intercept plus the main effects and their interaction: $\eta = 1 + \beta_A x_A + \beta_B x_B + \beta_{A \times B} x_{A \times B}$.

Thus, there is a latent level of linear predictions ($\eta$), which are then passed through a non-linear link function to predict the probability that the observed data is a success ($P(y = 1)$). We will use this logistic model to analyze an example data set where the dependent variable is dichotomous, coded as either a $1$ (indicating success) or a $0$ (indicating failure).

We load a simulated data set where the dependent variable codes whether a subject performed a task successfully ($pDV = 1$) or not ($pDV = 0$). Moreover, the data set has two between-subject factors A and B. The means for each of the four conditions are shown in Table 9.3.

data("df_contrasts7")

TABLE 9.3: Summary statistics per condition for the simulated data.
Factor A	Factor B	N data	Means
A1	B1	$50$	$0.2$
A1	B2	$50$	$0.5$
A2	B1	$50$	$0.2$
A2	B2	$50$	$0.8$

To analyze this data, we use scaled sum contrasts, as we had done above for the $2 \times 2$ design with response times as the dependent variable; this allows us to interpret the coefficients directly as main effects. Next, we fit a brms model. The model specification is the same as the model with response times–with two differences: First, the family argument is now specified as family = bernoulli(link = "logit") to indicate the logistic model. We do not specify a prior for sigma, since there is no residual standard deviation in a logistic model.

contrasts(df_contrasts7$A) <- c(-0.5, +0.5)
contrasts(df_contrasts7$B) <- c(-0.5, +0.5)
fit_pDV_AB.sum <- brm(pDV ~ 1 + A * B,
  data = df_contrasts7,
  family = bernoulli(link = "logit"),
  prior = c(
    prior(normal(0, 3), class = Intercept),
    prior(normal(0, 3), class = b)
  )
)

fixef(fit_pDV_AB.sum)

##           Estimate Est.Error  Q2.5 Q97.5
## Intercept    -0.35      0.17 -0.68 -0.03
## A1            0.71      0.34  0.07  1.38
## B1            2.09      0.34  1.45  2.77
## A1:B1         1.34      0.67  0.05  2.64

The results from this analysis show that the estimates for the two main effects ($A1$ and $B1$) as well as the interaction ($A1:B1$) are positive and the 95% credible intervals do not include zero. If we want to make a discovery claim, we would need to perform Bayes factor analyses to investigate the evidence that there is for each of the effects.

Next, we discuss how we can obtain model predictions for each of the four experimental conditions for this generalized linear model. To obtain such predictions, we first take a look at the contrast matrix. We simultaneously have contrasts for two main effects and one interaction:

We obtain the posterior samples for the estimates from the model:

df_postSamp_pDV <- as.data.frame(fit_pDV_AB.sum)

From these, we can compute the posterior samples for the linear predictions for each group. We see in the contrast matrix how we have to combine the posterior samples for the intercept, main effects, and interaction to obtain latent linear predictions for each condition. The first condition (design cell $A1$, $B1$) has a weight of $1$ for the intercept, and then weights of $-0.5$ (for the main effect of $A$), $-0.5$ (for the main effect of $B$), and $0.25$ (for the interaction). The posterior samples for the other conditions are computed accordingly.

df_postSamp_pDV <- df_postSamp_pDV %>%
  mutate(A1_B1 = 1 * b_Intercept - 0.5 * b_A1 - 0.5 * b_B1 + 
           0.25 * `b_A1:B1`,
         A1_B2 = 1 * b_Intercept - 0.5 * b_A1 + 0.5 * b_B1 - 
           0.25 * `b_A1:B1`,
         A2_B1 = 1 * b_Intercept + 0.5 * b_A1 - 0.5 * b_B1 - 
           0.25 * `b_A1:B1`,
         A2_B2 = 1 * b_Intercept + 0.5 * b_A1 + 0.5 * b_B1 + 
           0.25 * `b_A1:B1`)

Now, we have computed posterior samples for estimates of the latent linear predictor $\eta$ for each experimental condition. We can look at the posterior means:

df_postSamp_pDV %>%
  select(A1_B1, A1_B2, A2_B1, A2_B2) %>%
  colMeans()

##    A1_B1    A1_B2    A2_B1    A2_B2 
## -1.41606  0.00627 -1.37819  1.38533

This shows that these values are not on the probability scale. Instead, they are on the (log-odds) scale of the latent linear predictor $\eta$. For presentation and interpretation of the results, it might be much more informative to look at the condition means in terms of the probabilities of success in each of the four conditions. Given that we have the linear predictions for each condition, this can be easily computed by sending all posterior samples for the linear predictions through the link function. Applying the logistic function (plogis() in R) transforms the linear predictors to the probability scale:³³

df_postSamp_pDV <- df_postSamp_pDV %>%
  mutate(p_A1_B1 = plogis(A1_B1),
         p_A1_B2 = plogis(A1_B2),
         p_A2_B1 = plogis(A2_B1),
         p_A2_B2 = plogis(A2_B2))

Now, we have posterior samples for each condition on the probability scale. We can take a look at the posterior means, and see that these closely correspond to the probabilities in the data that we have seen above in Table 9.3.

df_postSamp_pDV %>%
  select(p_A1_B1, p_A1_B2, p_A2_B1, p_A2_B2) %>%
  colMeans()

## p_A1_B1 p_A1_B2 p_A2_B1 p_A2_B2 
##   0.201   0.502   0.207   0.794

Of course, the advantage is that we now have posterior samples for these conditions available, and can compute posterior 95% credible intervals (also see Figure 9.5). Rather than do it manually, the function conditional_effects() can do this for us. By default, all main effects and two-way interactions estimated in the model are shown (this can be changed by including, for example, effects = "A:B").

conditional_effects(fit_pDV_AB.sum, robust = FALSE)[]

## $A
##    A   pDV  B cond__ effect1__ estimate__   se__ lower__ upper__
## 1 A1 0.425 B1      1        A1      0.201 0.0564   0.103   0.319
## 2 A2 0.425 B1      1        A2      0.207 0.0556   0.110   0.326
## 
## $B
##    B   pDV  A cond__ effect1__ estimate__   se__ lower__ upper__
## 1 B1 0.425 A1      1        B1      0.201 0.0564   0.103   0.319
## 2 B2 0.425 A1      1        B2      0.502 0.0694   0.366   0.636
## 
## $`A:B`
##    A  B   pDV cond__ effect1__ effect2__ estimate__   se__ lower__
## 1 A1 B1 0.425      1        A1        B1      0.201 0.0564   0.103
## 2 A1 B2 0.425      1        A1        B2      0.502 0.0694   0.366
## 3 A2 B1 0.425      1        A2        B1      0.207 0.0556   0.110
## 4 A2 B2 0.425      1        A2        B2      0.794 0.0553   0.680
##   upper__
## 1   0.319
## 2   0.636
## 3   0.326
## 4   0.892

We plot the two-way interaction using brms embedding the conditional_effects() call in plot(.)[[1]]. This allows us to select the first (and here the only) ggplot2 element and to customize it.

plot(conditional_effects(fit_pDV_AB.sum,
                         effects = "A:B",
                         robust = FALSE),
     plot = FALSE)[[1]] +
  labs(y = "Success probability")

$Means and 95 percent posterior credible intervals for a simulated data set of successful task performance in a $2 \times 2$ design.$

FIGURE 9.5: Means and 95 percent posterior credible intervals for a simulated data set of successful task performance in a $2 \times 2$ design.

9.4 Summary

To summarize, we have seen interesting results for contrasts in the context of $2 \times 2$ designs, where depending on the contrast coding, the factors estimated nested effects (treatment contrasts) or main effects (sum contrasts). We also saw that it is possible to code contrasts for a $2 \times 2$ design, by creating one factor comprising all design cells, and by specifying all effects of interest in one large contrast matrix. In designs with one factor and one covariate it is possible to control group-differences for differences in the covariate (ANCOVA), or to estimate in how far regression slopes are parallel in different experimental conditions. Last, in generalized linear models with non-linear link functions it is possible to obtain posterior samples not only on the latent scale of linear predictors, but also on the scale of the response.

9.5 Further reading

Analysis of variance is discussed in detail in Maxwell, Delaney, and Kelley (2017). A practical book on ANOVA using R is Faraway (2002).

9.6 Exercises

Exercise 9.1 ANOVA coding for a four-condition design.

Load the following data. These data are from Experiment 1 in a set of reading studies on Persian (Safavi, Husain, and Vasishth 2016); we encountered these data in the preceding chapter’s exercises.

library(bcogsci)
data("df_persianE1")
dat1 <- df_persianE1
head(dat1)

##     subj item   rt distance   predability
## 60     4    6  568    short   predictable
## 94     4   17  517     long unpredictable
## 146    4   22  675    short   predictable
## 185    4    5  575     long unpredictable
## 215    4    3  581     long   predictable
## 285    4    7 1171     long   predictable

The four conditions are:

Distance=short and Predictability=unpredictable
Distance=short and Predictability=predictable
Distance=long and Predictability=unpredictable
Distance=long and Predictability=predictable

For the data given above, define an ANOVA-style contrast coding, and compute main effects and interactions. Check with hypr what the estimated comparisons are with an ANOVA coding.

Exercise 9.2 ANOVA and nested comparisons in a $2\times 2\times 2$ design

Load the following data set. This is a $2\times 2\times 2$ design from Jäger et al. (2020), with the factors Grammaticality (grammatical vs. ungrammatical), Dependency (Agreement vs. Reflexives), and Interference (Interference vs. no interference). The experiment is a replication attempt of Experiment 1 reported in Dillon et al. (2013).

library(bcogsci)
data("df_dillonrep")

The grammatical conditions are a,b,e,f. The rest of the conditions are ungrammatical.
The agreement conditions are a,b,c,d. The other conditions are reflexives.
The interference conditions are a,d,e,h, and the others are the no-interference conditions.

The dependent measure of interest is TFT (total fixation time, in milliseconds).

Using a linear model, do a main effects and interactions ANOVA contrast coding, and obtain an estimate of the main effects of Grammaticality, Dependency, and Interference, and all interactions. You may find it easier to code the contrasts coding the main effects as +1, -1, using ifelse() in R to code vectors corresponding to each main effect. This will make the specification of the interactions easy.

The researchers had a further research hypothesis: in ungrammatical sentences only, agreement would show an interference effect but reflexives would not. In grammatical sentences, both agreement and reflexives are expected to show interference effects. This kind of research question can be answered with nested contrast coding.

To carry out the relevant nested contrasts, define contrasts that estimate the effects of

grammaticality
dependency type
the interaction between grammaticality and dependency type
reflexives interference within grammatical conditions
agreement interference within grammatical conditions
reflexives interference within ungrammatical conditions
agreement interference within ungrammatical conditions

Do the estimates match expectations? Check this by computing the condition means and checking that the estimates from the models match the relevant differences between conditions or clusters of conditions.

References

Dillon, Brian W., Alan Mishler, Shayne Sloggett, and Colin Phillips. 2013. “Contrasting Intrusion Profiles for Agreement and Anaphora: Experimental and Modeling Evidence.” Journal of Memory and Language 69 (2): 85–103. https://doi.org/https://doi.org/10.1016/j.jml.2013.04.003.

Faraway, Julian James. 2002. Practical Regression and ANOVA using R. Vol. 168. Citeseer.

Jäger, Lena A., Daniela Mertzen, Julie A. Van Dyke, and Shravan Vasishth. 2020. “Interference Patterns in Subject-Verb Agreement and Reflexives Revisited: A Large-Sample Study.” Journal of Memory and Language 111. https://doi.org/https://doi.org/10.1016/j.jml.2019.104063.

Maxwell, Scott E, Harold D Delaney, and Ken Kelley. 2017. Designing Experiments and Analyzing Data: A Model Comparison Perspective. New York, NY: Routledge.

Safavi, Molood Sadat, Samar Husain, and Shravan Vasishth. 2016. “Dependency Resolution Difficulty Increases with Distance in Persian Separable Complex Predicates: Implications for Expectation and Memory-Based Accounts.” Frontiers in Psychology 7 (403). https://doi.org/10.3389/fpsyg.2016.00403.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2019. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and Language 110. https://doi.org/10.1016/j.jml.2019.104038.

2020. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and Language 110 (February): 104038. https://doi.org/10/gf9tjp.

The same (with lower precision) can be achieved using 1/(1+exp(-.)).↩︎

Factor A	Factor B	N data	Means	Std. dev.	Std. errors
A1	B1	\(5\)	\(10\)	\(10\)	\(4.5\)
A1	B2	\(5\)	\(20\)	\(10\)	\(4.5\)
A2	B1	\(5\)	\(10\)	\(10\)	\(4.5\)
A2	B2	\(5\)	\(40\)	\(10\)	\(4.5\)

Predictor	Estimate	Est. Error	2.5%	97.5%
Intercept	\(19.99\)	\(2.43\)	\(15.29\)	\(24.83\)
A1	\(-4.95\)	\(2.45\)	\(-9.92\)	\(-0.15\)
B1	\(-9.95\)	\(2.59\)	\(-15.00\)	\(-4.79\)
A1:B1	\(4.98\)	\(2.57\)	\(-0.07\)	\(10.00\)

An Introduction to Bayesian Data Analysis for Cognitive Science