6.1 Basic concepts illustrated using a two-level factor

We first consider the simplest case: suppose we want to compare the means of a dependent variable (DV) such as response times between two groups of subjects. R can be used to simulate data for such an example. Such simulated data is available in the R package lingpsych accompanying this book as the data set df_contrasts1. The simulations assumed longer response times in condition F1 (\(\mu_1 = 0.8\) sec) than F2 (\(\mu_2 = 0.4\) sec). The data from the \(10\) simulated subjects are aggregated and summary statistics are computed for the two groups.

library(lingpsych)
data("df_contrasts1")
df_contrasts1
## # A tibble: 10 × 3
##    F        DV    id
##    <fct> <dbl> <int>
##  1 F1    0.636     1
##  2 F1    0.841     2
##  3 F1    0.555     3
##  4 F1    1.03      4
##  5 F1    0.938     5
##  6 F2    0.123     6
##  7 F2    0.304     7
##  8 F2    0.659     8
##  9 F2    0.469     9
## 10 F2    0.444    10
str(df_contrasts1)
## tibble [10 × 3] (S3: tbl_df/tbl/data.frame)
##  $ F : Factor w/ 2 levels "F1","F2": 1 1 1 1 1 2 2 2 2 2
##  $ DV: num [1:10] 0.636 0.841 0.555 1.029 0.938 ...
##  $ id: int [1:10] 1 2 3 4 5 6 7 8 9 10
## [1] 0.6
TABLE 6.1: Summary statistics per condition for the simulated data.
Factor N data Est. means Std. dev. Std. errors
F1 5 0.8 0.2 0.1
F2 5 0.4 0.2 0.1
Means and standard errors of the simulated dependent variable (e.g., response times in seconds) in two conditions F1 and F2.

FIGURE 6.1: Means and standard errors of the simulated dependent variable (e.g., response times in seconds) in two conditions F1 and F2.

The results, displayed in Figure 6.1 and shown in Table 6.1, show that the assumed true condition means are exactly realized with the simulated data. The numbers are exact because the used mvrnorm() function (see the documentation for df_contrasts1) ensures that the data are generated so that the sample mean yields the true means for each level. In real data-sets, of course, the sample means will vary from experiment to experiment.

A simple Bayesian linear model of DV on F using the function brm yields a straightforward estimate of the difference between the group means. We use rather vague priors. The estimates for the fixed effects are presented below:

fit_F <- lm(DV ~ 1 + F,
  data = df_contrasts1
)
round(summary(fit_F)$coefficients, 3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)      0.8      0.089   8.944    0.000
## FF2             -0.4      0.126  -3.162    0.013

Comparing the means for each condition with the coefficients (Estimates) reveals that (i) the intercept (\(0.8\)) is the mean for condition F1, \(\hat\mu_1\); and (ii) the slope (FF2: \(-0.4\)) is the difference between the estimated means for the two groups, \(\mu_2 - \mu_1\) (Bolker 2018):

\[\begin{equation} \begin{array}{lcl} \text{Intercept} = & \hat{\mu}_1 & = \text{estimated mean for F1} \\ \text{Slope (FF2)} = & \hat{\mu}_2 - \hat{\mu}_1 & = \text{estim. mean for F2} - \text{estim. mean for F1} \end{array} \label{def:beta} \end{equation}\]

The new information is the standard error for the difference between the two groups, and the t- and p-values corresponding to the null hypothesis test of no difference.

6.1.1 Default contrast coding: Treatment contrasts

How does the function lm arrive at these particular values for the intercept and slope? That is, why does the intercept assess the mean of condition F1 and how do we know the slope measures the difference in means between F2\(-\)F1? This result is a consequence of the default contrast coding of the factor F. R assigns treatment contrasts to factors and orders their levels alphabetically. The first factor level (here: F1) is coded as \(0\) and the second level (here: F2) is coded as \(1\). This becomes clear when we inspect the current contrast attribute of the factor using the contrasts command:

contrasts(df_contrasts1$F)
##    F2
## F1  0
## F2  1

Why does this contrast coding yield these particular regression coefficients? Let’s take a look at the regression equation. Let \(\alpha\) represent the intercept, and \(\beta_1\) the slope. Then, the simple regression above expresses the belief that the expected response time \(y\) is a linear function of the factor F. In a more general formulation, this is written as follows: \(y\) is a linear function of some predictor \(x\) with regression coefficients for the intercept, \(\alpha\), and for the factor, \(\beta_1\):

\[\begin{equation} y = \alpha + \beta_1x \label{eq:lm1} \end{equation}\]

This equation is part of the likelihood in a statistical model. So, if \(x = 0\) (condition F1), \(y\) is \(\alpha + \beta_1 \cdot 0 = \alpha\); and if \(x = 1\) (condition F2), \(y\) is \(\alpha + \beta_1 \cdot 1 = \alpha + \beta_1\).

Expressing the above in terms of the estimated coefficients:

\[\begin{equation} \begin{array}{lccll} \text{estim. value for F1} = & \hat{\mu}_1 = & \hat{\beta}_0 = & \text{Intercept} \\ \text{estim. value for F2} = & \hat{\mu}_2 = & \hat{\beta}_0 + \hat{\beta}_1 = & \text{Intercept} + \text{Slope (FF2)} \end{array} \label{eq:predVal} \end{equation}\]

It is useful to think of such unstandardized regression coefficients as difference scores; they express the increase in the dependent variable \(y\) associated with a change in the independent variable \(x\) of \(1\) unit, such as going from \(0\) to \(1\) in this example. The difference between condition means is \(0.4 - 0.8 = -0.4\), which is the estimated regression coefficient \(\hat{\beta}_1\). The sign of the slope is negative because we have chosen to subtract the larger mean F1 score from the smaller mean F2 score.

6.1.2 Defining hypotheses

The analysis of the regression equation demonstrates that in the treatment contrast the intercept assesses the average response in the baseline condition, whereas the slope estimates the difference between condition means. However, these are just verbal descriptions of what each coefficient assesses. Is it also possible to formally write down what each coefficient assesses? Moreover, is it possible to relate this to formal null hypotheses that are encoded in each of these two coefficients? The relation to null hypothesis tests is directly possible in frequentist statistics.

From the perspective of parameter estimation and formal hypothesis tests, the slope represents the main test of interest, so we consider this first. The treatment contrast specifies that the slope \(\beta_1\) estimates the difference in means between the two levels of the factor F. This can formally be written as:

\[\begin{equation} \beta_1 = \mu_{F2} - \mu_{F1} \end{equation}\]

or equivalently:

\[\begin{equation} \beta_1 = - 1 \cdot \mu_{F1} + 1 \cdot \mu_{F2} \end{equation}\]

This can express the null hypothesis that the difference in means between the two levels of the factor F is \(0\); formally, the null hypothesis \(H_0\) is that \(H_0: \; \beta_1 = 0\):

\[\begin{equation} \label{eq:f2minusf1} H_0: \beta_1 = \mu_{F2} - \mu_{F1} = 0 \end{equation}\]

or equivalently:

\[\begin{equation} H_0: \beta_1 = - 1 \cdot \mu_{F1} + 1 \cdot \mu_{F2} = 0 \end{equation}\]

The \(\pm 1\) weights in the parameter estimation and null hypothesis statements directly express which means are compared by the treatment contrast.

The intercept in the treatment contrast estimates a quantity and expresses a null hypothesis that is usually of little interest: it estimates the mean in condition F1, and can be used to test whether this mean of F1 is \(0\). Formally, the parameter \(\alpha\) estimates the following quantity:

\[\begin{equation} \alpha = \mu_{F1} \end{equation}\]

or equivalently:

\[\begin{equation} \alpha = 1 \cdot \mu_{F1} + 0 \cdot \mu_{F2} . \end{equation}\]

This can also be written as a formal null hypothesis, which is \(H_0: \; \alpha = 0\):

\[\begin{equation} H_0: \alpha = \mu_{F1} = 0 \end{equation}\]

or equivalently:

\[\begin{equation} \label{eq:trmtcontrfirstmention} H_0: \alpha = 1 \cdot \mu_{F1} + 0 \cdot \mu_{F2} = 0 . \end{equation}\]

The fact that the intercept term formally tests the null hypothesis that the mean of condition F1 is zero is in line with our previous derivation (see equation ??).

In R, factor levels are ordered alphabetically and by default the first level is used as the baseline in treatment contrasts. Obviously, this default mapping will only be correct for a given data-set if the levels’ alphabetical ordering matches the desired contrast coding. When it does not, it is possible to re-order the levels. Here is one way of re-ordering the levels in R:

df_contrasts1$Fb <- factor(df_contrasts1$F,
  levels = c("F2", "F1")
)
contrasts(df_contrasts1$Fb)
##    F1
## F2  0
## F1  1

This re-ordering did not change any data associated with the factor, only one of its attributes. With this new contrast attribute a simple linear model yields the following result.

fit_Fb <- lm(DV ~ 1 + Fb,
  data = df_contrasts1
)
round(summary(fit_Fb)$coefficients, 3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)      0.4      0.089   4.472    0.002
## FbF1             0.4      0.126   3.162    0.013

The model now estimates different quantities. The intercept now codes the mean of condition F2, and the slope measures the difference in means between F1 minus F2. This represents an alternative coding of the treatment contrast.

6.1.3 Sum contrasts

Treatment contrasts are only one of many options. It is also possible to use sum contrasts, which code one of the conditions as \(-1\) and the other as \(+1\), effectively `centering’ the effects at the grand mean (GM, i.e., the mean of the two group means). Here, we rescale the contrast to values of \(-0.5\) and \(+0.5\), which makes the estimated treatment effect the same as for treatment coding and easier to interpret.

To use this contrast in a linear regression, use the contrasts function:

(contrasts(df_contrasts1$F) <- c(-0.5, +0.5))
fit_mSum <- lm(DV ~ 1 + F,
  data = df_contrasts1
)
round(summary(fit_mSum)$coefficients, 3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)      0.6      0.063   9.487    0.000
## F1              -0.4      0.126  -3.162    0.013

Here, the slope (F1) again codes the difference of the groups associated with the first and second factor levels. It has the same value as in the treatment contrast. However, the intercept now represents the estimate of the average of condition means for F1 and F2, that is, the GM. This differs from the treatment contrast. For the scaled sum contrast:

\[\begin{equation} \begin{array}{lcl} \text{Intercept} = & (\hat{\mu}_1 + \hat{\mu}_2)/2 & = \text{estimated mean of F1 and F2} \\ \text{Slope (F1)} = & \hat{\mu}_2 - \hat{\mu}_1 & = \text{estim. mean for F2} - \text{estim. mean for F1} \end{array} \label{def:beta2} \end{equation}\]

Why does the intercept assess the GM and why does the slope test the group difference? This is the result of rescaling the sum contrast. The first factor level (F1) was coded as \(-0.5\), and the second factor level (F1) as \(+0.5\):

contrasts(df_contrasts1$F)
##    [,1]
## F1 -0.5
## F2  0.5

Let’s again look at the regression equation to better understand what computations are performed. Again, \(\alpha\) represents the intercept, \(\beta_1\) represents the slope, and the predictor variable \(x\) represents the factor F. The regression equation is written as:

\[\begin{equation} y = \alpha + \beta_1x \label{eq:lm2} \end{equation}\]

The group of F1 subjects is then coded as \(-0.5\), and the response time for the group of F1 subjects is \(\alpha + \beta_1 \cdot x_1 = 0.6 + (-0.4) \cdot (-0.5) = 0.8\). By contrast, the F2 group is coded as \(+0.5\). By implication, the mean of the F2 group must be \(\alpha + \beta_1 \cdot x_1 = 0.6 + (-0.4) \cdot 0.5 = 0.4\). Expressed in terms of the estimated coefficients:

\[\begin{equation} \begin{array}{lccll} \text{estim. value for F1} = & \hat{\mu}_1 = & \hat{\beta}_0 - 0.5 \cdot \hat{\beta}_1 = & \text{Intercept} - 0.5 \cdot \text{Slope (F1)}\\ \text{estim. value for F2} = & \hat{\mu}_2 = & \hat{\beta}_0 + 0.5 \cdot \hat{\beta}_1 = & \text{Intercept} + 0.5 \cdot \text{Slope (F1)} \end{array} \label{eq:predVal2} \end{equation}\]

The unstandardized regression coefficient is a difference score: Taking a step of one unit on the predictor variable \(x\), e.g., from \(-0.5\) to \(+0.5\), reflects a step from condition F1 to F2, and changes the dependent variable from \(0.8\) (for condition F1) to \(0.4\) (condition F2), giving us a difference of \(0.4 - 0.8 = -0.4\). This is again the estimated regression coefficient \(\hat{\beta}_1\). Moreover, as mentioned above, the intercept now assesses the GM of conditions F1 and F2: it is in the middle between condition means for F1 and F2.

So far we gave verbal statements about what is tested by the intercept and the slope in the case of the scaled sum contrast. It is possible to write these statements as formal parameter estimates and formal null hypotheses that are tested by each regression coefficient. In sum contrasts, the slope parameter \(\beta_1\) assesses the following quantity:

\[\begin{equation} \beta_1 = -1 \cdot \mu_{F1} + 1 \cdot \mu_{F2} \end{equation}\]

This can be formulated into the null hypothesis that the difference in means between the two levels of factor F is 0; formally, the null hypothesis \(H_0\) is that

\[\begin{equation} H_0: \beta_1 = -1 \cdot \mu_{F1} + 1 \cdot \mu_{F2} = 0 \end{equation}\]

This estimates the same quantity and tests the same null hypothesis as the slope in the treatment contrast. The intercept, however, now assess a different quantity and expresses a different hypothesis about the data: it estimates the average of the two conditions F1 and F2, and tests the null hypothesis that this average is 0:

\[\begin{equation} \alpha = 1/2 \cdot \mu_{F1} + 1/2 \cdot \mu_{F2} = \frac{\mu_{F1} + \mu_{F2}}{2} \end{equation}\]

And for the null hypothesis:

\[\begin{equation} H_0: \alpha = 1/2 \cdot \mu_{F1} + 1/2 \cdot \mu_{F2} = \frac{\mu_{F1} + \mu_{F2}}{2} = 0 \end{equation}\]

In balanced data, i.e., in data-sets where there are no missing data points, the average of the two conditions F1 and F2 is the GM. In unbalanced data-sets, where there are missing values, this average is the weighted GM. To illustrate this point, consider an example with fully balanced data and two equal group sizes of \(5\) subjects for each group F1 and F2. Here, the GM is also the mean across all subjects. Next, consider a highly simplified unbalanced data-set, where in condition F1 two observations of the dependent variable are available with values of \(2\) and \(3\), and where in condition F2 only one observation of the dependent variables is available with a value of \(4\). In this data-set, the mean across all subjects is \(\frac{2 + 3 + 4}{3} = \frac{9}{3} = 3\). However, the (weighted) GM as assessed in the intercept in a model using sum contrasts for factor F would first compute the mean for each group separately (i.e., \(\frac{2 + 3}{2} = 2.5\), and \(4\)), and then compute the mean across conditions \(\frac{2.5 + 4}{2} = \frac{6.5}{2} = 3.25\). The GM of \(3.25\) is different from the mean across subjects of \(3\).

To summarize, treatment contrasts and sum contrasts are two possible ways to parameterize the difference between two groups; they estimate different quantities and test different hypotheses (there are cases, however, where the estimates / hypotheses are equivalent). Treatment contrasts compare one or more means against a baseline condition, whereas sum contrasts allow us to determine whether we can reject the null hypothesis that a condition’s mean is the same as the GM (which in the two-group case also implies a hypothesis test that the two group means are the same). One question that comes up here is how one knows or formally derives what quantities are estimated by a given set of contrasts, and what hypotheses it can be used to test. This question will be discussed in detail below for the general case of any arbitrary contrasts.

6.1.4 Cell means parameterization and posterior comparisons

One alternative option is to use the so-called cell means parameterization. In this approach, one does not estimate an intercept term, and then differences between factor levels. Instead, each degree of freedom in a design is used to simply estimate the mean of one of the factor levels. As a consequence, no comparisons between condition means are estimated, but simply the mean of each experimental condition is estimated. Cell means parameterization is specified by explicitly removing the intercept term (which is added automatically in R by default) by adding a \(-1\) in the regression formula:

fit_mCM <- lm(DV ~ -1 + F,
  data = df_contrasts1
)
round(summary(fit_mCM)$coefficients, 3)
##     Estimate Std. Error t value Pr(>|t|)
## FF1      0.8      0.089   8.944    0.000
## FF2      0.4      0.089   4.472    0.002

Now, the regression coefficients (see the column labeled ‘Estimate’) estimate the mean of the first factor level (\(0.8\)) and the mean of the second factor level (\(0.4\)). This cell means parameterization usually does not allow a test of the hypotheses of interest, as these hypotheses usually relate to differences between conditions rather than to whether each condition differs from zero.

References

Bolker, Ben. 2018. “Https://Github.com/Bbolker/Mixedmodels-Misc/Blob/Master/Notes/Contrasts.rmd.”