2.1 Some terminology surrounding typical experiment designs in linguistics and psychology

An experiment typically involves taking a random sample from some population. For example, we could take a sample of participants and record their reading times for two types of sentences, for example easy and difficult sentences. How one operationalizes easy and difficult is not important at this point. However, to make the discussion concrete, consider this well-known example from psycholinguistics. We could be comparing reading times in active vs. passive sentences:

Active sentence: The man threw the ball.

Passive sentence: The ball was thrown by the man.

One claim in the literature is that, due to their simpler structure, active sentences are easier to process than passive sentences. So this is an example of a simpler vs. complex sentence. If we want to investigate whether actives are easier to process than passives, we can, as an example, compare the total reading times for each sentence type (there are some problems that arise here because the sentences don’t have the same length, so the comparison is not really fair; these issues will be discussed later).

Technically, we are supposed to randomly choose participants because randomization allows us to generalize beyond the sample we have to the population we are interested in studying. In practice, this randomization is imperfectly implemented at best. Instead we take whatever participants we get, such as university students who happen to apply to participate in an experiment! So, random sampling is an assumption in all the discussions in this chapter, but in practice this assumption may or may not be perfectly met.

One design decision that the researcher must take is whether to collect only one response from each participant in each condition, or whether we collect multiple measures from each participant. If we collect only one data point from each participant, we will say that the data points are independent of each other. When we collect multiple measurements from each participant, we will say that we have dependent data; such multiple measurements are also called repeated measures. With repeated measures coming from the same subject, there will be some dependency between the data-points because the multiple measurements are coming from a common source.

In psycholinguistics, repeated measures experiments are the norm. Furthermore, it is common to use what is called a Latin Square design. The term Latin Square refers to the fact that the ordering of the conditions forms a square (Table 2.1); the word Latin apparently only refers to the fact the symbols for each condition were from the Latin script.

TABLE 2.1: The Latin Square with two conditions labeled a and b.
Group 1 Group 2
a b
b a

A characteristic property of the Latin Square is that, as shown in Table 2.1, each condition appears in each row exactly once. The Latin square generalizes beyond two conditions easily. Suppose we have four conditions; then, the Latin Square is as in Table 2.2. Similarly, if there are eight conditions, the Latin Square table would have the form shown in Table 2.3.

TABLE 2.2: A Latin Square design with four conditions labeled a, b, c, and d.
Group 1 Group 2 Group 3 Group 4
a b c d
b c d a
c d a b
d a b c
TABLE 2.3: A Latin Square design with eight conditions labeled a, b, c, d, e, f, g, and h. The column names representing the eight groups are abbreviated as G1-G8.
G1 G2 G3 G4 G5 G6 G7 G8
a b c d e f g h
b c d e f g h a
c d e f g h a b
d e f g h a b c
e f g h a b c d
f g h a b c d e
g h a b c d e f
h a b c d e f g

As an aside, it is rarely a good idea to design such complex experiments as the one shown in Table 2.3. The more conditions you have in an experiment, the more decisions you will have to make about how to analyze the data, the more complex the model will get, and the harder the interpretation of the data. Later in this course, we will demonstrate the price one has to pay for setting up overly complex designs. This point is mentioned here because it is a common beginner error to want to have as many conditions as possible in a single experiment. In fact, the first experiment a student of ours once proposed to us was a 180 condition design.

The Latin Square design is attractive for psycholinguistics and psychology because it has some important optimality properties. Experiment design and the properties of different designs is a whole field in itself in statistics, but we won’t get into these details.

The Latin square design can be extended so that each participant sees repeated sets of each condition. For example, in a two-condition design, we may decide to create multiple (say 4) sets of items: item 1 will have two instances of conditions a and b; item 2 to 4 will each have two instances of conditions a and b. The Latin square then looks like the one shown in Table 2.4.

TABLE 2.4: The Latin Square with two conditions labeled a and b, and four items.
item Group 1 Group 2
1 a b
2 b a
3 a b
4 b a

Now, when each incoming participant is randomly assigned to Group 1 or Group 2, they will see two instances of each condition, but—crucially–each instance of each item will be seen only once. This has the advantage that we obtain repeated measures from each participant for each condition, leading to more accurate estimates from each participant for each condition; and we obtain measurements from multiple items as well. However, each participant sees each item only once; this can be important in cases where showing the same item to the participant more than once could bias their response. For example, if we show the same word twice to a participant, they might process the second instance of that word more easily, biasing the average response to that word. There can however be situations where it is appropriate/necessary to show the same item multiple times to a participant. In most of the designs considered in this book, participants will see one item only once.

Having repeated measurements from participants and from items allows us to generalize beyond the specific participants and items that we have in the experiment.

To come closer to a realistic experiment design, suppose that we have two participants (this is just for illustration—normally we will have many more than two participants). Then, the data frame will look like this:

condition <- c(rep(letters[1:2], 2), rep(letters[2:1], 2))
item <- rep(1:4, 2)
subj <- rep(1:2, each = 4)
df_example <- data.frame(subj, item, condition)
df_example
##   subj item condition
## 1    1    1         a
## 2    1    2         b
## 3    1    3         a
## 4    1    4         b
## 5    2    1         b
## 6    2    2         a
## 7    2    3         b
## 8    2    4         a

As a consequence of this design, we say that we have a fully crossed subjects (participants) and items design: each participant sees exactly one item:

xtabs(~ subj + item, df_example)
##     item
## subj 1 2 3 4
##    1 1 1 1 1
##    2 1 1 1 1

You can also check that each subject sees two instances of each condition:

xtabs(~ subj + condition, df_example)
##     condition
## subj a b
##    1 2 2
##    2 2 2

Notice also that there is exactly one measurement from each item for each condition. This is because only two subjects are involved here. If there were 10 subjects, there would be five instance of each condition for each item.

xtabs(~ item + condition, df_example)
##     condition
## item a b
##    1 1 1
##    2 1 1
##    3 1 1
##    4 1 1

So, one consequence of the above repeated measurements design is that there will, in general, be repeated measurements not just from subjects but also from items. Thus, items can also be seen as producing dependent data. Later on, we will see that this design has far-reaching consequences regarding the type of data analysis one can and should do.

This is a very brief introduction to experiment design; but the reader should keep in mind that these are the kinds of designs we will primarily focus on in this book. We now turn to the idea of hypothetical repeated sampling, and the central limit theorem.