8.5 Define a function for generating data

Next, we define a function that generates repeated measures data given the parameter estimates. The basic idea here is the following.

  • First, create a data-frame that represents a Latin-square design.
  • Then, given the condition id, and the subject and item ids in each row of the data frame, generate data row-by-row.

We explain these steps next.

8.5.1 Generate a Latin-square design

First, consider how one can create a Latin-square design. Suppose we have four items and four subjects. For such an experiment, we would create two groups, g1 and g2, with the following layout.

nitem <- 4
nsubj <- 4

g1 <- data.frame(
  item = 1:nitem,
  cond = rep(c("a", "b"), nitem / 2)
)
g2 <- data.frame(
  item = 1:nitem,
  cond = rep(c("b", "a"), nitem / 2)
)
g1
##   item cond
## 1    1    a
## 2    2    b
## 3    3    a
## 4    4    b
g2
##   item cond
## 1    1    b
## 2    2    a
## 3    3    b
## 4    4    a

Half the total number of subjects will be assigned to group 1 and half to group 2:

## assemble data frame:
gp1 <- g1[rep(
  seq_len(nrow(g1)),
  nsubj / 2
), ]
gp2 <- g2[rep(
  seq_len(nrow(g2)),
  nsubj / 2
), ]

simdat <- rbind(gp1, gp2)

## add subject column:
simdat$subj <- rep(1:nsubj, each = nitem)

Finally, the contrast coding for each row in the data-frame is set up:

## add contrast coding:
simdat$so <- ifelse(simdat$cond == "a", -1 / 2, 1 / 2)

8.5.2 Generate data row-by-row

Then, we proceed row-by-row in this data frame, and generate data for each subject, item, and condition. For example, the first row of our simulated data-set has subject id:

simdat[1, ]$subj
## [1] 1

Similarly, the first row has item id:

simdat[1, ]$item
## [1] 1

The first row’s condition coding is:

simdat[1, ]$so
## [1] -0.5

These three pieces of information are what we need to generate data for the first row. Recall that the model for subject \(i\) and item \(j\) in condition so is

\[\begin{equation} \beta_0 + u_{0i} + w_{0j} + (\beta + u_{1i} + w_{1j})\times so + \varepsilon \hbox{ where } \varepsilon \sim N(0,\sigma) \end{equation}\]

The terms u0, w0, and u1, w1, which are the adjustments to the intercepts and slopes by subject and item, are stored in two matrices that are generated randomly each time that we simulate new subjects/items. Recall from chapter 1 how bivariate data are generated. The intercept and slope adjustments will be generated using the mvrnorm function in the MASS library. For example, given the variance covariance matrix for subjects Sigma_u that we created above, the subject random effects (intercept and slope adjustments) for 10 subjects can be generated in a matrix as follows:

library(MASS)
u <- mvrnorm(
  n = 10,
  mu = c(0, 0), Sigma = Sigma_u
)
u
##       (Intercept)       so
##  [1,]     0.11016  0.28122
##  [2,]     0.83391  0.21912
##  [3,]    -0.12487  0.06087
##  [4,]    -0.02720 -0.06286
##  [5,]     0.42886  0.24292
##  [6,]    -0.09001 -0.24314
##  [7,]     0.06948  0.12048
##  [8,]     0.21241  0.04330
##  [9,]    -0.48465 -0.39614
## [10,]    -0.04833  0.35370

Each row in this matrix is the intercept and slope adjustment for a subject; the row number indexes the subject id. For example, subject 1’s intercept adjustment is:

u[1, 1]
## (Intercept) 
##      0.1102

Subject 1’s slope adjustment is:

u[1, 2]
##     so 
## 0.2812

Analogously to the subject random effects matrix, a matrix for items random effects is also generated. As an example, we generate simulated random effects for 10 items:

w <- mvrnorm(
  n = 10,
  mu = c(0, 0), Sigma = Sigma_w
)
w
##            [,1]       [,2]
##  [1,]  0.020043  0.0733144
##  [2,]  0.040632  0.0099160
##  [3,]  0.002638 -0.1848623
##  [4,]  0.006454 -0.0252776
##  [5,]  0.024081  0.0068770
##  [6,] -0.068960 -0.1846818
##  [7,] -0.009174  0.0097644
##  [8,]  0.020528 -0.0007004
##  [9,] -0.047864  0.1347658
## [10,] -0.014866  0.0425107

Now, to generate simulated data for subject 1 and item 1 for object relatives, we simply need to run this line of code:

rlnorm(
  1, beta[1] + u[1, 1] + w[1, 1] +
    (beta[2] + u[1, 2] + w[1, 2]) * (0.5),
  sigma_e
)
## [1] 423.2

For subject 2, all that would change is the subject id: instead of u[1,1], and u[1,2] we would write u[2,1] and u[2,2]. The for-loop below works through the simdat data-frame row by row, looks up the subject id, the item id for that row, and the condition coding for that row, and fills in the simulated reading time using the above code. This is how the simulated data are generated.

Here is the complete function for generating simulated data. It uses all the bits of code we discussed above.

library(MASS)
## assumes that no. of subjects and
## no. of items is divisible by 2.
gen_sim_lnorm2 <- function(nitem = 16,
                           nsubj = 42,
                           beta = NULL,
                           Sigma_u = NULL, # subject vcov matrix
                           Sigma_w = NULL, # item vcov matrix
                           sigma_e = NULL) {
  ## prepare data frame for a two-condition latin square:
  g1 <- data.frame(
    item = 1:nitem,
    cond = rep(c("a", "b"), nitem / 2)
  )
  g2 <- data.frame(
    item = 1:nitem,
    cond = rep(c("b", "a"), nitem / 2)
  )


  ## assemble data frame:
  gp1 <- g1[rep(
    seq_len(nrow(g1)),
    nsubj / 2
  ), ]
  gp2 <- g2[rep(
    seq_len(nrow(g2)),
    nsubj / 2
  ), ]

  simdat <- rbind(gp1, gp2)

  ## add subject column:
  simdat$subj <- rep(1:nsubj, each = nitem)

  ## add contrast coding:
  simdat$so <- ifelse(simdat$cond == "a", -1 / 2, 1 / 2)

  ## subject random effects:
  u <- mvrnorm(
    n = length(unique(simdat$subj)),
    mu = c(0, 0), Sigma = Sigma_u
  )

  ## item random effects
  w <- mvrnorm(
    n = length(unique(simdat$item)),
    mu = c(0, 0), Sigma = Sigma_w
  )

  ## generate data row by row:
  N <- dim(simdat)[1]
  rt <- rep(NA, N)
  for (i in 1:N) {
    rt[i] <- rlnorm(
      1, beta[1] +
        u[simdat[i, ]$subj, 1] +
        w[simdat[i, ]$item, 1] +
        (beta[2] + u[simdat[i, ]$subj, 2] +
          w[simdat[i, ]$item, 2]) * simdat$so[i],
      sigma_e
    )
  }
  simdat$rt <- rt
  simdat$subj <- factor(simdat$subj)
  simdat$item <- factor(simdat$item)
  simdat
}

Let’s generate some simulated data and check what the data look like:

dat <- gen_sim_lnorm2(
  nitem = 16,
  nsubj = 42,
  beta = beta,
  Sigma_u = Sigma_u,
  Sigma_w = Sigma_w,
  sigma_e = sigma_e
)

The data have the expected structure:

## fully  crossed subjects and items:
head(t(xtabs(~ subj + item, dat)))
##     subj
## item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
##    1 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1
##    2 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1
##    3 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1
##    4 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1
##    5 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1
##    6 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1
##     subj
## item 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
##    1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##    2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##    3  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##    4  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##    5  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##    6  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##     subj
## item 36 37 38 39 40 41 42
##    1  1  1  1  1  1  1  1
##    2  1  1  1  1  1  1  1
##    3  1  1  1  1  1  1  1
##    4  1  1  1  1  1  1  1
##    5  1  1  1  1  1  1  1
##    6  1  1  1  1  1  1  1
##  8 measurements per condition:
head(t(xtabs(~ subj + cond, dat)))
##     subj
## cond 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
##    a 8 8 8 8 8 8 8 8 8  8  8  8  8  8  8  8  8  8  8
##    b 8 8 8 8 8 8 8 8 8  8  8  8  8  8  8  8  8  8  8
##     subj
## cond 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
##    a  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##    b  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
##     subj
## cond 36 37 38 39 40 41 42
##    a  8  8  8  8  8  8  8
##    b  8  8  8  8  8  8  8
##  contrast coding check:
xtabs(~ so + cond, dat)
##       cond
## so       a   b
##   -0.5 336   0
##   0.5    0 336
## condition b is slower than a:
round(with(dat, tapply(rt, cond, mean)))
##   a   b 
## 371 459

Everything checks out.