The following examples serves to illustrate the structure of experimental designs with both crossed and nested effects. The models are fitted speciying both random and fixed effects. We use the R package lme4 to estimate the coefficients of the resulting linear mixed effect model and lmerTest for additional testing features.

A word of caution: correctly formulating mixed models for complex experimental designs is far from trivial, and the correct interpretation of the output requires subject-matter expertise. Thus, rather than see the following sections as a definitive guide to doing such calculations on your own, it should be viewed as a way to acquire some terminology and intuition for speaking with a statistical consultant and formulating your needs, and gain familiarity for peer-reviewing purposes.

Example 1

We consider data from Danish company of electronics Bang and Olufsen from the lmerTest package (under data(TVbo)). The setup of the experiment is as follows: eight experienced raters (Assessor) were asked to rate 15 characteristics of televisions to compare two attributes, Picture and TVset.

Determining model structure

Based on this description alone, it is somewhat unclear what the structure of the experiment is. We can inspect the database.

Code
library(lmerTest)
data(TVbo)
str(TVbo)
'data.frame':   192 obs. of  19 variables:
 $ Assessor            : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ TVset               : Factor w/ 3 levels "TV1","TV2","TV3": 3 2 1 3 2 1 3 2 1 3 ...
 $ Repeat              : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 1 1 1 2 ...
 $ Picture             : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 2 2 2 2 ...
 $ Coloursaturation    : num  10.4 9.9 7 9.8 10.6 7.5 7.1 9.9 5 10 ...
 $ Colourbalance       : num  5 4.1 9.8 4.8 4.3 9.4 8 7.5 6.4 7.5 ...
 $ Noise               : num  13.1 10.9 13.1 13.2 13.3 13.3 12 8.7 12.8 13.4 ...
 $ Depth               : num  3.1 7.4 6 5.7 6.7 5.8 8.2 6.3 3.9 8.3 ...
 $ Sharpness           : num  8.3 5.3 7.9 9.3 4.7 6.6 10.7 7.5 6 8.8 ...
 $ Lightlevel          : num  9.5 9.9 6.9 9.8 10.3 6.7 10.3 9.5 5.8 9.2 ...
 $ Contrast            : num  6.4 6.9 5.9 6.3 7.5 6.4 9.1 7 4.1 9.8 ...
 $ Sharpnessofmovement : num  11.6 11.6 11.5 8.1 8.1 8.1 4 8.1 5.1 4.5 ...
 $ Flickeringstationary: num  13.5 13.5 13.5 12.1 8.8 13.2 12.1 5.6 11.8 13.5 ...
 $ Flickeringmovement  : num  2 1.8 1.7 13.5 13.5 13.5 1.9 1.8 1.7 13.5 ...
 $ Distortion          : num  1.8 1.9 1.9 1.5 1.5 1.5 1.8 2 1.9 1.4 ...
 $ Dimglasseffect      : num  1.6 1.6 1.6 1.7 1.7 1.7 1.6 1.6 1.5 1.7 ...
 $ Cutting             : num  1.5 2.2 3.3 1.4 3.9 3.7 1.5 2.7 4.1 1.6 ...
 $ Flossyedges         : num  3.4 3.3 5.4 7.9 4.4 4.6 9.9 3.1 4.5 11.6 ...
 $ Elasticeffect       : num  4.9 2.7 1.7 4.4 2.8 1.7 4.2 2.3 1.7 4.9 ...

We have three factors for the model: Assessor, TVset and Picture. There are seemingly two repetitions for ratings based on the factor level. Let us investigate the two manipulated factors, TVset and Picture, and look at the number of occurences for each combination

Code
xtabs(~ TVset + Picture, data = TVbo)
     Picture
TVset  1  2  3  4
  TV1 16 16 16 16
  TV2 16 16 16 16
  TV3 16 16 16 16

There are 16 measurements for each combination, something compatible with a repeated measure design in which each of the eight assessor gets to see all factors. Thus, Picture and TVset are crossed within-subject factors and the dataset is balanced. The contingency table shows that we can estimate interactions between TVset and Picture. Since we are not interested in individual assessors scores, we treat the factor as a random effect and any interaction with Assessor will also be treated as random.

Given that we have two replications per combination of TVset and Picture, we could in theory fit the three-way (random) effect by including the term (1 | Assessor:TVset:Picture). If we do so, the software gives a warning: boundary (singular) fit which indicates an estimated variance is exactly zero, a synonym in this example for overfitting. We thus remove the three-way interaction. However, the numerical routine fails to converge, showing zero estimated variance for Assessor:Picture. Thus, we refit the model without the three-way interaction (since it does not make sense to have a three-way interaction, but remove one of the two-way), and later refit another model without Assessor:Picture whose variability is still estimated to be zero. Sometimes, this may be due to poor starting value and we may try to change the default optimization algorithm, the problem persist.

Model formulation

Since there are 15 different response variables, we start with a single one, here Sharpness of the display. The model we fit is of the form

\[\begin{align*} Y_{ijkl} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \gamma_k + (\beta\gamma)_{jk} + \varepsilon_{ijkl}, \end{align*}\] where

  • \(Y_{ijkl}\) is the Sharpness rating for the \(i\)th TVset, the \(j\)th Picture, the \(k\)th assessor and the \(l\)th replication.
  • \(\mu\) is the global average of all ratings
  • \(\alpha_i\) is the mean different for the \(i\)th TVset
  • \(\beta_j\) is the mean different for the \(j\)th Picture
  • \((\alpha\beta)_{ij}\) is the interaction for the \(i\)th TVset and \(j\)th Picture.
  • \(\gamma_k \sim \mathsf{No}(0, \sigma^2_\gamma)\) is the random effect of assessors
  • \((\alpha\gamma)_{ik} \sim \mathsf{No}(0, \sigma^2_{\alpha\gamma})\) are random effects for assessors for television sets
  • \(\varepsilon_{ijkl} \sim \mathsf{No}(0, \sigma^2_\varepsilon)\) are measurements-specific error terms

All of the random effects, including error, are independent of one another.

Code
mmod <- lmer(
  Sharpness ~ TVset * Picture + 
       (1 | Assessor) + 
       # (1 | Assessor:Picture) +
       # (1 | Assessor:TVset:Picture) +
       (1 | Assessor:TVset) ,
  # control = lmerControl(optimizer = "bobyqa"),
  ## 'control' above for the optimization routine
  data = TVbo)
# summary(mmod)

Correlation structure

The first section of the summary table gives estimations of the standard deviation of each error term. There is little variability due to Assessor:Picture interaction. The most important part of this output is perhaps to look at the number of instances to make sure our design specification is correct. The model assumes (correctly) that there are 32 combinations of Assessor and Picture (8 by 4), 24 combinations of Assessor and TVset (8 by 3) and 8 combinations for Assessor.

Random effects:
 Groups         Name        Variance Std.Dev.
 Assessor:TVset (Intercept) 0.4909   0.7006  
 Assessor       (Intercept) 1.0837   1.0410  
 Residual                   1.9366   1.3916  
Number of obs: 192, groups:  
Assessor:TVset, 24; Assessor, 8

One of the main reason for using mixed models is to automatically account for the correlation between observations. If we fail to account for the dependence between ratings from the same assessor and their personal preferences, we will get standard errors for the estimated mean difference for the two fixed effect that are too small. Another popular option in finance and economics is use of so-called sandwich variance estimators to handle unequal variance and clustering. Among other, the sandwich package in R (Zeileis et al., 2020) offers multiple options should one decide on this avenue.

Figure 1 shows the structure of the correlation matrix of the ratings: the blocks corresponds to the eight assessors: measurements from different assessors are independent, but they are correlated within assessors as a result of specifying (multiple) random effects with assessors. Since there is more than one source of variability here, the pattern is different from equicorrelation, with additional correlations between if data are from the same TV set or same picture level.

Figure 1: Correlation structure for the linear mixed model fitted to the TVbo data

Testing

Before looking more in details at the output, let us see if the rating depends on the setting (TVset and Picture combination): if it did, we would have to basically compare all 12 combinations as a one-way analysis of variance model. In this context, the logical comparison would be to compute pairwise differences and check which are equivalent to the best combination.

Code
anova(mmod, ddf = "Kenward-Roger")
Type III Analysis of Variance Table with Kenward-Roger's method
               Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
TVset          90.869  45.435     2    14 23.4615 3.384e-05 ***
Picture        26.304   8.768     3   159  4.5276  0.004478 ** 
TVset:Picture 105.183  17.531     6   159  9.0524 1.634e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is clear that there are differences all across the board. Constructing tests for mixed models in general is complicated and we need to correctly set up the model and compare the variability of the differences relative to the mean for each factor relative to different measures of variability. These will be the next random term that includes this factor and random effects (here either Assessor or the Error term). The correct source of variability are Assessor:TVset for TVset, Assessor:Picture for Picture and the error term for the two-way interaction between both fixed effects. Likewise, the degrees of freedom for the denominator are determined by the number of instances of the random term that involves the factor, so \(8 \times 3 - (1 - 2 - 7) = 14\) degrees of freedom since there are 24 instances of TVset:Assessor and we have to estimate one grand mean, two mean differences for TVset and likewise 7 effects for the Assessor.

This decomposition is better seen looking directly at the model output when fitted using aov (since this is a balanced within-subject design).

Code
fm <- aov(
  Coloursaturation ~ TVset * Picture +
    Error(Assessor/(Picture*TVset)), 
  data = TVbo)

Pairwise comparisons

We can next look at the estimated marginal means to find the television sets and the picture with the best ratings.

Code
library(emmeans)
emm <- mmod |> 
  emmeans(spec = c("TVset","Picture"))

All four best ratings are for TV set 2, with picture 4, 1, 2 and 3 (in decreasing order). We can see if these differences are significant by looking at all pairwise differences between combinations to find the best value. All of the differences in rating for TV set 2 could be attributed to noise after correcting for multiple testing using Tukey’s correction. The fifth top position is TV set 1, picture 4 and this one has a significantly lower rating (adjusted \(p\)-value of 0.0088) than the leading choice.

Code
emm
 TVset Picture emmean    SE   df lower.CL upper.CL
 TV1   1         8.01 0.564 23.7     6.85     9.18
 TV2   1         6.92 0.564 23.7     5.75     8.08
 TV3   1        10.38 0.564 23.7     9.22    11.55
 TV1   2         6.97 0.564 23.7     5.81     8.14
 TV2   2         8.18 0.564 23.7     7.02     9.35
 TV3   2        11.39 0.564 23.7    10.23    12.56
 TV1   3         8.27 0.564 23.7     7.10     9.43
 TV2   3         7.57 0.564 23.7     6.40     8.73
 TV3   3         8.30 0.564 23.7     7.14     9.46
 TV1   4         8.82 0.564 23.7     7.65     9.98
 TV2   4         7.25 0.564 23.7     6.09     8.41
 TV3   4        10.91 0.564 23.7     9.74    12.07

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Code
emm |> 
  pairs()
 contrast                    estimate    SE    df t.ratio p.value
 TV1 Picture1 - TV2 Picture1   1.0938 0.604  51.1   1.811  0.8045
 TV1 Picture1 - TV3 Picture1  -2.3687 0.604  51.1  -3.922  0.0126
 TV1 Picture1 - TV1 Picture2   1.0375 0.492 159.0   2.109  0.6174
 TV1 Picture1 - TV2 Picture2  -0.1688 0.604  51.1  -0.279  1.0000
 TV1 Picture1 - TV3 Picture2  -3.3813 0.604  51.1  -5.598  0.0001
 TV1 Picture1 - TV1 Picture3  -0.2562 0.492 159.0  -0.521  1.0000
 TV1 Picture1 - TV2 Picture3   0.4437 0.604  51.1   0.735  0.9998
 TV1 Picture1 - TV3 Picture3  -0.2875 0.604  51.1  -0.476  1.0000
 TV1 Picture1 - TV1 Picture4  -0.8063 0.492 159.0  -1.639  0.8920
 TV1 Picture1 - TV2 Picture4   0.7625 0.604  51.1   1.262  0.9803
 TV1 Picture1 - TV3 Picture4  -2.8937 0.604  51.1  -4.791  0.0008
 TV2 Picture1 - TV3 Picture1  -3.4625 0.604  51.1  -5.733  <.0001
 TV2 Picture1 - TV1 Picture2  -0.0563 0.604  51.1  -0.093  1.0000
 TV2 Picture1 - TV2 Picture2  -1.2625 0.492 159.0  -2.566  0.3081
 TV2 Picture1 - TV3 Picture2  -4.4750 0.604  51.1  -7.409  <.0001
 TV2 Picture1 - TV1 Picture3  -1.3500 0.604  51.1  -2.235  0.5333
 TV2 Picture1 - TV2 Picture3  -0.6500 0.492 159.0  -1.321  0.9752
 TV2 Picture1 - TV3 Picture3  -1.3813 0.604  51.1  -2.287  0.4985
 TV2 Picture1 - TV1 Picture4  -1.9000 0.604  51.1  -3.146  0.0993
 TV2 Picture1 - TV2 Picture4  -0.3312 0.492 159.0  -0.673  0.9999
 TV2 Picture1 - TV3 Picture4  -3.9875 0.604  51.1  -6.602  <.0001
 TV3 Picture1 - TV1 Picture2   3.4062 0.604  51.1   5.640  <.0001
 TV3 Picture1 - TV2 Picture2   2.2000 0.604  51.1   3.643  0.0279
 TV3 Picture1 - TV3 Picture2  -1.0125 0.492 159.0  -2.058  0.6531
 TV3 Picture1 - TV1 Picture3   2.1125 0.604  51.1   3.498  0.0412
 TV3 Picture1 - TV2 Picture3   2.8125 0.604  51.1   4.657  0.0013
 TV3 Picture1 - TV3 Picture3   2.0812 0.492 159.0   4.230  0.0023
 TV3 Picture1 - TV1 Picture4   1.5625 0.604  51.1   2.587  0.3137
 TV3 Picture1 - TV2 Picture4   3.1313 0.604  51.1   5.184  0.0002
 TV3 Picture1 - TV3 Picture4  -0.5250 0.492 159.0  -1.067  0.9956
 TV1 Picture2 - TV2 Picture2  -1.2063 0.604  51.1  -1.997  0.6930
 TV1 Picture2 - TV3 Picture2  -4.4188 0.604  51.1  -7.316  <.0001
 TV1 Picture2 - TV1 Picture3  -1.2937 0.492 159.0  -2.630  0.2725
 TV1 Picture2 - TV2 Picture3  -0.5938 0.604  51.1  -0.983  0.9975
 TV1 Picture2 - TV3 Picture3  -1.3250 0.604  51.1  -2.194  0.5614
 TV1 Picture2 - TV1 Picture4  -1.8438 0.492 159.0  -3.747  0.0129
 TV1 Picture2 - TV2 Picture4  -0.2750 0.604  51.1  -0.455  1.0000
 TV1 Picture2 - TV3 Picture4  -3.9312 0.604  51.1  -6.509  <.0001
 TV2 Picture2 - TV3 Picture2  -3.2125 0.604  51.1  -5.319  0.0001
 TV2 Picture2 - TV1 Picture3  -0.0875 0.604  51.1  -0.145  1.0000
 TV2 Picture2 - TV2 Picture3   0.6125 0.492 159.0   1.245  0.9843
 TV2 Picture2 - TV3 Picture3  -0.1187 0.604  51.1  -0.197  1.0000
 TV2 Picture2 - TV1 Picture4  -0.6375 0.604  51.1  -1.056  0.9953
 TV2 Picture2 - TV2 Picture4   0.9313 0.492 159.0   1.893  0.7620
 TV2 Picture2 - TV3 Picture4  -2.7250 0.604  51.1  -4.512  0.0020
 TV3 Picture2 - TV1 Picture3   3.1250 0.604  51.1   5.174  0.0002
 TV3 Picture2 - TV2 Picture3   3.8250 0.604  51.1   6.333  <.0001
 TV3 Picture2 - TV3 Picture3   3.0938 0.492 159.0   6.288  <.0001
 TV3 Picture2 - TV1 Picture4   2.5750 0.604  51.1   4.263  0.0045
 TV3 Picture2 - TV2 Picture4   4.1437 0.604  51.1   6.861  <.0001
 TV3 Picture2 - TV3 Picture4   0.4875 0.492 159.0   0.991  0.9977
 TV1 Picture3 - TV2 Picture3   0.7000 0.604  51.1   1.159  0.9898
 TV1 Picture3 - TV3 Picture3  -0.0312 0.604  51.1  -0.052  1.0000
 TV1 Picture3 - TV1 Picture4  -0.5500 0.492 159.0  -1.118  0.9935
 TV1 Picture3 - TV2 Picture4   1.0188 0.604  51.1   1.687  0.8658
 TV1 Picture3 - TV3 Picture4  -2.6375 0.604  51.1  -4.367  0.0032
 TV2 Picture3 - TV3 Picture3  -0.7312 0.604  51.1  -1.211  0.9857
 TV2 Picture3 - TV1 Picture4  -1.2500 0.604  51.1  -2.070  0.6454
 TV2 Picture3 - TV2 Picture4   0.3187 0.492 159.0   0.648  1.0000
 TV2 Picture3 - TV3 Picture4  -3.3375 0.604  51.1  -5.526  0.0001
 TV3 Picture3 - TV1 Picture4  -0.5188 0.604  51.1  -0.859  0.9992
 TV3 Picture3 - TV2 Picture4   1.0500 0.604  51.1   1.738  0.8418
 TV3 Picture3 - TV3 Picture4  -2.6063 0.492 159.0  -5.297  <.0001
 TV1 Picture4 - TV2 Picture4   1.5688 0.604  51.1   2.597  0.3081
 TV1 Picture4 - TV3 Picture4  -2.0875 0.604  51.1  -3.456  0.0459
 TV2 Picture4 - TV3 Picture4  -3.6562 0.604  51.1  -6.054  <.0001

Degrees-of-freedom method: kenward-roger 
P value adjustment: tukey method for comparing a family of 12 estimates 

Checking model assumptions

When we set up random effect terms, these are specified as being normal random variables. This has a shrinkage effect: predictions of the random effect model will tend to be closer to the mean than corresponding fixed, more so for larger values. In linear mixed models, we can retrieve predictions of random effects and plot them against theoretical normal plotting positions to see if they are in line with the assumption (as we did for residuals).

Code
library(lattice) # plots
# plot residuals against fitted value (additivity / linearity)
plot(mmod)
# Normal quantile-quantile plot of residuals
qqmath(mmod)
# Normal quantile-quantile plots of random effect predictions
random <- ranef(mmod)
plot(random)

As with other quantile-quantile plots, we expect the values to fall in line, which they seem to do here. There is one residual corresponding to a rating much below the average, but nothing to worry given the large sample size.

Example 2

We consider the data from Curley et al. (2022), who investigate the effect of anchoring and perception in a judicial system having three potential verdicts: guitly, not guilty and not proven.

The authors used a linear mixed effect model

For the final guilt scores, we conducted linear regressions with pre-trial bias (PJAQ) as a covariate and repeated measures for evidence anchor and vignette. We also included by-subject intercepts as a random effect.

Model structure

Code
# Load packages
library(lmerTest)
library(emmeans)
library(ggplot2)
# Set up sum-to-zero constraint for factors
options(contrasts = c("contr.sum", "contr.poly"))
data(C22, package = "hecedsm") # Load data
str(C22)
tibble [256 × 6] (S3: tbl_df/tbl/data.frame)
 $ id         : Factor w/ 128 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ anchor     : Factor w/ 2 levels "strong-first",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ vignette   : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ verdictsyst: Factor w/ 2 levels "three","two": 2 2 2 2 2 2 2 2 1 1 ...
 $ guilt      : num [1:256] 103.5 81.8 87.8 82.7 76.8 ...
 $ pjaq       : int [1:256] 67 101 65 60 75 73 66 86 59 93 ...
Code
# How are conditions manipulated?
with(C22, xtabs(~ vignette + id))
        id
vignette 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
       1 2 2 2 2 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
       2 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
        id
vignette 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
       1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
       2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
        id
vignette 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
       1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  0  0  0  0  0  0  0  0
       2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  2  2  2  2  2  2  2
        id
vignette 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
       1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
       2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
        id
vignette 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
       1  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
       2  2  2  2  2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
        id
vignette 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
       1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
       2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
Code
with(C22, xtabs(~ vignette + anchor))
        anchor
vignette strong-first weak-first
       1           64         64
       2           64         64
Code
with(C22, xtabs(~ anchor + id))
              id
anchor         1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
  strong-first 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  weak-first   1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
              id
anchor         25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
  strong-first  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  weak-first    1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
              id
anchor         46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
  strong-first  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  weak-first    1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
              id
anchor         67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
  strong-first  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  weak-first    1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
              id
anchor         88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
  strong-first  1  1  1  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1
  weak-first    1  1  1  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1
              id
anchor         107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
  strong-first   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
  weak-first     1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
              id
anchor         123 124 125 126 127 128
  strong-first   1   1   1   1   1   1
  weak-first     1   1   1   1   1   1

For each individual we see each pair of anchor once, but people are assigned to a single vignette. With only two measurements, no interaction is possible within participants. Likewise, we cannot have interaction between subject-effect and with anchor and verdict system (why?)

We thus include a single random effect, that for subject. Curley et al. (2022) fitted two models: one with all three way interactions between PJAQ, vignette and anchor, the other with additive effects.

Since PJAQ is a continuous covariate, the interaction with vignette and anchor means that we expect the slope for PJAQ to be different for each group. This isn’t necessarily justified by any of the author’s hypotheses. If any interaction with the total pre-trial bias (PJAQ) is present, we would need to perform comparisons for different values of PJAQ.

Model fitting and tests

We fit the model without the interaction with PJAQ. Note that, using emmeans, it is better to center the covariate so that it’s average is zero, in which case comparisons are performed for this average value.

Code
C22_c <- C22 |>
  dplyr::mutate(pjaq = scale(pjaq, scale = FALSE),
                guilt = guilt / 10) 
# scale from 0 to 14
model1 <- lmer(
  guilt ~ anchor*vignette*pjaq + (1 | id),
  data = C22_c)
model2 <- lmer(
  guilt ~ anchor*vignette + pjaq + (1 | id),
  data = C22_c)

Now that we have model coefficients, we can test the models. First, we check if the slope is the same for each of the four subgroups

Code
# Type III ANOVA from lmerTest
anova(model1, model2)
refitting model(s) with ML (instead of REML)
Data: C22_c
Models:
model2: guilt ~ anchor * vignette + pjaq + (1 | id)
model1: guilt ~ anchor * vignette * pjaq + (1 | id)
       npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
model2    7 1119.8 1144.7 -552.92   1105.8                     
model1   10 1122.9 1158.3 -551.43   1102.9 2.9693  3     0.3964

There appear to be no two-way or three-way interactions between any of the two factors and the covariate.

We can create an interaction plot to see what the estimated marginal means are for each of the four subconditions when the total pre-trial bias score is average.

Code
library(ggplot2)
# Interaction plot
emmip(model2, 
      formula = anchor ~ vignette, 
      CIs = TRUE) +
  theme_classic() +
  theme(legend.position = "bottom")

There seems to be differences, but note the very large uncertainty (and that of difference would be larger than for the individual means). We can inspect the estimated parameters for the model coefficient and compute marginal means.

Code
summary(model2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: guilt ~ anchor * vignette + pjaq + (1 | id)
   Data: C22_c

REML criterion at convergence: 1120.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.00918 -0.52985  0.02713  0.52806  2.48233 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 3.300    1.817   
 Residual             2.271    1.507   
Number of obs: 256, groups:  id, 128

Fixed effects:
                   Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)         8.18232    0.18616 125.00000  43.954  < 2e-16 ***
anchor1            -0.25008    0.09419 126.00000  -2.655  0.00895 ** 
vignette1           0.31827    0.18794 125.00000   1.693  0.09287 .  
pjaq                0.08710    0.01924 125.00000   4.527 1.37e-05 ***
anchor1:vignette1  -0.13371    0.09419 126.00000  -1.420  0.15817    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) anchr1 vgntt1 pjaq 
anchor1     0.000                     
vignette1   0.000  0.000              
pjaq        0.000  0.000  0.138       
anchr1:vgn1 0.000  0.000  0.000  0.000
Code
emmeans(model2, spec = "anchor") |>
  contrast(method = "pairwise")
NOTE: Results may be misleading due to involvement in interactions
 contrast                      estimate    SE  df t.ratio p.value
 (strong-first) - (weak-first)     -0.5 0.188 126  -2.655  0.0089

Results are averaged over the levels of: vignette 
Degrees-of-freedom method: kenward-roger 

There is a strong correlation, 0.59 between the responses from the same individuals so failing to account for the subject-specific random effect would lead to misleading conclusions, as reported \(p\)-values would be much smaller than they should be.

The correlation between pre-trial bias score (PJAQ) and the guilt score is 0.32. The coefficient in the additive model is 0.09 with a standard error of 0.19, indicating a strong positive effect of pre-trial bias on the guilt score.

The anchor has a large impact, with the estimated difference weak-first vs strong-first of 5 points for the guilt score overall.

Example 3

We consider data from a study at Tech3Lab (Labonté-LeMoyne et al., 2020) on standing desks. The description from the abstract reads

Thirty-seven young and healthy adults performed eight cognitive tasks in a 2 × 2 × 2 within-subject design of the following independent variables: posture (sitting/standing), task difficulty (easy/hard), and input device (computer mouse/tactile screen) in a counterbalanced order.

The database LJLSFBM20 in package hecedsm contains the measurements. There is a total of 296 measurements, or \(37 \times 8\), meaning each participant is assigned to a single task.

Code
data(LJLSFBM20, package = "hecedsm")
str(LJLSFBM20)
tibble [296 × 14] (S3: tbl_df/tbl/data.frame)
 $ id            : Factor w/ 37 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ order         : int [1:296] 1 2 3 4 5 6 7 8 1 2 ...
 $ position      : Factor w/ 2 levels "standing","sitting": 2 2 2 2 1 1 1 1 1 1 ...
 $ phys_demand   : Factor w/ 2 levels "mouse","touchpad": 1 2 2 1 2 2 1 1 2 2 ...
 $ task_diff     : Factor w/ 2 levels "easy","difficult": 1 2 1 2 2 1 2 1 2 1 ...
 $ ies           : num [1:296] 2486 24034 4691 11727 10874 ...
 $ central_alpha : num [1:296] 0.3 0.227 0.25 0.224 0.235 ...
 $ parietal_alpha: num [1:296] 0.388 0.348 0.345 0.333 0.298 ...
 $ central_beta  : num [1:296] 0.1103 0.0866 0.0902 0.0863 0.0972 ...
 $ parietal_beta : num [1:296] 0.169 0.159 0.141 0.147 0.141 ...
 $ bmi           : num [1:296] 22.8 22.8 22.8 22.8 22.8 ...
 $ sex           : Factor w/ 2 levels "man","woman": 1 1 1 1 1 1 1 1 1 1 ...
 $ attention     : num [1:296] 6.83 6.67 6.83 6.83 6.83 ...
 $ satisfaction  : num [1:296] 70 85 80 76 86 100 60 80 70 85 ...

Model structure

The id variable is a dummy for the participant, with associated covariate sex and bmi. We also have the three within-subject factors, position, phys_demand and task_diff and the order in which the tasks were given (counterbalanced). There are in total seven response variables: the inverse efficiency score global stimulus (ies), measures of brain activity (central_alpha, parietal_alpha, central_beta, parietal_beta) and two scales obtained from answers to a questionaire, attention and satisfaction.

The three manipulated factors are nested within subject and crossed, so we can estimate the three-way and two-way interactions with the experimental factors. The only logical random effect here is for subject, and we cannot have further sources of variability given the lack of replication.

The authors are not transparent as to what their model is and earns a failure grade for reproducibility: we have no idea of the specification, coefficients are not reported. There is clearly 8 coefficiens corresponding to the average of the subgroups, plus a random effect for subject and sex and body mass index as covariates. It appears from the output of Table 1 that there was an interaction term added for BMI and position (as standing may be more strenuous on overweight participants). Finally, we include the order of the tasks as a covariate to account for potential fatigue effects.

Model adjustment and testing

Our linear mixed model would take the form, here with ies as response

Code
library(lmerTest)
mod <- lmer(ies ~ position*phys_demand*task_diff +
  (1 | id) + sex + bmi*position + order, 
           data = LJLSFBM20) 

Given the number of response variables and coefficients, it is clear one must account for testing multiplicity. Figuring out the size of the family, \(m\), is not trivial: testing interactions and main effects leads to 7 tests. We could also be interested in the interaction between body mass index and position if it affects performance, bringing the number of tests to 8 and throw the covariate effect for order (up to 11 if we include all controls). Further contrasts may inflate this number (for example, there are 28 pairwise comparisons if the three-way interaction is significant). With seven responses and linear mixed models parametrized in the same way, we have at least 56 tests and potentially up to 350! Controlling the type-I error requires using Bonferroni–Holm type correction, since the different responses are not part of the same models.

Code
anova(mod)
Type III Analysis of Variance Table with Satterthwaite's method
                                   Sum Sq    Mean Sq NumDF DenDF  F value
position                           574983     574983     1   250   0.0361
phys_demand                    1280635446 1280635446     1   250  80.3464
task_diff                      4927991208 4927991208     1   250 309.1796
sex                              79185574   79185574     1    34   4.9681
bmi                              10365621   10365621     1    34   0.6503
order                           105580125  105580125     1   250   6.6240
position:phys_demand              9127601    9127601     1   250   0.5727
position:task_diff                1825943    1825943     1   250   0.1146
phys_demand:task_diff           327914741  327914741     1   250  20.5732
position:bmi                      1224778    1224778     1   250   0.0768
position:phys_demand:task_diff    2299196    2299196     1   250   0.1443
                                  Pr(>F)    
position                         0.84952    
phys_demand                    < 2.2e-16 ***
task_diff                      < 2.2e-16 ***
sex                              0.03254 *  
bmi                              0.42560    
order                            0.01064 *  
position:phys_demand             0.44992    
position:task_diff               0.73530    
phys_demand:task_diff          8.913e-06 ***
position:bmi                     0.78185    
position:phys_demand:task_diff   0.70441    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we focus on the sole results for ies, there is a significant two-way interaction (and main effects) for phys_demand and task_diff, but not for position. Further investigation would reveal better performance on the easy task and overall with a mouse.

Code
emmeans::emmeans(mod, spec = c("phys_demand","task_diff"))
NOTE: Results may be misleading due to involvement in interactions
 phys_demand task_diff emmean  SE   df lower.CL upper.CL
 mouse       easy        2668 672 79.3     1332     4005
 touchpad    easy        4723 672 79.2     3387     6060
 mouse       difficult   8733 672 79.3     7396    10070
 touchpad    difficult  14998 672 79.2    13662    16335

Results are averaged over the levels of: position, sex 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 

Model assumptions

One may wonder whether there are any effect of spillover, learning or fatigue due to the repeated measures: tasks were much longer than is typical. The coefficient for order is -268.14, so the decrease for the inverse efficiency score global stimulus for each additional task, ceteris paribus, with a \(p\)-value of 0.011 suggesting participants improve over time.

Code
library(ggplot2)
g1 <- ggplot(
  data = data.frame(residuals = resid(mod), 
                    id = LJLSFBM20$id),
  mapping = aes(x = id, 
                y = residuals)) + 
  geom_point() +
  theme_classic()
g2 <- ggplot(
  data = data.frame(residuals = resid(mod), 
                    order = LJLSFBM20$order),
  mapping = aes(x = order, 
                y = residuals)) + 
  geom_point() +
  theme_classic()
library(patchwork) # combine results
g1 + g2
Figure 2: Scatterplots of residuals from the linear mixed model against participant identifier (left) and against order of task (right).

Looking at the residuals of the model per participants is also quite insightful. It is clear that measurements from participants 30 and 31 are abnormal, and these correspond to the values we see.

We can check the other model assumptions: the quantile-quantile plot of random effects suggests some large outliers, unsurprisingly due to the participants identified. The plot of fitted values vs residuals suggests our model is wholly inadequate: there is a clear trend in the residuals and strong evidence of heterogeneity.

Code
plot(ranef(mod))
plot(mod)
Figure 3: Normal quantile-quantile plot of predicted random effects (left) and Tukey’s plot of residual vs fitted values (right).

We could fit a more complex model to tackle the heterogeneity issue by including different variance per individual, at the expense of lower power.

References

Curley, L. J., Murray, J., MacLean, R., Munro, J., Lages, M., Frumkin, L. A., Laybourn, P., & Brown, D. (2022). Verdict spotting: Investigating the effects of juror bias, evidence anchors and verdict system in jurors. Psychiatry, Psychology and Law, 29(3), 323–344. https://doi.org/10.1080/13218719.2021.1904450
Labonté-LeMoyne, E., Jutras, M.-A., Léger, P.-M., Sénécal, S., Fredette, M., Begon, M., & Mathieu, M.-È. (2020). Does reducing sedentarity with standing desks hinder cognitive performance? Human Factors, 62(4), 603–612. https://doi.org/10.1177/0018720819879310
Zeileis, A., Köll, S., & Graham, N. (2020). Various versatile variances: An object-oriented implementation of clustered covariances in R. Journal of Statistical Software, 95(1), 1–36. https://doi.org/10.18637/jss.v095.i01