Contrasts and multiple testing

Session 4

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 45

Outline2 / 45

Outline

Contrasts

2 / 45

Outline

Contrasts

Multiple testing

2 / 45

Planned comparisonsOftentimes, we are not interested in the global null hypothesis.
Rather, we formulate planned comparisons at registration time for effects of interest
3 / 45

Planned comparisons

Oftentimes, we are not interested in the global null hypothesis.
Rather, we formulate planned comparisons at registration time for effects of interest

What is the scientific question of interest?

3 / 45

Global null vs contrasts

Global test

Contrasts

Image source: PNGAll.com, CC-BY-NC 4.0

4 / 45

Analogy here is that the global test is a dim light: it shows there are differences, but does not tell us where. Contrasts are more specific.

Linear contrasts

With $K$ groups, null hypothesis of the form

$H_{0} : \underset{weighted sum of subpopulation means}{C = c_{1} μ_{1} + \dots + c_{K} μ_{K}} = a$

5 / 45

Linear contrasts

With $K$ groups, null hypothesis of the form

$H_{0} : \underset{weighted sum of subpopulation means}{C = c_{1} μ_{1} + \dots + c_{K} μ_{K}} = a$

Linear combination of
weighted group averages

5 / 45

Examples of linear contrasts

Global mean larger than $a$ ?

$H_{0} : \frac{n_{1}}{n} μ_{1} + \dots + \frac{n_{K}}{n} μ_{K} \leq a$

Pairwise comparison

$H_{0} : μ_{i} = μ_{j}, i \neq j$

6 / 45

Characterization of linear contrastsWeights c1,…,cKc1,…,cK are specified by the user.
Mean response in each experimental group is estimated as sample average of observations in that group, ˆμ1,…,ˆμKμ^1,…,μ^K.
Assuming equal variance, the contrast statistic behaves in large samples like a Student-t distribution with n−Kn−K degrees of freedom.
7 / 45

Sum-to-zero constraint

If $c_{1} + \dots + c_{K} = 0$ , the contrast encodes

differences between treatments

rather than information about the overall mean.

8 / 45

Arithmetic example

Setup

group 1

(control)

group 2

(control)

group 3

(praise, reprove, ignore)

9 / 45

Arithmetic example

Setup

group 1

(control)

group 2

(control)

group 3

(praise, reprove, ignore)

Hypotheses of interest

$H_{01} : μ_{praise} = μ_{reproved}$ (attention)
$H_{02} : \frac{1}{2} (μ_{{control}_{1}} + μ_{{control}_{2}}) = μ_{praised}$ (encouragement)

9 / 45

This is post-hoc, but supposed we had particular interest in the following hypothesis (for illustration purposes)

Contrasts

With placeholders for each group, write $H_{01} : μ_{praised} = μ_{reproved}$ as

$0 \cdot μ_{{control}_{1}}$ + $0 \cdot μ_{{control}_{2}}$ + $1 \cdot μ_{praised}$ - $1 \cdot μ_{reproved}$ + $0 \cdot μ_{ignored}$

The sum of the coefficient vector, $c = (0, 0, 1, - 1, 0)$ , is zero.

10 / 45

Contrasts

With placeholders for each group, write $H_{01} : μ_{praised} = μ_{reproved}$ as

$0 \cdot μ_{{control}_{1}}$ + $0 \cdot μ_{{control}_{2}}$ + $1 \cdot μ_{praised}$ - $1 \cdot μ_{reproved}$ + $0 \cdot μ_{ignored}$

The sum of the coefficient vector, $c = (0, 0, 1, - 1, 0)$ , is zero.

Similarly, for $H_{02} : \frac{1}{2} (μ_{{control}_{1}} + μ_{{control}_{2}}) = μ_{praise}$

$\frac{1}{2} \cdot μ_{{control}_{1}} + \frac{1}{2} \cdot μ_{{control}_{2}} - 1 \cdot μ_{praised} + 0 \cdot μ_{reproved} + 0 \cdot μ_{ignored}$

The contrast vector is $c = (\frac{1}{2}, \frac{1}{2}, - 1, 0, 0)$ ; entries again sum to zero.

Equivalent formulation is obtained by picking $c = (1, 1, - 2, 0, 0)$

10 / 45

Contrasts in R with `emmeans`

library(emmeans)
linmod <- lm(score ~ group, data = arithmetic)
linmod_emm <- emmeans(linmod, specs = 'group')
contrast_specif <- list(
  controlvspraised = c(0.5, 0.5, -1, 0, 0),
  praisedvsreproved = c(0, 0, 1, -1, 0)
)
contrasts_res <- 
  contrast(object = linmod_emm, 
                    method = contrast_specif)
# Obtain confidence intervals instead of p-values
confint(contrasts_res)

11 / 45

Output

contrast	null.value	estimate	std.error	df	statistic	p.value
control vs praised	0	-8.44	1.40	40	-6.01	<1e-04
praised vs reprove	0	4.00	1.62	40	2.47	0.018

Confidence intervals

contrast	lower	upper
control vs praised	-11.28	-5.61
praised vs reprove	0.72	7.28

12 / 45

One-sided tests

Suppose we postulate that the contrast statistic is bigger than some value $a$ .

The alternative is $H_{a} : C > a$ (what we are trying to prove)!
The null hypothesis is therefore $H_{0} : C \leq a$ (Devil's advocate)

It suffices to consider the endpoint $C = a$ (why?)

If we reject $C = a$ in favour of $C > a$ , all other values of the null hypothesis are even less compatible with the data.

13 / 45

Comparing rejection regions

Rejection regions for a one-sided test (left) and a two-sided test (right).

14 / 45

When to use one-sided tests?

In principle, one-sided tests are more powerful (larger rejection region on one sided).

However, important to pre-register hypothesis
- can't look at the data before formulating the hypothesis (as always)!
More logical for follow-up studies and replications.

If you postulate $H_{a} : C > a$ and the data show the opposite with $\hat{C} \leq a$ , then the $p$ -value for the one-sided test is 1!

15 / 45

Multiple testing16 / 45

Post-hoc tests

Suppose you decide to look at all pairwise differences

Comparing all pairwise differences: $m = (\binom{K}{2})$ tests

$m = 3$ tests if $K = 3$ groups,
$m = 10$ tests if $K = 5$ groups,
$m = 45$ tests if $K = 10$ groups...

17 / 45

The recommendation for ANOVA is to limit the number of tests to the number of subgroups

There is a catch...

Read the small prints:

If you do a single hypothesis test and
your testing procedure is well calibrated
(meaning the model assumptions are met),
there is a probability $α$
of making a type I error
if the null hypothesis is true.

18 / 45

How many tests?

Dr. Yoav Benjamini looked at the number of tests performed in the Psychology replication project

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

The number of tests performed ranged from 4 to 700, with an average of 72.

19 / 45

How many tests?

Dr. Yoav Benjamini looked at the number of tests performed in the Psychology replication project

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

The number of tests performed ranged from 4 to 700, with an average of 72.

Most studies did not account for selection.

19 / 45

Yoav B. reported that 11/100 engaged with selection, but only cursorily

Scientifist, investigate!

Consider the Cartoon Significant by Randall Munroe (https://xkcd.com/882/)

It highlights two problems: lack of accounting for multiple testing and selective reporting.

20 / 45

Bring students to realize the multiple testing problem: quiz them on potential consequences

Gone fishing

Having found no difference between group, you decide to stratify with another categorical variable and perform the comparisons for each level (subgroup analysis)

The more tests you perform, the larger the type I error.

Probability of type I error

If we do $m$ independent comparisons, each one at the level $α$ , the probability of making at least one type I error, say $α^{⋆}$ , is

$α^{⋆} = 1 - probability of making no type I error = 1 - (1 - α)^{m} .$

With $α = 0.05$

$m = 4$ tests, $α^{⋆} \approx 0.185$ .
$m = 72$ tests, $α^{⋆} \approx 0.975$ .

Tests need not be independent... but one can show $α^{⋆} \leq m α$ .

21 / 45

The first equality holds under the assumption observations (and thus tests) are independent, the second follows from Boole's inequality and does not require independence.

It is an upper bound on the probability of making no type I error

Statistical significance at the 5% level

Why $α = 5$ %? Essentially arbitrary...

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level.

Fisher, R.A. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain, 33:503-513.

22 / 45

Family of hypothesis

Consider $m$ tests with the corresponding null hypotheses $H_{01}, \dots, H_{0 m}$ .

The family may depend on the context, but including any hypothesis that is scientifically relevant and could be reported.

Should be chosen a priori and pre-registered

23 / 45

Family of hypothesis

Consider $m$ tests with the corresponding null hypotheses $H_{01}, \dots, H_{0 m}$ .

The family may depend on the context, but including any hypothesis that is scientifically relevant and could be reported.

Should be chosen a priori and pre-registered

Keep it small: the number of planned comparisons for a one-way ANOVA should be less than the number of groups $K$ .

23 / 45

Researchers do not all agree with this “liberal” approach (i.e., that don't correct for multiplicity even for pre-planned comparisons) and recommend to always control for the familywise or experimentwise Type I error rate. dixit F. Bellavance.

Notation

Define indicators $\begin{aligned} R_{i} & = {\begin{cases} 1 & if we reject H_{0 i} \\ 0 & if we fail to reject H_{0 i} \end{cases} \\ V_{i} & = {\begin{cases} 1 & type I error for H_{0 i} (R_{i} = 1 and H_{0 i} is true) \\ 0 & otherwise \end{cases} \end{aligned}$

with

$R = R_{1} + \dots + R_{m}$ the total number of rejections ( $0 \leq R \leq m$ ).
$V = V_{1} + \dots + V_{m}$ the number of null hypothesis rejected by mistake.

24 / 45

Familywise error rate

Definition: the familywise error rate is the probability of making at least one type I error per family

$F W E R = Pr (V \geq 1)$

If we use a procedure that controls for the family-wise error rate, we talk about simultaneous inference (or simultaneous coverage for confidence intervals).

25 / 45

Bonferroni's procedure

Consider a family of $m$ hypothesis tests and perform each test at level $α / m$ .

reject $i$ th null $H_{i 0}$ if the associated p-value $p_{i} \leq α / m$ .
build confidence intervals similarly with $1 - α / m$ quantiles.

If the (raw) $p$ -values are reported, reject $H_{0 i}$ if $m \times p_{i} \leq α$ (i.e., multiply reported $p$ -values by $m$ )

26 / 45

Often incorrectly applied to a set of significant contrasts, rather than for preplanned comparisons only

Holm's sequential method

Order the $p$ -values of the family of $m$ tests from smallest to largest $p_{(1)} \leq \dots \leq p_{(m)}$

associated to null hypothesis $H_{0 (1)}, \dots, H_{0 (m)}$ .

Idea use a different level for each test, more stringent for smaller $p$ -values.

Coupling Holm's method with Bonferroni's procedure: compare $p_{(1)}$ to $α_{(1)} = α / m$ , $p_{(2)}$ to $α_{(2)} = α / (m - 1)$ , etc.

Holm-Bonferroni procedure is always more powerful than Bonferroni

27 / 45

Sequential Holm-Bonferroni procedureorder pp-values from smallest to largest.
start with the smallest pp-value.
check significance one test at a time.
stop when the first non-significant pp-value is found or no more test.

28 / 45

Conclusion for Holm-Bonferroni

Reject smallest p-values until you find one that fails, reject rest

If $p_{(j)} \geq α_{(j)}$ but $p_{(i)} < α_{(i)}$ for $i = 1, \dots, j - 1$ (all smaller $p$ -values)

reject $H_{0 (1)}, \dots, H_{0 (j - 1)}$
fail to reject $H_{0 (j)}, \dots, H_{0 (m)}$

All p-values are lower than their respective cutoff:

If $p_{(i)} \leq α_{(i)}$ for all test $i = 1, \dots, m$

reject $H_{0 (1)}, \dots, H_{0 (m)}$

29 / 45

Numerical example

Consider $m = 3$ tests with raw $p$ -values $0.01$ , $0.04$ , $0.02$ .

$i$	$p_{(i)}$	$Bonferroni$	$Holm-Bonferroni$
1	$0.01$	$3 \times 0.01 = 0.03$	$3 \times 0.01 = 0.03$
2	$0.02$	$3 \times 0.02 = 0.06$	$2 \times 0.02 = 0.04$
3	$0.04$	$3 \times 0.04 = 0.12$	$1 \times 0.04 = 0.04$

Reminder of Holm–Bonferroni: multiply by $(m - i + 1)$ the $i$ th smallest $p$ -value $p_{(i)}$ , compare the product to $α$ .

30 / 45

Why choose Bonferroni's procedure?

$m$ must be prespecified
simple and generally applicable (any design)
but dominated by sequential procedures (Holm-Bonferroni uniformly more powerful)
low power when the number of test $m$ is large
also controls for the expected number of false positive, $E (V)$ , a more stringent criterion called per-family error rate (PFER)

Careful: adjust for the real number of comparisons made (often reporter just correct only the 'significant tests', which is wrong).

31 / 45

Controlling the average number of errors

The FWER does not make a distinction between one or multiple type I errors.

We can also look at a more stringent criterion

per-family error rate (PFER) i.e., the expected number of false positive

Since $F W E R = Pr (V \geq 1) \leq E (V) = P F E R$

any procedure that controls the per-family error rate also controls the familywise error rate. Bonferroni controls both per-family error rate and family-wise error rate.

Confidence intervals for linear contrasts

Given a linear contrast of the form $C = c_{1} μ_{1} + \dots + c_{K} μ_{K}$ with $c_{1} + \dots c_{K} = 0$ , we build confidence intervals as usual

$\hat{C} \pm critical value \times \hat{s e} (\hat{C})$

Different methods provide control for FWER by modifying the critical value.

All methods valid with equal group variances and independent observations.

32 / 45

Assuming we care only about mean differences between experimental conditions

FWER control in ANOVA

Tukey's honestly significant difference (HSD) method: to compare (all) pairwise differences between subgroups, based on the largest possible pairwise mean differences, with extensions for unbalanced samples.
Scheffé's method: applies to any contrast (properties depends on sample size $n$ and number of groups $K$ , not the number of test). Better than Bonferroni if $m$ is large. Can be used for any design, but not powerful.
Dunnett's method: only for all pairwise contrasts relative to a specific baseline (control).

Described in Dean, Voss and Draguljić (2017), Section 4.4 in more details.

33 / 45

Tukey's honest significant difference

Control for all pairwise comparisons

Idea: controlling for the range $max {μ_{1}, \dots, μ_{K}} - min {μ_{1}, \dots μ_{K}}$ automatically controls FWER for other pairwise differences.

Critical values based on ''Studentized range'' distribution

Assumptions: equal variance, equal number of observations in each experimental condition.

34 / 45

Scheffé's criterion

Control for all
possible linear contrasts

Critical value is $\sqrt{(K - 1) F}$ ,
where $F$ is the $(1 - α)$ quantile
of the $F (K - 1, n - K)$ distribution.

Allows for data snooping
(post-hoc hypothesis)

But not powerful...

35 / 45

Adjustment for one-way ANOVA

Take home message:

same as usual, only with different critical values
larger cutoffs for $p$ -values when procedure accounts for more tests

Everything is obtained using software.

36 / 45

Numerical example

With $K = 5$ groups and $n = 9$ individuals per group (arithmetic example), critical value for two-sided test of zero difference with standardized $t$ -test statistic and $α = 5$ % are

Scheffé's (all contrasts): 3.229
Tukey's (all pairwise differences): 2.856
Dunnett's (difference to baseline): 2.543
unadjusted Student's $t$ -distribution: 2.021

37 / 45

These were derived from the output of the function, sometimes by reverse-engineering. agricolae::scheffe.test TukeyHSD, agricolae::HSD.test DescTools::DunnettTest

Sometimes, there are too many tests...

38 / 45

Scaling back expectations...

A simultaneous procedure that controls family-wise error rate (FWER) ensure any selected test has type I error $α$ .

With thousands of tests, this is too stringent a criterion.

39 / 45

Scaling back expectations...

A simultaneous procedure that controls family-wise error rate (FWER) ensure any selected test has type I error $α$ .

With thousands of tests, this is too stringent a criterion.

The false discovery rate (FDR) provides a guarantee for the proportion among selected discoveries (tests for which we reject the null hypothesis).

Why use it? the false discovery rate is scalable:

2 type I errors out of 4 tests is unacceptable.
2 type I errors out of 100 tests is probably okay.

39 / 45

False discovery rate

Suppose that $m_{0}$ out of $m$ null hypothesis are true

The false discovery rate is the proportion of false discovery among rejected nulls,

$FDR = {\begin{cases} \frac{V}{R} & R > 0 (if one or more rejection), \\ 0 & R = 0 (if no rejection) . \end{cases}$

40 / 45

False discovery rate offers weak-FWER control the property is only guaranteed under the scenario where all null hypotheses are true.

Controlling false discovery rate

The Benjamini-Hochberg (1995) procedure for controlling false discovery rate is:

Order the p-values from the $m$ tests from smallest to largest: $p_{(1)} \leq \dots \leq p_{(m)}$
For level $α$ (e.g., $α = 0.05$ ), set $k = max {i : p_{(i)} \leq \frac{i}{m} α}$
Reject $H_{0 (1)}, \dots, H_{0 (k)}$ .

41 / 45

Benjamini-Hochberg in a picture

Plot p-values (y-axis) against their rank (x-axis)
- (the smallest p-value has rank $1$ , the largest has rank $m$ ).
Draw the line $y = α / m x$
- (zero intercept, slope $α / m$ )
Reject all null hypotheses associated to $p$ -values located before the first time a point falls above the line.

42 / 45

Recap 1The test of equality of variance of the one-way ANOVA is seldom of interest (too general or vague)
Rather, we care about specific comparisons (often linear contrasts)
Must specify ahead of time which comparisons are of interestotherwise it's easy to find something significant!
and multiplicity correction will be unfavorable.

43 / 45

Recap 2Researchers often carry lots of hypothesis testing teststhe more you look, the more you find!
One of the many reasons for the replication crisis!

Thus want to control probability of making a type I error (condemn innocent, incorrect finding) among all mm tests performedaka family-wise error rate (FWER)
Downside of multiplicity correction/adjustment is loss of power
upside is (more robust findings).

44 / 45

Recap 3

ANOVA specific solutions (assuming equal variance, balanced large samples...)

Tukey's HSD (all pairwise differences),
Dunnett's method (only differences relative to a reference category)
Scheffé's method (all potential linear contrasts)

Outside of ANOVA, some more general recipes:

FWER: Bonferroni (suboptimal), Bonferroni-Holm (more powerful)
FDR: Benjamini-Hochberg

Pick the one that controls FWER, but penalizes less!

45 / 45

Example of the last point is comparison between Bonferroni and Scheffé: with large number of tests, the latter may be less stringent and lead to more discovery while having guarantees for the FWER

Contrasts and multiple testing

Session 4

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 45

Outline2 / 45

Outline

Contrasts

2 / 45

Outline

Contrasts

Multiple testing

2 / 45

Planned comparisonsOftentimes, we are not interested in the global null hypothesis.
Rather, we formulate planned comparisons at registration time for effects of interest
3 / 45

Planned comparisons

Oftentimes, we are not interested in the global null hypothesis.
Rather, we formulate planned comparisons at registration time for effects of interest

What is the scientific question of interest?

3 / 45

Global null vs contrasts

Global test

Contrasts

Image source: PNGAll.com, CC-BY-NC 4.0

4 / 45

Analogy here is that the global test is a dim light: it shows there are differences, but does not tell us where. Contrasts are more specific.

Linear contrasts

With $K$ groups, null hypothesis of the form

$H_{0} : \underset{weighted sum of subpopulation means}{C = c_{1} μ_{1} + \dots + c_{K} μ_{K}} = a$

5 / 45

Linear contrasts

With $K$ groups, null hypothesis of the form

$H_{0} : \underset{weighted sum of subpopulation means}{C = c_{1} μ_{1} + \dots + c_{K} μ_{K}} = a$

Linear combination of
weighted group averages

5 / 45

Examples of linear contrasts

Global mean larger than $a$ ?

$H_{0} : \frac{n_{1}}{n} μ_{1} + \dots + \frac{n_{K}}{n} μ_{K} \leq a$

Pairwise comparison

$H_{0} : μ_{i} = μ_{j}, i \neq j$

6 / 45

Characterization of linear contrastsWeights c1,…,cKc1,…,cK are specified by the user.
Mean response in each experimental group is estimated as sample average of observations in that group, ˆμ1,…,ˆμKμ^1,…,μ^K.
Assuming equal variance, the contrast statistic behaves in large samples like a Student-t distribution with n−Kn−K degrees of freedom.
7 / 45

Sum-to-zero constraint

If $c_{1} + \dots + c_{K} = 0$ , the contrast encodes

differences between treatments

rather than information about the overall mean.

8 / 45

Arithmetic example

Setup

group 1

(control)

group 2

(control)

group 3

(praise, reprove, ignore)

9 / 45

Arithmetic example

Setup

group 1

(control)

group 2

(control)

group 3

(praise, reprove, ignore)

Hypotheses of interest

$H_{01} : μ_{praise} = μ_{reproved}$ (attention)
$H_{02} : \frac{1}{2} (μ_{{control}_{1}} + μ_{{control}_{2}}) = μ_{praised}$ (encouragement)

9 / 45

This is post-hoc, but supposed we had particular interest in the following hypothesis (for illustration purposes)

Contrasts

With placeholders for each group, write $H_{01} : μ_{praised} = μ_{reproved}$ as

$0 \cdot μ_{{control}_{1}}$ + $0 \cdot μ_{{control}_{2}}$ + $1 \cdot μ_{praised}$ - $1 \cdot μ_{reproved}$ + $0 \cdot μ_{ignored}$

The sum of the coefficient vector, $c = (0, 0, 1, - 1, 0)$ , is zero.

10 / 45

Contrasts

With placeholders for each group, write $H_{01} : μ_{praised} = μ_{reproved}$ as

$0 \cdot μ_{{control}_{1}}$ + $0 \cdot μ_{{control}_{2}}$ + $1 \cdot μ_{praised}$ - $1 \cdot μ_{reproved}$ + $0 \cdot μ_{ignored}$

The sum of the coefficient vector, $c = (0, 0, 1, - 1, 0)$ , is zero.

Similarly, for $H_{02} : \frac{1}{2} (μ_{{control}_{1}} + μ_{{control}_{2}}) = μ_{praise}$

$\frac{1}{2} \cdot μ_{{control}_{1}} + \frac{1}{2} \cdot μ_{{control}_{2}} - 1 \cdot μ_{praised} + 0 \cdot μ_{reproved} + 0 \cdot μ_{ignored}$

The contrast vector is $c = (\frac{1}{2}, \frac{1}{2}, - 1, 0, 0)$ ; entries again sum to zero.

Equivalent formulation is obtained by picking $c = (1, 1, - 2, 0, 0)$

10 / 45

Contrasts in R with `emmeans`

library(emmeans)
linmod <- lm(score ~ group, data = arithmetic)
linmod_emm <- emmeans(linmod, specs = 'group')
contrast_specif <- list(
  controlvspraised = c(0.5, 0.5, -1, 0, 0),
  praisedvsreproved = c(0, 0, 1, -1, 0)
)
contrasts_res <- 
  contrast(object = linmod_emm, 
                    method = contrast_specif)
# Obtain confidence intervals instead of p-values
confint(contrasts_res)

11 / 45

Output

contrast	null.value	estimate	std.error	df	statistic	p.value
control vs praised	0	-8.44	1.40	40	-6.01	<1e-04
praised vs reprove	0	4.00	1.62	40	2.47	0.018

Confidence intervals

contrast	lower	upper
control vs praised	-11.28	-5.61
praised vs reprove	0.72	7.28

12 / 45

One-sided tests

Suppose we postulate that the contrast statistic is bigger than some value $a$ .

The alternative is $H_{a} : C > a$ (what we are trying to prove)!
The null hypothesis is therefore $H_{0} : C \leq a$ (Devil's advocate)

It suffices to consider the endpoint $C = a$ (why?)

If we reject $C = a$ in favour of $C > a$ , all other values of the null hypothesis are even less compatible with the data.

13 / 45

Comparing rejection regions

Rejection regions for a one-sided test (left) and a two-sided test (right).

14 / 45

When to use one-sided tests?

In principle, one-sided tests are more powerful (larger rejection region on one sided).

However, important to pre-register hypothesis
- can't look at the data before formulating the hypothesis (as always)!
More logical for follow-up studies and replications.

If you postulate $H_{a} : C > a$ and the data show the opposite with $\hat{C} \leq a$ , then the $p$ -value for the one-sided test is 1!

15 / 45

Multiple testing16 / 45

Post-hoc tests

Suppose you decide to look at all pairwise differences

Comparing all pairwise differences: $m = (\binom{K}{2})$ tests

$m = 3$ tests if $K = 3$ groups,
$m = 10$ tests if $K = 5$ groups,
$m = 45$ tests if $K = 10$ groups...

17 / 45

The recommendation for ANOVA is to limit the number of tests to the number of subgroups

There is a catch...

Read the small prints:

18 / 45

How many tests?

Dr. Yoav Benjamini looked at the number of tests performed in the Psychology replication project

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

The number of tests performed ranged from 4 to 700, with an average of 72.

19 / 45

How many tests?

Dr. Yoav Benjamini looked at the number of tests performed in the Psychology replication project

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

The number of tests performed ranged from 4 to 700, with an average of 72.

Most studies did not account for selection.

19 / 45

Yoav B. reported that 11/100 engaged with selection, but only cursorily

Scientifist, investigate!

Consider the Cartoon Significant by Randall Munroe (https://xkcd.com/882/)

It highlights two problems: lack of accounting for multiple testing and selective reporting.

20 / 45

Bring students to realize the multiple testing problem: quiz them on potential consequences

Gone fishing

Having found no difference between group, you decide to stratify with another categorical variable and perform the comparisons for each level (subgroup analysis)

The more tests you perform, the larger the type I error.

Probability of type I error

If we do $m$ independent comparisons, each one at the level $α$ , the probability of making at least one type I error, say $α^{⋆}$ , is

$α^{⋆} = 1 - probability of making no type I error = 1 - (1 - α)^{m} .$

With $α = 0.05$

$m = 4$ tests, $α^{⋆} \approx 0.185$ .
$m = 72$ tests, $α^{⋆} \approx 0.975$ .

Tests need not be independent... but one can show $α^{⋆} \leq m α$ .

21 / 45

The first equality holds under the assumption observations (and thus tests) are independent, the second follows from Boole's inequality and does not require independence.

It is an upper bound on the probability of making no type I error

Statistical significance at the 5% level

Why $α = 5$ %? Essentially arbitrary...

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level.

Fisher, R.A. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain, 33:503-513.

22 / 45

Family of hypothesis

Consider $m$ tests with the corresponding null hypotheses $H_{01}, \dots, H_{0 m}$ .

The family may depend on the context, but including any hypothesis that is scientifically relevant and could be reported.

Should be chosen a priori and pre-registered

23 / 45

Family of hypothesis

Consider $m$ tests with the corresponding null hypotheses $H_{01}, \dots, H_{0 m}$ .

The family may depend on the context, but including any hypothesis that is scientifically relevant and could be reported.

Should be chosen a priori and pre-registered

Keep it small: the number of planned comparisons for a one-way ANOVA should be less than the number of groups $K$ .

23 / 45

Notation

with

$R = R_{1} + \dots + R_{m}$ the total number of rejections ( $0 \leq R \leq m$ ).
$V = V_{1} + \dots + V_{m}$ the number of null hypothesis rejected by mistake.

24 / 45

Familywise error rate

Definition: the familywise error rate is the probability of making at least one type I error per family

$F W E R = Pr (V \geq 1)$

If we use a procedure that controls for the family-wise error rate, we talk about simultaneous inference (or simultaneous coverage for confidence intervals).

25 / 45

Bonferroni's procedure

Consider a family of $m$ hypothesis tests and perform each test at level $α / m$ .

reject $i$ th null $H_{i 0}$ if the associated p-value $p_{i} \leq α / m$ .
build confidence intervals similarly with $1 - α / m$ quantiles.

If the (raw) $p$ -values are reported, reject $H_{0 i}$ if $m \times p_{i} \leq α$ (i.e., multiply reported $p$ -values by $m$ )

26 / 45

Often incorrectly applied to a set of significant contrasts, rather than for preplanned comparisons only

Holm's sequential method

Order the $p$ -values of the family of $m$ tests from smallest to largest $p_{(1)} \leq \dots \leq p_{(m)}$

associated to null hypothesis $H_{0 (1)}, \dots, H_{0 (m)}$ .

Idea use a different level for each test, more stringent for smaller $p$ -values.

Coupling Holm's method with Bonferroni's procedure: compare $p_{(1)}$ to $α_{(1)} = α / m$ , $p_{(2)}$ to $α_{(2)} = α / (m - 1)$ , etc.

Holm-Bonferroni procedure is always more powerful than Bonferroni

27 / 45

Sequential Holm-Bonferroni procedureorder pp-values from smallest to largest.
start with the smallest pp-value.
check significance one test at a time.
stop when the first non-significant pp-value is found or no more test.

28 / 45

Conclusion for Holm-Bonferroni

Reject smallest p-values until you find one that fails, reject rest

If $p_{(j)} \geq α_{(j)}$ but $p_{(i)} < α_{(i)}$ for $i = 1, \dots, j - 1$ (all smaller $p$ -values)

reject $H_{0 (1)}, \dots, H_{0 (j - 1)}$
fail to reject $H_{0 (j)}, \dots, H_{0 (m)}$

All p-values are lower than their respective cutoff:

If $p_{(i)} \leq α_{(i)}$ for all test $i = 1, \dots, m$

reject $H_{0 (1)}, \dots, H_{0 (m)}$

29 / 45

Numerical example

Consider $m = 3$ tests with raw $p$ -values $0.01$ , $0.04$ , $0.02$ .

$i$	$p_{(i)}$	$Bonferroni$	$Holm-Bonferroni$
1	$0.01$	$3 \times 0.01 = 0.03$	$3 \times 0.01 = 0.03$
2	$0.02$	$3 \times 0.02 = 0.06$	$2 \times 0.02 = 0.04$
3	$0.04$	$3 \times 0.04 = 0.12$	$1 \times 0.04 = 0.04$

Reminder of Holm–Bonferroni: multiply by $(m - i + 1)$ the $i$ th smallest $p$ -value $p_{(i)}$ , compare the product to $α$ .

30 / 45

Why choose Bonferroni's procedure?

$m$ must be prespecified
simple and generally applicable (any design)
but dominated by sequential procedures (Holm-Bonferroni uniformly more powerful)
low power when the number of test $m$ is large
also controls for the expected number of false positive, $E (V)$ , a more stringent criterion called per-family error rate (PFER)

Careful: adjust for the real number of comparisons made (often reporter just correct only the 'significant tests', which is wrong).

31 / 45

Controlling the average number of errors

The FWER does not make a distinction between one or multiple type I errors.

We can also look at a more stringent criterion

per-family error rate (PFER) i.e., the expected number of false positive

Since $F W E R = Pr (V \geq 1) \leq E (V) = P F E R$

any procedure that controls the per-family error rate also controls the familywise error rate. Bonferroni controls both per-family error rate and family-wise error rate.

Confidence intervals for linear contrasts

Given a linear contrast of the form $C = c_{1} μ_{1} + \dots + c_{K} μ_{K}$ with $c_{1} + \dots c_{K} = 0$ , we build confidence intervals as usual

$\hat{C} \pm critical value \times \hat{s e} (\hat{C})$

Different methods provide control for FWER by modifying the critical value.

All methods valid with equal group variances and independent observations.

32 / 45

Assuming we care only about mean differences between experimental conditions

FWER control in ANOVA

Tukey's honestly significant difference (HSD) method: to compare (all) pairwise differences between subgroups, based on the largest possible pairwise mean differences, with extensions for unbalanced samples.
Scheffé's method: applies to any contrast (properties depends on sample size $n$ and number of groups $K$ , not the number of test). Better than Bonferroni if $m$ is large. Can be used for any design, but not powerful.
Dunnett's method: only for all pairwise contrasts relative to a specific baseline (control).

Described in Dean, Voss and Draguljić (2017), Section 4.4 in more details.

33 / 45

Tukey's honest significant difference

Control for all pairwise comparisons

Idea: controlling for the range $max {μ_{1}, \dots, μ_{K}} - min {μ_{1}, \dots μ_{K}}$ automatically controls FWER for other pairwise differences.

Critical values based on ''Studentized range'' distribution

Assumptions: equal variance, equal number of observations in each experimental condition.

34 / 45

Scheffé's criterion

Control for all
possible linear contrasts

Critical value is $\sqrt{(K - 1) F}$ ,
where $F$ is the $(1 - α)$ quantile
of the $F (K - 1, n - K)$ distribution.

Allows for data snooping
(post-hoc hypothesis)

But not powerful...

35 / 45

Adjustment for one-way ANOVA

Take home message:

same as usual, only with different critical values
larger cutoffs for $p$ -values when procedure accounts for more tests

Everything is obtained using software.

36 / 45

Numerical example

With $K = 5$ groups and $n = 9$ individuals per group (arithmetic example), critical value for two-sided test of zero difference with standardized $t$ -test statistic and $α = 5$ % are

Scheffé's (all contrasts): 3.229
Tukey's (all pairwise differences): 2.856
Dunnett's (difference to baseline): 2.543
unadjusted Student's $t$ -distribution: 2.021

37 / 45

These were derived from the output of the function, sometimes by reverse-engineering. agricolae::scheffe.test TukeyHSD, agricolae::HSD.test DescTools::DunnettTest

Sometimes, there are too many tests...

38 / 45

Scaling back expectations...

A simultaneous procedure that controls family-wise error rate (FWER) ensure any selected test has type I error $α$ .

With thousands of tests, this is too stringent a criterion.

39 / 45

Scaling back expectations...

A simultaneous procedure that controls family-wise error rate (FWER) ensure any selected test has type I error $α$ .

With thousands of tests, this is too stringent a criterion.

The false discovery rate (FDR) provides a guarantee for the proportion among selected discoveries (tests for which we reject the null hypothesis).

Why use it? the false discovery rate is scalable:

2 type I errors out of 4 tests is unacceptable.
2 type I errors out of 100 tests is probably okay.

39 / 45

False discovery rate

Suppose that $m_{0}$ out of $m$ null hypothesis are true

The false discovery rate is the proportion of false discovery among rejected nulls,

$FDR = {\begin{cases} \frac{V}{R} & R > 0 (if one or more rejection), \\ 0 & R = 0 (if no rejection) . \end{cases}$

40 / 45

False discovery rate offers weak-FWER control the property is only guaranteed under the scenario where all null hypotheses are true.

Controlling false discovery rate

The Benjamini-Hochberg (1995) procedure for controlling false discovery rate is:

Order the p-values from the $m$ tests from smallest to largest: $p_{(1)} \leq \dots \leq p_{(m)}$
For level $α$ (e.g., $α = 0.05$ ), set $k = max {i : p_{(i)} \leq \frac{i}{m} α}$
Reject $H_{0 (1)}, \dots, H_{0 (k)}$ .

41 / 45

Benjamini-Hochberg in a picture

Plot p-values (y-axis) against their rank (x-axis)
- (the smallest p-value has rank $1$ , the largest has rank $m$ ).
Draw the line $y = α / m x$
- (zero intercept, slope $α / m$ )
Reject all null hypotheses associated to $p$ -values located before the first time a point falls above the line.

42 / 45

Recap 1The test of equality of variance of the one-way ANOVA is seldom of interest (too general or vague)
Rather, we care about specific comparisons (often linear contrasts)
Must specify ahead of time which comparisons are of interestotherwise it's easy to find something significant!
and multiplicity correction will be unfavorable.

43 / 45

Recap 2Researchers often carry lots of hypothesis testing teststhe more you look, the more you find!
One of the many reasons for the replication crisis!

Thus want to control probability of making a type I error (condemn innocent, incorrect finding) among all mm tests performedaka family-wise error rate (FWER)
Downside of multiplicity correction/adjustment is loss of power
upside is (more robust findings).

44 / 45

Recap 3

ANOVA specific solutions (assuming equal variance, balanced large samples...)

Tukey's HSD (all pairwise differences),
Dunnett's method (only differences relative to a reference category)
Scheffé's method (all potential linear contrasts)

Outside of ANOVA, some more general recipes:

FWER: Bonferroni (suboptimal), Bonferroni-Holm (more powerful)
FDR: Benjamini-Hochberg

Pick the one that controls FWER, but penalizes less!

45 / 45

Example of the last point is comparison between Bonferroni and Scheffé: with large number of tests, the latter may be less stringent and lead to more discovery while having guarantees for the FWER

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Contrasts and multiple testing

Outline

Outline

Outline

Planned comparisons

Planned comparisons

Global null vs contrasts

Linear contrasts

Linear contrasts

Examples of linear contrasts

Characterization of linear contrasts

Sum-to-zero constraint

Arithmetic example

Arithmetic example

Contrasts

Contrasts

Contrasts in R with emmeans

Output

One-sided tests

Comparing rejection regions

When to use one-sided tests?

Multiple testing

Post-hoc tests

There is a catch...

How many tests?

How many tests?

Scientifist, investigate!

Probability of type I error

Statistical significance at the 5% level

Family of hypothesis

Family of hypothesis

Notation

Familywise error rate

Bonferroni's procedure

Holm's sequential method

Sequential Holm-Bonferroni procedure

Conclusion for Holm-Bonferroni

Numerical example

Why choose Bonferroni's procedure?

Controlling the average number of errors

Confidence intervals for linear contrasts

FWER control in ANOVA

Tukey's honest significant difference

Scheffé's criterion

Adjustment for one-way ANOVA

Numerical example

Sometimes, there are too many tests...

Scaling back expectations...

Scaling back expectations...

False discovery rate

Controlling false discovery rate

Benjamini-Hochberg in a picture

Recap 1

Recap 2

Recap 3

Outline

Help

Contrasts and multiple testing

Contrasts and multiple testing

Outline

Outline

Outline

Planned comparisons

Planned comparisons

Global null vs contrasts

Linear contrasts

Linear contrasts

Examples of linear contrasts

Characterization of linear contrasts

Sum-to-zero constraint

Arithmetic example

Arithmetic example

Contrasts

Contrasts

Contrasts in R with emmeans

Output

One-sided tests

Comparing rejection regions

When to use one-sided tests?

Multiple testing

Contrasts in R with `emmeans`

Contrasts in R with `emmeans`