Session 4
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Contrasts
Contrasts
Multiple testing
What is the scientific question of interest?
Global test
Contrasts
Image source: PNGAll.com, CC-BY-NC 4.0
Analogy here is that the global test is a dim light: it shows there are differences, but does not tell us where. Contrasts are more specific.
With K groups, null hypothesis of the form
H0:C=c1μ1+⋯+cKμKweighted sum of subpopulation means=a
With K groups, null hypothesis of the form
H0:C=c1μ1+⋯+cKμKweighted sum of subpopulation means=a
Linear combination of
weighted group averages
Global mean larger than a?
H0:n1nμ1+⋯+nKnμK≤a
Pairwise comparison
H0:μi=μj,i≠j
If c1+⋯+cK=0, the contrast encodes
differences between treatments
rather than information about the overall mean.
Setup
group 1
(control)
group 2
(control)
group 3
(praise, reprove, ignore)
Setup
group 1
(control)
group 2
(control)
group 3
(praise, reprove, ignore)
Hypotheses of interest
This is post-hoc, but supposed we had particular interest in the following hypothesis (for illustration purposes)
With placeholders for each group, write H01:μpraised=μreproved as
0⋅μcontrol1 + 0⋅μcontrol2 + 1⋅μpraised - 1⋅μreproved + 0⋅μignored
The sum of the coefficient vector, c=(0,0,1,−1,0), is zero.
With placeholders for each group, write H01:μpraised=μreproved as
0⋅μcontrol1 + 0⋅μcontrol2 + 1⋅μpraised - 1⋅μreproved + 0⋅μignored
The sum of the coefficient vector, c=(0,0,1,−1,0), is zero.
Similarly, for H02:12(μcontrol1+μcontrol2)=μpraise
12⋅μcontrol1+12⋅μcontrol2−1⋅μpraised+0⋅μreproved+0⋅μignored
The contrast vector is c=(12,12,−1,0,0); entries again sum to zero.
Equivalent formulation is obtained by picking c=(1,1,−2,0,0)
emmeans
library(emmeans)linmod <- lm(score ~ group, data = arithmetic)linmod_emm <- emmeans(linmod, specs = 'group')contrast_specif <- list( controlvspraised = c(0.5, 0.5, -1, 0, 0), praisedvsreproved = c(0, 0, 1, -1, 0))contrasts_res <- contrast(object = linmod_emm, method = contrast_specif)# Obtain confidence intervals instead of p-valuesconfint(contrasts_res)
contrast | null.value | estimate | std.error | df | statistic | p.value |
---|---|---|---|---|---|---|
control vs praised | 0 | -8.44 | 1.40 | 40 | -6.01 | <1e-04 |
praised vs reprove | 0 | 4.00 | 1.62 | 40 | 2.47 | 0.018 |
Confidence intervals
contrast | lower | upper |
---|---|---|
control vs praised | -11.28 | -5.61 |
praised vs reprove | 0.72 | 7.28 |
Suppose we postulate that the contrast statistic is bigger than some value a.
It suffices to consider the endpoint C=a (why?)
Rejection regions for a one-sided test (left) and a two-sided test (right).
In principle, one-sided tests are more powerful (larger rejection region on one sided).
If you postulate Ha:C>a and the data show the opposite with ˆC≤a, then the p-value for the one-sided test is 1!
Suppose you decide to look at all pairwise differences
Comparing all pairwise differences: m=(K2) tests
The recommendation for ANOVA is to limit the number of tests to the number of subgroups
Read the small prints:
If you do a single hypothesis test and
your testing procedure is well calibrated
(meaning the model assumptions are met),
there is a probability α
of making a type I error
if the null hypothesis is true.
Dr. Yoav Benjamini looked at the number of tests performed in the Psychology replication project
The number of tests performed ranged from 4 to 700, with an average of 72.
Dr. Yoav Benjamini looked at the number of tests performed in the Psychology replication project
The number of tests performed ranged from 4 to 700, with an average of 72.
Most studies did not account for selection.
Yoav B. reported that 11/100 engaged with selection, but only cursorily
It highlights two problems: lack of accounting for multiple testing and selective reporting.
Bring students to realize the multiple testing problem: quiz them on potential consequences
Gone fishing
The more tests you perform, the larger the type I error.
If we do m independent comparisons, each one at the level α, the probability of making at least one type I error, say α⋆, is
α⋆=1–probability of making no type I error=1−(1−α)m.
With α=0.05
Tests need not be independent... but one can show α⋆≤mα.
The first equality holds under the assumption observations (and thus tests) are independent, the second follows from Boole's inequality and does not require independence.
It is an upper bound on the probability of making no type I error
Why α=5%? Essentially arbitrary...
If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level.
Fisher, R.A. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain, 33:503-513.
Consider m tests with the corresponding null hypotheses H01,…,H0m.
Should be chosen a priori and pre-registered
Consider m tests with the corresponding null hypotheses H01,…,H0m.
Should be chosen a priori and pre-registered
Keep it small: the number of planned comparisons for a one-way ANOVA should be less than the number of groups K.
Researchers do not all agree with this “liberal” approach (i.e., that don't correct for multiplicity even for pre-planned comparisons) and recommend to always control for the familywise or experimentwise Type I error rate. dixit F. Bellavance.
Define indicators Ri={1if we reject H0i0if we fail to reject H0iVi={1type I error for H0i(Ri=1 and H0i is true)0otherwise
with
Definition: the familywise error rate is the probability of making at least one type I error per family
FWER=Pr(V≥1)
If we use a procedure that controls for the family-wise error rate, we talk about simultaneous inference (or simultaneous coverage for confidence intervals).
Consider a family of m hypothesis tests and perform each test at level α/m.
If the (raw) p-values are reported, reject H0i if m×pi≤α (i.e., multiply reported p-values by m)
Often incorrectly applied to a set of significant contrasts, rather than for preplanned comparisons only
Order the p-values of the family of m tests from smallest to largest p(1)≤⋯≤p(m)
associated to null hypothesis H0(1),…,H0(m).
Idea use a different level for each test, more stringent for smaller p-values.
Coupling Holm's method with Bonferroni's procedure: compare p(1) to α(1)=α/m, p(2) to α(2)=α/(m−1), etc.
Holm-Bonferroni procedure is always more powerful than Bonferroni
Reject smallest p-values until you find one that fails, reject rest
If p(j)≥α(j) but p(i)<α(i) for i=1,…,j−1 (all smaller p-values)
All p-values are lower than their respective cutoff:
If p(i)≤α(i) for all test i=1,…,m
Consider m=3 tests with raw p-values 0.01, 0.04, 0.02.
i | p(i) | Bonferroni | Holm-Bonferroni |
---|---|---|---|
1 | 0.01 | 3×0.01=0.03 | 3×0.01=0.03 |
2 | 0.02 | 3×0.02=0.06 | 2×0.02=0.04 |
3 | 0.04 | 3×0.04=0.12 | 1×0.04=0.04 |
Reminder of Holm–Bonferroni: multiply by (m−i+1) the ith smallest p-value p(i), compare the product to α.
Careful: adjust for the real number of comparisons made (often reporter just correct only the 'significant tests', which is wrong).
The FWER does not make a distinction between one or multiple type I errors.
We can also look at a more stringent criterion
per-family error rate (PFER) i.e., the expected number of false positive
Since FWER=Pr(V≥1)≤E(V)=PFER
any procedure that controls the per-family error rate also controls the familywise error rate. Bonferroni controls both per-family error rate and family-wise error rate.
Given a linear contrast of the form C=c1μ1+⋯+cKμK with c1+⋯cK=0, we build confidence intervals as usual
ˆC±critical value׈se(ˆC)
Different methods provide control for FWER by modifying the critical value.
All methods valid with equal group variances and independent observations.
Assuming we care only about mean differences between experimental conditions
Described in Dean, Voss and Draguljić (2017), Section 4.4 in more details.
Control for all pairwise comparisons
Idea: controlling for the range max{μ1,…,μK}−min{μ1,…μK} automatically controls FWER for other pairwise differences.
Critical values based on ''Studentized range'' distribution
Assumptions: equal variance, equal number of observations in each experimental condition.
Control for all
possible linear contrasts
Critical value is √(K−1)F,
where F is the (1−α) quantile
of the F(K−1,n−K) distribution.
Allows for data snooping
(post-hoc hypothesis)
But not powerful...
Take home message:
Everything is obtained using software.
With K=5 groups and n=9 individuals per group (arithmetic
example), critical value for two-sided test of zero difference with standardized t-test statistic and α=5% are
These were derived from the output of the function, sometimes by reverse-engineering.
agricolae::scheffe.test
TukeyHSD
, agricolae::HSD.test
DescTools::DunnettTest
A simultaneous procedure that controls family-wise error rate (FWER) ensure any selected test has type I error α.
With thousands of tests, this is too stringent a criterion.
A simultaneous procedure that controls family-wise error rate (FWER) ensure any selected test has type I error α.
With thousands of tests, this is too stringent a criterion.
The false discovery rate (FDR) provides a guarantee for the proportion among selected discoveries (tests for which we reject the null hypothesis).
Why use it? the false discovery rate is scalable:
Suppose that m0 out of m null hypothesis are true
The false discovery rate is the proportion of false discovery among rejected nulls,
FDR={VRR>0 (if one or more rejection),0R=0 (if no rejection).
False discovery rate offers weak-FWER control the property is only guaranteed under the scenario where all null hypotheses are true.
The Benjamini-Hochberg (1995) procedure for controlling false discovery rate is:
(the smallest p-value has rank 1, the largest has rank m).
(zero intercept, slope α/m)
ANOVA specific solutions (assuming equal variance, balanced large samples...)
Outside of ANOVA, some more general recipes:
Pick the one that controls FWER, but penalizes less!
Example of the last point is comparison between Bonferroni and Scheffé: with large number of tests, the latter may be less stringent and lead to more discovery while having guarantees for the FWER
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Session 4
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Contrasts
Contrasts
Multiple testing
What is the scientific question of interest?
Global test
Contrasts
Image source: PNGAll.com, CC-BY-NC 4.0
Analogy here is that the global test is a dim light: it shows there are differences, but does not tell us where. Contrasts are more specific.
With K groups, null hypothesis of the form
H0:C=c1μ1+⋯+cKμKweighted sum of subpopulation means=a
With K groups, null hypothesis of the form
H0:C=c1μ1+⋯+cKμKweighted sum of subpopulation means=a
Linear combination of
weighted group averages
Global mean larger than a?
H0:n1nμ1+⋯+nKnμK≤a
Pairwise comparison
H0:μi=μj,i≠j
If c1+⋯+cK=0, the contrast encodes
differences between treatments
rather than information about the overall mean.
Setup
group 1
(control)
group 2
(control)
group 3
(praise, reprove, ignore)
Setup
group 1
(control)
group 2
(control)
group 3
(praise, reprove, ignore)
Hypotheses of interest
This is post-hoc, but supposed we had particular interest in the following hypothesis (for illustration purposes)
With placeholders for each group, write H01:μpraised=μreproved as
0⋅μcontrol1 + 0⋅μcontrol2 + 1⋅μpraised - 1⋅μreproved + 0⋅μignored
The sum of the coefficient vector, c=(0,0,1,−1,0), is zero.
With placeholders for each group, write H01:μpraised=μreproved as
0⋅μcontrol1 + 0⋅μcontrol2 + 1⋅μpraised - 1⋅μreproved + 0⋅μignored
The sum of the coefficient vector, c=(0,0,1,−1,0), is zero.
Similarly, for H02:12(μcontrol1+μcontrol2)=μpraise
12⋅μcontrol1+12⋅μcontrol2−1⋅μpraised+0⋅μreproved+0⋅μignored
The contrast vector is c=(12,12,−1,0,0); entries again sum to zero.
Equivalent formulation is obtained by picking c=(1,1,−2,0,0)
emmeans
library(emmeans)linmod <- lm(score ~ group, data = arithmetic)linmod_emm <- emmeans(linmod, specs = 'group')contrast_specif <- list( controlvspraised = c(0.5, 0.5, -1, 0, 0), praisedvsreproved = c(0, 0, 1, -1, 0))contrasts_res <- contrast(object = linmod_emm, method = contrast_specif)# Obtain confidence intervals instead of p-valuesconfint(contrasts_res)
contrast | null.value | estimate | std.error | df | statistic | p.value |
---|---|---|---|---|---|---|
control vs praised | 0 | -8.44 | 1.40 | 40 | -6.01 | <1e-04 |
praised vs reprove | 0 | 4.00 | 1.62 | 40 | 2.47 | 0.018 |
Confidence intervals
contrast | lower | upper |
---|---|---|
control vs praised | -11.28 | -5.61 |
praised vs reprove | 0.72 | 7.28 |
Suppose we postulate that the contrast statistic is bigger than some value a.
It suffices to consider the endpoint C=a (why?)
Rejection regions for a one-sided test (left) and a two-sided test (right).
In principle, one-sided tests are more powerful (larger rejection region on one sided).
If you postulate Ha:C>a and the data show the opposite with ˆC≤a, then the p-value for the one-sided test is 1!
Suppose you decide to look at all pairwise differences
Comparing all pairwise differences: m=(K2) tests
The recommendation for ANOVA is to limit the number of tests to the number of subgroups
Read the small prints:
If you do a single hypothesis test and
your testing procedure is well calibrated
(meaning the model assumptions are met),
there is a probability α
of making a type I error
if the null hypothesis is true.
Dr. Yoav Benjamini looked at the number of tests performed in the Psychology replication project
The number of tests performed ranged from 4 to 700, with an average of 72.
Dr. Yoav Benjamini looked at the number of tests performed in the Psychology replication project
The number of tests performed ranged from 4 to 700, with an average of 72.
Most studies did not account for selection.
Yoav B. reported that 11/100 engaged with selection, but only cursorily
It highlights two problems: lack of accounting for multiple testing and selective reporting.
Bring students to realize the multiple testing problem: quiz them on potential consequences
Gone fishing
The more tests you perform, the larger the type I error.
If we do m independent comparisons, each one at the level α, the probability of making at least one type I error, say α⋆, is
α⋆=1–probability of making no type I error=1−(1−α)m.
With α=0.05
Tests need not be independent... but one can show α⋆≤mα.
The first equality holds under the assumption observations (and thus tests) are independent, the second follows from Boole's inequality and does not require independence.
It is an upper bound on the probability of making no type I error
Why α=5%? Essentially arbitrary...
If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level.
Fisher, R.A. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain, 33:503-513.
Consider m tests with the corresponding null hypotheses H01,…,H0m.
Should be chosen a priori and pre-registered
Consider m tests with the corresponding null hypotheses H01,…,H0m.
Should be chosen a priori and pre-registered
Keep it small: the number of planned comparisons for a one-way ANOVA should be less than the number of groups K.
Researchers do not all agree with this “liberal” approach (i.e., that don't correct for multiplicity even for pre-planned comparisons) and recommend to always control for the familywise or experimentwise Type I error rate. dixit F. Bellavance.
Define indicators Ri={1if we reject H0i0if we fail to reject H0iVi={1type I error for H0i(Ri=1 and H0i is true)0otherwise
with
Definition: the familywise error rate is the probability of making at least one type I error per family
FWER=Pr(V≥1)
If we use a procedure that controls for the family-wise error rate, we talk about simultaneous inference (or simultaneous coverage for confidence intervals).
Consider a family of m hypothesis tests and perform each test at level α/m.
If the (raw) p-values are reported, reject H0i if m×pi≤α (i.e., multiply reported p-values by m)
Often incorrectly applied to a set of significant contrasts, rather than for preplanned comparisons only
Order the p-values of the family of m tests from smallest to largest p(1)≤⋯≤p(m)
associated to null hypothesis H0(1),…,H0(m).
Idea use a different level for each test, more stringent for smaller p-values.
Coupling Holm's method with Bonferroni's procedure: compare p(1) to α(1)=α/m, p(2) to α(2)=α/(m−1), etc.
Holm-Bonferroni procedure is always more powerful than Bonferroni
Reject smallest p-values until you find one that fails, reject rest
If p(j)≥α(j) but p(i)<α(i) for i=1,…,j−1 (all smaller p-values)
All p-values are lower than their respective cutoff:
If p(i)≤α(i) for all test i=1,…,m
Consider m=3 tests with raw p-values 0.01, 0.04, 0.02.
i | p(i) | Bonferroni | Holm-Bonferroni |
---|---|---|---|
1 | 0.01 | 3×0.01=0.03 | 3×0.01=0.03 |
2 | 0.02 | 3×0.02=0.06 | 2×0.02=0.04 |
3 | 0.04 | 3×0.04=0.12 | 1×0.04=0.04 |
Reminder of Holm–Bonferroni: multiply by (m−i+1) the ith smallest p-value p(i), compare the product to α.
Careful: adjust for the real number of comparisons made (often reporter just correct only the 'significant tests', which is wrong).
The FWER does not make a distinction between one or multiple type I errors.
We can also look at a more stringent criterion
per-family error rate (PFER) i.e., the expected number of false positive
Since FWER=Pr(V≥1)≤E(V)=PFER
any procedure that controls the per-family error rate also controls the familywise error rate. Bonferroni controls both per-family error rate and family-wise error rate.
Given a linear contrast of the form C=c1μ1+⋯+cKμK with c1+⋯cK=0, we build confidence intervals as usual
ˆC±critical value׈se(ˆC)
Different methods provide control for FWER by modifying the critical value.
All methods valid with equal group variances and independent observations.
Assuming we care only about mean differences between experimental conditions
Described in Dean, Voss and Draguljić (2017), Section 4.4 in more details.
Control for all pairwise comparisons
Idea: controlling for the range max{μ1,…,μK}−min{μ1,…μK} automatically controls FWER for other pairwise differences.
Critical values based on ''Studentized range'' distribution
Assumptions: equal variance, equal number of observations in each experimental condition.
Control for all
possible linear contrasts
Critical value is √(K−1)F,
where F is the (1−α) quantile
of the F(K−1,n−K) distribution.
Allows for data snooping
(post-hoc hypothesis)
But not powerful...
Take home message:
Everything is obtained using software.
With K=5 groups and n=9 individuals per group (arithmetic
example), critical value for two-sided test of zero difference with standardized t-test statistic and α=5% are
These were derived from the output of the function, sometimes by reverse-engineering.
agricolae::scheffe.test
TukeyHSD
, agricolae::HSD.test
DescTools::DunnettTest
A simultaneous procedure that controls family-wise error rate (FWER) ensure any selected test has type I error α.
With thousands of tests, this is too stringent a criterion.
A simultaneous procedure that controls family-wise error rate (FWER) ensure any selected test has type I error α.
With thousands of tests, this is too stringent a criterion.
The false discovery rate (FDR) provides a guarantee for the proportion among selected discoveries (tests for which we reject the null hypothesis).
Why use it? the false discovery rate is scalable:
Suppose that m0 out of m null hypothesis are true
The false discovery rate is the proportion of false discovery among rejected nulls,
FDR={VRR>0 (if one or more rejection),0R=0 (if no rejection).
False discovery rate offers weak-FWER control the property is only guaranteed under the scenario where all null hypotheses are true.
The Benjamini-Hochberg (1995) procedure for controlling false discovery rate is:
(the smallest p-value has rank 1, the largest has rank m).
(zero intercept, slope α/m)
ANOVA specific solutions (assuming equal variance, balanced large samples...)
Outside of ANOVA, some more general recipes:
Pick the one that controls FWER, but penalizes less!
Example of the last point is comparison between Bonferroni and Scheffé: with large number of tests, the latter may be less stringent and lead to more discovery while having guarantees for the FWER