Session 2
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Variability
Variability
Hypothesis tests
Variability
Hypothesis tests
Pairwise comparisons
Interested in impacts of intervention or policy
Population distribution (describing possible outcomes and their frequencies) encodes everything we could be interested in.
Histograms for 10 random samples of size 20 from a discrete uniform distribution.
Data collection costly
→ limited information available about population.
Sample too small to reliably estimate distribution
Focus instead on particular summaries
→ mean, variance, odds, etc.
mean / expectation
μ
standard deviation
σ=√variance
same scale as observations
Do not confuse standard error (variability of statistic) and standard deviation (variability of observation from population)
Not all samples are born alike
Can you spot the differences?
statistic = numerical summary of the data.
statistic = numerical summary of the data.
requires benchmark / standardization
typically a unitless quantity
need measure of uncertainty of statistic
Wald statistic
W=estimated qty−postulated qtystd. error (estimated qty)
standard error = measure of variability (same units as obs.)
resulting ratio is unitless!
The standard error is typically function of the sample size and the standard deviation σ of the observations.
From Davison (2008), Example 9.2
In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.
data(arithmetic, package = "hecedsm")# categorical variable = factor# Look up datastr(arithmetic)
## 'data.frame': 45 obs. of 2 variables:## $ group: Factor w/ 5 levels "control 1","control 2",..: 1 1 1 1 1 1 1 1 1 2 ...## $ score: num 17 14 24 20 24 23 16 15 24 21 ...
# compute summary statisticssummary_stat <- arithmetic |> group_by(group) |> summarize(mean = mean(score), sd = sd(score))knitr::kable(summary_stat, digits = 2)
group | mean | sd |
---|---|---|
control 1 | 19.67 | 4.21 |
control 2 | 18.33 | 3.57 |
praise | 27.44 | 2.46 |
reprove | 23.44 | 3.09 |
ignore | 16.11 | 3.62 |
# Boxplot with jittered dataggplot(data = arithmetic, aes(x = group, y = score)) + geom_boxplot() + geom_jitter(width = 0.3, height = 0) + theme_bw()
Let μC and μD denote the population average (expectation) score for praise and reprove, respectively.
Our null hypothesis is H0:μC=μD against the alternative Ha that they are different (two-sided test).
Equivalent to δCD=μC−μD=0.
The value of the Wald statistic is t=ˆδCD−0se(ˆδCD)=41.6216=2.467
The value of the Wald statistic is t=ˆδCD−0se(ˆδCD)=41.6216=2.467
How 'extreme' is this number?
Could it have happened by chance if there was no difference between groups?
Benchmarking
The null distribution tells us what are the plausible values for the statistic and their relative frequency if the null hypothesis holds.
What can we expect to see by chance if there is no difference between groups?
Oftentimes, the null distribution comes with the test statistic
Alternatives include
Null distributions are different, which makes comparisons uneasy.
Uniform distribution under H0
Fix level α before the experiment.
Choose small α (typical value is 5%)
Reject H0 if p-value less than α
Question: why can't we fix α=0?
The American Statistical Association (ASA) published a statement on (mis)interpretation of p-values.
(2) P-values do not measure the probability that the studied hypothesis is true
(3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
(4) P-values and related analyses should not be reported selectively
(5) P-value, or statistical significance, does not measure the size of an effect or the importance of a result
Nature's checklist
The p-values and confidence intervals for pairwise differences between groups j and k are based on the t-statistic:
t=estimated−postulated differenceuncertainty=(ˆμj−ˆμk)−(μj−μk)se(ˆμj−ˆμk).
In large sample, this statistic behaves like a Student-t variable with n−K degrees of freedom, denoted St(n−K) hereafter.
Note: in an analysis of variance model, the standard error se(ˆμj−ˆμk) is based the pooled variance estimate (estimated using all observations).
Consider the pairwise average difference in scores between the praise (group C) and the reprove (group D) of the arithmetic
data.
If we postulate δjk=μj−μk=0, the test statistic becomes
t=ˆδjk−0se(ˆδjk)
The p-value is p=1−Pr(−|t|≤T≤|t|) for T∼Stn−K.
Recall: the larger the values of the statistic t (either positive or negative), the more evidence against the null hypothesis.
For a test at level α (two-sided), we fail to reject null hypothesis for all values of the test statistic t that are in the interval
tn−K(α/2)≤t≤tn−K(1−α/2)
Because of the symmetry around zero, tn−K(1−α/2)=−tn−K(α/2).
qt(1-alpha/2, df = n - K)
where n
is the number of observations and K
the number of groups.The blue area defines the set of values for which we fail to reject null H0.
All values of t falling in the red area lead to rejection at level 5%.
Let δjk=μj−μk denote the population difference, ˆδjk the estimated difference (difference in sample averages) and se(ˆδjk) the estimated standard error.
The region for which we fail to reject the null is −tn−K(1−α/2)≤ˆδjk−δjkse(ˆδjk)≤tn−K(1−α/2) which rearranged gives the (1−α) confidence interval for the (unknown) difference δjk.
ˆδjk−se(ˆδjk)tn−K(1−α/2)≤δjk≤ˆδjk+se(ˆδjk)tn−K(1−α/2)
The reported confidence interval is of the form
estimate±critical value×standard error
confidence interval = [lower, upper] units
If we replicate the experiment and compute confidence intervals each time
Each interval either contains the true value (black horizontal line) or doesn't.
Test statistics are standardized,
Two options for reporting:
qt(0.975, df = 45 - 5)
The postulated value δCD=0 is not in the interval: reject H0.
library(emmeans) # marginal means and contrastsmodel <- aov(score ~ group, data = arithmetic)margmeans <- emmeans(model, specs = "group")contrast(margmeans, method = "pairwise", adjust = 'none', infer = TRUE) |> as_tibble() |> filter(contrast == "praise - reprove") |> knitr::kable(digits = 3)
contrast | estimate | SE | df | lower.CL | upper.CL | t.ratio | p.value |
---|---|---|---|---|---|---|---|
praise - reprove | 4 | 1.622 | 40 | 0.723 | 7.277 | 2.467 | 0.018 |
The output is either a p-value or a confidence interval
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Session 2
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Variability
Variability
Hypothesis tests
Variability
Hypothesis tests
Pairwise comparisons
Interested in impacts of intervention or policy
Population distribution (describing possible outcomes and their frequencies) encodes everything we could be interested in.
Histograms for 10 random samples of size 20 from a discrete uniform distribution.
Data collection costly
→ limited information available about population.
Sample too small to reliably estimate distribution
Focus instead on particular summaries
→ mean, variance, odds, etc.
mean / expectation
μ
standard deviation
σ=√variance
same scale as observations
Do not confuse standard error (variability of statistic) and standard deviation (variability of observation from population)
Not all samples are born alike
Can you spot the differences?
statistic = numerical summary of the data.
statistic = numerical summary of the data.
requires benchmark / standardization
typically a unitless quantity
need measure of uncertainty of statistic
Wald statistic
W=estimated qty−postulated qtystd. error (estimated qty)
standard error = measure of variability (same units as obs.)
resulting ratio is unitless!
The standard error is typically function of the sample size and the standard deviation σ of the observations.
From Davison (2008), Example 9.2
In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.
data(arithmetic, package = "hecedsm")# categorical variable = factor# Look up datastr(arithmetic)
## 'data.frame': 45 obs. of 2 variables:## $ group: Factor w/ 5 levels "control 1","control 2",..: 1 1 1 1 1 1 1 1 1 2 ...## $ score: num 17 14 24 20 24 23 16 15 24 21 ...
# compute summary statisticssummary_stat <- arithmetic |> group_by(group) |> summarize(mean = mean(score), sd = sd(score))knitr::kable(summary_stat, digits = 2)
group | mean | sd |
---|---|---|
control 1 | 19.67 | 4.21 |
control 2 | 18.33 | 3.57 |
praise | 27.44 | 2.46 |
reprove | 23.44 | 3.09 |
ignore | 16.11 | 3.62 |
# Boxplot with jittered dataggplot(data = arithmetic, aes(x = group, y = score)) + geom_boxplot() + geom_jitter(width = 0.3, height = 0) + theme_bw()
Let μC and μD denote the population average (expectation) score for praise and reprove, respectively.
Our null hypothesis is H0:μC=μD against the alternative Ha that they are different (two-sided test).
Equivalent to δCD=μC−μD=0.
The value of the Wald statistic is t=ˆδCD−0se(ˆδCD)=41.6216=2.467
The value of the Wald statistic is t=ˆδCD−0se(ˆδCD)=41.6216=2.467
How 'extreme' is this number?
Could it have happened by chance if there was no difference between groups?
Benchmarking
The null distribution tells us what are the plausible values for the statistic and their relative frequency if the null hypothesis holds.
What can we expect to see by chance if there is no difference between groups?
Oftentimes, the null distribution comes with the test statistic
Alternatives include
Null distributions are different, which makes comparisons uneasy.
Uniform distribution under H0
Fix level α before the experiment.
Choose small α (typical value is 5%)
Reject H0 if p-value less than α
Question: why can't we fix α=0?
The American Statistical Association (ASA) published a statement on (mis)interpretation of p-values.
(2) P-values do not measure the probability that the studied hypothesis is true
(3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
(4) P-values and related analyses should not be reported selectively
(5) P-value, or statistical significance, does not measure the size of an effect or the importance of a result
Nature's checklist
The p-values and confidence intervals for pairwise differences between groups j and k are based on the t-statistic:
t=estimated−postulated differenceuncertainty=(ˆμj−ˆμk)−(μj−μk)se(ˆμj−ˆμk).
In large sample, this statistic behaves like a Student-t variable with n−K degrees of freedom, denoted St(n−K) hereafter.
Note: in an analysis of variance model, the standard error se(ˆμj−ˆμk) is based the pooled variance estimate (estimated using all observations).
Consider the pairwise average difference in scores between the praise (group C) and the reprove (group D) of the arithmetic
data.
If we postulate δjk=μj−μk=0, the test statistic becomes
t=ˆδjk−0se(ˆδjk)
The p-value is p=1−Pr(−|t|≤T≤|t|) for T∼Stn−K.
Recall: the larger the values of the statistic t (either positive or negative), the more evidence against the null hypothesis.
For a test at level α (two-sided), we fail to reject null hypothesis for all values of the test statistic t that are in the interval
tn−K(α/2)≤t≤tn−K(1−α/2)
Because of the symmetry around zero, tn−K(1−α/2)=−tn−K(α/2).
qt(1-alpha/2, df = n - K)
where n
is the number of observations and K
the number of groups.The blue area defines the set of values for which we fail to reject null H0.
All values of t falling in the red area lead to rejection at level 5%.
Let δjk=μj−μk denote the population difference, ˆδjk the estimated difference (difference in sample averages) and se(ˆδjk) the estimated standard error.
The region for which we fail to reject the null is −tn−K(1−α/2)≤ˆδjk−δjkse(ˆδjk)≤tn−K(1−α/2) which rearranged gives the (1−α) confidence interval for the (unknown) difference δjk.
ˆδjk−se(ˆδjk)tn−K(1−α/2)≤δjk≤ˆδjk+se(ˆδjk)tn−K(1−α/2)
The reported confidence interval is of the form
estimate±critical value×standard error
confidence interval = [lower, upper] units
If we replicate the experiment and compute confidence intervals each time
Each interval either contains the true value (black horizontal line) or doesn't.
Test statistics are standardized,
Two options for reporting:
qt(0.975, df = 45 - 5)
The postulated value δCD=0 is not in the interval: reject H0.
library(emmeans) # marginal means and contrastsmodel <- aov(score ~ group, data = arithmetic)margmeans <- emmeans(model, specs = "group")contrast(margmeans, method = "pairwise", adjust = 'none', infer = TRUE) |> as_tibble() |> filter(contrast == "praise - reprove") |> knitr::kable(digits = 3)
contrast | estimate | SE | df | lower.CL | upper.CL | t.ratio | p.value |
---|---|---|---|---|---|---|---|
praise - reprove | 4 | 1.622 | 40 | 0.723 | 7.277 | 2.467 | 0.018 |
The output is either a p-value or a confidence interval