Session 2
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Variability
Variability
Hypothesis tests
Variability
Hypothesis tests
Pairwise comparisons
Interested in impacts of intervention or policy
Population distribution (describing possible outcomes and their frequencies) encodes everything we could be interested in.
Histograms for 10 random samples of size 20 from a discrete uniform distribution.
Data collection costly
\(\to\) limited information available about population.
Sample too small to reliably estimate distribution
Focus instead on particular summaries
\(\to\) mean, variance, odds, etc.
mean / expectation
$$\mu$$
standard deviation
$$\sigma= \sqrt{\text{variance}}$$
same scale as observations
Do not confuse standard error (variability of statistic) and standard deviation (variability of observation from population)
Not all samples are born alike
Can you spot the differences?
Histograms of data from uniform (top) and non-uniform (bottom) distributions with increasing sample sizes.
statistic = numerical summary of the data.
statistic = numerical summary of the data.
requires benchmark / standardization
typically a unitless quantity
need measure of uncertainty of statistic
Wald statistic
\begin{align*} W = \frac{\text{estimated qty} - \text{postulated qty}}{\text{std. error (estimated qty)}} \end{align*}
standard error = measure of variability (same units as obs.)
resulting ratio is unitless!
The standard error is typically function of the sample size and the standard deviation \(\sigma\) of the observations.
From Davison (2008), Example 9.2
In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.
data(arithmetic, package = "hecedsm")# categorical variable = factor# Look up datastr(arithmetic)
## 'data.frame': 45 obs. of 2 variables:## $ group: Factor w/ 5 levels "control 1","control 2",..: 1 1 1 1 1 1 1 1 1 2 ...## $ score: num 17 14 24 20 24 23 16 15 24 21 ...
# compute summary statisticssummary_stat <- arithmetic |> group_by(group) |> summarize(mean = mean(score), sd = sd(score))knitr::kable(summary_stat, digits = 2)
group | mean | sd |
---|---|---|
control 1 | 19.67 | 4.21 |
control 2 | 18.33 | 3.57 |
praise | 27.44 | 2.46 |
reprove | 23.44 | 3.09 |
ignore | 16.11 | 3.62 |
# Boxplot with jittered dataggplot(data = arithmetic, aes(x = group, y = score)) + geom_boxplot() + geom_jitter(width = 0.3, height = 0) + theme_bw()
Let \(\mu_{C}\) and \(\mu_{D}\) denote the population average (expectation) score for praise and reprove, respectively.
Our null hypothesis is $$\mathscr{H}_0: \mu_{C} = \mu_{D}$$ against the alternative \(\mathscr{H}_a\) that they are different (two-sided test).
Equivalent to \(\delta_{CD} = \mu_C - \mu_D = 0\).
The value of the Wald statistic is $$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$
The value of the Wald statistic is $$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$
How 'extreme' is this number?
Could it have happened by chance if there was no difference between groups?
Benchmarking
The null distribution tells us what are the plausible values for the statistic and their relative frequency if the null hypothesis holds.
What can we expect to see by chance if there is no difference between groups?
Oftentimes, the null distribution comes with the test statistic
Alternatives include
Null distributions are different, which makes comparisons uneasy.
Uniform distribution under H0
Fix level \(\alpha\) before the experiment.
Choose small \(\alpha\) (typical value is 5%)
Reject \(\mathscr{H}_0\) if p-value less than \(\alpha\)
Question: why can't we fix \(\alpha=0\)?
The American Statistical Association (ASA) published a statement on (mis)interpretation of p-values.
(2) P-values do not measure the probability that the studied hypothesis is true
(3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
(4) P-values and related analyses should not be reported selectively
(5) P-value, or statistical significance, does not measure the size of an effect or the importance of a result
Nature's checklist
The p-values and confidence intervals for pairwise differences between groups \(j\) and \(k\) are based on the t-statistic:
\begin{align*} t = \frac{\text{estimated} - \text{postulated difference}}{\text{uncertainty}}= \frac{(\widehat{\mu}_j - \widehat{\mu}_k) - (\mu_j - \mu_k)}{\mathsf{se}(\widehat{\mu}_j - \widehat{\mu}_k)}. \end{align*}
In large sample, this statistic behaves like a Student-t variable with \(n-K\) degrees of freedom, denoted \(\mathsf{St}(n-K)\) hereafter.
Note: in an analysis of variance model, the standard error \(\mathsf{se}(\widehat{\mu}_j - \widehat{\mu}_k)\) is based the pooled variance estimate (estimated using all observations).
Consider the pairwise average difference in scores between the praise (group C) and the reprove (group D) of the arithmetic
data.
If we postulate \(\delta_{jk} = \mu_j - \mu_k = 0\), the test statistic becomes
\begin{align*} t = \frac{\widehat{\delta}_{jk} - 0}{\mathsf{se}(\widehat{\delta}_{jk})} \end{align*}
The \(p\)-value is \(p = 1- \Pr(-|t| \leq T \leq |t|)\) for \(T \sim \mathsf{St}_{n-K}\).
Recall: the larger the values of the statistic \(t\) (either positive or negative), the more evidence against the null hypothesis.
For a test at level \(\alpha\) (two-sided), we fail to reject null hypothesis for all values of the test statistic \(t\) that are in the interval
$$\mathfrak{t}_{n-K}(\alpha/2) \leq t \leq \mathfrak{t}_{n-K}(1-\alpha/2)$$
Because of the symmetry around zero, \(\mathfrak{t}_{n-K}(1-\alpha/2) = -\mathfrak{t}_{n-K}(\alpha/2)\).
qt(1-alpha/2, df = n - K)
where n
is the number of observations and K
the number of groups.The blue area defines the set of values for which we fail to reject null \(\mathscr{H}_0\).
All values of \(t\) falling in the red area lead to rejection at level \(5\)%.
Let \(\delta_{jk}=\mu_j - \mu_k\) denote the population difference, \(\widehat{\delta}_{jk}\) the estimated difference (difference in sample averages) and \(\mathsf{se}(\widehat{\delta}_{jk})\) the estimated standard error.
The region for which we fail to reject the null is \begin{align*} -\mathfrak{t}_{n-K}(1-\alpha/2) \leq \frac{\widehat{\delta}_{jk} - \delta_{jk}}{\mathsf{se}(\widehat{\delta}_{jk})} \leq \mathfrak{t}_{n-K}(1-\alpha/2) \end{align*} which rearranged gives the \((1-\alpha)\) confidence interval for the (unknown) difference \(\delta_{jk}\).
\begin{align*} \widehat{\delta}_{jk} - \mathsf{se}(\widehat{\delta}_{jk})\mathfrak{t}_{n-K}(1-\alpha/2) \leq \delta_{jk} \leq \widehat{\delta}_{jk} + \mathsf{se}(\widehat{\delta}_{jk})\mathfrak{t}_{n-K}(1-\alpha/2) \end{align*}
The reported confidence interval is of the form
$$ \text{estimate} \pm \text{critical value} \times \text{standard error}$$
confidence interval = [lower, upper] units
If we replicate the experiment and compute confidence intervals each time
Each interval either contains the true value (black horizontal line) or doesn't.
Test statistics are standardized,
Two options for reporting:
qt(0.975, df = 45 - 5)
The postulated value \(\delta_{CD}=0\) is not in the interval: reject \(\mathscr{H}_0\).
library(emmeans) # marginal means and contrastsmodel <- aov(score ~ group, data = arithmetic)margmeans <- emmeans(model, specs = "group")contrast(margmeans, method = "pairwise", adjust = 'none', infer = TRUE) |> as_tibble() |> filter(contrast == "praise - reprove") |> knitr::kable(digits = 3)
contrast | estimate | SE | df | lower.CL | upper.CL | t.ratio | p.value |
---|---|---|---|---|---|---|---|
praise - reprove | 4 | 1.622 | 40 | 0.723 | 7.277 | 2.467 | 0.018 |
The output is either a p-value or a confidence interval
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Session 2
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Variability
Variability
Hypothesis tests
Variability
Hypothesis tests
Pairwise comparisons
Interested in impacts of intervention or policy
Population distribution (describing possible outcomes and their frequencies) encodes everything we could be interested in.
Histograms for 10 random samples of size 20 from a discrete uniform distribution.
Data collection costly
\(\to\) limited information available about population.
Sample too small to reliably estimate distribution
Focus instead on particular summaries
\(\to\) mean, variance, odds, etc.
mean / expectation
$$\mu$$
standard deviation
$$\sigma= \sqrt{\text{variance}}$$
same scale as observations
Do not confuse standard error (variability of statistic) and standard deviation (variability of observation from population)
Not all samples are born alike
Can you spot the differences?
Histograms of data from uniform (top) and non-uniform (bottom) distributions with increasing sample sizes.
statistic = numerical summary of the data.
statistic = numerical summary of the data.
requires benchmark / standardization
typically a unitless quantity
need measure of uncertainty of statistic
Wald statistic
\begin{align*} W = \frac{\text{estimated qty} - \text{postulated qty}}{\text{std. error (estimated qty)}} \end{align*}
standard error = measure of variability (same units as obs.)
resulting ratio is unitless!
The standard error is typically function of the sample size and the standard deviation \(\sigma\) of the observations.
From Davison (2008), Example 9.2
In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.
data(arithmetic, package = "hecedsm")# categorical variable = factor# Look up datastr(arithmetic)
## 'data.frame': 45 obs. of 2 variables:## $ group: Factor w/ 5 levels "control 1","control 2",..: 1 1 1 1 1 1 1 1 1 2 ...## $ score: num 17 14 24 20 24 23 16 15 24 21 ...
# compute summary statisticssummary_stat <- arithmetic |> group_by(group) |> summarize(mean = mean(score), sd = sd(score))knitr::kable(summary_stat, digits = 2)
group | mean | sd |
---|---|---|
control 1 | 19.67 | 4.21 |
control 2 | 18.33 | 3.57 |
praise | 27.44 | 2.46 |
reprove | 23.44 | 3.09 |
ignore | 16.11 | 3.62 |
# Boxplot with jittered dataggplot(data = arithmetic, aes(x = group, y = score)) + geom_boxplot() + geom_jitter(width = 0.3, height = 0) + theme_bw()
Let \(\mu_{C}\) and \(\mu_{D}\) denote the population average (expectation) score for praise and reprove, respectively.
Our null hypothesis is $$\mathscr{H}_0: \mu_{C} = \mu_{D}$$ against the alternative \(\mathscr{H}_a\) that they are different (two-sided test).
Equivalent to \(\delta_{CD} = \mu_C - \mu_D = 0\).
The value of the Wald statistic is $$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$
The value of the Wald statistic is $$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$
How 'extreme' is this number?
Could it have happened by chance if there was no difference between groups?
Benchmarking
The null distribution tells us what are the plausible values for the statistic and their relative frequency if the null hypothesis holds.
What can we expect to see by chance if there is no difference between groups?
Oftentimes, the null distribution comes with the test statistic
Alternatives include
Null distributions are different, which makes comparisons uneasy.
Uniform distribution under H0
Fix level \(\alpha\) before the experiment.
Choose small \(\alpha\) (typical value is 5%)
Reject \(\mathscr{H}_0\) if p-value less than \(\alpha\)
Question: why can't we fix \(\alpha=0\)?
The American Statistical Association (ASA) published a statement on (mis)interpretation of p-values.
(2) P-values do not measure the probability that the studied hypothesis is true
(3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
(4) P-values and related analyses should not be reported selectively
(5) P-value, or statistical significance, does not measure the size of an effect or the importance of a result
Nature's checklist
The p-values and confidence intervals for pairwise differences between groups \(j\) and \(k\) are based on the t-statistic:
\begin{align*} t = \frac{\text{estimated} - \text{postulated difference}}{\text{uncertainty}}= \frac{(\widehat{\mu}_j - \widehat{\mu}_k) - (\mu_j - \mu_k)}{\mathsf{se}(\widehat{\mu}_j - \widehat{\mu}_k)}. \end{align*}
In large sample, this statistic behaves like a Student-t variable with \(n-K\) degrees of freedom, denoted \(\mathsf{St}(n-K)\) hereafter.
Note: in an analysis of variance model, the standard error \(\mathsf{se}(\widehat{\mu}_j - \widehat{\mu}_k)\) is based the pooled variance estimate (estimated using all observations).
Consider the pairwise average difference in scores between the praise (group C) and the reprove (group D) of the arithmetic
data.
If we postulate \(\delta_{jk} = \mu_j - \mu_k = 0\), the test statistic becomes
\begin{align*} t = \frac{\widehat{\delta}_{jk} - 0}{\mathsf{se}(\widehat{\delta}_{jk})} \end{align*}
The \(p\)-value is \(p = 1- \Pr(-|t| \leq T \leq |t|)\) for \(T \sim \mathsf{St}_{n-K}\).
Recall: the larger the values of the statistic \(t\) (either positive or negative), the more evidence against the null hypothesis.
For a test at level \(\alpha\) (two-sided), we fail to reject null hypothesis for all values of the test statistic \(t\) that are in the interval
$$\mathfrak{t}_{n-K}(\alpha/2) \leq t \leq \mathfrak{t}_{n-K}(1-\alpha/2)$$
Because of the symmetry around zero, \(\mathfrak{t}_{n-K}(1-\alpha/2) = -\mathfrak{t}_{n-K}(\alpha/2)\).
qt(1-alpha/2, df = n - K)
where n
is the number of observations and K
the number of groups.The blue area defines the set of values for which we fail to reject null \(\mathscr{H}_0\).
All values of \(t\) falling in the red area lead to rejection at level \(5\)%.
Let \(\delta_{jk}=\mu_j - \mu_k\) denote the population difference, \(\widehat{\delta}_{jk}\) the estimated difference (difference in sample averages) and \(\mathsf{se}(\widehat{\delta}_{jk})\) the estimated standard error.
The region for which we fail to reject the null is \begin{align*} -\mathfrak{t}_{n-K}(1-\alpha/2) \leq \frac{\widehat{\delta}_{jk} - \delta_{jk}}{\mathsf{se}(\widehat{\delta}_{jk})} \leq \mathfrak{t}_{n-K}(1-\alpha/2) \end{align*} which rearranged gives the \((1-\alpha)\) confidence interval for the (unknown) difference \(\delta_{jk}\).
\begin{align*} \widehat{\delta}_{jk} - \mathsf{se}(\widehat{\delta}_{jk})\mathfrak{t}_{n-K}(1-\alpha/2) \leq \delta_{jk} \leq \widehat{\delta}_{jk} + \mathsf{se}(\widehat{\delta}_{jk})\mathfrak{t}_{n-K}(1-\alpha/2) \end{align*}
The reported confidence interval is of the form
$$ \text{estimate} \pm \text{critical value} \times \text{standard error}$$
confidence interval = [lower, upper] units
If we replicate the experiment and compute confidence intervals each time
Each interval either contains the true value (black horizontal line) or doesn't.
Test statistics are standardized,
Two options for reporting:
qt(0.975, df = 45 - 5)
The postulated value \(\delta_{CD}=0\) is not in the interval: reject \(\mathscr{H}_0\).
library(emmeans) # marginal means and contrastsmodel <- aov(score ~ group, data = arithmetic)margmeans <- emmeans(model, specs = "group")contrast(margmeans, method = "pairwise", adjust = 'none', infer = TRUE) |> as_tibble() |> filter(contrast == "praise - reprove") |> knitr::kable(digits = 3)
contrast | estimate | SE | df | lower.CL | upper.CL | t.ratio | p.value |
---|---|---|---|---|---|---|---|
praise - reprove | 4 | 1.622 | 40 | 0.723 | 7.277 | 2.467 | 0.018 |
The output is either a p-value or a confidence interval