+ - 0:00:00
Notes for current slide
Notes for next slide

Hypothesis testing

Session 2

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 43

Outline

2 / 43

Outline

Variability

2 / 43

Outline

Variability

Hypothesis tests

2 / 43

Outline

Variability

Hypothesis tests

Pairwise comparisons

2 / 43

Sampling variability

3 / 43

Studying a population

Interested in impacts of intervention or policy

Population distribution (describing possible outcomes and their frequencies) encodes everything we could be interested in.

4 / 43

Sampling variability

Histograms for 10 random samples of size 20 from a discrete uniform distribution.

5 / 43

Decision making under uncertainty

  • Data collection costly

    limited information available about population.

  • Sample too small to reliably estimate distribution

  • Focus instead on particular summaries

    mean, variance, odds, etc.

6 / 43

Population characteristics

mean / expectation

μ

standard deviation

σ=variance

same scale as observations

7 / 43

Do not confuse standard error (variability of statistic) and standard deviation (variability of observation from population)

Sampling variability

8 / 43

Not all samples are born alike

  • Analogy: comparing kids (or siblings): not everyone look alike (except twins...)
  • Chance and haphazard variability mean that we might have a good idea, but not exactly know the truth.

The signal and the noise

Can you spot the differences?

9 / 43

Information accumulates

Histograms of data from uniform (top) and non-uniform (bottom)
distributions with increasing sample sizes.

Histograms of data from uniform (top) and non-uniform (bottom) distributions with increasing sample sizes.

10 / 43

Hypothesis tests

11 / 43

The general recipe of hypothesis testing

  1. Define variables
  2. Write down hypotheses (null/alternative)
  3. Choose and compute a test statistic
  4. Compare the value to the null distribution (benchmark)
  5. Compute the p-value
  6. Conclude (reject/fail to reject)
  7. Report findings
12 / 43

Hypothesis tests versus trials

Scene from "12 Angry Men" by Sidney Lumet

  • Binary decision: guilty/not guilty
  • Summarize evidences (proof)
  • Assess evidence in light of presumption of innocence
  • Verdict: either guilty or not guilty
  • Potential for judicial mistakes
13 / 43

How to assess evidence?

statistic = numerical summary of the data.

14 / 43

How to assess evidence?

statistic = numerical summary of the data.

requires benchmark / standardization

typically a unitless quantity

need measure of uncertainty of statistic

14 / 43

General construction principles

Wald statistic

W=estimated qtypostulated qtystd. error (estimated qty)

standard error = measure of variability (same units as obs.)

resulting ratio is unitless!

15 / 43

The standard error is typically function of the sample size and the standard deviation σ of the observations.

Impact of encouragement on teaching

From Davison (2008), Example 9.2

In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.

16 / 43

Basic manipulations in R: load data

data(arithmetic, package = "hecedsm")
# categorical variable = factor
# Look up data
str(arithmetic)
## 'data.frame': 45 obs. of 2 variables:
## $ group: Factor w/ 5 levels "control 1","control 2",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ score: num 17 14 24 20 24 23 16 15 24 21 ...
17 / 43

Basic manipulations in R: summary statistics

# compute summary statistics
summary_stat <-
arithmetic |>
group_by(group) |>
summarize(mean = mean(score),
sd = sd(score))
knitr::kable(summary_stat,
digits = 2)
group mean sd
control 1 19.67 4.21
control 2 18.33 3.57
praise 27.44 2.46
reprove 23.44 3.09
ignore 16.11 3.62
18 / 43

Basic manipulations in R: plot

# Boxplot with jittered data
ggplot(data = arithmetic,
aes(x = group,
y = score)) +
geom_boxplot() +
geom_jitter(width = 0.3,
height = 0) +
theme_bw()

19 / 43

Formulating an hypothesis

Let μC and μD denote the population average (expectation) score for praise and reprove, respectively.

Our null hypothesis is H0:μC=μD against the alternative Ha that they are different (two-sided test).

Equivalent to δCD=μCμD=0.

20 / 43

Test statistic

The value of the Wald statistic is t=δ^CD0se(δ^CD)=41.6216=2.467

21 / 43

Test statistic

The value of the Wald statistic is t=δ^CD0se(δ^CD)=41.6216=2.467

How 'extreme' is this number?

21 / 43

Could it have happened by chance if there was no difference between groups?

Assessing evidence

Is 1 big?

22 / 43

Assessing evidence

Is 1 big?

Benchmarking

  • The same number can have different meanings
    • units matter!
  • Meaningful comparisons require some reference.
22 / 43

Possible, but not plausible

The null distribution tells us what are the plausible values for the statistic and their relative frequency if the null hypothesis holds.

What can we expect to see by chance if there is no difference between groups?

23 / 43

Oftentimes, the null distribution comes with the test statistic

Alternatives include

  • Large sample behaviour (asymptotic distribution)
  • Resampling/bootstrap/permutation.

P-value

Null distributions are different, which makes comparisons uneasy.

  • The p-value gives the probability of observing an outcome as extreme if the null hypothesis was true.

24 / 43

Uniform distribution under H0

Level = probability of condemning an innocent

Fix level α before the experiment.

Choose small α (typical value is 5%)

Reject H0 if p-value less than α

25 / 43

Question: why can't we fix α=0?

What is really a p-value?

The American Statistical Association (ASA) published a statement on (mis)interpretation of p-values.

(2) P-values do not measure the probability that the studied hypothesis is true

(3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

(4) P-values and related analyses should not be reported selectively

(5) P-value, or statistical significance, does not measure the size of an effect or the importance of a result

26 / 43

Reporting results of a statistical procedure

Nature's checklist

27 / 43

Pairwise comparisons

28 / 43

Pairwise differences and t-tests

The p-values and confidence intervals for pairwise differences between groups j and k are based on the t-statistic:

t=estimatedpostulated differenceuncertainty=(μ^jμ^k)(μjμk)se(μ^jμ^k).

In large sample, this statistic behaves like a Student-t variable with nK degrees of freedom, denoted St(nK) hereafter.

Note: in an analysis of variance model, the standard error se(μ^jμ^k) is based the pooled variance estimate (estimated using all observations).

29 / 43

Pairwise differences

Consider the pairwise average difference in scores between the praise (group C) and the reprove (group D) of the arithmetic data.

  • Group sample averages are μ^C=27.4 and μ^D=23.4
  • The estimated average difference between groups C and D is δ^CD=4
  • The estimated pooled standard deviation for the five groups is 1.15δ^CD
  • The standard error for the pairwise difference is se(δ^CD)=1.6216
  • There are n=45 observations and K=5 groups
30 / 43

t-tests: null distribution is Student-t

If we postulate δjk=μjμk=0, the test statistic becomes

t=δ^jk0se(δ^jk)

The p-value is p=1Pr(|t|T|t|) for TStnK.

  • probability of statistic being more extreme than t

Recall: the larger the values of the statistic t (either positive or negative), the more evidence against the null hypothesis.

31 / 43

Critical values

For a test at level α (two-sided), we fail to reject null hypothesis for all values of the test statistic t that are in the interval

tnK(α/2)ttnK(1α/2)

Because of the symmetry around zero, tnK(1α/2)=tnK(α/2).

  • We call tnK(1α/2) a critical value.
  • in R, the quantiles of the Student t distribution are obtained from qt(1-alpha/2, df = n - K) where n is the number of observations and K the number of groups.
32 / 43

Null distribution

The blue area defines the set of values for which we fail to reject null H0.

All values of t falling in the red area lead to rejection at level 5%.

33 / 43

Example

  • If H0:δCD=0, the t statistic is t=δ^CD0se(δ^CD)=41.6216=2.467
  • The p-value is p=0.018.
  • We reject the null at level α=5% since 0.018<0.05.
  • Conclude that there is a significant difference at level α=0.05 between the average scores of subpopulations C and D.
34 / 43

Confidence interval

Let δjk=μjμk denote the population difference, δ^jk the estimated difference (difference in sample averages) and se(δ^jk) the estimated standard error.

The region for which we fail to reject the null is tnK(1α/2)δ^jkδjkse(δ^jk)tnK(1α/2) which rearranged gives the (1α) confidence interval for the (unknown) difference δjk.

δ^jkse(δ^jk)tnK(1α/2)δjkδ^jk+se(δ^jk)tnK(1α/2)

35 / 43

Interpretation of confidence intervals

The reported confidence interval is of the form

estimate±critical value×standard error

confidence interval = [lower, upper] units

If we replicate the experiment and compute confidence intervals each time

  • on average, 95% of those intervals will contain the true value if the assumptions underlying the model are met.
36 / 43

Interpretation in a picture: coin toss analogy

Each interval either contains the true value (black horizontal line) or doesn't.

100 confidence intervals
37 / 43

Why confidence intervals?

Test statistics are standardized,

  • Good for comparisons with benchmark
  • typically meaningless (standardized = unitless quantities)

Two options for reporting:

  • p-value: probability of more extreme outcome if no mean difference
  • confidence intervals: set of all values for which we fail to reject the null hypothesis at level α for the given sample
38 / 43

Example

  • Mean difference of δ^CD=4, with se(δ^CD)=1.6216.
  • The critical values for a test at level α=5% are 2.021 and 2.021
    • qt(0.975, df = 45 - 5)
  • Since |t|>2.021, reject H0: the two population are statistically significant at level α=5%.
  • The confidence interval is [41.6216×2.021,4+1.6216×2.021]=[0.723,7.277]

The postulated value δCD=0 is not in the interval: reject H0.

39 / 43

Pairwise differences in R

library(emmeans) # marginal means and contrasts
model <- aov(score ~ group, data = arithmetic)
margmeans <- emmeans(model, specs = "group")
contrast(margmeans,
method = "pairwise",
adjust = 'none',
infer = TRUE) |>
as_tibble() |>
filter(contrast == "praise - reprove") |>
knitr::kable(digits = 3)
contrast estimate SE df lower.CL upper.CL t.ratio p.value
praise - reprove 4 1.622 40 0.723 7.277 2.467 0.018
40 / 43

Recap 1

  • Testing procedures factor in the uncertainty inherent to sampling.
  • Adopt particular viewpoint: null hypothesis (simpler model, e.g., no difference between group) is true. We consider the evidence under that optic.
41 / 43

Recap 2

  • p-values measures compatibility with the null model (relative to an alternative)
  • Tests are standardized values,

The output is either a p-value or a confidence interval

  • confidence interval: on scale of data (meaningful interpretation)
  • p-values: uniform on [0,1] if the null hypothesis is true
42 / 43

Recap 3

  • All hypothesis tests share common ingredients
  • Many ways, models and test can lead to the same conclusion.
  • Transparent reporting is important!
43 / 43

Outline

2 / 43
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow