Hypothesis testing

Session 2

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 43

Outline2 / 43

Outline

Variability

2 / 43

Outline

Variability

Hypothesis tests

2 / 43

Outline

Variability

Hypothesis tests

Pairwise comparisons

2 / 43

Sampling variability3 / 43

Studying a population

Interested in impacts of intervention or policy

Population distribution (describing possible outcomes and their frequencies) encodes everything we could be interested in.

4 / 43

Sampling variability

Histograms for 10 random samples of size 20 from a discrete uniform distribution.

5 / 43

Decision making under uncertainty

Data collection costly

$\to$ limited information available about population.
Sample too small to reliably estimate distribution
Focus instead on particular summaries

$\to$ mean, variance, odds, etc.

6 / 43

Population characteristics

mean / expectation

$$\mu$$

standard deviation

$$\sigma= \sqrt{\text{variance}}$$

same scale as observations

7 / 43

Do not confuse standard error (variability of statistic) and standard deviation (variability of observation from population)

Sampling variability

8 / 43

Not all samples are born alike

Analogy: comparing kids (or siblings): not everyone look alike (except twins...)
Chance and haphazard variability mean that we might have a good idea, but not exactly know the truth.

The signal and the noise

Can you spot the differences?

9 / 43

Information accumulates

Histograms of data from uniform (top) and non-uniform (bottom) distributions with increasing sample sizes.

10 / 43

Hypothesis tests11 / 43

The general recipe of hypothesis testingDefine variables
Write down hypotheses (null/alternative)
Choose and compute a test statistic
Compare the value to the null distribution (benchmark)
Compute the p-value
Conclude (reject/fail to reject)
Report findings

12 / 43

Hypothesis tests versus trials

Scene from "12 Angry Men" by Sidney Lumet

Binary decision: guilty/not guilty
Summarize evidences (proof)
Assess evidence in light of presumption of innocence
Verdict: either guilty or not guilty
Potential for judicial mistakes

13 / 43

How to assess evidence?

statistic = numerical summary of the data.

14 / 43

How to assess evidence?

statistic = numerical summary of the data.

requires benchmark / standardization

typically a unitless quantity

need measure of uncertainty of statistic

14 / 43

General construction principles

Wald statistic

\begin{align*} W = \frac{\text{estimated qty} - \text{postulated qty}}{\text{std. error (estimated qty)}} \end{align*}

standard error = measure of variability (same units as obs.)

resulting ratio is unitless!

15 / 43

The standard error is typically function of the sample size and the standard deviation $\sigma$ of the observations.

Impact of encouragement on teaching

From Davison (2008), Example 9.2

In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.

16 / 43

Basic manipulations in R: load data

data(arithmetic, package = "hecedsm")
# categorical variable = factor
# Look up data
str(arithmetic)

## 'data.frame':    45 obs. of  2 variables:
##  $ group: Factor w/ 5 levels "control 1","control 2",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ score: num  17 14 24 20 24 23 16 15 24 21 ...

17 / 43

Basic manipulations in R: summary statistics

# compute summary statistics
summary_stat <-
  arithmetic |> 
  group_by(group) |>
  summarize(mean = mean(score),
            sd = sd(score))
knitr::kable(summary_stat, 
             digits = 2)

group	mean	sd
control 1	19.67	4.21
control 2	18.33	3.57
praise	27.44	2.46
reprove	23.44	3.09
ignore	16.11	3.62

18 / 43

Basic manipulations in R: plot

# Boxplot with jittered data
ggplot(data = arithmetic,
       aes(x = group,
           y = score)) +
  geom_boxplot() +
  geom_jitter(width = 0.3, 
              height = 0) +
  theme_bw()

19 / 43

Formulating an hypothesis

Let $\mu_{C}$ and $\mu_{D}$ denote the population average (expectation) score for praise and reprove, respectively.

Our null hypothesis is $$\mathscr{H}_0: \mu_{C} = \mu_{D}$$ against the alternative $\mathscr{H}_a$ that they are different (two-sided test).

Equivalent to $\delta_{CD} = \mu_C - \mu_D = 0$.

20 / 43

Test statistic

The value of the Wald statistic is $$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$

21 / 43

Test statistic

The value of the Wald statistic is $$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$

How 'extreme' is this number?

21 / 43

Could it have happened by chance if there was no difference between groups?

Assessing evidence

Is 1 big?

22 / 43

Assessing evidence

Is 1 big?

Benchmarking

The same number can have different meanings
- units matter!
Meaningful comparisons require some reference.

22 / 43

Possible, but not plausible

The null distribution tells us what are the plausible values for the statistic and their relative frequency if the null hypothesis holds.

What can we expect to see by chance if there is no difference between groups?

23 / 43

Oftentimes, the null distribution comes with the test statistic

Alternatives include

Large sample behaviour (asymptotic distribution)
Resampling/bootstrap/permutation.

P-value

Null distributions are different, which makes comparisons uneasy.

The p-value gives the probability of observing an outcome as extreme if the null hypothesis was true.

24 / 43

Uniform distribution under H0

Level = probability of condemning an innocent

Fix level $\alpha$ before the experiment.

Choose small $\alpha$ (typical value is 5%)

Reject $\mathscr{H}_0$ if p-value less than $\alpha$

25 / 43

Question: why can't we fix $\alpha=0$?

What is really a p-value?

The American Statistical Association (ASA) published a statement on (mis)interpretation of p-values.

(2) P-values do not measure the probability that the studied hypothesis is true

(3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

(4) P-values and related analyses should not be reported selectively

(5) P-value, or statistical significance, does not measure the size of an effect or the importance of a result

26 / 43

Reporting results of a statistical procedure

Nature's checklist

27 / 43

Pairwise comparisons28 / 43

Pairwise differences and t-tests

The p-values and confidence intervals for pairwise differences between groups $j$ and $k$ are based on the t-statistic:

\begin{align*} t = \frac{\text{estimated} - \text{postulated difference}}{\text{uncertainty}}= \frac{(\widehat{\mu}_j - \widehat{\mu}_k) - (\mu_j - \mu_k)}{\mathsf{se}(\widehat{\mu}_j - \widehat{\mu}_k)}. \end{align*}

In large sample, this statistic behaves like a Student-t variable with $n-K$ degrees of freedom, denoted $\mathsf{St}(n-K)$ hereafter.

Note: in an analysis of variance model, the standard error $\mathsf{se}(\widehat{\mu}_j - \widehat{\mu}_k)$ is based the pooled variance estimate (estimated using all observations).

29 / 43

Pairwise differences

Consider the pairwise average difference in scores between the praise (group C) and the reprove (group D) of the arithmetic data.

Group sample averages are $\widehat{\mu}_C = 27.4$ and $\widehat{\mu}_D = 23.4$
The estimated average difference between groups $C$ and $D$ is $\widehat{\delta}_{CD} = 4$
The estimated pooled standard deviation for the five groups is $1.15\vphantom{\widehat{\delta}_{CD}}$
The standard error for the pairwise difference is $\mathsf{se}(\widehat{\delta}_{CD}) = 1.6216$
There are $n=45$ observations and $K=5$ groups

30 / 43

t-tests: null distribution is Student-t

If we postulate $\delta_{jk} = \mu_j - \mu_k = 0$, the test statistic becomes

\begin{align*} t = \frac{\widehat{\delta}_{jk} - 0}{\mathsf{se}(\widehat{\delta}_{jk})} \end{align*}

The $p$-value is $p = 1- \Pr(-|t| \leq T \leq |t|)$ for $T \sim \mathsf{St}_{n-K}$.

probability of statistic being more extreme than $t$

Recall: the larger the values of the statistic $t$ (either positive or negative), the more evidence against the null hypothesis.

31 / 43

Critical values

For a test at level $\alpha$ (two-sided), we fail to reject null hypothesis for all values of the test statistic $t$ that are in the interval

$$\mathfrak{t}_{n-K}(\alpha/2) \leq t \leq \mathfrak{t}_{n-K}(1-\alpha/2)$$

Because of the symmetry around zero, $\mathfrak{t}_{n-K}(1-\alpha/2) = -\mathfrak{t}_{n-K}(\alpha/2)$.

We call $\mathfrak{t}_{n-K}(1-\alpha/2)$ a critical value.
in R, the quantiles of the Student t distribution are obtained from qt(1-alpha/2, df = n - K) where n is the number of observations and K the number of groups.

32 / 43

Null distribution

The blue area defines the set of values for which we fail to reject null $\mathscr{H}_0$.

All values of $t$ falling in the red area lead to rejection at level $5$%.

33 / 43

ExampleIf \(\mathscr{H}_0: \delta_{CD}=0\), the \(t\) statistic is 
$$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$
The \(p\)-value is \(p=0.018\). 
We reject the null at level \(\alpha=5\)% since \(0.018 < 0.05\).
Conclude that there is a significant difference at level \(\alpha=0.05\) between the average scores of subpopulations \(C\) and \(D\).
34 / 43

Confidence interval

Let $\delta_{jk}=\mu_j - \mu_k$ denote the population difference, $\widehat{\delta}_{jk}$ the estimated difference (difference in sample averages) and $\mathsf{se}(\widehat{\delta}_{jk})$ the estimated standard error.

The region for which we fail to reject the null is \begin{align*} -\mathfrak{t}_{n-K}(1-\alpha/2) \leq \frac{\widehat{\delta}_{jk} - \delta_{jk}}{\mathsf{se}(\widehat{\delta}_{jk})} \leq \mathfrak{t}_{n-K}(1-\alpha/2) \end{align*} which rearranged gives the $(1-\alpha)$ confidence interval for the (unknown) difference $\delta_{jk}$.

\begin{align*} \widehat{\delta}_{jk} - \mathsf{se}(\widehat{\delta}_{jk})\mathfrak{t}_{n-K}(1-\alpha/2) \leq \delta_{jk} \leq \widehat{\delta}_{jk} + \mathsf{se}(\widehat{\delta}_{jk})\mathfrak{t}_{n-K}(1-\alpha/2) \end{align*}

35 / 43

Interpretation of confidence intervals

The reported confidence interval is of the form

$$ \text{estimate} \pm \text{critical value} \times \text{standard error}$$

confidence interval = [lower, upper] units

If we replicate the experiment and compute confidence intervals each time

on average, 95% of those intervals will contain the true value if the assumptions underlying the model are met.

36 / 43

Interpretation in a picture: coin toss analogy

Each interval either contains the true value (black horizontal line) or doesn't.

37 / 43

Why confidence intervals?

Test statistics are standardized,

Good for comparisons with benchmark
typically meaningless (standardized = unitless quantities)

Two options for reporting:

$p$-value: probability of more extreme outcome if no mean difference
confidence intervals: set of all values for which we fail to reject the null hypothesis at level $\alpha$ for the given sample

38 / 43

Example

Mean difference of $\widehat{\delta}_{CD}=4$, with $\mathsf{se}(\widehat{\delta}_{CD})=1.6216$.
The critical values for a test at level $\alpha = 5$% are $-2.021$ and $2.021$
- qt(0.975, df = 45 - 5)
Since $|t| > 2.021$, reject $\mathscr{H}_0$: the two population are statistically significant at level $\alpha=5$%.
The confidence interval is $$[4-1.6216\times 2.021, 4+1.6216\times 2.021] = [0.723, 7.277]$$

The postulated value $\delta_{CD}=0$ is not in the interval: reject $\mathscr{H}_0$.

39 / 43

Pairwise differences in R

library(emmeans) # marginal means and contrasts
model <- aov(score ~ group, data = arithmetic)
margmeans <- emmeans(model, specs = "group")
contrast(margmeans, 
         method = "pairwise",
         adjust = 'none', 
         infer = TRUE) |>
  as_tibble() |>
  filter(contrast == "praise - reprove") |>
  knitr::kable(digits = 3)

contrast	estimate	SE	df	lower.CL	upper.CL	t.ratio	p.value
praise - reprove	4	1.622	40	0.723	7.277	2.467	0.018

40 / 43

Recap 1

Testing procedures factor in the uncertainty inherent to sampling.
Adopt particular viewpoint: null hypothesis (simpler model, e.g., no difference between group) is true. We consider the evidence under that optic.

41 / 43

Recap 2

p-values measures compatibility with the null model (relative to an alternative)
Tests are standardized values,

The output is either a p-value or a confidence interval

confidence interval: on scale of data (meaningful interpretation)
p-values: uniform on [0,1] if the null hypothesis is true

42 / 43

Recap 3All hypothesis tests share common ingredients
Many ways, models and test can lead to the same conclusion.
Transparent reporting is important!

43 / 43

Hypothesis testing

Session 2

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 43

Outline2 / 43

Outline

Variability

2 / 43

Outline

Variability

Hypothesis tests

2 / 43

Outline

Variability

Hypothesis tests

Pairwise comparisons

2 / 43

Sampling variability3 / 43

Studying a population

Interested in impacts of intervention or policy

Population distribution (describing possible outcomes and their frequencies) encodes everything we could be interested in.

4 / 43

Sampling variability

Histograms for 10 random samples of size 20 from a discrete uniform distribution.

5 / 43

Decision making under uncertainty

Data collection costly

$\to$ limited information available about population.
Sample too small to reliably estimate distribution
Focus instead on particular summaries

$\to$ mean, variance, odds, etc.

6 / 43

Population characteristics

mean / expectation

$$\mu$$

standard deviation

$$\sigma= \sqrt{\text{variance}}$$

same scale as observations

7 / 43

Do not confuse standard error (variability of statistic) and standard deviation (variability of observation from population)

Sampling variability

8 / 43

Not all samples are born alike

Analogy: comparing kids (or siblings): not everyone look alike (except twins...)
Chance and haphazard variability mean that we might have a good idea, but not exactly know the truth.

The signal and the noise

Can you spot the differences?

9 / 43

Information accumulates

Histograms of data from uniform (top) and non-uniform (bottom) distributions with increasing sample sizes.

10 / 43

Hypothesis tests11 / 43

The general recipe of hypothesis testingDefine variables
Write down hypotheses (null/alternative)
Choose and compute a test statistic
Compare the value to the null distribution (benchmark)
Compute the p-value
Conclude (reject/fail to reject)
Report findings

12 / 43

Hypothesis tests versus trials

Scene from "12 Angry Men" by Sidney Lumet

Binary decision: guilty/not guilty
Summarize evidences (proof)
Assess evidence in light of presumption of innocence
Verdict: either guilty or not guilty
Potential for judicial mistakes

13 / 43

How to assess evidence?

statistic = numerical summary of the data.

14 / 43

How to assess evidence?

statistic = numerical summary of the data.

requires benchmark / standardization

typically a unitless quantity

need measure of uncertainty of statistic

14 / 43

General construction principles

Wald statistic

\begin{align*} W = \frac{\text{estimated qty} - \text{postulated qty}}{\text{std. error (estimated qty)}} \end{align*}

standard error = measure of variability (same units as obs.)

resulting ratio is unitless!

15 / 43

The standard error is typically function of the sample size and the standard deviation $\sigma$ of the observations.

Impact of encouragement on teaching

From Davison (2008), Example 9.2

In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.

16 / 43

Basic manipulations in R: load data

data(arithmetic, package = "hecedsm")
# categorical variable = factor
# Look up data
str(arithmetic)

## 'data.frame':    45 obs. of  2 variables:
##  $ group: Factor w/ 5 levels "control 1","control 2",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ score: num  17 14 24 20 24 23 16 15 24 21 ...

17 / 43

Basic manipulations in R:  summary statistics# compute summary statistics
summary_stat <-
  arithmetic |> 
  group_by(group) |>
  summarize(mean = mean(score),
            sd = sd(score))
knitr::kable(summary_stat, 
             digits = 2)



group
mean
sd


control 1
19.67
4.21

control 2
18.33
3.57

praise
27.44
2.46

reprove
23.44
3.09

ignore
16.11
3.62


18 / 43

Basic manipulations in R: plot

# Boxplot with jittered data
ggplot(data = arithmetic,
       aes(x = group,
           y = score)) +
  geom_boxplot() +
  geom_jitter(width = 0.3, 
              height = 0) +
  theme_bw()

19 / 43

Formulating an hypothesis

Let $\mu_{C}$ and $\mu_{D}$ denote the population average (expectation) score for praise and reprove, respectively.

Our null hypothesis is $$\mathscr{H}_0: \mu_{C} = \mu_{D}$$ against the alternative $\mathscr{H}_a$ that they are different (two-sided test).

Equivalent to $\delta_{CD} = \mu_C - \mu_D = 0$.

20 / 43

Test statistic

The value of the Wald statistic is $$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$

21 / 43

Test statistic

The value of the Wald statistic is $$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$

How 'extreme' is this number?

21 / 43

Could it have happened by chance if there was no difference between groups?

Assessing evidence

Is 1 big?

22 / 43

Assessing evidence

Is 1 big?

Benchmarking

The same number can have different meanings
- units matter!
Meaningful comparisons require some reference.

22 / 43

Possible, but not plausible

The null distribution tells us what are the plausible values for the statistic and their relative frequency if the null hypothesis holds.

What can we expect to see by chance if there is no difference between groups?

23 / 43

Oftentimes, the null distribution comes with the test statistic

Alternatives include

Large sample behaviour (asymptotic distribution)
Resampling/bootstrap/permutation.

P-value

Null distributions are different, which makes comparisons uneasy.

The p-value gives the probability of observing an outcome as extreme if the null hypothesis was true.

24 / 43

Uniform distribution under H0

Level = probability of condemning an innocent

Fix level $\alpha$ before the experiment.

Choose small $\alpha$ (typical value is 5%)

Reject $\mathscr{H}_0$ if p-value less than $\alpha$

25 / 43

Question: why can't we fix $\alpha=0$?

What is really a p-value?

The American Statistical Association (ASA) published a statement on (mis)interpretation of p-values.

(2) P-values do not measure the probability that the studied hypothesis is true

(3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

(4) P-values and related analyses should not be reported selectively

(5) P-value, or statistical significance, does not measure the size of an effect or the importance of a result

26 / 43

Reporting results of a statistical procedure

Nature's checklist

27 / 43

Pairwise comparisons28 / 43

Pairwise differences and t-tests

The p-values and confidence intervals for pairwise differences between groups $j$ and $k$ are based on the t-statistic:

In large sample, this statistic behaves like a Student-t variable with $n-K$ degrees of freedom, denoted $\mathsf{St}(n-K)$ hereafter.

Note: in an analysis of variance model, the standard error $\mathsf{se}(\widehat{\mu}_j - \widehat{\mu}_k)$ is based the pooled variance estimate (estimated using all observations).

29 / 43

Pairwise differences

Consider the pairwise average difference in scores between the praise (group C) and the reprove (group D) of the arithmetic data.

Group sample averages are $\widehat{\mu}_C = 27.4$ and $\widehat{\mu}_D = 23.4$
The estimated average difference between groups $C$ and $D$ is $\widehat{\delta}_{CD} = 4$
The estimated pooled standard deviation for the five groups is $1.15\vphantom{\widehat{\delta}_{CD}}$
The standard error for the pairwise difference is $\mathsf{se}(\widehat{\delta}_{CD}) = 1.6216$
There are $n=45$ observations and $K=5$ groups

30 / 43

t-tests: null distribution is Student-t

If we postulate $\delta_{jk} = \mu_j - \mu_k = 0$, the test statistic becomes

\begin{align*} t = \frac{\widehat{\delta}_{jk} - 0}{\mathsf{se}(\widehat{\delta}_{jk})} \end{align*}

The $p$-value is $p = 1- \Pr(-|t| \leq T \leq |t|)$ for $T \sim \mathsf{St}_{n-K}$.

probability of statistic being more extreme than $t$

Recall: the larger the values of the statistic $t$ (either positive or negative), the more evidence against the null hypothesis.

31 / 43

Critical values

For a test at level $\alpha$ (two-sided), we fail to reject null hypothesis for all values of the test statistic $t$ that are in the interval

$$\mathfrak{t}_{n-K}(\alpha/2) \leq t \leq \mathfrak{t}_{n-K}(1-\alpha/2)$$

Because of the symmetry around zero, $\mathfrak{t}_{n-K}(1-\alpha/2) = -\mathfrak{t}_{n-K}(\alpha/2)$.

We call $\mathfrak{t}_{n-K}(1-\alpha/2)$ a critical value.
in R, the quantiles of the Student t distribution are obtained from qt(1-alpha/2, df = n - K) where n is the number of observations and K the number of groups.

32 / 43

Null distribution

The blue area defines the set of values for which we fail to reject null $\mathscr{H}_0$.

All values of $t$ falling in the red area lead to rejection at level $5$%.

33 / 43

ExampleIf \(\mathscr{H}_0: \delta_{CD}=0\), the \(t\) statistic is 
$$t=\frac{\widehat{\delta}_{CD} - 0}{\mathsf{se}(\widehat{\delta}_{CD})} = \frac{4}{1.6216}=2.467$$
The \(p\)-value is \(p=0.018\). 
We reject the null at level \(\alpha=5\)% since \(0.018 < 0.05\).
Conclude that there is a significant difference at level \(\alpha=0.05\) between the average scores of subpopulations \(C\) and \(D\).
34 / 43

Confidence interval

35 / 43

Interpretation of confidence intervals

The reported confidence interval is of the form

$$ \text{estimate} \pm \text{critical value} \times \text{standard error}$$

confidence interval = [lower, upper] units

If we replicate the experiment and compute confidence intervals each time

on average, 95% of those intervals will contain the true value if the assumptions underlying the model are met.

36 / 43

Interpretation in a picture: coin toss analogy

Each interval either contains the true value (black horizontal line) or doesn't.

37 / 43

Why confidence intervals?

Test statistics are standardized,

Good for comparisons with benchmark
typically meaningless (standardized = unitless quantities)

Two options for reporting:

$p$-value: probability of more extreme outcome if no mean difference
confidence intervals: set of all values for which we fail to reject the null hypothesis at level $\alpha$ for the given sample

38 / 43

Example

Mean difference of $\widehat{\delta}_{CD}=4$, with $\mathsf{se}(\widehat{\delta}_{CD})=1.6216$.
The critical values for a test at level $\alpha = 5$% are $-2.021$ and $2.021$
- qt(0.975, df = 45 - 5)
Since $|t| > 2.021$, reject $\mathscr{H}_0$: the two population are statistically significant at level $\alpha=5$%.
The confidence interval is $$[4-1.6216\times 2.021, 4+1.6216\times 2.021] = [0.723, 7.277]$$

The postulated value $\delta_{CD}=0$ is not in the interval: reject $\mathscr{H}_0$.

39 / 43

Pairwise differences in R

library(emmeans) # marginal means and contrasts
model <- aov(score ~ group, data = arithmetic)
margmeans <- emmeans(model, specs = "group")
contrast(margmeans, 
         method = "pairwise",
         adjust = 'none', 
         infer = TRUE) |>
  as_tibble() |>
  filter(contrast == "praise - reprove") |>
  knitr::kable(digits = 3)

contrast	estimate	SE	df	lower.CL	upper.CL	t.ratio	p.value
praise - reprove	4	1.622	40	0.723	7.277	2.467	0.018

40 / 43

Recap 1

Testing procedures factor in the uncertainty inherent to sampling.
Adopt particular viewpoint: null hypothesis (simpler model, e.g., no difference between group) is true. We consider the evidence under that optic.

41 / 43

Recap 2

p-values measures compatibility with the null model (relative to an alternative)
Tests are standardized values,

The output is either a p-value or a confidence interval

confidence interval: on scale of data (meaningful interpretation)
p-values: uniform on [0,1] if the null hypothesis is true

42 / 43

Recap 3All hypothesis tests share common ingredients
Many ways, models and test can lead to the same conclusion.
Transparent reporting is important!

43 / 43

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Hypothesis testing

Outline

Outline

Outline

Outline

Sampling variability

Studying a population

Sampling variability

Decision making under uncertainty

Population characteristics

Sampling variability

The signal and the noise

Information accumulates

Hypothesis tests

The general recipe of hypothesis testing

Hypothesis tests versus trials

How to assess evidence?

How to assess evidence?

General construction principles

Impact of encouragement on teaching

Basic manipulations in R: load data

Basic manipulations in R: summary statistics

Basic manipulations in R: plot

Formulating an hypothesis

Test statistic

Test statistic

Assessing evidence

Assessing evidence

Possible, but not plausible

P-value

Level = probability of condemning an innocent

What is really a p-value?

Reporting results of a statistical procedure

Pairwise comparisons

Pairwise differences and t-tests

Pairwise differences

t-tests: null distribution is Student-t

Critical values

Null distribution

Example

Confidence interval

Interpretation of confidence intervals

Interpretation in a picture: coin toss analogy

Why confidence intervals?

Example

Pairwise differences in R

Recap 1

Recap 2

Recap 3

Outline

Help

Hypothesis tests

Hypothesis testing

Outline

Outline

Outline

Outline

Sampling variability

Studying a population

Sampling variability

Decision making under uncertainty

Population characteristics

Sampling variability

The signal and the noise

Information accumulates

Hypothesis tests

The general recipe of hypothesis testing

Hypothesis tests versus trials

How to assess evidence?

How to assess evidence?

General construction principles

Impact of encouragement on teaching

Basic manipulations in R: load data

Basic manipulations in R: summary statistics

Basic manipulations in R: plot

Formulating an hypothesis

Test statistic

Test statistic

Assessing evidence

Assessing evidence