Effect size and power

Session 9

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 41

Outline2 / 41

Outline

Effect sizes

Power

2 / 41

Effect size3 / 41

Motivating example

Quote from the OSC psychology replication

The key statistics provided in the paper to test the “depletion” hypothesis is the main effect of a one-way ANOVA with three experimental conditions and confirmatory information processing as the dependent variable; $F (2, 82) = 4.05$ , $p = 0.02$ , $η^{2} = 0.09$ . Considering the original effect size and an alpha of $0.05$ the sample size needed to achieve $90$ % power is $132$ subjects.

Replication report of Fischer, Greitemeyer, and Frey (2008, JPSP, Study 2) by E.M. Galliani

4 / 41

Translating statement into science

Q: How many observations should
I gather to reliably detect an effect? Q: How big is this effect?

5 / 41

Does it matter?

With large enough sample size, any sized difference between treatments becomes statistically significant.

Statistical significance $\neq$ practical relevance

But whether this is important depends on the scientific question.

6 / 41

ExampleWhat is the minimum difference between two treatments that would be large enough to justify commercialization of a drug?
Tradeoff between efficacy of new treatment vs status quo, cost of drug, etc.
7 / 41

Using statistics to measure effects

Statistics and $p$ -values are not good summaries of magnitude of an effect:

the larger the sample size, the bigger the statistic, the smaller the $p$ -value

Instead use

standardized differences

percentage of variability explained

Estimators popularized in the handbook

Cohen, Jacob. Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Routhledge, 1988.

8 / 41

Illustrating effect size (differences)

The plot shows null (thick) and true sampling distributions (dashed) for the same difference in sample mean with small (left) and large (right) samples.

9 / 41

Estimands, estimators, estimates

$μ_{i}$ is the (unknown) population mean of group $i$ (parameter, or estimand)
${\hat{μ}}_{i}$ is a formula (an estimator) that takes data as input and returns a numerical value (an estimate).
throughout, use hats to denote estimated quantities:

Left to right: parameter $μ$ (target), estimator $\hat{μ}$ (recipe) and estimate $\hat{μ} = 10$ (numerical value, proxy)

10 / 41

From Twitter, @simongrund89

Cohen's d

Standardized measure of effect (dimensionless=no units):

Assuming equal variance $σ^{2}$ , compare mean of two groups $i$ and $j$ :

$d = \frac{μ_{i} - μ_{j}}{σ}$

Usual estimator of Cohen's $d$ , $\hat{d} = ({\hat{μ}}_{i} - {\hat{μ}}_{j}) / \hat{σ}$ , uses sample average of groups and the square root of the pooled variance.

Cohen's classification: small (d=0.2), medium (d=0.5) or large (d=0.8) effect size.

Note: this is not the $t$ -statistic (the denominator is the estimated standard deviation, not the standard error of the mean).

11 / 41

Note that there are multiple versions of Cohen's coefficients. These are the effects of the pwr package. The small/medium/large effect size varies depending on the test! See the vignette of pwr for defaults.

Effect size: ratio of variance

For a one-way ANOVA (equal variance $σ^{2}$ ) with more than two groups, Cohen's f is the square root of

$f^{2} = \frac{1}{σ^{2}} \sum_{j = 1}^{k} \frac{n_{j}}{n} (μ_{j} - μ)^{2},$ a weighted sum of squared difference relative to the overall mean $μ$ .

For $k = 2$ groups, Cohen's $f$ and Cohen's $d$ are related via $f = d / 2$ .

12 / 41

Effect size: proportion of variance

If there is a single experimental factor $A$ , we break down the variability as $σ_{total}^{2} = σ_{resid}^{2} + σ_{A}^{2}$ and define the percentage of variability explained by the effect of $A$ as. $η^{2} = \frac{explained variability}{total variability} = \frac{σ_{A}^{2}}{σ_{total}^{2}} .$

13 / 41

Coefficient of determination estimator

For the balanced one-way between-subject ANOVA, typical estimator is the coefficient of determination

${\hat{η}}^{2} \equiv \hat{R}^{2} = \frac{F ν_{1}}{F ν_{1} + ν_{2}}$ where $ν_{1} = K - 1$ and $ν_{2} = n - K$ are the degrees of freedom for the one-way ANOVA with $n$ observations and $K$ groups.

The coefficient of determination $\hat{R}^{2}$ is an upward biased estimator (too large on average).
for the replication, $\hat{R}^{2} = (4.05 \times 2) / (4.05 \times 2 + 82) = 0.09$ .

14 / 41

People frequently write $η^{2}$ when they mean the estimator $\hat{R}^{2}$

$ω^{2}$ square estimator

Another estimator of $η^{2}$ that is recommended in Keppel & Wickens (2004) for power calculations is ${\hat{ω}}^{2}$ .

For one-way between-subject ANOVA, the latter is obtained from the $F$ -statistic as

${\hat{ω}}^{2} = \frac{ν_{1} (F - 1)}{ν_{1} (F - 1) + n}$

for the replication, ${\hat{ω}}^{2} = (2 \times 3.05) / (2 \times 3.05 + 84) = 0.0677.$
if the value returned is negative, report zero.

15 / 41

Since the $F$ statistic is approximately 1 on average, this measure removes the average.

Link between $η^{2}$ to Cohen's $f$

Software usually take Cohen's $f$ (or $f^{2}$ ) as input for the effect size.

Convert from $η^{2}$ (proportion of variance) to $f$ (ratio of variance) via the relationship

$f^{2} = \frac{η^{2}}{1 - η^{2}} .$

16 / 41

Calculating Cohen's f

Replace $η^{2}$ by ${\hat{R}}^{2}$ or ${\hat{ω}}^{2}$ to get

$\begin{aligned} \hat{f} = \sqrt{\frac{F ν_{1}}{ν_{2}}}, \tilde{f} & = \sqrt{\frac{ν_{1} (F - 1)}{n}} \end{aligned}$

If we plug-in estimated values

with $\hat{R}^{2}$ , we get $\hat{f} = 0.314$
with ${\hat{ω}}^{2}$ , we get $\tilde{f} = 0.27$ .

17 / 41

Effect sizes for multiway ANOVA

With a completely randomized design with only experimental factors, use partial effect size $η_{⟨ effect ⟩}^{2} = σ_{effect}^{2} / (σ_{effect}^{2} + σ_{resid}^{2})$

In R, use effectsize::omega_squared(model, partial = TRUE).

18 / 41

Partial effects and variance decomposition

Consider a completely randomized balanced design with two factors $A$ , $B$ and their interaction $A B$ . In a balanced design, we can decompose the total variance as

$σ_{total}^{2} = σ_{A}^{2} + σ_{B}^{2} + σ_{A B}^{2} + σ_{resid}^{2} .$

Cohen's partial $f$ measures the proportion of variability that is explained by a main effect or an interaction, e.g.,

$f_{⟨ A ⟩} = \frac{σ_{A}^{2}}{σ_{resid}^{2}}, f_{⟨ A B ⟩} = \frac{σ_{A B}^{2}}{σ_{resid}^{2}} .$

19 / 41

These variance quantities are unknown, so need to be estimated somehow.

Partial effect size (variance)

Effect size are often reported in terms of variability via the ratio $η_{⟨ effect ⟩}^{2} = \frac{σ_{effect}^{2}}{σ_{effect}^{2} + σ_{resid}^{2}} .$

Both ${\hat{η}}_{⟨ effect ⟩}^{2}$ (aka ${\hat{R}}_{⟨ effect ⟩}^{2}$ ) and ${\hat{ω}}_{⟨ effect ⟩}^{2}$ are estimators of this quantity and obtained from the $F$ statistic and degrees of freedom of the effect.

20 / 41

${\hat{ω}}_{⟨ effect ⟩}^{2}$ is presumed less biased than ${\hat{η}}_{⟨ effect ⟩}^{2}$ , as is ${\hat{ϵ}}_{⟨ effect ⟩}$ .

Estimation of partial $ω^{2}$

Similar formulas as the one-way case for between-subject experiments, with

${\hat{ω}}_{⟨ effect ⟩}^{2} = \frac{{df}_{effect} (F_{effect} - 1)}{{df}_{effect} (F_{effect} - 1) + n},$ where $n$ is the overall sample size.

In R, effectsize::omega_squared reports these estimates with one-sided confidence intervals.

Reference for confidence intervals: Steiger (2004), Psychological Methods

21 / 41

The confidence intervals are based on the F distribution, by changing the non-centrality parameter and inverting the distribution function (pivot method). There is a one-to-one correspondence with Cohen's f, and a bijection between the latter and omega_sq_partial or eta_sq_partial. This yields asymmetric intervals.

Converting $ω^{2}$ to Cohen's $f$

Given an estimate of $η_{⟨ effect ⟩}^{2}$ , convert it into an estimate of Cohen's partial $f_{⟨ effect ⟩}^{2}$ , e.g., ${\hat{f}}_{⟨ effect ⟩}^{2} = \frac{{\hat{ω}}_{⟨ effect}^{2} ⟩}{1 - {\hat{ω}}_{⟨ effect}^{2} ⟩} .$

The package effectsize::cohens_f returns ${\tilde{f}}^{2} = n^{- 1} F_{effect} {df}_{effect}$ , a transformation of ${\hat{η}}_{⟨ effect ⟩}^{2}$ .

22 / 41

SummaryEffect sizes can be recovered using information found in the ANOVA table.
Multiple estimators for the same quantityreport the one used along with confidence or tolerance intervals.
some estimators are preferred (less biased): this matters for power studies

The correct measure may depend on the designpartial vs total effects,
different formulas for within-subjects (repeated measures) designs!

23 / 41

Power24 / 41

Power and sample size calculations

Journals and grant agencies oftentimes require an estimate of the sample size needed for a study.

large enough to pick-up effects of scientific interest (good signal-to-noise)
efficient allocation of resources (don't waste time/money)

Same for replication studies: how many participants needed?

25 / 41

I cried power!Power is the ability to detect when the null is false, for a given alternative
It is the probability of correctly rejecting the null hypothesis under an alternative.
The larger the power, the better.

26 / 41

Living in an alternative world

How does the F-test behaves under an alternative?

27 / 41

Thinking about power

What do you think is the effect on power of an increase of the

group sample size $n_{1}, \dots, n_{K}$ .
variability $σ^{2}$ .
true mean difference $μ_{j} - μ$ .

28 / 41

What happens under the alternative?

The peak of the distribution shifts to the right.

Why? on average, the numerator of the $F$ -statistic is

$\begin{aligned} E (between-group variability) = σ^{2} + \frac{\sum_{j = 1}^{K} n_{j} (μ_{j} - μ)^{2}}{K - 1} . \end{aligned}$

Under the null hypothesis, $μ_{j} = μ$ for $j = 1, \dots, K$

the rightmost term is 0.

29 / 41

Noncentrality parameter and power

The alternative distribution is $F (ν_{1}, ν_{2}, Δ)$ distribution with degrees of freedom $ν_{1}$ and $ν_{2}$ and noncentrality parameter $\begin{aligned} Δ = \frac{\sum_{j = 1}^{K} n_{j} (μ_{j} - μ)^{2}}{σ^{2}} . \end{aligned}$

30 / 41

I cried power!

The null alternative corresponds to a single value (equality in mean), whereas there are infinitely many alternatives...

Power is the ability to detect when the null is false, for a given alternative (dashed).

Power is the area in white under the dashed curved, beyond the cutoff.

31 / 41

In which of the two figures is power the largest?

What determines power?

Think in your head of potential factors impact power for a factorial design.

32 / 41

What determines power?

Think in your head of potential factors impact power for a factorial design.

The size of the effects, $δ_{1} = μ_{1} - μ$ , $\dots$ , $δ_{K} = μ_{K} - μ$
The background noise (intrinsic variability, $σ^{2}$ )
The level of the test, $α$
The sample size in each group, $n_{j}$
The choice of experimental design
The choice of test statistic

32 / 41

What determines power?

Think in your head of potential factors impact power for a factorial design.

The size of the effects, $δ_{1} = μ_{1} - μ$ , $\dots$ , $δ_{K} = μ_{K} - μ$
The background noise (intrinsic variability, $σ^{2}$ )
The level of the test, $α$
The sample size in each group, $n_{j}$
The choice of experimental design
The choice of test statistic

We focus on the interplay between

effect size | power | sample size

32 / 41

The level is fixed, but we may consider multiplicity correction within the power function. The noise level is oftentimes intrinsic to the measurement.

Living in an alternative world

In a one-way ANOVA, the alternative distribution of the $F$ test has an additional parameter $Δ$ , which depends on both the sample and the effect sizes.

$Δ = \frac{\sum_{j = 1}^{K} n_{j} (μ_{j} - μ)^{2}}{σ^{2}} = n f^{2} .$

Under the null hypothesis, $μ_{j} = μ$ for $j = 1, \dots, K$ and $Δ = 0$ .

The greater $Δ$ , the further the mode (peak of the distribution) is from unity.

33 / 41

Noncentrality parameter and power

$Δ = \frac{\sum_{j = 1}^{K} n_{j} (μ_{j} - μ)^{2}}{σ^{2}} .$

When does power increase?

What is the effect of an increase of the

group sample size $n_{1}, \dots, n_{K}$ .
variability $σ^{2}$ .
true mean difference $μ_{j} - μ$ .

34 / 41

Noncentrality parameter

The alternative distribution is $F (ν_{1}, ν_{2}, Δ)$ distribution with degrees of freedom $ν_{1}$ and $ν_{2}$ and noncentrality parameter $Δ$ .

35 / 41

For other tests, parameters vary but the story is the same.

The plot shows the null and alternative distributions. The noncentral F is shifted to the right (mode = peak) and right skewed. The power is shaded in blue, the null distribution is shown in dashed lines.

Power for factorial experimentsG⋆PowerG⋆Power and R packages take Cohen's ff (or f2f2) as inputs.
Calculation based on FF distribution with ν1=dfeffectν1=dfeffect degrees of freedom
ν2=n−ngν2=n−ng, where ngng is the number of mean parameters estimated. 
noncentrality parameter ϕ=nf2⟨effect⟩ϕ=nf⟨effect⟩2

36 / 41

Example

Consider a completely randomized design with two crossed factors $A$ and $B$ .

We are interested by the interaction, $η_{⟨ A B ⟩}^{2}$ , and we want 80% power:

# Estimate Cohen's f from omega.sq.partial
fhat <- sqrt(omega.sq.part/(1-omega.sq.part))
# na and nb are number of levels of factors
WebPower::wp.kanova(power = 0.8, 
                    f = fhat, 
                    ndf = (na-1)*(nb-1), 
                    ng = na*nb)

37 / 41

Power curves

library(pwr)
power_curve <- 
 pwr.anova.test(
  f = 0.314, #from R-squared
  k = 3, 
  power = 0.9,
  sig.level = 0.05)
plot(power_curve)

Recall: convert $η^{2}$ to Cohen's $f$ (the effect size reported in pwr) via $f^{2} = η^{2} / (1 - η^{2})$

Using $\tilde{f}$ instead (from ${\hat{ω}}^{2}$ ) yields $n = 59$ observations per group!

38 / 41

Effect size estimates

WARNING!

Most effects reported in the literature are severely inflated.

Publication bias & the file drawer problem

Estimates reported in meta-analysis, etc. are not reliable.
Run pilot study, provide educated guesses.
Estimated effects size are uncertain (report confidence intervals).

39 / 41

Recall the file drawer problem: most studies with small effects lead to non significant results and are not published. So the reported effects are larger than expected.

Beware of small samples

Better to do a large replication than multiple small studies.

Otherwise, you risk being in this situation:

40 / 41

Observed (post-hoc) power

Sometimes, the estimated values of the effect size, etc. are used as plug-in.

The (estimated) effect size in studies are noisy!
Post-hoc power estimates are also noisy and typically overoptimistic.
Not recommended, but useful pointer if the observed difference seems important (large), but there isn't enough evidence (too low signal-to-noise).

Statistical fallacy

Because we reject a null doesn't mean the alternative is true!

Power is a long-term frequency property: in a given experiment, we either reject or we don't.

41 / 41

Not recommended unless the observed differences among the means seem important in practice but are not statistically significant

Effect size and power

Session 9

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 41

Outline2 / 41

Outline

Effect sizes

Power

2 / 41

Effect size3 / 41

Motivating example

Quote from the OSC psychology replication

The key statistics provided in the paper to test the “depletion” hypothesis is the main effect of a one-way ANOVA with three experimental conditions and confirmatory information processing as the dependent variable; $F (2, 82) = 4.05$ , $p = 0.02$ , $η^{2} = 0.09$ . Considering the original effect size and an alpha of $0.05$ the sample size needed to achieve $90$ % power is $132$ subjects.

Replication report of Fischer, Greitemeyer, and Frey (2008, JPSP, Study 2) by E.M. Galliani

4 / 41

Translating statement into science

Q: How many observations should
I gather to reliably detect an effect? Q: How big is this effect?

5 / 41

Does it matter?

With large enough sample size, any sized difference between treatments becomes statistically significant.

Statistical significance $\neq$ practical relevance

But whether this is important depends on the scientific question.

6 / 41

ExampleWhat is the minimum difference between two treatments that would be large enough to justify commercialization of a drug?
Tradeoff between efficacy of new treatment vs status quo, cost of drug, etc.
7 / 41

Using statistics to measure effects

Statistics and $p$ -values are not good summaries of magnitude of an effect:

the larger the sample size, the bigger the statistic, the smaller the $p$ -value

Instead use

standardized differences

percentage of variability explained

Estimators popularized in the handbook

Cohen, Jacob. Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Routhledge, 1988.

8 / 41

Illustrating effect size (differences)

The plot shows null (thick) and true sampling distributions (dashed) for the same difference in sample mean with small (left) and large (right) samples.

9 / 41

Estimands, estimators, estimates

$μ_{i}$ is the (unknown) population mean of group $i$ (parameter, or estimand)
${\hat{μ}}_{i}$ is a formula (an estimator) that takes data as input and returns a numerical value (an estimate).
throughout, use hats to denote estimated quantities:

Left to right: parameter $μ$ (target), estimator $\hat{μ}$ (recipe) and estimate $\hat{μ} = 10$ (numerical value, proxy)

10 / 41

From Twitter, @simongrund89

Cohen's d

Standardized measure of effect (dimensionless=no units):

Assuming equal variance $σ^{2}$ , compare mean of two groups $i$ and $j$ :

$d = \frac{μ_{i} - μ_{j}}{σ}$

Usual estimator of Cohen's $d$ , $\hat{d} = ({\hat{μ}}_{i} - {\hat{μ}}_{j}) / \hat{σ}$ , uses sample average of groups and the square root of the pooled variance.

Cohen's classification: small (d=0.2), medium (d=0.5) or large (d=0.8) effect size.

Note: this is not the $t$ -statistic (the denominator is the estimated standard deviation, not the standard error of the mean).

11 / 41

Effect size: ratio of variance

For a one-way ANOVA (equal variance $σ^{2}$ ) with more than two groups, Cohen's f is the square root of

$f^{2} = \frac{1}{σ^{2}} \sum_{j = 1}^{k} \frac{n_{j}}{n} (μ_{j} - μ)^{2},$ a weighted sum of squared difference relative to the overall mean $μ$ .

For $k = 2$ groups, Cohen's $f$ and Cohen's $d$ are related via $f = d / 2$ .

12 / 41

Effect size: proportion of variance

13 / 41

Coefficient of determination estimator

For the balanced one-way between-subject ANOVA, typical estimator is the coefficient of determination

The coefficient of determination $\hat{R}^{2}$ is an upward biased estimator (too large on average).
for the replication, $\hat{R}^{2} = (4.05 \times 2) / (4.05 \times 2 + 82) = 0.09$ .

14 / 41

People frequently write $η^{2}$ when they mean the estimator $\hat{R}^{2}$

$ω^{2}$ square estimator

Another estimator of $η^{2}$ that is recommended in Keppel & Wickens (2004) for power calculations is ${\hat{ω}}^{2}$ .

For one-way between-subject ANOVA, the latter is obtained from the $F$ -statistic as

${\hat{ω}}^{2} = \frac{ν_{1} (F - 1)}{ν_{1} (F - 1) + n}$

for the replication, ${\hat{ω}}^{2} = (2 \times 3.05) / (2 \times 3.05 + 84) = 0.0677.$
if the value returned is negative, report zero.

15 / 41

Since the $F$ statistic is approximately 1 on average, this measure removes the average.

Link between $η^{2}$ to Cohen's $f$

Software usually take Cohen's $f$ (or $f^{2}$ ) as input for the effect size.

Convert from $η^{2}$ (proportion of variance) to $f$ (ratio of variance) via the relationship

$f^{2} = \frac{η^{2}}{1 - η^{2}} .$

16 / 41

Calculating Cohen's f

Replace $η^{2}$ by ${\hat{R}}^{2}$ or ${\hat{ω}}^{2}$ to get

$\begin{aligned} \hat{f} = \sqrt{\frac{F ν_{1}}{ν_{2}}}, \tilde{f} & = \sqrt{\frac{ν_{1} (F - 1)}{n}} \end{aligned}$

If we plug-in estimated values

with $\hat{R}^{2}$ , we get $\hat{f} = 0.314$
with ${\hat{ω}}^{2}$ , we get $\tilde{f} = 0.27$ .

17 / 41

Effect sizes for multiway ANOVA

With a completely randomized design with only experimental factors, use partial effect size $η_{⟨ effect ⟩}^{2} = σ_{effect}^{2} / (σ_{effect}^{2} + σ_{resid}^{2})$

In R, use effectsize::omega_squared(model, partial = TRUE).

18 / 41

Partial effects and variance decomposition

Consider a completely randomized balanced design with two factors $A$ , $B$ and their interaction $A B$ . In a balanced design, we can decompose the total variance as

$σ_{total}^{2} = σ_{A}^{2} + σ_{B}^{2} + σ_{A B}^{2} + σ_{resid}^{2} .$

Cohen's partial $f$ measures the proportion of variability that is explained by a main effect or an interaction, e.g.,

$f_{⟨ A ⟩} = \frac{σ_{A}^{2}}{σ_{resid}^{2}}, f_{⟨ A B ⟩} = \frac{σ_{A B}^{2}}{σ_{resid}^{2}} .$

19 / 41

These variance quantities are unknown, so need to be estimated somehow.

Partial effect size (variance)

Effect size are often reported in terms of variability via the ratio $η_{⟨ effect ⟩}^{2} = \frac{σ_{effect}^{2}}{σ_{effect}^{2} + σ_{resid}^{2}} .$

Both ${\hat{η}}_{⟨ effect ⟩}^{2}$ (aka ${\hat{R}}_{⟨ effect ⟩}^{2}$ ) and ${\hat{ω}}_{⟨ effect ⟩}^{2}$ are estimators of this quantity and obtained from the $F$ statistic and degrees of freedom of the effect.

20 / 41

${\hat{ω}}_{⟨ effect ⟩}^{2}$ is presumed less biased than ${\hat{η}}_{⟨ effect ⟩}^{2}$ , as is ${\hat{ϵ}}_{⟨ effect ⟩}$ .

Estimation of partial $ω^{2}$

Similar formulas as the one-way case for between-subject experiments, with

${\hat{ω}}_{⟨ effect ⟩}^{2} = \frac{{df}_{effect} (F_{effect} - 1)}{{df}_{effect} (F_{effect} - 1) + n},$ where $n$ is the overall sample size.

In R, effectsize::omega_squared reports these estimates with one-sided confidence intervals.

Reference for confidence intervals: Steiger (2004), Psychological Methods

21 / 41

Converting $ω^{2}$ to Cohen's $f$

The package effectsize::cohens_f returns ${\tilde{f}}^{2} = n^{- 1} F_{effect} {df}_{effect}$ , a transformation of ${\hat{η}}_{⟨ effect ⟩}^{2}$ .

22 / 41

SummaryEffect sizes can be recovered using information found in the ANOVA table.
Multiple estimators for the same quantityreport the one used along with confidence or tolerance intervals.
some estimators are preferred (less biased): this matters for power studies

The correct measure may depend on the designpartial vs total effects,
different formulas for within-subjects (repeated measures) designs!

23 / 41

Power24 / 41

Power and sample size calculations

Journals and grant agencies oftentimes require an estimate of the sample size needed for a study.

large enough to pick-up effects of scientific interest (good signal-to-noise)
efficient allocation of resources (don't waste time/money)

Same for replication studies: how many participants needed?

25 / 41

I cried power!Power is the ability to detect when the null is false, for a given alternative
It is the probability of correctly rejecting the null hypothesis under an alternative.
The larger the power, the better.

26 / 41

Living in an alternative world

How does the F-test behaves under an alternative?

27 / 41

Thinking about power

What do you think is the effect on power of an increase of the

group sample size $n_{1}, \dots, n_{K}$ .
variability $σ^{2}$ .
true mean difference $μ_{j} - μ$ .

28 / 41

What happens under the alternative?

The peak of the distribution shifts to the right.

Why? on average, the numerator of the $F$ -statistic is

$\begin{aligned} E (between-group variability) = σ^{2} + \frac{\sum_{j = 1}^{K} n_{j} (μ_{j} - μ)^{2}}{K - 1} . \end{aligned}$

Under the null hypothesis, $μ_{j} = μ$ for $j = 1, \dots, K$

the rightmost term is 0.

29 / 41

Noncentrality parameter and power

30 / 41

I cried power!

The null alternative corresponds to a single value (equality in mean), whereas there are infinitely many alternatives...

Power is the ability to detect when the null is false, for a given alternative (dashed).

Power is the area in white under the dashed curved, beyond the cutoff.

31 / 41

In which of the two figures is power the largest?

What determines power?

Think in your head of potential factors impact power for a factorial design.

32 / 41

What determines power?

Think in your head of potential factors impact power for a factorial design.

The size of the effects, $δ_{1} = μ_{1} - μ$ , $\dots$ , $δ_{K} = μ_{K} - μ$
The background noise (intrinsic variability, $σ^{2}$ )
The level of the test, $α$
The sample size in each group, $n_{j}$
The choice of experimental design
The choice of test statistic

32 / 41

What determines power?

Think in your head of potential factors impact power for a factorial design.

The size of the effects, $δ_{1} = μ_{1} - μ$ , $\dots$ , $δ_{K} = μ_{K} - μ$
The background noise (intrinsic variability, $σ^{2}$ )
The level of the test, $α$
The sample size in each group, $n_{j}$
The choice of experimental design
The choice of test statistic

We focus on the interplay between

effect size | power | sample size

32 / 41

The level is fixed, but we may consider multiplicity correction within the power function. The noise level is oftentimes intrinsic to the measurement.

Living in an alternative world

In a one-way ANOVA, the alternative distribution of the $F$ test has an additional parameter $Δ$ , which depends on both the sample and the effect sizes.

$Δ = \frac{\sum_{j = 1}^{K} n_{j} (μ_{j} - μ)^{2}}{σ^{2}} = n f^{2} .$

Under the null hypothesis, $μ_{j} = μ$ for $j = 1, \dots, K$ and $Δ = 0$ .

The greater $Δ$ , the further the mode (peak of the distribution) is from unity.

33 / 41

Noncentrality parameter and power

$Δ = \frac{\sum_{j = 1}^{K} n_{j} (μ_{j} - μ)^{2}}{σ^{2}} .$

When does power increase?

What is the effect of an increase of the

group sample size $n_{1}, \dots, n_{K}$ .
variability $σ^{2}$ .
true mean difference $μ_{j} - μ$ .

34 / 41

Noncentrality parameter

The alternative distribution is $F (ν_{1}, ν_{2}, Δ)$ distribution with degrees of freedom $ν_{1}$ and $ν_{2}$ and noncentrality parameter $Δ$ .

35 / 41

For other tests, parameters vary but the story is the same.

Power for factorial experimentsG⋆PowerG⋆Power and R packages take Cohen's ff (or f2f2) as inputs.
Calculation based on FF distribution with ν1=dfeffectν1=dfeffect degrees of freedom
ν2=n−ngν2=n−ng, where ngng is the number of mean parameters estimated. 
noncentrality parameter ϕ=nf2⟨effect⟩ϕ=nf⟨effect⟩2

36 / 41

Example

Consider a completely randomized design with two crossed factors $A$ and $B$ .

We are interested by the interaction, $η_{⟨ A B ⟩}^{2}$ , and we want 80% power:

# Estimate Cohen's f from omega.sq.partial
fhat <- sqrt(omega.sq.part/(1-omega.sq.part))
# na and nb are number of levels of factors
WebPower::wp.kanova(power = 0.8, 
                    f = fhat, 
                    ndf = (na-1)*(nb-1), 
                    ng = na*nb)

37 / 41

Power curves

library(pwr)
power_curve <- 
 pwr.anova.test(
  f = 0.314, #from R-squared
  k = 3, 
  power = 0.9,
  sig.level = 0.05)
plot(power_curve)

Recall: convert $η^{2}$ to Cohen's $f$ (the effect size reported in pwr) via $f^{2} = η^{2} / (1 - η^{2})$

Using $\tilde{f}$ instead (from ${\hat{ω}}^{2}$ ) yields $n = 59$ observations per group!

38 / 41

Effect size estimates

WARNING!

Most effects reported in the literature are severely inflated.

Publication bias & the file drawer problem

Estimates reported in meta-analysis, etc. are not reliable.
Run pilot study, provide educated guesses.
Estimated effects size are uncertain (report confidence intervals).

39 / 41

Recall the file drawer problem: most studies with small effects lead to non significant results and are not published. So the reported effects are larger than expected.

Beware of small samples

Better to do a large replication than multiple small studies.

Otherwise, you risk being in this situation:

40 / 41

Observed (post-hoc) power

Sometimes, the estimated values of the effect size, etc. are used as plug-in.

The (estimated) effect size in studies are noisy!
Post-hoc power estimates are also noisy and typically overoptimistic.
Not recommended, but useful pointer if the observed difference seems important (large), but there isn't enough evidence (too low signal-to-noise).

Statistical fallacy

Because we reject a null doesn't mean the alternative is true!

Power is a long-term frequency property: in a given experiment, we either reject or we don't.

41 / 41

Not recommended unless the observed differences among the means seem important in practice but are not statistically significant

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Effect size and power

Outline

Outline

Effect size

Motivating example

Translating statement into science

Does it matter?

Example

Using statistics to measure effects

Illustrating effect size (differences)

Estimands, estimators, estimates

Cohen's d

Effect size: ratio of variance

Effect size: proportion of variance

Coefficient of determination estimator

ω2ω2 square estimator

Link between η2η2 to Cohen's ff

Calculating Cohen's f

Effect sizes for multiway ANOVA

Partial effects and variance decomposition

Partial effect size (variance)

Estimation of partial ω2ω2

Converting ω2ω2 to Cohen's ff

Summary

Power

Power and sample size calculations

I cried power!

Living in an alternative world

Thinking about power

What happens under the alternative?

Noncentrality parameter and power

I cried power!

What determines power?

What determines power?

What determines power?

Living in an alternative world

Noncentrality parameter and power

Noncentrality parameter

Power for factorial experiments

Example

Power curves

Effect size estimates

Beware of small samples

Observed (post-hoc) power

Outline

Help

Effect size and power

Effect size and power

Outline

Outline

Effect size

Motivating example

Translating statement into science

Does it matter?

Example

Using statistics to measure effects

Illustrating effect size (differences)

Estimands, estimators, estimates

Cohen's d

Effect size: ratio of variance

Effect size: proportion of variance

Coefficient of determination estimator

ω2ω2 square estimator

Link between η2η2 to Cohen's ff

Calculating Cohen's f

Effect sizes for multiway ANOVA

Partial effects and variance decomposition

Partial effect size (variance)

Estimation of partial ω2ω2

Converting ω2ω2 to Cohen's ff

Summary

Power

Power and sample size calculations

I cried power!

Living in an alternative world

Thinking about power

What happens under the alternative?

Noncentrality parameter and power

I cried power!

What determines power?

$ω^{2}$ square estimator

Link between $η^{2}$ to Cohen's $f$

Estimation of partial $ω^{2}$

Converting $ω^{2}$ to Cohen's $f$

$ω^{2}$ square estimator

Link between $η^{2}$ to Cohen's $f$

Estimation of partial $ω^{2}$

Converting $ω^{2}$ to Cohen's $f$