## Code

```
library(effectsize)
data(BSJ92, package = "hecedsm")
<- aov(posttest2 - pretest2 ~ group,
mod data = BSJ92)
omega_squared(anova(mod), partial = FALSE)
```

This example is devoted to correct calculation and reporting of effect sizes across a variety of experimental designs, including multi-way analysis of variance with and without blocking factors. We make use of the comprehensive `effectsize`

package throughout.^{1}

We also showcase how effect sizes can be recovered from the output of an analysis of variance table or from the reported test statistics and degrees of freedom, thus highlighting the impact of accurately reporting the latter.

We finally consider calculation of power in various replication studies using `WebPower`

and G*Power.

Effect size typically serve three purpose:

- inform readers of the magnitude of the effect,
- provide a standardized quantity that can be combined with others in a meta-analysis, or
- serve as proxy in a power study to estimate the minimum number of observations needed.

If you report the exact value of the test statistic, the null distribution and (in short) all elements of an analysis of variance table in a complex design, it is possible by using suitable formulae to recover effect sizes, as they are functions of the test statistic summaries, degrees of freedom and correlation between observations (in the case of repeated measures).

The `effectsize`

package includes a variety of estimators for standardized difference or ratio of variance. For example, for the latter, we can retrieve Cohen’s \(f\) via `cohens_f`

, \(\widehat{\epsilon}^2\) via `epsilon_squared`

or \(\widehat{\omega}^2\) via `omega_squared`

.

By default, in a design with more than one factor, the partial effects are returned (argument `partial = TRUE`

) — if there is a single factor, these coincide with the total effects and the distinction is immaterial.

The `effectsize`

package reports confidence intervals^{2} calculated using the pivot method described in Steiger (2004). Check the documentation at `?effectsize::effectsize_CIs`

for more technical details.^{3}

In general, confidence intervals for effect sizes are very wide, including a large range of potential values and sometimes zero. This reflects the large uncertainty surrounding their estimation and should not be taken to mean that the estimated effect is null.

We begin with the result of our Example on one-way ANOVA in Baumann et al. (1992). If we consider the global \(F\)-test of equality in means, we can report as corresponding effect size the percentage of variance that is explained by the experimental condition, `group`

.

Effect Size for ANOVA (Type I) | ||
---|---|---|

Parameter | Omega2 | 95% CI |

group | 0.16 | [0.03, 1.00] |

One-sided CIs: upper bound fixed at [1.00]. |

The estimator employed is \(\widehat{\omega}^2\) and could be obtained directly using the formula provided in the slides. For a proportion of variance, the number is medium according to Cohen (1988) definition. Using \(\widehat{R}^2 \equiv \widehat{\eta}^2\) as estimator instead would give an estimated proportion of variance of 0.188, a slightly higher number.

Having found a significant difference in mean between groups, one could be interested in computing estimated marginal means and contrasts based on the latter. The `emmeans`

function has a method for computing effect size (Cohen’s \(d\)) for pairwise differences if provided with the denominator standard deviation \(\sigma\) and the degrees of freedom associated with the latter (i.e., how many observations were left from the total sample size after subtracting the number of subgroup means).

The confidence intervals reported by `emmeans`

for \(t\)-tests are symmetric and different in nature from the one obtained previously.

Technical aside: while it is possible to create a \(t\)-statistic for a constrast by dividing the contrast estimator by it’s standard error, the construction of Cohen’s \(d\) here for the contrast consisting of, e.g., the pairwise difference between `DRTA`

and `TA`

would take the form \[
d_{\text{DRTA}- \text{TA}} = \frac{\mu_{\text{DRTA}}- \mu_{\text{TA}}}{\sigma},
\] where the denominator stands for the standard deviation of the observations.^{4}

Armed with effect sizes and a desired level of power, it is possible to determine the minimum number of observations that would yield such effect.

Johnson et al. (2014) performs a replication study of Schnall, Benton, and Harvey (2008) who conjectured that physical cleanliness reduces the severity of moral judgments.

The following excerpt from the paper explain how sample size for the replication were calculated.

In Experiment 2, the critical test of the cleanliness manipulation on ratings of morality was significant, \(F(1, 41) = 7.81\), \(p=0.01\), \(d=0.87\), \(N=44\). Assuming \(\alpha=0.05\), the achieved power in this experiment was \(0.80\). Our proposed research will attempt to replicate this experiment with a level of power = \(0.99\). This will require a minimum of 100 participants (assuming equal sized groups with \(d=0.87\)) so we will collect data from 115 participants to ensure a properly powered sample in case of errors.

The first step is to try and compute the effect size, here Cohen’s \(d\), from the reported \(F\) statistic to make sure it matches the quoted value.

This indeed coincides with the value reported for Cohen’s \(d\) estimator. We can then plug-in this value in the power function with the desired power level \(0.99\) to find out a minimal number of 50 participants in each group, for a total of 100 if we do a pairwise comparison using a two-sample \(t\)-test.

```
Two-sample t-test
n d alpha power
49.53039 0.87 0.05 0.99
NOTE: n is number in *each* group
URL: http://psychstat.org/ttest
```

The `effectsize`

package includes many functions to convert \(F\) and \(t\) statistics to effect sizes.^{5}

While software can easily compute effect sizes, the user should not blindly rely on the output, but rather think about various elements using the following guiding principles:

- we are interested in partial effects when there are multiple factors
- the denominator should consist of the variance of the effect of interest (say factor \(A\)), the variance of blocking factors and random effects and that of all interactions associated with them.

Consider next the unbalanced two-way ANOVA example with Study 1 of Maglio & Polman (2014).

We pass here directly the output of the model. We use the `lm`

function with the mean-to-zero parametrization, since we have unbalanced data.

By default, the variance terms for each factor and interaction are estimated using the `anova`

call. When the data aren’t balanced and you have multiple factors in the mean equation, these are the sequential sum of square estimates (type I). This means that the resulting effect size would depend on the order in which you specify the terms, an unappealing feature. The model can alternatively take as argument the analysis of variance table produced by the `Anova`

function in package `car`

, e.g., `car::Anova(..., type = 3)`

. Note that it is of paramount importance to pass the correct arguments and to use the mean-to-zero parametrization in order to get sensible results. The package warns user about this.

Effect Size for ANOVA (Type III) | ||
---|---|---|

Parameter | Omega2 (partial) | 95% CI |

station | 0.26 | [0.17, 1.00] |

direction | -3.39e-03 | [0.00, 1.00] |

station:direction | 0.18 | [0.10, 1.00] |

One-sided CIs: upper bound fixed at [1.00]. |

The estimated effect size for the main effect of `direction`

is negative with \(\widehat{\omega}^2_{\langle \text{direction}\rangle}\): either reporting a negative value or zero. This reflects that the estimated effect is very insignificant.

Equipped with the estimated effect size, we can now transform our partial \(\widehat{\omega}^2_{\langle\text{AB}\rangle}\) measure into an estimated Cohen’s \(f\) via \[\widetilde{f} = \left( \frac{\widehat{\omega}^2}{1-\widehat{\omega}^2}\right)^{1/2},\] which is then fed into `WebPower`

package functionality to compute the *post-hoc* power. Since we are dealing with a two-way ANOVA with 8 subgroups, we set `ng=8`

and then `ndf`

corresponding to the degrees of freedom of the estimated interaction (here \((n_a-1)\times (n_b-1)=3\), the number of coefficients needed to capture the interaction).

Given all but one of the following collection

- the power,
- the number of groups and degrees of freedom from the design,
- the effect size and
- the sample size,

it is possible to deduce the last one assuming a balanced sample. Below, we use the information to compute the so-called *post-hoc* power. Such terminology is misleading because there is no guarantee that we are under the alternative, and effect sizes are really noisy proxy so the range of potential values for the missing ingredient is oftentimes quite large. Because studies in the literature have inflated effect size, the power measures are more often than not misleading.

```
Multiple way ANOVA analysis
n ndf ddf f ng alpha power
202 3 194 0.4764222 8 0.05 0.9999821
NOTE: Sample size is the total sample size
URL: http://psychstat.org/kanova
```

Here, the interaction is unusually strong (a fifth of the variance is explained by it!) and we have an extremely large *post-hoc* power estimate. This is rather unsurprising given the way the experiment was set up.

We can use the same function to determine how many observations the study would need to minimally achieve a certain power, below of 99% — the number reported must be rounded up to the nearest integer. Depending on the design or function, this number may be the overall sample size or the sample size per group.

```
Multiple way ANOVA analysis
n ndf ddf f ng alpha power
107.8 3 99.80004 0.4764222 8 0.05 0.99
NOTE: Sample size is the total sample size
URL: http://psychstat.org/kanova
```

```
Multiple way ANOVA analysis
n ndf ddf f ng alpha power
97.61394 3 89.61394 0.5017988 8 0.05 0.99
NOTE: Sample size is the total sample size
URL: http://psychstat.org/kanova
```

The total sample size using \(\widehat{\omega}^2\) is 108, whereas using the biased estimator \(\widehat{f}\) directly (itself obtained from \(\widehat{\eta}^2\)) gives 98: this difference of 10 individuals can have practical implications.

Baumann, J. F., Seifert-Kessell, N., & Jones, L. A. (1992). Effect of think-aloud instruction on elementary students’ comprehension monitoring abilities. *Journal of Reading Behavior*, *24*(2), 143–172. https://doi.org/10.1080/10862969209547770

Cohen, J. (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). Routledge. https://doi.org/10.4324/9780203771587

Johnson, D. J., Cheung, F., & Donnellan, M. B. (2014). Does cleanliness influence moral judgments? *Social Psychology*, *45*(3), 209–215. https://doi.org/10.1027/1864-9335/a000186

Maglio, S. J., & Polman, E. (2014). Spatial orientation shrinks and expands psychological distance. *Psychological Science*, *25*(7), 1345–1352. https://doi.org/10.1177/0956797614530571

Steiger, J. H. (2004). Beyond the \(F\) test: Effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis. *Psychological Methods*, *9*, 164–182. https://doi.org/10.1037/1082-989X.9.2.164

There are many other alternatives.↩︎

Really, these are fiducial intervals based on confidence distributions.↩︎

Note that, when the test statistic representing the proportion of variance explained is strictly positive, like a \(F\) or \(\chi^2\) statistic, the corresponding effect size is an estimated percentage of variance returned by, e.g., \(\widehat{\omega}^2\). To ensure consistency, the confidence intervals are one-sided, giving a lower bound (for the

**minimum**effect size compatible with the data), while the upper bound is set to the maximum value, e.g., 1 for a proportion.↩︎It isn’t always obvious when marginalizing out a one-way ANOVA from a complex design or when we have random effects or blocking factor what the estimated standard deviation should be, so it is left to the user to specify the correct quantity.↩︎

As the example of Baumann et al. (1992) showed, however, not all statistics can be meaningfully converted to effect size.↩︎

```
---
title: "Effect size and power"
linktitle: "Effect size and power"
type: docs
editor_options:
chunk_output_type: console
execute:
echo: true
eval: true
message: false
warning: false
cache: true
fig-align: 'center'
out-width: '80%'
---
# Notebook
This example is devoted to correct calculation and reporting of effect sizes across a variety of experimental designs, including multi-way analysis of variance with and without blocking factors. We make use of the comprehensive `effectsize` package throughout.^[There are many other alternatives.]
We also showcase how effect sizes can be recovered from the output of an analysis of variance table or from the reported test statistics and degrees of freedom, thus highlighting the impact of accurately reporting the latter.
We finally consider calculation of power in various replication studies using `WebPower` and G*Power.
## Effect sizes
Effect size typically serve three purpose:
1. inform readers of the magnitude of the effect,
2. provide a standardized quantity that can be combined with others in a meta-analysis, or
3. serve as proxy in a power study to estimate the minimum number of observations needed.
If you report the exact value of the test statistic, the null distribution and (in short) all elements of an analysis of variance table in a complex design, it is possible by using suitable formulae to recover effect sizes, as they are functions of the test statistic summaries, degrees of freedom and correlation between observations (in the case of repeated measures).
The `effectsize` package includes a variety of estimators for standardized difference or ratio of variance. For example, for the latter, we can retrieve Cohen's $f$ via `cohens_f`, $\widehat{\epsilon}^2$ via `epsilon_squared` or $\widehat{\omega}^2$ via `omega_squared`.
By default, in a design with more than one factor, the partial effects are returned (argument `partial = TRUE`) --- if there is a single factor, these coincide with the total effects and the distinction is immaterial.
The `effectsize` package reports confidence intervals^[Really, these are fiducial intervals based on confidence distributions.] calculated using the pivot method described in @Steiger:2004. Check the documentation at `?effectsize::effectsize_CIs` for more technical details.^[Note that, when the test statistic representing the proportion of variance explained is strictly positive, like a $F$ or $\chi^2$ statistic, the corresponding effect size is an estimated percentage of variance returned by, e.g., $\widehat{\omega}^2$. To ensure consistency, the confidence intervals are one-sided, giving a lower bound (for the **minimum** effect size compatible with the data), while the upper bound is set to the maximum value, e.g., 1 for a proportion.]
In general, confidence intervals for effect sizes are very wide, including a large range of potential values and sometimes zero. This reflects the large uncertainty surrounding their estimation and should not be taken to mean that the estimated effect is null.
## Example 1 - one-way ANOVA
We begin with the result of our [Example on one-way ANOVA](/example/onewayanova.html) in @Baumann:1992. If we consider the global $F$-test of equality in means, we can report as corresponding effect size the percentage of variance that is explained by the experimental condition, `group`.
```{r}
#| eval: false
library(effectsize)
data(BSJ92, package = "hecedsm")
mod <- aov(posttest2 - pretest2 ~ group,
data = BSJ92)
omega_squared(anova(mod), partial = FALSE)
```
```{r}
#| echo: false
library(effectsize)
data(BSJ92, package = "hecedsm")
mod <- aov(posttest2 - pretest2 ~ group,
data = BSJ92)
print_html(omega_squared(anova(mod), partial = FALSE))
```
The estimator employed is $\widehat{\omega}^2$ and could be obtained directly using the formula provided in the slides. For a proportion of variance, the number is medium according to @Cohen:1988 definition. Using $\widehat{R}^2 \equiv \widehat{\eta}^2$ as estimator instead would give an estimated proportion of variance of `r round(effectsize::eta_squared(anova(mod), partial = FALSE)$Eta2, 3)`, a slightly higher number.
Having found a significant difference in mean between groups, one could be interested in computing estimated marginal means and contrasts based on the latter. The `emmeans` function has a method for computing effect size (Cohen's $d$) for pairwise differences if provided with the denominator standard deviation $\sigma$ and the degrees of freedom associated with the latter (i.e., how many observations were left from the total sample size after subtracting the number of subgroup means).
```{r}
library(emmeans)
pair_diff <- emmeans(
mod, spec = "group") |>
eff_size(sigma = sigma(mod),
edf = df.residual(mod))
```
The confidence intervals reported by `emmeans` for $t$-tests are symmetric and different in nature from the one obtained previously.
Technical aside: while it is possible to create a $t$-statistic for a constrast by dividing the contrast estimator by it's standard error, the construction of Cohen's $d$ here for the contrast consisting of, e.g., the pairwise difference between `DRTA` and `TA` would take the form
$$
d_{\text{DRTA}- \text{TA}} = \frac{\mu_{\text{DRTA}}- \mu_{\text{TA}}}{\sigma},
$$
where the denominator stands for the standard deviation of the observations.^[It isn't always obvious when marginalizing out a one-way ANOVA from a complex design or when we have random effects or blocking factor what the estimated standard deviation should be, so it is left to the user to specify the correct quantity.]
## Example 2: Sample size for replication studies
Armed with effect sizes and a desired level of power, it is possible to determine the minimum number of observations that would yield such effect.
@Johnson:2014 performs a replication study of Schnall, Benton, and Harvey (2008) who conjectured that physical cleanliness reduces the severity of moral judgments.
The following excerpt from the paper explain how sample size for the replication were calculated.
> In Experiment 2, the critical test of the cleanliness manipulation on ratings of morality was significant, $F(1, 41) = 7.81$, $p=0.01$, $d=0.87$, $N=44$. Assuming $\alpha=0.05$, the achieved power in this experiment was $0.80$. Our proposed research will attempt to replicate this experiment with a level of power = $0.99$. This will require a minimum of 100 participants (assuming equal sized groups with $d=0.87$) so we will collect data from 115 participants to ensure a properly powered sample in case of errors.
The first step is to try and compute the effect size, here Cohen's $d$, from the reported $F$ statistic to make sure it matches the quoted value.
```{r}
dhat <- effectsize::F_to_d(
f = 7.81,
df = 1,
df_error = 41)
```
This indeed coincides with the value reported for Cohen's $d$ estimator. We can then plug-in this value in the power function with the desired power level $0.99$ to find out a minimal number of 50 participants in each group, for a total of 100 if we do a pairwise comparison using a two-sample $t$-test.
```{r}
WebPower::wp.t(
d = 0.87,
power = 0.99,
type = "two.sample")
```
The `effectsize` package includes many functions to convert $F$ and $t$ statistics to effect sizes.^[As the example of @Baumann:1992 showed, however, not all statistics can be meaningfully converted to effect size.]
## Example 3: two-way ANOVA with unbalanced data
While software can easily compute effect sizes, the user should not blindly rely on the output, but rather think about various elements using the following guiding principles:
- we are interested in partial effects when there are multiple factors
- the denominator should consist of the variance of the effect of interest (say factor $A$), the variance of blocking factors and random effects and that of all interactions associated with them.
Consider next the [unbalanced two-way ANOVA example](/example/twowayanova.html) with Study 1 of @Maglio/Polman:2014.
We pass here directly the output of the model. We use the `lm` function with the mean-to-zero parametrization, since we have unbalanced data.
```{r}
data(MP14_S1, package = 'hecedsm')
# Force mean-to-zero parametrization
options(contrasts = c("contr.sum", "contr.poly"))
# Estimate two-way ANOVA with interaction
model <- lm(distance ~ station*direction,
data = MP14_S1)
# Test only the interaction
out <- car::Anova(model, type = 3)
```
By default, the variance terms for each factor and interaction are estimated using the `anova` call. When the data aren't balanced and you have multiple factors in the mean equation, these are the sequential sum of square estimates (type I). This means that the resulting effect size would depend on the order in which you specify the terms, an unappealing feature. The model can alternatively take as argument the analysis of variance table produced by the `Anova` function in package `car`, e.g., `car::Anova(..., type = 3)`. Note that it is of paramount importance to pass the correct arguments and to use the mean-to-zero parametrization in order to get sensible results. The package warns user about this.
```{r}
omsq <- omega_squared(out)
print_html(omsq)
```
The estimated effect size for the main effect of `direction` is negative with $\widehat{\omega}^2_{\langle \text{direction}\rangle}$: either reporting a negative value or zero. This reflects that the estimated effect is very insignificant.
Equipped with the estimated effect size, we can now transform our partial $\widehat{\omega}^2_{\langle\text{AB}\rangle}$ measure into an estimated Cohen's $f$ via
$$\widetilde{f} = \left( \frac{\widehat{\omega}^2}{1-\widehat{\omega}^2}\right)^{1/2},$$ which is then fed into `WebPower` package functionality to compute the *post-hoc* power. Since we are dealing with a two-way ANOVA with 8 subgroups, we set `ng=8` and then `ndf` corresponding to the degrees of freedom of the estimated interaction (here $(n_a-1)\times (n_b-1)=3$, the number of coefficients needed to capture the interaction).
Given all but one of the following collection
1. the power,
2. the number of groups and degrees of freedom from the design,
3. the effect size and
4. the sample size,
it is possible to deduce the last one assuming a balanced sample. Below, we use the information to compute the so-called *post-hoc* power. Such terminology is misleading because there is no guarantee that we are under the alternative, and effect sizes are really noisy proxy so the range of potential values for the missing ingredient is oftentimes quite large. Because studies in the literature have inflated effect size, the power measures are more often than not misleading.
```{r}
# Power calculations
# Convert omega-squared to Cohen's f
cohensf1 <- effectsize::eta2_to_f(
omsq$Omega2_partial[3])
WebPower::wp.kanova(
n = 202, # sample size, assumed *balanced*
ndf = 3, # degrees of freedom
f = cohensf1, # Cohen's f estimate
ng = 8) # number of subgroups of ANOVA
```
Here, the interaction is unusually strong (a fifth of the variance is explained by it!) and we have an extremely large *post-hoc* power estimate. This is rather unsurprising given the way the experiment was set up.
We can use the same function to determine how many observations the study would need to minimally achieve a certain power, below of 99% --- the number reported must be rounded up to the nearest integer. Depending on the design or function, this number may be the overall sample size or the sample size per group.
```{r}
# Sample size calculations
cohensf2 <- cohens_f(out)$Cohens_f_partial[3]
WebPower::wp.kanova(
ndf = 3,
f = cohensf1,
ng = 8,
power = 0.99)
WebPower::wp.kanova(
ndf = 3,
f = cohensf2,
ng = 8,
power = 0.99)
```
The total sample size using $\widehat{\omega}^2$ is 108, whereas using the biased estimator $\widehat{f}$ directly (itself obtained from $\widehat{\eta}^2$) gives 98: this difference of 10 individuals can have practical implications.
```