Effect size and power

Notebook

This example is devoted to correct calculation and reporting of effect sizes across a variety of experimental designs, including multi-way analysis of variance with and without blocking factors. We make use of the comprehensive effectsize package throughout.1

We also showcase how effect sizes can be recovered from the output of an analysis of variance table or from the reported test statistics and degrees of freedom, thus highlighting the impact of accurately reporting the latter.

We finally consider calculation of power in various replication studies using WebPower and G*Power.

Effect sizes

Effect size typically serve three purpose:

  1. inform readers of the magnitude of the effect,
  2. provide a standardized quantity that can be combined with others in a meta-analysis, or
  3. serve as proxy in a power study to estimate the minimum number of observations needed.

If you report the exact value of the test statistic, the null distribution and (in short) all elements of an analysis of variance table in a complex design, it is possible by using suitable formulae to recover effect sizes, as they are functions of the test statistic summaries, degrees of freedom and correlation between observations (in the case of repeated measures).

The effectsize package includes a variety of estimators for standardized difference or ratio of variance. For example, for the latter, we can retrieve Cohen’s \(f\) via cohens_f, \(\widehat{\epsilon}^2\) via epsilon_squared or \(\widehat{\omega}^2\) via omega_squared.

By default, in a design with more than one factor, the partial effects are returned (argument partial = TRUE) — if there is a single factor, these coincide with the total effects and the distinction is immaterial.

The effectsize package reports confidence intervals2 calculated using the pivot method described in Steiger (2004). Check the documentation at ?effectsize::effectsize_CIs for more technical details.3

In general, confidence intervals for effect sizes are very wide, including a large range of potential values and sometimes zero. This reflects the large uncertainty surrounding their estimation and should not be taken to mean that the estimated effect is null.

Example 1 - one-way ANOVA

We begin with the result of our Example on one-way ANOVA in Baumann et al. (1992). If we consider the global \(F\)-test of equality in means, we can report as corresponding effect size the percentage of variance that is explained by the experimental condition, group.

Code
library(effectsize)
data(BSJ92, package = "hecedsm")
mod <- aov(posttest2 - pretest2 ~ group,
           data = BSJ92)
omega_squared(anova(mod), partial = FALSE)
Effect Size for ANOVA (Type I)
Parameter Omega2 95% CI
group 0.16 [0.03, 1.00]
One-sided CIs: upper bound fixed at [1.00].

The estimator employed is \(\widehat{\omega}^2\) and could be obtained directly using the formula provided in the slides. For a proportion of variance, the number is medium according to Cohen (1988) definition. Using \(\widehat{R}^2 \equiv \widehat{\eta}^2\) as estimator instead would give an estimated proportion of variance of 0.188, a slightly higher number.

Having found a significant difference in mean between groups, one could be interested in computing estimated marginal means and contrasts based on the latter. The emmeans function has a method for computing effect size (Cohen’s \(d\)) for pairwise differences if provided with the denominator standard deviation \(\sigma\) and the degrees of freedom associated with the latter (i.e., how many observations were left from the total sample size after subtracting the number of subgroup means).

Code
library(emmeans)
pair_diff <- emmeans(
  mod, spec = "group") |> 
  eff_size(sigma = sigma(mod), 
           edf = df.residual(mod))

The confidence intervals reported by emmeans for \(t\)-tests are symmetric and different in nature from the one obtained previously.

Technical aside: while it is possible to create a \(t\)-statistic for a constrast by dividing the contrast estimator by it’s standard error, the construction of Cohen’s \(d\) here for the contrast consisting of, e.g., the pairwise difference between DRTA and TA would take the form \[ d_{\text{DRTA}- \text{TA}} = \frac{\mu_{\text{DRTA}}- \mu_{\text{TA}}}{\sigma}, \] where the denominator stands for the standard deviation of the observations.4

Example 2: Sample size for replication studies

Armed with effect sizes and a desired level of power, it is possible to determine the minimum number of observations that would yield such effect.

Johnson et al. (2014) performs a replication study of Schnall, Benton, and Harvey (2008) who conjectured that physical cleanliness reduces the severity of moral judgments.

The following excerpt from the paper explain how sample size for the replication were calculated.

In Experiment 2, the critical test of the cleanliness manipulation on ratings of morality was significant, \(F(1, 41) = 7.81\), \(p=0.01\), \(d=0.87\), \(N=44\). Assuming \(\alpha=0.05\), the achieved power in this experiment was \(0.80\). Our proposed research will attempt to replicate this experiment with a level of power = \(0.99\). This will require a minimum of 100 participants (assuming equal sized groups with \(d=0.87\)) so we will collect data from 115 participants to ensure a properly powered sample in case of errors.

The first step is to try and compute the effect size, here Cohen’s \(d\), from the reported \(F\) statistic to make sure it matches the quoted value.

Code
dhat <- effectsize::F_to_d(
  f = 7.81,
  df = 1, 
  df_error = 41)

This indeed coincides with the value reported for Cohen’s \(d\) estimator. We can then plug-in this value in the power function with the desired power level \(0.99\) to find out a minimal number of 50 participants in each group, for a total of 100 if we do a pairwise comparison using a two-sample \(t\)-test.

Code
WebPower::wp.t(
  d = 0.87,
  power = 0.99,
  type = "two.sample")
Two-sample t-test

           n    d alpha power
    49.53039 0.87  0.05  0.99

NOTE: n is number in *each* group
URL: http://psychstat.org/ttest

The effectsize package includes many functions to convert \(F\) and \(t\) statistics to effect sizes.5

Example 3: two-way ANOVA with unbalanced data

While software can easily compute effect sizes, the user should not blindly rely on the output, but rather think about various elements using the following guiding principles:

  • we are interested in partial effects when there are multiple factors
  • the denominator should consist of the variance of the effect of interest (say factor \(A\)), the variance of blocking factors and random effects and that of all interactions associated with them.

Consider next the unbalanced two-way ANOVA example with Study 1 of Maglio & Polman (2014).

We pass here directly the output of the model. We use the lm function with the mean-to-zero parametrization, since we have unbalanced data.

Code
data(MP14_S1, package = 'hecedsm')
# Force mean-to-zero parametrization
options(contrasts = c("contr.sum", "contr.poly"))
# Estimate two-way ANOVA with interaction
model <- lm(distance ~ station*direction, 
            data = MP14_S1)
# Test only the interaction
out <- car::Anova(model, type = 3)

By default, the variance terms for each factor and interaction are estimated using the anova call. When the data aren’t balanced and you have multiple factors in the mean equation, these are the sequential sum of square estimates (type I). This means that the resulting effect size would depend on the order in which you specify the terms, an unappealing feature. The model can alternatively take as argument the analysis of variance table produced by the Anova function in package car, e.g., car::Anova(..., type = 3). Note that it is of paramount importance to pass the correct arguments and to use the mean-to-zero parametrization in order to get sensible results. The package warns user about this.

Code
omsq <- omega_squared(out)
print_html(omsq)
Effect Size for ANOVA (Type III)
Parameter Omega2 (partial) 95% CI
station 0.26 [0.17, 1.00]
direction -3.39e-03 [0.00, 1.00]
station:direction 0.18 [0.10, 1.00]
One-sided CIs: upper bound fixed at [1.00].

The estimated effect size for the main effect of direction is negative with \(\widehat{\omega}^2_{\langle \text{direction}\rangle}\): either reporting a negative value or zero. This reflects that the estimated effect is very insignificant.

Equipped with the estimated effect size, we can now transform our partial \(\widehat{\omega}^2_{\langle\text{AB}\rangle}\) measure into an estimated Cohen’s \(f\) via \[\widetilde{f} = \left( \frac{\widehat{\omega}^2}{1-\widehat{\omega}^2}\right)^{1/2},\] which is then fed into WebPower package functionality to compute the post-hoc power. Since we are dealing with a two-way ANOVA with 8 subgroups, we set ng=8 and then ndf corresponding to the degrees of freedom of the estimated interaction (here \((n_a-1)\times (n_b-1)=3\), the number of coefficients needed to capture the interaction).

Given all but one of the following collection

  1. the power,
  2. the number of groups and degrees of freedom from the design,
  3. the effect size and
  4. the sample size,

it is possible to deduce the last one assuming a balanced sample. Below, we use the information to compute the so-called post-hoc power. Such terminology is misleading because there is no guarantee that we are under the alternative, and effect sizes are really noisy proxy so the range of potential values for the missing ingredient is oftentimes quite large. Because studies in the literature have inflated effect size, the power measures are more often than not misleading.

Code
# Power calculations
# Convert omega-squared to Cohen's f
cohensf1 <- effectsize::eta2_to_f(
  omsq$Omega2_partial[3])
WebPower::wp.kanova(
  n = 202, # sample size, assumed *balanced*
  ndf = 3, # degrees of freedom
  f = cohensf1, # Cohen's f estimate
  ng = 8) # number of subgroups of ANOVA
Multiple way ANOVA analysis

      n ndf ddf         f ng alpha     power
    202   3 194 0.4764222  8  0.05 0.9999821

NOTE: Sample size is the total sample size
URL: http://psychstat.org/kanova

Here, the interaction is unusually strong (a fifth of the variance is explained by it!) and we have an extremely large post-hoc power estimate. This is rather unsurprising given the way the experiment was set up.

We can use the same function to determine how many observations the study would need to minimally achieve a certain power, below of 99% — the number reported must be rounded up to the nearest integer. Depending on the design or function, this number may be the overall sample size or the sample size per group.

Code
# Sample size calculations
cohensf2 <- cohens_f(out)$Cohens_f_partial[3]
WebPower::wp.kanova(
  ndf = 3, 
  f = cohensf1, 
  ng = 8, 
  power = 0.99)
Multiple way ANOVA analysis

        n ndf      ddf         f ng alpha power
    107.8   3 99.80004 0.4764222  8  0.05  0.99

NOTE: Sample size is the total sample size
URL: http://psychstat.org/kanova
Code
WebPower::wp.kanova(
  ndf = 3, 
  f = cohensf2, 
  ng = 8,
  power = 0.99)
Multiple way ANOVA analysis

           n ndf      ddf         f ng alpha power
    97.61394   3 89.61394 0.5017988  8  0.05  0.99

NOTE: Sample size is the total sample size
URL: http://psychstat.org/kanova

The total sample size using \(\widehat{\omega}^2\) is 108, whereas using the biased estimator \(\widehat{f}\) directly (itself obtained from \(\widehat{\eta}^2\)) gives 98: this difference of 10 individuals can have practical implications.

References

Baumann, J. F., Seifert-Kessell, N., & Jones, L. A. (1992). Effect of think-aloud instruction on elementary students’ comprehension monitoring abilities. Journal of Reading Behavior, 24(2), 143–172. https://doi.org/10.1080/10862969209547770
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge. https://doi.org/10.4324/9780203771587
Johnson, D. J., Cheung, F., & Donnellan, M. B. (2014). Does cleanliness influence moral judgments? Social Psychology, 45(3), 209–215. https://doi.org/10.1027/1864-9335/a000186
Maglio, S. J., & Polman, E. (2014). Spatial orientation shrinks and expands psychological distance. Psychological Science, 25(7), 1345–1352. https://doi.org/10.1177/0956797614530571
Steiger, J. H. (2004). Beyond the \(F\) test: Effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis. Psychological Methods, 9, 164–182. https://doi.org/10.1037/1082-989X.9.2.164

Footnotes

  1. There are many other alternatives.↩︎

  2. Really, these are fiducial intervals based on confidence distributions.↩︎

  3. Note that, when the test statistic representing the proportion of variance explained is strictly positive, like a \(F\) or \(\chi^2\) statistic, the corresponding effect size is an estimated percentage of variance returned by, e.g., \(\widehat{\omega}^2\). To ensure consistency, the confidence intervals are one-sided, giving a lower bound (for the minimum effect size compatible with the data), while the upper bound is set to the maximum value, e.g., 1 for a proportion.↩︎

  4. It isn’t always obvious when marginalizing out a one-way ANOVA from a complex design or when we have random effects or blocking factor what the estimated standard deviation should be, so it is left to the user to specify the correct quantity.↩︎

  5. As the example of Baumann et al. (1992) showed, however, not all statistics can be meaningfully converted to effect size.↩︎