Count data and nonparametric tests

Session 13

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 27

Outline2 / 27

Outline

Count data

Nonparametric tests

2 / 27

Count data3 / 27

Tabular data

Aggregating binary responses gives counts.

Duke and Amir (2023) investigated the impact on sales of presenting customers with

a sequential choice (first decide whether or not to buy, then pick quantity) versus
an integrated decision (choose not to buy, or one of different quantities).

	integrated	sequential
no	100	133
yes	66	26

Question: does the selling format increases sales?

4 / 27

Test idea: comparing counts

Assume $Y_{i j} \sim P o i s s o n (μ_{i j})$ , an integer-valued random variable. We compare two nested models:

typically, the alternative model is the satured model, which has as many averages as cells (model with an interaction) and for which the averages are given by observed counts.
the null model, a simplification with fewer parameters than cells. For example, the additive model (without interaction) is $\begin{aligned} \ln μ_{i j} = \underset{\begin{matrix} global \\ mean \end{matrix}}{μ} + \underset{\begin{matrix} row \\ effect \end{matrix}}{α_{i}} + \underset{\begin{matrix} column \\ effect \end{matrix}}{β_{j}}, i = 1, \dots, I; j = 1, \dots, J . \end{aligned}$

5 / 27

Expected (null) vs observed (alternative)

We can compare the predicted counts under

the additive model under the null $H_{0}$ (left, $E_{i j}$ ) and
the saturated model under the alternative $H_{a}$ (right, $O_{i j}$ ).

expected counts
	integrated	sequential
no	119.01	113.99
yes	46.99	45.01

observed counts
	integrated	sequential
no	100	133
yes	66	26

6 / 27

Pearson chi-square test

Consider an $I \times J$ contingency table.

Denote the observed counts in the $(i, j)$ th cell $O_{i j}$ .

We compare these with expected counts under the null hypothesis, $E_{i j}$ 's.

The test statistic is $P = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}} .$

Yate's correction for $2 \times 2$ tables replaces numerator by $(| O_{i j} - E_{i j} | - 0.5)^{2}$ .

7 / 27

Null distribution for Pearson chi-square test

In large samples (if $min_{i, j} E_{i j} > 5$ ), the statistic $P$ behaves like a chi-square distribution with $ν$ degrees of freedom, $χ_{ν}^{2}$ .

The degrees of freedom $ν$ encode the difference in the number of parameters between alternative and null model.

For example, comparing

the saturated model with $I J$ cells/parameters and
the null main effect/additive model with $1 + (I - 1) + (J - 1)$ parameters

Then, the degrees of freedom for a two-way table with $I$ rows and $J$ columns is the number of interaction parameters, $ν = (I - 1) \times (J - 1)$ .

8 / 27

Data example

data(DA23_E2, package = "hecedsm")
tabs <- with(DA23_E2, table(purchased, format))
# Chi-square test for independence
chisq <- chisq.test(tabs)

The test statistic is $21.92$ , with 1 degree of freedom. The $p$ -value is less than $10^{- 4}$ , so there is strong evidence of differences between selling format.

9 / 27

Effect size

Effect sizes for contingency tables range from 0 (no association) to 1 (perfect association).

Measures include

$ϕ$ for $2 \times 2$ contingency tables, $ϕ = \sqrt{P / n}$ , where $n$ is the sum of the counts.
Cramér's $V$ , which is a renormalization, $V = ϕ / \sqrt{min (I - 1, J - 1)}$ .

Small sample (bias) corrections are often employed.

We obtain $V = 0.2541$ , a moderate effect size.

10 / 27

Example 2 - frequency of elocution

We consider Elliot et al. (2021) multi-lab replication study on spontaneous verbalization of children when asked to identify pictures of objects.

data(MULTI21_D1, package = "hecedsm")
contingency <- xtabs( #pool data
  count ~ age + frequency, 
  data = MULTI21_D1)
# No correction to get same result as Poisson regression model
(chisqtest <- chisq.test(contingency, correct = FALSE))

## 
##     Pearson's Chi-squared test
## 
## data:  contingency
## X-squared = 87.467, df = 6, p-value < 2.2e-16

11 / 27

Poisson regression analog

We compare nested models

MULTI21_D1_long <- MULTI21_D1 |> # pool data by age freq
  dplyr::group_by(age, frequency) |>  # I=4 age group, J=3 freq
  dplyr::summarize(total = sum(count)) # aggregate counts
mod_main <- glm(total ~ age + frequency, # null model, no interaction
    family = poisson, data = MULTI21_D1_long)
mod_satur <- glm(total ~ age * frequency, # saturated model
    family = poisson, data = MULTI21_D1_long)

The null model is the main effect model (no interaction, "independence between factors").

12 / 27

Remarks

Model comparison relies on likelihood ratio (sometimes termed "deviance") or score test (Pearson $X^{2}$ test).
Compared to linear regression and ANOVA, the variance of the cells is solely determined by the mean counts.
Each dimension of the contingency table (row, column, depth) is a factor.
Each cell is a response value. There are as many "observations" as cells.

# Both tests have (I-1) x (J-1) = 6 degrees of freedom
# likelihood ratio/deviance
anova(mod_main, mod_satur, test = "LRT")  
# score test, aka Pearson X2 stat
anova(mod_main, mod_satur, test = "Rao")

13 / 27

Example 3 - racial discrimination

We consider a study from Bertrand and Mullainathan (2004), who study racial discrimination in hiring based on the consonance of applicants names.

The authors created curriculum vitae for four applicants and randomly allocated them a name, either one typical of a white person or a black person.

The response is a count indicating how many of the applicants were called back (out of 4 profiles: 2 black and 2 white), depending on their origin.

14 / 27

Testing symmetry

Under the null hypothesis of symmetry, the off-diagonal entries of the table have equal frequency.

The expected counts $E$ are the average of two cells $E_{i j} = (O_{i j} + O_{j i}) / 2$ for $i \neq j .$

15 / 27

Fitting Poisson models

Null model: Poisson model with sym as factor
Alternative model: saturated model (observed counts)

data(BM04_T2, package = "hecedsm")
# Symmetric model with 6 parameters (3 diag + 3 upper triangular)
mod_null <- glm(count ~ gnm::Symm(black, white), 
                data = BM04_T2, 
                family = poisson)
# Compare the two nested models using a likelihood ratio test
pchisq(deviance(mod_null), lower.tail = FALSE,
       df = mod_null$df.residual) # 9 cells - 6 parameters = 3 df
PearsonX2 <- sum(residuals(mod_null, type = "pearson")^2)
pchisq(PearsonX2, df = mod_null$df.residual, lower.tail = FALSE)

16 / 27

Nonparametric tests17 / 27

Why nonparametric tests?

Nonparametric tests refer to procedures which make no assumption about the nature of the data (e.g., normality)

Rather than considering numeric response $Y_{(1)} \leq \dots \leq Y_{(n)}$ , we substitute them with ranks $1, \dots, n$ (assuming no ties), where $R_{i} = r a n k (Y_{i}) = # {j : Y_{i} \geq Y_{j}, j = 1, \dots, n}$

e.g., numbers $(8, 2, 1, 2)$ have (average) ranks $(4, 2.5, 1, 2.5)$

18 / 27

Understanding rank-based procedures

Many tests could be interpreted (roughly) as linear/regression or ANOVA

but with the values of the rank $R_{i}$ rather than that of the response $Y_{i}$

Ranks are not affected by outliers (more robust)

this is useful for continuous data, less for Likert scales (lots of ties, bounded scales)

19 / 27

Wilcoxon's signed rank test

For paired data with differences $D_{i} = Y_{i 2} - Y_{i 1}$ , we wish to know if the average rank is zero.

remove zero differences
rank absolute values $R_{i} = r a n k (| D_{i} |)$ of the remaining observations
compute the test statistic $T = \sum_{i = 1}^{n} s i g n (D_{i}) R_{i}$
compare with reference under hypothesis of symmetry of the distribution.

The latter is analogous to a one-sample $t$ -test for $μ_{D} = 0$ .

20 / 27

Kruskal–Wallis test

Roughly speaking

rank observations of the pooled sample (abstracting from $K$ group labels)
compare average ranks in each group.
compare with reference

For $K = 2$ , the test is called Mann–Whitney–Wilcoxon or Mann–Whitney $U$ or Wilcoxon rank sum test.

Analogous to running two-sample $t$ -test or one-way ANOVA with ranks.

21 / 27

Null distributions and benchmarks

Since ranks are discrete (assuming no ties), we can derive explicit expression for values that the statistic can take in small samples.

Zero differences and ties mess up things.
With more than 15 observations by group, large-sample approximations (normal, Student-t or $F$ distribution) from linear regression/ANOVA are valid.

22 / 27

Example 1 - Virtual communications

Brucks and Levav (2022) measure the attention of participants during exchanges using an eyetracker in

face-to-face meetings
videoconference meetings

Data suggest that videoconferencing translates into longer time spent gazing at the partner than in-person meetings.

23 / 27

Code for Wilcoxon rank-sum test

The coin package function reports Hodges–Lehmann estimate of location.

Intervals and estimates of difference in mean are in seconds (-37 seconds).

data(BL22_E, package = "hecedsm")
(mww <- coin::wilcox_test( # rank-based test
  partner_time ~ cond, 
  data = BL22_E,  
  conf.int = TRUE)) # values and intervals are times in seconds
welch <- t.test(partner_time ~ cond, 
  data = BL22_E, # compare results with two sample t-test
  conf.int = TRUE)
# Effect size
eff <- effectsize::rank_eta_squared(partner_time ~ cond, data = BL22_E)

24 / 27

Example 2 - Smartwatches distractions

We consider a within-subject design from Tech3Lab (Brodeur et al., 2021).

Each of the 31 participants was assigned to four distractions while using a driving simulator

phone
using a speaker
texting while driving
smartwatch

Task order was randomized and data are balanced

The response is the number of road safety violations conducted on the segment.

25 / 27

Friedman and Quade tests

We use Quade's test, which ranks responses of each participants $1, 2, 3, 4$ separately.

data(BRLS21_T3, package = "hecedsm")
coin::friedman_test(nviolation ~ task | id,
                 data = BRLS21_T3)

## 
##     Asymptotic Friedman Test
## 
## data:  nviolation by
##      task (phone, watch, speaker, texting) 
##      stratified by id
## chi-squared = 18.97, df = 3, p-value = 0.0002774

coin::quade_test(nviolation ~ task | id,
                 data = BRLS21_T3)

## 
##     Asymptotic Quade Test
## 
## data:  nviolation by
##      task (phone, watch, speaker, texting) 
##      stratified by id
## chi-squared = 21.512, df = 3, p-value = 8.241e-05

The repeated measures must use a similar response (e.g., Likert scale).

26 / 27

Pairwise differences

Since there are overall differences, we can follow-up by looking at all pairwise differences using Wilcoxon rank-sum test

# Transform to wide format - one line per id
smartwatch <- tidyr::pivot_wider(
  data = BRLS21_T3,
  names_from = task,
  values_from = nviolation)
# Wilcoxon signed-rank test
coin::wilcoxsign_test(phone ~ watch,
                      data = smartwatch)

There are $(\binom{4}{2}) = 6$ pairwise comparisons, so we should adjust $p$ -values for multiple testing using, e.g., Holm–Bonferroni.

27 / 27

Count data and nonparametric tests

Session 13

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

1 / 27

Outline2 / 27

Outline

Count data

Nonparametric tests

2 / 27

Count data3 / 27

Tabular data

Aggregating binary responses gives counts.

Duke and Amir (2023) investigated the impact on sales of presenting customers with

a sequential choice (first decide whether or not to buy, then pick quantity) versus
an integrated decision (choose not to buy, or one of different quantities).

	integrated	sequential
no	100	133
yes	66	26

Question: does the selling format increases sales?

4 / 27

Test idea: comparing counts

Assume $Y_{i j} \sim P o i s s o n (μ_{i j})$ , an integer-valued random variable. We compare two nested models:

typically, the alternative model is the satured model, which has as many averages as cells (model with an interaction) and for which the averages are given by observed counts.
the null model, a simplification with fewer parameters than cells. For example, the additive model (without interaction) is $\begin{aligned} \ln μ_{i j} = \underset{\begin{matrix} global \\ mean \end{matrix}}{μ} + \underset{\begin{matrix} row \\ effect \end{matrix}}{α_{i}} + \underset{\begin{matrix} column \\ effect \end{matrix}}{β_{j}}, i = 1, \dots, I; j = 1, \dots, J . \end{aligned}$

5 / 27

Expected (null) vs observed (alternative)

We can compare the predicted counts under

the additive model under the null $H_{0}$ (left, $E_{i j}$ ) and
the saturated model under the alternative $H_{a}$ (right, $O_{i j}$ ).

expected counts
	integrated	sequential
no	119.01	113.99
yes	46.99	45.01

observed counts
	integrated	sequential
no	100	133
yes	66	26

6 / 27

Pearson chi-square test

Consider an $I \times J$ contingency table.

Denote the observed counts in the $(i, j)$ th cell $O_{i j}$ .

We compare these with expected counts under the null hypothesis, $E_{i j}$ 's.

The test statistic is $P = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}} .$

Yate's correction for $2 \times 2$ tables replaces numerator by $(| O_{i j} - E_{i j} | - 0.5)^{2}$ .

7 / 27

Null distribution for Pearson chi-square test

In large samples (if $min_{i, j} E_{i j} > 5$ ), the statistic $P$ behaves like a chi-square distribution with $ν$ degrees of freedom, $χ_{ν}^{2}$ .

The degrees of freedom $ν$ encode the difference in the number of parameters between alternative and null model.

For example, comparing

the saturated model with $I J$ cells/parameters and
the null main effect/additive model with $1 + (I - 1) + (J - 1)$ parameters

Then, the degrees of freedom for a two-way table with $I$ rows and $J$ columns is the number of interaction parameters, $ν = (I - 1) \times (J - 1)$ .

8 / 27

Data example

data(DA23_E2, package = "hecedsm")
tabs <- with(DA23_E2, table(purchased, format))
# Chi-square test for independence
chisq <- chisq.test(tabs)

The test statistic is $21.92$ , with 1 degree of freedom. The $p$ -value is less than $10^{- 4}$ , so there is strong evidence of differences between selling format.

9 / 27

Effect size

Effect sizes for contingency tables range from 0 (no association) to 1 (perfect association).

Measures include

$ϕ$ for $2 \times 2$ contingency tables, $ϕ = \sqrt{P / n}$ , where $n$ is the sum of the counts.
Cramér's $V$ , which is a renormalization, $V = ϕ / \sqrt{min (I - 1, J - 1)}$ .

Small sample (bias) corrections are often employed.

We obtain $V = 0.2541$ , a moderate effect size.

10 / 27

Example 2 - frequency of elocution

We consider Elliot et al. (2021) multi-lab replication study on spontaneous verbalization of children when asked to identify pictures of objects.

data(MULTI21_D1, package = "hecedsm")
contingency <- xtabs( #pool data
  count ~ age + frequency, 
  data = MULTI21_D1)
# No correction to get same result as Poisson regression model
(chisqtest <- chisq.test(contingency, correct = FALSE))

## 
##     Pearson's Chi-squared test
## 
## data:  contingency
## X-squared = 87.467, df = 6, p-value < 2.2e-16

11 / 27

Poisson regression analog

We compare nested models

MULTI21_D1_long <- MULTI21_D1 |> # pool data by age freq
  dplyr::group_by(age, frequency) |>  # I=4 age group, J=3 freq
  dplyr::summarize(total = sum(count)) # aggregate counts
mod_main <- glm(total ~ age + frequency, # null model, no interaction
    family = poisson, data = MULTI21_D1_long)
mod_satur <- glm(total ~ age * frequency, # saturated model
    family = poisson, data = MULTI21_D1_long)

The null model is the main effect model (no interaction, "independence between factors").

12 / 27

Remarks

Model comparison relies on likelihood ratio (sometimes termed "deviance") or score test (Pearson $X^{2}$ test).
Compared to linear regression and ANOVA, the variance of the cells is solely determined by the mean counts.
Each dimension of the contingency table (row, column, depth) is a factor.
Each cell is a response value. There are as many "observations" as cells.

# Both tests have (I-1) x (J-1) = 6 degrees of freedom
# likelihood ratio/deviance
anova(mod_main, mod_satur, test = "LRT")  
# score test, aka Pearson X2 stat
anova(mod_main, mod_satur, test = "Rao")

13 / 27

Example 3 - racial discrimination

We consider a study from Bertrand and Mullainathan (2004), who study racial discrimination in hiring based on the consonance of applicants names.

The authors created curriculum vitae for four applicants and randomly allocated them a name, either one typical of a white person or a black person.

The response is a count indicating how many of the applicants were called back (out of 4 profiles: 2 black and 2 white), depending on their origin.

14 / 27

Testing symmetry

Under the null hypothesis of symmetry, the off-diagonal entries of the table have equal frequency.

The expected counts $E$ are the average of two cells $E_{i j} = (O_{i j} + O_{j i}) / 2$ for $i \neq j .$

15 / 27

Fitting Poisson models

Null model: Poisson model with sym as factor
Alternative model: saturated model (observed counts)

data(BM04_T2, package = "hecedsm")
# Symmetric model with 6 parameters (3 diag + 3 upper triangular)
mod_null <- glm(count ~ gnm::Symm(black, white), 
                data = BM04_T2, 
                family = poisson)
# Compare the two nested models using a likelihood ratio test
pchisq(deviance(mod_null), lower.tail = FALSE,
       df = mod_null$df.residual) # 9 cells - 6 parameters = 3 df
PearsonX2 <- sum(residuals(mod_null, type = "pearson")^2)
pchisq(PearsonX2, df = mod_null$df.residual, lower.tail = FALSE)

16 / 27

Nonparametric tests17 / 27

Why nonparametric tests?

Nonparametric tests refer to procedures which make no assumption about the nature of the data (e.g., normality)

e.g., numbers $(8, 2, 1, 2)$ have (average) ranks $(4, 2.5, 1, 2.5)$

18 / 27

Understanding rank-based procedures

Many tests could be interpreted (roughly) as linear/regression or ANOVA

but with the values of the rank $R_{i}$ rather than that of the response $Y_{i}$

Ranks are not affected by outliers (more robust)

this is useful for continuous data, less for Likert scales (lots of ties, bounded scales)

19 / 27

Wilcoxon's signed rank test

For paired data with differences $D_{i} = Y_{i 2} - Y_{i 1}$ , we wish to know if the average rank is zero.

remove zero differences
rank absolute values $R_{i} = r a n k (| D_{i} |)$ of the remaining observations
compute the test statistic $T = \sum_{i = 1}^{n} s i g n (D_{i}) R_{i}$
compare with reference under hypothesis of symmetry of the distribution.

The latter is analogous to a one-sample $t$ -test for $μ_{D} = 0$ .

20 / 27

Kruskal–Wallis test

Roughly speaking

rank observations of the pooled sample (abstracting from $K$ group labels)
compare average ranks in each group.
compare with reference

For $K = 2$ , the test is called Mann–Whitney–Wilcoxon or Mann–Whitney $U$ or Wilcoxon rank sum test.

Analogous to running two-sample $t$ -test or one-way ANOVA with ranks.

21 / 27

Null distributions and benchmarks

Since ranks are discrete (assuming no ties), we can derive explicit expression for values that the statistic can take in small samples.

Zero differences and ties mess up things.
With more than 15 observations by group, large-sample approximations (normal, Student-t or $F$ distribution) from linear regression/ANOVA are valid.

22 / 27

Example 1 - Virtual communications

Brucks and Levav (2022) measure the attention of participants during exchanges using an eyetracker in

face-to-face meetings
videoconference meetings

Data suggest that videoconferencing translates into longer time spent gazing at the partner than in-person meetings.

23 / 27

Code for Wilcoxon rank-sum test

The coin package function reports Hodges–Lehmann estimate of location.

Intervals and estimates of difference in mean are in seconds (-37 seconds).

data(BL22_E, package = "hecedsm")
(mww <- coin::wilcox_test( # rank-based test
  partner_time ~ cond, 
  data = BL22_E,  
  conf.int = TRUE)) # values and intervals are times in seconds
welch <- t.test(partner_time ~ cond, 
  data = BL22_E, # compare results with two sample t-test
  conf.int = TRUE)
# Effect size
eff <- effectsize::rank_eta_squared(partner_time ~ cond, data = BL22_E)

24 / 27

Example 2 - Smartwatches distractions

We consider a within-subject design from Tech3Lab (Brodeur et al., 2021).

Each of the 31 participants was assigned to four distractions while using a driving simulator

phone
using a speaker
texting while driving
smartwatch

Task order was randomized and data are balanced

The response is the number of road safety violations conducted on the segment.

25 / 27

Friedman and Quade tests

We use Quade's test, which ranks responses of each participants $1, 2, 3, 4$ separately.

data(BRLS21_T3, package = "hecedsm")
coin::friedman_test(nviolation ~ task | id,
                 data = BRLS21_T3)

## 
##     Asymptotic Friedman Test
## 
## data:  nviolation by
##      task (phone, watch, speaker, texting) 
##      stratified by id
## chi-squared = 18.97, df = 3, p-value = 0.0002774

coin::quade_test(nviolation ~ task | id,
                 data = BRLS21_T3)

## 
##     Asymptotic Quade Test
## 
## data:  nviolation by
##      task (phone, watch, speaker, texting) 
##      stratified by id
## chi-squared = 21.512, df = 3, p-value = 8.241e-05

The repeated measures must use a similar response (e.g., Likert scale).

26 / 27

Pairwise differences

Since there are overall differences, we can follow-up by looking at all pairwise differences using Wilcoxon rank-sum test

# Transform to wide format - one line per id
smartwatch <- tidyr::pivot_wider(
  data = BRLS21_T3,
  names_from = task,
  values_from = nviolation)
# Wilcoxon signed-rank test
coin::wilcoxsign_test(phone ~ watch,
                      data = smartwatch)

There are $(\binom{4}{2}) = 6$ pairwise comparisons, so we should adjust $p$ -values for multiple testing using, e.g., Holm–Bonferroni.

27 / 27

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Count data and nonparametric tests

Outline

Outline

Count data

Tabular data

Test idea: comparing counts

Expected (null) vs observed (alternative)

Pearson chi-square test

Null distribution for Pearson chi-square test

Data example

Effect size

Example 2 - frequency of elocution

Poisson regression analog

Remarks

Example 3 - racial discrimination

Testing symmetry

Fitting Poisson models

Nonparametric tests

Why nonparametric tests?

Understanding rank-based procedures

Wilcoxon's signed rank test

Kruskal–Wallis test

Null distributions and benchmarks

Example 1 - Virtual communications

Code for Wilcoxon rank-sum test

Example 2 - Smartwatches distractions

Friedman and Quade tests

Pairwise differences

Outline

Help

Count data and nonparametric tests

Count data and nonparametric tests

Outline

Outline

Count data

Tabular data

Test idea: comparing counts

Expected (null) vs observed (alternative)

Pearson chi-square test

Null distribution for Pearson chi-square test

Data example

Effect size

Example 2 - frequency of elocution

Poisson regression analog

Remarks

Example 3 - racial discrimination

Testing symmetry

Fitting Poisson models

Nonparametric tests

Why nonparametric tests?

Understanding rank-based procedures

Wilcoxon's signed rank test

Kruskal–Wallis test

Null distributions and benchmarks

Example 1 - Virtual communications

Code for Wilcoxon rank-sum test

Example 2 - Smartwatches distractions

Friedman and Quade tests

Pairwise differences