Session 13
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Count data
Nonparametric tests
Aggregating binary responses gives counts.
Duke and Amir (2023) investigated the impact on sales of presenting customers with
integrated | sequential | |
---|---|---|
no | 100 | 133 |
yes | 66 | 26 |
Question: does the selling format increases sales?
Assume Yij∼Poisson(μij), an integer-valued random variable. We compare two nested models:
We can compare the predicted counts under
integrated | sequential | |
---|---|---|
no | 119.01 | 113.99 |
yes | 46.99 | 45.01 |
integrated | sequential | |
---|---|---|
no | 100 | 133 |
yes | 66 | 26 |
Consider an I×J contingency table.
Denote the observed counts in the (i,j)th cell Oij.
We compare these with expected counts under the null hypothesis, Eij's.
The test statistic is P=I∑i=1J∑j=1(Oij−Eij)2Eij.
Yate's correction for 2×2 tables replaces numerator by (|Oij−Eij|−0.5)2.
In large samples (if mini,jEij>5), the statistic P behaves like a chi-square distribution with ν degrees of freedom, χ2ν.
The degrees of freedom ν encode the difference in the number of parameters between alternative and null model.
For example, comparing
Then, the degrees of freedom for a two-way table with I rows and J columns is the number of interaction parameters, ν=(I−1)×(J−1).
data(DA23_E2, package = "hecedsm")tabs <- with(DA23_E2, table(purchased, format))# Chi-square test for independencechisq <- chisq.test(tabs)
The test statistic is 21.92, with 1 degree of freedom. The p-value is less than 10−4, so there is strong evidence of differences between selling format.
Effect sizes for contingency tables range from 0 (no association) to 1 (perfect association).
Measures include
Small sample (bias) corrections are often employed.
We obtain V=0.2541, a moderate effect size.
We consider Elliot et al. (2021) multi-lab replication study on spontaneous verbalization of children when asked to identify pictures of objects.
data(MULTI21_D1, package = "hecedsm")contingency <- xtabs( #pool data count ~ age + frequency, data = MULTI21_D1)# No correction to get same result as Poisson regression model(chisqtest <- chisq.test(contingency, correct = FALSE))
## ## Pearson's Chi-squared test## ## data: contingency## X-squared = 87.467, df = 6, p-value < 2.2e-16
We compare nested models
MULTI21_D1_long <- MULTI21_D1 |> # pool data by age freq dplyr::group_by(age, frequency) |> # I=4 age group, J=3 freq dplyr::summarize(total = sum(count)) # aggregate countsmod_main <- glm(total ~ age + frequency, # null model, no interaction family = poisson, data = MULTI21_D1_long)mod_satur <- glm(total ~ age * frequency, # saturated model family = poisson, data = MULTI21_D1_long)
The null model is the main effect model (no interaction, "independence between factors").
# Both tests have (I-1) x (J-1) = 6 degrees of freedom# likelihood ratio/devianceanova(mod_main, mod_satur, test = "LRT") # score test, aka Pearson X2 statanova(mod_main, mod_satur, test = "Rao")
We consider a study from Bertrand and Mullainathan (2004), who study racial discrimination in hiring based on the consonance of applicants names.
The authors created curriculum vitae for four applicants and randomly allocated them a name, either one typical of a white person or a black person.
The response is a count indicating how many of the applicants were called back (out of 4 profiles: 2 black and 2 white), depending on their origin.
Under the null hypothesis of symmetry, the off-diagonal entries of the table have equal frequency.
sym
as factordata(BM04_T2, package = "hecedsm")# Symmetric model with 6 parameters (3 diag + 3 upper triangular)mod_null <- glm(count ~ gnm::Symm(black, white), data = BM04_T2, family = poisson)# Compare the two nested models using a likelihood ratio testpchisq(deviance(mod_null), lower.tail = FALSE, df = mod_null$df.residual) # 9 cells - 6 parameters = 3 dfPearsonX2 <- sum(residuals(mod_null, type = "pearson")^2)pchisq(PearsonX2, df = mod_null$df.residual, lower.tail = FALSE)
Nonparametric tests refer to procedures which make no assumption about the nature of the data (e.g., normality)
Rather than considering numeric response Y(1)≤⋯≤Y(n), we substitute them with ranks 1,…,n (assuming no ties), where Ri=rank(Yi)=#{j:Yi≥Yj,j=1,…,n}
Many tests could be interpreted (roughly) as linear/regression or ANOVA
Ranks are not affected by outliers (more robust)
For paired data with differences Di=Yi2−Yi1, we wish to know if the average rank is zero.
The latter is analogous to a one-sample t-test for μD=0.
Roughly speaking
For K=2, the test is called Mann–Whitney–Wilcoxon or Mann–Whitney U or Wilcoxon rank sum test.
Analogous to running two-sample t-test or one-way ANOVA with ranks.
Since ranks are discrete (assuming no ties), we can derive explicit expression for values that the statistic can take in small samples.
Brucks and Levav (2022) measure the attention of participants during exchanges using an eyetracker in
Data suggest that videoconferencing translates into longer time spent gazing at the partner than in-person meetings.
The coin
package function reports Hodges–Lehmann estimate of location.
Intervals and estimates of difference in mean are in seconds (-37 seconds).
data(BL22_E, package = "hecedsm")(mww <- coin::wilcox_test( # rank-based test partner_time ~ cond, data = BL22_E, conf.int = TRUE)) # values and intervals are times in secondswelch <- t.test(partner_time ~ cond, data = BL22_E, # compare results with two sample t-test conf.int = TRUE)# Effect sizeeff <- effectsize::rank_eta_squared(partner_time ~ cond, data = BL22_E)
We consider a within-subject design from Tech3Lab (Brodeur et al., 2021).
Each of the 31 participants was assigned to four distractions while using a driving simulator
Task order was randomized and data are balanced
The response is the number of road safety violations conducted on the segment.
We use Quade's test, which ranks responses of each participants 1,2,3,4 separately.
data(BRLS21_T3, package = "hecedsm")coin::friedman_test(nviolation ~ task | id, data = BRLS21_T3)
## ## Asymptotic Friedman Test## ## data: nviolation by## task (phone, watch, speaker, texting) ## stratified by id## chi-squared = 18.97, df = 3, p-value = 0.0002774
coin::quade_test(nviolation ~ task | id, data = BRLS21_T3)
## ## Asymptotic Quade Test## ## data: nviolation by## task (phone, watch, speaker, texting) ## stratified by id## chi-squared = 21.512, df = 3, p-value = 8.241e-05
The repeated measures must use a similar response (e.g., Likert scale).
Since there are overall differences, we can follow-up by looking at all pairwise differences using Wilcoxon rank-sum test
# Transform to wide format - one line per idsmartwatch <- tidyr::pivot_wider( data = BRLS21_T3, names_from = task, values_from = nviolation)# Wilcoxon signed-rank testcoin::wilcoxsign_test(phone ~ watch, data = smartwatch)
There are (42)=6 pairwise comparisons, so we should adjust p-values for multiple testing using, e.g., Holm–Bonferroni.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Session 13
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Count data
Nonparametric tests
Aggregating binary responses gives counts.
Duke and Amir (2023) investigated the impact on sales of presenting customers with
integrated | sequential | |
---|---|---|
no | 100 | 133 |
yes | 66 | 26 |
Question: does the selling format increases sales?
Assume Yij∼Poisson(μij), an integer-valued random variable. We compare two nested models:
We can compare the predicted counts under
integrated | sequential | |
---|---|---|
no | 119.01 | 113.99 |
yes | 46.99 | 45.01 |
integrated | sequential | |
---|---|---|
no | 100 | 133 |
yes | 66 | 26 |
Consider an I×J contingency table.
Denote the observed counts in the (i,j)th cell Oij.
We compare these with expected counts under the null hypothesis, Eij's.
The test statistic is P=I∑i=1J∑j=1(Oij−Eij)2Eij.
Yate's correction for 2×2 tables replaces numerator by (|Oij−Eij|−0.5)2.
In large samples (if mini,jEij>5), the statistic P behaves like a chi-square distribution with ν degrees of freedom, χ2ν.
The degrees of freedom ν encode the difference in the number of parameters between alternative and null model.
For example, comparing
Then, the degrees of freedom for a two-way table with I rows and J columns is the number of interaction parameters, ν=(I−1)×(J−1).
data(DA23_E2, package = "hecedsm")tabs <- with(DA23_E2, table(purchased, format))# Chi-square test for independencechisq <- chisq.test(tabs)
The test statistic is 21.92, with 1 degree of freedom. The p-value is less than 10−4, so there is strong evidence of differences between selling format.
Effect sizes for contingency tables range from 0 (no association) to 1 (perfect association).
Measures include
Small sample (bias) corrections are often employed.
We obtain V=0.2541, a moderate effect size.
We consider Elliot et al. (2021) multi-lab replication study on spontaneous verbalization of children when asked to identify pictures of objects.
data(MULTI21_D1, package = "hecedsm")contingency <- xtabs( #pool data count ~ age + frequency, data = MULTI21_D1)# No correction to get same result as Poisson regression model(chisqtest <- chisq.test(contingency, correct = FALSE))
## ## Pearson's Chi-squared test## ## data: contingency## X-squared = 87.467, df = 6, p-value < 2.2e-16
We compare nested models
MULTI21_D1_long <- MULTI21_D1 |> # pool data by age freq dplyr::group_by(age, frequency) |> # I=4 age group, J=3 freq dplyr::summarize(total = sum(count)) # aggregate countsmod_main <- glm(total ~ age + frequency, # null model, no interaction family = poisson, data = MULTI21_D1_long)mod_satur <- glm(total ~ age * frequency, # saturated model family = poisson, data = MULTI21_D1_long)
The null model is the main effect model (no interaction, "independence between factors").
# Both tests have (I-1) x (J-1) = 6 degrees of freedom# likelihood ratio/devianceanova(mod_main, mod_satur, test = "LRT") # score test, aka Pearson X2 statanova(mod_main, mod_satur, test = "Rao")
We consider a study from Bertrand and Mullainathan (2004), who study racial discrimination in hiring based on the consonance of applicants names.
The authors created curriculum vitae for four applicants and randomly allocated them a name, either one typical of a white person or a black person.
The response is a count indicating how many of the applicants were called back (out of 4 profiles: 2 black and 2 white), depending on their origin.
Under the null hypothesis of symmetry, the off-diagonal entries of the table have equal frequency.
sym
as factordata(BM04_T2, package = "hecedsm")# Symmetric model with 6 parameters (3 diag + 3 upper triangular)mod_null <- glm(count ~ gnm::Symm(black, white), data = BM04_T2, family = poisson)# Compare the two nested models using a likelihood ratio testpchisq(deviance(mod_null), lower.tail = FALSE, df = mod_null$df.residual) # 9 cells - 6 parameters = 3 dfPearsonX2 <- sum(residuals(mod_null, type = "pearson")^2)pchisq(PearsonX2, df = mod_null$df.residual, lower.tail = FALSE)
Nonparametric tests refer to procedures which make no assumption about the nature of the data (e.g., normality)
Rather than considering numeric response Y(1)≤⋯≤Y(n), we substitute them with ranks 1,…,n (assuming no ties), where Ri=rank(Yi)=#{j:Yi≥Yj,j=1,…,n}
Many tests could be interpreted (roughly) as linear/regression or ANOVA
Ranks are not affected by outliers (more robust)
For paired data with differences Di=Yi2−Yi1, we wish to know if the average rank is zero.
The latter is analogous to a one-sample t-test for μD=0.
Roughly speaking
For K=2, the test is called Mann–Whitney–Wilcoxon or Mann–Whitney U or Wilcoxon rank sum test.
Analogous to running two-sample t-test or one-way ANOVA with ranks.
Since ranks are discrete (assuming no ties), we can derive explicit expression for values that the statistic can take in small samples.
Brucks and Levav (2022) measure the attention of participants during exchanges using an eyetracker in
Data suggest that videoconferencing translates into longer time spent gazing at the partner than in-person meetings.
The coin
package function reports Hodges–Lehmann estimate of location.
Intervals and estimates of difference in mean are in seconds (-37 seconds).
data(BL22_E, package = "hecedsm")(mww <- coin::wilcox_test( # rank-based test partner_time ~ cond, data = BL22_E, conf.int = TRUE)) # values and intervals are times in secondswelch <- t.test(partner_time ~ cond, data = BL22_E, # compare results with two sample t-test conf.int = TRUE)# Effect sizeeff <- effectsize::rank_eta_squared(partner_time ~ cond, data = BL22_E)
We consider a within-subject design from Tech3Lab (Brodeur et al., 2021).
Each of the 31 participants was assigned to four distractions while using a driving simulator
Task order was randomized and data are balanced
The response is the number of road safety violations conducted on the segment.
We use Quade's test, which ranks responses of each participants 1,2,3,4 separately.
data(BRLS21_T3, package = "hecedsm")coin::friedman_test(nviolation ~ task | id, data = BRLS21_T3)
## ## Asymptotic Friedman Test## ## data: nviolation by## task (phone, watch, speaker, texting) ## stratified by id## chi-squared = 18.97, df = 3, p-value = 0.0002774
coin::quade_test(nviolation ~ task | id, data = BRLS21_T3)
## ## Asymptotic Quade Test## ## data: nviolation by## task (phone, watch, speaker, texting) ## stratified by id## chi-squared = 21.512, df = 3, p-value = 8.241e-05
The repeated measures must use a similar response (e.g., Likert scale).
Since there are overall differences, we can follow-up by looking at all pairwise differences using Wilcoxon rank-sum test
# Transform to wide format - one line per idsmartwatch <- tidyr::pivot_wider( data = BRLS21_T3, names_from = task, values_from = nviolation)# Wilcoxon signed-rank testcoin::wilcoxsign_test(phone ~ watch, data = smartwatch)
There are (42)=6 pairwise comparisons, so we should adjust p-values for multiple testing using, e.g., Holm–Bonferroni.