Many experiments can have categorical outcomes, which themselves are function of one or several experimental factors. These data can be understood in the same way as other ANOVA models. Such data are often found in papers in the form of contingency tables, giving the total count per factor combination.

The analogy with analysis of variance doesn’t stop there. If we have a two-factor design, the most complicated model is the saturated model which has one parameter per cell (since, with count data, we have a single count per cell). The model without the interaction should have the same proportion of observations in both row means and columns, and we can compare the difference in goodness-of-fit arising from dropping the interaction.

If we have more than two factors, we could pool and aggregate the counts and ignore one dimension to run a series of tests with the two-dimensional tables, amounting to marginalization. Much like in ANOVA, this only makes sense if there is no interaction between the factors. If there is an interaction, we must look at simple effects and fix the level of one factor while comparing the others.

The following section is an introduction to the topic, showcasing examples of tests and their application. It highlights the unifying theme of the course: statistics as summary of evidence, and decision-making in the presence of uncertainty. We use examples drawn from published articles in management sciences.

Setup

We consider for simplicity a bivariate contingency table with \(I\) rows and \(J\) columns and \(n\) observations overall; the count in the \((i,j)\) entry of the matrix is \(n_{ij}\).

Pearson’s \(X^2\) goodness-of-fit test examines discrepancies between postulated proportions in each cell (with the sum of the probabilities summing to one, \(\sum_{i=1}^I \sum_{j=1}^J p_{ij0}=1\)). The test compare the expected counts \(E_{ij}=n \cdot p_{ij0}\) with the observed counts \(O_{ij}=n_{ij}\). As summary of evidence, we take the statistic \[ X^2 = \sum_{i,j} \frac{(E_{ij}-O_{ij})^2}{E_{ij}},\] the squared difference between expected and observed counts, divided by the expected counts.

The contingency table described above represents a two-way factorial design on which we impose \(IJ-1\): the one constraint comes from the fact the overall counts must sum to \(n\), so one of the numbers is predetermined by others.

Rather than specify a full description.

The model with the two-way interaction has \(IJ\) parameters, one for each cell. There, the estimated proportions are simply the observed counts in each cell, divided by the overall sample size. Such model has as many parameters as observations and is said to be saturated. The first departure one can consider is thus having different marginal proportions in each row and columns, but no interaction. This hypothesis of independence between factors thus compares the model with interaction to the one without.

Study 1 - Lee and Choi (2019)

Lee & Choi (2019) study the perception of consumers when faced with inconsistent descriptions of times (when the description doesn’t match the image). The dataset LC19_T2 contains the counts for the expected number of toothbrushes for each combination of image and text.

We compute the \(X^2\) test of independence between

  1. the text description (text) and the expected number of toothbrushes (expected).
  2. the image and expected number.

In R, the chisq.test function will compute the test of independence between rows and columns if you provide a matrix with the cross-counts.

Code
data(LC19_T2, package = "hecedsm")
contingency_tab <- 
  with(LC19_T2, 
       xtabs(count ~ text + expected))
# Score test, chi-square (2) null
chisq.test(contingency_tab)

    Pearson's Chi-squared test

data:  contingency_tab
X-squared = 81.071, df = 2, p-value < 2.2e-16

We can check that our summaries match those reported by the authors, so the results are reproducible.

Models for count data are often obtained by specifying a Poisson distribution for the response and setting factors as explanatories. This makes it perhaps clearer what the \(\chi^2\) test of independence is computing. Note that this isn’t the only statistic to compare: below, I fit both the saturated model and the model without interaction and use regular interactions to fit them. The regression model specifies that the response is counts (family=poisson).

Code
# Fit Poisson model
cmod1 <- glm(
  count ~ text * expected,
  data = LC19_T2, 
  family = poisson)
cmod0 <- glm(
  count ~ text + expected,
  data = hecedsm::LC19_T2, 
  family = poisson)
# Likelihood ratio test, chi-square (2) null
car::Anova(cmod1, type = 3)
Analysis of Deviance Table (Type III tests)

Response: count
              LR Chisq Df Pr(>Chisq)    
text             0.182  1     0.6696    
expected        89.150  2     <2e-16 ***
text:expected   88.705  2     <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Score test - the "chi-squared test of independence"
anova(cmod0, cmod1, test = "Rao")
Analysis of Deviance Table

Model 1: count ~ text + expected
Model 2: count ~ text * expected
  Resid. Df Resid. Dev Df Deviance    Rao  Pr(>Chi)    
1         8     96.372                                 
2         6      7.667  2   88.705 81.071 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the likelihood ratio test and that of the score test are slightly different, but they test the same hypothesis. Here, we reject the null hypothesis of independence between text and expected: there is indeed an interaction present.

The results are somehow meaningless. First, one would need to check that there is no interaction between text and image before marginalizing, which is unlikely because part of the confusion would stem from inconsistent description along with the photo: if the display is a picture of six toothbrushes and the text description is 1 pack of 1, this is confusing. If the image agrees with the text, we expect people to answer correctly.

Rather than perform the test of Lee & Choi (2019), the comparison of interest in my humble opinion is consistent (text one vs expected one has the same proportion as text 6 vs expected 6), and similarly for inconsistent. To test this, it suffices to permute entries and change the labels of the expected factor.

Table 1: Contingency table for null hypothesis of independence looking at the correspondance between text description and expectation of customers.
not sure incorrect correct
1 10 12 81
6 12 17 68

The test statistic for this hypothesis is 2 with a \(p\)-value of 0.37. There is no evidence here that people have different levels of confusion if the text mentions a different quantity.

Study 2 - Bertrand and Mullainathan (2004)

While by far the most common, there are more specialized hypothesis that can be considered with design. The following example showcases a test of symmetry for a square contingency table.

We consider a study from Bertrand & Mullainathan (2004), who study racial discrimination in hiring based on the consonance of applicants names. The authors created curriculum vitae for four applicants and randomly allocated them a name, either one typical of a white person or a black person. The response is a count indicating how many of the applicants were called back (out of two black and two white) depending on their origin.

If there was no racial discrimination (null hypothesis), we would expect the average number of times a white applicant was called back (but no black applicant) to be the same as a single black applicant (but no white). Only the entries for different numbers of call-back (either 0 vs 2, 0 vs 1 or 1 vs 2 for either race) are instructive about our question of interest. The data are reported in Table 2.

Table 2: Contingency table for the racial discrimination in labor market.
no 1W 2W
no 1103 74 19
1B 33 46 18
2B 6 7 17

The hypothesis of symmetry postulates that the proportion on either side of the diagonal are the same, so \(p_{ij}=p_{ji}\). Under the null hypothesis model, our best estimate of the proportion is thus \((n_{ij} + n_{ji})/2\), which is the sample average of those cells. The statistic is analogous to Fisher’s goodness of fit test, except that the expected counts are estimated here. There are \(J^2\) entries and we have \(J(J-1)/2\) constraints (the degrees of freedom).

The test statistic reduces to \[X^2 = \sum_{i,j} \frac{(E_{ij}-O_{ij})^2}{E_{ij}} = \sum_{j=1}^J \frac{(n_{ij} - n_{ji})^2}{n_{ij}+n_{ji}}.\] The statistic is 27.31, to be compared with a \(\chi^2_3\) benchmark (three off-diagonal entries). The \(p\)-value is \(5\times 10^{-6}\), highly suggestive of racial discrimination.

References

Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American Economic Review, 94(4), 991–1013. https://doi.org/10.1257/0002828042002561
Lee, K., & Choi, J. (2019). Image-text inconsistency effect on product evaluation in online retailing. Journal of Retailing and Consumer Services, 49, 279–288. https://doi.org/10.1016/j.jretconser.2019.03.015