Analyzing One Factor by Another: The Contingency Table
Another type of bivariate analysis comes about when you arrange to analyze one categorical variable by another. Recall that along with most sources of information about statistical analysis, R uses the word factor to refer to a variable that’s measured on a nominal basis, in categories such as make of car or political party. When you count the number of male and female Republicans and Democrats, the table that results is termed a contingency table.
Let’s take a look at the results of running the Desc function on two factors from the Pizza data set, Driver, and Area. See Figure 2.28.
Figure 2.28 This tabulation usually follows the inferential statistics in the Desc results.
(It’s almost eerie to note how closely the table produced by Desc in Figure 2.28 resembles the results of the old Crosstabs procedure in SPSS.)
This portion of the results is fairly straightforward. The values taken on by one factor head the table’s columns and the values of the other factor label the table’s rows. In each cell you get the cell’s count, as well as each cell’s percent of the entire table’s count. You also get the cell’s count as a percent of all the records for its row, as a percent of all the records for its column.
These raw counts are often intrinsically interesting. They can reveal unexpected patterns in the data or confirm a researcher’s expectations. They can also mislead seriously—see the discussion of Simpson’s Paradox later in this chapter for more about that problem.
It’s easy enough to get the same results in Excel as are shown using Desc in Figure 2.28. Figure 2.29 shows how you might use a pivot table to get the cross tabulation. The pivot table’s data source is the list shown in Figure 2.4.
Figure 2.29 This pivot table requires a moderate amount of customizing.
To create the pivot table shown in Figure 2.29, you could take these general steps:
Insert a pivot table on an otherwise blank worksheet using the link in the Insert tab’s Tables group.
Supply the address of the underlying data—in this case, the list shown in Figure 2.4.
On the PivotTable Fields window, drag the area field down into the Rows area. Use the Row Labels drop-down menu to suppress the NA values in the pivot table’s rows.
Still in the PivotTable Fields window, drag the driver field down into the Columns area. Use the Column Labels drop-down menu to suppress NA values in the pivot table’s columns.
Drag the driver field down into the Σ Values area. You will now have a raw count of the number of records for each driver in each area.
Drag the driver field down into the Σ Values area again. Click the drop-down menu next to the second instance of the driver field in the Σ Values area and choose Value Field Settings. Change the field’s custom name to Percent of Total. Click the Show Values As tab and select Percent of Grand Total. Click OK.
Notice that you now have a Σ Values entry in the Columns area. Click the drop-down menu immediately to its right and choose Move To Row Labels. (You could instead drag the Σ Values entry from the Column Labels area to the Row Labels area.)
Repeat steps 4 through 7 twice, once to insert a Percent of Row (see step 6) and a Percent of Column (also see step 6) summary into the pivot table.
If you want to suppress the grand totals at the bottom of the pivot table, to more closely emulate the layout of R’s table in Figure 2.28, you can right-click any cell in the pivot table and choose PivotTable Options from the shortcut menu. Click the Totals & Filters tab in the PivotTable Options dialog box and clear either the Show Grand Totals for Rows or the Show Grand Totals for Columns checkbox, or both. You can’t choose to show the grand totals for a subset of the value fields, so you can’t follow R’s approach and show only the count and the percent of table totals, suppressing the totals for percent of row and percent of column.
The Desc function reports several additional statistics to accompany the contingency table. They appear in Figure 2.30.
Figure 2.30 Desc reports three statistics that help you evaluate the reliability of the finding and three that quantify the strength of the relationship between the table’s dimensions.
Suppose that instead of the names of drivers and delivery areas, you had a table that showed a breakdown of a sample of 100 adults by sex and by preference of political party. If those two variables, sex and political preference, were independent of one another, you would expect the number of observations in each cell to reflect the frequencies in the table’s margins. That is, if 50 of the sampled subjects were female and 30 of the sampled subjects preferred the Republican party, and if sex and political preference were independent of one another, you would expect to find 50 times 30 divided by 100, or 15 subjects, in the cell that represents Republican women.
But if there were something about the Republican party that drew women disproportionately, you might find 30 instead of 15 Republican women in your sample (equally, you might find 5 instead of 15 if women were disproportionately drawn to the Democratic party). In either case, your observation would have departed from the expected frequency by a considerable amount. That might constitute good evidence that the two variables, sex and political preference, are not independent of one another, and that something about the parties draws or distances women disproportionately. You would have counted many more (or many fewer) people than you would expect purely on the basis that half the population is female and 30% of the population prefer the Republican party.
The three inferential statistics that R’s Desc function reports help you decide whether a departure from expected frequencies in a contingency table is likely a random event that might well not show up in a subsequent sample, or is large enough that you might well expect it to repeat in that subsequent sample. These statistics are Pearson’s chi-square, the likelihood ratio, and the Mantel-Haenszel chi-square, as shown in Figure 2.30. I’ll discuss these statistics later in the chapter in the Excel context, where it’s easier to see how the calculations come together.
The Desc function also reports the Phi coefficient, the Contingency coefficient and Cramer’s V. These are three measures of the strength of the relationship between the table’s variables, analogous to the Pearson correlation coefficient but used with nominal variables instead of with interval or ratio variables. Again, I’ll show the details of their calculations shortly.
These three inferential statistics and relationship coefficients are somewhat more complex than those routinely used with variables measured on interval or ratio scales. Each statistic returns a chi-square value, which you can test using one of Excel’s chi-square worksheet functions. Excel does not routinely provide the statistics themselves, whether via worksheet functions, the Data Analysis add-in or any command that resides on the Ribbon. It’s necessary to assemble a particular set of worksheet functions as described in the next few sections.
The Pearson Chi-square
This statistic tests whether the dimensions of a contingency table such as the one shown in Figures 2.28 and 2.29 are independent of one another, or whether the dimensions have a joint effect. The question is addressed by determining the expected frequency for each cell. That expected frequency is found by multiplying the total frequency for a cell’s row by the total frequency for the cell’s column, and dividing by the total frequency for the table.
I’ll show that procedure shortly, but first Figure 2.31 uses an arithmetically equivalent but quicker way to reach the same result.
Figure 2.31 A quicker way to find the chi-square value but not as helpful conceptually.
Figure 2.31 shows how to reach the chi-square value for the Pizza data set’s Area and Driver variables in just a couple of steps. You can start with the contingency table itself, with the value fields restricted to the raw counts only—no percentages.
To get the matrix of values in the range B10:H12, enter this formula in cell B10:
That is, square the count in cell B4. Then divide the result by the product of the total frequency for row 4, in cell I4, and the total frequency for column B, in cell B7.
Notice the use of dollar signs in the addresses of the marginal cells B7 and I4. They make mixed references of the two addresses, so that you can drag cell B7 to the right without disturbing references to column I, and drag it down without disturbing references to row 7.
Now, copy and paste, or drag using the fill handle, from cell B10 into H10. Select B10:H10 and drag down into B12:H12. You now have the matrix in B10:H12.
Enter this formula in cell C14:
That returns the value 1009.50319. Compare it to the Pearson chi-square value returned by Desc in Figure 2.30, row 10.
The analysis shown in Figure 2.32 returns the same result with an extra step, but it shows the comparison of the original to the expected values more clearly.
Figure 2.32 The range B16:H18 shows how the observed counts are compared to the expected counts.
The expected counts are calculated explicitly in the range B10:H12. Each expected count is calculated by taking the product of the marginals and dividing by the table’s grand total. For example, the formula in cell B10 is
The dollar signs in the formula serve the same function as in Figure 2.31: They enable you to copy it and paste it through column H and down into row 12. Notice that the resulting marginals for the expected counts, in the ranges B13:H13 and I10:I12 are identical to the marginals shown for the observed counts.
Then the observed and expected counts are compared using this formula in cell B16:
No dollar signs are used because we don’t need to constrain any addressing to the marginal cells. The formula simply squares the difference between the observed and the expected values, and divides by the expected value.
Clearly, the greater the discrepancies between the observed and the expected values, the greater the values calculated by the formula—and, therefore, the greater the sum of the values calculated in B16:H18. And the greater that total, the greater the resulting value of chi-square, which is calculated in cell C21 with this formula:
Notice that the resulting chi-square value is identical to the one reported in Figure 2.31. You can test the reliability of that value by using Excel’s CHISQ.DIST.RT( ) function. It’s used in Figure 2.31, cell C16, and Figure 2.32, cell C23. The two results are identical because the values of chi-square are identical. The formula as used in Figure 2.32 is
The second argument, C22, is the degrees of freedom for the test. When used in this fashion, to test the independence of classifications in a contingency table, the degrees of freedom equals the number of categories in one variable, minus 1, times the number of categories in the other variable, minus 1. With seven categories for Driver and three for Area, the degrees of freedom is (7−1)*(3−1), or 12.
You’ll note that the probability of getting so large a chi-square if the two variables are actually independent of one another is reported by Excel as approximately 1 in 2 times 10 to the 208th power. It’s reported by Desc as approximately 1 in 2 times 10 to the 16th power.
I cannot explain the discrepancy. I can say that such discrepancies tend not to turn up with smaller, and thus much more likely, values for chi-square. Comparing such ridiculously small probabilities is akin to asking whether the volume of 208 atoms is meaningfully greater than the volume of 16 atoms.
The Likelihood Ratio
When you use it in the same sort of setup—assessing a contingency table—as you use Pearson’s chi-square, the likelihood ratio provides you the same sort of inference as I discussed in the prior section. The likelihood ratio is probably used more frequently in the context of logistic regression, where it helps assess the reliability of an equation developed using maximum likelihood techniques.
Figure 2.33 shows how the likelihood ratio is calculated when used with contingency tables. The basic difference between the likelihood ratio and Pearson’s chi-square is that the likelihood ratio evaluates the ratio of the observed frequencies to the expected frequencies, whereas Pearson’s chi-square evaluates the result of subtracting one from the other.
Figure 2.33 Both Pearson’s chi-square and the likelihood ratio depend on contrasting the observed to the expected frequencies.
Notice that the likelihood ratio of 1020.859 in Figure 2.33 is very close to the chi-square value of 1009.503 in Figure 2.32. In fact both statistics follow the theoretical chi-square distribution when the contingency table’s variables are mutually independent. (And therefore, if you get a highly unusual chi-square or likelihood ratio, as here, you have to consider that the variables are not independent.) The distribution of the two statistics comes closer and closer to the theoretical chi-square distribution as the sample size increases: Both statistics are consistent estimators.
However, Pearson’s chi-square converges to the theoretical distribution more quickly (that is, sooner as the sample size increases) than does the likelihood ratio. Furthermore, when the average number of observations per cell is less than 5, the likelihood ratio is usually a less accurate estimate of chi-square than Pearson’s formulation.
Although you can doubtless find counterexamples, a good general approach is to regard Pearson’s chi-square as your primary criterion and the likelihood ratio as a confirming measure. Because the Desc function supplies both statistics by default, you might as well use both, and it’s pretty easy to calculate both using Excel worksheet formulas as shown in Figures 2.32 and 2.33.
The Mantel-Haenszel Chi-square
The third test statistic provided by the Desc function when you specify two factors such as Area and Driver is the Mantel-Haenszel chi-square. It differs in several important ways from the Pearson chi-square and the likelihood ratio. Of the differences, the two most important are these:
The variables that define the table are assumed to be based on an ordinal scale. This often means that the variables take on values such as “First,” “Second,” and “Third.” Then, Desc regards the variable as a factor (and it takes two factors for Desc to return the Mantel-Haenszel chi-square). The Desc function cannot distinguish between an ordinal “First” and a nominal “Camden,” so it’s up to you to decide whether the nature of the variables’ scales complies with the test’s assumptions.
The Mantel-Haenszel chi-square is normally used when a third variable is in play, in addition to the two that define a two-dimensional, row-by-column contingency table. The levels of this third variable are often termed strata.
It’s when strata exist—and when you can account for them—that analysis using the Mantel-Haenszel approach becomes most valuable. The classic study dates to the 1970s, when the University of California at Berkeley came under fire for apparent sex discrimination in admission to its graduate programs. A review of admission statistics at Berkeley for 1973 showed that 44% of male applicants were admitted, but only 35% of female applicants. The chi-square value for the 2-by-2 (admission status by sex) contingency table was significant at the 0.001 level. That appears to be prima facie evidence of sex discrimination.
On closer examination, however, it turned out that the individual departments’ admissions decisions tended to favor female applicants. (In a breakdown of admission by sex on a department by department basis, the 2-by-2 tables for each department are the strata in Mantel-Haenszel analysis.) That’s an apparent contradiction, one that occurs widely enough in different studies that it has its own name: Simpson’s Paradox.
It’s not really a paradox, though. What happened is that some departments admitted a smaller percentage of applicants, regardless of their sex. But females applied for admission to those departments more often than they did to the departments that admitted a higher percentage of applicants.
Viewed from that perspective, the apparent sex bias disappears. Consider a (hypothetical) program in thermonuclear physics. It might be that 50 men and 100 women applied for admission and 20% of each sex were admitted. In a different program on biostatistics, 50 men and 50 women applied and 50% of each sex were admitted. Combining the records for the two courses of study, 35 of 100 men were admitted, or 35%, whereas 45 of 150 women were admitted, or 30%. But viewed at the department level, 20% and 50% of each sex were admitted.
The Mantel-Haenszel chi-square test takes account of differences in the strata (which some writers have referred to as “lurkers”). As R’s Desc function implements it, when only two dimensions such as Area and Driver are supplied, the test uses only that single table. Figure 2.34 shows how the statistic is calculated in that case.
Figure 2.34 It’s necessary first to convert the variables’ categories to ordinal, numeric values.
The values for the variables Area and Driver in the Pizza data set are given in the range A2:B1195 in Figure 2.34. Just for illustration, assume for the moment that those values are members of an ordinal scale. The first task is to convert the text values to numerics: 1 through 3 for Area and 1 through 7 for Driver. The corresponding pairs of values for the two variables are shown in D1:E4 and D7:E14.
Now Figure 2.34 uses Excel’s VLOOKUP( ) function to convert the text values in A2:B1195 to numerics. For example, the formula in cell G2 is
and in cell H2 the formula is
Cells G2:H2 are copied and pasted down through row 1195.
The formula for the Mantel-Haenszel chi-square test is:
(n – 1) * R2
where n is the number of paired observations in the contingency table (here, that’s 1194) and R2 is the R squared for the paired values. Excel has a worksheet function, RSQ( ), that returns the R squared value for exactly two variables, so the Mantel-Haenszel chi-square value is given in cell K5. The degrees of freedom for the test is 1, so the probability of the value in cell K5 is returned in cell K9 by
Compare the values in cells K5 and K9 with those returned by Desc in Figure 2.30.
Estimating the Strength of the Relationships
The versions of chi-square that Desc reports—Pearson’s chi-square, the likelihood ratio, and the Mantel-Haenszel chi-square—are meant primarily to enable a comparison of the tabled values with a distribution, chi-square, whose characteristics are as thoroughly studied and well known as those of the normal curve. But the value of chi-square is heavily influenced by the degrees of freedom and the sample size involved, which makes it very difficult to use chi-square to assess the strength of the relationship between the variables.
That’s what statistics such as the phi coefficient, the contingency coefficient and Cramer’s V are for, and they’re the final three statistics reported routinely by Desc when you request the bivariate analysis of two factors.
Figure 2.35 shows how all three are calculated.
The phi coefficient is generally used in situations that involve two dichotomous variables, therefore a 2-by-2 contingency table. The Desc function calculates and returns the phi coefficient regardless of the size of the table’s dimensions, though, so it’s up to you choose the proper coefficient based on your knowledge of the variables that define the table. If they’re both dichotomies, the phi coefficient will quantify the strength of their relationship for you.
As usual with these statistics, the phi coefficient relies on the comparison between the observed cell frequencies and the cell frequencies that are expected, given the marginal frequencies.
Figure 2.35 shows how the phi coefficient is calculated. The Pearson chi-square is calculated in cell C15, just as is done in cell C14 of Figure 2.31. In cell E15, that chi-square value is divided by the total number of observations in the table, and the square root of the ratio is taken to return the phi coefficient.
Bear in mind that the phi coefficient is an appropriate measure of the strength of the relationship between the variables that define the table represents the cross tabulation of two dichotomies. It’s interesting to note that if the table’s entries were represented as an Excel list or an R data frame with each observation representing a different record with a 0 or a 1 on each variable, the ordinary Pearson correlation between the two variables would equal the phi coefficient. But the dichotomous nature of both variables imposes additional constraints on the phi coefficient, so it would be misleading to think of the phi coefficient as an exact analog of the Pearson correlation coefficient.
When you have more than two values for either or both variables, as in Figures 2.30 and 2.35, you usually prefer the contingency coefficient to the phi coefficient. The contingency coefficient is calculated very easily once you have the chi-square value for the table. In Figure 2.35, cell E17 returns the contingency coefficient with this formula:
That is, the square root of the ratio of the chi-square value to the sum of chi-square and the number of observations.
Cramer’s V, shown in cell E19 of Figure 2.35, again measures the strength of the association between two variables measured on nominal scales. Many analysts prefer it to the phi or the contingency coefficient because its upper limit, like Pearson’s correlations, is 1.0 without any special constraints set on how the values are distributed. The formula for Cramer’s V as used in cell E19 is
where cell C15 contains the value for chi-square, cell I6 contains the number of observations in the contingency table, and 2 is the smaller of
The number of rows in the table, minus 1
The number of columns in the table, minus 1