# Descriptive Statistics

• Print
This chapter is from the book

## Analyzing One Factor by Another: The Contingency Table

Another type of bivariate analysis comes about when you arrange to analyze one categorical variable by another. Recall that along with most sources of information about statistical analysis, R uses the word factor to refer to a variable that’s measured on a nominal basis, in categories such as make of car or political party. When you count the number of male and female Republicans and Democrats, the table that results is termed a contingency table.

Let’s take a look at the results of running the Desc function on two factors from the Pizza data set, Driver, and Area. See Figure 2.28.

(It’s almost eerie to note how closely the table produced by Desc in Figure 2.28 resembles the results of the old Crosstabs procedure in SPSS.)

This portion of the results is fairly straightforward. The values taken on by one factor head the table’s columns and the values of the other factor label the table’s rows. In each cell you get the cell’s count, as well as each cell’s percent of the entire table’s count. You also get the cell’s count as a percent of all the records for its row, as a percent of all the records for its column.

These raw counts are often intrinsically interesting. They can reveal unexpected patterns in the data or confirm a researcher’s expectations. They can also mislead seriously—see the discussion of Simpson’s Paradox later in this chapter for more about that problem.

It’s easy enough to get the same results in Excel as are shown using Desc in Figure 2.28. Figure 2.29 shows how you might use a pivot table to get the cross tabulation. The pivot table’s data source is the list shown in Figure 2.4.

To create the pivot table shown in Figure 2.29, you could take these general steps:

1. Insert a pivot table on an otherwise blank worksheet using the link in the Insert tab’s Tables group.

2. Supply the address of the underlying data—in this case, the list shown in Figure 2.4.

3. On the PivotTable Fields window, drag the area field down into the Rows area. Use the Row Labels drop-down menu to suppress the NA values in the pivot table’s rows.

4. Still in the PivotTable Fields window, drag the driver field down into the Columns area. Use the Column Labels drop-down menu to suppress NA values in the pivot table’s columns.

5. Drag the driver field down into the Σ Values area. You will now have a raw count of the number of records for each driver in each area.

6. Drag the driver field down into the Σ Values area again. Click the drop-down menu next to the second instance of the driver field in the Σ Values area and choose Value Field Settings. Change the field’s custom name to Percent of Total. Click the Show Values As tab and select Percent of Grand Total. Click OK.

7. Notice that you now have a Σ Values entry in the Columns area. Click the drop-down menu immediately to its right and choose Move To Row Labels. (You could instead drag the Σ Values entry from the Column Labels area to the Row Labels area.)

Repeat steps 4 through 7 twice, once to insert a Percent of Row (see step 6) and a Percent of Column (also see step 6) summary into the pivot table.

If you want to suppress the grand totals at the bottom of the pivot table, to more closely emulate the layout of R’s table in Figure 2.28, you can right-click any cell in the pivot table and choose PivotTable Options from the shortcut menu. Click the Totals & Filters tab in the PivotTable Options dialog box and clear either the Show Grand Totals for Rows or the Show Grand Totals for Columns checkbox, or both. You can’t choose to show the grand totals for a subset of the value fields, so you can’t follow R’s approach and show only the count and the percent of table totals, suppressing the totals for percent of row and percent of column.

The Desc function reports several additional statistics to accompany the contingency table. They appear in Figure 2.30.

Suppose that instead of the names of drivers and delivery areas, you had a table that showed a breakdown of a sample of 100 adults by sex and by preference of political party. If those two variables, sex and political preference, were independent of one another, you would expect the number of observations in each cell to reflect the frequencies in the table’s margins. That is, if 50 of the sampled subjects were female and 30 of the sampled subjects preferred the Republican party, and if sex and political preference were independent of one another, you would expect to find 50 times 30 divided by 100, or 15 subjects, in the cell that represents Republican women.

But if there were something about the Republican party that drew women disproportionately, you might find 30 instead of 15 Republican women in your sample (equally, you might find 5 instead of 15 if women were disproportionately drawn to the Democratic party). In either case, your observation would have departed from the expected frequency by a considerable amount. That might constitute good evidence that the two variables, sex and political preference, are not independent of one another, and that something about the parties draws or distances women disproportionately. You would have counted many more (or many fewer) people than you would expect purely on the basis that half the population is female and 30% of the population prefer the Republican party.

The three inferential statistics that R’s Desc function reports help you decide whether a departure from expected frequencies in a contingency table is likely a random event that might well not show up in a subsequent sample, or is large enough that you might well expect it to repeat in that subsequent sample. These statistics are Pearson’s chi-square, the likelihood ratio, and the Mantel-Haenszel chi-square, as shown in Figure 2.30. I’ll discuss these statistics later in the chapter in the Excel context, where it’s easier to see how the calculations come together.

The Desc function also reports the Phi coefficient, the Contingency coefficient and Cramer’s V. These are three measures of the strength of the relationship between the table’s variables, analogous to the Pearson correlation coefficient but used with nominal variables instead of with interval or ratio variables. Again, I’ll show the details of their calculations shortly.

These three inferential statistics and relationship coefficients are somewhat more complex than those routinely used with variables measured on interval or ratio scales. Each statistic returns a chi-square value, which you can test using one of Excel’s chi-square worksheet functions. Excel does not routinely provide the statistics themselves, whether via worksheet functions, the Data Analysis add-in or any command that resides on the Ribbon. It’s necessary to assemble a particular set of worksheet functions as described in the next few sections.

### The Pearson Chi-square

This statistic tests whether the dimensions of a contingency table such as the one shown in Figures 2.28 and 2.29 are independent of one another, or whether the dimensions have a joint effect. The question is addressed by determining the expected frequency for each cell. That expected frequency is found by multiplying the total frequency for a cell’s row by the total frequency for the cell’s column, and dividing by the total frequency for the table.

I’ll show that procedure shortly, but first Figure 2.31 uses an arithmetically equivalent but quicker way to reach the same result.

Figure 2.31 shows how to reach the chi-square value for the Pizza data set’s Area and Driver variables in just a couple of steps. You can start with the contingency table itself, with the value fields restricted to the raw counts only—no percentages.

To get the matrix of values in the range B10:H12, enter this formula in cell B10:

=B4^2/(B\$7*\$I4)

That is, square the count in cell B4. Then divide the result by the product of the total frequency for row 4, in cell I4, and the total frequency for column B, in cell B7.

Notice the use of dollar signs in the addresses of the marginal cells B7 and I4. They make mixed references of the two addresses, so that you can drag cell B7 to the right without disturbing references to column I, and drag it down without disturbing references to row 7.

Now, copy and paste, or drag using the fill handle, from cell B10 into H10. Select B10:H10 and drag down into B12:H12. You now have the matrix in B10:H12.

Enter this formula in cell C14:

=I7*(SUM(B10:H12)-1)

That returns the value 1009.50319. Compare it to the Pearson chi-square value returned by Desc in Figure 2.30, row 10.

The analysis shown in Figure 2.32 returns the same result with an extra step, but it shows the comparison of the original to the expected values more clearly.

The expected counts are calculated explicitly in the range B10:H12. Each expected count is calculated by taking the product of the marginals and dividing by the table’s grand total. For example, the formula in cell B10 is

=\$I4*B\$7/\$I\$7

The dollar signs in the formula serve the same function as in Figure 2.31: They enable you to copy it and paste it through column H and down into row 12. Notice that the resulting marginals for the expected counts, in the ranges B13:H13 and I10:I12 are identical to the marginals shown for the observed counts.

Then the observed and expected counts are compared using this formula in cell B16:

=(B4-B10)^2/B10

No dollar signs are used because we don’t need to constrain any addressing to the marginal cells. The formula simply squares the difference between the observed and the expected values, and divides by the expected value.

Clearly, the greater the discrepancies between the observed and the expected values, the greater the values calculated by the formula—and, therefore, the greater the sum of the values calculated in B16:H18. And the greater that total, the greater the resulting value of chi-square, which is calculated in cell C21 with this formula:

=SUM(B16:H18)

Notice that the resulting chi-square value is identical to the one reported in Figure 2.31. You can test the reliability of that value by using Excel’s CHISQ.DIST.RT( ) function. It’s used in Figure 2.31, cell C16, and Figure 2.32, cell C23. The two results are identical because the values of chi-square are identical. The formula as used in Figure 2.32 is

=CHISQ.DIST.RT(C21,C22)

The second argument, C22, is the degrees of freedom for the test. When used in this fashion, to test the independence of classifications in a contingency table, the degrees of freedom equals the number of categories in one variable, minus 1, times the number of categories in the other variable, minus 1. With seven categories for Driver and three for Area, the degrees of freedom is (7−1)*(3−1), or 12.

You’ll note that the probability of getting so large a chi-square if the two variables are actually independent of one another is reported by Excel as approximately 1 in 2 times 10 to the 208th power. It’s reported by Desc as approximately 1 in 2 times 10 to the 16th power.

I cannot explain the discrepancy. I can say that such discrepancies tend not to turn up with smaller, and thus much more likely, values for chi-square. Comparing such ridiculously small probabilities is akin to asking whether the volume of 208 atoms is meaningfully greater than the volume of 16 atoms.

### The Likelihood Ratio

When you use it in the same sort of setup—assessing a contingency table—as you use Pearson’s chi-square, the likelihood ratio provides you the same sort of inference as I discussed in the prior section. The likelihood ratio is probably used more frequently in the context of logistic regression, where it helps assess the reliability of an equation developed using maximum likelihood techniques.

Figure 2.33 shows how the likelihood ratio is calculated when used with contingency tables. The basic difference between the likelihood ratio and Pearson’s chi-square is that the likelihood ratio evaluates the ratio of the observed frequencies to the expected frequencies, whereas Pearson’s chi-square evaluates the result of subtracting one from the other.

Notice that the likelihood ratio of 1020.859 in Figure 2.33 is very close to the chi-square value of 1009.503 in Figure 2.32. In fact both statistics follow the theoretical chi-square distribution when the contingency table’s variables are mutually independent. (And therefore, if you get a highly unusual chi-square or likelihood ratio, as here, you have to consider that the variables are not independent.) The distribution of the two statistics comes closer and closer to the theoretical chi-square distribution as the sample size increases: Both statistics are consistent estimators.

However, Pearson’s chi-square converges to the theoretical distribution more quickly (that is, sooner as the sample size increases) than does the likelihood ratio. Furthermore, when the average number of observations per cell is less than 5, the likelihood ratio is usually a less accurate estimate of chi-square than Pearson’s formulation.

Although you can doubtless find counterexamples, a good general approach is to regard Pearson’s chi-square as your primary criterion and the likelihood ratio as a confirming measure. Because the Desc function supplies both statistics by default, you might as well use both, and it’s pretty easy to calculate both using Excel worksheet formulas as shown in Figures 2.32 and 2.33.

### The Mantel-Haenszel Chi-square

The third test statistic provided by the Desc function when you specify two factors such as Area and Driver is the Mantel-Haenszel chi-square. It differs in several important ways from the Pearson chi-square and the likelihood ratio. Of the differences, the two most important are these:

• The variables that define the table are assumed to be based on an ordinal scale. This often means that the variables take on values such as “First,” “Second,” and “Third.” Then, Desc regards the variable as a factor (and it takes two factors for Desc to return the Mantel-Haenszel chi-square). The Desc function cannot distinguish between an ordinal “First” and a nominal “Camden,” so it’s up to you to decide whether the nature of the variables’ scales complies with the test’s assumptions.

• The Mantel-Haenszel chi-square is normally used when a third variable is in play, in addition to the two that define a two-dimensional, row-by-column contingency table. The levels of this third variable are often termed strata.

It’s when strata exist—and when you can account for them—that analysis using the Mantel-Haenszel approach becomes most valuable. The classic study dates to the 1970s, when the University of California at Berkeley came under fire for apparent sex discrimination in admission to its graduate programs. A review of admission statistics at Berkeley for 1973 showed that 44% of male applicants were admitted, but only 35% of female applicants. The chi-square value for the 2-by-2 (admission status by sex) contingency table was significant at the 0.001 level. That appears to be prima facie evidence of sex discrimination.

On closer examination, however, it turned out that the individual departments’ admissions decisions tended to favor female applicants. (In a breakdown of admission by sex on a department by department basis, the 2-by-2 tables for each department are the strata in Mantel-Haenszel analysis.) That’s an apparent contradiction, one that occurs widely enough in different studies that it has its own name: Simpson’s Paradox.

It’s not really a paradox, though. What happened is that some departments admitted a smaller percentage of applicants, regardless of their sex. But females applied for admission to those departments more often than they did to the departments that admitted a higher percentage of applicants.

Viewed from that perspective, the apparent sex bias disappears. Consider a (hypothetical) program in thermonuclear physics. It might be that 50 men and 100 women applied for admission and 20% of each sex were admitted. In a different program on biostatistics, 50 men and 50 women applied and 50% of each sex were admitted. Combining the records for the two courses of study, 35 of 100 men were admitted, or 35%, whereas 45 of 150 women were admitted, or 30%. But viewed at the department level, 20% and 50% of each sex were admitted.

The Mantel-Haenszel chi-square test takes account of differences in the strata (which some writers have referred to as “lurkers”). As R’s Desc function implements it, when only two dimensions such as Area and Driver are supplied, the test uses only that single table. Figure 2.34 shows how the statistic is calculated in that case.

The values for the variables Area and Driver in the Pizza data set are given in the range A2:B1195 in Figure 2.34. Just for illustration, assume for the moment that those values are members of an ordinal scale. The first task is to convert the text values to numerics: 1 through 3 for Area and 1 through 7 for Driver. The corresponding pairs of values for the two variables are shown in D1:E4 and D7:E14.

Now Figure 2.34 uses Excel’s VLOOKUP( ) function to convert the text values in A2:B1195 to numerics. For example, the formula in cell G2 is

=VLOOKUP(A2,\$D\$2:\$E\$4,2)

and in cell H2 the formula is

=VLOOKUP(B2,\$D\$8:\$E\$14,2)

Cells G2:H2 are copied and pasted down through row 1195.

The formula for the Mantel-Haenszel chi-square test is:

(n – 1) * R2

where n is the number of paired observations in the contingency table (here, that’s 1194) and R2 is the R squared for the paired values. Excel has a worksheet function, RSQ( ), that returns the R squared value for exactly two variables, so the Mantel-Haenszel chi-square value is given in cell K5. The degrees of freedom for the test is 1, so the probability of the value in cell K5 is returned in cell K9 by

=CHISQ.DIST.RT(K5,1)

Compare the values in cells K5 and K9 with those returned by Desc in Figure 2.30.

### Estimating the Strength of the Relationships

The versions of chi-square that Desc reports—Pearson’s chi-square, the likelihood ratio, and the Mantel-Haenszel chi-square—are meant primarily to enable a comparison of the tabled values with a distribution, chi-square, whose characteristics are as thoroughly studied and well known as those of the normal curve. But the value of chi-square is heavily influenced by the degrees of freedom and the sample size involved, which makes it very difficult to use chi-square to assess the strength of the relationship between the variables.

That’s what statistics such as the phi coefficient, the contingency coefficient and Cramer’s V are for, and they’re the final three statistics reported routinely by Desc when you request the bivariate analysis of two factors.

Figure 2.35 shows how all three are calculated.

The phi coefficient is generally used in situations that involve two dichotomous variables, therefore a 2-by-2 contingency table. The Desc function calculates and returns the phi coefficient regardless of the size of the table’s dimensions, though, so it’s up to you choose the proper coefficient based on your knowledge of the variables that define the table. If they’re both dichotomies, the phi coefficient will quantify the strength of their relationship for you.

As usual with these statistics, the phi coefficient relies on the comparison between the observed cell frequencies and the cell frequencies that are expected, given the marginal frequencies.

Figure 2.35 shows how the phi coefficient is calculated. The Pearson chi-square is calculated in cell C15, just as is done in cell C14 of Figure 2.31. In cell E15, that chi-square value is divided by the total number of observations in the table, and the square root of the ratio is taken to return the phi coefficient.

Bear in mind that the phi coefficient is an appropriate measure of the strength of the relationship between the variables that define the table represents the cross tabulation of two dichotomies. It’s interesting to note that if the table’s entries were represented as an Excel list or an R data frame with each observation representing a different record with a 0 or a 1 on each variable, the ordinary Pearson correlation between the two variables would equal the phi coefficient. But the dichotomous nature of both variables imposes additional constraints on the phi coefficient, so it would be misleading to think of the phi coefficient as an exact analog of the Pearson correlation coefficient.

When you have more than two values for either or both variables, as in Figures 2.30 and 2.35, you usually prefer the contingency coefficient to the phi coefficient. The contingency coefficient is calculated very easily once you have the chi-square value for the table. In Figure 2.35, cell E17 returns the contingency coefficient with this formula:

=SQRT(C15/(C15+I6))

That is, the square root of the ratio of the chi-square value to the sum of chi-square and the number of observations.

Cramer’s V, shown in cell E19 of Figure 2.35, again measures the strength of the association between two variables measured on nominal scales. Many analysts prefer it to the phi or the contingency coefficient because its upper limit, like Pearson’s correlations, is 1.0 without any special constraints set on how the values are distributed. The formula for Cramer’s V as used in cell E19 is

=SQRT(C15/(I6*2))

where cell C15 contains the value for chi-square, cell I6 contains the number of observations in the contingency table, and 2 is the smaller of

• The number of rows in the table, minus 1

• The number of columns in the table, minus 1

• 🔖 Save To Your Account

### InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

## Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

## Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

### Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

### Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

### Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

### Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

### Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

### Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

## Other Collection and Use of Information

### Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

### Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

### Do Not Track

This site currently does not respond to Do Not Track signals.

## Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

## Children

This site is not directed to children under the age of 13.

## Marketing

Pearson may send or direct marketing communications to users, provided that

• Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
• Such marketing is consistent with applicable law and Pearson's legal obligations.
• Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
• Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

## Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

## Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

## Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

## Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

## Sharing and Disclosure

Pearson may disclose personal information, as follows:

• As required by law.
• With the consent of the individual (or their parent, if the individual is a minor)
• In response to a subpoena, court order or legal process, to the extent permitted or required by law
• To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
• In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
• To investigate or address actual or suspected fraud or other illegal activities
• To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
• To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
• To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.