- Descriptive Statistics in Excel
- Using R's DescTools Package
- Entering Some Useful Commands
- Running Bivariate Analyses with Desc
- Analyzing One Factor by Another: The Contingency Table
Running Bivariate Analyses with Desc
The Desc function can handle bivariate analyses as well as the univariate analyses discussed so far in this chapter. For example, analyzing two numeric variables gets you correlations between the two variables, and analyzing a numeric variable and a factor gets you a breakdown of the numeric variable by each level of the factor.
In Excel, and in most applications that are intended specifically for statistical analysis, you decide which analysis you want (ANOVA, multiple regression, correlation, factor analysis and, so on) and supply the names of the variables that you want the analysis to work with.
In the DescTools package, you can supply the names of two variables separated by a tilde (~) to inform the Desc function that you want to analyze the first variable by the second. For example:
Desc(temperature ~ delivery_min, data=d.pizza, plotit=TRUE)
Here, the command instructs the Desc function to analyze Temperature by Delivery_min (minutes required to deliver the order). You also see an alternative way to identify the data frame and the variables of interest. It was easy enough to use this sort of structure:
in earlier examples of univariate analysis in this chapter. That gets a little clumsy when two or more variables are involved, so the Temperature by Delivery_min analysis simply names the variables, followed by “data =” and the name of the data frame. If you’re really in a hurry, you could even omit the “data =” portion, along these lines:
Desc(temperature ~ delivery_min, d.pizza, plotit=TRUE)
Two Numeric Variables
In the previous example, the Desc function notices that the two variables specified in the arguments are both numeric variables, and therefore selects correlations as the type of analysis to carry out. This is a very different approach than an experienced Excel user is accustomed to. Excel requires that the user choose a type of analysis—for example, CORREL( ) for correlation or LINEST( ) for regression. Then the user hands the worksheet addresses of the variables of interest to the function. For example:
But using DescTools, the user chooses the sort of analysis by naming the variables to analyze. The Desc function notices the type of each variable and chooses accordingly the sort of results to return. In the prior example, Desc notices that both variables are numeric and therefore correlations are called for.
Figure 2.18 shows the statistical analysis of Temperature with Delivery_min.
Figure 2.18 The user’s knowledge of the nature of the variables determines which correlation is the one to attend to.
Figure 2.19 This is the approach that R uses to calculate Kendall’s tau.
Figure 2.19 shows how R’s DescTools package (as well as R’s supporting Kendall package) calculates Kendall’s tau. Recommendations for how best to calculate tau have changed over the years, but they all are based on counting agreements and inversions (termed concordant and discordant pairs in some sources). Most other efforts at developing innovative measures of correlation have depended on the original Pearson product-moment approach, pairing a record’s distance from the mean on one variable with its distance from the mean on the other variable. For example, Spearman’s correlation coefficient for ranks is little more than Pearson’s correlation calculated on ranks rather than on interval or ratio variables.
One way to conceptualize Kendall’s tau is to sort the records in descending order on, say, Variable A. Then, for each instance of Variable B, count the number of times each instance is greater than the remaining instances; that’s the number of agreements. Also count the number of times each instance is less than the remaining instances; that’s the number of inversions. The number of agreements less the number of inversions is often termed S. If there are no tied ranks, divide S by n(n − 1)/2 to get Kendall’s tau.
If you look in statistics texts dating back to the 1970s you’ll find some wildly idiosyncratic calculation methods. The one I got stuck with in college had you sort the records according to the rankings on one of the two variables. Then you drew a line between the two 1’s, another between the two 2’s, and so on. Finally, you counted the number of intersecting lines. That gave you the number of inversions. And yes, we had personal computers back then, but that method seemed to be more geared to an Etch-a-Sketch than to an HP.
Matters are more sophisticated now, although I admit that you get a better sense of the nature of the difference between Kendall’s approach and Pearson’s product-moment approach by thinking through the old agreement/inversion dichotomy. The approach taken by Desc, and shown in Figure 2.19, begins (as do most of the approaches to calculating Kendall’s tau) by sorting the records with one of the two sets of ranks as the sort key. Then, Figure 2.19 makes use of this array-formula in cell D2:
In close to 30 years of using Excel, and looking over Excel worksheets prepared by others, I don’t believe that I’ve encountered the SIGN( ) worksheet function more than two or three times. It takes a number as its argument and it returns a 1 if the number is positive, 0 if the number is 0 and −1 if the number is negative.
As used here, the SIGN( ) function subtracts each value in A3:A13 from the value in A2. Because the process started by sorting the records into ascending order by the values in column A, we expect all the values in A3:A13 to exceed or equal the value in A2. The results of the subtractions will therefore be either negative or zero, and the SIGN( ) function will therefore return either −1 or 0.
The array formula performs a similar operation on the differences between the values in B2 and those in B3:B13. When both instances of the SIGN( ) function return either +1 or −1, their product is +1, and we have an “agreement” or “concordance.” Setting aside the issue of ties for the moment, both the A2 value and the B2 value are either less than or larger than a pair of values in a subsequent row. And enclosing the SIGN( ) functions within the SUM( ) function adds up the results.
In Figure 2.19, notice that the values in cells A2 and B2 are less than their counterparts in rows 4, 5, and 7 through 13, resulting in 9 instances of +1 as the product of the two SIGN( ) functions. Only in row 6 does one SIGN( ) return a 1 and the other a −1, for a product of −1. And the result for cell D2 is 8: that is, 9 plus −1.
Dragging the array formula in cell D2 down through D12 causes the cells that are compared to those in subsequent rows to increment to D3, D4, and so on. By making mixed references to the final cells in row 13 (A$13 and B$13), the comparison ranges change from A3:A$13 to A4:A$13, then A4:A$13 to A5:A$13, and so on.
Cell D14 totals the results of the array formulas in D2:D12 and returns S, the numerator for tau. When there are no tied ranks, tau is given by this formula:
Tau = S / (n*(n − 1)/2)
But ranks are often tied. Therefore, another series of calculations is needed, shown in the range F2:I12 of Figure 2.19. We need to know the total number of records that belong to tied ranks. In Excel, the quickest way to do that is by way of Excel’s Advanced Filter. I discuss that process in detail in the next section of this chapter. Briefly, though, here’s how to get the necessary calculations in Figure 2.19:
Make sure that you have no data in columns F and H. (The Advanced Filter overwrites existing data and you can’t get it back with Undo.)
Click the Advanced link in the Sort & Filter group of the Ribbon’s Data tab.
Choose Copy to Another Location.
Click in the List Range edit box and drag through A1:A13.
Click in the Copy To edit box and then click in cell F1.
Fill the Unique Records Only checkbox.
Click OK. You’ll get a list of the unique ranks found in column A.
Repeat steps 2 through 7 for the ranks in column B, placing the unique ranks in column H.
Array-enter this formula in cell G2:
Array-enter this formula in cell I2:
Array-enter this formula in cell G11 to get the total ties for Variable A, termed Kx:
Array-enter this formula in cell G12 to get the total ties for Variable B, termed Ky:
Get the total number of possible comparisons for the twelve records on one variable, returned by n*(n − 1)/2 where n is the number of records. Enter this formula in cell G14:
Calculate tau with this formula in cell G16:
Drag it down through G3:G9.
(Notice that subtracting 1 in the expression (G2:G9-1)) causes any non-tied rank to return a value of 0.)
To demonstrate that the fourteen steps just given return the same outcome as does Desc, see Figure 2.20. That figure carries out the calculations on Temperature and Delivery Minutes from the Pizza data set. Compare the tau shown in cell M12 of Figure 2.20 with that shown in cell B11 of Figure 2.18 and calculated by Desc.
Figure 2.20 Kendall’s tau applied to the relationship between the Pizza data set’s Temperature and Delivery Minutes.
Figure 2.20 also shows the standard Pearson correlation coefficient, calculated on the original interval variables in columns A and B, and also Spearman’s rank correlation coefficient, which is simply the Pearson correlation applied to the variables’ ranks. Excel’s CORREL( ) worksheet function is used in both cases. Compare the results to those returned by Desc in the range B9:B10 of Figure 2.18.
Figure 2.21 shows the scatter chart that you get when you set the plotit argument to TRUE in an instance of the Desc function that names two numeric variables.
Figure 2.21 Notice the presence of confidence intervals around the smoother.
The line that resembles an Excel polynomial trendline, with shaded borders, is often termed a smoother. It is not a polynomial trendline but a line produced by locally weighted scatterplot smoothing, or LOWESS (that’s the original acronym; in recent years the tendency is to put it in lower case and abbreviate it as loess). The technique uses local values (that is, a subset of X values near one another) on the X variable to calculate a smoothing fraction for the Y variable. The value of the fraction typically varies with the X values.
If you call for plotit=TRUE via the Desc function, the confidence intervals around the smoother are 95% by default. If you want to change the default values, there’s just a little additional work:
> plot(temperature ~ delivery_min, data=d.pizza)
> lines(loess(temperature ~ delivery_min, data=d.pizza), conf.level = 0.99,args.band = list(col=SetAlpha(“blue,” 0.4), border=“black”))
Note in the third command, the confidence interval has been set to the 0.99 confidence level. The plot command enables the creation of the scatterplot itself. The lines command specifies the variables used in the calculation of the smoother and provides arguments that you can use to specify the size of the confidence intervals and the colors of chart components. The results appear in Figure 2.22.
Breaking Down a Numeric Variable by a Factor
Suppose that you were interested in whether the temperature of delivered pizzas varied as a function of the area they were delivered to. You might start out by adopting an alpha level to use if you decided eventually on an inferential test such as the analysis of variance (ANOVA). You would want to do this beforehand in order that your choice of alpha could not be based on before-the-fact knowledge of how the mean temperatures turned out.
That choice made, you could get preliminary information about the mean temperatures using the Desc function:
> Desc(temperature ~ area, d.pizza)
The results appear in Figure 2.23. Again, the type of analysis that Desc returns depends on the scale of measurement used for each variable. In this case, it’s a numeric variable viewed as a function of a factor.
Figure 2.23 The Kruskal-Wallis test is a nonparametric version of a one-factor ANOVA.
Figure 2.24 Excel’s pivot tables provide much less information about quantiles than about means, standard deviations, and counts.
If you include a request for charts in the command, you get a box-and-whiskers plot as well as a means plot for each grouping level. The command might look like this:
> Desc(temperature ~ area, data = d.pizza, plotit = TRUE)
See Figure 2.25.
Figure 2.25 The box-and-whiskers plots give you a quick idea of whether any serious skewness exists within a grouping level.
As the start of this section notes, the Desc function can break down a numeric variable such as Temperature by a factor such as Area. It provides information that’s useful for assessing the reliability of differences in group means.
When you think about testing the reliability of differences in group means, you think of t tests when there are just two group means to consider, and you think of ANOVA when three or more group means are involved.
In some cases, you might also think of the Kruskal-Wallis analysis of variance by ranks. This is a nonparametric test that’s designed with these considerations in mind:
Important assumptions made by the standard ANOVA may not be met by the available data.
The data can provide at least ordinal information. In other words, if you have three groups of cars and all you know about any given car is its make (a nominal variable), this test won’t help. But if you know each car’s ranking on miles per gallon, you might be in business.
The data (in the case of the Kruskal-Wallis test, the rankings) meet various other assumptions. For example, it’s assumed that the rankings can be partitioned into groups that are equally likely if the grouping factor actually has no effect on the rankings.
Now, if the data do not violate any important assumptions made by traditional ANOVA techniques to any important degree, you’re better off running ANOVA than Kruskal-Wallis. Other things equal, parametric tests such as ANOVA have more statistical power than do their nonparametric counterparts.
That being so, it baffles me that the Desc function goes to the trouble of calculating and displaying a Kruskal-Wallis outcome but does not bother with an ANOVA.
Whatever the rationale, it’s important to see how the Kruskal-Wallis test is implemented so that you’re in a position to compare the outcome in R with the outcome in Excel. As usual, it’s much easier to understand the calculations in Excel than in R. Also as usual, the steps to get to the results are more tedious using Excel than using R. But in the case of the Kruskal-Wallis test, using Excel is much more tedious than using R.
Figure 2.26 shows how you would start to go about a Kruskal-Wallis test in Excel.
Figure 2.26 The Kruskal-Wallis test is much less onerous if there are no tied ranks, but such ties are the norm.
To prepare for the analysis in Figure 2.26, I first deleted some records from the Pizza data file. I removed all the records with the value N/A for Area (10), and also those with the value NA for Temperature (38). Of the original 1,209 records, 1,161 remained with valid values for both Area and Temperature. Those records are in the range A2:B1162 in Figure 2.26.
Column C contains the worksheet function RANK.AVG( ), which returns the rank of each record. For example, this formula is in cell C2:
That formula returns the value 724.0. That means that the value 53, in cell B2, is the 724th lowest value in the temperatures recorded in B2:B1162: there are 723 values lower than 53.
Notice the third argument to the RANK.AVG function, the numeral 1. It tells Excel to act as though the data in B2:B1162 were sorted in ascending order. If that argument were 0 or omitted, Excel would act as though the data in B2:B1162 were in descending order. In that case, a Temperature value of 53 would be reported as the 438th highest temperature value instead of the 724th lowest.
The AVG tag to the RANK function tells Excel what value to assign in the event of identical values on the ranked variable: that is, what to do when, for example, there are two instances of a Temperature value of 27. Notice in Figure 2.26 that the value in cell B6 is a Temperature of 27 and in cell C6 a Rank of 65.5.
If the Temperature values in column B were sorted, two records with the value 27 would be adjacent. If the sort order were ascending, their ranks would be 65 and 66. The AVG tag tells Excel to assign the tied records the average of their ranks—in this case, 65.5.
If the RANK.EQ( ) function were used instead of RANK.AVG( ), tied records would also have the same rank. But instead of averaging two ranks as just described, Excel assigns the first available rank to both. So the numbers 21, 22, 22, and 23 would be ranked as 1, 2, 2, and 4.
The next step is to get a list of the unique ranks returned by the RANK.AVG( ) function. Because ties exist in the list of temperatures, resulting in average ranks such as 65.5, we can’t simply list each rank as an integer running from 1 to 1161. Instead, we rely on the Advanced Filter to return unique values from column C. Here’s how:
Copy the label in cell C1 to cell E1, or make sure that E1 is empty.
Go to the Ribbon’s Data tab and click the Advanced icon in the Sort & Filter group.
Choose the Copy to Another Location option button.
Enter E1 in the Copy To box.
Fill the Unique Records Only checkbox and click OK.
The result in column E is one value for each rank, tied or not. That is, suppose that in column C, rank 5 referred to exactly one record, and rank 6.5 is shared by the records that come in 6th and 7th. In column E, we want to find one instance of rank 5 and one instance of rank 6.5. The reason we need a list of unique ranks is that we’re getting ready to count the number of times that each rank occurs in column C. If we had multiple instances of the same rank in column E, such as two instances of rank 6.5, we would wind up with an over-count of the records for each rank. To continue the example, we’d count the number of records in column C associated with a rank of 6.5 the first time it occurred, and then again (erroneously) the second time it occurred.
The next step is to count the number of instances of each rank, whether tied or not. Array-enter, using Ctrl + Shift + Enter, this formula in cell F2:
If you prefer, you could instead use the following formula, entered normally:
The formulas return the same result and the choice is a matter of personal preference. (I like the array formula approach. It’s a little costlier in terms of RAM but I think it gives me better control over the conditions I set.)
Either formula counts the number of instances of each rank in column C. So, for example, cell F2 shows that column C contains one instance of rank 1.0, and cell F5 shows that column C contains two instances of rank 4.5. Those two latter cases would have ranks 4 and 5, but because they share the same value for Temperature, the RANK.AVG( ) function assigned each record the average of 4 and 5, or 4.5.)
At this point, our task is to get a count of the number of 2-record tied ranks, the number of 3-record tied ranks, and so on. Those counts are needed to calculate the basic inferential statistic of the Kruskal-Wallis test, denoted H. The counts are also needed to calculate a correction factor, denoted C, that is necessary when, as here, tied ranks exist.
To get those counts of 2-record ties, 3-record ties and others, begin by using the Advanced filter once again, this time with the counts in column F as the basis. The result appears in column G, which shows that column F contains at least one 1-record “tie,” at least one 2-record tie, on up to at least one 14-record tie.
With the types of ties identified in column G, we can get a count of each type in column H, again using an array formula that combines a conditional IF with an arithmetic SUM. In cell H2, array-enter this formula:
or this formula, entered normally:
Copy the formula down through column H until you reach the final unique type of tie—in Figure 2.26, that’s cell H14.
Cell H2 now shows the number of 1-record “ties” to be 128, the number of 2-record ties to be 63, the number of 3-record ties to be 57, and so on. I put the word “ties” in quotes because the associated ranks are single-record ranks: There are 128 ranks that involve one record only, such as the one record with a Temperature of 53 and a rank of 724. For the moment it’s useful to treat them as legitimate ties because it helps to confirm the count of the numbers so far.
For example, notice the number 1161 in cell H16. It’s calculated by summing the products of the values in G2:G14 and the values in cells H2:H14. The result is the number of records that are ranked in column C. By summing H2:H14 we are summing the number of records represented by each type of tie (1-record, 2-record, 3-record, and so on). If the result of that summation differed from the number of records ranked in column C, it would tell you that an error in counting had occurred somewhere along the line.
The Kruskal-Wallis analysis finishes up in Figure 2.27.
In Figure 2.27, column A is based on column G in Figure 2.26; I have omitted the 1-record “ties” in Figure 2.27 because they no longer provide useful information after the confirmation of the counts has been made. The counts of each type of tie appear in column B of Figure 2.27.
Column C begins the process of calculating a correction for the presence of ties. The formula in cell C2 is
In words, take the number of records involved in each 2-record tie (obviously, that’s 2) and cube it. Subtract the number of records in each 2-record tie (again, 2, obviously). In cell C2, multiply the difference by the number of such ties in the original data set, found in cell B2. Repeat this process for the remaining types of ties, found in A3:A13.
Calculate the main statistic, H. For each area (shown in the range E2:E4), calculate the sum of the ranks found in that area and square the sum. The array formula to manage that for the Brent area is in cell F2:
The formula is an array formula and must be entered using Ctrl + Shift + Enter. It tests whether a value in A2:A1162 on the worksheet named “Fig 2.26” equals the value in cell E2, which is the area named Brent. If the value in column A equals Brent, the formula adds the corresponding value in column C; otherwise, it adds zero. More briefly, the formula sums the ranks of all the records for the Brent area. At the end of the formula, that sum is squared.
The formula is copied down into F3 and F4. That’s the reason for the absolute references to the values in columns A and C: so that the formula can be copied and pasted down two rows without the references to those two columns adjusting.
Now the three squared sums in F2:F4 are each divided by the number of records that they comprise. For example, the Brent area includes 467 of the 1161 records, so the formula in cell G2 is
Similar formulas, adjusted to correspond to the other two areas, are in G3:G4. I could have written a formula to calculate the number of records in each area, and that’s probably the safer method in case any of the original records change. But at some point we all have to decide between the simplicity of a number and the additional safety of a formula. In this sort of case, where I don’t expect to conduct a particular inferential test more than once, I opt for simplicity.
Finally we’re in a position to calculate H. Cell F8 contains this formula:
The formula contains several values that do not change, regardless of the number of original records you have. Let N represent the number of original records, which here is 1161. Also, let S represent the sum of the squared rank totals divided by the number of records in each group—represented in the prior formula by SUM(G2:G4). Then the formula for H is
H = 12/(N*(N+1))*S-(3*(N+1))
The only two quantities that can vary in that last formula are N, the total count, and S, the sum of the squared sums of ranks divided by the record count in the groups. The 12, the 3, and the 1s are all invariant. If there were no ties in the rankings you would compare the value of H to the chi-square distribution with G−1 degrees of freedom, where G is the number of groups. In this case, G equals 3, so the degrees of freedom for the chi-square test is 2, and the value of H is 115.827, shown in cell F8.
Suppose you had started out this analysis by deciding to reject the null hypothesis of no difference between the group medians if the obtained value of H exceeded the 95th percentile of the chi-square distribution. You can obtain that value in Excel with this formula:
where 2 represents the degrees of freedom for the distribution. That formula returns the value 5.99 (see cell F12 in Figure 2.27). The obtained value of H is almost 20 times that value. You would get a value of 5.99 only 5% of the time when the populations represented by the groups have the same median temperatures. A value of 115.827 is vastly less likely than that, if the medians are equal, so you can confidently reject the null hypothesis. (I’ll return to the topic of the chi-square value shortly.)
But in this case tied ranks exist, and that means that the distribution of H does not follow the chi-square distribution as closely as it does when there are no tied ranks. A correction factor, usually designated as C, is needed. It’s calculated in cell F9 with this formula:
The value 1162 is of course the number of original observations plus 1. The use of the values in C2:C13 is limited to calculating C, so if you have no tied ranks you can skip calculating those values. After you have C, divide H by C to obtain H’, a version of H corrected for the presence of tied ranks. With so many cases—1161—the correction made by C is very small.
However, notice that the value of H’ of 115.83, in cell F10 is precisely equal to the value returned by Desc as shown in Figure 2.23.
Notice also in Figure 2.23 that R states that the likelihood of observing an H of 115.83 with 2 degrees of freedom is less than 2.22e-16. This differs from the value returned by Excel’s chi-square function:
which results in 7.04426E-26. That’s an extremely large difference between two extremely small numbers. I chalk it up to differences in the algorithms used by Excel and R, and I don’t worry about it. Both values mean that the obtained H is overwhelmingly unlikely in the presence of equal medians for the three groups. And I remind myself that neither value is likely to be correct. It’s very difficult indeed to measure the area under a curve when you’re working so far from the center of a theoretical distribution.
One other point to bear in mind: The data in the Pizza data set, first, is probably fictional, and, second, would be a grab sample if it were genuine. One of the assumptions involved in this sort of test, whether you run an ANOVA or a Kruskal-Wallis one-way sum-of-ranks test, is that the groups constitute independent samples. I don’t know where Brent is, or Camden or Westminster, but it strikes me that if they’re all served by the same pizza delivery service they must be close together. As such, the samples are spacially dependent. They may well have similar traffic patterns, crime rates, ambient external temperatures, and so on, rendering the samples much more similar than would be the case with other areas that represent the same populations.
Given that sort of threat to the validity of the statistical test, worrying about the difference between two extremely small probability measures is worry misplaced. I don’t mean to ding the Pizza data set here: It serves its purpose of demonstrating the results you can expect to get from the DescTools package.