# Descriptive Statistics

Drawing on his immense experience helping organizations gain value from statistical methods, Conrad Carlberg shows when and how to use Excel, when and how to use R instead, and how to use them together to get the best from both. Here he discusses how descriptive statistics tools in Excel and R can help you understand the distribution of the variables in your data set.

This chapter is from the book

### This chapter is from the book 

Regardless of the sort of analysis you have in mind for a particular data set, you want to understand the distribution of the variables in that set. The reasons vary from the mundane (someone entered an impossible value for a variable) to the technical (different sample sizes accompanying different variances).

Any of those events could happen, whether the source of the data is a sales ledger, a beautifully designed medical experiment or a study of political preferences. No matter what the cause, if your data set contains any unexpected values you want to know about it. Then you can take steps to correct data entered in error, or to adjust your decision rules if necessary, or even to replicate the experiment if it looks like something might have gone wrong with the methodology.

The point is that sophisticated multivariate analyses such as factor analysis with Varimax rotation or Cox Proportional Hazards Regression do not alert you when someone entered a patient’s body temperature on Wednesday morning as 986 degrees instead of 98.6 degrees. In that case, the results of your sophisticated procedure might turn out cockeyed, but you would have no special reason to suspect a missing decimal point as the cause of your findings.

You can save yourself a lot of subsequent grief if you just look over some preliminary descriptive statistics based on your data set. If a mean value, the range of the observed values, or their standard deviation looks unusual, you probably should verify and validate the way the data is collected, entered and stored before too much time passes.

## Descriptive Statistics in Excel

If you’re using Excel to analyze the data, either as a preliminary check or as your principal numeric application, one way to carry out this sort of work is to point Excel’s various worksheet functions at the data set. MIN( ) gets you the minimum in a set of values, MAX( ) gets you the maximum, and MAX( ) − MIN( ) gets you the range. COUNT( ) counts the numeric values, AVERAGE( ) returns the mean, and either STDEV.S( ) or STDEV.P( ) gets you the standard deviation.

Excel comes equipped with a Descriptive Statistics tool in the Data Analysis add-in (which was at one time termed the Analysis ToolPak or ATP). The Descriptive Statistics tool is good news and bad news.

The good news is that using the tool on an existing data set gets you up to 16 different descriptive statistics, and you don’t have to enter a single function on the worksheet. See Figure 2.1.

Figure 2.1 shows 11 values of two different variables in the range A1:B12. A univariate analysis of the variable named MPG appears in the range E3:F18, and a similar analysis of IPS appears in G3:H18. Notice that each statistic reported pertains to one variable only: None of the statistics correlates, for example, MPG with IPS, or reports the means of IPS according to specific values of MPG. The reported statistics are exclusively univariate.

You get them without having to know that Excel has a STDEV.S( ) function that reports the standard deviation of a sample of records, or that the standard errors reported in the fourth row are the standard error of the mean of each variable. The reported statistics are largely useful when you’re getting acquainted with a data set. You’ll normally want to know a lot more than that, of course, but it’s a good place to start.

The bad news is that the reported statistics represent static values. For example, in Figure 2.1, cell F3 contains the value 26 rather than this formula:

=AVERAGE(A2:A12)

Therefore, if you want to change the input values in A1:A12 by adding or deleting a record, or editing an existing value, none of the statistics would change in response. A worksheet formula recalculates when the values it points to change, but the results calculated by the Descriptive Statistics tool do not: They don’t point to anything because they’re just numbers.

### Using the Descriptive Statistics Tool

Using the tool is easy enough. You’ll want your active worksheet to have one or more lists or tables, such as the two shown in columns A and B of Figure 2.1. Go to the Ribbon’s Data tab and locate the Data Analysis link in the Analyze group. When you click it, you see the list box shown in Figure 2.2.

Click Descriptive Statistics and click OK. You’ll get the dialog box shown in Figure 2.3.

Then take these steps:

1. Click in the Input Range box and drag through the range your data occupies. In Figure 2.1, that’s A1:B12. If you have one variable, the range might be A1:A25, or if you have three variables, the range might be B1:D85. Don’t worry if your variables have different numbers of records, but leave unoccupied cells empty and don’t substitute #N/A for an empty cell.

2. Accept the default of Grouped By Columns.

3. If the first row in the input range has labels such as variable names, fill the Labels in First Row checkbox. This instructs Excel not to treat a label as a legitimate data value, which would usually result in an error message that the input range contains non-numeric data.

4. Click the Output Range option button, unless you want the results to be written to a new workbook or a new worksheet. (Worksheet Ply is just an old term for worksheet.) Careful! If you click the Output Range option button, the Input Range box is re-activated and gets filled with any cell or range that you click next. First, click the Output Range edit box and only then indicate where you want the output to begin.

5. Fill the Summary Statistics checkbox if you want the statistics shown in row 3 through row 15 in Figure 2.1.

6. Fill the Confidence Level for Mean checkbox if you want to put a confidence interval around the mean. Enter the confidence level you want in the edit box (often that will be 90, 95, or 99, taken as a percent).

7. Fill the Kth Largest and the Kth smallest checkboxes if you want that information. Also supply a value for K. That is, if you want the 5th largest value, fill the checkbox and enter 5 in the edit box. (I can’t recall the last time I needed either of these two statistics.)

8. Click OK. Within a few seconds you should see results such as those shown in columns E through H of Figure 2.1.

### Understanding the Results

Here’s a closer look at some of the statistics shown in Figure 2.1. Most of them are precisely what you would expect (mean, median, mode, range, minimum, maximum, sum, count, kth largest, and smallest values). The following may require a little additional information.

#### Standard Error of the Mean

Suppose that the 11 values in A2:A12 of Figure 2.1 are a sample from a population. Now suppose that you took hundreds or even thousands of similar 11-record samples from the same population. Each of those samples would have its own mean value, such as the one shown in cell F3 of Figure 2.1. If you calculated the standard deviation of all those mean values, you would have a statistic called the standard error of the mean. The value in cell F4 of Figure 2.1 estimates that value, so that you don’t have to actually take hundreds or thousands of additional samples. You can calculate that estimate using this formula:

where:

• is the standard error of the mean.

• S is the sample standard deviation.

• n is the number of records in the sample (the count).

The standard error of the mean is often useful when you want to test the difference between an obtained sample mean and a hypothesized value. It is also an integral part of a confidence interval placed around a sample mean.

#### Standard Deviation

Excel (and the general field of statistics) offers two types of standard deviation:

• Your data constitutes a population. For example, you might have 100 items made in a special production run, after which the mold was broken.

• Your data constitutes a sample. You might have 100 items sampled randomly from an ongoing process that yields millions of products per year.

In the first case, where you have an entire population of values, the formula for the standard deviation uses the actual count of values, n, in its denominator. In the second case you are estimating the population’s standard deviation on the basis of your sample, and you use n1 instead of n in the denominator.

In Excel, you use the worksheet function STDEV.P( ) when your data is the population. You use STDEV.S( ) when you have a sample. Possession of a sample occurs much more frequently than possession of a population, so the Descriptive Statistics tool returns the value that you would get if you were using the STDEV.S( ) function.

#### Sample Variance

The variance is the square of the standard deviation, and it’s subject to the same conditions as the standard deviation: That is, the variance has a population form (n) and a sample form (n − 1). The Descriptive Statistics tool returns the sample form of the variance.

#### Skewness

Skewness measures the symmetry in a distribution of values. An asymmetric distribution is said to be skewed. One popular way of calculating skewness is the average of the cubed standard scores (also termed z-scores):

The formula that Excel uses approaches the value of the average cubed z-score as n increases:

A symmetric distribution is expected to have a skewness of 0.

#### Kurtosis

Kurtosis measures the degree to which a distribution of scores is taller or flatter with respect to its width. Here’s one textbook definition of kurtosis:

It’s very similar to one definition of skewness, except here the z-scores are raised to the fourth instead of the third power. A distribution such as the normal curve would have an average z-score, raised to the fourth power, of 3. Therefore, 3 is subtracted in the formula so that a normal curve would have kurtosis of 0.

Again, Excel’s formula for kurtosis is slightly different and attempts to remove bias from the calculation of kurtosis based on a sample:

I’m inflicting these formulas on you so that if you need to contrast the reason that R returns different values for skewness and kurtosis than Excel, you’ll be better placed to understand the differences.

#### Confidence Level

The Descriptive Statistics tool’s results refer to the confidence interval value as the Confidence Level (see cell E18 in Figure 2.1). That’s a misnomer. The confidence level is the percentage of sample confidence intervals that you expect to capture the population mean: typically, 90%, 95%, or 99%.

In contrast, the Descriptive Statistics tool reports the quantity that you add to and subtract from the calculated mean so as to arrive at the confidence interval. That quantity is calculated, using Excel function syntax, as

=T.INV.2T(0.05,10)*F4

where cell F4 contains the standard error of the mean. Excel’s T.INV.2T( ) worksheet function returns the positive t value that cuts off some percentage of the area under the t-distribution, such that the remaining percentage is divided evenly between the two tails. So, this use of the function

=T.INV.2T(.05,10)

returns 2.23. That means

• Take a t-distribution, which is very similar to a normal curve but is a little flatter in its center and a little thicker in its tails. Its shape depends partly on the number of observations in the samples used to build the distribution. In this example, the number of observations is 11 and therefore the degrees of freedom is 10.

• The mean of that t-distribution is 0. If you cut the distribution at two points, −2.23 and +2.23, you’ll find that 95% of the area under the curve is between those two cuts.

• And of course, 5% is outside the cuts, with 2.5% in the distribution’s right tail and 2.5% in its left tail.

But we’re not working in a scale of t values, which has a standard deviation that’s slightly larger than 1.

Instead, we’re working with a scale that in Figure 2.1 describes whatever MPG is, with a standard error of 2.63. So, to cut off 2.5% of the distribution at each tail, we multiply ±2.23 by 2.63 or ±5.86. Adding ±5.86 to the mean of 26 shown in cell F4 of Figure 2.1 gives a 95% confidence interval of from 20.14 to 31.86. Another way of saying this is that we want to go up from the mean by 2.23 standard deviations. In this scale, a standard deviation is 2.63 units, so we go up from the mean by 2.23 × 2.63 units. We go down from the mean by the same amount. The resulting range of values is the 95% confidence interval for this data.

You interpret the confidence interval as follows: If you were to take 100 random samples, each from the same population, and put a 95% confidence interval around each of 100 sample means, 95 of those confidence intervals would capture the mean of the population. It is more rational to suppose that the one confidence interval that you did construct is one of the 95 that captures the population mean than one of the 5 that do not. That’s what’s meant by saying that a confidence interval makes you 95% confident that it captures the true population mean.

### Using the Excel Descriptive Statistics Tool on R’s Pizza File

Let’s take a look at what Excel’s Descriptive Statistics tool has to say about the numeric variables in R’s pizza delivery database. The first step is to export that database so it’s available to Excel. To arrange that, take these steps:

1. Start R.

2. Load the DescTools package using this R command:

3. library(DescTools)

4. Export the data frame named d.pizza by entering this command:

5. write.csv(d.pizza, file=“C:/Users/Jean/Desktop/pizza.csv”)

You can use any legitimate destination name and location for the file. It will be formatted as a comma-separated values file, or csv file, so you might as well use csv as the filename extension. Also notice the use of forward slash (/) rather than backslashes (\) in the file path: R interprets a backslash as an escape character. You can instead use the forward slash (/) or two consecutive backslashes (\\).

When control returns to the R interface—it only takes a second or two—quit R and start Excel. Open the csv file from the location where you stored it. It looks something like the worksheet shown in Figure 2.4.

In Figure 2.4, notice the NA values sprinkled among the legitimate values. The NA values mean not available and, because they’re text, the Descriptive Statistics tool has difficulty dealing with them when they’re supposed to be numbers.

If you want to use Excel’s worksheet functions, you’re in the clear. When a function such as SUM( ) or AVERAGE( ) or STDEV.S( ) encounters a text value such as NA, it simply ignores it. See Figure 2.5 for some examples.

The first ten records in the Pizza database from DescTools appear in cells A2:A11 of Figure 2.5, where the delivery temperature is recorded. This formula is entered in cell A13:

=AVERAGE(A2:A11)

Notice that the formula’s argument includes the text value NA in cell A5. The formula returns the result 46.00.

This formula is entered in cell A16:

=AVERAGE(A2,A3,A4,A6,A7,A8,A9,A10,A11)

The second AVERAGE formula includes every cell in A2:A11 except A5. Both AVERAGE( ) formulas return the same result, 46.00, and you can depend on Excel worksheet functions such as AVERAGE( ) and SUM( ) to ignore text values such as NA when they expect numeric values.

Still in Figure 2.5, the range D2:E11 contains the same records as A2:A11, but both the temperature and the minutes to delivery appear. If you want to calculate the correlation between those two variables on the basis of those records, Excel manages the missing data by ignoring the entire fourth record because it has a missing value for one of the variables. Notice the correlation between the values in D2:D11 and E2:E11, and −0.7449. That value is returned by this formula:

=CORREL(D2:D11,E2:E11)

In the range G2:H10 I have omitted the record with one missing value. The formula that calculates the correlation for that range of data is

=CORREL(G2:G10,H2:H10)

Both formulas return the same result, so the CORREL( ) function ignores the full record when either of the values is text.

Therefore, you’re in good shape if you decide to get the summary statistics from a data set such as the Pizza database using worksheet functions. Things aren’t as straightforward if you decide you want to use the Descriptive Statistics tool in Excel’s Data Analysis add-in.

If you point the Descriptive Statistics tool at a worksheet range that contains text data, it returns an error message to the effect that the range contains text data. See Figure 2.6.

To get the Descriptive Statistics tool to process the data, you have to replace the text data with something else. Obviously, you can’t just make up numbers and enter them in place of the NA values. As it turns out, though, the Descriptive Statistics tool can manage if you replace the text values with empty cells.

So, one possibility is to do a global replace of the NA values with nothing—that is, in the Find and Replace dialog box, enter NA in the Find What box and enter nothing in the Replace With edit box. Excel treats a truly empty cell as missing data, and functions such as AVERAGE( ) and SUM( ) return the correct summary values for those cells that contain numeric values.

Unfortunately, with a data set as full of NA values as R’s Pizza data set, those replacement empty cells cause another problem. With as many as 1209 records in the worksheet, you’re apt to want to select a variable for analysis by clicking its label (for example, cell M1 in Figure 2.4), then holding down the Ctrl key and pressing the down arrow. This sequence selects all the cells from the active one to the final contiguous non-empty cell below the one that’s active.

But that probably just takes you partway through the data set—perhaps even just 10 records. As soon as you replace an NA value, making (say) M5 an empty cell, your search for the actual final value in the column ends at M4.

An acceptable workaround is to select the entire column by clicking its column header to enter, for example, M:M in the Input Range box (see Figure 2.3). This slows processing down just a bit, but it’s better than hitting the down arrow once for every blank cell in the data’s column.

Another possibility is to use a worksheet function that is designed to handle text values in an otherwise numeric field. AVERAGEA( ) treats text values as legitimate zero values, however, and in averages there’s a difference between a missing value and a zero value. That is, this formula:

=AVERAGEA(1,2,“NA,“4,5)

returns 2.4. Treating NA as a zero value means that the sum is 12 and the count is 5, so the result is 2.4. It’s likely that the result you’re after completely ignores the NA value, so the sum would still be 12 but the count would be 4, and the result would be 3.

There’s no entirely satisfactory solution to these problems in Excel. You could define names based on the top row of the selection, and use a defined name instead of a worksheet address in the Input Range edit box. But Excel does not include the column headers as part of a range that’s named using that method, and so Descriptive Statistics cannot show the variable’s label as part of its results. If only from the standpoint of convenience, then, R is probably the better bet than Excel in this sort of situation.

### InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

## Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

## Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

### Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

### Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

### Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

### Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

### Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

### Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

## Other Collection and Use of Information

### Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

### Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

### Do Not Track

This site currently does not respond to Do Not Track signals.

## Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

## Children

This site is not directed to children under the age of 13.

## Marketing

Pearson may send or direct marketing communications to users, provided that

• Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
• Such marketing is consistent with applicable law and Pearson's legal obligations.
• Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
• Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

## Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

## Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

## Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

## Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

## Sharing and Disclosure

Pearson may disclose personal information, as follows:

• As required by law.
• With the consent of the individual (or their parent, if the individual is a minor)
• In response to a subpoena, court order or legal process, to the extent permitted or required by law
• To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
• In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
• To investigate or address actual or suspected fraud or other illegal activities
• To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
• To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
• To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.