# SQL Server Reference Guide

Hosted by

## Design Elements Part 7: Statistical Functions

Last updated Mar 28, 2003.

We are surrounded with statistics. Used correctly, they form the bedrock of successful decisions. Used incorrectly, they can destroy a company by causing management to make bad decisions, or cause a faulty conclusion in a lab test evaluation.

The ability to properly deliver statistics is a great career skill. No programming arsenal of knowledge is complete without a sound understanding of statistical methods and their proper use.

In this week's tutorial, we'll learn the most important various statistical methods and how to apply T-SQL functions and algorithms to solve them.

If heavy statistics and graphics are necessary in a project, the proper choice is normally to house the data in SQL Server and process it with a specialized package. T-SQL, however, offers useful commands and functions to deal with general statistics.

#### The Use of Statistics

One of the most basic uses of statistics is to summarize data. Business or scientific requirements often involve simple summaries of large quantities of data, which answer questions such as "How many?", "What's the average?" and "How many times does this occur?"

Another function statistical methods can help us with is grouping data. This use is called representing data since it takes the original order and rearranges it to show patterns or groups. The question here is "Do we have a meaningful group of data here or just random values?"

Statistical methods also help us to compare data so that we can view one thing versus another. Typical comparison questions are "How many workers are at plant X versus plant Y?" or "Do African honeybees fly farther in a day than European honeybees?"

Related to the comparison questions are statistical results that show a difference, such as "How much more protein is in peanut butter than in peanuts?"

Finally, statistics can be used to show relationships. This type of statistic is the most interesting, and the most dangerous. It seeks to answer questions such as "Does spending more money per student yield better grades overall?" This type of question is often difficult to answer, and one sampling of data or formula is adequate to answer it.

With these uses and cautions in mind, let's take a look at the methods used to gather the data our algorithms will use.

#### Statistical Method

The statistical method begins with asking questions. We've seen a few of those already, but there is a specific way to ask these kinds of questions. If the focus is a business system, the business users should be contacted with the questions and interpretations. If the data sets are scientific in nature, then the end users will help formulate the questions and relationships they are looking for. The point is, as the DBA or developer, you shouldn't try to guess what the questions are.

After we determine the questions, we need to collect the data. We covered the subject of database design earlier, so we use those concepts to create a model and store the data. This covers the question of how we store the data. The larger question is what we store.

Data collected for statistical use is called a sample. There are some important concepts to keep in mind when discussing statistical data, such as the sample size, sample period, and distribution of the sample.

The sample size is the amount of data that makes the results valid. For instance, if we asked the first three people we met whether they liked a recent movie, we can't really call that a representative size. Instead, we need to determine the number that allows us to have confidence in the results. For minor, less-important questions, this size might be quite small, but generally the more wide-spread the result set (kinds of food, house size, etc.), the larger the sample size needs to be.

The sample period determines how long the sample data collection should be taken. For instance, to check the average temperature in an area, several years of data collection is needed.

Finally, the sample distribution of data is important to gain the widest audience or participants of the event. For our movie question, it might be important to ask people of many ages and social demographics to see if it was widely liked. On the other hand, if we're only asking to determine if we want to see the movie, the distribution needs to include only those people who share our tastes.

After we collect the data, we analyze it (or at least provide it to someone else to analyze). This is where we apply the methods we're about to learn to derive meaning or information from the raw sample data.

Now that we have our methods down, let's get right to the Transact-SQL that we can use to put it all together. For many of the algorithms we need, the statistical functions are part of the aggregate functions we've studied before. We'll just layer these concepts to get at what we need.

#### Summarizing Data

We'll begin with the basics: summarizing data. The primary functions we'll use here are SUM and COUNT. Here's an example:

```USE pubs
GO
SELECT 'Total Sales, all stores:' = SUM(qty)
FROM sales
GO
----------------------
Total Sales, all stores:
493```

This simple statement returns the sum of the sales made at all stores. As we learned in our aggregate studies, we can also provide data that is broken down by store number:

```SELECT stor_id, 'Sales' = SUM(qty)
FROM sales
GROUP BY stor_id
ORDER BY Sales DESC
GO
-----------------------```
 stor_id Sales 7131 130 7066 125 7067 90 8042 80 7896 60 6380 8

This data is a bit more useful. It's often also helpful to show not only a ranking of the data such as we have here, but also the sample size. We do that with the COUNT function. It's pretty simple:

```SELECT COUNT(*)
FROM sales
------------------------
21```

The answer here is that we don't have a great many stores reporting – or does it? Actually, this shows how many lines of data were collected, which might very well mean something else entirely. It's important to keep these facts in mind when analyzing numerical data.

In both the SUM and COUNT functions, as with most functions, the WHERE condition can limit the results.

In addition to adding and counting the result sets, the average or mean of the data is part of the summarization step. Averages are probably one of the most misused functions in statistics. If I tell you that I have several coins in my pocket and the average value of them is 5 cents, what do I have in my pocket? Well, it all depends on the sample size and its distribution. I could have a few coins or many, and no nickels whatsoever, or they could all be nickels. For that reason, the average is often taken with several other measures. Also for this reason there are several kinds of averages, such as weighted or binomial averages.

In T-SQL, what we have is the numerical average, or the function that represents the formula:

`Average = The sum of the units divided by the number of units`

The format of the command to get the numeric average looks like this:

```SELECT 'Average Store Sales' = AVG(qty)
FROM sales
GO```

Again, we could use a WHERE clause or a GROUP BY clause to display the desired result.

Next week we'll take a look at some statistical functions that SQL Server doesn't natively provide: we'll have to write our own. We'll cover medians, modes, and more. We'll explain when to use each and how we can implement them. In the meantime, there's a little light reading in the references.

#### Online Resources

Really into statistics? Here's a great statistics site.

#### InformIT Tutorials and Sample Chapters

Just for fun: James Cortada explores statistics and sports.