SQL Server Reference Guide

Hosted by

Design Elements Part 8: Summarization Statistical Algorithms

Last updated Mar 28, 2003.

Design Elements Part 8: Summarization Statistical Algorithms

In last week's tutorial, we learned the importance of applying statistical methods properly. We covered the basics of statistics and where they are used.

To properly use statistical data, we need to collect, store and analyze it correctly. The statistical method we're using is to ask the questions, collect the data, and analyze the results. While this might sound obvious, each steps has a specific discipline to follow.

The primary uses of statistical analysis are summarizing data, grouping (or representing) data, comparing data sets, showing the differences between data sets, and finding relationships between data sets. We began the topic of summarizing data, in the context of SQL databases (and SWL Server in particular), which we'll finish today.

The first two functions we used for summarizing data were the SUM() and COUNT() functions. These are both quite basic, and the only real twist related to either of them is that we can use limiting queries to get only the results we're looking for.

SQL Server provides a built-in function for this formula: AVG(). The proper statistical name for this type of average is the mean. The problem is that a mean isn't really perfect. What we're trying to show is a representation of a data set. The simple formula above tends to do that – sometimes. For instance, if the average of a set of coins is five cents, we don't really know if we have a quarter and five pennies or six nickels.

For this reason, the mean is best used in conjunction with other tools that summarize the data. This is an important concept to learn about applying statistical methods. When you see a single statistic provided from an unqualified data sample, don't implicitly trust the conclusions.

Another type of summarization method to use is to show the middle number in a set, called the median. That gives us more confidence in the characterization of the data. It's a simple enough formula: we just take the data set, order it from lowest to highest, and then divide the set in half. The problem is, although it's a simple formula, there's no direct formula for it in T-SQL. We have to do a little work to get there, but it's not that hard.

To find the median, we just follow the formula. You'll need to make use of one new construct – the temporary table. A temporary table is a table constructed in memory, and is available only to the connection that creates it. When the connection is broken, the table is destroyed.

NOTE

There's another type of temporary table which is available to more than one connection, called a global temporary table.

Youe create a temporary table the same way as a normal table, except that you preface the name with a # sign for the local type, and ## for the global ones.

So we'll create a temporary table, insert the data in the table using a subquery, sorted in the right order. We've seen all these constructs before, but the twist is that we'll use an identity data type to provide a numerical reference for the middle value. Then we'll create variables to hold the numbers and calculations, and then use them to locate the center value, or at least the one closest to it.

Here's the code:

```USE pubs
GO

-- Create a temporary table to hold the values
-- and an identity field to give us a row-holder.
-- Note - Median would need to have another type
-- if the value isn't numeric
CREATE TABLE #Median(Median int, ID int identity)

-- Fill the temporary table with an ordered list from
-- the source base table
INSERT INTO #Median (Median)
SELECT qty
FROM sales
ORDER BY qty ASC

-- Set up variables to hold the calculations
DECLARE @Count AS INT, @Median AS INT

-- Get the number of rows from the ordered temporary table
SET @Count = (SELECT COUNT(*)FROM #Median)

-- Find the middle (roughly)
SET @Median = (@Count/2)

-- Use that value to find the proper value
-- from the ordered list
SELECT Median from #Median where ID = @Median

-- Clean up
DROP TABLE #Median

GO```

We now have four formulas to work with to help summarize the data, SUM(), COUNT(), AVG(), and our new median algorithm. But that's not all. To truly understand these numbers, we have a few more formulas to work out.

The first is the Mode. A mode is the number found most often in the data set. The mode, along with the average and median, can help us have confidence in the numbers closest to the average. The more numbers that are around the average, the more confidence we can have in it.

Once again, T-SQL does not provide a formula for this statistical function. We can, however, use a simple GROUP BY aggregate with the TOP function to get at it:

```SELECT TOP 1 COUNT(*) as 'Number of repeats'
, qty as 'Number'
FROM sales
GROUP BY qty
ORDER BY COUNT(*) DESC```

The final three functions have the formulas built-in.

The next two formulas show our range, or the spread of the values. The ranges of the numbers show us the high and low values of the data set.

Here are both of them in code:

```SELECT MAX(qty) as 'Maximum value'
, MIN(qty) as 'Minimum value'
FROM sales```

Finally, we need to know how far the numbers are spread around the average. If the spread is quite large, we have less confidence in the average. The function T-SQL provides for this measurement is STDEV(), and it takes a column as a parameter. Here's how that looks:

```SELECT STDEV(qty) as 'Standard Deviation'
FROM sales ```

With a standard deviation, about 65% of the values in the set are within one standard deviations of the mean. For example, if the average is 12, and the standard deviation is 1, then 65% of the values are between 11 and 13. More than 90% of the sets' values are within two standard deviations, so most of the data in this example is between 10 and 14! That gives us a great reassurance that the mean represents the data.

So, having the average is fine – provided we also know the standard deviation.

Finally, we can look at the variance of the distribution of the data. The variance also shows how tightly the data is clustered around the mean; it's just the square of the standard deviation. Here's that function:

``` SELECT VAR(qty) as 'Variance'
FROM sales ```

All these functions and algorithms show us the summarization functions. Next week, we'll recap with a combination of all these numbers, and a handy table that shows when to use each.

Online Resources

If you're really interested in the formulas behind these algorithms, this statistics glossary can't be beat.