Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Design Elements Part 9:Representing Data with Statistical Algorithms

Last updated Mar 28, 2003.

Last week, we covered the summary statistical algorithms that give additional confidence in the data's credibility. We learned that the most abused statistical measure is the average.

Which brings up a story: it seems that three statisticians were hunting in the woods. A rabbit appeared, and two of them fired their weapons. The first shot far in front of the rabbit and the second shot far behind. "We got him!" exclaimed the third.

While this is a humorous story, it brings up a good point. Without more data, the average isn't the most important statistic.

The last tutorial promised a handy chart to tell which statistical method to use in which situation, and here it is:

Method Categories Individual Numbers Series Numbers
Mode OK OK (if multiple values) Intervals only
Mean Bad OK OK
Median Ordered sets only OK OK

This chart shows the highest confidence in the measure. In most cases, however, the data is mixed. That's why we don't rely on a single value to describe the data.

To finish out the summary functions, here's a code sample to show all the algorithms we've covered so far:

USE pubs
GO
CREATE TABLE #Median(Median int, ID int identity) 
INSERT INTO #Median (Median) 
  SELECT qty 
  FROM sales
  ORDER BY qty ASC
DECLARE @Count AS INT, @MiddleRow AS INT, @Median AS INT 
SET @Count = (SELECT COUNT(*)FROM #Median) 
SET @MiddleRow = (@Count/2) 
SET @Median = (SELECT Median FROM #Median WHERE ID = @MiddleRow)

SELECT 
 COUNT(qty) as 'Count'
, SUM(qty) as 'Sum' 
, AVG(qty) as 'Average'
, @Median as 'Median' 
, STDEV(qty) as 'Std Deviation'
, VAR(qty) as 'Variance'
, MAX(qty) as 'Maximum'
, MIN(qty) as 'Minimum'
FROM sales
DROP TABLE #Median
GO

Representing Data

The next set of algorithms we'll examine are those that represent data. These measures show the type of data we are working with.

We've seen some of these functions and algorithms before, but this time we'll use them to build on our example. The first function we'll examine is COUNT. We covered this example in the last tutorial, so we won't spend a lot of time on it here other than a quick refresher:

SELECT COUNT(qty) as 'Count'
FROM sales
GO

Remember that we can use any of the WHERE constructs we learned to limit the result set.

We've also seen the next measure before, in our tutorial on aggregates, but not named as such: GROUP BY. In a SELECT statement, the GROUP BY clause places the results into a list that shows the number of times a value occurs:

SELECT title_id AS 'Title', COUNT(title_id) as 'Count'
FROM sales
GROUP BY title_id
------------------------------------
Title Count
BU1032 2
BU1111 1
BU2075 1
BU7832 1
MC2222 1
MC3021 2
PC1035 1
PC8888 1
PS1372 1
PS2091 4
PS2106 1
PS3333 1
PS7777 1
TC3218 1
TC4203 1
TC7777 1

We can also use the ORDER BY clause to order the list on either column:

SELECT title_id AS 'Title', COUNT(title_id) as 'Count'
FROM sales
GROUP BY title_id
ORDER BY 'Count' DESC

This type of output is called a "Frequency distribution," and it is also sometimes found in a chart format. While it's not exactly native to T-SQL to do graphics, it's interesting to see what we can accomplish with a little code:

SELECT title_id AS 'Title'
, REPLICATE('*', COUNT(title_id)) AS 'Count'
FROM sales
GROUP BY title_id
GO
--------------------------------------
Title Count
BU1032 **
BU1111 *
BU2075 *
BU7832 *
MC2222 *
MC3021 **
PC1035 *
PC8888 *
PS1372 *
PS2091 ****
PS2106 *
PS3333 *
PS7777 *
TC3218 *
TC4203 *
TC7777 *

This is the same basic query we saw earlier, but we've added one new construct: REPLICATE. This command merely repeats a character (an asterisk, in this case) by a certain number. To get that number we used the aggregate function of the quantity.

Since most people are visually oriented, this view can be helpful to describe the data set. Of course, this view isn't practical in SQL for large numeric values, unless we break the numbers down into groups of tens or hundreds.

As we can see, this type of data is basically a bar chart on its side. Bar charts are used quite frequently in statistical measures, and once again we have to check them out with a "weather eye." The problem with a bar chart arises with a depiction of scale.

Let's take a look at the scores from the schools in Shelbyville and those in Springfield:

Figure 106Figure 106

We can see here that the students in Shelbyville had scores around the 98 range, and those in Springfield had scores around 95. Not bad at all, for either city. Not bad, that is, unless we want to show a disparity between the cities, perhaps to increase school funding. In that case, all we have to do is change the scale, starting the bottom value at 94, rather than 0. Take a look at this graph:

Figure 107Figure 107

Notice that we clearly need more funding in Springfield!

This subject brings us into the next set of descriptive measurements: comparisons. There really isn't a formula or an algorithm for comparing data; what we're after in this measurement is showing two or more data sets side by side. The graphs shown above are examples of that.

The simple joins and unions that we covered earlier are enough for this activity. Nothing complex is called for here. It can also be helpful to group the values to show the disparity between the samples as well:

SELECT stor_id AS 'Store:'
, SUM(qty) AS 'Titles Sold:'
FROM sales
GROUP BY stor_id
---------------------------------------
Store: Titles Sold:
6380 8
7066 125
7067 90
7131 130
7896 60
8042 80

The only caveat to that is the data should always be "apples to apples," or in a similar sample distribution. The danger here is that the samples could show a disproportionate relationship if they aren't from the same kind of measurement.

One more: here's a script (although it's a bit more complex than I'd like) that provides the percentages for the values:

/* Let's get a aggregated table */
SELECT 
  stor_id AS 'store'
  , SUM(qty) AS quantity
  INTO #testtable
  FROM sales
  GROUP BY stor_id
/* Now let's set aside a decimal variable so the math will work */
DECLARE @var1 AS DECIMAL (5,2)
SET @Var1 = (SELECT sum(quantity)FROM #testtable)
/* Now let's get the values, one line at a 
time, and then divide them by the aggregate */
SELECT store
, quantity
, (quantity/@var1)*100 AS 'Percentage'
FROM #testtable
/* Cleanup */
DROP TABLE #testtable
---------------------------------------
StoreQuantityPercentage 638081.622700 706612525.354900 70679018.255500 713113026.369100 78966012.170300 80428016.227100

Want to extend this a bit further? Just multiply each raw (not multiplied by 100) percentage by 360 to determine the angle for a pie chart! (OK, it might actually be better to do that one in Excel.)

Online Resources

I didn't have a lot of time to spend on the charts in this tutorial, but Matthew Pinkney shows them (along with other interesting math functions) in a better format here.

InformIT Tutorials and Sample Chapters

Need a bit more real-world statistics? Check out Applied Statistics for Software Managers, by Katrina Maxwell.