 Introduction

Table of Contents
 Microsoft SQL Server Defined
 Microsoft SQL Server Features
 Microsoft SQL Server Administration

Microsoft SQL Server Programming
 An Outline for Development
 Database
 Database Services
 Database Objects: Databases
 Database Objects: Tables
 Database Objects: Table Relationships
 Database Objects: Keys
 Database Objects: Constraints
 Database Objects: Data Types
 Database Objects: Views
 Database Objects: Stored Procedures
 Database Objects: Indexes
 Database Objects: User Defined Functions
 Database Objects: Triggers
 Database Design: Requirements, Entities, and Attributes
 Business Process Model Notation (BPMN) and the Data Professional
 Business Questions for Database Design, Part One
 Business Questions for Database Design, Part Two
 Database Design: Finalizing Requirements and Defining Relationships
 Database Design: Creating an Entity Relationship Diagram
 Database Design: The Logical ERD
 Database Design: Adjusting The Model
 Database Design: Normalizing the Model
 Creating The Physical Model
 Database Design: Changing Attributes to Columns
 Database Design: Creating The Physical Database
 Database Design Example: Curriculum Vitae
 NULLs
 The SQL Server Sample Databases
 The SQL Server Sample Databases: pubs
 The SQL Server Sample Databases: NorthWind
 The SQL Server Sample Databases: AdventureWorks
 The SQL Server Sample Databases: Adventureworks Derivatives
 UniversalDB: The Demo and Testing Database, Part 1
 UniversalDB: The Demo and Testing Database, Part 2
 UniversalDB: The Demo and Testing Database, Part 3
 UniversalDB: The Demo and Testing Database, Part 4
 Getting Started with TransactSQL
 TransactSQL: Data Definition Language (DDL) Basics
 TransactSQL: Limiting Results
 TransactSQL: More Operators
 TransactSQL: Ordering and Aggregating Data
 TransactSQL: Subqueries
 TransactSQL: Joins
 TransactSQL: Complex Joins  Building a View with Multiple JOINs
 TransactSQL: Inserts, Updates, and Deletes
 An Introduction to the CLR in SQL Server 2005
 Design Elements Part 1: Programming Flow Overview, Code Format and Commenting your Code
 Design Elements Part 2: Controlling SQL's Scope
 Design Elements Part 3: Error Handling
 Design Elements Part 4: Variables
 Design Elements Part 5: Where Does The Code Live?
 Design Elements Part 6: Math Operators and Functions
 Design Elements Part 7: Statistical Functions
 Design Elements Part 8: Summarization Statistical Algorithms
 Design Elements Part 9:Representing Data with Statistical Algorithms
 Design Elements Part 10: Interpreting the Data—Regression
 Design Elements Part 11: String Manipulation
 Design Elements Part 12: Loops
 Design Elements Part 13: Recursion
 Design Elements Part 14: Arrays
 Design Elements Part 15: EventDriven Programming Vs. Scheduled Processes
 Design Elements Part 16: EventDriven Programming
 Design Elements Part 17: Program Flow
 Forming Queries Part 1: Design
 Forming Queries Part 2: Query Basics
 Forming Queries Part 3: Query Optimization
 Forming Queries Part 4: SET Options
 Forming Queries Part 5: Table Optimization Hints
 Using SQL Server Templates
 TransactSQL Unit Testing
 Index Tuning Wizard
 Unicode and SQL Server
 SQL Server Development Tools
 The SQL Server TransactSQL Debugger
 The TransactSQL Debugger, Part 2
 Basic Troubleshooting for TransactSQL Code
 An Introduction to Spatial Data in SQL Server 2008
 Performance Tuning
 Practical Applications
 Professional Development
 Application Architecture Assessments
 Business Intelligence
 Tips and Troubleshooting
 Additional Resources
Design Elements Part 8: Summarization Statistical Algorithms
Last updated Mar 28, 2003.
Design Elements Part 8: Summarization Statistical Algorithms
In last week's tutorial, we learned the importance of applying statistical methods properly. We covered the basics of statistics and where they are used.
To properly use statistical data, we need to collect, store and analyze it correctly. The statistical method we're using is to ask the questions, collect the data, and analyze the results. While this might sound obvious, each steps has a specific discipline to follow.
The primary uses of statistical analysis are summarizing data, grouping (or representing) data, comparing data sets, showing the differences between data sets, and finding relationships between data sets. We began the topic of summarizing data, in the context of SQL databases (and SWL Server in particular), which we'll finish today.
The first two functions we used for summarizing data were the SUM() and COUNT() functions. These are both quite basic, and the only real twist related to either of them is that we can use limiting queries to get only the results we're looking for.
SQL Server provides a builtin function for this formula: AVG(). The proper statistical name for this type of average is the mean. The problem is that a mean isn't really perfect. What we're trying to show is a representation of a data set. The simple formula above tends to do that – sometimes. For instance, if the average of a set of coins is five cents, we don't really know if we have a quarter and five pennies or six nickels.
For this reason, the mean is best used in conjunction with other tools that summarize the data. This is an important concept to learn about applying statistical methods. When you see a single statistic provided from an unqualified data sample, don't implicitly trust the conclusions.
Another type of summarization method to use is to show the middle number in a set, called the median. That gives us more confidence in the characterization of the data. It's a simple enough formula: we just take the data set, order it from lowest to highest, and then divide the set in half. The problem is, although it's a simple formula, there's no direct formula for it in TSQL. We have to do a little work to get there, but it's not that hard.
To find the median, we just follow the formula. You'll need to make use of one new construct – the temporary table. A temporary table is a table constructed in memory, and is available only to the connection that creates it. When the connection is broken, the table is destroyed.
NOTE
There's another type of temporary table which is available to more than one connection, called a global temporary table.
Youe create a temporary table the same way as a normal table, except that you preface the name with a # sign for the local type, and ## for the global ones.
So we'll create a temporary table, insert the data in the table using a subquery, sorted in the right order. We've seen all these constructs before, but the twist is that we'll use an identity data type to provide a numerical reference for the middle value. Then we'll create variables to hold the numbers and calculations, and then use them to locate the center value, or at least the one closest to it.
Here's the code:
USE pubs GO  Create a temporary table to hold the values  and an identity field to give us a rowholder.  Note  Median would need to have another type  if the value isn't numeric CREATE TABLE #Median(Median int, ID int identity)  Fill the temporary table with an ordered list from  the source base table INSERT INTO #Median (Median) SELECT qty FROM sales ORDER BY qty ASC  Set up variables to hold the calculations DECLARE @Count AS INT, @Median AS INT  Get the number of rows from the ordered temporary table SET @Count = (SELECT COUNT(*)FROM #Median)  Find the middle (roughly) SET @Median = (@Count/2)  Use that value to find the proper value  from the ordered list SELECT Median from #Median where ID = @Median  Clean up DROP TABLE #Median GO
We now have four formulas to work with to help summarize the data, SUM(), COUNT(), AVG(), and our new median algorithm. But that's not all. To truly understand these numbers, we have a few more formulas to work out.
The first is the Mode. A mode is the number found most often in the data set. The mode, along with the average and median, can help us have confidence in the numbers closest to the average. The more numbers that are around the average, the more confidence we can have in it.
Once again, TSQL does not provide a formula for this statistical function. We can, however, use a simple GROUP BY aggregate with the TOP function to get at it:
SELECT TOP 1 COUNT(*) as 'Number of repeats' , qty as 'Number' FROM sales GROUP BY qty ORDER BY COUNT(*) DESC
The final three functions have the formulas builtin.
The next two formulas show our range, or the spread of the values. The ranges of the numbers show us the high and low values of the data set.
Here are both of them in code:
SELECT MAX(qty) as 'Maximum value' , MIN(qty) as 'Minimum value' FROM sales
Finally, we need to know how far the numbers are spread around the average. If the spread is quite large, we have less confidence in the average. The function TSQL provides for this measurement is STDEV(), and it takes a column as a parameter. Here's how that looks:
SELECT STDEV(qty) as 'Standard Deviation' FROM sales
With a standard deviation, about 65% of the values in the set are within one standard deviations of the mean. For example, if the average is 12, and the standard deviation is 1, then 65% of the values are between 11 and 13. More than 90% of the sets' values are within two standard deviations, so most of the data in this example is between 10 and 14! That gives us a great reassurance that the mean represents the data.
So, having the average is fine – provided we also know the standard deviation.
Finally, we can look at the variance of the distribution of the data. The variance also shows how tightly the data is clustered around the mean; it's just the square of the standard deviation. Here's that function:
SELECT VAR(qty) as 'Variance' FROM sales
All these functions and algorithms show us the summarization functions. Next week, we'll recap with a combination of all these numbers, and a handy table that shows when to use each.
Online Resources
If you're really interested in the formulas behind these algorithms, this statistics glossary can't be beat.