- Table of Contents
- Microsoft SQL Server Defined
- Microsoft SQL Server Features
- Microsoft SQL Server Administration
- Microsoft SQL Server Programming
- An Outline for Development
- Database Services
- Database Objects: Databases
- Database Objects: Tables
- Database Objects: Table Relationships
- Database Objects: Keys
- Database Objects: Constraints
- Database Objects: Data Types
- Database Objects: Views
- Database Objects: Stored Procedures
- Database Objects: Indexes
- Database Objects: User Defined Functions
- Database Objects: Triggers
- Database Design: Requirements, Entities, and Attributes
- Business Process Model Notation (BPMN) and the Data Professional
- Business Questions for Database Design, Part One
- Business Questions for Database Design, Part Two
- Database Design: Finalizing Requirements and Defining Relationships
- Database Design: Creating an Entity Relationship Diagram
- Database Design: The Logical ERD
- Database Design: Adjusting The Model
- Database Design: Normalizing the Model
- Creating The Physical Model
- Database Design: Changing Attributes to Columns
- Database Design: Creating The Physical Database
- Database Design Example: Curriculum Vitae
- The SQL Server Sample Databases
- The SQL Server Sample Databases: pubs
- The SQL Server Sample Databases: NorthWind
- The SQL Server Sample Databases: AdventureWorks
- The SQL Server Sample Databases: Adventureworks Derivatives
- UniversalDB: The Demo and Testing Database, Part 1
- UniversalDB: The Demo and Testing Database, Part 2
- UniversalDB: The Demo and Testing Database, Part 3
- UniversalDB: The Demo and Testing Database, Part 4
- Getting Started with Transact-SQL
- Transact-SQL: Data Definition Language (DDL) Basics
- Transact-SQL: Limiting Results
- Transact-SQL: More Operators
- Transact-SQL: Ordering and Aggregating Data
- Transact-SQL: Subqueries
- Transact-SQL: Joins
- Transact-SQL: Complex Joins - Building a View with Multiple JOINs
- Transact-SQL: Inserts, Updates, and Deletes
- An Introduction to the CLR in SQL Server 2005
- Design Elements Part 1: Programming Flow Overview, Code Format and Commenting your Code
- Design Elements Part 2: Controlling SQL's Scope
- Design Elements Part 3: Error Handling
- Design Elements Part 4: Variables
- Design Elements Part 5: Where Does The Code Live?
- Design Elements Part 6: Math Operators and Functions
- Design Elements Part 7: Statistical Functions
- Design Elements Part 8: Summarization Statistical Algorithms
- Design Elements Part 9:Representing Data with Statistical Algorithms
- Design Elements Part 10: Interpreting the Data—Regression
- Design Elements Part 11: String Manipulation
- Design Elements Part 12: Loops
- Design Elements Part 13: Recursion
- Design Elements Part 14: Arrays
- Design Elements Part 15: Event-Driven Programming Vs. Scheduled Processes
- Design Elements Part 16: Event-Driven Programming
- Design Elements Part 17: Program Flow
- Forming Queries Part 1: Design
- Forming Queries Part 2: Query Basics
- Forming Queries Part 3: Query Optimization
- Forming Queries Part 4: SET Options
- Forming Queries Part 5: Table Optimization Hints
- Using SQL Server Templates
- Transact-SQL Unit Testing
- Index Tuning Wizard
- Unicode and SQL Server
- SQL Server Development Tools
- The SQL Server Transact-SQL Debugger
- The Transact-SQL Debugger, Part 2
- Basic Troubleshooting for Transact-SQL Code
- An Introduction to Spatial Data in SQL Server 2008
- Performance Tuning
- Practical Applications
- Professional Development
- Application Architecture Assessments
- Business Intelligence
- Tips and Troubleshooting
- Additional Resources
Design Elements Part 8: Summarization Statistical Algorithms
Last updated May 13, 2004.
Design Elements Part 8: Summarization Statistical Algorithms
In last week's tutorial, we learned the importance of applying statistical methods properly. We covered the basics of statistics and where they are used.
To properly use statistical data, we need to collect, store and analyze it correctly. The statistical method we're using is to ask the questions, collect the data, and analyze the results. While this might sound obvious, each steps has a specific discipline to follow.
The primary uses of statistical analysis are summarizing data, grouping (or representing) data, comparing data sets, showing the differences between data sets, and finding relationships between data sets. We began the topic of summarizing data, in the context of SQL databases (and SWL Server in particular), which we'll finish today.
The first two functions we used for summarizing data were the SUM() and COUNT() functions. These are both quite basic, and the only real twist related to either of them is that we can use limiting queries to get only the results we're looking for.
SQL Server provides a built-in function for this formula: AVG(). The proper statistical name for this type of average is the mean. The problem is that a mean isn't really perfect. What we're trying to show is a representation of a data set. The simple formula above tends to do that sometimes. For instance, if the average of a set of coins is five cents, we don't really know if we have a quarter and five pennies or six nickels.
For this reason, the mean is best used in conjunction with other tools that summarize the data. This is an important concept to learn about applying statistical methods. When you see a single statistic provided from an unqualified data sample, don't implicitly trust the conclusions.
Another type of summarization method to use is to show the middle number in a set, called the median. That gives us more confidence in the characterization of the data. It's a simple enough formula: we just take the data set, order it from lowest to highest, and then divide the set in half. The problem is, although it's a simple formula, there's no direct formula for it in T-SQL. We have to do a little work to get there, but it's not that hard.
To find the median, we just follow the formula. You'll need to make use of one new construct the temporary table. A temporary table is a table constructed in memory, and is available only to the connection that creates it. When the connection is broken, the table is destroyed.
There's another type of temporary table which is available to more than one connection, called a global temporary table.
Youe create a temporary table the same way as a normal table, except that you preface the name with a # sign for the local type, and ## for the global ones.
So we'll create a temporary table, insert the data in the table using a subquery, sorted in the right order. We've seen all these constructs before, but the twist is that we'll use an identity data type to provide a numerical reference for the middle value. Then we'll create variables to hold the numbers and calculations, and then use them to locate the center value, or at least the one closest to it.
Here's the code:
USE pubs GO -- Create a temporary table to hold the values -- and an identity field to give us a row-holder. -- Note - Median would need to have another type -- if the value isn't numeric CREATE TABLE #Median(Median int, ID int identity) -- Fill the temporary table with an ordered list from -- the source base table INSERT INTO #Median (Median) SELECT qty FROM sales ORDER BY qty ASC -- Set up variables to hold the calculations DECLARE @Count AS INT, @Median AS INT -- Get the number of rows from the ordered temporary table SET @Count = (SELECT COUNT(*)FROM #Median) -- Find the middle (roughly) SET @Median = (@Count/2) -- Use that value to find the proper value -- from the ordered list SELECT Median from #Median where ID = @Median -- Clean up DROP TABLE #Median GO
We now have four formulas to work with to help summarize the data, SUM(), COUNT(), AVG(), and our new median algorithm. But that's not all. To truly understand these numbers, we have a few more formulas to work out.
The first is the Mode. A mode is the number found most often in the data set. The mode, along with the average and median, can help us have confidence in the numbers closest to the average. The more numbers that are around the average, the more confidence we can have in it.
Once again, T-SQL does not provide a formula for this statistical function. We can, however, use a simple GROUP BY aggregate with the TOP function to get at it:
SELECT TOP 1 COUNT(*) as 'Number of repeats' , qty as 'Number' FROM sales GROUP BY qty ORDER BY COUNT(*) DESC
The final three functions have the formulas built-in.
The next two formulas show our range, or the spread of the values. The ranges of the numbers show us the high and low values of the data set.
Here are both of them in code:
SELECT MAX(qty) as 'Maximum value' , MIN(qty) as 'Minimum value' FROM sales
Finally, we need to know how far the numbers are spread around the average. If the spread is quite large, we have less confidence in the average. The function T-SQL provides for this measurement is STDEV(), and it takes a column as a parameter. Here's how that looks:
SELECT STDEV(qty) as 'Standard Deviation' FROM sales
With a standard deviation, about 65% of the values in the set are within one standard deviations of the mean. For example, if the average is 12, and the standard deviation is 1, then 65% of the values are between 11 and 13. More than 90% of the sets' values are within two standard deviations, so most of the data in this example is between 10 and 14! That gives us a great reassurance that the mean represents the data.
So, having the average is fine provided we also know the standard deviation.
Finally, we can look at the variance of the distribution of the data. The variance also shows how tightly the data is clustered around the mean; it's just the square of the standard deviation. Here's that function:
SELECT VAR(qty) as 'Variance' FROM sales
All these functions and algorithms show us the summarization functions. Next week, we'll recap with a combination of all these numbers, and a handy table that shows when to use each.
If you're really interested in the formulas behind these algorithms, this statistics glossary can't be beat.