- Table of Contents
- Microsoft SQL Server Defined
- Microsoft SQL Server Features
- Microsoft SQL Server Administration
- Microsoft SQL Server Programming
- An Outline for Development
- Database Services
- Database Objects: Databases
- Database Objects: Tables
- Database Objects: Table Relationships
- Database Objects: Keys
- Database Objects: Constraints
- Database Objects: Data Types
- Database Objects: Views
- Database Objects: Stored Procedures
- Database Objects: Indexes
- Database Objects: User Defined Functions
- Database Objects: Triggers
- Database Design: Requirements, Entities, and Attributes
- Business Process Model Notation (BPMN) and the Data Professional
- Business Questions for Database Design, Part One
- Business Questions for Database Design, Part Two
- Database Design: Finalizing Requirements and Defining Relationships
- Database Design: Creating an Entity Relationship Diagram
- Database Design: The Logical ERD
- Database Design: Adjusting The Model
- Database Design: Normalizing the Model
- Creating The Physical Model
- Database Design: Changing Attributes to Columns
- Database Design: Creating The Physical Database
- Database Design Example: Curriculum Vitae
- The SQL Server Sample Databases
- The SQL Server Sample Databases: pubs
- The SQL Server Sample Databases: NorthWind
- The SQL Server Sample Databases: AdventureWorks
- The SQL Server Sample Databases: Adventureworks Derivatives
- UniversalDB: The Demo and Testing Database, Part 1
- UniversalDB: The Demo and Testing Database, Part 2
- UniversalDB: The Demo and Testing Database, Part 3
- UniversalDB: The Demo and Testing Database, Part 4
- Getting Started with Transact-SQL
- Transact-SQL: Data Definition Language (DDL) Basics
- Transact-SQL: Limiting Results
- Transact-SQL: More Operators
- Transact-SQL: Ordering and Aggregating Data
- Transact-SQL: Subqueries
- Transact-SQL: Joins
- Transact-SQL: Complex Joins - Building a View with Multiple JOINs
- Transact-SQL: Inserts, Updates, and Deletes
- An Introduction to the CLR in SQL Server 2005
- Design Elements Part 1: Programming Flow Overview, Code Format and Commenting your Code
- Design Elements Part 2: Controlling SQL's Scope
- Design Elements Part 3: Error Handling
- Design Elements Part 4: Variables
- Design Elements Part 5: Where Does The Code Live?
- Design Elements Part 6: Math Operators and Functions
- Design Elements Part 7: Statistical Functions
- Design Elements Part 8: Summarization Statistical Algorithms
- Design Elements Part 9:Representing Data with Statistical Algorithms
- Design Elements Part 10: Interpreting the Data—Regression
- Design Elements Part 11: String Manipulation
- Design Elements Part 12: Loops
- Design Elements Part 13: Recursion
- Design Elements Part 14: Arrays
- Design Elements Part 15: Event-Driven Programming Vs. Scheduled Processes
- Design Elements Part 16: Event-Driven Programming
- Design Elements Part 17: Program Flow
- Forming Queries Part 1: Design
- Forming Queries Part 2: Query Basics
- Forming Queries Part 3: Query Optimization
- Forming Queries Part 4: SET Options
- Forming Queries Part 5: Table Optimization Hints
- Using SQL Server Templates
- Transact-SQL Unit Testing
- Index Tuning Wizard
- Unicode and SQL Server
- SQL Server Development Tools
- The SQL Server Transact-SQL Debugger
- The Transact-SQL Debugger, Part 2
- Basic Troubleshooting for Transact-SQL Code
- An Introduction to Spatial Data in SQL Server 2008
- Performance Tuning
- Practical Applications
- Professional Development
- Application Architecture Assessments
- Business Intelligence
- Tips and Troubleshooting
- Additional Resources
Design Elements Part 7: Statistical Functions
Last updated May 7, 2004.
We are surrounded with statistics. Used correctly, they form the bedrock of successful decisions. Used incorrectly, they can destroy a company by causing management to make bad decisions, or cause a faulty conclusion in a lab test evaluation.
The ability to properly deliver statistics is a great career skill. No programming arsenal of knowledge is complete without a sound understanding of statistical methods and their proper use.
In this week's tutorial, we'll learn the most important various statistical methods and how to apply T-SQL functions and algorithms to solve them.
If heavy statistics and graphics are necessary in a project, the proper choice is normally to house the data in SQL Server and process it with a specialized package. T-SQL, however, offers useful commands and functions to deal with general statistics.
The Use of Statistics
One of the most basic uses of statistics is to summarize data. Business or scientific requirements often involve simple summaries of large quantities of data, which answer questions such as "How many?", "What's the average?" and "How many times does this occur?"
Another function statistical methods can help us with is grouping data. This use is called representing data since it takes the original order and rearranges it to show patterns or groups. The question here is "Do we have a meaningful group of data here or just random values?"
Statistical methods also help us to compare data so that we can view one thing versus another. Typical comparison questions are "How many workers are at plant X versus plant Y?" or "Do African honeybees fly farther in a day than European honeybees?"
Related to the comparison questions are statistical results that show a difference, such as "How much more protein is in peanut butter than in peanuts?"
Finally, statistics can be used to show relationships. This type of statistic is the most interesting, and the most dangerous. It seeks to answer questions such as "Does spending more money per student yield better grades overall?" This type of question is often difficult to answer, and one sampling of data or formula is adequate to answer it.
With these uses and cautions in mind, let's take a look at the methods used to gather the data our algorithms will use.
The statistical method begins with asking questions. We've seen a few of those already, but there is a specific way to ask these kinds of questions. If the focus is a business system, the business users should be contacted with the questions and interpretations. If the data sets are scientific in nature, then the end users will help formulate the questions and relationships they are looking for. The point is, as the DBA or developer, you shouldn't try to guess what the questions are.
After we determine the questions, we need to collect the data. We covered the subject of database design earlier, so we use those concepts to create a model and store the data. This covers the question of how we store the data. The larger question is what we store.
Data collected for statistical use is called a sample. There are some important concepts to keep in mind when discussing statistical data, such as the sample size, sample period, and distribution of the sample.
The sample size is the amount of data that makes the results valid. For instance, if we asked the first three people we met whether they liked a recent movie, we can't really call that a representative size. Instead, we need to determine the number that allows us to have confidence in the results. For minor, less-important questions, this size might be quite small, but generally the more wide-spread the result set (kinds of food, house size, etc.), the larger the sample size needs to be.
The sample period determines how long the sample data collection should be taken. For instance, to check the average temperature in an area, several years of data collection is needed.
Finally, the sample distribution of data is important to gain the widest audience or participants of the event. For our movie question, it might be important to ask people of many ages and social demographics to see if it was widely liked. On the other hand, if we're only asking to determine if we want to see the movie, the distribution needs to include only those people who share our tastes.
After we collect the data, we analyze it (or at least provide it to someone else to analyze). This is where we apply the methods we're about to learn to derive meaning or information from the raw sample data.
Now that we have our methods down, let's get right to the Transact-SQL that we can use to put it all together. For many of the algorithms we need, the statistical functions are part of the aggregate functions we've studied before. We'll just layer these concepts to get at what we need.
We'll begin with the basics: summarizing data. The primary functions we'll use here are SUM and COUNT. Here's an example:
USE pubs GO SELECT 'Total Sales, all stores:' = SUM(qty) FROM sales GO ---------------------- Total Sales, all stores: 493
This simple statement returns the sum of the sales made at all stores. As we learned in our aggregate studies, we can also provide data that is broken down by store number:
SELECT stor_id, 'Sales' = SUM(qty) FROM sales GROUP BY stor_id ORDER BY Sales DESC GO -----------------------
This data is a bit more useful. It's often also helpful to show not only a ranking of the data such as we have here, but also the sample size. We do that with the COUNT function. It's pretty simple:
SELECT COUNT(*) FROM sales ------------------------ 21
The answer here is that we don't have a great many stores reporting or does it? Actually, this shows how many lines of data were collected, which might very well mean something else entirely. It's important to keep these facts in mind when analyzing numerical data.
In both the SUM and COUNT functions, as with most functions, the WHERE condition can limit the results.
In addition to adding and counting the result sets, the average or mean of the data is part of the summarization step. Averages are probably one of the most misused functions in statistics. If I tell you that I have several coins in my pocket and the average value of them is 5 cents, what do I have in my pocket? Well, it all depends on the sample size and its distribution. I could have a few coins or many, and no nickels whatsoever, or they could all be nickels. For that reason, the average is often taken with several other measures. Also for this reason there are several kinds of averages, such as weighted or binomial averages.
In T-SQL, what we have is the numerical average, or the function that represents the formula:
Average = The sum of the units divided by the number of units
The format of the command to get the numeric average looks like this:
SELECT 'Average Store Sales' = AVG(qty) FROM sales GO
Again, we could use a WHERE clause or a GROUP BY clause to display the desired result.
Next week we'll take a look at some statistical functions that SQL Server doesn't natively provide: we'll have to write our own. We'll cover medians, modes, and more. We'll explain when to use each and how we can implement them. In the meantime, there's a little light reading in the references.
Really into statistics? Here's a great statistics site.
InformIT Tutorials and Sample Chapters
Just for fun: James Cortada explores statistics and sports.