 Introduction

Table of Contents
 Microsoft SQL Server Defined
 Microsoft SQL Server Features
 Microsoft SQL Server Administration

Microsoft SQL Server Programming
 An Outline for Development
 Database
 Database Services
 Database Objects: Databases
 Database Objects: Tables
 Database Objects: Table Relationships
 Database Objects: Keys
 Database Objects: Constraints
 Database Objects: Data Types
 Database Objects: Views
 Database Objects: Stored Procedures
 Database Objects: Indexes
 Database Objects: User Defined Functions
 Database Objects: Triggers
 Database Design: Requirements, Entities, and Attributes
 Business Process Model Notation (BPMN) and the Data Professional
 Business Questions for Database Design, Part One
 Business Questions for Database Design, Part Two
 Database Design: Finalizing Requirements and Defining Relationships
 Database Design: Creating an Entity Relationship Diagram
 Database Design: The Logical ERD
 Database Design: Adjusting The Model
 Database Design: Normalizing the Model
 Creating The Physical Model
 Database Design: Changing Attributes to Columns
 Database Design: Creating The Physical Database
 Database Design Example: Curriculum Vitae
 NULLs
 The SQL Server Sample Databases
 The SQL Server Sample Databases: pubs
 The SQL Server Sample Databases: NorthWind
 The SQL Server Sample Databases: AdventureWorks
 The SQL Server Sample Databases: Adventureworks Derivatives
 UniversalDB: The Demo and Testing Database, Part 1
 UniversalDB: The Demo and Testing Database, Part 2
 UniversalDB: The Demo and Testing Database, Part 3
 UniversalDB: The Demo and Testing Database, Part 4
 Getting Started with TransactSQL
 TransactSQL: Data Definition Language (DDL) Basics
 TransactSQL: Limiting Results
 TransactSQL: More Operators
 TransactSQL: Ordering and Aggregating Data
 TransactSQL: Subqueries
 TransactSQL: Joins
 TransactSQL: Complex Joins  Building a View with Multiple JOINs
 TransactSQL: Inserts, Updates, and Deletes
 An Introduction to the CLR in SQL Server 2005
 Design Elements Part 1: Programming Flow Overview, Code Format and Commenting your Code
 Design Elements Part 2: Controlling SQL's Scope
 Design Elements Part 3: Error Handling
 Design Elements Part 4: Variables
 Design Elements Part 5: Where Does The Code Live?
 Design Elements Part 6: Math Operators and Functions
 Design Elements Part 7: Statistical Functions
 Design Elements Part 8: Summarization Statistical Algorithms
 Design Elements Part 9:Representing Data with Statistical Algorithms
 Design Elements Part 10: Interpreting the Data—Regression
 Design Elements Part 11: String Manipulation
 Design Elements Part 12: Loops
 Design Elements Part 13: Recursion
 Design Elements Part 14: Arrays
 Design Elements Part 15: EventDriven Programming Vs. Scheduled Processes
 Design Elements Part 16: EventDriven Programming
 Design Elements Part 17: Program Flow
 Forming Queries Part 1: Design
 Forming Queries Part 2: Query Basics
 Forming Queries Part 3: Query Optimization
 Forming Queries Part 4: SET Options
 Forming Queries Part 5: Table Optimization Hints
 Using SQL Server Templates
 TransactSQL Unit Testing
 Index Tuning Wizard
 Unicode and SQL Server
 SQL Server Development Tools
 The SQL Server TransactSQL Debugger
 The TransactSQL Debugger, Part 2
 Basic Troubleshooting for TransactSQL Code
 An Introduction to Spatial Data in SQL Server 2008
 Performance Tuning
 Practical Applications
 Professional Development
 Application Architecture Assessments
 Business Intelligence
 Tips and Troubleshooting
 Additional Resources
Design Elements Part 10: Interpreting the Data—Regression
Last updated Mar 28, 2003.
Last week, we learned about a simple framework to follow to build programs. This week, we'll use that framework to create a complex statistical algorithm.
You'll recall from last week's chapter that, while a more formal approach is called for in larger projects, our simplified development framework has four main phases:
Understand the goal or problem as completely as possible
Comment the process
Code the comments
Optimize the code
We'll use these phases to create today's algorithm.
Understand the goal or problem as completely as possible
Let's begin the process by understanding, as completely as possible, what we want to accomplish. There may not be time to become an expert in a particular area, but try to understand as far as you can. When you reach your limit, bring topic experts into the design process.
Today's concept is regression statistics. Regression analysis shows relationships between sets of data. In using this analysis method, we're trying to show if one thing is related to another. Be careful here; we can't exactly say that since we brush our teeth each morning and the sun comes up each morning, they are related!
In a simple example of a regression, we can examine one dependent variable in relation to one independent variable. The formula looks like this:
Y = a + bX
Where Y is the value of the dependent variable, X is the value of the independent variable, a is the intercept of the regression line on the Y axis when X = 0, and b is the slope of the regression line. We can take the data, plug it into the formula, and plot the points.
But there's a simpler way to look at this kind of data. Statistical algorithms allow us to understand the data without having to process it all. This is the "sample" we've learned about.
Let's take a look at a concrete example. Below we have the chart of data for rainfall and the inches of the jalapeño pepper plant's growth in my garden:
Growth  Days 
10  6 
7  4 
12  7 
12  8 
9  10 
16  7 
12  10 
18  15 
8  5 
12  6 
14  11 
16  13 
Rather than solving the formula above, we can plot this data graphically. Once we have the data plotted on an X and Y axis, we can see a pattern emerge, which seems to indicate that growth follows rain. Here's the way that looks:
SELECT REPLICATE(' ',Growth) + '*' FROM test ORDER BY days DESC
And here's the output:
* * * * * * * * * * * *
As we can see, drawing a line that hits as close to all the points we've created forms a rather distinct line, sloping upward, showing a possible relationship between the number of days of rain and the growth in inches.
In the code snippet above we've used a space to indent the result, and print an asterisk to mark the spot. We then ordered the data by day to see if it "scattered" around a particular point.
OK, this is probably not very useful  problems arise with large data sets or values above 256  but it is fun to see what can be accomplished with simple TSQL! The point of this exercise is that sometimes we have to look at the graphical movement of data rather than the discrete numbers. It's important to pick the best method to examine the data at hand. While the graphing method is useful, it's really a number that we're after. We'd like to find some sort of number that will show us whether one set of data could be related to another, and how strongly.
There is a formula that will help us  it's called a regression coefficient. While the exact formula is a bit on the symbolic side, here's what it looks like:
Wait! It's actually not as bad as it looks. What we're given by this monster is a number between 1 and 1. If the number is closer to 1, then items are inversely related. That is, the less rain, the higher the growth. If the number is closer to +1, then more rain means more growth. Closer to 0 and the two aren't considered to be related.
It's not really all that bad. Let's just break down the formula into comments one part at a time. The only statistical symbol you have to learn is the big "E" looking letter, called sigma. It simply means "the sum of."
Comment the process
To create our algorithm, we'll treat it as a simple program. Let's take the formula and rewrite it as a word problem:
/* First, we examine the whole formula. We see that there are several sums required, and a few of them are sums of numbers we don't have, such as x*y and x^2 or y^2. It might be easier to construct a temporary table that has those numbers already computed. We see that we need the x's raised to a power of two as well as the y's, and we also need each x times each y. */ /* While we're at it, let's declare some variables to hold the parts of the formula. */ /* Now, we take on the numerator. The first part is the sum of the x's times the y's. Notice that it's not the sum of x times the sum of y, but each x and y multiplied, and then summed. */ /* Next, we take that number and subtract the following: the sum of all the x's times the sums of all the y's divided by the number of items in the set. We now have our numerator. */ /* We move to the denominator. We need to take the square root of the whole denominator. We'll leave that for the last step. We notice that the two parts of the denominator are the same, except that the numbers are x's and y's. We may be able to use that later in the optimization step. First, we need to get the sum of all the x's which have been raised to the power of two. Note that we don't take the sum of x's and then raise that value to a power of two, it's every value that we're after. Then we need to take that number, and subtract from it the very number we weren't looking for a moment ago (the sum of x's and then raise that value to a power of two), divided by the number of values in the set. */ /* Now we multiply those two values together, and then take the whole denominator and take the square root. We can also take the completed numerator and divide it by the completed denominator. */ /* And we're there. Now we select the answers. */ /* Don't forget to clean up! */
Notice the level of granularity we've chosen here. We think that we've broken down a complex process as simply as we can. We'll find out if the comments need to be "tweaked" in the next step.
Code the comments
Now it's just a matter of applying some of our programming knowhow to those comments. Here we go:
/* First, we examine the whole formula. We see that there are several sums required, and a few of them are sums of numbers we don't have, such as x*y and x^2 or y^2. It might be easier to construct a temporary table that has those numbers already computed. We see that we need the x's raised to a power of two as well as the y's, and we also need each x times each y. */ USE pubs GO CREATE TABLE #Regression(x int, y int, x2 int, y2 int, xy int) INSERT INTO #Regression(x, y, x2, y2, xy) SELECT Growth, Days, POWER(Growth, 2), POWER(Days, 2), Growth * Days FROM test /* While we're at it, let's declare some variables to hold the parts of the formula. */ DECLARE @Exy as int , @Ex as int , @Ey as int , @n as int , @Ex2 as int , @Ey2 as int , @r as decimal(10, 5) , @a as int , @b as int , @c as int /* Now we take on the numerator. The first part is the sum of the x's times the y's. Notice that it's not the sum of x times the sum of y, but each x and y multiplied, and then summed. */ SET @Exy = (SELECT SUM(xy) FROM #Regression) /* Next, we take that number and subtract the following: the sum of all the x's times the sums of all the y's divided by the number of items in the set. We now have our numerator. */ SET @Ex = (SELECT SUM(x) FROM #Regression) SET @Ey = (SELECT SUM(y) FROM #Regression) SET @n = (SELECT COUNT(x) FROM #Regression) SET @a = @Exy  ((@Ex*@Ey)/@n) /* Next we move to the denominator. We notice that we need to take the square root of the whole denominator. We'll leave that for the last step. We notice that the two parts of the denominator are the same, except that the numbers are x's and y's. We may be able to use that later in the optimization step. First, we need to get the sum of all the x's which have been raised to the power of two. Note that we don't take the sum of x's and then raise that value to a power of two, it's every value that we're after. Then we need to take that number, and subtract from it the very number we weren't looking for a moment ago (the sum of x's and then raise that value to a power of two), divided by the number of values in the set. */ SET @Ex2 = (SELECT SUM(POWER(x, 2)) FROM #Regression) SET @b = @Ex2  ((POWER(@Ex, 2))/@n) /* We just need to repeat the two steps above for y. */ SET @Ey2 = (SELECT SUM(POWER(y ,2)) FROM #Regression) SET @c = @Ey2  ((POWER(@Ey, 2))/@n) /* Now we multiply those two values together, and then take the whole denominator and take the square root. We can also take the completed numerator and divide it by the completed denominator. */ SET @r = (@a/(SQRT(@b*@c))) /* And we're there. Now we select the answers. */ SELECT @Ex AS 'Ex' , @Ey AS 'Ey' , @Ex2 AS 'Ex2' , @Ey2 AS 'Ey2' , @Exy AS 'Exy' , @n AS 'n' , @a AS 'a' , @b AS 'b' , @c AS 'c' , @r AS 'r' /* Don't forget to clean up! */ DROP TABLE #Regression GO  (12 row(s) affected) Ex Ey Ex2 Ey2 Exy n a b c r           146 102 1902 990 1334 12 93 126 123 .74704 (1 row(s) affected)
There we have it. The number is pretty close to +1, so we can say that rain tends to cause more growth in our garden. (I bet you already knew that!)
Why not just do all this in one or two lines? Well, we could, but that wouldn't make for a very good tutorial! Plus, that's what the optimization step is for.
Optimize the code
To optimize  no, wait. Let's hear from you. Send me an email (woodyb@hotmail.com) with the subject line of "Statistical Regression." I'll post a few of the responses in our upcoming articles.
Online Resources
Will Hopkins has a good article with an explanation of the statistical formulas around the regression coefficient.
InformIT Tutorials and Sample Chapters
Katrina Maxwell has a worthwhile application of statistical analysis in her sample chapter, A Data Analysis Methodology for Software Managers.