Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Design Elements Part 10: Interpreting the Data—Regression

Last updated Mar 28, 2003.

Last week, we learned about a simple framework to follow to build programs. This week, we'll use that framework to create a complex statistical algorithm.

You'll recall from last week's chapter that, while a more formal approach is called for in larger projects, our simplified development framework has four main phases:

  1. Understand the goal or problem as completely as possible

  2. Comment the process

  3. Code the comments

  4. Optimize the code

We'll use these phases to create today's algorithm.

Understand the goal or problem as completely as possible

Let's begin the process by understanding, as completely as possible, what we want to accomplish. There may not be time to become an expert in a particular area, but try to understand as far as you can. When you reach your limit, bring topic experts into the design process.

Today's concept is regression statistics. Regression analysis shows relationships between sets of data. In using this analysis method, we're trying to show if one thing is related to another. Be careful here; we can't exactly say that since we brush our teeth each morning and the sun comes up each morning, they are related!

In a simple example of a regression, we can examine one dependent variable in relation to one independent variable. The formula looks like this:

Y = a + bX

Where Y is the value of the dependent variable, X is the value of the independent variable, a is the intercept of the regression line on the Y axis when X = 0, and b is the slope of the regression line. We can take the data, plug it into the formula, and plot the points.

But there's a simpler way to look at this kind of data. Statistical algorithms allow us to understand the data without having to process it all. This is the "sample" we've learned about.

Let's take a look at a concrete example. Below we have the chart of data for rainfall and the inches of the jalapeño pepper plant's growth in my garden:

Growth Days
10 6
7 4
12 7
12 8
9 10
16 7
12 10
18 15
8 5
12 6
14 11
16 13

Rather than solving the formula above, we can plot this data graphically. Once we have the data plotted on an X and Y axis, we can see a pattern emerge, which seems to indicate that growth follows rain. Here's the way that looks:

SELECT 
REPLICATE(' ',Growth) + '*'
FROM test 
ORDER BY days DESC

And here's the output:

         *
        *
       *
      *
     *
      *
      *
        *
     *
      *
    *
    *

As we can see, drawing a line that hits as close to all the points we've created forms a rather distinct line, sloping upward, showing a possible relationship between the number of days of rain and the growth in inches.

In the code snippet above we've used a space to indent the result, and print an asterisk to mark the spot. We then ordered the data by day to see if it "scattered" around a particular point.

OK, this is probably not very useful - problems arise with large data sets or values above 256 - but it is fun to see what can be accomplished with simple T-SQL! The point of this exercise is that sometimes we have to look at the graphical movement of data rather than the discrete numbers. It's important to pick the best method to examine the data at hand. While the graphing method is useful, it's really a number that we're after. We'd like to find some sort of number that will show us whether one set of data could be related to another, and how strongly.

There is a formula that will help us - it's called a regression coefficient. While the exact formula is a bit on the symbolic side, here's what it looks like:

Wait! It's actually not as bad as it looks. What we're given by this monster is a number between -1 and 1. If the number is closer to -1, then items are inversely related. That is, the less rain, the higher the growth. If the number is closer to +1, then more rain means more growth. Closer to 0 and the two aren't considered to be related.

It's not really all that bad. Let's just break down the formula into comments one part at a time. The only statistical symbol you have to learn is the big "E" looking letter, called sigma. It simply means "the sum of."

Comment the process

To create our algorithm, we'll treat it as a simple program. Let's take the formula and re-write it as a word problem:

/* First, we examine the whole formula. We see that there are several sums required,
and a few of them are sums of numbers we don't have, such as x*y and x^2 or y^2.
It might be easier to construct a temporary table that has those numbers already computed.
 We see that we need the x's raised to a power of two as well as the y's, and we also need
 each x times each y. */
/* While we're at it, let's declare some variables to hold the parts of the formula. */
/* Now, we take on the numerator. The first part is the sum of the x's times the y's.
 Notice that it's not the sum of x times the sum of y, but each x and y multiplied,
 and then summed. */
/* Next, we take that number and subtract the following: the sum of all the x's times
 the sums of all the y's divided by the number of items in the set.
 We now have our numerator. */
/* We move to the denominator. We need to take the square root of the whole denominator.
 We'll leave that for the last step. We notice that the two parts of the denominator
 are the same, except that the numbers are x's and y's. We may be able to use that
 later in the optimization step. First, we need to get the sum of all the x's which
 have been raised to the power of two. Note that we don't take the sum of x's and then
 raise that value to a power of two, it's every value that we're after. Then we need to
 take that number, and subtract from it the very number we weren't looking for a moment 
ago (the sum of x's and then raise that value to a power of two), divided by the number
 of values in the set. */
/* Now we multiply those two values together, and then take the whole denominator and take
 the square root. We can also take the completed numerator and divide it by the
 completed denominator. */
/* And we're there. Now we select the answers. */
/* Don't forget to clean up! */

Notice the level of granularity we've chosen here. We think that we've broken down a complex process as simply as we can. We'll find out if the comments need to be "tweaked" in the next step.

Code the comments

Now it's just a matter of applying some of our programming know-how to those comments. Here we go:

/* First, we examine the whole formula. We see that there are several sums required, and a few of them are sums of numbers we don't have, such as x*y and x^2 or y^2. It might be easier to construct a temporary table that has those numbers already computed. We see that we need the x's raised to a power of two as well as the y's, and we also need each x times each y. */
USE pubs
GO
CREATE TABLE #Regression(x int, y int, x2 int, y2 int, xy int) 
INSERT INTO #Regression(x, y, x2, y2, xy) 
	SELECT Growth, Days, POWER(Growth, 2), POWER(Days, 2), Growth * Days 
FROM test
/* While we're at it, let's declare some variables to hold the parts of the formula. */
DECLARE 
@Exy as int
, @Ex as int
, @Ey as int
, @n as int
, @Ex2 as int
, @Ey2 as int
, @r as decimal(10, 5)
, @a as int
, @b as int
, @c as int
/* Now we take on the numerator. The first part is the sum of the x's times the y's.
 Notice that it's not the sum of x times the sum of y, but each x and y multiplied,
 and then summed. */
SET @Exy = (SELECT SUM(xy) FROM #Regression)
/* Next, we take that number and subtract the following: the sum of all the x's times
 the sums of all the y's divided by the number of items in the set.
 We now have our numerator. */
SET @Ex = (SELECT SUM(x) FROM #Regression)
SET @Ey = (SELECT SUM(y) FROM #Regression)
SET @n = (SELECT COUNT(x) FROM #Regression)
SET @a = @Exy - ((@Ex*@Ey)/@n)
/* Next we move to the denominator. We notice that we need to take the square root of
 the whole denominator. We'll leave that for the last step. We notice that the two parts
 of the denominator are the same, except that the numbers are x's and y's. We may be able
 to use that later in the optimization step. First, we need to get the sum of all the x's
 which have been raised to the power of two. Note that we don't take the sum of x's and
 then raise that value to a power of two, it's every value that we're after. Then we need
 to take that number, and subtract from it the very number we weren't looking for a moment
 ago (the sum of x's and then raise that value to a power of two), divided by the number of
 values in the set. */
SET @Ex2 = (SELECT SUM(POWER(x, 2)) FROM #Regression)
SET @b = @Ex2 - ((POWER(@Ex, 2))/@n)
/* We just need to repeat the two steps above for y. */
SET @Ey2 = (SELECT SUM(POWER(y ,2)) FROM #Regression)
SET @c = @Ey2 - ((POWER(@Ey, 2))/@n)
/* Now we multiply those two values together, and then take the whole denominator and take
 the square root. We can also take the completed numerator and divide it by the
 completed denominator. */
SET @r = (@a/(SQRT(@b*@c)))
/* And we're there. Now we select the answers. */
SELECT @Ex AS 'Ex'
, @Ey AS 'Ey'
, @Ex2 AS 'Ex2'
, @Ey2 AS 'Ey2'
, @Exy AS 'Exy'
, @n AS 'n'
, @a AS 'a'
, @b AS 'b'
, @c AS 'c'
, @r AS 'r'	
/* Don't forget to clean up! */
DROP TABLE #Regression
GO
-----------
(12 row(s) affected)

Ex          Ey          Ex2         Ey2         Exy         n           a           b           c           r      
----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ------------ 
146         102         1902        990         1334        12          93          126         123         .74704

(1 row(s) affected)

There we have it. The number is pretty close to +1, so we can say that rain tends to cause more growth in our garden. (I bet you already knew that!)

Why not just do all this in one or two lines? Well, we could, but that wouldn't make for a very good tutorial! Plus, that's what the optimization step is for.

Optimize the code

To optimize - no, wait. Let's hear from you. Send me an e-mail (woodyb@hotmail.com) with the subject line of "Statistical Regression." I'll post a few of the responses in our upcoming articles.

Online Resources

Will Hopkins has a good article with an explanation of the statistical formulas around the regression coefficient.

InformIT Tutorials and Sample Chapters

Katrina Maxwell has a worthwhile application of statistical analysis in her sample chapter, A Data Analysis Methodology for Software Managers.