Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Data Mining

Last updated Mar 28, 2003.

So far in the last few tutorials in this series I've described reporting, Data Marts, Data Warehousing and On-Line Analytical Processing (OLAP) systems, which are components of a Business Intelligence landscape. In this tutorial I'll complete the conceptual parts of the back-end systems with a discussion of Data Mining. In future articles I'll explain presentation methods and then I'll demonstrate the physical implementations of these concepts.

Data Mining is the process of looking through large sets of data for patterns or statistically relevant information. Until the advent of the computer, this was very difficult to do. With a computer, the analysis of the data is easier to come by. Perhaps too easy.

Let's take a simple example. Assume for a moment that a store sells clothing in various parts of the U.S. Several advertising methods are employed, and management wants to know if the advertising is effective. They could query the data set to see how sales were affected after a particular marketing campaign. From their queries they find that sales of jackets did indeed increase after the last television commercials. They deduce that television commercials are an effective way to increase sales of jackets.

This sounds perfectly plausible, until you take into account that the commercials ran at the beginning of the fall season. On further investigation, we find that the sales of jackets always go up in the fall in the U.S. since it's the start of colder weather. The television commercials may have had no effect on that result at all. And that's where the danger of having a computer to help analyze data – you can make mistakes much faster with a more powerful system. At least for now, they can't think for you.

True Data Mining software helps you avoid these kinds of mistakes. It has formulas that try to find the best fit for an answer to these patterns. A lot of industries, especially the financial kind, benefit from finding patterns. Search engines, such as Google also using pattern matching, not by crawling through large data sets, but by watching how often one site is linked to by others. The developers there felt that if more people were linking to a site on a particular subject, it must be a good place to find out more about that subject. That's a example of using other methods to locate a pattern.

It is important that both the financial and searching systems don't find patterns where they don't exist. Data Mining software can account for more variables than spreadsheets or other semi-automated methods.

A lot of OLAP systems include the words Data Mining, whether they really do it or not. To truly qualify as having Data Mining, a separate set of algorithms are needed to cull the data and present patterns. At the base of these systems are powerful statistical and numeric evaluation engines. These engines are built to do "historical prediction", which means they use statistics like regression analysis. Regression analysis takes a large set of historical data and fits it to an average, accounting for spikes and valleys. It then "extends" the average with a formula.

To show why this is needed, consider a group of workers operating at varying, slowly increasing speeds. In other words, one day they can produce 30 items, the next day 25, and the next three days 37 items. You would like to predict how many items they might produce a year from now. You could just average the numbers and assume that they grow their production rate by around 2 or so each work week. The problem with a straight average, however, is that it loses accuracy over time, especially when the numbers vary greatly. Statistics like Regression Analysis take those differences into account, creating a curve rather than a straight-line prediction, since that's closer to reality.

But Data Mining isn't just about statistics. The software still has to account for false patterns. For instance, almost every day since I've been a child, I've brushed my teeth in the morning. Several mornings the sun has come out shortly after this process. Those two events are facts – but they aren't related. A simple "after this, therefore because of this" isn't correct in this case. Although this might be a trivial example, I've seen the same logic in a more complex fashion used by business to justify a course of action.

This type of error is rather easy to spot, but the real power of Data Mining software lies in the ability to derive previously unknown patterns from the data. The way that the Data Mining engine works is by developing a model from the data patterns and then testing them against real data. If the model accurately predicts what happened, it is possible that it will accurately predict the future.

The big variables in Data Mining systems are how much historical data you can provide and how far into the future the predictions run. In general, having twenty years worth of data and wanting to know more about how the next two weeks will look is far more accurate than the other way around.

The primary difference between an OLAP system and Data Mining is the level of detail required. Because an OLAP system is based on aggregates, the last level of storage it needs involves losing a great deal of detail. Data Mining systems, on the other hand, require a great deal of detail all the way through the process.

As we'll see in the implementation tutorials, Microsoft Analysis Services provides both OLAP and Data Mining features. Each of these is a truly separate set of capabilities, even though both use some common software components.

In the next few tutorials we'll leave the back-end concepts and begin to focus on what the user sees.

Informit Articles and Sample Chapters

There's a good book on the entire topic of Data Warehousing that also covers some Data Mining techniques called Data Warehousing in the Real World: A practical guide for building Decision Support Systems by S. Anahory and D. Murray in the Bookstore.

Online Resources

Kurt Thearling has been involved in several Data Mining software efforts, and has a great set of whitepapers and other information on the process.