Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

The Enterprise Data Warehouse — Aggregations and the Star Schema

Last updated Mar 28, 2003.

After covering the requirements gathering phase, data sources, the Operational Data Store (ODS), the Extract, Transform and Load process, Data Marts and Designing Data Elements, we now move on to the final step in data storage for the Business Intelligence landscape: Data Warehouses and their big brother, the Enterprise Data Warehouse. This is the last concept for storing data, but not the last concept for the landscape. We'll also explore analytical and presentation methods before we dive into the mechanics of each of these systems.

I'll dispense with the difference between a Data Warehouse and a Data Mart first. Recall that a Data Mart stores regional or functional strategic, analytical data. It's a step up from the Operational Data Store, which houses more line-item, tactical data for a single data source or set of data sources. This provides strategic data that the region or functional manager cares about, and because the Data Marts are located in the same geographical area, you'll get good performance. The Data in a Data Marts is also less detailed, with more aggregations where you sum the data over time and roll it up into single entries. You also begin to have information in the data, since your business analysts are telling you how to transform source data into new elements, as you saw in the last tutorial.

Because you normally have a Data Mart for each region or function, odds are you'll have multiple Data Marts. A Data Warehouse is the next level of both the aggregation and the strategic transformations of the data, taking data from multiple Data Marts. A Data Warehouse is also the next level up in storage requirements. Although you'll aggregate that data, you still have a lot of it to bring in.

You can have multiple Data Warehouses as well, if your organization is very large. If you do, you may need to aggregate the data from the Data Warehouses into an Enterprise Data Warehouse. The only real difference between a Data Warehouse and an Enterprise Data Warehouse then is where it sits in the chain and how large your firm is.

When your firm is small, you might begin with one source system, one ODS, one Data Mart, and one Data Warehouse. The advantage of breaking out your systems this way is that when you grow you can begin to add more and more of each system and the layers and strategies don't have to change as often.

The other difference between a Data Warehouse and the lower-level systems is how the data is stored. In On-line Transaction Processing (OLTP) transactions characteristic of source systems, the data is stored in a relational format. This type of storage is the focus of my tutorials on Database Design. The fundamentals of this type of storage is to store data only once, to provide a mechanism of joining the tables together to form relationships using primary and foreign keys, and to make sure that each column holds a discreet value of data. This type of design is called normalization, and it follows a concrete set of rules. Here's an example of data stored relationally for a source system at a university:

Student Table

StudentKey

StudentName

OtherStudentInfo...

1

Jane Doe

Freshman

2

John Smith

Freshman

3

John Doe

Junior

Professor Table

ProfessorKey

ProfessorName

OtherProfesserInfo

1

Thomas Jefferson

Degrees

2

George Washington

Degrees

3

Ben Franklin

Degrees

ClassSchedule Table

ClassKey

StudentKey

ProfessorKey

Date

Subject

1

1

1

01/01/01

Integrity

2

1

2

01/01/01

Honesty

3

2

3

01/01/01

Intelligence

In the ODS much of this design carries through, but since it is optimized for reporting, many of the tables are de-normalized, which means that the individual elements are combined into a single row, and much of the data repeats itself. This provides rapid reporting and low locking, but takes more room:

ClassReport Table

StudentName

Subject

Date

Professor

Jane Doe

Integrity

01/01/01

Thomas Jefferson

Jane Doe

Honesty

01/01/01

George Washington

John Smith

Intelligence

01/01/01

Ben Franklin

In this storage elements are repeated, but that's OK because it is read-only to the clients and is historical in nature. It's also a bit less detailed, since the status of a student or degrees a professor holds might not be important at the strategic level.

In the last tutorial I mentioned a new way of storing data for the Data Mart and the Data Warehouse. This schema is optimized for aggregate data rather than detailed reporting or OLTP transactions. The data is placed into a star schema, which is named from the shape that the tables make when you arrange them.

The star schema is based on two basic elements: dimensions and facts. A dimension is some measurement that you care about, and a fact is an aggregated number associated with it. In the example we've been using here, the school might care about how many students have attended each kind of class. In this case the dimension is a class name, and the fact is the number of students who attended it.

But it goes a bit further than that. Most managers want to know how many people attended the class during a certain period of time, perhaps last semester or last year. That adds another dimension called Time. In addition, perhaps they want to know the classes broken out by teachers, since more than one teacher might teach a particular class. That makes another dimension, called Teacher. There may be even more information desired, such as the number of students who enrolled after a type of recruitment drive, such as local school visits or mail-outs. That makes another dimension called DriveType. You might want to break out information by season, sports ranking or any other measurement. All of these would make new dimensions.

Let's look at a practical example of a star schema using only four dimensions and one fact:

StudentDimension

StudentKey

StudentName

1

Freshmen

2

Juniors

3

Seniors

ClassDimension

ClassKey

ClassName

1

Integrity

2

Honesty

3

Intelligence

TimeDimension

TimeKey

Semester

1

01/01/01

2

01/04/01

3

01/08/01

ProfessorDimension

ProfessorKey

ProfessorName

1

Thomas Jefferson

2

George Washington

3

Ben Franklin

AttendanceFacts

StudentKey

ClassKey

TimeKey

ProfessorKey

AttendanceFact

1

1

1

1

120

1

2

1

1

234

2

1

1

1

50

3

1

1

1

25

1

3

1

3

230

3

2

2

2

750

1

3

3

3

276

By joining this type of information together, you have rapid answers to questions such as "Which types of students take the most courses? In which semesters? By which professors?" This allows the board of directors to decide when and where to recruit, which teachers to retain, and what classes to expand. Used together with the On-Line Analytical Processing system, your users can navigate through this data in surprising new ways.

In the next tutorial I'll continue this discussion on the star schema and explain the systems that use it.

Informit Articles and Sample Chapters

Not all systems use a star schema the same way I've shown you here. Henry Fu and Biao Fu explain the SAP way of doing BI in their article called Data Warehousing and SAP BW.

Online Resources

Learn Data Modeling has a good description of the star schema and how you can design one.