Table of Contents
- Microsoft SQL Server Defined
- Microsoft SQL Server Features
- Microsoft SQL Server Administration
- Microsoft SQL Server Programming
- Performance Tuning
- Practical Applications
- Professional Development
- Application Architecture Assessments
- BI Explained
- Developing a Data Dictionary
- BI Security
- Gathering BI Requirements
- Source System Extracts and Transforms
- ETL Mechanisms
- Business Intelligence Landscapes
- Business Intelligence Layouts and the Build or Buy Decision
- A Single Version of the Truth
- The Operational Data Store (ODS)
- Data Marts – Combining and Transforming Data
- Designing Data Elements
- The Enterprise Data Warehouse — Aggregations and the Star Schema
- On-Line Analytical Processing (OLAP)
- Data Mining
- Key Performance Indicators
- BI Presentation - Client Tools
- BI Presentation - Portals
- Implementing ETL - Introduction to SQL Server 2005 Integration Services
- Building a Business Intelligence Solution, Part 1
- Building a Business Intelligence Solution, Part 2
- Building a Business Intelligence Solution, Part 3
- Tips and Troubleshooting
- Additional Resources
Building a Business Intelligence Solution, Part 3
Last updated Mar 28, 2003.
In a a previous tutorial, I explained the general process I follow to implement a Business Intelligence project. Starting in this series of tutorials, I’ll implement that process using a need I now have for some intelligence data. If you’re just joining me now, start out with this article first, since you’ll need that background for this tutorial. You’ve got a front-seat to my experiment as I work through it.
I’m at the stage where I’ve identified the general answers I want to get from the system, located the source for the data (which happily is only one or two tables from a single database for this example) and last week I spelunked around a bit to find out what kind of answers I can find in my data.
With the scripts I ran, I found that the Management Data Warehouse (MDW) system tracks several data points for the I/O subsystem as it relates to performance. I’m going to use the data points I found last week to begin laying out my source for the analysis I’ll do in the final steps of this project.
This brings me to what I view as the most difficult part of designing a Business Intelligence system — designing and creating the Dimension and Fact table layouts. Let me stop a moment and define those terms a little more completely.
A “fact” is something that you want to measure and analyze. Most of the time, these are the numbers in your source tables. You may not want all of them, mind you, but the items called “facts” are the numbers themselves. By the way, some texts or systems refer to these numbers as “measures,” so “facts” and “measures” are the same. I’ll mostly stick to the term “facts” for this series. So if you look back in the last tutorial at the kinds of things the MDW tracks, you’ll find that there are numbers such as “reads,” “writes” and so on. These numbers represent the activity of the monitored system. The number of the read operations or write operations are the facts I’m after.
A “dimension” is something that describes or is related to a fact. In the MDW, I have multiple pieces of information that describe the read or write operations, such as the drive letter it was written on, or the Instance of SQL Server that owned the database where the I/O operation happened. The reason this information is called a dimension is because you look at the facts (numbers) based on this information. For instance, you want to know how many writes and reads are happening, but you want to know that they are happening on which drive, which database, which server and so on.
Having this information spread out into dimensions is very useful. When I’m done with this design, I’ll be able to see all write operations, then trim it down to specific drives, files or databases, or all of those things at once. Imagine taking all of the data I am talking about and placing the “dimensions” on the squares of a Rubik’s cube. Now imagine placing the “facts” (numbers) in the middle of the cube. If you look at one “face” or side of the cube, you could see all of the write operations on a Server. Then you could spin one of the sides of the cube to also see the write operations of the databases on that server, and you could spin it again to add the drive letters the databases use to store the data. You could come up with all kinds of combinations, and that’s exactly what I’m building the foundation for in this tutorial.
There’s another kind of very special dimension, which involves the time that a fact happened. While that is a kind of number, it’s a dimension, since you will probably want to know the information sliced on the time it happened. It answers questions like “when is the busiest time for my I/O subsystem?” and “when does each server do the most I/O?” It’s a special kind of dimension because you store its data based on the kinds of questions you will ask — and I’ll explain that further when I create it.
Everything is in place — I know where the data is, which tables and columns hold the data, and which kind of reports I want from it. I just need to put the data into the format that the Analysis Services engine can use to do all this work. I need a place to put all this data, so I’ll create a database called mdwDW (for MDW Data Warehouse) on my system:
CREATE DATABASE mdwDW; GO
Easy enough. Since I’ll want to show you a database diagram a little later, I need to assign an owner to the database. Yes, I’m already the owner, but it’s using my domain account, and I’m sitting here in the Public Library typing this article up, so I need to set an owner that the database designer can see right now. I’ll do that with a simple stored procedure, setting the owner to “sa,” an account the system can always get to:
USE mdwDW; GO EXEC dbo.sp_changedbowner @loginame = N'sa', @map = false GO
The table structure that Analysis Services uses to do its work is called a “star” table, and getting the data from the source servers, databases and tables into this format is one of the most difficult parts of creating a BI solution, at least in my opinion. It’s the step that takes the most time, needs the most planning, and has the most places to go wrong. If you take it one step at a time, however, you can get there.
The first thing to do is to identify the facts and dimensions from your source data. In my case, that was pretty simple, since I have only a single server, a single database, and a single table to work with. Even when it is more complicated, however, it’s the same process.
I looked at my source (you might have to look through several of them), and identified the numbers that hold the data I want to report on. That turned out to be these columns from the snapshots.io_virtual_file_stats table in the MDW database:
Although these numbers are self-descriptive, you might have to document what the numbers actually mean in a more complex example.
The next step is to identify the data that describes those bits of data — the dimensions. Remember, this is the data that you want to “slice” the numbers on. In my case, that comes from a couple of tables. The first is the one I just used, snapshots.io_virtual_file_stats. The second table is the one I joined it on in the last tutorial, core.source_info_internal. You’ll recall that there is a join with a third table to get those two things together, but I’ll come to that later. Here are the columns I found that I wanted to “slice” my view of the numbers on:
The next step is to build the tables that will hold these dimensions and facts. The fact table (the table that holds all of the numbers) will be joined to all the other tables using a set of Primary and Foreign Keys, and when you lay that out it looks like a star — hence the name “star tables.”
I’ll begin with the dimension tables. They will have a Primary Key, and then the text that defines the dimension. Here’s the one I created for the Instance name, which tells me which server is involved:
/* Create Dimension Tables */ CREATE TABLE [dbo].[dimInstance] ( [InstancePK] [int] IDENTITY(1,1) NOT NULL , [InstanceName] [nchar](10) NULL , CONSTRAINT [PK_dimInstance] PRIMARY KEY CLUSTERED ( [InstancePK] ASC )); GO
Pretty simple — and pretty small. There will be only a few items in here, since I’m only monitoring a few servers. I could also add things like the location of the server, the admin, the primary use of the server and so on, but I’m not collecting that right now. Since I’m not collecting it, I can’t analyze it.
The fact table will have a relationship back to this table, so that I’ll be able to enter a single number there and Analysis Services will be able to get to the text I need.
I’ll follow the same process for the rest of the dimensions:
CREATE TABLE [dbo].[dimLogicalDisk] ( [LogicalDiskPK] [int] IDENTITY(1,1) NOT NULL , [LogicalDiskName] [nchar](10) NULL , CONSTRAINT [PK_dimLogicalDisk] PRIMARY KEY CLUSTERED ( [LogicalDiskPK] ASC )); GO CREATE TABLE [dbo].[dimLogicalFileName] ( [LogicalFileNamePK] [int] IDENTITY(1,1) NOT NULL , [LogicalFileName] [nchar](10) NULL , CONSTRAINT [PK_dimLogicalFileName] PRIMARY KEY CLUSTERED ( [LogicalFileNamePK] ASC )); GO CREATE TABLE [dbo].[dimIOType] ( [IOTypePK] [int] IDENTITY(1,1) NOT NULL , [IOType] [nchar](10) NULL , CONSTRAINT [PK_dimIOType] PRIMARY KEY CLUSTERED ( [IOTypePK] ASC )); GO
I’ll stop here a moment, however, and talk a little about the time dimension table. I’m going to build the whole table out of a single field, but I’ll break the field down into the new SQL Server 2008 DATE and TIME types to have a column for year, month, day, hour, minute and so on. This will allow me to get a very fine “grain” on the times I care about.
I’ll also create a very special Primary Key here. I’ll explain the reason for that in the next tutorial:
CREATE TABLE [dbo].[dimIOCollectionTime]( [IOCollectionTimePK] [bigint] NOT NULL, [IOCollectionTimeYear] [date] NULL, [IOCollectionTimeMonth] [date] NULL, [IOCollectionTimeDay] [date] NULL, [IOCollectionTimeHour] [time](0) NULL, [IOCollectionTimeMinute] [time](0) NULL, [IOCollectionTimeSecond] [time](0) NULL, CONSTRAINT [PK_dimIOCollectionTime] PRIMARY KEY CLUSTERED ( [IOCollectionTimePK] ASC) ); GO
Now the interesting part — the fact table, It holds all the numbers, as I mentioned earlier, but it has two other features. It’s Primary Key is made up of the Primary Keys in the other tables:
/* Fact Table */ CREATE TABLE [dbo.FactIOPerformanceMetrics] ( InstancePK int NOT NULL, LogicalDiskPK int NOT NULL, LogicalFileNamePK int NOT NULL, IOTypePK int NOT NULL, IOCollectionTimePK bigint NOT NULL, Reads int NULL, Writes int NULL, bytesRead int NULL, bytesWritten int NULL, readsStalled int NULL, writesStalled int NULL CONSTRAINT [PK_FactIOPerformanceMetrics] PRIMARY KEY CLUSTERED ( [InstancePK] ASC , [LogicalDiskPK] ASC , [LogicalFileNamePK] ASC , [IOTypePK] ASC , [IOCollectionTimePK] ASC ) ); GO
The second interesting thing is that the fact table has Foreign Keys to all of the dimension tables. Here’s that code:
/* Set up Foriegn Keys*/ BEGIN TRANSACTION SET QUOTED_IDENTIFIER ON SET ARITHABORT ON SET NUMERIC_ROUNDABORT OFF SET CONCAT_NULL_YIELDS_NULL ON SET ANSI_NULLS ON SET ANSI_PADDING ON SET ANSI_WARNINGS ON COMMIT BEGIN TRANSACTION GO ALTER TABLE dbo.dimLogicalFileName SET (LOCK_ESCALATION = TABLE) GO COMMIT BEGIN TRANSACTION GO ALTER TABLE dbo.dimLogicalDisk SET (LOCK_ESCALATION = TABLE) GO COMMIT BEGIN TRANSACTION GO ALTER TABLE dbo.dimIOType SET (LOCK_ESCALATION = TABLE) GO COMMIT BEGIN TRANSACTION GO ALTER TABLE dbo.dimIOCollectionTime SET (LOCK_ESCALATION = TABLE) GO COMMIT BEGIN TRANSACTION GO ALTER TABLE dbo.dimInstance SET (LOCK_ESCALATION = TABLE) GO COMMIT BEGIN TRANSACTION GO ALTER TABLE dbo.[dbo.FactIOPerformanceMetrics] ADD CONSTRAINT [FK_dbo.FactIOPerformanceMetrics_dimInstance] FOREIGN KEY ( InstancePK ) REFERENCES dbo.dimInstance ( InstancePK ) ON UPDATE NO ACTION ON DELETE NO ACTION GO ALTER TABLE dbo.[dbo.FactIOPerformanceMetrics] ADD CONSTRAINT [FK_dbo.FactIOPerformanceMetrics_dimIOCollectionTime] FOREIGN KEY ( IOCollectionTimePK ) REFERENCES dbo.dimIOCollectionTime ( IOCollectionTimePK ) ON UPDATE NO ACTION ON DELETE NO ACTION GO ALTER TABLE dbo.[dbo.FactIOPerformanceMetrics] ADD CONSTRAINT [FK_dbo.FactIOPerformanceMetrics_dimLogicalDisk] FOREIGN KEY ( LogicalDiskPK ) REFERENCES dbo.dimLogicalDisk ( LogicalDiskPK ) ON UPDATE NO ACTION ON DELETE NO ACTION GO ALTER TABLE dbo.[dbo.FactIOPerformanceMetrics] ADD CONSTRAINT [FK_dbo.FactIOPerformanceMetrics_dimLogicalFileName] FOREIGN KEY ( LogicalFileNamePK ) REFERENCES dbo.dimLogicalFileName ( LogicalFileNamePK ) ON UPDATE NO ACTION ON DELETE NO ACTION GO ALTER TABLE dbo.[dbo.FactIOPerformanceMetrics] ADD CONSTRAINT [FK_dbo.FactIOPerformanceMetrics_dimIOType] FOREIGN KEY ( IOTypePK ) REFERENCES dbo.dimIOType ( IOTypePK ) ON UPDATE NO ACTION ON DELETE NO ACTION GO ALTER TABLE dbo.[dbo.FactIOPerformanceMetrics] SET (LOCK_ESCALATION = TABLE) GO COMMIT
And the source for my analysis is complete. Here’s the database diagram of the structure:
Now I’m ready for the next step: figuring out how the data gets from the source to this new database and the star schema. I’ll cover that process next week.
InformIT Articles and Sample Chapters
There’s a lot more here on the MDW feature in The SQL Server 2008 Management Data Warehouse and Data Collector.
Books and eBooks
Before we get through with this project, you’ll need a good background on SQL Server Integration Services. I’ve got the reference for that right here: Microsoft SQL Server 2008 Integration Services Unleashed.
To make sure I give credit where it is due, here is the reference from Books Online that I used in this tutorial.