Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Building a Business Intelligence Solution, Part 1

Last updated Mar 28, 2003.

In a previous tutorial, I explained the general process I follow to implement a Business Intelligence project. Starting in this series of tutorials, I’ll implement that process using a need I now have for some intelligence data.

Before I get started, take a look at that process outline, with a couple of additions that I’ll explain as I go along:

  1. Description
  2. Scope
  3. Logical Flow
  4. Inputs
  5. Source Systems and Mechanisms
  6. Outputs
    • Reports
    • Exports
    • Interfaces to other systems
  7. Staging
  8. Growth and Archival Strategy
  9. Impacts
  10. Schedule
  11. Security

I’ll start the process by defining what I want to accomplish with the system when I am through. I’ll do that by also talking a bit about the source system I’m using for the data to begin with.

SQL Server 2008 introduces a new feature called the Management Data Warehouse This feature uses yet another new feature, called the Data Collector, to store performance data from various systems in a single location. Together with a new set of reports, these features combine to show you how your systems are performing, and how you can correct many of them.

The issue with the Management Data Warehouse database (which I’ll just refer to as MDW from here on out) is that because it stores data from so many “client” SQL Server Instances, it gets large quickly. For instance, you can look back at a poorly performing query, all the way down to the statement itself and its query plan. Because you’ll encounter these performance issues only so often, you normally delete that data every couple of weeks or so, “rolling” the window forward, which allows you to store more servers on a single monitoring database.

While a fine-grain level of detail is required for the day-to-day performance troubleshooting, you don’t normally need to know what a given query was doing over a month or two ago. Of course, you can adjust that window to however much data you think you need, but in my case two weeks is adequate.

But I don’t want to lose the performance data completely from my system. In fact, some of the performance data, such as the sizes of database tables and filegroups, will be interesting for years to come. They allow you to perform complex analysis and projections, and that’s exactly the kind of thing I’m after.

So the first part of the project is to define its purpose. In my case, I’m after the performance data in my system that will be useful to me in the future, specifically for growth and trend projections. Some of the data collected allows me to watch file growth, the rows in the database, or the number of connections to the system. I can then plan to buy or rearrange data space, add server capacity and so on. I can also use the information to balance the load across all of the servers I am monitoring.

There’s also another benefit that I am after — I want to see if my applications are performing properly, and if not, why. If I can pin down the poorly performing applications, and then get a sense for why they are performing poorly, I can correct the issue.

For instance, if a particular application is performing poorly, it could be due to poor design, implementation or maintenance. If the application is poorly designed, I can use that information to make a decision to go with another vendor or to re-write the application. If the application is poorly implemented, I can correct that with a migration to a better layout. If the application is poorly maintained, I can speak with the DBA in charge to correct that problem. No matter what the issue, I can find and correct it.

That’s still a fairly vague description of the project. In fact, if I was the IT manager in charge of picking projects to go forward with, I wouldn’t pick this one — yet. As the person proposing the work, I have some more investigation to do.

And that brings up an interesting point. Although I have started on the process at the top of the order, I need to skip down a few steps to get more information. I need to examine the source data a little more to see if I can gain all of the value from it that I need. You certainly can’t analyze data that you don’t have!

So I’ll leave the purpose vaguely defined at the moment. The Scope step is pretty obvious for this project — I’m staying with the single MDW database and the data it contains, and I’ll consume the information myself. That’s a pretty limited scope, and not very typical in a more comprehensive Business Intelligence (BI) project. For those systems, the business will drive the information they want to have, and your team and the business team will come to an agreement on scope. They will want everything all the time for everyone, and you’ll want to provide a limited set of data to certain people sometimes. That’s the nature of a BI project. But with a good-faith effort, you’ll come up with something both of you can agree on.

So I’ll need to do a little more digging. The place to start is with a “data dictionary,” which is simply a description of where the data is, what it is, and what it means. I used the word “simple,” because in this case, it is. Once again, in larger implementations, there’s no such thing as “simple.” The data is spread throughout the organization’s various systems, is not homogenous, and has gaps and duplications.

In my case, not only is the data located on a single SQL Server database, even the schema is already defined for me. Looking in SQL Server Books Online, I find that the entire MDW database schema is described, which allows me to do a little more discovery. The first set of tables is under the “core” schema, and those serve as the parent links to the actual payload performance and status data. Here’s what Books Online says about those:

core.performance_counter_report_group_items — Stores information about how the management data warehouse reports should group and aggregate performance counters.

core.snapshots_internal — Identifies each new snapshot. A new row is inserted into this table whenever an upload package starts uploading a new batch of data.

core.snapshot_timetable_internal — Stores information about the snapshot times. The snapshot time is stored in a separate table because many snapshots can happen at nearly the same time.

core.source.info_internal — This table stores information about the data source. This table is updated whenever a new collection set starts uploading data to the data warehouse.

core.supported_collector_types_internal — Contains the IDs of registered collector types that can upload data to the management data warehouse. This table is only updated when the schema of the warehouse is updated to support a new collector type. When the management data warehouse is created, this table is populated with the IDs of the collector types provided by the data collector.

core.wait_categories — Contains the categories used to group wait types according to wait_type characteristic.

core.wait_types — Contains the wait types recognized by the data collector.

core.purge_info_internal — Indicates that a request has been made to stop the removal of data from the management data warehouse.

The second set of tables is under the “snapshots” schema. These are sort of the “payload” data that link back to the “core” parents:

System-level resource tables:

  • snapshots.os_wait_stats
  • snapshots.os_latch_stats
  • snapshots.os_schedulers
  • snapshots.os_memory_clerks
  • snapshots.os_memory_nodes
  • snapshots.sql_process_and_system_memory

System activity tables:

  • snapshots.active_sessions_and_requests

Query statistics tables:

  • snapshots.query_stats

I/O statistics tables:

  • snapshots.io_virtual_file_stats

Query text and plan tables:

  • snapshots.notable_query_text
  • snapshots.notable_query_plan
  • Normalized query statistics tables:
  • snapshots.distinct_queries
  • snapshots.distinct_query_to_handle

There’s one more set of tables to deal with. If you “extend” the MDW and Data Collectors, the system will create a set of tables under the “Custom_snapshots” schema. If I find the data in the standard collectors don’t have what I need, I’ll add a custom collector to get it for me.

With all this info, I’m ready to start exploring my data a little to see what I can find. I know from the reports that come with the MDW feature that I can get amazing amounts of information on queries, I/O and system objects like CPU and Memory. I’ll begin (next time) by exploring these reports and their source data to see what I can use for my analysis project.

InformIT Articles and Sample Chapters

There’s a lot more on the MDW feature in The SQL Server 2008 Management Data Warehouse and Data Collector elsewhere in this Reference Guide.

Books and eBooks

Before we get through with this project, you’ll need a good background on SQL Server Integration Services. I’ve got the reference for that right here: Microsoft SQL Server 2008 Integration Services Unleashed.

Online Resources

To make sure I give credit where it is due, here is the reference from Books Online that I used in this tutorial.