Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Source System Extracts and Transforms

Last updated Mar 28, 2003.

One of the most sensitive areas that you will have to tackle in creating your Business Intelligence landscape is extracting data from source systems. The main difficulties you'll have in this process are with the security aspects and politics. Not to fear though; in this tutorial I'll explain how to begin to put everyone's mind at ease and show you how to create a comprehensive plan for getting data from their system to yours. I'll focus first on creating the plan and the meta-data, and in the next couple of articles I'll explain the mechanics of that process.

You might be fortunate enough to own all the source systems in question, in which case the primary danger is that you won't be thorough enough when you design your extracts. If you do have control of all the source systems, pretend that you don't and treat each system as I'll explain here. You'll have a safer, more supportable infrastructure that way, and if you ever include a foreign system in your schema you'll have all the documentation set up, ready to go.

I'll show you the documentation you need to create, but we should take a moment and discuss the distribution methods for that documentation. In several cases you'll need to have various people sign-off that they agree with the plan, to make sure they know about it and just for a little protection for yourself. Trust me, corporate memories can be fraught with amnesia from time to time, and you don't want to be the only one that remembers that someone agreed to your scheme.

I personally like portals for distributing the plan. It's a central location, securable, and you can even architect it so that it provides a web service or e-mail capabilities. It's ultimately portable as well, and works across all platforms. I've found it to be the best place to store all the documentation for my Business Intelligence landscape.

You can also use a plain old drive share as well. This can also be secured, but it's a little more closed from the distribution standpoint. There's no easy mechanism to publish the fact that changes have been made to a document.

You can even use plain old paper documents in a binder. This is my least favorite choice of all, since you're killing trees, it's a single-use format, and can be physically destroyed or lost. Not to mention that's SO twentieth-century! But if it's paper or nothing, do the paper.

Plan for the fact that these documents will be living, that is, changing quite frequently. That's not only OK, it's to be expected. In a Business Intelligence landscape, several source systems will evolve as time goes by. New features will be added, upgrades will occur, and you'll even add new sources to the mix. Your document storage and distribution system (portal, if you will) that you'll use for your Business Intelligence landscape should account for that.

Up to now we've only been discussing the planning stages. When you begin to extract data from source systems the final planning documents need to be made public. This is called the meta-data of the system, the data about the data. This meta-data includes where the data comes from, what it means, how it is changed, and how it is moved. Finally, the meta-data explains where that data ends up.

You can create the meta-data for the system automatically, or you can purchase third-party tools and products. Even if it is created for you, you should consider putting it out on a portal for all to see. The reason for this is that most systems (Extract, Transform and Load programs or entire BI suites) don't include all the documentation I'll describe below. Even if they do, they usually aren't linked so that you can search them from several angles.

If you're designing your own BI tools using only database and OLAP engines, then you'll definitely want to spend a great deal of time creating a system that will manage the meta-data for you. I'll return to this theme throughout this series.

The first item in your meta-data schema is the source data locations and descriptions. What you need here is the name of the location of the source data, with all pertinent information such as data type (character, numeric, date, etc.), the name and type of the source system table, how you extract the data (flat file, direct-connect, B2B, etc.) and the name of the contact for that system, with the administrator. The first contact is the business owner of the system, and the second contact is the name, e-mail, IM address and phone number of the technical representative. You'll need all of these at one time or another.

The next part of the source meta-data information is the data dictionary. This term goes all the way back to the age of the dinosaurs (mainframes) when you actually had time to document what you were doing. A data dictionary is simply a grid that shows a field's name, its structure, its transformations, its meaning and its location. That is exactly what you'll need to base your entire search schema on in your meta-data product. Most of the users I've dealt with want to point to a screen in their source application and say to you "where is that field located in your BI system, and what did you do to it?"

You'll want to develop a chronology document showing when each extract happens and how long it takes from the source, to the staging area (if applicable), and how long it takes to make it into your system. Understand that the entire length of the time it takes before a purchase order is updated into the BI system. That's because you may need to wait on all the PO systems to send their data so that you can maintain data integrity. This timeline will also help you design your maintenance and down-time windows, and what kind of slack you have when an extract is late or missed. It will, by the way, do both.

Next, develop a security matrix. This shows who has rights to the data in the source system and ho vulnerable it is along the way. If you extract salaries into a flat file on a share, you should know about who has access to that share. Here in the US, there are laws about that sort of thing. This also helps you develop your security chain in your BI system.

You'll also want an ownership chain on the data. This helps everyone understand who's got the data at which stage.

No matter how you intend to transfer the data from the source system to your BI system, you are still transferring it across the network. Make sure you include your network staff so that you can estimate the amount of traffic you're about to put on their network. They'll be surprised that you consulted them, and you'll be glad you did. Many designers forget this step and find it out too late when their China link bogs down under the traffic to the US.

Finally, create a support matrix. This is an "on-call" list that shows who gets contacted when something doesn't work. If you're in a worldwide environment as I am, make sure you design a "follow the sun" support scheme and train the technicians in your region on your system so that they can help you during their daylight hours. Or, you can always get up at 2 in the morning to find out someone shut down the FTP server in Paris.

Publish all of this documentation where everyone can get to it and suggest changes. When you create this meta-data system, make sure you make it searchable. Every part – from security to field types – should be linked to every other part. If you design this meta-data system properly from the outset, you'll find it helps you throughout the entire project design.

Informit Articles and Sample Chapters

Larissa T. Moss and Shaku Atre have a great book on BI that also discusses security called Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications.(Preview this book in Safari)

Online Resources

The ETL guy no longer keeps his site up to date, but it's a wealth of information on Extracts, Transforms and Loads. He's got a bunch of links from ETL products as well.