Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

ETL Mechanisms

Last updated Mar 28, 2003.

In this series on Business Intelligence (BI) landscapes, I've explained what BI is, some information to help you develop a security plan, and how to create your requirements. In the last tutorial I explained how to prepare your source systems to deliver data. And we haven't even used any software yet!

I say that half-humorously, but it does show that creating a Business Intelligence infrastructure requires an enormous amount of planning. In the current "do it now" workplace, planning isn't often held in high esteem. But I can promise you this: if you don't thoroughly plan your BI implementation, it will fail. Oh, everything will load, process and create reports, but the information won't be correct. Guaranteed. The reason for this sad fact is that IT simply can't beat the solution out of the technology – Business Information (even when used in science or other disciplines) is information, not data.

Think of a bank. Banks will keep your money, tell you how much you have, move it into other accounts for you, or even distribute it as you say. But they can't tell you what that means for your financial situation. The bank doesn't call you and say "you're not spending enough on Buck's database books this month!"

That's just like IT. Our part of the organization stores the organization's Microsoft Excel spreadsheets, moves them around, backs them up, secures them, and even allows users to add and delete them. But we don't know what they mean, or how the user will interpret them. So the only planning we normally do involves the mechanics of storing, moving and securing the organization's data. The users know what they want to do with the data, so they assign the meaning to it. They make data into information in their heads.

Not so with the BI system. In the BI system, IT assigns meaning to the data. We have to do this because we will extract data from various source systems and then transform the data prior to loading it into a series of larger systems. All this transforming and loading is to present the data in a coherent way, which involves knowing what the data means. You'll see what I mean by the end of this article.

To move the data from the source systems (see the last tutorial in this series) through to the end reports, there is a variety of mechanisms to choose from.

File Store

The first choice is to extract the data using source system tools into files. The main advantage here is that there is nothing to buy. Most enterprise software packages have mechanisms to pump out data from their databases using the included platform tools. If they don't, you can use the export tools built into every commercial database engine on the market.

Another advantage in exporting data from the source system to a set of files is that you don't have to ask for permission to access a source system to get them. Many administrators face licensing costs for creating a new login, and would rather give you what you want rather than let you come in and get it. They're also leery of letting a foreign system into their carefully balanced software.

Still another advantage is that you have a lot of control over how and when the data comes out of the system. You can "trickle" the data out frequently, or wait and "batch" the whole process at a time when the system isn't as busy.

Using a file store isn't perfect, however. If you use this method you should prepare for the inevitable issues when the export program on the source system doesn't run. Do you stop all imports until the problem on the source system is fixed, or do you continue importing from the other systems and let the data "catch up"? What if the source system that failed holds the parts information, and the next import to run is the purchase orders that depend on the parts information in the first place?

Recall that in the last tutorial I mentioned that you'll want to get an "on call" list for the administrators of the source systems. You'll need that in case the exports don't work. It's no good finding out that the lady you dealt with when you designed the extract system is on vacation when the source extract fails.

Also, leaving files laying around in an FTP site or some other share comes with its own set of problems, not to mention the complexity of the various steps of moving the data out, storing it, picking it up, and importing it. Remember, the fewer steps you have, the easier it is to comprehend the system, diagnose problems and correct them.

But there are times when a file store is a great solution. If you decide to use a file store as your extract method, make sure that you keep the files as generic as possible. They should be a "straight throw" from the original tables or views in the source system. Resist the urge to put intelligence for the combination of data at this level of the process. This is because you want experts in the source system to be able to help you when changes in their systems occur. You don't want to have to educate them on all of the changes you made to the extract process. Leave that for your side of the equation. Just get the data, agree on a format, and ensure that you have a good method to handle interruptions in the data flow.

That often means naming the files (or rows in tables, if you're extracting data into a client-side database) with a unique identifier, usually involving a date/time stamp. This will keep the extracts from stepping on one another if they do fail and then catch up.

Make sure that the staging area where you receive the files is reliable and secure. Document that security, and the timeline that the files will be ready for pickup in a meta-data tool or document.

ETL Tool

Some BI systems have built-in extraction tools. You can also purchase tools that can reach into a source system natively and move the data to your system.

The advantage to this kind of software is that they've built the intelligence for dealing with changes and failures right into the functionality it provides. This kind of software often uses a simple point-and-click paradigm that is easy to understand, explain and document.

Another advantage is that these tools normally have a good meta-data store built right in. This allows you to enter lots of information about the data structure, how it is changed in the process, what stage it is in at a given moment, and more. You'd have to code all this yourself if you use a file store.

But there are disadvantages here as well. These tools cost money, and they aren't always cheap. There's also a learning curve involved in installing yet another software package (YASP). The source system may require another license, and security becomes an issue.

If you are using same-to-same platforms, this choice becomes a little easier. SQL Server can talk directly to another SQL Server, and you can even set up a replication schema between systems to keep them up to date all the time. SQL Server can also talk to Oracle by using distributed queries or Data Transformation Services.

That brings up the choice of where to transform the data, and how much. You should work very hard to leave the data as generic as possible for as long as you can. The first time you change a "True" to "T" or "1", you're diluting the original meaning of the data a little. To be sure, at some point you're going to have to combine data from system A with data from system B, and unless they are the same kind of software odds are good that they won't store the same data the same way. But leave the data alone during the transfer process. Get it into a series of tables in the destination system (more on those next time) as virgin as possible. Once you've transferred the data you can begin to transform it, and if you do it on your side there you're the only one that is affected if the transformation isn't correct. I don't advocate ETL as defined by Extract, Transform and Load. I actually prefer to define it as Extract, Transfer and Load. Transform the data using the rich set of tools in your destination systems, and the process will be much smoother.

Finally you'll need to develop a timeline for the data to be current. The data is current when all systems have reported in, and you've moved the data beyond the staging tables (landing places for the data in your BI system) and on up into the first levels where it can be used.

In the next tutorial I'll explain those levels and where the ETL process should place the data. You have a few choices in your landscape, depending on your needs. I'll explain those choices and how you select the one for your situation.

Informit Articles and Sample Chapters

Larissa T. Moss and Shaku Atre have a great book on BI that also discusses security called Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications.

Online Resources

The ETL guy no longer keeps his site up to date, but it's a wealth of information on Extracts, Transforms and Loads. He's got a bunch of links from ETL products as well.