Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Migrating Departmental Data Stores to SQL Server: Design the ETL

Last updated Mar 28, 2003.

This is the fourth article in a series of a formal process you can follow to migrate data stored in “departmental data stores” (such as Excel, text files, XML documents and so on) into a Relational Database Management System (RDBMS) like SQL Server. The first article in this series is here, and you can click the “Next” link at the bottom of each subsequent article to bring you here.

In the first installment, I explained what these data stores are, what they mean to your organization, and when they should be considered for migration. Some data doesn’t need to be stored in an RDBMS, and other data does. I also explained a few methods you can use to locate that data. That’s the first step.

In the second installment I explained how to take that data and model it so that you can tease out the requirements from the discovery you’ve done and how to model that data so that everyone agrees on its final format. I wasn’t able to complete the entire part of that process there, so I finished that step in the article that followed.

In the third installment I covered a Business Requirements document in more depth and I explained how to normalize the model into tables and columns. And in last week's installment I cleaned up the model and explained how I decided to pick the location for the data.

Here’s the outline of the process and where we are so far:

  1. Locate the data
  2. Model the system
  3. Decide on the Destination
  4. Design the ETL and Migrate the Data
  5. Attach the front-ends
  6. Test and Monitor

In this article I’ll explain how to select the design for the Extract, Transform and Load (ETL) process and the considerations for migrating the data. I’ll spend quite a bit of time on the process, so this warrants its own discussion. I’ll follow that in the next article by detailing the options for actually running the ETL process.

Designing the Extract, Transform and Load

The first part of an ETL process is to select the sources and destination for the data. In my case, there’s only a single source, and it happens to be a Microsoft Excel Spreadsheet, version 2007. That’s really important, because no matter what tool I use I’ll need to know the way that the source deals with the data. Every method will handle the source in a very specific way, so it’s important to get that information.

In fact, I didn’t even ask the businessperson to tell me the version of Excel they were using. I went to their desk, opened the spreadsheet and clicked on “Help” and then “About” to verify the exact version they were using. Then I copied the spreadsheet to start my experimentation with the different methods, and right-clicked that file to verify the version. You should document carefully this part of the process — it really does matter.

Of equal importance is the destination. I’m using SQL Server 2008 Enterprise Edition for this exercise, and that will make a difference in the data types, field structures and so on. The point is that you should understand both the source and destination carefully before you start.

You may have more than one source. If so, document those the same way.

Mapping Fields

The next step is to map out the fields you want to bring in. In my case, the spreadsheet is quite de-normalized. That means that several elements of data are “missed” in with each other — there are Vendor names in the same “record” (line) as the Client name in the spreadsheet and so on.

What I’ll do is map out each element I need in the tables, and work backwards to the source. Remember that I built the database from the Business Requirements, so the database is now the “gold standard” of the data I need. I can do this electronically, or just on paper, but in any case I basically draw a line from the cell in the spreadsheet to the table and column in the database. It should end up as a one-to-one mapping — with the relationships defined in a subsequent step.

This means in some cases I’ll have to break apart some cells in the spreadsheet into multiple discrete values. If you’ll refer back to the example spreadsheet I posted in an earlier article, you’ll see that City, State and Zip Code are actually all in one cell. Depending on how I extract the data and which tool I use, this might actually be done graphically for me, but if not, I draw a line from the single cell to a box showing the transformation, then on to the database tables and columns they belong to. For instance:

I do this for each “task” or step in the ETL process. In some cases this is so simple that paper and pencil will suffice, in other cases, the tool you pick will create this graphic for you. In any case, no matter how “simple” the transformation may appear, you need to follow this process.

Determining Differences

A spreadsheet is arranged much differently than a database. In fact, one database (such as Oracle) there are differences. This step involves analyzing those differences from every angle — from something generic such as normalized versus de-normalized data to the actual data types and the structure. I’ll start with the generic task of making the relationships work.

As I normalize the data, I’ll have to keep the keys or relating values from the “parent” table stored somewhere as I break out the “child” data into its respective tables. For instance, I have in my diagram the Vendor separated from the Vendor Type, since a Vendor might supply my company with more than one service or good. But in the spreadsheet, the users have placed commas between the “types” values, so I have to transform them just as I did a moment ago with the addresses. But while the address goes into the same table, the Vendor Types gets its own table — and I’ll need to tie those back out. So I’ll make a mark on the map when I need to do that indicating the pairing key or value. In my case it’s the Vendor Code.

The relationship part of ETL is possibly the hardest. It requires that you thoroughly understand the source data and the destination design. This is when those Business Requirements really help.

The next part of analyzing the differences is checking the data types. Excel is especially vulnerable in this area, since it does have strong data types that the user can choose, and Excel chooses its own (Unicode text in most cases) if the user doesn’t. This is where the Entity Relationship Diagram comes in handy — I’ve taken the time to show the data types on my diagram, and working through the spreadsheet to understand two things: What types for the columns the user chose and what types the system chose when the user did not. Once again, this goes on my paper or electronic data map.

Finally there is the “optionality” of the data. This simply means where the data is required, versus where it is actually available in the source. For instance, based on the Business Requirements, I want to enforce that any time there is a contact, that contact must be tied back to a Vendor. In some cases, that might not be true in the source data. So what should you do?

There are a couple of choices. One is to ask the business to enter that data in the source system. This will be the richest, most correct data you can get, but in some cases it’s just not possible. Either there isn’t time, the business doesn’t want to do that or some other reason.

In that case, you can leave that record out, or place it in another table to deal with later. This protects the purity of your data the best, but might not be acceptable to the business.

The third option is to generate the data. This can be dangerous, but if you are careful it can work. I normally opt for this approach (depending on the situation of course) and enter some static value like “Not Yet Available” or “TO DO: Enter contact name” or some value that maintains the integrity of the system, but enters the parent record anyway.

You’ll have to figure out what makes sense for your situation.

In the next installment I’ll explain how to choose and create the ETL process and implement the extraction and load.

InformIT Articles and Sample Chapters

My article series on database design starts here.

Books and eBooks

Database Design for Mere Mortals is a great book on database design.

Online Resources

You can learn more about Business Process Model Notation here.