- Introduction
-
Table of Contents
- Microsoft SQL Server Defined
- Microsoft SQL Server Features
- Microsoft SQL Server Administration
- Microsoft SQL Server Programming
- Performance Tuning
-
Practical Applications
- Choosing the Back End
- The DBA's Toolbox, Part 1
- The DBA's Toolbox, Part 2
- Scripting Solutions for SQL Server
- Building a SQL Server Lab
- Using Graphics Files with SQL Server
- Enterprise Resource Planning
- Customer Relationship Management (CRM)
- Building a Reporting Data Server
- Building a Database Documenter, Part 1
- Building a Database Documenter, Part 2
- Data Management Objects
- Data Management Objects: The Server Object
- Data Management Objects: Server Object Methods
- Data Management Objects: Collections and the Database Object
- Data Management Objects: Database Information
- Data Management Objects: Database Control
- Data Management Objects: Database Maintenance
- Data Management Objects: Logging the Process
- Data Management Objects: Running SQL Statements
- Data Management Objects: Multiple Row Returns
- Data Management Objects: Other Database Objects
- Data Management Objects: Security
- Data Management Objects: Scripting
- Powershell and SQL Server - Overview
- PowerShell and SQL Server - Objects and Providers
- Powershell and SQL Server - A Script Framework
- Powershell and SQL Server - Logging the Process
- Powershell and SQL Server - Reading a Control File
- Powershell and SQL Server - SQL Server Access
- Powershell and SQL Server - Web Pages from a SQL Query
- Powershell and SQL Server - Scrubbing the Event Logs
- SQL Server 2008 PowerShell Provider
- SQL Server I/O: Importing and Exporting Data
- SQL Server I/O: XML in Database Terms
- SQL Server I/O: Creating XML Output
- SQL Server I/O: Reading XML Documents
- SQL Server I/O: Using XML Control Mechanisms
- SQL Server I/O: Creating Hierarchies
- SQL Server I/O: Using HTTP with SQL Server XML
- SQL Server I/O: Using HTTP with SQL Server XML Templates
- SQL Server I/O: Remote Queries
- SQL Server I/O: Working with Text Files
- Using Microsoft SQL Server on Handheld Devices
- Front-Ends 101: Microsoft Access
- Comparing Two SQL Server Databases
- English Query - Part 1
- English Query - Part 2
- English Query - Part 3
- English Query - Part 4
- English Query - Part 5
- RSS Feeds from SQL Server
- Using SQL Server Agent to Monitor Backups
- Reporting Services - Creating a Maintenance Report
- SQL Server Chargeback Strategies, Part 1
- SQL Server Chargeback Strategies, Part 2
- SQL Server Replication Example
- Creating a Master Agent and Alert Server
- The SQL Server Central Management System: Definition
- The SQL Server Central Management System: Base Tables
- The SQL Server Central Management System: Execution of Server Information (Part 1)
- The SQL Server Central Management System: Execution of Server Information (Part 2)
- The SQL Server Central Management System: Collecting Performance Metrics
- The SQL Server Central Management System: Centralizing Agent Jobs, Events and Scripts
- The SQL Server Central Management System: Reporting the Data and Project Summary
- Time Tracking for SQL Server Operations
- Migrating Departmental Data Stores to SQL Server
- Migrating Departmental Data Stores to SQL Server: Model the System
- Migrating Departmental Data Stores to SQL Server: Model the System, Continued
- Migrating Departmental Data Stores to SQL Server: Decide on the Destination
- Migrating Departmental Data Stores to SQL Server: Design the ETL
- Migrating Departmental Data Stores to SQL Server: Design the ETL, Continued
- Migrating Departmental Data Stores to SQL Server: Attach the Front End, Test, and Monitor
- Tracking SQL Server Timed Events, Part 1
- Tracking SQL Server Timed Events, Part 2
- Patterns and Practices for the Data Professional
- Managing Vendor Databases
- Consolidation Options
- Connecting to a SQL Azure Database from Microsoft Access
- SharePoint 2007 and SQL Server, Part One
- SharePoint 2007 and SQL Server, Part Two
- SharePoint 2007 and SQL Server, Part Three
- Querying Multiple Data Sources from a Single Location (Distributed Queries)
- Importing and Exporting Data for SQL Azure
- Working on Distributed Teams
- Professional Development
- Application Architecture Assessments
- Business Intelligence
- Tips and Troubleshooting
- Additional Resources
Migrating Departmental Data Stores to SQL Server: Design the ETL
Last updated Mar 28, 2003.
This is the fourth article in a series of a formal process you can follow to migrate data stored in “departmental data stores” (such as Excel, text files, XML documents and so on) into a Relational Database Management System (RDBMS) like SQL Server. The first article in this series is here, and you can click the “Next” link at the bottom of each subsequent article to bring you here.
In the first installment, I explained what these data stores are, what they mean to your organization, and when they should be considered for migration. Some data doesn’t need to be stored in an RDBMS, and other data does. I also explained a few methods you can use to locate that data. That’s the first step.
In the second installment I explained how to take that data and model it so that you can tease out the requirements from the discovery you’ve done and how to model that data so that everyone agrees on its final format. I wasn’t able to complete the entire part of that process there, so I finished that step in the article that followed.
In the third installment I covered a Business Requirements document in more depth and I explained how to normalize the model into tables and columns. And in last week's installment I cleaned up the model and explained how I decided to pick the location for the data.
Here’s the outline of the process and where we are so far:
- Locate the data
- Model the system
- Decide on the Destination
- Design the ETL and Migrate the Data
- Attach the front-ends
- Test and Monitor
In this article I’ll explain how to select the design for the Extract, Transform and Load (ETL) process and the considerations for migrating the data. I’ll spend quite a bit of time on the process, so this warrants its own discussion. I’ll follow that in the next article by detailing the options for actually running the ETL process.
Designing the Extract, Transform and Load
The first part of an ETL process is to select the sources and destination for the data. In my case, there’s only a single source, and it happens to be a Microsoft Excel Spreadsheet, version 2007. That’s really important, because no matter what tool I use I’ll need to know the way that the source deals with the data. Every method will handle the source in a very specific way, so it’s important to get that information.
In fact, I didn’t even ask the businessperson to tell me the version of Excel they were using. I went to their desk, opened the spreadsheet and clicked on “Help” and then “About” to verify the exact version they were using. Then I copied the spreadsheet to start my experimentation with the different methods, and right-clicked that file to verify the version. You should document carefully this part of the process it really does matter.
Of equal importance is the destination. I’m using SQL Server 2008 Enterprise Edition for this exercise, and that will make a difference in the data types, field structures and so on. The point is that you should understand both the source and destination carefully before you start.
You may have more than one source. If so, document those the same way.
Mapping Fields
The next step is to map out the fields you want to bring in. In my case, the spreadsheet is quite de-normalized. That means that several elements of data are “missed” in with each other there are Vendor names in the same “record” (line) as the Client name in the spreadsheet and so on.
What I’ll do is map out each element I need in the tables, and work backwards to the source. Remember that I built the database from the Business Requirements, so the database is now the “gold standard” of the data I need. I can do this electronically, or just on paper, but in any case I basically draw a line from the cell in the spreadsheet to the table and column in the database. It should end up as a one-to-one mapping with the relationships defined in a subsequent step.
This means in some cases I’ll have to break apart some cells in the spreadsheet into multiple discrete values. If you’ll refer back to the example spreadsheet I posted in an earlier article, you’ll see that City, State and Zip Code are actually all in one cell. Depending on how I extract the data and which tool I use, this might actually be done graphically for me, but if not, I draw a line from the single cell to a box showing the transformation, then on to the database tables and columns they belong to. For instance:
I do this for each “task” or step in the ETL process. In some cases this is so simple that paper and pencil will suffice, in other cases, the tool you pick will create this graphic for you. In any case, no matter how “simple” the transformation may appear, you need to follow this process.
Determining Differences
A spreadsheet is arranged much differently than a database. In fact, one database (such as Oracle) there are differences. This step involves analyzing those differences from every angle from something generic such as normalized versus de-normalized data to the actual data types and the structure. I’ll start with the generic task of making the relationships work.
As I normalize the data, I’ll have to keep the keys or relating values from the “parent” table stored somewhere as I break out the “child” data into its respective tables. For instance, I have in my diagram the Vendor separated from the Vendor Type, since a Vendor might supply my company with more than one service or good. But in the spreadsheet, the users have placed commas between the “types” values, so I have to transform them just as I did a moment ago with the addresses. But while the address goes into the same table, the Vendor Types gets its own table and I’ll need to tie those back out. So I’ll make a mark on the map when I need to do that indicating the pairing key or value. In my case it’s the Vendor Code.
The relationship part of ETL is possibly the hardest. It requires that you thoroughly understand the source data and the destination design. This is when those Business Requirements really help.
The next part of analyzing the differences is checking the data types. Excel is especially vulnerable in this area, since it does have strong data types that the user can choose, and Excel chooses its own (Unicode text in most cases) if the user doesn’t. This is where the Entity Relationship Diagram comes in handy I’ve taken the time to show the data types on my diagram, and working through the spreadsheet to understand two things: What types for the columns the user chose and what types the system chose when the user did not. Once again, this goes on my paper or electronic data map.
Finally there is the “optionality” of the data. This simply means where the data is required, versus where it is actually available in the source. For instance, based on the Business Requirements, I want to enforce that any time there is a contact, that contact must be tied back to a Vendor. In some cases, that might not be true in the source data. So what should you do?
There are a couple of choices. One is to ask the business to enter that data in the source system. This will be the richest, most correct data you can get, but in some cases it’s just not possible. Either there isn’t time, the business doesn’t want to do that or some other reason.
In that case, you can leave that record out, or place it in another table to deal with later. This protects the purity of your data the best, but might not be acceptable to the business.
The third option is to generate the data. This can be dangerous, but if you are careful it can work. I normally opt for this approach (depending on the situation of course) and enter some static value like “Not Yet Available” or “TO DO: Enter contact name” or some value that maintains the integrity of the system, but enters the parent record anyway.
You’ll have to figure out what makes sense for your situation.
In the next installment I’ll explain how to choose and create the ETL process and implement the extraction and load.
InformIT Articles and Sample Chapters
My article series on database design starts here.
Books and eBooks
Database Design for Mere Mortals is a great book on database design.
Online Resources
You can learn more about Business Process Model Notation here.
