Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Migrating Departmental Data Stores to SQL Server: Design the ETL, Continued

Last updated Mar 28, 2003.

This is the fifth article in a series of a formal process you can follow to migrate data stored in “departmental data stores” (such as Excel, text files, XML documents and so on) into a Relational Database Management System (RDBMS) like SQL Server. The first article in this series is here, and you can click the “Next” link at the bottom of each subsequent article to bring you here.

In the first article, I explained what these data stores are, what they mean to your organization, and when they should be considered for migration. Some data doesn’t need to be stored in an RDBMS, and other data does. I also explained a few methods you can use to locate that data. That’s the first step.

In the second installment I explained how to take that data and model it so that you can tease out the requirements from the discovery you’ve done and how to model that data so that everyone agrees on its final format. I wasn’t able to complete the entire part of that process there, so I finished that step in the article that followed.

In the third tutorial I covered a Business Requirements document in more depth and I explained how to normalize the model into tables and columns. And in the last article I cleaned up the model and explained how I decided to pick the location for the data.

In the fourth tutorial I explained how to design your Extract, Transform and Load (ETL) process, and what your considerations are for data type conversions and so on.

Here’s the outline of the process and where we are so far:

  1. Locate the data
  2. Model the system
  3. Decide on the Destination
  4. Design the ETL and Migrate the Data
  5. Attach the front-ends
  6. Test and Monitor

In this article I’ll explain how to run the ETL process. Although I used a single process (which I’ll describe) for my project, I’ll give you other options for your own. It’s also not a-typical to see multiple processes used, based on what you need to do.

You have a lot of choices here, so there are a few decision points to think about. I’ll stick with what you have right in the box — Assuming that you have SQL Server 2005 or higher, PowerShell, and Excel somewhere in the mix. Of course there are other options if you have third-party solutions as well.

Use Excel

The first, and perhaps easiest (at least for me) solution is just to insert the data into SQL Server directly with Transact-SQL statements. But writing them for a lot of data would be a real pain. So I decided to have Excel do it for me.

You can have Excel talk to SQL Server, especially in a read-only fashion. While Excel can be coded to put data back into SQL Server, it isn’t trivial. But there’s an easier way. You can just have Excel create the code for you.

That’s right — using simple concatenations in Excel, I designed a complete insertion script to take care of all of the work. It’s actually quite easy, if not a bit tedious. Here’s the process I followed:

Design the Insert Statements

If you’re doing strait insert statements, you can use the INSERT syntax. If you’ve created stored procedures or functions to do your data insertions, you use the EXEC syntax. Either way, the code ends up looking something like this for the Vendor table in my example:

/* Straight Inserts */
INSERT  [dbo].[Vendor]
        (
         [VendorCode]
       , [VendorName]
       , [Description]
       , [Address1]
       , [Address2]
       , [City]
       , [State]
       , [PostalCode]
        )
VALUES  (
         N'21342'
       , N'Hot wheels Catering'
       , N'Great food, fast'
       , N'221 B Baker Street'
       , N''
       , N'Bellevue'
       , N'WA'
       , N'98042    ')

Really the only parts of the INSERT statement that change are the actual values themselves.

Write the Formula

So I moved down to an empty spot on my user’s spreadsheet, and then created the following formula:

="INSERT Vendor (VendorCode, VendorName, Description, Address1, Address2, City, State, PostalCode)
 VALUES ('" & $A2 & "','" & $C2 & "','" & $D2 & "','" & $E2 & "','" & $F2 & "','" & $R2 & "','"
 & $S2 & "','" & $T2 & "')"

OK, it looks kind of intense, but it really isn’t. It’s a simple matter of finding the cells I wanted for the values, and typing that in where you see the “$” parts. That’s it.

Use the INSERT Statements

From there I copied the formula for all of the rows, and Excel wrote this for me (I’m only showing one line of output here):

INSERT Vendor (VendorCode, VendorName, Description, Address1, Address2, City, State, PostalCode)
 VALUES ('25364','Acme Office Supplies','Downtown Bellevue','123 7th Avenue South',
'PO Box 34122A','Bellevue','WA','98042')					

Easy Peasy. I pasted all of the rows into notepad as a script, and now I can run that whenever and however I want — from SQL Server Management Studio, osql, SQLCMD, PowerShell, or any other command-query tool.

Use OpenQuery or OpenRowSet

You might want to transfer the data more often, have more control, or not want to copy the Excel spreadsheet. Or perhaps you’re not allowed to copy the spreadsheet or add to theirs. No problem — have SQL Server reach across and read the data directly. From there you can create T-SQL statements (as above) or directly insert the data, or perhaps place it in staging tables if you want to go that route.

The trick is a “remote query.” You have a couple of ways to do this — you can use a linked server (not my preference for many reasons) or just use a single query to open a connection to Excel and read the data. If you want to use a linked server (more permanent), read up on the OpenQuery syntax. If you’re like me and just want to open the query and then close it, you can use the OpenRowSet syntax. This treats the Excel spreadsheet as a database, and you can even use rudimentary SQL statements on it.

Before you do this, again let me remind you to read and understand all of the statements I’m showing you here, and test first on a development server. To use these statements you need to allow “Ad Hoc Distributed Queries,” and that might have security implications at your shop. To be safe, bring the data in to a test server, and then transfer that to your “real” SQL Server Instance.

Here’s the statements you can use to open up the Ad Hoc Query functions on your test server:

/* Retrieve Data from Excel */
EXEC sp_configure 'show advanced options', 1
RECONFIGURE
EXEC sp_configure 'Ad Hoc Distributed Queries', 1
RECONFIGURE WITH OVERRIDE
GO

With that done, and assuming an Excel 2007 spreadsheet in the c:\temp directory with a tab called “Vendors”, here’s a sample query from my SQL Server Instance that reads all of the data for me:

SELECT * 
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0'
, 'Excel 12.0;Database=c:\temp\VendorList.xlsx;HDR=YES', 
'SELECT *
FROM [Vendors$]')

Now I’ll make it interesting. I’ll take a subset of the data and put it in a temporary table so that you can see the basic syntax for an insert operation with this method:

/* An example using SELECT and INSERT INTO */
CREATE TABLE #Test1 (c1 VARCHAR(255), c2 VARCHAR(255))
INSERT INTO #Test1
SELECT * 
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0'
, 'Excel 12.0;Database=c:\temp\VendorList.xlsx;HDR=YES', 
'SELECT [Vendor Code]
, [Vendor Name] 
FROM [Vendors$] 
WHERE [Vendor Name] = "Microsoft"')

SELECT * FROM #Test1;
GO

The keys here are lines that follow just after the OPENROWSET function — that’s basically the driver. You’ll need to look that up if you’re not using Excel, or even Excel 2007, since it’s different based on what you want to do. But this method works from sources as varied as Oracle, text files, XML, Access and more.

Use PowerShell

This one was an interesting choice. I’ve been using PowerShell for just about everything these days, so I wondered if I could treat it as a mix between creating the syntax for the inserts using a Stored Procedure, and reading the Excel file like I did with OpenRowSet. Once again, the magic here is in that connection string. Here’s the code I used:

$ExcelConnection= New-Object -com "ADODB.Connection" 
$ExcelFile="c:\temp\VendorList.xlsx" 
$strQuery="Select * from [Vendors$]" 

$ExcelConnection.Open("Provider=Microsoft.ACE.OLEDB.12.0;` 
Data Source=$ExcelFile;Extended Properties=Excel 12.0;") 

$ExcelRecordSet=$ExcelConnection.Execute($strQuery) 

do { 
Write-Host "EXEC sp_InsertVendors '" $ExcelRecordSet.Fields.Item("Vendor Code").Value "'"
$ExcelRecordSet.MoveNext()} 
Until ($ExcelRecordSet.EOF) 

$ExcelConnection.Close()

There are actually other ways to do this with PowerShell. In fact, you can use PowerShell to open the spreadsheet, take the data from the changes in it, and update the rows in SQL Server from there, using something called a Dataset. I’m thinking of writing a separate article on that process, but this should get you started.

Use SQL Server Integration Services (SSIS)

Possibly the richest, most powerful way to move the data into SQL Server is to use SQL Server Integration Services, or SSIS. I won’t cover the complete process I followed to make this package for two reasons — one is that it is pretty complex, and the other is that a friend helped me do this! (Thanks, Alex Viera).

If you’re not familiar with SSIS to begin with, start here and read up on that first. Alex helped me create a package that had a data flow that looks something like this:

The interesting parts are the “Data Conversion” task, the “Derive Columns” task, and the “Multicast” tasks. I’ll briefly explain why we chose this route.

Excel has various data types in the fields, but in my experience I’ve found that users don’t normally set them unless they are dates or times. It looks fine on the screen, but it isn’t always the data type they think it is. Excel will happily add numbers even when they are text, for instance.

For that reason, you must assume that the data is not in the proper type. Whenever I work with SSIS and Excel, I always start with a Data Conversion task, which basically lists the columns (which in Excel are often “Unicode”) and allows you to put them in the proper data types. Once again, that ERD you created earlier comes into play.

From there, the data streams into a “Derived Column” task. This takes data from the source and can create new columns (which I needed to break out City, State and Zip) or combine them, and so on. I always name those columns such that I can tell that this is what happened.

Now I pass the data on to the “MultiCast” task, so that I can get the entire data set from one place but send it to several tables at once. This handles all of the Primary and Foreign Key relationships for me, since this is a complete new Insert operation. If I already have data in the tables and want to tie out those keys using new data from Excel, I have to use a Lookup Task. You can see why this gets a little more complicated than just using the process I followed in Excel!

One final choice: you could also use the bcp SQL Server utility to move the data in, but that’s its own discussion. As I played around with that, I found it more difficult than the methods I’ve showed you above.

In the next articles I’ll show you how your choices for a front-end for the new data source.

InformIT Articles and Sample Chapters

My article series on database design starts here.

Books and eBooks

Database Design for Mere Mortals is a great book on database design.

Online Resources

You can learn more about Business Process Model Notation here.