Home > Articles

Gathering Data for a Data Warehouse

The term data warehousing is popular these days, despite the fact that many people don’t know what it stands for. Read this article by Baya Pavliashvili for an overview of the first step of your data warehousing projects—compiling the data that will construct the DW.
Like this article? We recommend

Like this article? We recommend

Compiling the data that will be part of your data warehouse (DW) is one of the first and most difficult steps of building a DW. This article will give you an overview of what's involved in extracting data and some tips on how to complete this task successfully.

Let's imagine that your company or client decides they need a DW. What's the first thing you do? Perhaps the same thing as in any other project: gather the requirements. Believe it or not, the users of your data warehouse will be the ones that do most of the work. As a database architect, you're there to provide the technical expertise, but you can't teach the business users how to do their job, at least not before you learn their business.

By the time the DW project is completed, you'll probably know more about the business than most of the business owners themselves (whether you want to or not). Some companies have a luxury of having their own Business Analysts (BAs) or hiring consultants to do BA work. Either way, as a database architect and administrator you need to know the business as well as any BA.

Be sure to determine who your DW users will be. Most of the time, the users of the DW will be the upper-level managers, who tend to be difficult and intimidating to work with. Indeed, be prepared to get the door shut in front of your nose at least a few times. However, don't worry: After you can demonstrate the power and capabilities of the DW, you'll get a warm welcome and perhaps even a smile from the same managers. If they're convinced that you can generate better reports in 20 seconds than the ones they currently use and take four days to run, the managers will gladly allocate a good portion of their precious time to conversing with you.

The type of questions you ask the business users will vary from one project to the next. However, you'll invariably need to find out the following:

  • What are all the transactional systems used by the company? In other words, how does the company collect the data? Whether the management wishes to use any of these systems for the DW effort does not matter. Often, they don't realize where the most valuable data of the company reside. Nor should they—that's your job.

  • Who owns the data? This is a tough question because some of the transactional systems could have been purchased from a third-party vendor who does not wish to give out its data structure or the data format. Besides, there might be several people responsible for data upkeep within an organization. When they see a data warehouse architect, the first thing they do is update their resumes; they wonder if you're a sign of things to come. In any case, this is a very important step, so don't take it lightly. Be sure to get not only the names, but also the job titles and duties of each person or organizational unit responsible for data.

  • What types of reports are expected to be derived from a data warehouse? This could be by far the most exciting process in building a DW. This is where you get to interview all the future users of the DW and get to know all of their expectations. Do not make the mistake of telling them what a DW is supposed to do for them! You're just now gathering the requirements, so the final deliverable of the project has not been determined yet. Be sure to document all requirements with the name and title of the person who requested the report.

After you have documented the data sources and types of reports you need, it's time to go after the data. At this point, it's a good idea to generate a list of all the data fields (columns) that you will need to satisfy reporting requirements. After you have a list of fields, look at the source systems.

Fortunately, most of the database systems provide some form of data security. On the other hand, the secure data means that you have to get the permissions to look at it. Don't expect your client or supervisor to grant you access. They probably don't have a clue of what database permissions mean. Instead, grab your notes from Question 2 and speak with the data owners—the folks responsible for daily operation of the transactional databases. How easy it is to get access to the data depends on the organization. The larger the organization, (usually) the tighter the security control, and therefore the longer it will take to get appropriate permissions. In smaller companies, the security might not be as tight. Even so, be sure to inform the data owners of your intentions prior to digging into their data. After you explain your mission to the data owners, they should gladly grant you data-reading permissions. If that's not the case, you might have to get an approval from the upper-level managers, who happen to be your customers.

After you glance at the source systems, you're likely to find out that certain fields requested on the reports do not appear in any of the source systems. Now what? Time to go back to the business users and get the name of the missing data source. If they asked for a data field, it's either available directly from one of the systems or is a result of a calculation. The business owners might not know exactly which system contains the missing fields, but they should know all the systems they work with.

After you know all the systems involved, it's a good idea to build a Data Flow Diagram (DFD), and present it to the users for their approval. The DFD does not have to be very detailed, but it helps to know how the data moves from one system to the next and what kind of manipulations happen during such data exchange. That way, if the same data is contained in multiple systems, you only have to extract it from the single most appropriate source.

It's a good practice to get at least a small sample from each data source to analyze the data format and shape. That's where you find out if you need to perform any data cleansing prior to transforming it into the dimensional model. The data format will also give you an idea of what's involved in writing the transformation routines.

Data cleansing might involve removing invalid characters, deleting invalid records, and much more. Sometimes, it's more cost-effective to assign a data entry clerk to clean the data, rather than developing data cleansing code. For instance, no code will tell you that "Dr. Anderson R Starkey" and "Andy Starkey, M.D." are one and the same. However, the clerk that has entered such data for the past five years can match such anomalies in her sleep.

Next, you will have to relate data in multiple systems, and try to come up with some sort of mapping of data fields. For instance, the rental information might be kept in one source, and the inventory data might be in another data store. If the company needs to analyze their expenses against their revenues, you'll have to grab data from both systems and come up with a way to compare inventory costs to the revenue that was generated against that cost.

When trying to relate the data from various sources, a useful exercise is building an Entity Relationship Diagram (ERD). You might be familiar with the ERD concept if you've modeled any relational databases. A dimensional ERD isn't concerned with Primary and Foreign Keys as much as it is concerned with the cardinality rules. For example, you might have a store that gets rental vehicles from multiple suppliers; on the other hand each supplier might work with multiple stores. Therefore, you have a many-to-many relationship between the entities of supplier and store.

Getting familiar with all the data sources and their format may take from a few weeks to a few months, depending on the size of the project. After that, you're ready to build your data extraction utilities.

In some cases, if you're lucky, extracting data will be a piece of cake—all you have to do is some formatting and the data is ready to be loaded into your staging area. But more than likely, this won't be the case. If you have text files with mainframe extracts or a relational data that can be easily transferred to your staging area, consider yourself lucky. More often, you'll find yourself in need of interfacing with some proprietary tool developed by a third party. If you don't know programming languages other than SQL, it's time to call your fellow programmers or hire a consultant having prior experience with the proprietary system(s).

Let me stress again that gathering requirements thoroughly is extremely important. The data warehouse needs only those columns that need to appear on the reports. If you import every column from each source system, you might run out of storage space on your server. If you don't have enough columns from each system, your users won't be happy with the DW. Therefore, be sure to get all the requirements ironed out prior to developing any data extraction routines.

After you have extracted all the data elements needed, you'll have to place such data in the staging area. The staging database does not have to be normalized. In fact, it might be beneficial to denormalize the staging database because it'll be used for querying and populating the dimensional database. Nor should your users attempt to get the reports from the staging database, although this idea might be tempting. You should still set up all the appropriate keys and indexes in the staging database to ensure the efficient execution of queries executed against this database.

Some IT professionals prefer generating Analysis Services cubes straight from the staging database. This is a bad practice because the staging area isn't optimized for reporting purposes. The staging database contains data in the same or similar format as it is in the source systems. Instead, the staging database should be used only to get the data extracts from various sources. Later, you transform these data and load them into the dimensional model, which is optimized for building cubes.

Although not necessary, it's a good idea to place the staging database on the same server and the same database engine where the dimensional database will reside. Although you could set up ODBC connections from the dimensional database to the staging area, it'll be more efficient to have both databases on the same server and in the same format.

I recommend using Microsoft SQL Server 2000 for all your data warehousing needs for a couple of reasons. First and foremost, SQL Server makes data warehousing affordable for many small to midsize companies that cannot afford the cost of maintaining other database management systems. When you purchase Microsoft SQL Server, the data warehousing tools are provided to you at no additional cost. In addition, SQL Server is much easier to use than any other major database engine on the market.

Summary

In this article, I gave you an overview of the first step of your data warehousing projects—compiling the data that will construct the DW. I also gave you some tips on how to be successful in extracting data and populating the staging area.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020