Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Developing a Data Dictionary

Last updated Mar 28, 2003.

A data dictionary is, at its most simple, a mapping of data elements your organization stores, where it stores them, and what each element means. While it’s simple to describe, it’s a bit more complicated and time-consuming to implement.

There are multiple reasons to take the time to document the data elements in your organization. The first is for High-Availability and Disaster Recovery (HA/DR) or what is often called “Business Continuity.” If you don’t know where your data is or how it is recorded, you certainly can’t back it up or restore it in a timely manner when needed.

Another reason for a data dictionary is that you need to determine authoritative sources for your data elements when you need systems to interact. At one location I worked as a data professional, we had no less than three “authoritative” inventory reports — all of which displayed a different number for “inventory on hand”. This was a shop desperately in need of a good data dictionary. The report data discrepancy only came to light when a vendor got two of the reports from different parts of our company — and asked which number she should use to send us new material.

Whenever you implement a Business Intelligence (BI) system in your organization, you will need a data dictionary to gather the disparate elements and cull them into a single reporting entity. In other words, you need a single source of data documented somewhere so that you can put all of it together to report and analyze it.

Another interesting reason you need a data dictionary is for systems optimization. You’ll find in your discovery efforts that I detail below that you are more than likely storing the same data multiple times. That may be entirely acceptable, but in some cases you’ll find a department re-recording data from a database for their own downstream use. While that’s acceptable from a read-only standpoint, it becomes problematic if the department modifies that original data, and doesn’t feed it back into the general data flow. In this case, you may be able to integrate their department-level system into the larger one, cleaning the data and de-duplicating it at the same time. That saves backup and optimization time, since the department no longer has to maintain a set of maintenance processes for themselves. By cleaning the data real-time, you save the process of re-importing back into the main data system. It reduces size, time and effort all the way around.

Creating a data dictionary goes beyond simply identifying broad data elements. And you’re not able to do this alone. From the very first step, you’ll need to involve multiple teams to get an accurate data dictionary. No one group knows where all of the data is, which of it is authoritative, and which you should track. This will definitely be a place where you want to involve business teams, development teams and data teams, however those are laid out in your organization.

So a data dictionary has multiple uses in an organization, and it will be a cooperative effort to create one. Here are some practical steps on implementing the effort.

Selecting a Format for the Data

This might not seem like the obvious place to start. In fact, I normally caution against picking a tool or some other implement as the starting place for a technical challenge, but this is an exception.

The reason you want to start with the format of data — often persisted using a particular tool or methodology — is that it will define the rest of your efforts. For instance, if you are dealing with another organization, and they have standardized on ISO 11179, for instance, you will need to ensure that your efforts match that standard. Using that standard as an example, there are well-recognized elements and descriptions for laying out the data.

It’s best if you define the “meta-data” (data about your data) that you want to collect at the outset of the effort. Since this work will not be trivial to do, you want to originate the effort once and then maintain the system and adapt it as you move along. This isn’t to say that you can’t create your own standards for the data dictionary. The point is to have a single referencing, maintainable system so that you can query and locate where data is stored in your organization.

Once you’ve defined the tool or standard you plan to use, it’s on to the next step of defining the data elements for your data dictionary.

Defining Data Elements for the Data Dictionary

A data dictionary is normally restricted to a particular set of data. After all, every column in every spreadsheet on every user’s desk, every cell in a word-processing document, even lines in a text file is a data element. It doesn’t make a lot of sense to try and track every single one of those elements, but there are some that you should track.

There are several methods to use to define the elements for your data dictionary. If you have picked an automated tool for your system, you may find that it can “watch” the data flowing through your systems and help you identify them very quickly. If you’re using a manual process, follow these tips to help as you go along.

First, keep the elements you want to collect as small as possible. Start broadly — have the Business Analysts help you define the “mission critical” systems (meaning logical systems, not physical hardware or software systems) you have, and tease out the large elements from there, such as “Purchase Order”, “Inventory”, “Time Spent on Project”, Billing” and so on. At this point the task won’t be as daunting, because you’re only working with a few scenarios — things that the organization tells you they must know in order to do business. And keep it at a high-level at this point, assuming your tool or process allows you to do that.

Second, let the business drive the original requirements. They should understand the need for this data, for all of the reasons I outlined above. It will take time and money to perform this audit, so they need to understand why you’re doing this work instead of other work. Once they understand the reason for the effort, they can drive the data requirements for you. Keep in mind that creating the data dictionary is not always the most difficult part of this exercise — maintaining it is.

Finally, make sure you have buy-in from as many teams as possible. You cannot do this work alone, even when “alone” means the entire technical team. Your focus is on technology and where it is applied, but there is a surprising amount of data that is not captured in formal IT systems — they are in spreadsheets, on Access databases and “in the cloud”. You’ll need help locating these. The development teams also need this information, since they are often asked to include information on a single screen from multiple systems. The temptation is to re-create those data systems, rather than looking to see if they already exist.

With the broad areas defined, It’s time to discover where the authoritative bits and bytes live, and who owns them.

Locating Sources for the Data Elements

If you’re using an automated Master Data Management (MDM) system such as SQL Server’s Master Data Services, you may perform this step a bit out of order. Those automated systems normally have a checklist of tasks that identify the source systems by watching data patterns, and where they flow to and from.

You can simulate this yourself, by starting at the user’s end of the data flow. Once the business identifies the critical systems, it’s a simple matter to find out what the users (or systems) do to get that data into those systems. Find the application owners for those programs, and trace the data being entered back to their source systems.

Repeat this process, but keep in mind that the data will probably be very normalized at first. You’ll need to find out what data store (or data stores) reach element ends up in, and then look for views or stored procedures that bring the elements back into a meaningful whole. This is usually the granularity you’re more interested in. For instance, a Purchase Order object is usually represented by several elements joined together, such as the Purchase Order header, line items and so on. Unless these are coalesced from several systems, you normally only need to know the PO object, not each detail breakout, at least for the data dictionary.

So at this point you have the following defined:

  • What you want to track
  • Where it is

Next, you need to define the meaning to each of these larger elements.

Assigning Meaning to the Elements

It is at this point that the data professional must involve the business. A database is similar to a bank. A bank can store your money, allow you to put money in and take it out, or even covert it to another currency. What a bank can’t do is tell you if a certain amount of money is “good” or “bad”, without having more background. In fact, even with that it’s usually not the place to ask that question.

Where the database is similar is that it stores data. As data professionals we make it possible to store data, edit it, remove it, report on it, even convert it to other formats. But we can’t give it meaning — nor should we. Our guarantee is that what goes into the database is what we store.

So this is where you need a representative from the business, usually a Business Analyst that will give the elements meaning. You might think that if a data element is marked “Purchase Order” then you can state that it’s a Purchase Order — no reason to involve a Business Analyst at that point, correct?

But you’d be wrong. Recall that company I worked at that had multiple numbers for the word “inventory”. In fact, the issue revolved around the fact that it was “on-hand” inventory. It turns out that each of the reports with the different numbers on them were correct! Each was using a different definition of the word “on-hand”. In one country, anything on the shelf of the company was considered (and taxed) as on-hand inventory. In another country, the source manufacturer would stage parts on our company’s shelves, but we actually only entered them into on-hand inventory when we paid for them for our use. Reporting came from multiple systems, and each was correct, hence the different definitions for on-hand inventory.

And as the data professional, you can’t really be expected to track international tax law. So the point is that you need to make sure a business representative evaluates your list and gives each data element meaning.

Developing a System for Updating the Data Dictionary

Remember, the Data Dictionary you’re creating is not the process of using it. In fact, the more difficult work is yet to come — maintaining and using the Data Dictionary is actually far more time consuming.

This is where a Master Data Management system is superior to using your own manually created one. In an ideal setup, developers no longer even use the database system — they point directly at the MDM system and feed and take data from that. Of course, this isn’t always possible if the application is “canned” or not controlled by your organization.

In that case, you need to ensure that as data systems are added, removed or altered within your organization that your team is made aware of the changes. Otherwise the data dictionary will be out of date, and could even cause errors.

The keys to keeping the system up to date are cooperation and keeping the meta-data set as small as possible. Whenever a new system is put into place or an older one is changed, make sure there’s a tight process for notifying your team. Again, the entire data dictionary process feeds important business processes like HA/DR, BI and integration, so the importance is there.

Also, remember that it is comparatively easy (and tempting) to add more and more elements to track in a data dictionary. Resist this urge. Every element you add brings an exponential level of maintenance and upkeep.