Home > Articles > Data > SQL Server

  • Print
  • + Share This
Like this article? We recommend

Changing Dimensions

Perhaps one of the biggest challenges in maintaining a data warehouse is managing historical data values. For example, suppose that you have an employee table that tracks organizational hierarchy by recording the employee ID of the supervisor (as Supervisor_ID):

Employee

Employee_ID

Last_name

First_name

... (other columns)

Supervisor_ID


What happens when the organizational structure changes? Suppose that Jim White worked for Jason Green from 1999 to 2002, and Jason Green found greener pastures and left the company in 2002. James Bond became the new supervisor for Jim White. In a transactional system, this change is likely to recorded as simply overwriting the supervisor ID—changing it from Jason Green's employee ID to that of James Bond. This way, all new reports show James as Jim's supervisor. But what about the historical reports? Would James Bond like to get credit for Jason Green's work? Depends on whether Jason had done a good job or not, right?

There are several ways of managing changing dimensions, and each of them has merit in certain situations depending on business needs. For example, if FINANCE department changes its name to FINANCES, you are safe in presuming that overwriting the existing value with the new value won't hurt many feelings. So the easiest way to handle the change is to discard the historical value and replace it with the new value.

The next idea is to record the old value, the new value, and the date of the change. To support such functionality, the employee dimension would have to change slightly:

Employee

Sequence_number

Employee_ID

Last_name

First_name

... (other columns)

Supervisor_ID

Supervisor_Start_date

Supervisor_End_date


This way, the reports would have to check the dates for which the data is retrieved and report either Jason Green or James Bond as Jim White's supervisor. Notice that in such cases employee_id can no longer be used to uniquely identify records—each employee that has had more than one supervisor will have more than one record in the employee dimension. Therefore, I added the sequence_number column to uniquely identify each record. Doing so allows me to run reports by employee (regardless of who managed the employee at the time the report was generated) or by supervisor.

Yet another way to treat the changing dimension is to create a new record when certain values change and do not associate the old record with the new one. Suppose that we're building a data warehouse for a video store chain. And suppose that renting habits of single women are much different from those of married women. When one of our customers, Lacey Smith, gets married and changes her name to Lacey Smith-Butler, we no longer care to associate her rental history of a single woman to her new trends as a married woman. So we keep her old record, but we also create a new record in which Lacey's marital status appears as married:

Customer_ID

Last_name

First_name

Marital_status

Sex

Record_creation_date

12345

Smith

Lacey

Single

Female

06-20-1999

349500

Smith-Butler

Lacey

Married

Female

06-20-2003


We examined various ways in which we can approach the maintenance of changing dimensions. This seems fairly straightforward if we were to examine each dimension record manually. However, unless you're building a data warehouse for a simple classroom exercise, manual examination of each record is not an option. Some advanced (and expensive) data warehousing tools allow automatic maintenance of changing dimensions. But in most cases, changing dimensions must be maintained by using the routines that import data from the transactional sources into the data warehouse.

Code in the routines for populating changing dimensions will differ depending on which method you use for maintaining the change; if you are replacing the existing value with the new value, you can simply empty out the existing dimension table (by using the TRUNCATE TABLE statement) and populate it with new values.

If you are maintaining the history of when changes occurred, you have to compare values of each column for which you're tracking history and populate the start_date and end_date columns accordingly. In addition, you have to create a new record and copy values of columns that are the same in the old and new record.

Finally, you might be creating a new record without associating it with the old record if values of a particular column change. If so, you need to examine the contents of the column in question and add a new record to the dimension.

  • + Share This
  • 🔖 Save To Your Account