As an instructor, I teach classes in system administration, data modeling, and report authoring. I'm a huge advocate of business analytics technology; I see the value to the organization of providing self-service reporting and analysis. To help provide perspective, I like to start classes with an understanding of data warehousing concepts and how these ideas apply to data sources and data models. This example provides the proper perspective for designing environments, models, and reports.
While a number of great texts look at data warehouses (for example, Data Warehousing: Architecture and Implementation and Data Warehouse: From Architecture to Implementation), many of my students don't understand the rationale for data warehouse design, and they look for a simple explanation of what a data warehouse is. With that information in mind, learners can apply these concepts to their designs.
Reporting Data Versus Analyzing Data
The difference between reporting information and analyzing data is huge. I like to think of reporting as "I know what I want. Just show me the numbers." Typically, when managers are looking for specific feedback about the performance of an organization, they turn to reports for this information. For example, a sales manager may ask, "How were our sales numbers for last week?" Based on the results of the report, the manager can see whether the team is going to make its monthly quota. This manager has a specific report design in mind and knows how to react based on the numbers presented in that report.
Analysts typically are confronted with a problem such as "I don't know what I want, but I'll know it when I see it." The approach to reviewing information changes perspectives as the needs change. The sales manager may ask a question like, "Why did we miss our quota last month?" The analysis of the answer will require looking at different customers, products, sales representatives, and time periods. The process of reviewing the data is more than just looking at a list of numbers. Perspectives on those numbers need to change as the analysis continues.
A Little Personal History
When I first started working in information technology, the goal of database design was normalizing data to remove redundancy and to minimize overall size. This process leads to complex databases with many tables joined together in complex ways. These databases were very efficient for the entry and maintenance of data, but information technology experts were needed to create efficient queries to extract information from that data. As a new programmer, I was taught how to design queries based on report requests.
Data warehouses were touted as a way to improve the ability of report authors to extract information from a database. As an IT professional, it took me a while to really understand the importance of these types of databases: "I can get all the information I need to create my reports, so why do we need to look to a new type of database?"
After reading many great books on data warehouses and online analytic processing design, the answer finally made sense to me. It comes back to looking at a report and designing a database specifically for writing such reports. Normalized database are designed for easy data entry and maintenance, but not for easy reporting.
The First Thing You Need to Know: Numbers
With all of the reports that I have read or created, two pieces of information are key. First are the numbers. I can think of only a handful of reports I've written that didn't have numbers that managers wanted to see. I don't mean codes and identifiers; I'm talking about the numbers that drive the business. Managers will often look for the numbers first and then see the other piece of information. If a report has no numbers, the managers won't know whether they're meeting their performance metric targets. When you create a reporting database, you want to have a grouping of the key performance indicators used by the organization.
The Second Thing You Need to Know: Perspectives and Insights
After looking at the numbers, managers want to provide perspective on the numbers that are being presented. These attributes will organize the numbers in ways that make sense to the user. As I would build reports, the managers would ask for more information about the customers, products, time periods being reported, and the salespeople.
So, when creating a reporting database, you want to organize the attributes logically to allow for added perspectives and insights into why the numbers are what they are. In class, we describe these sets of attributes as providing the "who," "what," "where," "when," and "why" for the numbers.
The Star Schema
This was the link I needed to help me understand the design of reporting databases. I need to create a database that allows me to link the numbers to the attributes. With that, I finally understood what a star schema is.
First, reporting databases should have a table for the key performance indicators that you want to include in the reports. This is known as a fact table. As a database designer, you should include all the numbers that the business will use. Sometimes that means that the numbers will be similar, but calculated differently. (An example is the differences between profit margin as currency and as a percentage of the business).
The goal for the fact table is that your report authors can perform fewer calculations to generate the desired results. This strategy allows for a simple solution for the writers of queries, as well as consistency in the use of the calculated fields.
Along with the key performance indicators, you'll need to include the key fields that will link to the tables of attributes. The second type of table contains all of the common attributes that report writers will use to provide context to the facts. This table is commonly known as a dimension table. A number of dimension tables will help to answer the "who," "what," "when," "where," and "why" of the data. Each table should contain all of the appropriate information in a specific area. For example, the employee dimension table should have all of the information that could be used in a report for employees. Ideally, dimension tables should only be linked to fact tables, and facts should only be connected to dimensions. Why? If you begin to have sets of attributes pointing to other sets of attributes, you are beginning to normalize the data. This approach reduces the simplicity of the star schema and increases the complexity of the queries to retrieve the data.
I hope you see value in this discussion, and I encourage you to read more about data warehousing and online analytical processing. My goal in this article is simply to start the discussion about these types of databases through this simple presentation.
Roger Johnson is a learning consultant on IBM Cognos technologies, delivering a wide variety of courses focused on the needs of his learners. His background includes application development and technical education across many industries and technologies. He is coauthor of IBM Cognos 10 Report Studio: Practical Examples. He can be reached by email (firstname.lastname@example.org) or through LinkedIn (rogerljohnson). You can find classes taught by Roger and other technical training experts at IBM Business Analytics Training.