Data Warehousing Architectures and Skill Sets
To ensure that we are working from a common understanding, here is a very brief summary of data warehouse architectures and requisite skill sets. This is not a substitute for any of the more comprehensive technical books on data warehousing but should be sufficient as a baseline for the remainder of the book.
Data Warehousing Conceptual Architectures
Figure 1.1 depicts an abstracted classical data warehousing architecture and is suitable to convey either a Kimball-style (Kimball and Ross 2002) or an Inmon-style (Inmon 2005) architecture. This is a high-level conceptual architecture containing multiple layers, each of which includes a complex integration of commercial technologies, data modeling and manipulation, and some custom code.
Figure 1.1 Classical data warehouse architecture
The data warehouse architecture includes one or more operational source systems from which data is extracted, transformed, and loaded into the data warehouse repositories. These systems are optimized for the daily transactional processing required to run the business operations. Most DW/BI systems source data from multiple operational systems, some of which are legacy systems that may be several decades old and reside on older technologies.
Data from these sources is extracted into an integration tier in the architecture that acts as a "holding pen" where data can be merged, manipulated, transformed, cleansed, and validated without placing an undue burden on the operational systems. This tier may include an operational data store or an enterprise information integration (EII) repository that acts as a system of record for all relevant operational data. The integration database is typically based on a relational data model and may have multiple subcomponents, including pre-staging, staging, and an integration repository, each serving a different purpose relating to the consolidation and preprocessing of data from disparate source systems. Common technologies for staging databases are Oracle, IBM DB2, Microsoft SQL Server, and NCR Teradata. The DW/BI community is beginning to see increasing use of the open-source database MySQL for this architectural component.
Data is extracted from the staging database, transformed, and loaded into a presentation tier in the architecture that contains appropriate structures for optimized multidimensional and analytical queries. This system is designed to support the data slicing and dicing that define the power of a data warehouse. There are a variety of alternatives for the implementation of the presentation database, including normalized relational schemas and denormalized schemas like star, snowflake, and even "starflake." Moreover, the presentation tier may include a single enterprise data warehouse or a collection of subject-specific data marts. Some architectures include a hybrid of both of these. Presentation repositories are typically implemented in the same technologies as the integration database.
Finally, data is presented to the business users at the analysis tier in the architecture. This conceptual layer in the system represents the variety of applications and tools that provide users with access to the data, including report writers, ad hoc querying, online analytical processing (OLAP), data visualization, data mining, and statistical analysis. BI tool vendors such as Pentaho, Cognos, MicroStrategy, Business Objects, Microsoft, Oracle, IBM, and others produce commercial products that enable data from the presentation database to be aggregated and presented within user applications.
This is a generalized architecture, and actual implementations vary in the details. One major variation on the Kimball architecture is the Inmon architecture (Inmon 2005), which inserts a layer of subject-specific data marts that contain subsets of the data from the main warehouse. Each data mart supports only the specific end-user applications that are relevant to the business subject area for which that mart was designed. Regardless of your preferences for Kimball- versus Inmon-style architectures, and of the variations found in implementation detail, Figure 1.1 will serve as reference architecture for the discussions in this book. The Agile DW/BI principles and practices that are introduced here are not specific to any particular architecture.
Diverse and Disparate Technical Skills
Inherent in the implementation of this architecture are the following aspects of development, each requiring a unique set of development skills:
- Data modeling. Design and implementation of data models are required for both the integration and presentation repositories. Relational data models are distinctly different from dimensional data models, and each has unique properties. Moreover, relational data modelers may not have dimensional modeling expertise and vice versa.
- ETL development. ETL refers to the extraction of data from source systems into staging, the transformations necessary to recast source data for analysis, and the loading of transformed data into the presentation repository. ETL includes the selection criteria to extract data from source systems, performing any necessary data transformations or derivations needed, data quality audits, and cleansing.
- Data cleansing. Source data is typically not perfect. Furthermore, merging data from multiple sources can inject new data quality issues. Data hygiene is an important aspect of data warehouse that requires specific skills and techniques.
- OLAP design. Typically data warehouses support some variety of online analytical processing (HOLAP, MOLAP, or ROLAP). Each OLAP technique is different but requires special design skills to balance the reporting requirements against performance constraints.
- Application development. Users commonly require an application interface into the data warehouse that provides an easy-to-use front end combined with comprehensive analytical capabilities, and one that is tailored to the way the users work. This often requires some degree of custom programming or commercial application customization.
- Production automation. Data warehouses are generally designed for periodic automated updates when new and modified data is slurped into the warehouse so that users can view the most recent data available. These automated update processes must have built-in fail-over strategies and must ensure data consistency and correctness.
- General systems and database administration. Data warehouse developers must have many of the same skills held by the typical network administrator and database administrator. They must understand the implications of efficiently moving possibly large volumes of data across the network, and the issues of effectively storing changing data.