- 1.1 Introduction to Master Data Management
- 1.2 Why an MDM System?
- 1.3 What Is a Master Data Management System?
- 1.4 Business Benefits of Managed Master Data
- Conclusion
1.3 What Is a Master Data Management System?
Master Data Management Systems provide authoritative data to an organization. But what kind of data? How do we work with the MDM System? How do we integrate the MDM System with the existing systems? These questions describe a solution space within which there are a wide variety of ways in which MDM Systems can be deployed and used according to the needs and goals of the enterprise.
In this section, we describe the three primary dimensions of this MDM solutions space. As shown in Figure 1.6, the three dimensions are the domains of master data that are managed, the methods by which the system is to be used, and the styles of implementation that are needed for a particular deployment. It is important to note that MDM implementations are typically not deployed in a "big bang" approach where all domains are managed across all methods of use. Organizations generally start with a limited scope that provides the highest return on investment in a relatively short time frame. As MDM implementations are rolled out over several phases, the space of the implementation may grow. Additional domains are added, the method of use may expand, or the implementation style may change to deliver additional business value. The term Multiform MDM is sometimes used to describe MDM Systems that support these three dimensions of MDM Systems. The following sections describe these dimensions in greater detail.
Figure 1.6 Dimensions of Master Data Management.
1.3.1 Master Data Domains
Master Data Management has emerged over the last few years from the recognition that the existing markets of Customer Data Integration (CDI) and Product Information Management (PIM) had key similarities as well as differences. CDI focuses on managing people and organizations—which we will collectively call parties. A CDI system can aggregate party information from many preexisting systems, manage the use of the party data, and distribute the information out to downstream systems such as billing systems, campaign management systems, or CRM systems.
PIM systems manage the definition and lifecycle of a finished good or service—collecting product information from multiple sources, getting agreement on the definition of products, and then publishing this information to Web sites, marketing systems, merchandizing systems, and so on. PIM systems are distinct from Product Lifecycle Management (PLM) systems, which focus on the design and development of products rather than the preparation of product information to support sales and distribution. There is a natural flow of information from a PLM system to a PIM system as a product transitions from engineering into marketing and sales.
CDI and PIM both represent a common pattern—that of aggregating data from existing systems, cleaning and augmenting that data, and then distributing that data to downstream systems. PIM and CDI systems differ in the most common ways in which the data is used after it has been loaded into the MDM System—we discuss the different methods of use in the following section. It is important to note that MDM Systems do more than just store and retrieve data—they incorporate business logic to reflect the proper management and handling of the master data. The rules for handling a product lifecycle are different than those for managing the lifecycle for a customer. The MDM System may also be configured to issue alerts when interesting things happen. For example, billing systems may need to get notified immediately when a customer address changes. This business logic can be customized for a particular deployment to reflect the needs of a particular industry as well as the unique characteristics of the implementing organization.
As CDI and PIM products have matured, it was also observed that while CDI systems focused on the customer, it was often convenient for such systems to include references to the products or accounts that a customer has. Similarly, PIM systems often need to store or reference the suppliers of the products or services. Supporting and using these cross-domain relationships has become a significant aspect of MDM Systems.
The kinds of information treated as master data varies from industry to industry and from one organization to another. An insurance company may wish to treat information about customers, policies, and accounts as master data, while a telecommunications company may be concerned with customers, accounts, location (of cell phone towers), and services. A manufacturer may be focused on managing suppliers, customers, distributors, and products. A government agency may want to focus only on citizens and non-citizens. In these examples, we see a lot of commonality as well as differences. In general, master data can be categorized according to the kinds of questions they address; three of the most common questions—"Who?," "What?," and "How?" are addressed by the party, product, and account domains of master data. Each of these domains represents a class of things—for example, the party domain can represent any kind of person or organization, including customers, suppliers, employees, citizens, distributors, and organizations. Each of these kinds of party shares a common set of attributes—such as the name of the party, where it is located (a party may have multiple locations such as home, work, vacation home, etc.), how to contact it, what kind of relationship the organization has with the party, and so forth. Similarly, the product domain can represent all kinds of things that you sell or use—from tangible consumer goods to service products such as mortgages, telephone services, or insurance policies. The account domain describes how a party is related to a product or service that the organization offers. What are the relations of the parties to this account, and who owns the account? Which accounts are used for which products? What are the terms and conditions associated with the products and the accounts? And how are products bundled?
Location information is often associated with one of the other domains. When we talk about where a product is sold, where a customer lives, and the address at which an insurance policy is in effect, we are referring to location information. Location information is tied to a product, a party, or an account—it does not have an independent existence. There are, of course, cases where location does exist independently, but those situations seem to be less common. Another interesting facet of location is that it can be described in many different ways (by postal address, by latitude and longitude, by geopolitical boundaries)—we need a particular context in order to define what we mean. A location can be a sales territory, a city, a campus with many buildings, a store, or even a spot on a shelf in an aisle within a store. For these reasons, we will treat location as a subordinate domain of master data.
Figure 1.7 shows how the three primary domains of party, product, and account overlap. These areas of overlap are particularly interesting, because they indicate fundamental relationships between the domains. For example, when we define a product, we often need to specify the party that supplies that product and the location(s) in which the product may be sold. Explicitly capturing these relationships within the same environment allows us to address business questions that may be otherwise difficult to resolve. Building on the previous example, if we record the party that supplies a product as well as the parties that we sell products to, then we can determine which of our suppliers are also our customers. Understanding the full set of linkages that an organization has with a partner can be valuable in all aspects of working with that partner—from establishing mutually beneficial agreements to ensuring an appropriate level of support. Indeed, perhaps the key benefit of supporting multiple domains of master data within the same system is that it clarifies these cross-domain relationships.
Figure 1.7 Domains of Master Data.
Master data domains can be made specific to a particular industry through the application of industry standards or widely accepted industry models.3 Typically, standards and models can be used to drive not just the definition of the data model within an MDM Solution but the services that work with the master data as well. In particular, use of standards and models aligns the services exposed by an MDM Solution with accepted industry-specific definitions, which reduces the cost of integration.
Gaining agreement on the definition of an MDM domain can be challenging when different stakeholders within an organization have different requirements or look at the same requirements from different points of view. If well-accepted industry models or standards exist, they can serve as a foundation for further customization, eliminating the need to laboriously gain agreement on every term or service definition. Table 1.1 provides a list of some of the standards and models that are available within a range of industries. Some of these standards and models could be used to guide the definition of data structures and access services for MDM domains.4
Table 1.1. Some Industry Standards and Models
Industry |
Standard or Industry Model |
Web Resource |
Banking |
IBM Information FrameWork (IFW) |
|
Interactive Financial eXchange (IFX) |
||
Insurance |
IBM Insurance Application Architecture (IAA) |
|
Association for Cooperative Operations Research and Development (ACORD) |
||
Telecoms |
Shared Information/Data Model (SID) |
|
IBM Telecommunications Data Warehouse |
http://www-306.ibm.com/software/data/ips/products/industrymodels/telecomm.html |
|
Retail |
Association for Retail Technology Standards (ARTS) |
|
IBM Retail Data Warehouse |
http://www-306.ibm.com/software/data/ips/products/industrymodels/retail.html |
|
Healthcare |
Health Level 7 (HL7) |
In summary, an MDM System supports one or more domains of master data. The domains provided are often industry-neutral but can be subsequently tailored (and/or mapped) to different industry standards or models. The domain definitions can be further customized during the design and implementation of an MDM Solution for a specific environment.
1.3.2 Methods of Use
As we look at the roles that master data plays within an organization, we find three key methods or patterns of use: Collaborative Authoring, Operational, and Analytical, shown in Figure 1.8. The simplest way to think about these methods of use is to consider who will be the primary consumers of the master data. Under the Collaborative Authoring5 pattern, the MDM System coordinates a group of users and systems in order to reach agreement on a set of master data. Under the Operational pattern, the MDM System participates in the operational transactions and business processes of the enterprise, interacting with other application systems and people. Finally, under the Analytical pattern, the MDM System is a source of authoritative information for downstream analytical systems, and sometimes is a source of insight itself.
Figure 1.8 Multiple MDM domains and multiple methods of use.
A particular element of master data such as a product or an account may be initially authored using a collaborative style, managed operationally through the operational style, and then published to other operational and analytical systems. Because MDM Systems may be optimized to one or more of the methods of use, more than one MDM System may be needed to support the full breadth of usage. Where multiple MDM Systems are used to support multiple usage patterns, careful attention to the integration, management, and governance of the combined system is required to ensure that the master data of the combined system is consistent and authoritative.
It is important to note that the style of usage is completely independent from the domain of information managed. Although Product Information Management systems are often associated with a Collaborative Authoring style of use, and Customer Data Integration systems are often associated with an Operational usage style, this alignment is not necessary or exclusive. There are an increasing number of cases where organizations seek an operational usage of product information as well as a range of use cases for collaborative authoring of customer information.
1.3.2.1 Collaborative MDM
Collaborative MDM deals with the processes supporting collaborative authoring of master data, including the creation, definition, augmentation, and approval of master data. Collaborative MDM is about achieving agreement on a complex topic among a group of people. The process of getting to agreement is often encapsulated in a workflow that may incorporate both automated and manual tasks, both of which are supported by collaborative capabilities. Information about the master data being processed is passed from task to task within the workflow and is governed throughout its lifecycle.
As a consequence of the complexity of product development and management, PIM systems commonly support a collaborative style of usage. Perhaps the most common process implemented by PIM systems is the process for introducing a new product to the market. An in-depth discussion on NPI can be found in Chapter 6. A typical NPI process is shown in Figure 1.9.
Figure 1.9 Simplified New Product Introduction process.
Here we can see that information about new products (or items) is received from one or more external sources and then incrementally extended, augmented, validated, and approved by a number of different end users with different user roles and responsibilities.6
The collaborative steps within a New Product Introduction process are used to define the kinds of properties that describe the product. A given product will be described by dozens, and often hundreds, of properties depending on how the product is classified and where it is sold. In the New Product Introduction process, product specialists, buyers, and other stakeholders describe all of the characteristics of the product that are necessary to bring it to market. These characteristics may include product specifications, marketing information, ingredients, safety information, recycling information, cost, and so on. Large retailers may have more than a million products that they sell, spanning categories from food to clothing to furniture to appliances. The kinds of properties that are relevant to a product depend on the kind of product it is. For clothing, examples include color, size, and material; for electronic appliances, examples might be specifications, color, warranty, and so on. The Collaborative MDM System helps users to capture all of the different relevant properties of the product, validate the properties, categorize the product, and coordinate the approval of the product. As buyers and product specialists come up with new ways to describe products, new properties are created to hold these new descriptions. In retail environments, the structure of the product information is constantly evolving.
Collaboration is a common pattern and can be found beyond the PIM domain. Indeed, we find that many of the tasks performed by a product specialist in the PIM environment are also performed in the management of Customer and Account information. A key role that spans all domains of master data is that of data steward. A data steward looks after the quality and management of the data. For example, when we believe that two or more party records in a data store may really refer to the same individual, data stewards may need to manually combine information from the party records together and then validate the proposed changes with supervisors. Similarly, where questions requiring human intervention arise about the accuracy of information, a request for attention may be made visible to all data stewards who are capable of handling the issue, which can result in a collaborative pattern to resolve data quality issues.
The Collaborative style of usage requires a core set of capabilities within the MDM environment. A combination of workflow, task management, and state management are needed to guide and coordinate the collaborative tasks and the master data being collaborating on. Workflow controls the execution of a sequence of tasks by people and automated processes. Task management prioritizes and displays pending work for individuals to perform, while state management helps us to model and then enforce the lifecycle of the master data.
Because many concurrent users and workflows may be executing in parallel, the integrity of the master data needs to be protected with a check-in/check-out or similar locking technique. To improve efficiency, master data records are often processed in batches within the same workflow, which results in the concept of a "workbasket" of master data records that is passed from task to task within the workflow. Tasks within a workflow may be automated actions (such as import, export, or data validation) or manual tasks that allow users to work directly with the master data. Typically, this workflow will involve business users and data stewards, a process that, in turn, has implications for the design of the UIs (user interfaces) for collaborative authoring of master data. User interfaces must be both efficient and comfortable to use, and must rely on a set of underlying services that create, query, update, and delete the master data itself, the relationships between the master data, and other related information, such as lookup tables. Tooling to support the flexible creation and customization of collaborative workflows and even user screens may also be provided.
Finally, a common set of services are typically also provided to enforce security and privacy, and to support administration, validation, and import/export of master data. These services are needed across all kinds of MDM Systems.
1.3.2.2 Operational MDM
In the Operational style of MDM, the MDM server acts as an Online-Transaction Processing (OLTP) system that responds to requests from multiple applications and users. Operational MDM focuses on providing stateless services in a high-performance environment. These stateless services can be invoked from an enterprise business process or directly from a business application or user interface. Operational MDM services are often designed to fit within a Service-Oriented Architecture as well as in traditional environments. Integration of an Operational MDM System with existing systems calls for the support of a wide variety of communications styles and protocols, including synchronous and asynchronous styles, global transactions, and one-way communications.
A good example of Operational MDM usage is a New Account Opening business process. In this process, a person or organization wants to open a new account—perhaps a bank account, a cable TV account, or any other kind of account. As shown in Figure 1.10, MDM services are invoked to check what information about the customer is already known and to determine if product policy is being complied with before an offer of a new account is made. If the customer isn't already known, then the new customer is added to the MDM System and a new account is created (presuming that the new customer meets the appropriate requirements). Each of the tasks within this workflow is implemented by a service, and many of these services are implemented by an Operational MDM System.
Figure 1.10 Example New Account Opening process.
Operational MDM is also commonly used in the PIM domain. For retailers, after products have been defined, the approved product information may be published to an operational MDM System that then serves as a hub of MDM information that interacts with merchandising, distribution, or e-commerce applications. As such applications become more open and able to interact within an SOA environment, the need for such an operational MDM hub increases.
A wide range of capabilities is required for the Operational usage style. There can be hundreds of services that provide access and management of MDM data. Specific sets of services for each kind of MDM object managed provide for creation, reading, updating, and deletion of the MDM objects. Services are also provided to relate, group, and organize MDM objects. As with the Collaborative style of MDM, services are also needed for cleansing and validation of the data, for detection and processing of duplicates, and for managing the security and privacy of the information.
1.3.2.3 Analytical MDM
Analytical MDM is about the intersection between Business Intelligence (BI) and MDM. BI is a broad field that includes business reporting, data warehouses, data marts, data mining, scoring, and many other fields. To be useful, all forms of BI require meaningful, trusted data. Increasingly, analytical systems are also transitioning from purely decision support to more operational involvement. As BI systems have begun to take on this broader role, the relationship between MDM Systems and Analytical systems has also begun to change.
There are three primary intersections between MDM and BI.
- MDM as a trusted data source: A key role of an MDM System is to be a provider of clean and consistent data to BI systems.
- Analytics on MDM data: MDM Systems themselves may integrate reporting and analytics in support of providing insight over the data managed within the MDM System.
- Analytics as a key function of an MDM System: Specialized kinds of analytics, such as identity resolution, may be a key feature of some MDM Systems.
One of the common drivers for clean and consistent master data is the need to improve the quality of decision making. Using an MDM System to feed downstream BI systems is an important and common pattern. The data that drives a BI system must be of a high quality if the results of the analytical processing are to be trusted. For this reason, MDM Systems are often a key source of information to data warehouses, data marts, Online Analytical Processing (OLAP) cubes, and other BI structures. The common data models for data warehouses use what are called star schemas or snowflake schemas to represent the relationship between the facts to be analyzed and the dimensions by which the analysis is done.7 For example, a business analyst in a retail environment would be interested in understanding the number or value of sales by product or perhaps by manufacturer. Here, the sales transaction data is stored in fact tables. Product and manufacturer represent dimensions of the analysis. We can observe that often master data domains align with dimensions within an analytical environment, which makes the MDM System a natural source of data for BI systems.
The insight gained from a data warehouse or OLAP cube may also be fed back into the MDM System. For example, in the travel and entertainment industry, some companies build analytical models that can project the likely net lifetime revenue potential of a customer. To build these projections, they will source the master data from an MDM System and transactional details from other systems. After the revenue potential is computed, the MDM System is updated to reflect this information, which may now be used as part of each customer's profile. Reservation systems can then use this profile to tailor offers specifically to each customer.8
Insight may also be derived from data maintained by the MDM System itself. An MDM System contains all of the information needed to report on key performance indicators such as the number of new customers per week, the number of new accounts per day, or the average time to introduce a new product. Reporting and dashboarding tools can operate directly over the master data to provide these kinds of domain-specific insights. Some MDM Systems also incorporate a combination of rules and event subsystems that allow interesting events to be detected and actions to be taken based on these events as they happen. For example, if a customer changes addresses five times in three months, that may trigger an alert that notifies event subscribers to contact the customer to validate his or her address on a periodic basis. Analytics may also be executed as an MDM transaction is taking place, using architected integration points that allow external functions to be invoked as part of an MDM service. A good example is the use of scoring functions to predict the likelihood of a customer canceling accounts at an institution. Such scoring functions can be developed by gaining a deep understanding of an issue, such as customer retention, through data mining and building a model of recurring customer retention patterns based on the combination of customer and transaction data maintained within a data warehouse. While it is time-consuming to develop and validate such a model, the scoring model that results can be efficiently executed as part of an MDM service. This kind of analytics is called in-line analytics or operational analytics and is an important new way in which MDM Systems can work together with BI systems to provide additional value to an enterprise.
The final kind of MDM analytics is where the MDM System provides some key analytic capabilities. One particular kind of insight that can be derived from the information within an MDM System is the discovery of both obvious and non-obvious relationships between the master entities managed. An obvious kind of relationship would be one that discovered households based on a set of rules around names, addresses, and other common information. A non-obvious kind of relationship might find relationships between people or organizations by looking for shared fragments of information, such as a common phone number, in an effort to determine that people may be roommates. Searching for non-obvious relationships may also require rules that look for combinations of potentially obfuscated information—for example, transposed Social Security numbers and phone numbers—to identify potential relationships where people may be trying to hide their identities. Identity resolution and relationship discovery are important for both looking for questionable dealings9 and understanding a social network that a person is part of—and therefore are important for predicting the overall value of a person's influence.
The analytical style of usage encompasses a variety of capabilities. Populating external analytical environments such as data warehouses with data from an MDM System requires information integration tools to efficiently transfer and transform information from the MDM System into the star or snowflake schemas needed by the data warehouse. Integration with reporting tools is required in order to display key performance indicators and how they change over time. Rules, scoring, and event management are important capabilities for in-line analytics within the MDM environment.
In practice, MDM usage will often cross the boundaries between collaborative, operational, and analytical usage. For example, collaborative MDM processes can be very useful in managing the augmentation of complex operational structures such as organizational hierarchies. On the other hand, there is valuable analytical information that can be gathered around the nature of the collaborative processes. An MDM implementation may start with the usage style that is most important to achieving their business need and then later extend the environment to incorporate additional styles to meet further requirements.
1.3.3 System of Record vs. System of Reference
The goal of an MDM System is to provide authoritative master data to an enterprise. Ideally, a single copy of key master data would be managed by an MDM System—all applications that needed master data would be serviced by this system, and all updates to master data would be made through the MDM System. The master data in the ideal MDM implementation can be considered a system of record. That is, the data managed by the MDM System is the best source of truth. If applications want to be sure that they are getting the most current and highest-quality information, then they consult this source of truth. Achieving this ideal MDM System can be difficult, at best, due to several confounding factors, such as:
- The complexity and investment in the existing IT environment
- Master data locked into packaged applications
- Requirements for performance, availability, and scalability in a complex and geographically distributed world
- Legal constraints that limit the movement of data across geopolitical boundaries
All of these factors contribute to the need for copies of master data—sometimes partial subsets, sometimes completely redundant replicas. These copies can be well-managed, integrated, and synchronized extensions of an MDM System. When the replica of the master data is known to be synchronized with the system of record—in a managed way that maintains the quality and integrity of the data within both the replica and the system of record—we can call this copy a system of reference. Although it is synchronized with the system of record, it may not always be completely current. Changes to the system of record are often batched together and then applied to the systems of reference on a periodic basis. In some cases, the copy may represent a special-purpose MDM implementation that has been specifically tuned to the needs of a particular style of usage. A system of reference is a source of authoritative master data because it is managed and synchronized with the system of record. It can therefore be used as a trusted source of master data by other applications and other systems.
An MDM system of reference is best used as a read-only source of information, with all of the updates going through the system of record. Figure 1.11 illustrates a simple environment where the system of record aggregates master information from multiple sources, is responsible for cleansing and managing the data, and then provides this data to both managed systems of reference as well as directly to other consumers of the managed information. We use the terms Managed and Unmanaged to define the scope of the overall MDM environment. In a managed environment, each source should only feed one system, and each consumer should receive information from only one system. Within the managed environment, we can track the movement of information between systems and audit the transactions that use the system.
Figure 1.11 System of Record vs. System of Reference.
It is important to note that the system of record may, in fact, be made up of multiple physical subsystems—but this arrangement should be transparent to all of the other systems and people that interact with it. For example, if legal requirements in a country dictate that personal information may not leave the country, then different MDM Systems may be required. In this situation, the different MDM Systems can be logically brought together through the technique of federation. 10 The fact that there is more than one MDM System can be hidden from the consumers. Consumers issue requests to the MDM System as a whole and receive a response without having to know (or care) which particular system responded.
1.3.4 Consistency of Data
When data is replicated in an environment, questions of consistency among the replicas immediately arise. For example, as we discussed in the previous section, a system of reference may not always be completely consistent with a system of record. It is useful to think about two basic approaches towards consistency. The first we can call Absolute Consistency, and the second we can call Convergent Consistency. In a distributed system with absolute consistency, information will be identical among all replicas at all times that the systems are available (for simplicity, let's ignore the case where a system is recovering after a failure). In a distributed environment, we commonly achieve absolute consistency by following a two-phase commit transaction protocol that is provided in most distributed databases, messaging systems, and transaction systems. The basic idea is that we can define a unit of work containing several actions that must either all complete or all fail. We can use this approach to write applications that update multiple databases and guarantee that either all of the updates worked or they all failed. We can also use this approach to concurrently update master data records in multiple repositories simultaneously, and with the two-phase commit transaction protocol, either all of these databases will be updated or none of them will be. In either case, all of the databases contain the same information and are therefore absolutely consistent. Two-phase commit can be costly from a performance, complexity, and availability point of view. For example, if we have three systems that all normally participate in a transaction, all three of the systems must be available for the transaction to complete successfully—in other words, if any one of the systems is not available, then none of the systems can be updated. So while two-phase commit is an important and widely used technique, it is not always the right approach. When we balance the needs for performance, availability, and consistency, we find that there are a range of options for each. Providing an absolute consistency can decrease performance and availability. There are many excellent books devoted to transaction processing that explore this topic further—see [3] as an example.
Convergent consistency is an alternative way to think about providing consistency across systems. The basic idea of convergent consistency is that if we have a distributed set of systems that we want to keep synchronized, whenever we apply an update to one system, that update gets forwarded to all of the other systems. There are a variety of ways to do this—we could do this as each change occurs (which can result in a lot of communications traffic), or we could accumulate a set of changes and process them a batch at a time. Passing along the changes as they happen allows the receiving systems to be only a few updates behind the system that was directly updated—but it can be costly in terms of resources. Processing a batch at a time means that the changes will be delayed in getting to the other systems, but fewer resources will be consumed. With either approach, if new updates stop arriving, all of the systems will eventually have the same data. That is, the information in the different systems converge, and all of the systems become consistent with one another. The benefit of following a convergent consistency strategy is that the systems can operate independently of each other so that processing of forwarded updates can happen at the convenience of the recipient. This fact means that we can achieve higher availability and potentially higher performance at the expense of consistency. The lag-time for changes to propagate across all of the systems can be tuned by increasing or decreasing the rate at which the changes are forwarded. One other consideration is that if new updates are being applied to multiple systems concurrently, significant care must be taken to prevent anomalous behaviors such as update conflicts.
Absolute and convergent consistency are both important strategies for managing replicated data across multiple systems. Absolute consistency is not always technically possible or pragmatic. Many systems do not expose interfaces that support two-phase commit. Convergent consistency can be quite pragmatic and can yield better performance and availability, but it also has its share of complexity. System architects implementing MDM Systems need to be well versed in these techniques to properly select the right combinations of techniques that will balance the requirements and constraints dictated by an implementation.
1.3.5 MDM Implementation Styles
MDM Systems are implemented to improve the quality of master data and to provide consistent, managed use of this information in what is often a complex and somewhat tangled environment. There are a variety of ways to support these requirements in ways that accommodate a range of methods of use (as described earlier) and implementation requirements. Implementation requirements can dictate:
- If the MDM System is to be used as a system of record or a system of reference
- If the system is to support operational environments, decision support environments, or both
- If it is important for the MDM System to push clean data back into existing systems
- If the system is to be part of an SOA fabric
- If geographic distribution is required
Different combinations of implementation and usage requirements have led to the evolution of a number of MDM implementation styles. Hybrid implementations that combine multiple implementation styles are common. Because some styles are simpler than others, organizations may start with a simpler implementation style that addresses the most urgent business needs and then subsequently address additional business needs by extending the implementation to enable additional styles.
In this section, we introduce four common implementation styles:11
- Consolidation Implementation Style
- Registry Implementation Style
- Coexistence Implementation Style
- Transactional Hub Implementation Style
As the styles progress from Consolidation Implementation Style to Transactional Hub Implementation Style, they provide increasing functionality and also tend to require more sophisticated deployments.
1.3.5.1 Consolidation Implementation Style
The consolidation implementation style brings together master data from a variety of existing systems, both databases and application systems, into a single managed MDM hub. Along the way, the data is transformed, cleansed, matched, and integrated in order to provide a complete golden record for one or more master data domains. This golden record serves as a trusted source to downstream systems for reporting and analytics, or as a system of reference to other operational applications. Changes to the data primarily come in from the systems that feed it; this is a read-only system. Figure 1.12 illustrates the basic consolidation style, with reads and writes going directly against the existing systems and the MDM System (in the middle) receiving updates from these existing systems. The integrated and cleansed information is then distributed to downstream systems (such as data warehouses) that use, but don't update, the master data.
Figure 1.12 Consolidation Implementation Style.
There is a strong similarity between the consolidation implementation style and an operational data store (ODS). An ODS is also an aggregation point and staging area for analytical systems such as data warehouses—see [1,4] for more details. The distinction between them lies in the set of platform capabilities that an MDM System offers, which go beyond the storage and management of data that an ODS provides. An operational data store is a database that is used in a particular way for a particular purpose, while an MDM System provides access, governance, and stewardship services to retrieve and manage the master data and to support data stewards as they investigate and resolve potential data quality issues.
Implementing the consolidation style is a natural early phase in the multiphase roll-out of an MDM System. A consolidation style MDM system serves as a valuable resource for analytical applications and at the same time provides a foundation for the coexistence and transactional hub implementation styles.
The drawbacks of the consolidation style mirror its advantages. Because it is fed by upstream systems, it does not always contain the most current information. If batch imports are performed only once a day, then the currency requirements for a decision support system would likely be met—but those for a downstream operational system may not be. Because the consolidation style represents a read-only system, all of the information about a master data object must already be present in the systems that feed the MDM System. Thus, if additional information needs to be collected to address new business needs, one or more of the existing source applications need to be changed as well as the MDM System—this lack of flexibility is addressed by the coexistence and transactional hub implementation styles.
1.3.5.2 Registry Implementation Style
The registry implementation style (as shown in Figure 1.13) can be useful for providing a read-only source of master data as a reference to downstream systems with a minimum of data redundancy. In the figure, the two outside systems are existing sources of master data. The MDM System in the middle holds the minimum amount of information required to uniquely identify a master data record; it also provides cross-references to detailed information that is managed within other systems and databases. The registry is able to clean and match just this identifying information and assumes that the source systems are able to adequately manage the quality of their own data. A registry style of MDM implementation serves as a read-only system of reference to other applications.
Figure 1.13 Registry Implementation Style.
Queries against the registry style MDM System dynamically assemble the required information in two steps. First, the identifying information is looked up within the MDM System. Then, using that identity and the cross-reference information, relevant pieces of information are retrieved from other source systems. Figure 1.14 shows a simple example where the MDM System holds enough master data to uniquely identify a customer (in this case, the Name, TaxID, and Primary address information) and then provides cross-references to additional customer information stored in System A and System B. When a service request for customer information is received (getCustInfo()), the MDM System looks up the information that it keeps locally, as well as the cross-references, to return the additional information from Systems A and B. The MDM System brings together the information desired as it is needed—through federation. Federation can be done at the database layer or by dynamically invoking services to retrieve the needed data in each of the source systems.
Figure 1.14 MDM Registry Federation.
Federation has several advantages. Because the majority of information remains in the source systems and is fetched when needed, the information returned is always current. This style of MDM System is therefore suitable to meet transactional inquiry needs in an operational environment. The registry implementation style can also be useful in complex organizational environments where one group may not be able to provide all of its data to another. The registry style can be relatively quick to implement, because responsibility for most of the data remains within the source systems.
There is, however, a corresponding set of issues with this implementation style. One fundamental issue is that a registry implementation is not useful in remediating quality issues that go beyond basic identity. A registry implementation can only manage the quality of the data that it holds—so while it can match and cleanse the core identifying data, it cannot, in itself, provide a completely standardized and cleansed view of the master data. Because the complexities of updating federated information lead most registry style implementations to be read-only, the cleansed identifying information is not typically sent back to the source systems. If the data in the sources systems is clean, the composite view served by the MDM System will also be clean. Thus, a registry implementation can act as an authoritative source of master data for the key identifying information that it maintains.
A registry implementation style is also more sensitive to the availability and performance of the existing systems. If one of the source systems slows down or fails, the MDM System will be directly affected. Similarly, the registry style also requires strong governance practices between the MDM System and the source systems, because a unilateral change in a source system could immediately cause problems for users of the MDM System. For example, in the scenario shown in Figure 1.14, suppose a change is made in the structure of the Privacy Preferences information in System A. If this change occurs without making corresponding changes in the MDM System, then a request such as getCustInfo() will likely cause the MDM system to fail with an internal error because of the assumptions it makes about the structure of the data it federates.
1.3.5.3 Coexistence Implementation Style
The coexistence style of MDM implementation involves master data that may be authored and stored in numerous locations and that includes a physically instantiated golden record in the MDM System that is synchronized with source systems. The golden record is constructed in the same manner as the consolidation style, typically through batch imports, and can be both queried and updated within the MDM System. Updates to the master data can be fed back to source systems as well as published to downstream systems. In a coexistence style, the MDM System can interact with other applications or users, as shown in Figure 1.15.
Figure 1.15 Coexistence Implementation Style.
An MDM System implemented in the coexistence style is not a system of record, because it is not the single place where master data is authored and updated. It is a key participant in a loosely distributed environment that can serve as an authoritative source of master data to other applications and systems. Because the master data is physically instantiated within the system, the quality of the data can be managed as the data is imported into the system. If the MDM System does a bidirectional synchronization with source systems, care must be taken to avoid update cycles where changes from one system conflict with changes from another—these cycles can be through a combination of automated and manual conflict detection and resolution.
The advantage of the coexistence style is that it can provide a full set of MDM capabilities without causing significant change in the existing environment. The disadvantage is that because it is not the only place where master data may be authored or changed, it is not always up to date. As with the consolidation style, the coexistence style is an excellent system of reference but is not a system of record.
1.3.5.4 Transactional Hub Implementation Style
A transactional hub implementation style is a centralized, complete set of master data for one or more domains (see Figure 1.16). It is a system of record, serving as the single version of truth for the master data it manages. A transactional hub is part of the operational fabric of an IT environment, receiving and responding to requests in a timely manner. This style often evolves from the consolidation and coexistence implementations. The fundamental difference is the change from a system of reference to a system of record. As a system of record, updates to master data happen directly to this system using the services provided by the hub. As update transactions take place, the master data is cleansed, matched, and augmented in order to maintain the quality of the master data. After updates are accepted, the system distributes these changes to interested applications and users. Changes can be distributed as they happen via messaging, or the changes can be aggregated and distributed as a batch.
Figure 1.16 Transactional Hub Implementation Style.
Sometimes data extensions are needed in the MDM System to accommodate information that is not already stored in the source systems. For example, in the product domain, a food retailer might find that consumers are interested in knowing the distance that food has traveled to a store (there is a growing interest in purchasing locally grown products). Rather than augment all of the source systems, the MDM System would be extended to support this new information and would become the only place where such information is managed.
Governance and security are key aspects of all MDM implementation styles. Access to the master data must be tightly controlled and audited. Auditing can be used to track both queries and changes to the data. Visibility of the information may be controlled to the attribute value level to ensure that the right people and applications are restricted to seeing the right information in the right context. Because a transactional hub implementation is a system of record, security and governance play an especially critical role in maintaining the integrity of the master data.
The benefits of a transactional hub implementation are significant. As the system of record, it is the repository of current, clean, authoritative master data providing both access and governance. Any of the methods of use can be implemented (collaborative, operational, and analytical) to meet the MDM needs of an organization. The primary difficulty in a transactional hub implementation is achieving the transition from system of reference to system of record. As a system of record, all updates should be funneled to the MDM System—this means that existing applications, business processes, and perhaps organizational structures may need to be altered to use the MDM System. Although potentially costly, the overall organization generally benefits as more comprehensive data governance policies are established to manage the master data.
The primary disadvantages of the transactional hub style are cost and complexity. The implementation of a transactional hub often means that existing systems and business processes have to be altered when the transactional hub becomes the single point of update within the environment. The transition to a transactional hub can be performed incrementally to minimize disruption. The significant benefits of a transactional hub implementation cause it to be the ultimate goal of many MDM projects.
The different implementation styles introduced in this section are complementary and additive. Table 1.2 provides an overview that compares the implementation styles and shows the individual benefits and drawbacks. Different MDM domains may be implemented with different styles within the same MDM System. As we have mentioned, it is common for an MDM deployment to start with one style, such as the consolidation style, achieve success with that implementation by publishing authoritative master data to downstream systems, and then extend the system with a coexistence style. With the completion of the coexistence phase, the MDM System could then be used to support the master data needs of new applications while continuing to publish snapshots of master data to downstream systems. Over time, the existing systems could be altered to leverage the MDM System, which would become a system of record.
Table 1.2. MDM Implementation Styles
Style |
Consolidation |
Registry |
Coexistence |
Transactional Hub |
What |
Aggregate master data into a common repository for reporting and reference |
Maintain thin system of record with links to more complete data spread across systems; useful for realtime reference |
Manage single view of master data, synchronizing changes with other systems |
Manage single view of master data, providing access via services |
Benefits |
Good for preparing data to feed downstream systems |
Complete view is assembled as needed; fast to build |
Assumes existing systems unchanged, yet provides read-write management |
Support new and existing transactional applications; the system of record |
Drawbacks |
Read-only; not always current with operational systems |
Read-mostly; may be more complex to manage |
Not always consistent with other systems |
May require changes to existing systems to exploit |
Methods of use |
Analytical |
Operational |
Collaborative, Operational, Analytical |
Collaborative, Operational, Analytical |
System of |
Reference |
Reference |
Reference |
Record |
In an MDM System supporting multiple domains of master data such as customer, product, and supplier, we may find that the MDM System may appear as a consolidation style for one domain, a registry style for the second domain, and a transactional hub for the third domain.
1.3.6 Categorizing Data
There are many ways to characterize the different ways to store, manage, and use data. Because these characterizations can sometimes be confusing, we put forward a set of working definitions for five key categories of data that we discuss in this book. The five key kinds of data that we discuss in this section are:
- Metadata
- Reference Data
- Master Data
- Transaction Data
- Historical Data
Each of these categories of data has important roles to play within an enterprise's information architecture.
1.3.6.1 Metadata
The distinctions between metadata, master data, and reference data can be particularly confusing. In this book, we use the term metadata to refer to descriptive information that is useful for people or systems who seek to understand something. Metadata is a very broad topic—there are thousands of different kinds of metadata. It is beyond the scope of this book to provide an in-depth review of the topic; however, we can describe a few key characteristics.12 Different kinds of metadata are defined and used pervasively throughout the software industry because it is useful to be able to have one kind of information describe another kind of information. A database catalog describes the data managed within a database, an XML schema13 describes how an XML document that conforms to the schema should be structured, and a WSDL14 file describes how a Web service is defined. Metadata is used in both runtimes and in tools. For example, a relational database uses metadata (the database catalog) to define the legal data types for a column of data, that is, to recognize if a column of information is a primary key and to indicate if values in a column of data can be null. Similarly, database tooling uses this same metadata to allow database administrators to author and manage these database structures.
In general, it is considered appropriate to hide the existence of metadata by making the creation, management, and use of metadata part of the systems and tools that need to use it. For example, many different tools and runtimes are involved in collecting data quality information. This metadata helps users determine how much they should trust the data and systems monitored. It is important that the collection and processing of this information be automated and transparent to users. If it is something that requires user involvement, then it is difficult to guarantee that the collection of the quality metadata has been done in an accurate and consistent manner. Similarly, most users should not have to explicitly recognize when they are using metadata—it should just be a natural part of their work environment.
Metadata is also stored and managed in a wide variety of ways—from files to specialized metadata repositories. Metadata repositories often provide additional benefits by allowing different kinds of metadata to be linked together to promote better understanding and to support impact analysis and data lineage across a range of different systems. For example, because there are many places where data quality information can be collected and exploited, it is useful to aggregate this information into a Metadata Repository where the information can be combined, related, and accessed by multiple tools and systems. An increasingly important new kind of metadata repository is a Service Registry and Repository (SRR), which specializes in storing information about the services deployed in an SOA environment to support both the operation and management of a services infrastructure.
Because metadata is data that describes other kinds of information, there is metadata for each of the other kinds of data. For example, in the case of master data, the information model and services provided by an MDM System are described by a set of metadata that is used at both design and execution. Users of the MDM System rely on this metadata to accurately describe how to interpret and use the master data. When metadata plays such a critical role, governance of the metadata is important to users' confidence that the metadata accurately reflects the MDM System.15
1.3.6.2 Reference Data
Where metadata often describes the structure, origin, and meaning of things, reference data is focused on defining and distributing collections of common values. Reference data enables accurate and efficient processing of operational and analytical activities by enabling processes to use the same defined values of information for common abbreviations, for codes, and for validation. Reference data can be simple lists of common values to be used in lookups to ensure the consistent use of a code such as the abbreviation for a state, of product codes that uniquely identify a product, or of transaction codes that specify if a checking transaction is a deposit, withdrawal, or transfer. In all of these cases, the reference data represents an agreed-upon set of values that should be used throughout an organization.
Reference data is used throughout an IT system—including the processing of financial transactions, the analysis of data in a warehouse, and the management of the systems themselves. Wherever we want to guarantee a common value of a simple object, we are using reference data. When the values of reference data are able to change during the processing of long-running business processes, the management of reference data becomes particularly important. For example, when two companies merge, the stock symbol representing them may change, and so all in-flight transactions that referenced that stock symbol might fail unless the reference data is managed appropriately.
It is often the case that different applications have different values for the same object. For example, one application may refer to a state by its abbreviation, and another application may refer to it by its full name (e.g., TX and Texas). A reference data management system helps to translate from one set of values to another.
The management of reference data can happen in multiple places. Many applications have their own built-in management for reference data. Dedicated reference data management systems are sometimes used for specialized forms of reference data, and in particular for the reference data used in financial investment transactions. Some MDM Systems can also be used to manage reference data in addition to master data.
1.3.6.3 Master Data
As we described earlier, master data represents the common business objects that need to be agreed on and shared throughout an enterprise. In previous sections, we described the domains of master data and the methods of use. Master data is most often managed within a specialized MDM System. An MDM System often uses both reference data and metadata in its processing. Reference data is used to ensure common and consistent values for attributes of master data such as a country name or a color. MDM Systems may either store metadata internally or leverage an external metadata repository to describe the structure of the information managed by an MDM System and the services it provides.
1.3.6.4 Transaction Data
The business transactions that run an organization produce transaction data. Transaction data is the fine-grained information that represents the details of any enterprise—in the commercial world this information includes sales transactions, inventory information, and invoices and bills. In noncommercial organizations, transaction data might represent passport applications, logistics, or court cases. Transaction data describes what is happening in an organization.
There is often a relationship between transaction data and master data. For example, if a person applies for a passport, the application and the processing of the application is transaction information that refers to the master data representing people. The master data contains citizenship status, existing passport details, and address information that is needed in the processing of the passport request. The processing of the passport itself is handled by other applications.
Transaction data is usually maintained in databases associated with the applications that drive the business and that may also be geographically and organizationally distributed. An organization can have a very large number of databases with transaction databases—each holding a large amount of data. Transaction data is commonly stored in relational databases according to schemas that have been optimized for the combination of query and update patterns required.
1.3.6.5 Historical Data
Historical data represents the accumulation of transaction and master data over time. It is used for both analytical processing and for regulatory compliance and auditing. Data integration tooling is typically used to extract transaction data from the existing application systems and load it into an ODS.16 Along the way, it is often transformed to reorganize the data by subject, making it easier for the data to be subsequently loaded into a data warehouse for reporting and analysis. The ODS may be updated periodically or continuously.
The historical data loaded into the warehouse is used to gain insight into the functioning of the business. Many different kinds of analyses may be performed that can support a wide range of uses, including basic reporting, dashboards (which show key performance indicators for a specific part of the organization, and predictive analytics that can drive operational decisions).
Historical data is also required to conform to the wide variety of regulations and standards that organizations have to comply with. As described throughout many sections of this book,17 the regulatory environment drives the need for the management of historical as well as master data.
Table 1.3 summarizes some of the key characteristics of these different kinds of information.
Table 1.3. Key Data Characteristics
What Kind of Information? |
Examples |
How Is It Used? |
How Is It Managed? |
|
Metadata |
Descriptive information |
XML schemas, database catalogs, WSDL descriptions Data lineage information Impact analysis Data Quality |
Wide variety of uses in tooling and runtimes |
Metadata repositories, by tools, within runtimes |
Reference Data |
Commonly used values |
State codes, country codes, accounting codes |
Consistent domain of values for common objects |
Multiple strategies |
Master Data |
Key business objects used across an organization |
Customer data Product definitions |
Collaborative, Operational, and Analytical usages |
Master Data Management System |
Transactional Data |
Detailed information about individual business transactions |
Sales receipts, invoices, inventory data |
Operational transactions in applications such as ERP or Point of Sales |
Managed by application systems |
Historical Data |
Historical information about both business transactions and master data |
Data warehouses, Data Marts, OLAP systems |
Used for analysis, planning, and decision making |
Managed by information integration and analytical tools |