16.5 Storage and Management of the Data Warehouse
This section presents the data warehouse implementation. First, we enumerate different solutions to store XML data. Next, we present the mapping we propose for storing XML data using a relational DBMS. Finally, we present the solution we have implemented to store the mapping rules concerning views in the data warehouse.
16.5.1 The Different Approaches to Storing XML Data
We briefly present here the different approaches to storing XML data. We can distinguish at least three categories:
Flat Streams: In this approach XML data are stored in their textual form, by means of files or BLOB attributes in a DBMS. This method is very fast and easy for storing or retrieving whole documents. On the other hand, querying the data on structure (i.e., metadata) is not efficient because parsing all the data is mandatory for each query.
Metamodeling: In this approach, XML data are shredded and stored in a DBMS using its data model. The main issue of this approach is to define a schema mapping from XML data to the target DBMS data model. This mapping may be generic (i.e., valid for all XML documents), or schema driven (i.e., valid for documents that are instances of a DTD or XML Schema). These mappings improve query response time on XML data, but storage is more difficult because a parsing phase is necessary. In the database literature, many mapping schemes have been studied for relational DBMSs (e.g., Florescu and Kossmann 1999a; Manolescu et al. 2000; Yoshikawa et al. 2001; Sha et al. 1999).
Mixed: Finally, the two previous approaches could be merged to use the best of each one. A hybrid approach consists of defining a certain level of data granularity. Structures coarser than this granularity are stored using the metamodeling approach and structures finer are stored using the flat streams approach. A special-purpose XML DBMS has been proposed by C.-C. Kanne and G. Moerkotte (Kanne and Moerkotte 1999), using this technique. Another approach is to store data in two redundant repositories, one flat and one metamodeled.
16.5.2 Mapping XML to Relational
We chose a relational DBMS to store XML data of the warehouse. The mapping schema that we used is presented in Listing 16.10. Primary keys are in bold characters and foreign keys are in italic characters.
Listing 16.10 Mapping Schema for XML Data
Document (d_docID, d_url) Element (e_elemID, e_type) Attribute (a_attID, a_name) XmlNode (xn_nodeID, xn_type, xn_elemID, xn_attID, xn_value, xn_docID) Children (c_father, c_child, c_rank) AllChildren (ac_father, ac_child)
The "Document" table contains source URLs. The "Element" and "Attribute" tables are dictionaries containing all element types or attributes names of XML data in the data warehouse. These dictionaries will accelerate queries. The "XmlNode" table contains XML nodes. The "xn_type" attribute indicates the node type: element, attribute, or text. The foreign keys "xn_elemID" or "xn_attID" indicate the element type or the attribute name. The "xn_value" attribute gives the value of an attribute node or a text node. Finally, the "xn_docID" foreign key indicates the source from where the node came. This information is useful for warehouse maintenance. The "Children" table indicates parent-child relationships between nodes, and the "AllChildren" table indicates all parent-child relationships between nodes. This last table introduces redundancies in XML data but is useful for the query manager.
16.5.3 View Storage
As depicted in Figure 16.1, data warehouse storage is performed with two main components: 0(1) the XML data component (used to store XML data), and (2) the Datawarehouse component (used to store mapping rules).
The XML data component is organized according to the relational schema presented in Listing 16.10. Each XML node is identified by a "nodeID" attribute. This identifier is used to reference XML data in the Datawarehouse component.
We will now describe the organization of the Datawarehouse component. As for XML data, we use a relational DBMS to store mapping rules between the variables and XML data. The base relations are a result of patterns, and the other nodes of the graph are defined with relational operations to create fragments and views.
Patterns: A table is created for each pattern. The name of this table is P-pid with "pid" being the identifier of the pattern. For each variable of the pattern, a column is created in the pattern table. This column is named by the variable name and contains the identifier of the XML node in the XML data component.
Fragments: Tables are created for fragments. The name of this table is F-fid with "fid" being the identifier of the fragment. This table uses relational operators to compute the fragment result with the appropriate pattern tables.
Views: Tables are created for views. The name of this table is V-vid with "vid" being the identifier of the view. This table uses relational operators to perform joins between the different fragment tables used by the view.
16.5.4 Extraction of Data
This section explains how data are extracted from source. For storage space optimization, we store the XML data component once in the XML nodes that match several pattern variables.
For data extraction, we consider all patterns that have the same data source. The challenge is to avoid introducing redundancies in the XML data component. For this purpose, we process as follows:
All patterns are grouped by sources, so that patterns having the same source are evaluated together.
For a group of patterns, the source is parsed, and an object document model is generated. Each XML node has an object identifier assigned by the system.
Each pattern of the group is evaluated, and nodes matching the pattern specification are stored by the way of an Xml2Sql component.
The Xml2Sql component ensures that each XML element will be stored only once in the datawarehouse. For this purpose, we use a hash table associating the identifier of the parsed node and the value of the "nodeID" attribute in the XML data component. Before adding the XML data, the Xml2Sql component checks if the node has already been stored. If the node is already stored, the Xml2Sql component retrieves the "nodeID" attribute value in the hash table. In the case where the node is not already stored, the node is added in the XML data component and in the hash table. During the extraction phase, the fragment tables are populated, and the "nodeID" attribute is necessary to reference XML data.
At this time, we propose only a basic maintenance strategy for data. When a source is updated, we maintain the warehouse by recomputing patterns that use this source. Views using modified patterns are also recomputed. This strategy is possible thanks to our storage technique that separates storage of each pattern in a table. We plan to investigate a more sophisticated strategy: incremental maintenance.