16.4 Managing the Metadata
The specification of the data warehouse is stored in an XML document. This document contains the metadata of the warehouse, including:
Information about the JDBC connection for XML data storage
Data source URLs
We chose an XML format for metadata storage because of portability and easy parsing. We present now the DTD validating the warehouse metadata.
16.4.1 Data Warehouse
The root element of the data warehouse specification is declared as follows:
<!-- datawarehouse element --> <!ELEMENT datawarehouse (connection, source*, view*) >
"connection" element contains data about the JDBC connection. This data is used to connect the data warehouse manager with the DBMS (MySQL) used to store data of the warehouse.
"source" element contains data about the XML sources.
"view" element contains a view definition.
The element describing a source is shown in Listing 16.2.
Listing 16.2 Element Describing an XML Source
<!-- source element --> <!ELEMENT source EMPTY > <!ATTLIST source id ID #REQUIRED url CDATA #REQUIRED >
A "source" element contains two attributes: "id" identifies the source, and "url" gives the XML source URL. Elements describing sources are not encapsulated in pattern elements to easily recognize sources that are used in several patterns.
16.4.2 View Definition
This section describes the part of the DTD that defines a view. The "view" element is described as shown in Listing 16.3.
Listing 16.3 Element Describing a View
<!-- view element --> <!ELEMENT view (fragment+, join*, result) > <!ATTLIST view id ID #REQUIRED >
A "view" element is composed of several fragments (one at least), several join conditions, and a result pattern.
A "fragment" element describes data to match in one or more XML sources. The part of the DTD shown in Listing 16.4 describes a fragment definition.
Listing 16.4 Element Describing a Fragment
<!-- fragment element --> <!ELEMENT fragment (pattern+) > <!ATTLIST fragment id ID #REQUIRED >
A fragment is composed of several pattern subelements (one at least).
The "pattern" element describes data to match in an XML source. A pattern is linked to a source with the "source" attribute, which references a previously defined source. A "pattern" element is composed of a "pattern.node" element indicating the pattern root and one or more "condition" elements. A condition adds a restriction on values of a variable to be matched by the pattern. Listing 16.5 is the part of the DTD describing a pattern definition.
Listing 16.5 Elements Describing a Pattern
<!-- pattern element --> <!ELEMENT pattern (pattern.node, condition*) > <!ATTLIST pattern id ID #REQUIRED source IDREF #REQUIRED > <!-- pattern node --> <!ELEMENT pattern.node (pattern.node*) > <!ATTLIST pattern.node name CDATA #REQUIRED type CDATA #REQUIRED bind CDATA #IMPLIED >
A pattern is described with "pattern.node" elements that describe the pattern to match in the XML source. For this purpose, a "pattern.node" element contains two attributes: The "type" attribute indicates if the node matches an element or an attribute in the XML source and the "name" attribute indicates the name of the element or attribute to match. The "bind" attribute, if it exists, indicates the variable name that binds the matched element or attribute in the XML source.
The "condition" element allows us to add a condition on the variables defined in the pattern. The part of the DTD shown in Listing 16.6 describes the condition element definition.
Listing 16.6 Element Describing a Condition
<!-- condition node --> <!ELEMENT condition EMPTY > <!ATTLIST condition left CDATA #REQUIRED operator CDATA #REQUIRED right CDATA #REQUIRED >
A "join" element contains the join condition between the fragments defined in the view. The part of the DTD shown in Listing 16.7 describes the join element definition.
Listing 16.7 Element Describing a Join
<!-- join node --> <!ELEMENT join EMPTY > <!ATTLIST join leftfragment IDREF #REQUIRED leftvariable CDATA #REQUIRED rightfragment IDREF #REQUIRED rightvariable CDATA #REQUIRED >
The "join" element contains four attributes indicating fragments and variables defining the join condition. The "leftfragment" and "rightfragment" attributes are IDREF attributes referencing the left and right fragments to join. The "leftvariable" and "rightvariable" contain the names of the variables of the left and right fragments used in the join condition.
Finally, the "result" element contains the definition of the view result pattern (see Listing 16.8).
Listing 16.8 Elements Describing a Result and Grouping Constraints
<!-- result --> <!ELEMENT result (result.node, groupby*) > <!-- result node --> <!ELEMENT result.node (result.node*) > <!ATTLIST result.node name CDATA #REQUIRED type CDATA #REQUIRED > <!-- group by --> <!ELEMENT groupby EMPTY > <!ATTLIST groupby variable CDATA #REQUIRED >
The "result" element contains the description of the view result structure. It is composed of a "result.node" element containing the result pattern and zero or more "groupby" elements indicating how result data will be organized.
16.4.3 Mediated Schema Definition
The main role of a data warehouse is to provide integrated and uniform access to heterogeneous and distributed data. For this purpose, a mediated schema is provided to users on which they can formulate their queries. Metadata are used to create this schema. In the following, we will present how this schema is generated.
To provide an integrated view of heterogeneous sources, the data warehouse is considered as an entire XML document, containing the result of all the views. The fragment of DTD describing the data warehouse (with "N" views) is as follows:
<!ELEMENT datawarehouse (view1, view2, . . . , viewN) >
The view model allows us to generate a DTD on a view specification. This DTD is defined with the result pattern and could possibly be completed with the source definition. The generated DTD for the view that is specified in Listing 16.1 is shown in Listing 16.9.
Listing 16.9 DTD Generated for the View in Listing 16.1
<!ELEMENT authorspublications (author*) > <!ELEMENT author (title+) > <!ATTLIST author name CDATA #REQUIRED > <!ELEMENT title (#PCDATA) >
The root of the view result is an element, of which the type is the view name "authorspublications". The "author" element is composed of one or more "title" elements, because of the group-by clause, and has one attribute containing the author's name.
The mediated schema is aimed at querying the XML data in the warehouse. Currently, the query manager evaluates the XML views from the database system. In the future, the query manager capabilities will be extended to enable the processing of XPath queries with a DTD-driven tool.