Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

pureXML

v_1_9.jpg

As of DB2 V9, XML data can be stored natively in DB2 databases. This implementation is known as pureXML. With pureXML, you can treat XML as another data type that can be managed by DB2. This means that you can CREATE tables with XML columns, ALTER existing tables to add XML columns, INSERT XML data (optionally validated against XML schemas), CREATE indexes on XML data, search XML data, and generally manipulate XML as part of your DB2 databases.

DB2’s support for XML with pureXML is novel in that the XML data is integrated into the DB2 database system enabling access and management of the XML using DB2 functions and capabilities.

Creating a Table with an XML Column

Similar to LOB data, XML data is physically stored in separate table spaces from the base tables that contain the XML columns. But unlike with LOBs, the storage structures are transparent to the user. You do not need to define and manage the storage mechanisms used by DB2 for XML data.

For example, the following SQL creates a new table with an XML column, which can be used to store the customer XML example shown previously:

CREATE TABLE CUST
(CUSTNO   INTEGER NOT NULL,
 STATUS   CHAR(1),
 XMLCUST  XML)
IN DB.TS;

When a table is defined with an XML column, DB2 generates a hidden column in the base table called a DOCID, which contains a unique identifier for the XML column. There is a single DOCID column even if the table contains more than one XML column. The DOCID column is purely internal; it does not show up in a SELECT *, for example. In addition, DB2 automatically creates a document ID index for the XML column, too.

The XML data is not stored directly in the XML column. Instead, a separate internal table in a separate table space is created for each XML column in the base table. A single XML document may be physically split across multiple rows in the XML table, but it logically still belongs to a single row in the base table, which is the table created with the XML column definition. The internal XML table is composed of three columns:

  • DOCID (BIGINT)
  • MIN_NODEID (VARCHAR)
  • XMLDATA (VARBINARY)

The DOCID column is used as a pointer between the base table and the XML table. In any given row, the MIN_NODEID column stores the lowest node stored in the XMLDATA column of the row. This information optimizes DB2’s capability to process XML documents. The XMLDATA column contains a region of an XML document formatted as a parsed tree. The internal table is clustered by DOCID and MIN_NODEID.

The internal XML table always has a 16-KB page size. This is so regardless of the page size of the base table containing the XML column. The table space used by the internal XML table is a Universal table space. If the base table resides in a partitioned table space, the XML table space will be range-partitioned; if not, the XML table space will be partition-by-growth.

The TBSBPXML DSNZPARM system parameter is available to specify the default buffer pool for XML table spaces. The default is BP16K0; however, because the DB2 Catalog uses this buffer pool, it is a good idea to use different 16-K buffer pools for your XML table spaces.

XML Document Trees

DB2’s pureXML implementation follows the XPath 2.0 and the XQuery 1.0 data model, which provides an abstract representation of XML documents. By using the data model, all permissible values of expressions in XPath can be defined, including values used during intermediate calculations. The pureXML data model is described in terms of sequences and items, atomic values, and nodes.

The primary difference between XPath and XQuery is that XPath is a subset of XQuery for addressing parts of a document. XPath cannot be used to construct new XML documents, nor does it support complex join, grouping, and ordering.

XML is hierarchic in nature and, as such, every XML document can be represented as a node tree. When XML data is queried or modified, the hierarchical structure of the document must be traversed. To assure that XML data is accessed as efficiently as possible, DB2 physically stores XML documents in hierarchical format, as trees of nodes with parent-child relationships between the nodes.

To better understand this concept of hierarchic trees, work through an example. To start, examine the following sample XML:

<customer>
   <custname>
       <first_name>Craig</first_name>
       <last_name>Mullins</last_name>
   </custname>
   <addr country="US">
       <street>100 Easy St</street>
       <city>Pittsburgh</city>
       <state>PA</state>
       <zip_code>15215</zip_code>
   </addr>
   <phone type="work">412-555-1000</phone>
   <phone type="mobile">972-555-8174</phone>
</customer>

This basic XML document contains customer data. At the root of the tree is the root element, customer. There are various direct children elements as well: first_name, last_name, addr, and two occurrences of phone.

The element addr is composed of multiple elements as well: street, city, state, and zip_code. And addr also has an attribute for the country. The element phone has an attribute, type, associated with it, as well.

Figure 10.1 illustrates a representation of this XML document as a hierarchical tree. You can build these trees by parsing an XML document using an XML parser.

Figure 10.1

Figure 10.1. An XML document tree.

XML data is stored in DB2 as an XML document tree. When XML is inserted or loaded into an XML column, DB2 parses the XML document to produce the hierarchical format, and that is what is stored in the XML table space. If one or more columns are of type XML, and you select the entire XML column, DB2 retrieves the XML document tree and converts it back into the text XML document. This process is called serialization.

When the XML data is larger than a page, the document tree is divided into groups of nodes; each group is usually a subtree of nodes. The divided data is stored in the XMLDATA column of the internal XML table. Nodes are grouped bottom up, up to the largest row size of a page. One document can be physically stored across many rows and pages.

DB2 stores XML data as an XML document tree to optimize performance when accessing XML data. The document tree is further optimized by replacing tag names with 4-byte identifiers. So the internal storage does not look exactly like the tree shown in Figure 10.1. Tag names are mapped to stringIDs and stored in the DB2 Catalog in the table SYSIBM.SYSXMLSTRINGS. DB2 caches the mapping table to optimize performance. Furthermore, XML query evaluation and traversal of XML documents can operate on integers, which is typically much faster than operating on strings. There will be only one 4-byte identifier for a particular string, even if that string (that is, “name”) is used in many different XML columns.

The exact shape of the document tree can vary from document to document; it depends upon the content of each XML document. The document tree is not built or predefined based on the XML Schema. The actual XML data is stored as VARBINARY data, composed of the XML document tree (or a sequence of subtrees) with context path.

Serializing and Parsing XML Data

An XML value can be transformed into a textual XML value that represents the XML document by using the XMLSERIALIZE function or by retrieving the value into an application variable of an XML, string, or binary data type.

The inverse can also be achieved. The textual XML document can be turned into the XML value using the XMLPARSE function or by storing a value from a string, binary, or XML data type to an XML column.

Altering a Table to Contain XML Data

Modifying a table to add an XML column is also a simple endeavor. All that is necessary is a simple ALTER, such as the following:

ALTER TABLE DSN81010.EMP
   ADD JOBHISTORY XML;

This statement adds the JOBHISTORY column as XML to the EMP table. Of course, when an XML column is added to an existing table, DB2 creates the same internal table and index to support the XML data as it would for a new table.

Schema Validation

As you have seen, XML is flexible. XML has no rigorously enforced schema applied when data is added, such as with a DBMS such as DB2. But you can validate XML against a schema with a process known as XML schema validation. XML schema validation can determine whether the structure, content, and data types of an XML document are valid according to an XML schema. In addition, XML schema validation removes whitespace that can be ignored from the XML document being validated.

You can validate an XML document in DB2 in two ways:

  • Automatically, by specifying an XML type modifier in the XML column definition of the CREATE or ALTER TABLE statement. When a column has an XML type modifier, DB2 implicitly validates documents that are inserted into the column or documents in the column that are updated.
  • Manually, by executing the DSN_XMLVALIDATE built-in function when you INSERT a document into an XML column or UPDATE a document in an XML column.

If you perform XML schema validation using DB2, you need to set up an XML Schema Repository. A DB2 for z/OS XML schema repository (XSR) is a set of DB2 tables where you can store XML schemas. XSR requires additional software: WLM, z/OS XML System Services, Java 2 Technology Edition 31 bit (V5 or later), and IBM Data Server Driver for JDBC and SQLJ.

Refer to the IBM manual, pureXML Guide (SC19-2981) for additional information on using an XML Schema Repository, as well as additional information on using type modifiers.

XML Namespaces

XML namespaces are a W3C XML standard for providing uniquely named elements and attributes in an XML document. XML documents may contain elements and attributes that have the same name but belong to different domains. A namespace can be used to remove the ambiguity when referencing such elements and attributes.

Namespaces are supported by all DB2 pureXML features including SQL/XML, XML indexes, and XML Schema validation.

A namespace must be declared with a prefix assigned to a Universal Resource Identifier (URI) before it can be used. Now augment your example XML with a namespace:

<c:customer xmlns:c="http://ddgsample.org">
   <c:custname>
       <c:first_name>Craig</c:first_name>
       <c:last_name>Mullins</c:last_name>
   </c:custname>
   <c:addr country="US">
       <c:street>100 Easy St</c:street>
       <c:city>Pittsburgh</c:city>
       <c:state>PA</c:state>
       <c:zip_code>15215</c:zip_code>
   </c:addr>
   <c:phone type="work">412-555-1000</c:phone>
   <c:phone type="mobile">972-555-8174</c:phone>
</c:customer>

The attribute xmlns:c declares that c is a namespace prefix bound to the URI http://ddgsample.org. The prefix c can be used for the customer element and all other elements or attributes in the document that are descendants of customer.

Not every node in a document must belong to the same namespace. Furthermore, an XML document can contain multiple namespaces. Specify namespace declarations as needed within your XML documents according to your needs for accessing elements of the XML document.

If all elements in a document belong to the same namespace, you can declare a default namespace and avoid the use of prefixes:

<customer xmlns="http://ddgsample.org">
   <custname>
        <first_name>Craig</first_name>
        <last_name>Mullins</last_name>
   </custname>
   <addr country="US">
        <street>100 Easy St</street>
        <city>Pittsburgh</city>
        <state>PA</state>
        <zip_code>15215</zip_code>
   </addr>
   <phone type="work">412-555-1000</phone>
   <phone type="mobile">972-555-8174</phone>
</customer>

In this case, all elements belong to the declared http://ddgsample.org namespace.

Indexing XML Data

You can build indexes on data stored in XML columns to improve the efficiency of queries against XML documents. Indexes on XML data differ from DB2 indexes on relational data. A typical DB2 index lists a column, or series of columns, and the index is built upon those columns. An XML index is based upon a part of the data in the XML column, not the entire column.

A typical DB2 index has one entry for each row in the table, even if the value is NULL, whereas an XML index does not have an entry for a row if the XML document in that row does not contain that element.

An XML index uses an XML pattern expression to index paths and values in XML documents stored within a single XML column. The index entries in XML indexes provide access to nodes within the XML document. Because multiple parts of a XML document can satisfy an XML pattern, DB2 might generate multiple index keys when it inserts values for a single document into the index.

XML indexes are built using the CREATE INDEX statement (and dropped using DROP INDEX). Instead of listing columns, the GENERATE KEY USING XMLPATTERN clause is used to indicate what portion of the XML document you want to index:

CREATE INDEX CSTLNMX1
   ON   CUST(XMLCUST)
GENERATE KEY USING XMLPATTERN '/customerinfo/custname/last_name'
AS SQL VARCHAR(20)

The GENERATE KEY USING XMLPATTERN clause provides information about what you want to index. This clause is called an XML index specification. The XML index specification contains an XML pattern clause. The XML pattern clause in this example indicates that you want to index the values of the last_name attribute of each customer element. The index entries are to be stored as VARCHAR(20).

Every XML pattern expression specified in index CREATE statement must be associated with a data type. The only supported data types are VARCHAR, DECFLOAT, DATE, and TIMESTAMP.

v_1_10.jpg

Only one index specification clause is allowed in each CREATE INDEX statement, but it is permissible to create multiple XML indexes on each XML column.

To identify the portion of the XML to be indexed, you specify an XML pattern to identify a set of nodes in the XML document. This pattern expression is similar to an XPath expression. (But only a subset of the XPath language is supported.)

Namespace Declarations in XML Indexes

In the XMLPATTERN clause of the CREATE INDEX statement, you can specify an optional namespace declaration that maps a URI to a namespace prefix. Then you can use the namespace prefix when you refer to element and attribute names in the XML pattern expression.

For example, consider the following index creation statement:

CREATE INDEX CSTPHNX2
   ON   CUST(XMLCUST)
GENERATE KEY USING XMLPATTERN 'declare namespace s="http://ddgsample.org/";
/s:customer/s:phone/@s:type' AS SQL VARCHAR(12)

The namespace declaration maps the namespace URI http://ddgsample.org to the character s. It can then be used to qualify all elements and attributes with that namespace prefix.

XML Indexes and UNIQUE

Specifying UNIQUE in an XML index definition is a little bit different than in traditional, relational indexes. For a traditional index, the UNIQUE keyword enforces uniqueness across all rows in the table. For XML indexes, the UNIQUE keyword enforces uniqueness across all documents in an XML column. This means the index can ensure uniqueness. not only across all rows in the table, but also within a single document within a row.

For an XML index, DB2 enforces uniqueness for the following:

  • Data type of the index.
  • XML path to a node.
  • Value of the node after the XML value has been cast to the SQL data type specified for the index.

Querying XML Data

Of course, after you create tables with XML data in them, you want to access that data in your DB2 applications. Doing so requires a basic understanding of XPath.

XPath

XPath is a programming language designed by the World Wide Web Consortium (W3C) for querying and modifying XML data. DB2 supports a subset of the language constructs in the XPath 2.0 specification. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes.

XML documents are treated as trees of nodes. The top element of the tree is called the root element. Now look at some sample XML again:

<customer>
   <custname>
       <first_name>Craig</first_name>
       <last_name>Mullins</last_name>
   </custname>
   <addr country="US">
       <street>100 Easy St</street>
       <city>Pittsburgh</city>
       <state>PA</state>
       <zip_code>15215</zip_code>
   </addr>
   <phone type="work">412-555-1000</phone>
   <phone type="mobile">972-555-8174</phone>
</customer>

Examples of nodes in this XML document include <customer>, which is the root element node, <city>Pittsburgh</city>, which is an element node, and Type="work", which is an attribute node.

There is a relationship among the nodes that you need to understand for XML processing. Each element and attribute has one parent. In the example, XML the addr element is the parent of street, city, state, and zip_code. Element nodes can have zero, one, or more children. Nodes that have the same parent are siblings. The street, city, state, and zip_code nodes are siblings of each other, as well as children of addr.

Descendants are any children of a node as well as the children’s children, and so on. So in your example, the descendants of customer include custname, first_name, last_name, addr, street, city, state, zip_code, and phone.

XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps. If you are familiar with Windows® or UNIX® file structure, an XPath specification will look familiar to you. Table 10.1 outlines some useful path expressions.

Table 10.1. XPath Expressions

Expression

Description

Nodename

Selects all child nodes of the named node

/

Selects from the root node

//

Selects nodes in the document from the current node that match the selection no matter where they are

.

Selects the current node

..

Selects the parent of the current node

@

Selects attributes

To select the last_name element of the customer XML document using XPath you would specify the following:

customer/addr/last_name,

Or to select all attributes that are named phone using XPath, you can specify the following:

//@phone

Use DB2 XPath in the following contexts:

  • As an argument to the XMLQUERY built-in function, which extracts data from an XML column.
  • As an argument to the XMLEXISTS predicate, which is used for evaluation of data in an XML column.
  • In an XML index, to determine the nodes in an XML document to index. (Only a subset of XPath, called an XML pattern, is valid in this context.)

The XMLQUERY Function

You can use the XMLQUERY function to execute an XPath expression from within SQL. You can pass variables to the XPath expression specified in XMLQUERY. XMLQUERY returns an XML value, which is an XML sequence. This sequence can be empty or can contain one or more items.

When you execute XPath expressions from within an XMLQUERY function, you can allow XML data to participate in SQL queries. Furthermore, you can retrieve parts of XML documents, instead of the entire XML document. This gives you the ability to operate on both relational and XML data in the same SQL statement. And you can apply additional SQL to the returned XML values after using XMLCAST to cast the results to a non-XML type.

Consider the following query, which returns the phone number information extracted from the XML for each customer in the CUST table:

SELECT CUSTNO,
   XMLQUERY ('declare default element namespace "http://ddgsample.org";
   /customer/phone' passing XMLCUST)
   AS "PHONE FROM XMLCUST"
FROMCUST

You can use the XMLEXISTS predicate to restrict the result set based on values in XML columns. To do so, an XPath expression is specified to the XMLEXISTS predicate. If the XPath expression returns an empty sequence, the value of the XMLEXISTS predicate is false. Otherwise, XMLEXISTS returns true and the row is returned.

For example, consider this SQL statement:

SELECT CUSTNO, STATUS, XMLCUST
FROM   CUST
WHERE  XMLEXISTS ('declare default element namespace "http://ddgsample.org";
  //addr[city="Pittsburgh"]' passing XMLCUST)
AND XMLEXISTS ('declare default element namespace "http://ddgsample.org";
/customer[last_name="Mullins"]' passing XMLCUST)

This matches your sample XML document in which customer city is Pittsburgh and last_name is Mullins, so that row will be returned (as would any other where both of these conditions are true).

The XMLTABLE() Function

The XMLTABLE() function though can be used to produce XML query results comparable to XQuery. The following example uses the XMLTABLE() function to query your sample table and XML document, returning the city, street, state, and zip_code as columns in a result table:

SELECT X.*
FROM   CUST,
       XMLTABLE (XMLNAMESPACES(DEFAULT 'http://ddgsample.org'),
         '$x/customer/addr[@zip_code=15215]'
         PASSING XMLCUST as "x"
       COLUMNS
         ZIP_CODE INT         PATH '@zip_code',
         STREET   VARCHAR(50) PATH 'street',
         CITY     VARCHAR(30) PATH 'city',
         STATE    VARCHAR(2)  PATH 'state') AS X

Recall that the XML column in the CUST table is named XMLCUST, therefore you code PASSING XMLCUST.

To enable this query to use an XML index, can add a predicate using XMLEXISTS, for example:

WHERE XMLEXISTS('$x/customer/addr[@zip=15215]'PASSING XMLCUST AS "x")

The XMLTABLE() function can be used when converting XML data into a relational result set. It can be beneficial to use XMLTABLE() in views where the SQL referencing the view does not have to code XPath expressions. The optimizer can still use XML indexes when view is referenced.

Access Methods

DB2 supports several access methods for XML data. The basic access method is the DocScan, which traverses XML data and evaluates XPath expressions using an IBM-patented technique called QuickXScan (see Note). There is no access type indicator for DocScan in the PLAN_TABLE because it is part of a scan if there is a predicate on an XML column involved.

XML indexes are used only for the XMLEXISTS predicate and XMLTABLE function evaluation. There are three access types for XML index-based access. Similar to RID list access, ANDing, and ORing, they include the following:

  • DocID list access (DX)
  • DocID list ANDing (DI for DocID list Intersection)
  • DocID list ORing (DU for DocID list Union)

Inserting XML Data

Inserting XML data is as simple as inserting any other type of data. For example, assume that you want to use the CUSTOMER table described earlier in this chapter to INSERT an XML document. The following SQL achieves this goal:

INSERT INTO CUSTOMER
  (CUSTNO, STATUS, XMLCUST)
VALUES (1000, 'Y',
'<customer>
    <custname>
        <first_name>Craig</first_name>
        <last_name>Mullins</last_name>
    </custname>
    <addr country="US">
        <street>100 Easy St</street>
        <city>Pittsburgh</city>
        <state>PA</state>
        <zip_code>15215</zip_code>
      </addr>
    <phone type="work">412-555-1000</phone>
   <phone type="mobile">972-555-8174</phone>
</customer>'
);

This SQL statement inserts a new customer into the CUSTOMER table, giving it a CUSTNO of 1000 and a STATUS of Y. The XML data is inserted into the XML column defined as XMLCUST.

Deleting XML Data

You can use the SQL DELETE statement to delete rows that contain XML documents. Nothing special is required, just code the DELETE as normal including any WHERE clause that you want. When the row is deleted, the XML data for that row is deleted, too.

Of course, you might want to DELETE based upon values in the XML document. You can do this using XPath expressions within XMLEXISTS, for example, to DELETE rows from CUST table for which the value of the city element is Pittsburgh (in the XMLCUST column):

DELETE FROM CUST
WHERE XMLEXISTS ('declare default element namespace "http://ddgsample.org";
//addr[city="Pittsburgh"]' passing XMLCUST)

Updating XML Data

You can also update XML data, either the entire XML document, or a portion of the XML document.

Updating an Entire XML Document

To update an entire XML document in an XML column, supply the XML data to the UPDATE statement, being sure to specify a WHERE clause for the rows to be updated. The input to the XML column must be a well-formed XML document (as defined in the XML 1.0 specification).

Updating a Portion of an XML Document

v_1_10.jpg

You can also use the UPDATE statement with the XMLMODIFY function to update a portion of an XML document in an XML column. The XMLMODIFY function specifies a basic updating expression that you can use to insert nodes, delete nodes, replace nodes, or replace the values of a node in XML documents stored in XML columns.

The following UPDATE statement modifies the document for CUSTNO 100 changing the value of the city element to “Houston”:

UPDATE CUST
SET XMLCUST=XMLMODIFY('replace value of node/customer/addr/city with "Houston" ')
WHERE CUSTNO = 100

The following update expressions are supported in the XMLMODIFY function:

  • delete expressions: To remove elements or attributes from a document
  • insert expressions: To add elements or attributes to a document
  • replace value of node expressions: To change the value of an element or attribute
  • replace node expressions: To replace an existing element or attribute with a different one
  • + Share This
  • 🔖 Save To Your Account