Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

Implementation Concepts

Generally, details of implementation are hidden within the database or middleware that implements the XML view of relational data. Nevertheless, it is valuable to understand the basic approaches in order to gain insight into performance or for debugging. A thorough discussion of implementing the different kinds of applications introduced in the second section, using different XML representations and different mapping techniques, is beyond the scope of this chapter. Our aim in this section is to outline the basic requirements for implementation. We begin with the simple application type of emitting XML documents, which allows us to introduce several implementation strategies for composed techniques—most important, the varieties of techniques for generating hierarchical structure through joins. We also include a very high-level outline of implementation techniques for the other application types.

Emitting XML Documents

Our simplest application type consisted of emitting XML documents that are stored in a relational database. When the document is stored in LOB form, this is quite simple: The column value is simply accessed and returned. The more interesting case is emitting a composed XML document (an XML view of relational data).

Most aspects of generating an XML document from a composed representation are also not very complex. The basic concept is to create a standard SQL statement (one that does not involve any extended features). This will, of course, return relational data. Then, in a post-processing step, for each row of the relational data, XML “decoration” is added: XML element tags, additional static content, and so on. Of course if the implementation is directly within the database engine, it need not actually be implemented in these two steps, but conceptually (and often in practice) this is representative of the process.

The basic step is to generate an SQL statement that retrieves content from a single table. This SQL statement simply names the table and whichever columns are needed. If the composition technique only refers to a single table, this is the only step that is required. For example, we can convert the annotated XSD shown in Listing 6.15 into the SQL shown in the last three lines of the listing. All the work of adding element tags is handled by the post-processor.

Listing 6.15 Creating an SQL Statement for an Annotated Schema

<xs:element name="Employee" sql:relation="Persons" 
             sql:limit-field="Type"
             sql:limit-value="Emp">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="FName" type="xs:string"
                   sql:field="FirstName"/> 
      <xs:element name="LName" type="xs:string" 
                   sql:field="LastName"/>
    </xs:sequence>
    <xs:attribute name="EmpID" type="xs:integer" 
                   sql:field="EmployeeID"/>
  </xs:complexType>
</xs:element>


SELECT e.FirstName, e.LastName, e.EmployeeID
FROM Employees e
WHERE e.Type="Emp";

Generation of hierarchy through grouping is also relatively straightforward. Instead of using the SQL GROUP BY construct, we will SORT BY the grouping keys instead. Then, during post-processing, a new grouping-level element is created each time a new value of the group keys is encountered, and new subordinate elements are created for each row of the data, as above. For example, we would convert an SQL/XML statement as shown in Listing 6.16:

Listing 6.16 Translating an SQL/XML XMLAGG Query

         SQL/XML
   SELECT XMLELEMENT
   ( NAME "department"
   XMLELEMENT ( NAME "name" e.dept )
   XMLELEMENT ( NAME "employees" 
   XMLAGG ( 
   XMLELEMENT ( NAME "emp", e.lname )
   )
   )
   ) AS "dept_list"
   FROM employees e
   GROUP BY e.dept;
   
   SQL
   SELECT e.dept, e.lname
   FROM employees e
   ORDER BY e.dept;
   

The SQL statement often becomes much simpler: All of the XML-oriented tasks are moved to the post-processing step. The part of the post-processing that handles the grouping would look something like the pseudocode shown in Listing 6.17:

Listing 6.17 Pseudocode for Reconstructing Grouped XML Data from SQL

groupkey ← null;
cursor ← execute the sql query;
   while ( cursor not empty ) {
   newgroupkey ← getgroupkey(cursor);
   if ( groupkey != newgroupkey ) {
   if ( groupkey != null )
   emit close tag for old group element;
   emit open tag for new group element;
   groupkey ← newgroupkey;
   }
   process non-grouping fields in row;
   advance cursor;
   }
   if ( groupkey != null )
   emit close tag for old group element
   }
   

The most complex aspect of emitting XML from composed representations comes from the generation of hierarchy though joins. The basic premise is still the same: We want to create standard SQL, with post-processing that can generate XML from the relational results. Depending on the database vendor, the database might itself already internally support nested tables or some similar concept that can be used for this purpose. But if the database does not provide those capabilities, or if the application is being implemented in middleware that lacks access to those capabilities, then another approach is required. There are three commonly used techniques, which we call simple join, dependent join, and sorted outer union. We introduce each of these below.

Simple Join

Perhaps the most obvious approach is to try to first compute a join using the ordinary SQL join syntax and then use post-processing to reconstruct the hierarchy from the join results. The idea is similar to the method for reconstructing hierarchy for grouping: First join the appropriate relations together, sort the results so that rows that represent children of the same element will occur consecutively in the results, and use the same “Did the key change?” logic that we used for grouping to control when new container elements are opened and closed. The left outer join must be used in order to accurately reconstruct cases where an element has no children (e.g., a department with no employees).

This approach can work in certain limited cases. The following must be true:

  • There must be a key associated with each element that is not a leaf element. If there is no such key, then duplicate elements will look alike and be merged. For example, in the following sample join result, if the employee's name is not a key, then there is no way to tell whether or not this represents a single employee named Mary with two phone numbers, or two people named Mary, each with a single phone number:

    deptid  deptname  empname  empphone
    _____________________________________________________________________
    21      Physics   Mary     x2017
    21      Physics   Mary     x4007
    

    As noted earlier, if the relational data is the result of shredding, it is possible to automatically add node-ids for elements, thereby guaranteeing that the required keys will exist.

  • The XML hierarchy must be linear; that is, the complete set of joins must form a linear path: A left-outer-join B left-outer-join C, and so forth. The hierarchy depicted in Listing 6.5 does not satisfy this condition: in this case Table A is joined to both B and D. The problem with this arrangement is that when a complete join of A, B, C, and D is done, the result will be a full cross-product between B and D. This has two effects: The data size is exploded (consider what can happen if there are more than two nonlinear hierarchies), and the logic required to reconstruct the intended hierarchy becomes considerably more complex.

The second limitation is significant, because it seriously restricts the type of XML that can be generated with the simple join technique.

Dependent Join

This approach works without any restrictions. In this case we do not attempt to create a single SQL query; instead, the “post-processing” step becomes a master control that creates and executes multiple SQL queries. We form one SQL query for the topmost relation, and then, for each row in that relation, invoke a separate SQL query for each related child (and in turn, for each of those rows, a separate SQL query for their children, etc., recursively).

Thus for the structure that was depicted in Listing 6.5, the processing would look something like what is shown in Listing 6.18:

Listing 6.18 Code Structure for a Dependent Join

cursorA ← sql for table A
   while ( cursorA not empty ) {
   process elements for row in A
   cursorB ← sql for table B
   while ( cursorB not empty ) {
   process elements for row in B
   cursorC ← sql for table C
   while ( cursorC not empty ) {
   process elements for row in C
   advance cursorC
   }
   advance cursorB
   }
   cursorD ← sql for table D
   while ( cursorD not empty ) {
   process elements for row in D
   advance cursorD
   }
   advance cursorA
   }
   

In actual practice, of course, we would use a generic recursive algorithm that would operate over any hierarchy, but this way it is easier to see the relationships, both for linear (B, C) and nonlinear (B, D) relationships. Each SQL statement except the outermost will be a parameterized SQL statement (i.e., a prepared statement), and the join condition will be represented by a parameterized condition. For example, the SQL for Table B might contain lines like the following:

SELECT ...
FROM B
WHERE B.key = &;

At run-time, the parameter will be substituted with the corresponding value from table(s) higher in the hierarchy.

It is easy to see that this technique can create numerous SQL statements and can also have many cursors open simultaneously, even though only one is in use at a time. The algorithm does not use database resources very efficiently. And indeed, while the dependent join is adequate for small or light-load use cases, it does not by itself scale effectively to larger application demands.

The Sorted Outer Union

If the simple join only works in limited cases, and the dependent join does not scale, what are we to do? A very clever, and fairly complicated, technique was introduced by Jayavel Shanmugasundaram et al. [EFFICIENT]. It is known as the Sorted Outer Union (SOU). The Sorted Outer Union is, in effect, an improved version of the simple join approach, but it removes the restriction against nonlinear hierarchy (though keys are still required).

The fundamental question is this: If the simple join technique works for a linear join hierarchy, can we somehow pack together multiple queries, one for each linear hierarchy, in one query, in such a way that the results are neither a cross-product nor ambiguous? This is what the SOU accomplishes.

The output of the SOU contains columns for all of the tables involved—even tables that are “parallel” to each other in the XML (e.g., B and D in the example from Listing 6.5). However, in any one row of the output, only columns from a single table (and the chain of keys from the root to get to that table) will be filled; all other columns will be null. The entire SOU query consists of the union over a set of smaller SQL queries, one for each interesting join path in the XML, where each of these smaller SQL queries is in essence the simple join query for that join path. The effect is to fill the output with “blocks” of data: a set of rows to retrieve values from Table A, then another for Table B (given a key from A), and so forth. In a final bit of magic, the entire result set is sorted by the keys in such a way as to interleave the rows of the data in exactly the same order in which the fields are used in the XML document.

We illustrate this pattern with the structure from Listing 6.5, which involved Tables A, B, C and D. Thus the output of the SOU query will have columns from each of these tables, as shown below:

A-key  A-cols  B-key  B-cols  C-key  C-cols  D-key  D-cols
_____________________________________________________________________

According to the SOU algorithm, there are four interesting join paths: A itself, A-B, A-B-C and A-D. If we look at just one of these paths, say A-B, the SQL query for that path will create the join path to B and retrieve the columns of B. All the other columns will be filled with nulls. The pattern of the query, and of its output, appear as shown in Listing 6.19:

Listing 6.19 An Example of an SOU Join Block, and the Result It Generates

SELECT A-key, null as A-col1, null as A-col2, ..., 
   B-key, B-col1, B-col2, ...,
   null as C-key, null as C-col1, null as C-col2, ..., 
   null as D-key, null as D-col1, null as D-col2, ...
   FROM   A, B
   WHERE  A-key = B-key;
   
   A-key  A-cols  B-key  B-cols  C-key  C-cols  D-key  D-cols
   _____________________________________________________________________
   
   a1     nulls     b1     bvals   nulls    nulls     nulls    nulls
   

Notice the use of null, cast to the appropriate datatype, to fill the unused columns. Also notice that we used an inner join here, not an outer join: the outer join is not required because its effect will be achieved by separate queries for each join path (in this case a separate query to retrieve values from A by itself).

One query of this form is generated for each interesting join path, the individual queries are unioned together, and the results sorted by the table keys:

sql-query-for-A
   UNION ALL
   sql-query-for-A-B
   UNION ALL
   sql-query-for-A-B-C
   UNION ALL
   sql-query-for-A-D
   ORDER BY A-key, B-key, C-key, D-key;
   

The effect of the sort operation is to interleave the results from the different queries so that all the rows for a particular element appear together, in order. The algorithm assumes that the key order is the same as document order. (If it is not, then separate columns are required to represent the proper ordinal values, and the sort will be on those columns instead.) Also, nulls are assumed to sort low, which assures that the shorter join paths appear before the longer ones; thus, information about A is retrieved before information about A-B, and so on. (If nulls sort high, the algorithm can be modified to sort keys in the opposite order.)

The SOU query relies on the existence of keys for each table for the same reason as the simple join: Without them it is impossible to reliably handle duplicates or to generate the correct ordering. Beyond that, it places no inherent restrictions on the hierarchy of the XML, and can be used to generate single queries that make effective use of database connections. And while the database results are “wide” (containing many columns), many of the columns will be null, so the actual result size is reasonable; if the database driver can compress nulls, it is close to as efficient as possible. By comparison, the simple join technique, even when it is applicable, will replicate values in data columns many times (e.g., the columns from the higher levels of the hierarchy), which can greatly increase the overall size of the data returned.

The SOU may still have problems scaling, however. The SQL queries generated by the SOU technique can be quite large, and large XML structures may push the limits of what databases (or their optimizers) can handle.

Avoiding Unnecessary Dependent Joins or Unions

Both dependent join and SOU pay a price for each “join block.” One way to improve their performance is to try to limit the number of join blocks they require. For example, we can improve the performance of the dependent join by using it across larger queries. Analyzing the structure of the XML hierarchy and verifying the existence of keys may allow us to identify when it is legal to join multiple tables in a single query using the simple join technique, and only use the dependent join technique between the resulting query blocks. This approach, when applicable, can reduce the number of SQL statements and cursors required. When the structure of the XML hierarchy is actually linear, it reduces to the simple join technique.

A way to improve the performance of both dependent joins and SOU queries is to recognize situations in which the outer join is from one table (A) to a key in another table (B). We can call this kind of join a “lookup,” because that is its effect: Given information about some entity/element in Table A, we are “looking up” the value of some other attribute in Table B. Since the lookup occurs on a key, the join will match exactly once, returning either null or the single matching value. In this situation, it is possible to inline that particular join into a single SQL query (either a query in a dependent join, or a single query within an SOU)—and this is true regardless of the structure of the XML hierarchy. This optimization is easy to apply if the field in B is declared as a key; if it is not declared in the database metadata, then it might be identified by additional information in the mapping directly (e.g., an additional annotation in an XML template).

Querying and Updating XML Documents

Querying and transforming applications require adding the capability to select parts of XML documents and to create new XML structure dynamically. A rich language such as XQuery or XSLT has powerful methods for selection, based on many possible properties of the XML input, and can create essentially any output structure for the results.

Conceptually, we think of implementing a query or transformation using the same approach we have followed for emitting XML documents from composed sources: First translate as much of the query or transformation as possible to one or more SQL statements and then add an extra post-processor to finish any additional necessary processing. When emitting documents, our post-processing step was pretty much limited to translating the SQL results to XML for composed representations (LOB representations required no additional work at all). Now, however, the post-processor may do a lot more than that in order to to handle operations that are not translatable to SQL.

For LOB implementations, we generally must start by extracting the required information from the LOB, using functions that process the XML string content. An entire XML query processor might be implemented within these functions, or the functions might be more limited to extracting field data, using SQL to do the rest of the work. Either way, one challenge that must be surmounted is that many natural operations on XML will return multiple results, which will require the use of arrays, nested tables, or similar structures. The only real alternative is to limit the query functions to those that can only return single results, which is a strong limitation indeed, one that is appropriate only if the structure of the XML and the application happen to satisfy that requirement and are unlikely to change over time.

Performance is also a serious issue for general-purpose query processing of LOB XML data. General querying usually requires complete scans of all the LOB data, which exacts a large performance cost compared to normal query processing in relational databases. Text-indexing or special purpose XML-indexing tools can create indices of particular elements or attributes inside the LOB data, which helps to limit the scanning required. Also, this cost can be reduced if the XML LOB data can reside in memory (in a database cache, or by using an in-memory database processor). Query processing must still stream through all the data, but the dominant cost of reading the data from disk is avoided.

Implementation of querying for composed representations is naturally quite different. In this case the document we are querying is a virtual document that is already computed by a SQL query or queries (plus post-processing). Thus our goal is to merge the application's query or transform with the composition query that generates the document. We illustrate this with a very simple example. The following XQuery

for $x in doc("purchases.xml")/PurchaseOrder/originator
where $x/contactAddress/zip-four = "98102-1234"
return $x/contactName

translates to SQL such as the following:

SELECT contactName
FROM purchaseorder, originator
WHERE purchaseorder.id = originator.poid AND originator.zip4 = "98102";

We can paraphrase an algorithm that would produce this query as follows:

  1. Start with the SQL query that would be used to emit the entire PurchaseOrder element list (from the root of the document).

  2. Translate the XQuery selection conditions into equivalent SQL selection conditions (if possible).

  3. Prune the SQL query so that it will reference only the tables and columns necessary either to evaluate the selection conditions or to return required results.

Of course this merely brushes the surface: We must also recognize and handle joins between multiple documents, and so forth. We will not delve into further details in this chapter.

One issue is worth highlighting because it can be controllable. One of the most significant differences between the XML and relational models is that the order of elements is important in an XML document, so results must be generated in a particular order, and XQuery and other languages can query the document based on that order. Since the relational model is unordered, we cannot guarantee that data will be returned from a relational database in any particular order unless an ORDER BY clause is used in the query. Note that strictly speaking it is not even sufficient to say that “whatever order comes out of the database” is the order of the document, because the order is not guaranteed to be stable: changing the query, for example by adding new constraints, could cause the order of the results to be changed (e.g., because the database optimizer changes the join order). We can enforce a stable order by adding ORDER BY clauses to the SQL; in the case of shredded XML, we can add a separate ordinal value to the relational data for this purpose.

But ordering XML is expensive. If the application does not require stable order in the XML, it is very advantageous to avoid the unnecessary sorting steps. This is the motivation for the unordered operator in XQuery, which allows the application to declare that it is not interested in order. As an alternative, the mapping technique could allow the declaration that order of certain elements is never meaningful and therefore not supported.

  • + Share This
  • 🔖 Save To Your Account