Home > Articles > Web Services > XML

This chapter is from the book

Implementation Concepts

Generally, details of implementation are hidden within the database or middleware that implements the XML view of relational data. Nevertheless, it is valuable to understand the basic approaches in order to gain insight into performance or for debugging. A thorough discussion of implementing the different kinds of applications introduced in the second section, using different XML representations and different mapping techniques, is beyond the scope of this chapter. Our aim in this section is to outline the basic requirements for implementation. We begin with the simple application type of emitting XML documents, which allows us to introduce several implementation strategies for composed techniques—most important, the varieties of techniques for generating hierarchical structure through joins. We also include a very high-level outline of implementation techniques for the other application types.

Emitting XML Documents

Our simplest application type consisted of emitting XML documents that are stored in a relational database. When the document is stored in LOB form, this is quite simple: The column value is simply accessed and returned. The more interesting case is emitting a composed XML document (an XML view of relational data).

Most aspects of generating an XML document from a composed representation are also not very complex. The basic concept is to create a standard SQL statement (one that does not involve any extended features). This will, of course, return relational data. Then, in a post-processing step, for each row of the relational data, XML “decoration” is added: XML element tags, additional static content, and so on. Of course if the implementation is directly within the database engine, it need not actually be implemented in these two steps, but conceptually (and often in practice) this is representative of the process.

The basic step is to generate an SQL statement that retrieves content from a single table. This SQL statement simply names the table and whichever columns are needed. If the composition technique only refers to a single table, this is the only step that is required. For example, we can convert the annotated XSD shown in Listing 6.15 into the SQL shown in the last three lines of the listing. All the work of adding element tags is handled by the post-processor.

Listing 6.15 Creating an SQL Statement for an Annotated Schema

<xs:element name="Employee" sql:relation="Persons" 
             sql:limit-field="Type"
             sql:limit-value="Emp">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="FName" type="xs:string"
                   sql:field="FirstName"/> 
      <xs:element name="LName" type="xs:string" 
                   sql:field="LastName"/>
    </xs:sequence>
    <xs:attribute name="EmpID" type="xs:integer" 
                   sql:field="EmployeeID"/>
  </xs:complexType>
</xs:element>


SELECT e.FirstName, e.LastName, e.EmployeeID
FROM Employees e
WHERE e.Type="Emp";

Generation of hierarchy through grouping is also relatively straightforward. Instead of using the SQL GROUP BY construct, we will SORT BY the grouping keys instead. Then, during post-processing, a new grouping-level element is created each time a new value of the group keys is encountered, and new subordinate elements are created for each row of the data, as above. For example, we would convert an SQL/XML statement as shown in Listing 6.16:

Listing 6.16 Translating an SQL/XML XMLAGG Query

         SQL/XML
   SELECT XMLELEMENT
   ( NAME "department"
   XMLELEMENT ( NAME "name" e.dept )
   XMLELEMENT ( NAME "employees" 
   XMLAGG ( 
   XMLELEMENT ( NAME "emp", e.lname )
   )
   )
   ) AS "dept_list"
   FROM employees e
   GROUP BY e.dept;
   
   SQL
   SELECT e.dept, e.lname
   FROM employees e
   ORDER BY e.dept;
   

The SQL statement often becomes much simpler: All of the XML-oriented tasks are moved to the post-processing step. The part of the post-processing that handles the grouping would look something like the pseudocode shown in Listing 6.17:

Listing 6.17 Pseudocode for Reconstructing Grouped XML Data from SQL

groupkey ← null;
cursor ← execute the sql query;
   while ( cursor not empty ) {
   newgroupkey ← getgroupkey(cursor);
   if ( groupkey != newgroupkey ) {
   if ( groupkey != null )
   emit close tag for old group element;
   emit open tag for new group element;
   groupkey ← newgroupkey;
   }
   process non-grouping fields in row;
   advance cursor;
   }
   if ( groupkey != null )
   emit close tag for old group element
   }
   

The most complex aspect of emitting XML from composed representations comes from the generation of hierarchy though joins. The basic premise is still the same: We want to create standard SQL, with post-processing that can generate XML from the relational results. Depending on the database vendor, the database might itself already internally support nested tables or some similar concept that can be used for this purpose. But if the database does not provide those capabilities, or if the application is being implemented in middleware that lacks access to those capabilities, then another approach is required. There are three commonly used techniques, which we call simple join, dependent join, and sorted outer union. We introduce each of these below.

Simple Join

Perhaps the most obvious approach is to try to first compute a join using the ordinary SQL join syntax and then use post-processing to reconstruct the hierarchy from the join results. The idea is similar to the method for reconstructing hierarchy for grouping: First join the appropriate relations together, sort the results so that rows that represent children of the same element will occur consecutively in the results, and use the same “Did the key change?” logic that we used for grouping to control when new container elements are opened and closed. The left outer join must be used in order to accurately reconstruct cases where an element has no children (e.g., a department with no employees).

This approach can work in certain limited cases. The following must be true:

  • There must be a key associated with each element that is not a leaf element. If there is no such key, then duplicate elements will look alike and be merged. For example, in the following sample join result, if the employee's name is not a key, then there is no way to tell whether or not this represents a single employee named Mary with two phone numbers, or two people named Mary, each with a single phone number:

    deptid  deptname  empname  empphone
    _____________________________________________________________________
    21      Physics   Mary     x2017
    21      Physics   Mary     x4007
    

    As noted earlier, if the relational data is the result of shredding, it is possible to automatically add node-ids for elements, thereby guaranteeing that the required keys will exist.

  • The XML hierarchy must be linear; that is, the complete set of joins must form a linear path: A left-outer-join B left-outer-join C, and so forth. The hierarchy depicted in Listing 6.5 does not satisfy this condition: in this case Table A is joined to both B and D. The problem with this arrangement is that when a complete join of A, B, C, and D is done, the result will be a full cross-product between B and D. This has two effects: The data size is exploded (consider what can happen if there are more than two nonlinear hierarchies), and the logic required to reconstruct the intended hierarchy becomes considerably more complex.

The second limitation is significant, because it seriously restricts the type of XML that can be generated with the simple join technique.

Dependent Join

This approach works without any restrictions. In this case we do not attempt to create a single SQL query; instead, the “post-processing” step becomes a master control that creates and executes multiple SQL queries. We form one SQL query for the topmost relation, and then, for each row in that relation, invoke a separate SQL query for each related child (and in turn, for each of those rows, a separate SQL query for their children, etc., recursively).

Thus for the structure that was depicted in Listing 6.5, the processing would look something like what is shown in Listing 6.18:

Listing 6.18 Code Structure for a Dependent Join

cursorA ← sql for table A
   while ( cursorA not empty ) {
   process elements for row in A
   cursorB ← sql for table B
   while ( cursorB not empty ) {
   process elements for row in B
   cursorC ← sql for table C
   while ( cursorC not empty ) {
   process elements for row in C
   advance cursorC
   }
   advance cursorB
   }
   cursorD ← sql for table D
   while ( cursorD not empty ) {
   process elements for row in D
   advance cursorD
   }
   advance cursorA
   }
   

In actual practice, of course, we would use a generic recursive algorithm that would operate over any hierarchy, but this way it is easier to see the relationships, both for linear (B, C) and nonlinear (B, D) relationships. Each SQL statement except the outermost will be a parameterized SQL statement (i.e., a prepared statement), and the join condition will be represented by a parameterized condition. For example, the SQL for Table B might contain lines like the following:

SELECT ...
FROM B
WHERE B.key = &;

At run-time, the parameter will be substituted with the corresponding value from table(s) higher in the hierarchy.

It is easy to see that this technique can create numerous SQL statements and can also have many cursors open simultaneously, even though only one is in use at a time. The algorithm does not use database resources very efficiently. And indeed, while the dependent join is adequate for small or light-load use cases, it does not by itself scale effectively to larger application demands.

The Sorted Outer Union

If the simple join only works in limited cases, and the dependent join does not scale, what are we to do? A very clever, and fairly complicated, technique was introduced by Jayavel Shanmugasundaram et al. [EFFICIENT]. It is known as the Sorted Outer Union (SOU). The Sorted Outer Union is, in effect, an improved version of the simple join approach, but it removes the restriction against nonlinear hierarchy (though keys are still required).

The fundamental question is this: If the simple join technique works for a linear join hierarchy, can we somehow pack together multiple queries, one for each linear hierarchy, in one query, in such a way that the results are neither a cross-product nor ambiguous? This is what the SOU accomplishes.

The output of the SOU contains columns for all of the tables involved—even tables that are “parallel” to each other in the XML (e.g., B and D in the example from Listing 6.5). However, in any one row of the output, only columns from a single table (and the chain of keys from the root to get to that table) will be filled; all other columns will be null. The entire SOU query consists of the union over a set of smaller SQL queries, one for each interesting join path in the XML, where each of these smaller SQL queries is in essence the simple join query for that join path. The effect is to fill the output with “blocks” of data: a set of rows to retrieve values from Table A, then another for Table B (given a key from A), and so forth. In a final bit of magic, the entire result set is sorted by the keys in such a way as to interleave the rows of the data in exactly the same order in which the fields are used in the XML document.

We illustrate this pattern with the structure from Listing 6.5, which involved Tables A, B, C and D. Thus the output of the SOU query will have columns from each of these tables, as shown below:

A-key  A-cols  B-key  B-cols  C-key  C-cols  D-key  D-cols
_____________________________________________________________________

According to the SOU algorithm, there are four interesting join paths: A itself, A-B, A-B-C and A-D. If we look at just one of these paths, say A-B, the SQL query for that path will create the join path to B and retrieve the columns of B. All the other columns will be filled with nulls. The pattern of the query, and of its output, appear as shown in Listing 6.19:

Listing 6.19 An Example of an SOU Join Block, and the Result It Generates

SELECT A-key, null as A-col1, null as A-col2, ..., 
   B-key, B-col1, B-col2, ...,
   null as C-key, null as C-col1, null as C-col2, ..., 
   null as D-key, null as D-col1, null as D-col2, ...
   FROM   A, B
   WHERE  A-key = B-key;
   
   A-key  A-cols  B-key  B-cols  C-key  C-cols  D-key  D-cols
   _____________________________________________________________________
   
   a1     nulls     b1     bvals   nulls    nulls     nulls    nulls
   

Notice the use of null, cast to the appropriate datatype, to fill the unused columns. Also notice that we used an inner join here, not an outer join: the outer join is not required because its effect will be achieved by separate queries for each join path (in this case a separate query to retrieve values from A by itself).

One query of this form is generated for each interesting join path, the individual queries are unioned together, and the results sorted by the table keys:

sql-query-for-A
   UNION ALL
   sql-query-for-A-B
   UNION ALL
   sql-query-for-A-B-C
   UNION ALL
   sql-query-for-A-D
   ORDER BY A-key, B-key, C-key, D-key;
   

The effect of the sort operation is to interleave the results from the different queries so that all the rows for a particular element appear together, in order. The algorithm assumes that the key order is the same as document order. (If it is not, then separate columns are required to represent the proper ordinal values, and the sort will be on those columns instead.) Also, nulls are assumed to sort low, which assures that the shorter join paths appear before the longer ones; thus, information about A is retrieved before information about A-B, and so on. (If nulls sort high, the algorithm can be modified to sort keys in the opposite order.)

The SOU query relies on the existence of keys for each table for the same reason as the simple join: Without them it is impossible to reliably handle duplicates or to generate the correct ordering. Beyond that, it places no inherent restrictions on the hierarchy of the XML, and can be used to generate single queries that make effective use of database connections. And while the database results are “wide” (containing many columns), many of the columns will be null, so the actual result size is reasonable; if the database driver can compress nulls, it is close to as efficient as possible. By comparison, the simple join technique, even when it is applicable, will replicate values in data columns many times (e.g., the columns from the higher levels of the hierarchy), which can greatly increase the overall size of the data returned.

The SOU may still have problems scaling, however. The SQL queries generated by the SOU technique can be quite large, and large XML structures may push the limits of what databases (or their optimizers) can handle.

Avoiding Unnecessary Dependent Joins or Unions

Both dependent join and SOU pay a price for each “join block.” One way to improve their performance is to try to limit the number of join blocks they require. For example, we can improve the performance of the dependent join by using it across larger queries. Analyzing the structure of the XML hierarchy and verifying the existence of keys may allow us to identify when it is legal to join multiple tables in a single query using the simple join technique, and only use the dependent join technique between the resulting query blocks. This approach, when applicable, can reduce the number of SQL statements and cursors required. When the structure of the XML hierarchy is actually linear, it reduces to the simple join technique.

A way to improve the performance of both dependent joins and SOU queries is to recognize situations in which the outer join is from one table (A) to a key in another table (B). We can call this kind of join a “lookup,” because that is its effect: Given information about some entity/element in Table A, we are “looking up” the value of some other attribute in Table B. Since the lookup occurs on a key, the join will match exactly once, returning either null or the single matching value. In this situation, it is possible to inline that particular join into a single SQL query (either a query in a dependent join, or a single query within an SOU)—and this is true regardless of the structure of the XML hierarchy. This optimization is easy to apply if the field in B is declared as a key; if it is not declared in the database metadata, then it might be identified by additional information in the mapping directly (e.g., an additional annotation in an XML template).

Querying and Updating XML Documents

Querying and transforming applications require adding the capability to select parts of XML documents and to create new XML structure dynamically. A rich language such as XQuery or XSLT has powerful methods for selection, based on many possible properties of the XML input, and can create essentially any output structure for the results.

Conceptually, we think of implementing a query or transformation using the same approach we have followed for emitting XML documents from composed sources: First translate as much of the query or transformation as possible to one or more SQL statements and then add an extra post-processor to finish any additional necessary processing. When emitting documents, our post-processing step was pretty much limited to translating the SQL results to XML for composed representations (LOB representations required no additional work at all). Now, however, the post-processor may do a lot more than that in order to to handle operations that are not translatable to SQL.

For LOB implementations, we generally must start by extracting the required information from the LOB, using functions that process the XML string content. An entire XML query processor might be implemented within these functions, or the functions might be more limited to extracting field data, using SQL to do the rest of the work. Either way, one challenge that must be surmounted is that many natural operations on XML will return multiple results, which will require the use of arrays, nested tables, or similar structures. The only real alternative is to limit the query functions to those that can only return single results, which is a strong limitation indeed, one that is appropriate only if the structure of the XML and the application happen to satisfy that requirement and are unlikely to change over time.

Performance is also a serious issue for general-purpose query processing of LOB XML data. General querying usually requires complete scans of all the LOB data, which exacts a large performance cost compared to normal query processing in relational databases. Text-indexing or special purpose XML-indexing tools can create indices of particular elements or attributes inside the LOB data, which helps to limit the scanning required. Also, this cost can be reduced if the XML LOB data can reside in memory (in a database cache, or by using an in-memory database processor). Query processing must still stream through all the data, but the dominant cost of reading the data from disk is avoided.

Implementation of querying for composed representations is naturally quite different. In this case the document we are querying is a virtual document that is already computed by a SQL query or queries (plus post-processing). Thus our goal is to merge the application's query or transform with the composition query that generates the document. We illustrate this with a very simple example. The following XQuery

for $x in doc("purchases.xml")/PurchaseOrder/originator
where $x/contactAddress/zip-four = "98102-1234"
return $x/contactName

translates to SQL such as the following:

SELECT contactName
FROM purchaseorder, originator
WHERE purchaseorder.id = originator.poid AND originator.zip4 = "98102";

We can paraphrase an algorithm that would produce this query as follows:

  1. Start with the SQL query that would be used to emit the entire PurchaseOrder element list (from the root of the document).

  2. Translate the XQuery selection conditions into equivalent SQL selection conditions (if possible).

  3. Prune the SQL query so that it will reference only the tables and columns necessary either to evaluate the selection conditions or to return required results.

Of course this merely brushes the surface: We must also recognize and handle joins between multiple documents, and so forth. We will not delve into further details in this chapter.

One issue is worth highlighting because it can be controllable. One of the most significant differences between the XML and relational models is that the order of elements is important in an XML document, so results must be generated in a particular order, and XQuery and other languages can query the document based on that order. Since the relational model is unordered, we cannot guarantee that data will be returned from a relational database in any particular order unless an ORDER BY clause is used in the query. Note that strictly speaking it is not even sufficient to say that “whatever order comes out of the database” is the order of the document, because the order is not guaranteed to be stable: changing the query, for example by adding new constraints, could cause the order of the results to be changed (e.g., because the database optimizer changes the join order). We can enforce a stable order by adding ORDER BY clauses to the SQL; in the case of shredded XML, we can add a separate ordinal value to the relational data for this purpose.

But ordering XML is expensive. If the application does not require stable order in the XML, it is very advantageous to avoid the unnecessary sorting steps. This is the motivation for the unordered operator in XQuery, which allows the application to declare that it is not interested in order. As an alternative, the mapping technique could allow the declaration that order of certain elements is never meaningful and therefore not supported.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020