LOB or Compose?
The most basic dichotomy in mapping between XML and relational data is the choice between representing an XML document in a single column or in multiple columns: LOB representation or composed representation. Additionally, relational databases that have been extended to add XML functionality may implement a native XML type.
In an LOB representation, a column in a database table contains a literal XML string, usually in a CLOB (character large object) or similar datatype—hence the name. A single application might store multiple types of XML documents in different columns or even in the same column. Listing 6.1 shows an example of LOB XML. In this example, we have a database table that stores purchase orders as XML documents in one column and includes two other columns: id and receivedate, that are part of the management of the purchase order document. (Note: in this and all other examples of XML used in this chapter, whitespace has been added for clarity; it is not necessarily part of the actual XML data.)
Listing 6.1 Example of LOB-Encoded XML
id receivedate purchaseorder _____________________________________________________________________ 4023 2001-12-01 <purchaseOrder xmlns="http://po.example.com"> <originator billId="0013579"> <contactName> Fred Allen </contactName> <contactAddress> <street>123 2nd Ave. NW</street> <city>Anytown</city> <state>OH</state> <zip-four>99999-1234</zip-four> </contactAddress> <phone>(330) 555-1212</phone> <originatorReferenceNumber> AS 1132 </originatorReferenceNumber> </originator> <order> <item code="34xdl 1278 12ct" quantity="1"/> <item code="57xdl 7789" quantity="1" colorcode="012"/> </order> <shipAddress sameAsContact="true"/> <shipCode carrier="02"/> </purchaseOrder> 5327 2002-04-23 <purchaseOrder xmlns="http://po.example.com"> <originator ...
The other major approach to mapping between XML and relational data employs a composed representation of XML data. In a composed representation, the XML document does not exist in its XML form in the relational database. Instead, an XML document is composed from individual data items that are stored in many columns, and often in multiple tables, in the relational database. Typically each XML element that has simple data corresponds to the value of some column in some row of the relational data.
For example, the same data that we saw in Listing 6.1 might look as shown in Listing 6.2 in a composed representation. A key observation for the composed representation is that the same relational data could be used to represent many different XML documents by choosing different columns, arranged in different ways, and different corresponding element names. There must be some explicit or implicit mechanism to determine what the XML for a set of relational tables is supposed to look like. We call such a mechanism a composition technique. A variety of composition techniques exist, and we explore a few of the variations in the next section.
Listing 6.2 Example of Composed XML
purchaseorder table id receivedate _____________________________________________________________________ 4023 2001-12-01 5327 2002-04-23 originator table poid billId contactName street ... _____________________________________________________________________ 4023 0013579 Fred Allen 123 2nd Ave. NW order table poid code quant color size _____________________________________________________________________ 4023 34xdl 1278 12ct 1 null null 4023 57xdl 7789 1 012 null ship table poid carrier ... _____________________________________________________________________ 4023 02
An important aspect of the composed representation concerns the use case where the original data is in XML form. In this case, the XML document must be decomposed into values that can be placed into the relational tables—a process often called by the evocative name “shredding.” This shredded, or decomposed, representation is then recomposed on the fly to generate the final XML. The shredding, or decomposition, process can also be configurable; ideally it uses the same information that is used in the composition technique that will be used to recompose the data. Shredding creates a variety of additional issues and choices that we discuss in a separate section below.
LOB and composed approaches can coexist, of course: part of an XML document can be (de)composed, while certain subelements are maintained in LOB form. For example, a table that mixes composed and LOB data might look like the one shown in Listing 6.3. In this case, the document data is divided between decomposed, or shredded, elements and XML LOB contents. In other scenarios, the division might be redundant: Certain elements of the LOB information might be duplicated as individual database columns.
Listing 6.3 A Table with Both Composed and LOB Data
originator table poid billId contactName address _____________________________________________________________________ 4023 0013579 Fred Allen <contactAddress> <street>1234 2nd Ave. NW</street> <city>Anytown</city> <state>OH</state> <zip-four>99999-1234</zip-four> </contactAddress>
Finally, when the database itself is extended to include native XML functionality, then another representation option is a native XML type. That is, the SQL language is extended to allow “XML” to be a primitive datatype. If we look behind the implementation of such a type, we will generally find a dichotomy similar to the one we have outlined here: either an LOB data object or some structured representation is used. But of course the SQL database may also have additional options, especially with regard to details like indexing. Options for an XML native datatype are discussed in more detail in Chapter 7.
Different vendors use different approaches to support mapping between XML and relational data: Some use an LOB representation, some use a composed representation, and some allow either, or a mixture of both. (The same holds true when the database supports a native XML datatype: The database designer may have a choice of underlying representations.)
With respect to the kinds of applications supported by each representation, we can make the following observations:
Clearly if the source data is relational, a composed representation is used.
LOB representations are the most flexible with respect to wide varieties of XML inputs, including marked-up text and variable- or no-schema documents. Composed representations with shredding can manage structured or semi-structured XML documents with a given schema much more easily than they can the other types of XML sources.
LOB representations obviously make it easy to implement the “emit documents” application type. All the other application types, however, require some ability to look inside the document, which must be implemented as an additional layer of processing that cannot take advantage of the database's capabilities, unless implemented as part of a native implementation. This is a significant restriction. By comparison, the composed technique exposes the XML data in such a way that it is possible to use the SQL engine's own capabilities to implement important operations such as selection, at the cost of requiring additional complexity to generate XML output.
Hybrid and redundant representations can help obtain the benefits of both approaches by enabling indexing and easy access to some elements, while preserving complex XML representations. The cost is a more complex representation of the data, essentially requiring that any access or update methods be able to handle both cases. Redundant representations also add the overhead of keeping multiple representations of the same data in sync.