1.4 Patterns in XML
In order to effectively model information using XML, we must learn how to identify the natural patterns inherent to it. First, we must determine whether we have used XML elements properly. To do this we will analyze the XML fragment shown in Listing 1.2.
Listing 1.2 Example XML Fragment
<colorimeter_reading> <device> X-Rite Digital Swatchbook </device> <patch> cyan </patch> <RGB resolution=8> <red> 0 </red> <green> 255 </green> <blue> 255 </blue> </RGB> </colorimeter_reading>
We examine each data element and ask the following question:
Is this data, or is it actually metadata (information about another data element)?
We examine every attribute and ask the following questions:
Does the attribute tell us something about or describe how to interpret, use, or present data elements?
Is the attribute truly metadata, and not actually a data element?
Does it apply to all data elements in its scope?
We examine every tag and ask the following question:
Does this tag help describe what all data elements in its scope are?
We examine the groupings we have created (the sibling relationships) and ask:
Are all members of the group related in a way the parent nodes describe?
Is the relationship between siblings unambiguous?
If the answer to any of the preceding questions is "no," then we need to cast the offending components differently.
After insuring that information has been expressed using the components of XML appropriately, we examine how everything has been stitched together. To do this we create an information context list from the XML fragment. This is done by simply taking each data element and writing down every tag and attribute leading up to it. The resulting lines will give us a flattened view of the information items contained in the XML fragment. A context list for the example XML fragment in Listing 1.2 would look like the one shown in Listing 1.3.
Listing 1.3 Context List for Example XML Fragment
<colorimeter_reading><device> X-Rite Digital Swatchbook <colorimeter_reading><patch> cyan <colorimeter_reading><RGB resolution=8><red> 0 <colorimeter_reading><RGB resolution=8><green> 255 <colorimeter_reading><RGB resolution=8><blue> 255
If we convert these lines to what they mean in English, we can see that each information item, and its context, makes sense and is contextually complete:
This colorimeter reading is from an X-Rite Digital Swatchbook.
This colorimeter reading is for a patch called cyan.
This colorimeter reading is RGB-red and has an 8-bit value of 0.
This colorimeter reading is RGB-green and has an 8-bit value of 255.
This colorimeter reading is RGB-blue and has an 8-bit value of 255.
Next we examine the groupings implied by the tag hierarchy:
"-colorimeter_reading-" contains "-device-", "-patch-", and "-RGB-" (plus its children).
"-RGB-" contains "-red-", "-green-", and "-blue-".
"-colorimeter_reading-" represents the root tag, so everything else is obviously related to it. The only other implied grouping falls under "-RGB-". These are the actual readings, and the only entries that are, so they are logically related in an unambiguous way.
Finally, we examine the scope for each attribute:
"resolution-8" has the items "-red-", "-green-", and "-blue-" in its scope.
"resolution-8" logically applies to every item in its scope and none of the items not in its scope, so it has been appropriately applied.
A self-constructing XML information system (like NeoCore XMS) will use the structure of and the natural patterns contained in XML to automatically determine what to index. Simple queries are serviced by direct lookups. Complex queries are serviced by a combination of direct lookups, convergences against selected parent nodes, and targeted substring searches. With NeoCore XMS no database design or indexing instructions are necessarythe behavior of XMS is driven entirely by the structure of the XML documents posted to it. Index entries are determined by inference and are built based on the natural patterns contained in XML documents. NeoCore XMS creates index entries according to the following rules:
An index entry is created for each data element.
An index entry is created for each complete tag context for each data element that is, the concatenation of every tag leading up to the data element.
An index entry is created for the concatenation of the two preceding items (tag context plus data element).
For the XML fragment in Listing 1.2, the following items would be added to the pattern indices (actually, this list is not complete because partial tag context index entries are also created, but a discussion of those is beyond the scope of this chapter):
X-Rite Digital Swatchbook
cyan
0 Common XML Information-Modeling Pitfalls
255
255
<colorimeter_reading--device-
<colorimeter_reading--patch-
<colorimeter_reading--RGB--red-
<colorimeter_reading--RGB--green-
<colorimeter_reading--RGB--blue-
<colorimeter_reading--RGB resolution-8-
<colorimeter_reading--device- X-Rite Digital Swatchbook
<colorimeter_reading--patch- cyan
<colorimeter_reading--RGB resolution-8--red- 0
<colorimeter_reading--RGB resolution-8--green- 255
<colorimeter_reading--RGB resolution-8--blue- 255
Entries 15 are data only, entries 611 are tag context only, and entries 1216 are both.
At this point it is important to consider how performance will be affected by the structure of the XML document. Because the inherent patterns inferred from the XML itself can be used to automatically build a database, the degree to which those patterns match likely queries will have a big effect on performance, especially in data-centric applications where single data elements or subdocuments need to be accessed without having to process an entire XML document.