Programming XML: SAX and DOM
- Simple API For XML Version 2 (SAX2)
- Auxiliary SAX Interfaces
- SAX and I/O
- SAX Error Handling
- The Glue of SAX: XMLReader
- The Document Object Model
- The Object Model
- The DOM and Factories
- The Node Interface
- Parents and Children
- Nonhierarchical Nodes
- Text Nodes
- Element and Attribute Nodes
- Document, Document Type, and Entity Nodes
- Bulk Insertion Using Document Fragment
- DOM Error Handling
- Implementation vs Interface
- DOM Traversal
- Where Are We?
printf("<%s %s=\"%s\"/>\n", elemName, attName, attVal);Anonymous, 1998
The previous chapter described the XML Information Set, which is the normative definition of an XML document's abstract data model. The chapter presented a variety of example XML documents and document fragments in their serialized form as part of the discussion of various Infoset information items. One reason this approach was used was to avoid the complete alienation of readers already familiar with XML's serialization format. The primary reason, however, was to demonstrate that there is an isomorphic translation between the data model of the Infoset and the serialization format known as XML 1.0 + namespaces.
This chapter builds on this translation and describes two common techniques for translating between the Infoset and some format suitable for use by computer programs. These two common techniques are based on taking the abstractions of the Infoset and projecting them onto an object model that allows programmers to work in terms of the abstract information items, not the angle brackets and character references of XML's serialization format. These two common techniques are known as the Simple API for XML (SAX) and the Document Object Model (DOM).
Both SAX and DOM are a set of abstract programmatic interfaces that model the XML information set. The SAX and DOM approaches differ in two fundamental ways. First, the DOM is a W3C Candidate Recommendation and carries with it the weight of the W3C (this is both good and bad). In contrast, SAX is a de facto standard developed by a group of developers on the XML-DEV mailing list and supervised by David Megginson.1 The lack of "official" endorsement of SAX has not prevented most major XML products from supporting SAX and DOM as peer technologies.
The more important distinction between SAX and DOM is the differences between their basic technical approaches. SAX is a set of streaming interfaces that decompose the Infoset of an XML document into a linear sequence of well-known method calls. DOM is a set of traversal interfaces that decompose the Infoset of an XML document into a hierarchal tree of generic objects/nodes. DOM is best suited to applications that need to retain an XML document in memory for generic traversal or manipulation. SAX-based applications typically have no need to retain a generic view of the XML Infoset in memory.2 Fortunately, since both SAX and DOM are isomorphisms of the Infoset, one can typically mix the two with very little trouble (in fact, many DOM-based implementations are built in terms of an underlying SAX-based code base).
Simple API For XML Version 2 (SAX2)
SAX is a set of abstract programmatic interfaces that project the Infoset of an XML document onto a stream of well-known method calls. At the time of this writing, version 2 of SAX was in beta form and will likely become finalized by the time this book is published. Version 1 of SAX was standardized prior to namespaces or the Infoset and requires proprietary extensions to be useful for modern XML applications. For that reason, this book ignores version 1 of SAX and uses the term SAX as a synonym for SAX2.3 At the time of this writing,SAX had only been defined for the Java programming language. However, efforts to map SAX to C++, Perl, Python, and COM were all in various stages of development. Figure 2.1 presents the UML model of the SAX2 interface suite.
Figure 2.1. The UML model of the SAX2 interface suite.
The primary interface of SAX is ContentHandler. The ContentHandler interface models most of the information set core as an ordered sequence of method calls. The remaining information set items are modeled by the DTDHandler, DeclHandler and LexicalHandler interfaces, which are described later in this chapter. The following is the Java version of ContentHandler:
package org.xml.sax; public interface ContentHandler { // signals the beginning/end of a document void startDocument () throws SAXException; void endDocument() throws SAXException; // signals the beginning/end of an element void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException; void endElement(String namespaceURI, String localName, String qName) throws SAXException; // signals a namespace declaration entering/leaving scope void startPrefixMapping(String prefix, String uri) throws SAXException; void endPrefixMapping(String prefix) throws SAXException; // signals character data in element content void characters(char ch[], int start, int length) throws SAXException; // signals ignorable whitespace in element content void ignorableWhitespace(char ch[], int start, int length) throws SAXException; // signals a processing instruction void processingInstruction(String target, String data) throws SAXException; // signals a skipped entity reference void skippedEntity (String name) throws SAXException; // supplies context information about the caller void setDocumentLocator (Locator locator); }
This interface is implemented by code that wishes to "receive" an XML document and consumed by code that wishes to "send" an XML document. A component that emits serialized XML would implement ContentHandler. A component that parses serialized XML would consume ContentHandler. Since the typical application both consumes and emits XML documents, an application programmer will likely wind up both implementing and consuming this interface.
The protocol of the ContentHandler interface implies that a certain amount of context information will be retained between method calls. In particular, for information items that have a [children] property (for example, document and element information items), a given information item will be represented by at least two method invocations, one signaling the "beginning" of the item and another signaling the "end." Any intermediate method invocations that may occur between these two signals correspond to [children] property of the "current" information item. For example, a document information item will be represented by a call to startDocument followed by a call to endDocument. In between these two calls, there will be at least one startElement/endElement pair representing the lone element information item in the document's [children] property. There may also be calls to ContentHandler.processingInstruction (or LexicalHandler.comment) representing additional information items that are also in the document's [children] property. Implementations of ContentHandler are expected to retain some notion of context in order to properly interpret the method invocations issued by the caller.
The most heavily utilized methods of ContentHandler are startElement and endElement. The startElement method signals the beginning of a new element information item. The endElement method signals the ending of the current element information item. All methods invoked between startElement/endElement correspond to the [children] property of the corresponding element information item. Both methods have a similar set of parameters.
void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException; void endElement( String namespaceURI, String localName, String qName) throws SAXException;
The namespaceURI and localName parameters correspond directly to the [namespace URI] and [local name] Infoset properties. The atts parameter corresponds to the [attributes] property. Finally, the qName parameter corresponds to the QName of the element. Depending on which SAX features are supported by the caller, this parameter may simply be the empty string. The configuration of SAX features is discussed later in this chapter, but the default behavior is to not report the QName of the element.
Consider the following Java code that consumes a ContentHandler interface:
void emit(org.xml.sax.ContentHandler handler) throws org.xml.sax.SAXException { org.xml.sax.Attributes a = new org.xml.sax.helpers.AttributesImpl(); handler.startElement("", "period", "", a); handler.startElement("", "artist", "", a); handler.endElement("", "artist", ""); handler.startElement("", "artist", "", a); handler.endElement("", "artist", ""); handler.endElement("", "period", ""); }
This set of method invocations corresponds directly to the following XML fragment:
<period xmlns=""><artist/><artist/></period>
In fact, one could easily imagine a simple implementation of ContentHandler that emitted XML as its methods are invoked.
class Emitter implements org.xml.sax.ContentHandler { CharacterStream out; public void startElement(String namespaceURI, String localName, String qName, org.xml.sax.Attributes atts) { out.write("<" + localName + " xmlns=\"" + namespaceURI + "\">"); } public void endElement(String namespaceURI, String localName, String qName) { out.write("<" + localName + ">"); } // other ContentHandler methods elided for clarity }
Note that this overly simplistic implementation makes no attempt to collapse start and end tags for empty elements, nor does it do anything reasonable with namespace declarations or prefixes.4
It is difficult to look at the ContentHandler interface without also looking at one of the interfaces that it relies on: Attributes. The Attributes interface models the [attributes] property of an element information item. It exposes an element's attributes as an unordered property bag that can be traversed by name or position. The following is the Java version of Attributes:
package org.xml.sax; public interface Attributes { // return the number of attributes in the list int getLength (); // look up an attribute's Namespace URI, local name or raw // XML 1.0 name by index String getURI (int index); String getLocalName (int index); String getQName (int index); // look up an attribute's index by Namespace or raw name int getIndex (String uri, String localPart); int getIndex (String qName); // Look up an attribute's value String getValue (String uri, String localName); String getValue (int index); String getValue (String qName); // Look up an attribute's type String getType (String uri, String localName); String getType (int index); String getType (String qName); }
For convenience, the Java version of SAX provides a default implementation of this interface (AttributesImpl) that allows populating the collection via the following method:
public void addAttribute(String uri, String localName, String qName, String type, String value);
The following Java code fragment demonstrates how to create an attribute collection that contains three attributes:
org.xml.sax.Attributes create( ) { org.xml.sax.helpers.AttributesImpl atts = new org.xml.sax.helpers.AttributesImpl( ); atts.addAttribute("", "a", "", "CDATA", "Hello, World"); atts.addAttribute("", "b", "", "NMTOKEN", "Hello"); atts.addAttribute("http://www.w3.org/1999/xlink", "href", "", "CDATA", "#foo"); return (org.xml.sax.Attributes)atts; }
Note that in this example, the qName parameter is the empty string. This is consistent with the default behavior of ContentHandler.
Implementations of ContentHandler.startElement receive an Attributes implementation as the last parameter. This is the one chance that the ContentHandler implementation gets to see the attribute names, values, and types. The following startElement handler prints out the value of the href attribute that is qualified by the XLink namespace URI:
void startElement(String namespaceURI, String localName, String qName, Attributes atts) { // lookup attribute for this element String val = atts.getValue("http://www.w3.org/1999/xlink", "href"); // test for presence and act accordingly if (val != null) System.out.println("Link to " + val); else System.out.println("No link attribute present"); }
Attributes can also be accessed by position, but because the [attributes] Infoset property is an unordered collection, the actual order in which the attributes appear is insignificant.
SAX treats namespace declarations as distinct facets of an element information item. Because XML documents are increasingly using the QName datatype in element and attribute content, the actual namespace prefix-to-URI mappings that are in scope needs to be known by the ContentHandler implementation. Acknowledging the fact that namespace declarations and attributes are distinct Infoset information items, SAX models namespace declarations as a distinct pair of ContentHandler methods and does not deliver them as part of the Attributes collection at startElement-time. The startPrefixMapping method is called just prior to the startElement and corresponds to the namespace declarations of the element about to be processed. Once all of the element content has been processed, the endPrefixMapping method is called after issuing the endElement method call. Consider the following serialized element information item:
<artist xmlns='uri-one' xmlns:two='uri-two' xmlns:three='uri-three' />
This element information item corresponds to the following sequence of Java method invocations:
void emit2(org.xml.sax.ContentHandler handler) throws org.xml.sax.SAXException { org.xml.sax.Attributes a = new org.xml.sax.helpers.AttributesImpl(); // indicate namespace declarations coming into scope handler.startPrefixMapping("", "uri-one"); handler.startPrefixMapping("two", "uri-two"); handler.startPrefixMapping("three", "uri-three"); // indicate element start and finish handler.startElement("uri-one", "artist", "", a); handler.endElement("uri-one", "artist", ""); // indicate namespace declarations leaving scope handler.endPrefixMapping("three"); handler.endPrefixMapping("two"); handler.endPrefixMapping(""); }
Note that the protocol of ContentHandler does not require the start-PrefixMapping/endPrefixMapping calls to occur in the same order (or reverse order). The only ordering requirement is that all startPrefixMapping calls occur immediately prior to the corresponding startElement call and that all endPrefixMapping calls occur immediately after the corresponding endElement call.
To lighten the load of ContentHandler implementers, SAX provides a built-in class called NamespaceSupport that provides most of the default processing one would need to properly deal with QNames in attribute/element content. The following is the public interface to NamespaceSupport:
package org.xml.sax.helpers; public class NamespaceSupport { // The XML Namespace URI as a constant public final static String XMLNS = "http://www.w3.org/XML/1998/namespace"; // reset this NamespaceSupport object for reuse public void reset( ); // enter/leave a new Namespace scope public void pushContext( ); public void popContext( ); // add a namespace declaration to the current scope public boolean declarePrefix(String prefix, String uri) // Process a raw XML 1.0 name. public String [] processName(String qName, String parts[], boolean isAttribute); // resolve prefix against in-scope namespaces public String getURI(String prefix); // return all in-scope namespace prefixes public java.util.Enumeration getPrefixes( ); // return prefixes declared specifically in current scope public java.util.Enumeration getDeclaredPrefixes( ); }
The NamespaceSupport class keeps a stack of namespace declaration scopes. Calling pushContext starts a new scope; calling popContext reverts back to the previous scope. Assuming that each namespace declaration has been inserted using declarePrefix, the getURI method will return the namespace URI that corresponds to a given NCName-based prefix.
ContentHandler implementations typically use the NamespaceSupport class as follows:
class MyHandler implements org.xml.sax.ContentHandler { org.xml.sax.helpers.NamespaceSupport ns = new org.xml.sax.helpers.NamespaceSupport( ); public void startPrefixMapping(String prefix, String uri){ ns.pushContext( ); ns.declarePrefix(prefix, uri); } public void endPrefixMapping(String prefix, String uri) { ns.popContext( ); } }
Given this implementation of startPrefixMapping/endPrefixMapping, one can now look up the correct mapping of a namespace prefix by calling the getURI method. Additionally, the processName method can be used to crack a QName into its constituent components.
String[] ss = new String[3]; ss = processName("two:LName", ss, false);
This would result in the following three-tuple if called against the namespace declarations from the artist element shown earlier in this chapter.
{ "uri-two", "LName", "two:LName" }
If called using the string "LName", one would have gotten
{ "uri-one", "LName", "LName" }
assuming the isAttribute parameter was false (note that the default namespace of the artist element was uri-one). Had the isAttribute parameter been set to true, the QName would have been interpreted according to the rules of attribute names, which means that a name with no prefix belongs to no namespace and thus would have yielded the following three-tuple:
{ "", "LName", "LName" }
Note that the first string is the empty string.
The discussion so far has focused on the basic structure of a document's elements and has ignored the content model of each element. SAX defines four additional ContentHandler methods that are used to signal the presence of nonelement [children] facets of the current element. The simplest of these methods is the processingInstruction method.
void processingInstruction(String target, String data) throws SAXException;
Consider the following serialized processing instruction:
<?hack Magnum PI?>
This PI would be conveyed in SAX as follows:
void emit3(org.xml.sax.ContentHandler handler) throws org.xml.sax.SAXException { handler.processingInstruction("hack", "Magnum PI"); }
As processing instructions are also valid [children] of the document information item, calls to processingInstruction may occur prior to the first startElement and after the final endElement. However, all processingInstruction and startElement calls will be surrounded by a pair of calls to startDocument and endDocument that signal the beginning and end of the document information item.
For elements whose content model is mixed or text only, the characters method must be called to convey the character data that appears as element content. For elements whose content model is known to be element only, any interleaving whitespace between child elements may be delivered using the ignorableWhitespace method.5 Both methods take an array of characters as a parameter. An initial offset and length is provided to indicate which subset of the array contains the actual content. Consider the following element information item:
<x xmlns='uri-one'>Hello, World</x>
The following Java code shows the corresponding SAX ContentHandler call sequence:
void emit4(org.xml.sax.ContentHandler handler) throws org.xml.sax.SAXException { org.xml.sax.Attributes a = new org.xml.sax.helpers.AttributesImpl(); handler.startElement("uri-one", "x", "", a); char[] rgch = "Hello, World".toCharArray(); handler.characters(rgch, 0, rgch.length); handler.endElement("uri-one", "x", ""); }
The offset and length parameters are Java-isms that allow Java-based XML parsers to avoid excessive memory movement.
The final [children]-related method is skippedEntity, whose signature looks like the following:
void skippedEntity(String name) throws SAXException;
This method corresponds to the reference to a skipped entity information item as a child of the current element. It signals the presence of an entity reference that will not be expanded by the caller. This method exists primarily due to a loophole in the XML 1.0 specification that allows nonvalidating parsers to skip external parsed entities.
Because SAX is commonly used to interface with XML parsers, it is occasionally useful for a ContentHandler implementation to discover exactly what portion of which document the parser is currently working on. To support this functionally, SAX defines the Locator interface, which is typically implemented by SAX-aware parsers to allow implementations of ContentHandler to discover exactly where the current method corresponds to in the underlying serialized form. The following is the Java version of Locator:
package org.xml.sax; public interface Locator { String getPublicId( ); String getSystemId( ); int getLineNumber( ); int getColumnNumber( ); }
For convenience, the Java version of SAX provides a default implementation of this interface (LocatorImpl) that has four corresponding "setter" methods to allow setting of the various location properties. SAX parsers make this interface available to ContentHandler implementations by calling the setDocumentLocator method prior to calling any other ContentHandler methods.