Home > Articles

Programming XML: SAX and DOM

The Infoset is a nice idea, but it is barely useful if software cannot be written against it. Fortunately, there are two widely accepted programmatic interfaces based on the Infoset that allow documents to be manipulated and [de]serialized at the Infoset level, not at the character-stream/markup level: SAX and DOM.
This chapter is from the book

This chapter is from the book

printf("<%s %s=\"%s\"/>\n", elemName, attName, attVal);

Anonymous, 1998

The previous chapter described the XML Information Set, which is the normative definition of an XML document's abstract data model. The chapter presented a variety of example XML documents and document fragments in their serialized form as part of the discussion of various Infoset information items. One reason this approach was used was to avoid the complete alienation of readers already familiar with XML's serialization format. The primary reason, however, was to demonstrate that there is an isomorphic translation between the data model of the Infoset and the serialization format known as XML 1.0 + namespaces.

This chapter builds on this translation and describes two common techniques for translating between the Infoset and some format suitable for use by computer programs. These two common techniques are based on taking the abstractions of the Infoset and projecting them onto an object model that allows programmers to work in terms of the abstract information items, not the angle brackets and character references of XML's serialization format. These two common techniques are known as the Simple API for XML (SAX) and the Document Object Model (DOM).

Both SAX and DOM are a set of abstract programmatic interfaces that model the XML information set. The SAX and DOM approaches differ in two fundamental ways. First, the DOM is a W3C Candidate Recommendation and carries with it the weight of the W3C (this is both good and bad). In contrast, SAX is a de facto standard developed by a group of developers on the XML-DEV mailing list and supervised by David Megginson.1 The lack of "official" endorsement of SAX has not prevented most major XML products from supporting SAX and DOM as peer technologies.

The more important distinction between SAX and DOM is the differences between their basic technical approaches. SAX is a set of streaming interfaces that decompose the Infoset of an XML document into a linear sequence of well-known method calls. DOM is a set of traversal interfaces that decompose the Infoset of an XML document into a hierarchal tree of generic objects/nodes. DOM is best suited to applications that need to retain an XML document in memory for generic traversal or manipulation. SAX-based applications typically have no need to retain a generic view of the XML Infoset in memory.2 Fortunately, since both SAX and DOM are isomorphisms of the Infoset, one can typically mix the two with very little trouble (in fact, many DOM-based implementations are built in terms of an underlying SAX-based code base).

Simple API For XML Version 2 (SAX2)

SAX is a set of abstract programmatic interfaces that project the Infoset of an XML document onto a stream of well-known method calls. At the time of this writing, version 2 of SAX was in beta form and will likely become finalized by the time this book is published. Version 1 of SAX was standardized prior to namespaces or the Infoset and requires proprietary extensions to be useful for modern XML applications. For that reason, this book ignores version 1 of SAX and uses the term SAX as a synonym for SAX2.3 At the time of this writing,SAX had only been defined for the Java programming language. However, efforts to map SAX to C++, Perl, Python, and COM were all in various stages of development. Figure 2.1 presents the UML model of the SAX2 interface suite.

Figure 2.1Figure 2.1. The UML model of the SAX2 interface suite.


The primary interface of SAX is ContentHandler. The ContentHandler interface models most of the information set core as an ordered sequence of method calls. The remaining information set items are modeled by the DTDHandler, DeclHandler and LexicalHandler interfaces, which are described later in this chapter. The following is the Java version of ContentHandler:

package org.xml.sax;
public interface ContentHandler {
// signals the beginning/end of a document
  void startDocument () throws SAXException;
  void endDocument() throws SAXException;

// signals the beginning/end of an element
  void startElement(String namespaceURI, String localName,
                    String qName, Attributes atts)
            throws SAXException;
  void endElement(String namespaceURI, String localName,
                   String qName) throws SAXException;

// signals a namespace declaration entering/leaving scope
  void startPrefixMapping(String prefix, String uri)
            throws SAXException;
  void endPrefixMapping(String prefix) throws SAXException;

// signals character data in element content
  void characters(char ch[], int start, int length)
            throws SAXException;

// signals ignorable whitespace in element content
  void ignorableWhitespace(char ch[], int start, int length)
                           throws SAXException;

// signals a processing instruction
  void processingInstruction(String target, String data)
            throws SAXException;
// signals a skipped entity reference
  void skippedEntity (String name) throws SAXException;

// supplies context information about the caller
  void setDocumentLocator (Locator locator);
}

This interface is implemented by code that wishes to "receive" an XML document and consumed by code that wishes to "send" an XML document. A component that emits serialized XML would implement ContentHandler. A component that parses serialized XML would consume ContentHandler. Since the typical application both consumes and emits XML documents, an application programmer will likely wind up both implementing and consuming this interface.

The protocol of the ContentHandler interface implies that a certain amount of context information will be retained between method calls. In particular, for information items that have a [children] property (for example, document and element information items), a given information item will be represented by at least two method invocations, one signaling the "beginning" of the item and another signaling the "end." Any intermediate method invocations that may occur between these two signals correspond to [children] property of the "current" information item. For example, a document information item will be represented by a call to startDocument followed by a call to endDocument. In between these two calls, there will be at least one startElement/endElement pair representing the lone element information item in the document's [children] property. There may also be calls to ContentHandler.processingInstruction (or LexicalHandler.comment) representing additional information items that are also in the document's [children] property. Implementations of ContentHandler are expected to retain some notion of context in order to properly interpret the method invocations issued by the caller.

The most heavily utilized methods of ContentHandler are startElement and endElement. The startElement method signals the beginning of a new element information item. The endElement method signals the ending of the current element information item. All methods invoked between startElement/endElement correspond to the [children] property of the corresponding element information item. Both methods have a similar set of parameters.

void startElement(String namespaceURI,
                  String localName,
                  String qName,
                  Attributes atts) throws SAXException;
void endElement(  String namespaceURI,
                  String localName,
                  String qName) throws SAXException;

The namespaceURI and localName parameters correspond directly to the [namespace URI] and [local name] Infoset properties. The atts parameter corresponds to the [attributes] property. Finally, the qName parameter corresponds to the QName of the element. Depending on which SAX features are supported by the caller, this parameter may simply be the empty string. The configuration of SAX features is discussed later in this chapter, but the default behavior is to not report the QName of the element.

Consider the following Java code that consumes a ContentHandler interface:

void emit(org.xml.sax.ContentHandler handler)
           throws org.xml.sax.SAXException {
  org.xml.sax.Attributes a =
                new org.xml.sax.helpers.AttributesImpl();
  handler.startElement("", "period", "", a);
    handler.startElement("", "artist", "", a);
    handler.endElement("", "artist", "");
    handler.startElement("", "artist", "", a);
    handler.endElement("", "artist", "");
  handler.endElement("", "period", "");
}

This set of method invocations corresponds directly to the following XML fragment:

<period xmlns=""><artist/><artist/></period>

In fact, one could easily imagine a simple implementation of ContentHandler that emitted XML as its methods are invoked.

class Emitter implements org.xml.sax.ContentHandler {
  CharacterStream out;
  public void startElement(String namespaceURI,
                String localName, String qName,
                org.xml.sax.Attributes atts) {
    out.write("<" + localName
              + " xmlns=\"" + namespaceURI + "\">");
  }
  public void endElement(String namespaceURI,
                String localName, String qName) {
    out.write("<" + localName + ">");
  }
// other ContentHandler methods elided for clarity
}

Note that this overly simplistic implementation makes no attempt to collapse start and end tags for empty elements, nor does it do anything reasonable with namespace declarations or prefixes.4

It is difficult to look at the ContentHandler interface without also looking at one of the interfaces that it relies on: Attributes. The Attributes interface models the [attributes] property of an element information item. It exposes an element's attributes as an unordered property bag that can be traversed by name or position. The following is the Java version of Attributes:

package org.xml.sax;
public interface Attributes {
// return the number of attributes in the list
  int getLength ();
// look up an attribute's Namespace URI, local name or raw
// XML 1.0 name by index
  String getURI (int index);
  String getLocalName (int index);
  String getQName (int index);
// look up an attribute's index by Namespace or raw name
  int getIndex (String uri, String localPart);
  int getIndex (String qName);
// Look up an attribute's value
  String getValue (String uri, String localName);
  String getValue (int index);
  String getValue (String qName);
// Look up an attribute's type
  String getType (String uri, String localName);
  String getType (int index);
  String getType (String qName);
}

For convenience, the Java version of SAX provides a default implementation of this interface (AttributesImpl) that allows populating the collection via the following method:

public void addAttribute(String uri, String localName,
                String qName, String type, String value);

The following Java code fragment demonstrates how to create an attribute collection that contains three attributes:

org.xml.sax.Attributes create( ) {
  org.xml.sax.helpers.AttributesImpl atts =
                  new org.xml.sax.helpers.AttributesImpl( );
  atts.addAttribute("", "a", "", "CDATA", "Hello, World");
  atts.addAttribute("", "b", "", "NMTOKEN", "Hello");
  atts.addAttribute("http://www.w3.org/1999/xlink", "href",
                    "", "CDATA", "#foo");
  return (org.xml.sax.Attributes)atts;
}

Note that in this example, the qName parameter is the empty string. This is consistent with the default behavior of ContentHandler.

Implementations of ContentHandler.startElement receive an Attributes implementation as the last parameter. This is the one chance that the ContentHandler implementation gets to see the attribute names, values, and types. The following startElement handler prints out the value of the href attribute that is qualified by the XLink namespace URI:

void startElement(String namespaceURI, String localName,
                  String qName, Attributes atts) {
// lookup attribute for this element
  String val = atts.getValue("http://www.w3.org/1999/xlink",
                             "href");
// test for presence and act accordingly
  if (val != null)
    System.out.println("Link to " + val);
  else
    System.out.println("No link attribute present");
}

Attributes can also be accessed by position, but because the [attributes] Infoset property is an unordered collection, the actual order in which the attributes appear is insignificant.

SAX treats namespace declarations as distinct facets of an element information item. Because XML documents are increasingly using the QName datatype in element and attribute content, the actual namespace prefix-to-URI mappings that are in scope needs to be known by the ContentHandler implementation. Acknowledging the fact that namespace declarations and attributes are distinct Infoset information items, SAX models namespace declarations as a distinct pair of ContentHandler methods and does not deliver them as part of the Attributes collection at startElement-time. The startPrefixMapping method is called just prior to the startElement and corresponds to the namespace declarations of the element about to be processed. Once all of the element content has been processed, the endPrefixMapping method is called after issuing the endElement method call. Consider the following serialized element information item:

<artist
    xmlns='uri-one'
    xmlns:two='uri-two'
    xmlns:three='uri-three'
/>

This element information item corresponds to the following sequence of Java method invocations:

void emit2(org.xml.sax.ContentHandler handler)
           throws org.xml.sax.SAXException {
  org.xml.sax.Attributes a =
                new org.xml.sax.helpers.AttributesImpl();
// indicate namespace declarations coming into scope
  handler.startPrefixMapping("", "uri-one");
  handler.startPrefixMapping("two", "uri-two");
  handler.startPrefixMapping("three", "uri-three");
// indicate element start and finish
  handler.startElement("uri-one", "artist", "", a);
  handler.endElement("uri-one", "artist", "");
// indicate namespace declarations leaving scope
  handler.endPrefixMapping("three");
  handler.endPrefixMapping("two");
  handler.endPrefixMapping("");
}

Note that the protocol of ContentHandler does not require the start-PrefixMapping/endPrefixMapping calls to occur in the same order (or reverse order). The only ordering requirement is that all startPrefixMapping calls occur immediately prior to the corresponding startElement call and that all endPrefixMapping calls occur immediately after the corresponding endElement call.

To lighten the load of ContentHandler implementers, SAX provides a built-in class called NamespaceSupport that provides most of the default processing one would need to properly deal with QNames in attribute/element content. The following is the public interface to NamespaceSupport:

package org.xml.sax.helpers;
public class NamespaceSupport {
// The XML Namespace URI as a constant
  public final static String XMLNS =
                     "http://www.w3.org/XML/1998/namespace";
// reset this NamespaceSupport object for reuse
  public void reset( );
// enter/leave a new Namespace scope
  public void pushContext( );
  public void popContext( );
// add a namespace declaration to the current scope
  public boolean declarePrefix(String prefix, String uri)
// Process a raw XML 1.0 name.
  public String [] processName(String qName,
                     String parts[], boolean isAttribute);
// resolve prefix against in-scope namespaces
    public String getURI(String prefix);
// return all in-scope namespace prefixes
    public java.util.Enumeration getPrefixes( );
// return prefixes declared specifically in current scope
    public java.util.Enumeration getDeclaredPrefixes( );
}

The NamespaceSupport class keeps a stack of namespace declaration scopes. Calling pushContext starts a new scope; calling popContext reverts back to the previous scope. Assuming that each namespace declaration has been inserted using declarePrefix, the getURI method will return the namespace URI that corresponds to a given NCName-based prefix.

ContentHandler implementations typically use the NamespaceSupport class as follows:

class MyHandler implements org.xml.sax.ContentHandler {
  org.xml.sax.helpers.NamespaceSupport ns =
                new org.xml.sax.helpers.NamespaceSupport( );
  public void startPrefixMapping(String prefix, String uri){
    ns.pushContext( );
    ns.declarePrefix(prefix, uri);
  }
  public void endPrefixMapping(String prefix, String uri) {
    ns.popContext( );
  }
}

Given this implementation of startPrefixMapping/endPrefixMapping, one can now look up the correct mapping of a namespace prefix by calling the getURI method. Additionally, the processName method can be used to crack a QName into its constituent components.

String[] ss = new String[3];
ss = processName("two:LName", ss, false);

This would result in the following three-tuple if called against the namespace declarations from the artist element shown earlier in this chapter.

{ "uri-two", "LName", "two:LName" }

If called using the string "LName", one would have gotten

{ "uri-one", "LName", "LName" }

assuming the isAttribute parameter was false (note that the default namespace of the artist element was uri-one). Had the isAttribute parameter been set to true, the QName would have been interpreted according to the rules of attribute names, which means that a name with no prefix belongs to no namespace and thus would have yielded the following three-tuple:

{ "", "LName", "LName" }

Note that the first string is the empty string.

The discussion so far has focused on the basic structure of a document's elements and has ignored the content model of each element. SAX defines four additional ContentHandler methods that are used to signal the presence of nonelement [children] facets of the current element. The simplest of these methods is the processingInstruction method.

void processingInstruction(String target, String data)
               throws SAXException;

Consider the following serialized processing instruction:

<?hack Magnum PI?>

This PI would be conveyed in SAX as follows:

void emit3(org.xml.sax.ContentHandler handler)
           throws org.xml.sax.SAXException {
  handler.processingInstruction("hack",
                                "Magnum PI");
}

As processing instructions are also valid [children] of the document information item, calls to processingInstruction may occur prior to the first startElement and after the final endElement. However, all processingInstruction and startElement calls will be surrounded by a pair of calls to startDocument and endDocument that signal the beginning and end of the document information item.

For elements whose content model is mixed or text only, the characters method must be called to convey the character data that appears as element content. For elements whose content model is known to be element only, any interleaving whitespace between child elements may be delivered using the ignorableWhitespace method.5 Both methods take an array of characters as a parameter. An initial offset and length is provided to indicate which subset of the array contains the actual content. Consider the following element information item:

<x xmlns='uri-one'>Hello, World</x>

The following Java code shows the corresponding SAX ContentHandler call sequence:

void emit4(org.xml.sax.ContentHandler handler)
           throws org.xml.sax.SAXException {
  org.xml.sax.Attributes a =
                new org.xml.sax.helpers.AttributesImpl();
  handler.startElement("uri-one", "x", "", a);
    char[] rgch = "Hello, World".toCharArray();
    handler.characters(rgch, 0, rgch.length);
  handler.endElement("uri-one", "x", "");
}

The offset and length parameters are Java-isms that allow Java-based XML parsers to avoid excessive memory movement.

The final [children]-related method is skippedEntity, whose signature looks like the following:

void skippedEntity(String name) throws SAXException;

This method corresponds to the reference to a skipped entity information item as a child of the current element. It signals the presence of an entity reference that will not be expanded by the caller. This method exists primarily due to a loophole in the XML 1.0 specification that allows nonvalidating parsers to skip external parsed entities.

Because SAX is commonly used to interface with XML parsers, it is occasionally useful for a ContentHandler implementation to discover exactly what portion of which document the parser is currently working on. To support this functionally, SAX defines the Locator interface, which is typically implemented by SAX-aware parsers to allow implementations of ContentHandler to discover exactly where the current method corresponds to in the underlying serialized form. The following is the Java version of Locator:

package org.xml.sax;
public interface Locator {
  String getPublicId( );
  String getSystemId( );
  int getLineNumber( );
  int getColumnNumber( );
}

For convenience, the Java version of SAX provides a default implementation of this interface (LocatorImpl) that has four corresponding "setter" methods to allow setting of the various location properties. SAX parsers make this interface available to ContentHandler implementations by calling the setDocumentLocator method prior to calling any other ContentHandler methods.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020