Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

8.3 Processing Performance

One of XML's great strengths is the enormous selection of libraries, applications, toolkits, and frameworks available to developers in nearly every programming language and on nearly every platform. Sometimes, especially in the case of such low-level components as parsers, this software has been heavily optimized for size or performance. In other cases, however, publicly available XML software is optimized instead for ease of use; when that happens, the software can hide design choices that trigger enormous size or performance hits while processing XML.

When a high-level component, such as an XSLT engine, becomes an efficiency bottleneck, developers need to understand how their tools work and where the costs come from. This section describes some of the optimizations available to developers to improve XML performance for parsing, querying, and transforming XML and then introduces the technique of XML pipelines.

8.3.1 Parsing Interfaces

Although people have come up with some innovative ideas for XML parsing interfaces, such as pull parsers and query-based parsers, nearly all XML parsing interfaces in common use fall into one of two groups:

  1. Tree-based interfaces, such as the Document Object Model [DOM]

  2. Event-based interfaces, such as the Simple API for XML [SAX]

DOM-like interfaces allow applications to navigate and modify an XML document in much the same way that they navigate and modify a file system, by moving up and down various branches. As a matter of fact, some XML user interfaces even present XML documents using file system iconography, with folder icons for elements and file icons for text.

Unfortunately, the ease of use of DOM-like interfaces comes at a high price: An XML document's entire structure must be available for random access. In practice, that means that the structure must be in a database or, most commonly, in computer memory. During batch processing, building that in-memory data structure requires a time for allocating objects [1] and memory for keeping them, typically five to ten times the size of the original XML document. For a desktop application, these requirements are not usually a problem: A user editing a 2MB XML document will not mind waiting a few seconds for the document to load and can easily spare 20MB of memory to edit it. However, on a system handling hundreds or thousands of XML documents simul taneously and quickly, such as a server or high-speed information system—or even worse, a network appliance—the memory and time overhead are entirely unacceptable.

In these cases, switching to an event-based interface, such as SAX, may be a good choice. SAX-like interfaces do not allow random access to an XML document—although a DOM-like interface is similar to a hard drive, a SAX-like interface is more similar to a tape drive—but the SAX-like interface uses near-constant memory no matter how large an XML document may be [2] and tends to use very little processing time, as it does not need to allocate new objects during its run.

Unfortunately, programming with a SAX-like parsing interface is considerably more complicated than programming with a DOM-like interface. The application needs to capture and store any information it requires as the information streams past, and there is no way to follow backward references in a document. When ease of implementation is the priority, a DOM-like interface makes sense; when performance is the priority, a SAX-like interface makes sense.

Often, applications parsing data-oriented XML documents build their own, in-memory data trees.

  • An accounting application might build a tree of ledgers, accounts, and entries.

  • A geodata application might build a tree of polygons and points.

  • A graphics program might build a tree of shapes and lines.

An application that builds a specialized data tree using a DOM-like XML parsing interface takes a double hit: First, the parser uses a lot of time and memory to build a parse tree of the XML document, and then the application uses more time and memory to build its own tree before discarding the parser's tree. In such a case, simply switching to an event-based interface can reduce an application's time and memory requirements for XML processing to a fraction of what they originally were.

Another place that tree-based interfaces often hide performance is in transformations, discussed in Section 8.3.3.

8.3.2 Queries

Sometimes, an application reads an XML document simply to extract a small bit of information hidden somewhere inside it. A bibliographical application might need to extract only the title and the author's name from an XML-encoded research paper, like the one in Listing 8-3.

Example 8-3. Bibliographical Information in a Research Paper

<article>
  <articleinfo>
   <title>Frog populations in Frontenac County</title>
   <author>
     <honorific>Dr.</honorific>
     <firstname>Jose</firstname>
     <surname>Schmidt</surname>
     <affiliation>
       <orgdiv>Department of Biology</orgdiv>
       <orgname>King's University</orgname>
     </affiliation>
   </author>
 </articleinfo>

 <sect1>
 <title>Introduction</title>

 <para>During the summers of 1999 and 2000, a team of researchers
   [...]</para>

 [...]

 </sect1>

 [...]

</article>

If the paper were 100 pages long, building a DOM tree of the entire XML document simply to extract the title and author information would be extremely wasteful. It is much better to write a short, perhaps 50-line, program in Perl, Java, Python, C, or C++ to extract the required information from an event stream provided by a SAX-like interface. In fact, the application can terminate parsing as soon as it has all the required information, so the parser will not need to read most of the document at all, much less build it into an in-memory tree.

Unfortunately, custom programming is not always a practical solution: Many people need to extract information from XML documents in a more generic way that does not require hard-core computer programming skills. For situations like these, the XML community has largely settled on XPath [XPATH] as a simple query syntax for XML documents. [3] For example, the following XPath expressions select the element containing the article title and the subtree containing the author's name:

/article/articleinfo/title
/article/articleinfo/author

A general-purpose query application can use these expressions to return matching fragments from an XML document, the same way that the UNIX grep command uses regular expressions to return matching fragments from a text file. An interactive XML editor or browser can use these expressions to bring the cursor to the matching points in a document.

Unfortunately, it is necessary to have random access to the original XML document to support all of XPath, and that, once again, requires a DOM-like interface. In a high-speed environment, in which predictable, consistent performance is important, it may be necessary to restrict queries to the subset of XPath that can be supported by a streaming interface. Both of the previous XPath expressions, for example, could be resolved by an application using a SAX-like parsing interface, as they do not require any backward references.

The XPath 1.0 expressions that can be matched most efficiently—for time and memory—are the ones that limit themselves mostly to context on the element stack, particularly parent, child, ancestor, and descendant relationships. For example, all the following XPath expressions could be matched against output from an event-based XML parser, such as a SAX parser, storing no context except for a stack of elements and their attributes up to the root element and the index of each element in the stack relative to its siblings:

//caution
/book//chapter/title[contains(string(), "Crimson")]
//section[@id="preface"]//character[string()="Joseph"]
  • The first XPath expression simply requires matching the caution element, with no additional context.

  • The second expression requires matching the current element against title and examining the element stack to ensure that the parent element is chapter and that the root element is book, then testing whether the string "Crimson" appears in the element's content.

  • The third expression requires matching the current element against character, testing whether that section element with the id attribute equal to "preface" appears anywhere in the element stack, and testing that the element's content is exactly the string Joseph.

These are all queries that will use near-constant memory no matter how long an XML doc ument is, as none requires random access. The following inefficient expressions, on the other hand, do require random access to a document or else the caching of a large part of a document's contents:

//caution[../para[contains(string(), "oil")]]
/book//chapter[following-sibling:chapter/title[string()="Crimson"]]
//link[id(@ref)/@status="modified"]
  • The first XPath expression matches any caution element that has a sibling named para containing oil in its content. As a result, a query engine will have to save the content of every caution element until the parser has finished reporting all its siblings.

  • The second expression matches any chapter that appears before a chapter with the title "Crimson". As a result, the query engine will have to cache the content of all chapters until the parser is finished processing the parent element.

  • The third expression is the most inefficient: In the worst case, it will require the query engine to cache every link element until the end of the document.

Fortunately, the efficiency concerns for XPath expressions are not so serious, because many technical specialists, not to mention ordinary users, will find such expressions anywhere from excessively complicated to baffling. Limiting a high-performance query engine to efficient expressions still provides a respectable amount of query capability while also allowing an application all the advantages of a SAX-like interface (see Section 8.3.1): Modifying an application to use only this subset, together with a SAX-like parser, can bring significant speed and memory improvements to an XML application.

8.3.3 Transformations

Performance problems with transformations are closely related to the problems with parsing interfaces (Section 8.3.1) and queries (Section 8.3.2). The most popular specification for transforming XML documents into XML or another format is XSL Transformations [XSLT], but because XSLT supports the full XPath specification, XSLT also requires random access to the original XML document. Therefore, XSLT engines typically use a DOM-like interface with all the extra time and memory overhead.

XSLT's expressive power, and the fact that it is a template language, make it ideal for complex structural transformations; in those cases, extra time and memory might be a reasonable tradeoff for rapid development. On the other hand, many transformations are small and do not require the ability to restructure a document extensively. In these cases, the significant overhead of using XSLT processors other tree-based transformation tools is unacceptable.

Before considering alternatives, this section takes a quick look at the efficiency of XSLT engines. XSLT engines read an XML document as input and create a new document, possibly XML, as output. Although the engines require random access to the input document, they have no such requirement for output, so there is no need to build a second in-memory document tree. Command line XSLT utilities are generally well-enough designed that they do not build a second DOM tree, or the equivalent, but simply write out the result as a stream, However, if an XML project happens to use an XSLT library, a developer could end up using the library to generate a DOM tree, for XML, before writing any output. This is, obviously, a bad idea when memory is at a premium, and fixing this problem is an easy way to halve the memory consumption of any project that uses XSLT.

A second optimization is possible with XSLT, but to date, no one seems to have taken advantage of it. XSLT requires random access to the input document but does not modify the original document; as a result, it is possible to parallelize XSLT processing, assigning different processors to perform different parts of the transformations on the input document. [4] This approach will not solve the memory problem but could bring significant speed improvements for large input documents or complex transformation rules.

As mentioned earlier, however, XSLT's ability to access the input document randomly using XPath expressions is not necessary for many kinds of transformations, and an application working with an event stream, such as SAX, can run much faster, using very little memory. SAX-like transformations are complicated or impossible when the transformation requires major structural changes, but such transformation works well in three cases:

  1. Adding information to a document

  2. Removing information to a document

  3. Modifying information in place, such as renaming elements and attributes or changing text

Many typical transformations fall into these categories. For example, one common transformation for technical and sales publications is adding live data from a database at publication time. The original XML markup might look like that in Listing 8-4.

Example 8-4. Catalog Entry before Transformation

<item>
 <source>Henrickson</source>
 <name>Acoustic Guitar</name>
 <description>This guitar includes a mahogany fret board, gold-plated
  tuning pegs, and a solid top.</description>
 <price dbref="g3905778/price"/>
</item>

The transformation engine executes a database query based on the value of the dbref attribute in the price element and replaces it with the latest price information, as shown in Listing 8-5.

Example 8-5. Catalog Entry after Transformation

<item>
 <source>Henrickson</source>
 <name>Acoustic Guitar</name>
 <description>This guitar includes a mahogany fret board, gold-plated
  tuning pegs, and a solid top.</description>
 <price>
  <status>sale</status>
  <expiry-date>2005-01-01</expiry-date>
  <currency>USD</currency>
  <amount>1999.00</amount>
 </price>
</item>

A tree-based transformation engine, such as XSLT, brings little advantage to this kind of work. In fact, because XSLT does not specify a standard method for database access, any XSLT template would not be portable anyway, even if the engine did support database access. On the other hand, using Perl, Python, or Java, a transformation like this is trivial for an experienced developer.

8.3.4 Pipelines

Another performance problem that can creep into an XML processing system is duplicated parsing. An XML document will often pass through several components in a chain from the point it enters a system to the point it leaves; if the XML is written to disk and then reparsed into memory each time a component touches it, a system might slow down. [5] The solution to this problem is to use a processing pipeline like the one in Figure 8-1.

08fig01.gifFigure 8-1 XML processing in a pipeline

The XML document enters the pipeline through the parser on the left, which converts the document to a series of SAX or SAX-like events. The parser then passes the events on to the first component, which manipulates them as necessary—say, by adding or removing information—then passes the modified events on to another component. The second component knows nothing about the first component: As far as it is concerned, the events could be coming directly from a parser. The second component makes further changes to the event stream and then passes the modified events on to the third component, and so on. At the end of the chain is a component that simply writes the events back out to disk or sends them out over the network in XML format. The alternative to this pipeline would be to parse and rewrite the XML document for each processing component, as in Figure 8-2: Note all the extra work required for processing.

08fig02.gifFigure 8-2 XML processing without a pipeline

The pipeline concept works with both DOM-like interfaces and SAX-like interfaces. For DOM-like interfaces, the components all work on the same in-memory tree in sequence, making changes and then passing the tree on to the next component. The beginning of the pipeline is also a parser—and tree build, in this case—and the last component is also a writer.

Aside from the fact that, for efficiency, the components need to run on the same system, this approach has one problem: Without XML to examine between each component, it can be difficult to find and fix problems. Fortunately, however, the pipeline components are modular, so it is easy to fit in extra debugging components to write out the events as XML at various points in the processing.

The pipeline model is the most common one used by skilled XML developers working with SAX and other event-based interfaces, and it has proved itself in high-speed, high-demand environments. The model is not quite as common with the DOM and other tree-based interfaces, as people tend not to use those in high-speed environments in the first place, but it can work equally well.

  • + Share This
  • 🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020