.NET for Java Developers: Processing XML
Like the popular HTML, Extensible Markup Language (XML) consists of tagged, human-readable text. Unlike HTML, the tags in an XML document follow one simple rule: For every opening tag <tag> there is a closing tag </tag> . An XML document in which every opening tag has a closing tag is said to be well formed.
As long as the XML document is well formed, you can fabricate the tags in any way you want. An XML document is typically parsed by an XML parser, which creates an in-memory logical data structure for navigating the document. There are different types of XML parsers. The most common do not usually care what the tags are as long as they are well formed. Sometimes a parser can validate an XML document against a set of rules that limit the document to only a certain subset of tags. Such parsers are called validating parsers.
The two most popular mechanisms for parsing XML documents are to create a Document Object Model (DOM) tree or to use the event-based Simple API for XML (SAX) model. An XML document can be validated against a DTD (the set of rules that define the type and structure of the XML tags) or an XML schema.
This chapter looks at C#'s API for DOM and SAX parsing of XML documents. We look at validating an XML document against a DTD. We also look at other utilities, such as XPath and Extensible Stylesheet Transformation (XSLT), that are built into the .NET API.
20.1 XML Support in Java
For a long time, XML was not built into the Java API. Support for XML was primarily through third-party libraries (such as Apache Xerces or JDOM). Fortunately, that has changed, and now you can get the Java XML Pack, a toolset for dealing with everything XML in Java. The XML Pack brings together several of the key industry standards for XML, such as SAX, DOM, XSLT, SOAP, Universal Description, Discovery & Integration (UDDI), Electronic Business using Extensible Markup Language (ebXML), and Web Services Description Language (WSDL). The two common programmatic XML APIs (SAX and DOM) are now built into the core Java API (as of J2SE 1.4.0).
The SAX parser is an event-driven parser in which the parser fires off events when it encounters XML elements. Users write content handlers, which they can register with the parser. A content handler is like an event listener and can take appropriate action upon encountering, say, a particular XML tag. The SAX parser is based on a push model, wherein the parser pushes events to content handlers.
The DOM parser parses the XML into an in-memory tree data structure (also known as a DOM tree). The Document Object Model is an API for valid HTML and well-formed XML documents. It defines the logical structure of documents and the way a document is accessed and manipulated. In the DOM specification, the term "document" is used in the broad sense; increasingly, XML is being used as a way to represent many kinds of information that may be stored in diverse systems. Much of this has traditionally been seen as data rather than as documents. Nevertheless, XML presents this data as documents, and the DOM can be used to manage this data.
With the Document Object Model, programmers can build documents, navigate their structure, and add, modify, or delete elements and content. Anything found in an HTML or XML document can be accessed, changed, deleted, or added using the DOM. The DOM is a W3C specification (http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/). The JDOM (http://www.jdom.org) API is one of the easier APIs for dealing with the XML DOM.