In this chapter, you learn
What an XML parser is
How to interface a parser with an application
What DOM, the Document Object Model, is
How to write Java applications that use DOM
Which other applications use DOM
What Is a Parser?
A parser is the most basic yet most important XML tool. Every XML application includes a parser. For example, the XSL processors (Xalan and FOP) from the last chapters were based on the Xerces parser.
A parser is a software component that sits between the application and XML files. Its goal is to shield the developer from the intricacies of the XML syntax.
Parsers are confusing because they have received a lot of publicity: There are dozens of parsers freely available on the Internet. When Microsoft shipped Internet Explorer 4.0 as the first browser with XML support, they really meant they had bundled two XML parsers with it.
Yet, if you ask for a demo of a parser, you won't see much. The parser is a low-level tool that is almost invisible to everybody but programmers. The confusion arises because the tool that has so much visibility in the marketplace turns out to be a very low-level library.
Why do you need parsers? Imagine you are given an XML file with product descriptions, including prices. Your job is to write an application to convert the prices from dollars to euros.
It looks like a simple assignment: Loop through the price list and multiply each price by the exchange rate. How long would that take? A quarter of a day's work, including tests.
Yet, remember the prices are in an XML file. To loop through the prices means to read and interpret the XML syntax. It doesn't look difficultbasically, elements are in angle brackets. Let's say the quarter-of-a-day assignment is now a one-day assignment.
Do you remember entities? The XML syntax is not just about angle brackets. There might be entities in the price list. Therefore, the application must read and interpret the DTD to be able to resolve entities. While it's reading the DTD, it might as well read element definitions and validate the document.
» For more information on how the DTD influences the document, see the section "Standalone Documents" in Chapter 4.
What about other XML features: character encodings, namespaces, parameter entities? And did you consider errors? How does your software recover from a missing closing tag?
The XML syntax is simple. Yet it's an extensible syntax so XML applications have to be ready to cope with many options. As it turns out, writing a software library to decode XML files is a one-month assignment. If you were to write such a library, after one month, you would have written your own parser.
Is it productive to spend one month writing a parser library when you need only a quarter of a day's work to process the data? Of course not. It is more sensible to download a parser from the Internet or use one that ships with your favorite development tool.
Admittedly, this example is oversimplified, but it illustrates the definition of a parser: an off-the-shelf component that isolates programmers from the specifics of the XML syntax.
If you are not convinced yet or if you would rather write your own XML parser, consider this: No programmer in his or her right mind (except those working for Oracle, Sybase, Informix, and the like) would write low-level database drivers. It makes more sense to use the drivers that ship with the database.
Likewise, no programmer should spend time decoding XML filesit makes more sense to turn to existing parsers.
The word parser comes from compilers. In a compiler, a parser is the module that reads and interprets the source code.
In a compiler, the parser creates a parse tree, which is an in-memory representation of the source code.
The second half of the compiler, known as the backend, uses parse trees to generate object files (compiled modules).
Validating and Nonvalidating Parsers
You will remember that XML documents can be either well formed or valid. Well-formed documents respect the syntactic rules. Valid documents not only respect the syntactic rules but also conform to a structure as described in a DTD or a schema.
Likewise, there are validating and nonvalidating parsers. Both parsers enforce syntactic rules but only validating parsers know how to validate documents against their DTDs or schemas.
Lest there be any confusion, there is no direct mapping between well-formed and nonvalidating parsers. Nonvalidating parsers can read valid documents (that is, a document with a DTD or a schema) but they won't validate them. To a nonvalidating parser, every document is a well-formed document.
Similarly, some validating parsers accept well-formed documents (others consider it an error not to have a DTD or a schema). Of course, when working on well-formed documents, they behave like nonvalidating parsers.
As a programmer, you will like the combination of validating parsers and valid documents. The parser catches most of the structural errors for you. And you don't have to write a single line of code to benefit from the service: The parser figures it out by reading the DTD or the schema. In short, it means less work for you.