XML Document Structure and Syntax
XML documents must adhere to a standard syntax so that automated parsers can read them. Fortunately, the syntax is pretty simple to understand, especially if you've developed Web pages in HTML. The XML syntax is a bit more rigorous than that of HTML, but as you'll see, that's a good thing. There are a million ways to put together a bogus, sloppy HTML document, but the structure required by XML means that you get a higher level of consistency.
The XML declaration is the same for all XML documents. Following is an XML declaration:
The declaration says two things: This is an XML document, and this document conforms to the XML 1.0 W3C recommendation (which you can get straight from the horse's mouth at http://www.w3.org/TR/REC-xml). The current and only W3C recommendation for XML is version 1.0, so you shouldn't see an XML declaration that's different from this examplebut you might in the future as the specification is revised into new versions.
A W3C recommendation isn't quite the same as a bona fide internet standard, but it's close enough for our purposes.
The XML declaration, when it exists, must exist on the first line of the document. The declaration does not have to exist, howeverit is an optional part of an XML document. The idea behind a declaration is that you may have some automated tool that trawls document folders looking for XML. If your XML files contain declarations, it'll be much easier for such an automated process to locate XML documents (as well as differentiate them from other marked-up documents such as HTML Web pages).
Don't sweat it too much if you don't include a declaration line in the XML documents you create. Leaving it out doesn't affect how data in the document is parsed.
An element is a part of an XML document that contains data. If you're accustomed to database programming or working with delimited documents, you can think of an element as a column or a field. XML elements are sometimes also referred to as nodes.
XML documents must have at least one top-level element to be parsable. The following code shows an XML document with a declaration and a single top-level element (but no actual data).
<?xml version="1.0"?> <ORDERS> </ORDERS>
This document can be parsed, even though it contains no data. Note one important thing about the markup of this document: It contains both an open tag and a close tag. The close tag is differentiated by the slash (/) character in front of the element name.
This is an important difference between XML and HTML. In HTML, some elements require close tags, but many don't. Even for those elements that don't contain proper closing tags, the browser often attempts to correctly render the page (sometimes with quirky results).
XML, on the other hand, is the shrewish librarian of the data universe. It's not nearly as forgiving as HTML and will rap you on the knuckles if you cross it. If your XML document contains an element that's missing a close tag, the document won't parse. This is a common source of frustration among developers who use XML. Another kicker is the fact that (unlike HTML) tag names in XML are case-sensitive. This means that <ORDERS> and <orders> are considered to be two different and distinct tags.
The whole purpose of an XML element is to contain pieces of data. In the previous example, we left out the data. Code Listing 11.1 shows an evolved version of this document, this time with data in it.
Listing 11.1 An XML Document with Elements That Contain Data
<?xml version="1.0"?> <ORDERS> <ORDER> <DATETIME>1/4/2000 9:32 AM</DATETIME> <ID>33849</ID> <CUSTOMER>Steve Farben</CUSTOMER> <TOTALAMOUNT>3456.92</TOTALAMOUNT> </ORDER> </ORDERS>
If you were to describe Listing 11.1 in English, you'd say that it contains a top-level ORDERS element and a single ORDER element, or node. The ORDER node is a child of the ORDERS element. The ORDER element itself contains four child nodes of its own: DATETIME, ID, CUSTOMER, and TOTALAMOUNT.
Adding a few additional orders to this document might give you something like Listing 11.2.
Listing 11.2 An XML Document with Multiple Child Elements Beneath the Top-Level Element
<?xml version="1.0"?> <ORDERS> <ORDER> <DATETIME>1/4/2000 9:32 AM</DATETIME> <ID>33849</ID> <CUSTOMER>Steve Farben</CUSTOMER> <TOTALAMOUNT>3456.92</TOTALAMOUNT> </ORDER> <ORDER> <DATETIME>1/4/2000 9:32 AM</DATETIME> <ID>33856</ID> <CUSTOMER>Jane Colson</CUSTOMER> <TOTALAMOUNT>401.19</TOTALAMOUNT> </ORDER> <ORDER> <DATETIME>1/4/2000 9:32 AM</DATETIME> <ID>33872</ID> <CUSTOMER>United Disc, Incorporated</CUSTOMER> <TOTALAMOUNT>74.28</TOTALAMOUNT> </ORDER> </ORDERS>
Here's where developers sometimes get nervous about XML. With a document like Listing 11.2, you can see that there's far more markup than data. Does this mean that all those extra bytes will squish your application's performance?
Maybe, but not necessarily. Consider an Internet application that uses XML on the server side. When this application needs to send data to the client, it first opens and parses the XML document (we'll discuss how XML parsing works later). Then some sort of resultin all likelihood, a tiny subset of the data, stripped of markuwill be sent to the client Web browser. The fact that there's a bunch of markup there doesn't slow things down significantly.
At the same time, there is a way to express data more succinctly in an XML document, without the need for as many open and closing markup tags. You can do this through the use of attributes.
An attribute is another way to enclose a piece of data in an XML document. An attribute is always part of a element; it typically modifies or is related to the information in the node. In a relational database application that emits XML, it's common to see foreign key data expressed in the form of attributes.
For example, a document that contains information about a sales transaction might use attributes as shown in Listing 11.3.
Listing 11.3 An XML Document with Elements and Attributes
<?xml version="1.0"?> <ORDERS> <ORDER id="33849" custid="406"> <DATETIME>1/4/2000 9:32 AM</DATETIME> <TOTALAMOUNT>3456.92</TOTALAMOUNT> </ORDER> </ORDERS>
As you can see from Listing 11.3, attribute values always are enclosed in quotation marks. Using attributes tends to reduce the total number of bytes of the document, reducing some markup at the expense of readability (in some cases). Note that you are allowed to use either single or double quotation marks anywhere XML requires quotes.
This element/attribute syntax may look familiar from HTML, which uses attributes to assign values to elements the same way XML does. But remember that XML is a bit more rigid than HTML; a bracket out of place or a mismatched close tag will cause the entire document to be unparsable.
Enclosing Character Data
At the beginning of this chapter, we discussed the various dilemmas involved with delimited files. One of the problems with delimiters is the fact that if the delimiter character exists within the data, it's difficult if not impossible for a parser to know how to parse the data.
This problem is not confined to delimited files; XML has similar problems with containing delimiter characters. The problem arises because the de facto XML delimiter character (in actuality, the markup character) is the left angle bracket (also known as the less-than symbol). In XML, the ampersand character (&) can also throw the parser off.
You've got two ways to deal with this problem in XML: Either replace the forbidden characters with character entities or use a CDATA section as a way to delimit the entire data field.
Using Character Entities
You might be familiar with character entities from working with HTML. The idea is to take a character that might be interpreted as a part of markup and replace it with an escape sequence to prevent the parser from going haywire. Listing 11.4 provides an example of this.
Listing 11.4 An XML Document with Escape Sequences
<?xml version="1.0"?> <ORDERS> <ORDER id="33849"> <NAME>Jones & Williams Certified Public Accountants</NAME> <DATETIME>1/4/2000 9:32 AM</DATETIME> <TOTALAMOUNT>3456.92</TOTALAMOUNT> </ORDER> </ORDERS>
Take a look at the data in the NAME element in the code example. Instead of an ampersand, the & character entity is used. (If a data element contains a left bracket, it should be escaped with the < character entity.)
When you use an XML parser to extract data with escape characters, the parser will automatically convert the escaped characters to their correct representation.
Using CDATA elements
An alternative to replacing delimiter characters is to use CDATA elements. A CDATA element tells the XML parser not to interpret or parse characters that appear in the section.
Listing 11.5 is an example of the same XML document from the previous example, delimited with a CDATA section rather than a character entity.
Listing 11.5 An XML Document with a CDATA Section
<?xml version="1.0"?> <ORDERS> <ORDER id="33849"> <NAME><![CDATA[Jones & Williams Certified Public Accountants]]></NAME> <DATETIME>1/4/2000 9:32 AM</DATETIME> <TOTALAMOUNT>3456.92</TOTALAMOUNT> </ORDER> </ORDERS>
In this example, the original data in the NAME element does not need to be changed, as in the previous example. Here, the data is wrapped with a CDATA element. The document is parsable, even though it contains an unparsable character (the ampersand).
Which technique should you use? It's really up to you. I prefer using the CDATA method because it doesn't require altering the original data, but it has the disadvantage of adding a dozen or so bytes to each element.
Abbreviated Close-Tag Syntax
For elements that contain no data, you can use an abbreviated syntax for element tags to reduce the amount of markup overhead contained in your document. Listing 11.6 demonstrates this.
Listing 11.6 An XML Document with Empty Elements
<?xml version="1.0"?> <ORDERS> <ORDER id="33849" custid="406"> <DATETIME>1/4/2000 9:32 AM</DATETIME> <TOTALAMOUNT /> </ORDER> </ORDERS>
You can see from the example that the TOTALAMOUNT element contains no data. As a result, we can express it as <TOTALAMOUNT /> instead of <TOTALAMOUNT> </TOTALAMOUNT>. (It's perfectly legal to use either syntax in your XML documents; the abbreviated syntax generally better, though, because it reduces the size of your XML document.)