XML Document Structure
The XML Recommendation states that an XML document has both logical and physical structure. Physically, it is comprised of storage units called entities, each of which may refer to other entities, similar to the way that include works in the C language. Logically, an XML document consists of declarations, elements, comments, character references, and processing instructions, collectively known as the markup.
Although throughout this book we refer to an "XML document," it is crucial to understand that XML may not exist as a physical file on disk. XML is sometimes used to convey messages between applications, such as from a Web server to a client. The XML content may be generated on the fly, for example by a Java application that accesses a database. It may be formed by combining pieces of several files, possibly mixed with output from a program. However, in all cases, the basic structure and syntax of XML is invariant.
An XML document consists of three parts, in the order given:
An XML declaration (which is technically optional, but recommended in most normal cases)
A document type declaration that refers to a DTD (which is optional, but required if you want validation)
A body or document instance (which is required)
Collectively, the XML declaration and the document type declaration are called the XML prolog.
The XML declaration is a piece of markup (which may span multiple lines of a file) that identifies this as an XML document. The declaration also indicates whether the document can be validated by referring to an external Document Type Definition (DTD). DTDs are the subject of chapter 4; for now, just think of a DTD as a set of rules that describes the structure of an XML document.
The minimal XML declaration is:
<?xml version="1.0" ?>
XML is case-sensitive (more about this in the next subsection), so it's important that you use lowercase for xml and version. The quotes around the value of the version attribute are required, as are the ? characters. At the time of this writing, "1.0" is the only acceptable value for the version attribute, but this is certain to change when a subsequent version of the XML specification appears.
Do not include a space before the string xml or between the question mark and the angle brackets. The strings <?xml and ?> must appear exactly as indicated. The space before the ?> is optional. No blank lines or space may precede the XML declaration; adding white space here can produce strange error messages.
In most cases, this XML declaration is present. If so, it must be the very first line of the document and must not have leading white space. This declaration is technically optional; cases where it may be omitted include when combining XML storage units to create a larger, composite document.
Actually, the formal definition of an XML declaration, according to the XML 1.0 specification is as follows:
XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
This Extended Backus-Naur Form (EBNF) notation, characteristic of many W3C specifications, means that an XML declaration consists of the literal sequence '<?xml', followed by the required version information, followed by optional encoding and standalone declarations, followed by an optional amount of white space, and terminating with the literal sequence '?>'. In this notation, a question mark not contained in quotes means that the term that precedes it is optional.
The following declaration means that there is an external DTD on which this document depends. See the next subsection for the DTD that this negative standalone value implies.
<?xml version="1.0" standalone="no" ?>
On the other hand, if your XML document has no associated DTD, the correct XML declaration is:
<?xml version="1.0" standalone="yes" ?>
The XML 1.0 Recommendation states: "If there are external markup declarations but there is no standalone document declaration, the value 'no' is assumed."
The optional encoding part of the declaration tells the XML processor (parser) how to interpret the bytes based on a particular character set. The default encoding is UTF-8, which is one of seven character-encoding schemes used by the Unicode standard, also used as the default for Java. In UTF-8, one byte is used to represent the most common characters and three bytes are used for the less common special characters. UTF-8 is an efficient form of Unicode for ASCII-based documents. In fact, UTF-8 is a superset of ASCII.3
<?xml version="1.0" encoding="UTF-8" ?>
For Asian languages, however, an encoding of UTF-16 is more appropriate because two bytes are required for each character. It is also possible to specify an ISO character encoding, such as in the following example, which refers to ASCII plus Greek characters. Note, however, that some XML processors may not handle ISO character sets correctly since the specification requires only that they handle UTF-8 and UTF-16.
<?xml version="1.0" encoding="ISO-8859-7" ?>
Both the standalone and encoding information may be supplied:
<?xml version="1.0" standalone="no" encoding="UTF-8" ?>
Is the next example valid?
<?xml version="1.0" encoding='UTF-8' standalone='no'?>
Yes, it is. The order of attributes does not matter. Single and double quotes can be used interchangeably, provided they are of matching kind around any particular attribute value. (Although there is no good reason in this example to use double quotes for version and single quotes for the other, you may need to do so if the attribute value already contains the kind of quotes you prefer.) Finally, the lack of a blank space between 'no' and ?> is not a problem.
Neither of the following XML declarations is valid.
<?XML VERSION="1.0" STANDALONE="no"?> <?xml version="1.0" standalone="No"?>
The first is invalid because these particular attribute names must be lowercase, as must "xml". The problem with the second declaration is that the value of the standalone attribute must be literally "yes" or "no", not "No". (Do I dare call this a "no No"?)
Document Type Declaration
The document type declaration follows the XML declaration. The purpose of this declaration is to announce the root element (sometimes called the document element) and to provide the location of the DTD.4 The general syntax is:
<!DOCTYPE RootElement (SYSTEM | PUBLIC) ExternalDeclarations? [InternalDeclarations]? >
where <!DOCTYPE is a literal string, RootElement is whatever you name the outermost element of your hierarchy, followed by either the literal keyword SYSTEM or PUBLIC. The optional ExternalDeclarations portion is typically the relative path or URL to the DTD that describes your document type. (It is really only optional if the entire DTD appears as an InternalDeclaration, which is neither likely nor desirable.) If there are InternalDeclarations, they must be enclosed in square brackets. In general, you'll encounter far more cases with ExternalDeclarations than InternalDeclarations, so let's ignore the latter for now. They constitute the internal subset, which is described in chapter 4.
Let's start with a simple but common case. In this example, we are indicating that the DTD and the XML document reside in the same directory (i.e., the ExternalDeclarations are contained in the file employees.dtd) and that the root element is Employees:
<!DOCTYPE Employees SYSTEM "employees.dtd">
<!DOCTYPE PriceList SYSTEM "prices.dtd">
indicates a root element PriceList and the DTD is in the local file: prices.dtd.
In the next example, we use normal directory path syntax to indicate a different location for the DTD.
<!DOCTYPE Employees SYSTEM "../dtds/employees.dtd">
As is often the case, we might want to specify a URL for the DTD since the XML file may not even be on the same host as the DTD. This case also applies when you are using an XML document for message passing or data transmission across servers and still want the validation by referencing a common DTD.
<!DOCTYPE Employees SYSTEM "http://somewhere.com/dtds/employees.dtd">
Next, we have the case of the PUBLIC identifier. This is used in formal environments to declare that a given DTD is available to the public for shared use. Recall that XML's true power as a syntax relates to developing languages that permit exchange of structured data between applications and across company boundaries. The syntax is a little different:
<!DOCTYPE RootElement PUBLIC PublicID URI>
The new aspect here is the notion of a PublicID, which is a slightly involved formatted string that identifies the source of the DTD whose path follows as the URI. This is sometimes known as the Formal Public Identifier (FPI).
For example, I was part of a team that developed (Astronomical) Instrument Markup Language (AIML, IML) for NASA Goddard Space Flight Center.5 We wanted our DTD to be available to other astronomers. Our document type declaration (with a root element named Instrument) was:
<!DOCTYPE Instrument PUBLIC "-//NASA//Instrument Markup Language 0.2//EN" "http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd">
In this case the PublicID is:
"-//NASA//Instrument Markup Language 0.2//EN"
The URI that locates the DTD is:
Let's decompose the PublicID. The leading hyphen indicates that NASA is not a standards body. If it were, a plus sign would replace the hyphen, except if the standards body were ISO, in which case the string "ISO" would appear. Next we have the name of the organization responsible for the DTD (NASA, in this case), surrounded with double slashes, then a short free-text description of the DTD ("Instrument Markup Language 0.2"), double slashes, and a two-character language identifier ("EN" for English, in this case).
Since the XML prolog is the combination of the XML declaration and the document type declaration, for our NASA example the complete prolog is:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE Instrument PUBLIC "-//NASA//Instrument Markup Language 0.2//EN" "http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd">
As another example, let's consider a common case involving DTDs from the W3C, such as those for XHTML 1.0.
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
W3C is identified as the organization, "DTD XHTML 1.0 Transitional" is the name of the DTD; it is in English; and the actual DTD is located by the URI http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd. Similarly, the prolog for XHTML Basic 1.0 is:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
The XHTML Basic 1.0 PublicID is similar but not identical to the XHTML 1.0 case and of course the DTD is different since it's a different language.
If you noticed that the NASA example uses uppercase for the encoding value UTF-8 and the W3C examples use lowercase, you may have been bothered because that is inconsistent with what we learned about the case-sensitive value for the standalone attribute. The only explanation I can offer is that although element and attribute names are always case-sensitive, attributes values may or may not be. A reasonable guess is that if the possible attribute values are easily enumerated (i.e., "yes" or "no", or other relatively short lists of choices), then case probably matters.
DTD-related keywords such as DOCTYPE, PUBLIC, and SYSTEM must be uppercase. XML-related attribute names such as version, encoding, and standalone must be lowercase.
The document body, or instance, is the bulk of the information content of the document. Whereas across multiple instances of a document of a given type (as identified by the DOCTYPE) the XML prolog will remain constant, the document body changes with each document instance (in general). This is because the prolog defines (either directly or indirectly) the overall structure while the body contains the real instance-specific data. Comparing this to data structures in computer languages, the DTD referenced in the prolog is analogous to a struct in the C language or a class definition in Java, and the document body is analogous to a runtime instance of the struct or class.
Because the document type declaration specifies the root element, this must be the first element the parser encounters. If any other element but the one identified by the DOCTYPE line appears first, the document is immediately invalid.
Listing 3-1 shows a very simple XHTML 1.0 document. The DOCTYPE is "html" (not "xhtml"), so the document body begins with <html ....> and ends with </html>.
Listing 3-1 Simple XHTML 1.0 Document with XML Prolog and Document Body
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>XHTML 1.0</title> </head> <body> <h1>Simple XHTML 1.0 Example</h1> <p>See the <a href= "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">DTD</a>.</p> </body> </html>