Sams Teach Yourself XML in 21 Days

Sams Teach Yourself XML in 21 Days

By Steven Holzner

Creating XML Documents Piece by Piece

Yesterday, you created this example XML document:

<?xml version="1.0" encoding="UTF-8"?>
<document>
    <heading>
        Hello From XML
    </heading>
    <message>
        This is an XML document!
    </message>
</document>

That's a fully-functional XML document, but it's only an example. Today, we're going to be more systematic about what goes into an XML document, discussing all the possible parts of such documents. You'll take a look at these parts of an XML document in the coming sections:

W3C defines everything that can go into XML documents in the XML 1.0 and 1.1 specifications, right down to our starting point—the character set you use.

Character Encodings: ASCII, Unicode, and UCS

The characters in an XML document are stored using numeric codes. That can be an issue, because different character sets use different codes, which means an XML processor might have problems trying to read an XML document that uses a character set—called a character encoding—other than what it's used to.

For example, a common character encoding used by text editors is the American Standard Code for Information Interchange (ASCII). ASCII is the default for plain text files created with Windows WordPad. ASCII codes extend from 0 to 255—for example, the ASCII code for A is 65, for B is 66, and so on. So, if you stored the word cat in an XML document written in ASCII, the numbers 67, 65, and 84 are what would actually be stored. On the other hand, the World Wide Web is just that—worldwide. Plenty of character sets can't fit into the 256 characters of ASCII, such as Cyrillic, Armenian, Hebrew, Thai, Tibetan, and so on.

For that reason, W3C turned to Unicode (http://www.unicode.org), which holds 65,536 characters, not just 256 (although only about 40,000 Unicode codes are reserved at this point). To make things easier, the first 256 Unicode characters correspond to the ASCII character set.

There's another character encoding available that has even more space than Unicode—the Universal Character System (UCS, also called ISO 10646) uses 32 bits—two bytes—per character. This gives it a range of two billion symbols—and a good thing, too, since there are more Chinese characters alone than there is space in Unicode. UCS also encompasses the smaller Unicode character set—each Unicode character is represented by the same code in UCS, in much the same way that Unicode encompasses the smaller ASCII character set.

So which character sets are supported in XML? ASCII? Unicode? UCS? Unicode uses two bytes for each character, so a Unicode file would be twice as long as an ASCII file. For that and other reasons, it's difficult to convert much of the available software to Unicode. XML actually supports a compressed version of Unicode created by the UCS group called UCS Transformation Format-8 (UTF-8). UTF-8 includes all the ASCII codes unchanged, and uses a single byte for the most common Unicode characters. Any other Unicode characters need more than one byte (and can use up to six)—for example, the Unicode for U041F.GIF is 03C0 in hexadecimal (960 in decimal), which you need to store in two bytes.

To make it easier to handle, UCS itself has also been compressed in the same way into a character set named UTF-16, which uses two bytes (instead of the normal four that UCS uses) for the most common characters, and more bytes for the less common characters.

W3C requires all XML processors to support both UTF-8 (compressed Unicode, including the full ASCII set), and UTF-16 (compressed UCS, including the full ASCII set), and those are the only two W3C requires. The UTF-8 encoding is the most popular one today in XML documents, because you can store documents in ASCII using a text editor and they can be treated, without any changes, as UTF-8 by an XML processor (ASCII uses one byte for characters, and UTF-8 uses one byte for the most common characters, including all the characters in the ASCII set). In fact, we've been using UTF-8 since our first XML example, as you can see where we've specified the character encoding for a document with the encoding attribute in the XML declaration:


   <?xml version="1.0" encoding="UTF-8"?>
<document>
    <heading>
        Hello From XML
    </heading>
    <message>
        This is an XML document!
    </message>
</document>

UTF-8 is so widespread that an XML processor will assume you're using it if you omit the encoding attribute. Although W3C requires all XML processors to support UTF-16 and UTF-8 (so you can assign these values to the encoding attribute), most don't support UTF-16 yet.

Although only UTF-8 and UTF-16 are required, there are many character encodings that an XML processor can support, such as the following:

The increasing adoption of Unicode is the main driving force behind XML 1.1. There are three main areas in which XML 1.1 differs from XML 1.0, all having to do with characters:

You'll see these various points in more depth today. However, note that most of these differences are technical, and won't concern you a great deal. For example, XML 1.0 and 1.1 differ slightly in what character references you can use. As in HTML, character reference stands for a Unicode character and begins with &, followed by a numeric code specifying a character, and ends with ;. You can either enter a Unicode character in an XML document as the character itself or as a character reference, which the XML processor will convert into the corresponding character.

For example, the Unicode for U041F.GIF is 960 in decimal, so you can embed U041F.GIF in your XML document by entering U041F.GIF (if your text editor supports Unicode), or as the character reference &#960; (if your text editor doesn't support Unicode). The XML processor will replace the character reference with U041F.GIF. (You can also give the Unicode in hexadecimal if you preface it with an x, which would be &#x03C0; in this case.)

The difference between XML 1.0 and XML 1.1 as far as character references go is that XML 1.1 allows the use of character references &#x1; through &#x1F;, most of which are forbidden in XML 1.0. Conversely, the character references &#x7F; through &#x9F;, which were allowed as characters or character references in XML 1.0 documents, might only appear as character references in XML 1.1. These kinds of relatively small differences aren't going to concern us a great deal. For all these details, check the XML 1.1 candidate recommendation itself.

That's given us a handle on the character encodings you can use to create XML documents. The next step is to see just how you put those characters to work in XML as you create markup and text data.

Understanding XML Markup and XML Data

At their most basic level, XML documents are combinations of markup and text data. They might also include binary data one day, but there's no way to include binary data in an XML document at the moment. (If you want to associate binary data with an XML document, you keep that data external to the document and use an entity reference, as you'll see later today and in Day 5 in detail.)

The markup in a document gives it its structure. Markup includes start tags, end tags, empty element tags, entity references, character references, comments, CDATA section delimiters (more about CDATA sections in a few pages), document type declarations, and processing instructions. What about the data in an XML document? All the text in an XML document that is not markup is data.

Although the markup we've seen has mostly consisted of tags up to this point, there's another type of markup that doesn't use tags—general entity references and parameter entity references. Whereas tags begin with < and end with >, general entity references start with & and end with ; (as with the character references we've already seen, which are general entity references—for example, if you're using the UTF-16 encoding, &#x03C0; is a character reference for U041F.GIF). General entity references are replaced by the entity they refer to when the document is parsed. Parameter entity references, which start with % and end with ;, are used in DTDs, as we'll see in Days 4 and 5.

For example, the markup &lt; is a general entity reference that is turned into a < (less than) symbol when parsed by an XML processor, and the general entity reference &gt; is turned into a > (greater than) symbol when parsed by an XML processor. You can see an example using these general entity references in Listing 2.1.

Example 2.1. Using an Entity Reference (ch02_01.xml)

<?xml version="1.0" encoding="UTF-8"?>
<document>
    <heading>
        Hello From XML
    </heading>
    <message>
        This text is inside a &lt;message&gt; element.
    </message>
</document>

You can see ch02_01.xml in Internet Explorer in Figure 2.9. As you can see in the figure, the markup &lt; was turned into a <, and the markup &gt; was turned into a > by the XML processor.

02fig09.gif

Figure 2.9 Using markup in Internet Explorer.

Besides character entity references, where a character code is replaced by the character it stands for, there are five predefined general entity references in XML, which are used when browsers might otherwise assume that they're part of markup to be interpreted:

You can also create your own general entity references, which we'll do in Day 5.

When an XML processor parses your XML, it replaces general entity references like &gt; with the entity those references stand for, which is > in this case. Before it's parsed, text data is called character data; after it's been parsed and general entity references have been replaced with the entities they refer to, the text data is called parsed character data.

Using Whitespace and Ends of Lines

Spaces, carriage returns, line feeds, and tabs are all treated as whitespace in XML. That means that to an XML processor, this XML document:

<?xml version="1.0" encoding="UTF-8"?>
<document>
<heading>
Hello From XML
</heading>
<message>
This is an XML document!
</message>
</document>

is the same as this one, in terms of content:

<?xml version="1.0" encoding="UTF-8"?>
<document>heading>Hello From XML</heading>
<message>This is an XML document!</message></document>

You can use a special attribute named xml:space in an element to indicate that you want whitespace to be preserved by XML processors (not all XML processors will support this attribute). You can set this attribute to "default" to indicate that the default handling of whitespace is OK for the current element and all contained elements, or "preserve" to indicate that you want all applications to preserve whitespace as it is in the document. This is useful if the XML processor is going to display the XML document visually:

<?xml version="1.0" encoding="UTF-8"?>
<document xml:space="preserve">
    <heading>
        Hello From XML
    </heading>
    <message>
        This is an XML document!
    </message>
</document>

In XML 1.0, lines officially end with a linefeed character (ASCII and UTF-8 code 10—the Unix way of ending lines). In MS DOS and some Windows programs, lines can end with a carriage return (ASCII and UTF-8 code 13) linefeed pair, but when parsed by an XML processor, that pair (codes 13 and 10) is converted into a single linefeed (ASCII and UTF-8 code 10). In XML 1.1, which is mostly about expanding the character sets you can use, XML 1.0 was considered to discriminate against the conventions used on IBM and IBM-compatible mainframes. That means that in XML 1.1, the acceptable line endings that XML processors are supposed to convert to &#xA; are expanded to include the following:

That brings us up through the basic structure of an XML document—markup and data. Now it's time to actually start putting markup and data to work as you start creating XML documents.

Share ThisShare This

Informit Network