This section describes some ways to work around XML's size problems, both with XML markup and with internal representations of parsed XML documents.
8.4.1 Unicode and Character Size
XML documents can use many different character encodings, and those affect the size of a document on disk, through the network, and in memory. Choosing the right character encoding for a project can be the cheapest and easiest kind of compression.
Because the W3C designed XML for international use, all XML parsers are required to support the standard Unicode encodings UTF-8 and UTF-16. Many XML parsers also accept other popular encodings, such as ASCII, ISO-8859-1the ISO Latin 1 alphabet, sometimes informally called 8-bit ASCIIand Shift-JIS. Other than the Unicode UTF and UCS encodings, no character encodings support the entire Unicode character set: For example, ISO-8859-1 does not support Japanese characters, and Shift-JIS does not support accented European characters. In XML document instances, however, it is possible to include characters outside the current character set by using character references; the following example includes the Russian word mup in an ASCII-encoded XML document:
<?xml version="1.0" encoding="US-ASCII"?> <trivia>The Russian word <q>мир</q> means both <q>world</q> and <q>peace</q>.</trivia>
If space and bandwidth are at a premium, then, it is possible to save some space by choosing the right XML character encoding.
With UTF-16, all charactersat least, all the ones you're likely to userequire 2 octets, so the phrase "Hello, world!" will use 26 octets to store 13 characters.
With UTF-8, basic ASCII characters require only 1 octet, so "Hello, world!" would use only 13 octets to store 13 characters; however, Cyrillic or accented European characters require 2 octets each, and other characters for other languages can require up to 6 octets.
With the ISO-8859 character encodings, each character requires only 1 octet, even accented European characters, but any characters not in the set require XML character references, such as м.
Of course, many other character encodings are available, but the tradeoffs should be obvious. XML documents containing English-language documents and computer programming code will be smallest with UTF-8 and generally twice as large with UTF-16; non-English European languages will be smallest with one of the ISO-8859 alphabets, if available, and only slightly larger with UTF-8. Many other world languages, such as Chinese, will be smallest with UTF-16 or a specialized encoding and can grow considerably larger with UTF-8. In most cases, where storage space is not severely constrained and every millisecond of bandwidth doesn't count, choosing either UTF-8 or UTF-16 and sticking with it may be the best choice; if size is critical, choosing the right encoding for storage can bring considerable savings.
The second Unicode-related size issues come when the XML document has been parsed and read into the computer: Typically, parsers will convert the original XML document's characters into a standard internal encoding for processing, almost always UTF-8 or UTF-16, so that any Unicode characters can be included. The same size advantages and disadvantages for various languages apply here: English and European languages will generally use less space with UTF-8, whereas nonalphabetic languages, such as Chinese, will use less space with UTF-16. There are also processing considerations: UTF-8 works better with older libraries in some programming languages, such as C and C++, but string manipulation can be tricky because characters do not have a fixed width. UTF-16 gives all characters a fixed width and works well with modern programming languages, such as Java, but it can be difficult to use in older libraries and programming languages.
Internalization is a useful trick from older programming languages, such as LISP. Because the same symbol, such as list, can appear frequently in a LISP program, implementers can save space if they put exactly one copy of each symbol in a lookup table and always point to that one location, rather than copying the symbol over and over again and wasting memory.
Many, if not most, XML parsers internalize element and attribute names in the same way, as a name may appear hundreds or thousands of times in an XML document. In Listing 8-6, the element name list appears twice, and the name item appears eight times, but the XML parser would allocate only a single instance of each in memory.
Example 8-6. Repeated Names
<list> <item>100</item> <item>200</item> <item>300</item> <item>400</item> </list>
In general, parsers do not internalize other strings, such as attribute values and character data, as they are much less likely to be repeated, and internalizing adds some processing overhead. However, an attribute value or character data content in XML documents is sometimes limited to a set of enumerated values: For example, the status of a task entry in a dictionary might be "draft", "pending", "approved", or "released". Because the value must be one of these and no other, an application can internalize the value of the element or attribute holding the information, as the values are very likely to repeat. Sometimes, XML schemas and DTDs can provide hints to XML processors about what values are enumerated, but applications should pay attention to internalization opportunities as well, as they can provide important memory savings.
If an XML document is going out over the network or being kept in long-term storage, brute-force compression is sometimes a much better choice than any of the other techniques in this section. XML is text with a lot of repetition, so it compresses surprisingly well. For example, at the point that I'm writing this paragraph, the XML manuscript for this book is 440,320 bytes long. Simply running the manuscript through the UNIX bzip2 -9 command reduces its size to 98,815 bytes, or only 22 percent of the original. If you are willing to trade a bit of processing time for the sake of reducing storage or network bandwidth requirements, straightforward compression will save far more space than any of the other techniques in this section and will outperform almost any optimized binary format as well.
Unfortunately, brute-force compression is not always the answer. Compressed XML needs to uncompressed before it can be used in any way, and it is virtually useless for in-memory processing. In these cases, specialized compression techniques, such as internalization, can bring some savings, but they are much less dramatic.