8.2 Disadvantages of XML for Size and Performance
Despite the advantages mentioned in Section 8.1, XML does sometimes cause a significant increase in data size and processing time. These disadvantages are the result of design decisions and tradeoffs made by XML's original designers. For example, to make XML fully internationalized, the designers chose to require Unicode support, which can increase the memory required for processing and storing information from XML documents. The designers also chose the robustness of redundant labels in start and end tags, increasing the amount of space XML requires in disk storage or the amount of bandwidth for moving it over a network. The most serious performance risk, however, is one that people do not often worry about: XML's ability to include external resources.
XML repeats every element and attribute name for every element and attribute instance: In fact, it repeats the element name twice for every instance. If a long XML document contains 20,000 nonempty elements named maintenance-entry, the string maintenance-entry will appear in the document 40,000 times, consuming between 680,000 and 2,720,000 bytes of storage space, depending on the character encoding.
For loosely structured XML, such as human-readable documents (see Chapter 3), this overhead is often not a problem, but for highly structured XML, such as a database dump, these repeated names represent a significant overhead. There is a temptation to use short, cryptic element and attribute names, such as c183, instead of workflow-approval, destroying XML's advantage of transparency. There is also a temptation to reduce the amount of tagging, using whitespace and line ends to delimit some fields. These solutions are not particularly good, but they do show the desperation people face when dealing with enormous XML data files.
Sometimes, text can be more efficient than binary representations: For example, the long integer "1" requires 1 byte to represent in text using UTF-8 text encoding but 4 bytes to represent in a typical binary encoding. More often, however, the XML textual representation is longer: For example, the short integer "15,383" requires between 5 and 20 bytes in text, depending on character encoding, but only 2 bytes in binary form.
In fact, character encoding itself can cause an enormous size increase for XML documents, both in memory and on disk. The Unicode UCS-4 encoding, which, fortunately, almost no one uses, requires 4 octets for each character, so 100,000 characters become 400,000 bytes of storage. UTF-16, which is more common, requires 2 bytes for most characters. UTF-8 requires only 1 byte for ASCII characters but as many as 6 bytes for some Asian characters.
8.2.3 External References
The biggest performance risk for XML comes not from the fact that it is text based, that it is parsed, or that it can use Unicode but from the fact that XML documents can include external files. To make things worse, the inclusion can take place in the lowest-level XML parsing layer, where it is completely hidden fromand sometimes outside the control ofthe application developer. For example, consider Listing 8-2.
Example 8-2. Referencing an External DTD
<!DOCTYPE doc SYSTEM "http://www.example.org/dtds/doc.dtd"> <doc> ... </doc>
By default, almost all validating XML parsers will go to www.example.org and download doc.dtd every time they parse this document, leading to some serious performance problems.
Even with a fast network connection, each document will likely require seconds rather than milliseconds to parse.
If www.example.org is slow, possibly because of a heavy network load, parsing will slow down even further, possibly on the order of minutes for each document.
If www.example.org goes offline, parsing will fail completely.
If www.example.org has a security breach, an intruder could modify the DTD to cause denial of service or include false information in XML documents referencing it.
External DTD subsets are the greatest danger, but they are not the only way XML documents can cause files to be downloaded automatically during processing: The documents can also use external parameter entities in the DTD subset and external text entities in the document itself. Some XML parsers will also automatically download schemas, such as XML Schema [XML-SCHEMA] or RelaxNG [RELAXNG], referenced from inside a document.
Organizations can work around this problem by always parsing inside a sandbox that prohibits or limits external network access, providing local copies of required files, such as schemas. Many developers do not consider this step at first, however, and when looking over HTTP server logs, it is not uncommon to see some sites hitting the same online DTD file hundreds or thousands of times a day, almost certainly because of automatic downloading by XML parsers.