Why HTML Is Not the Answer
I hear you saying to yourself, "Ah, Dan, but what about HTML? I can use HTML for managing information, and I get Web publishing for free (because HTML is the language of the Web). Isn't HTML also derived from SGML, and isn't it also a great, standardized way of storing documents?" Well, yes on one, no on two. HTML is wonderful, but for all its beauty, HTML is really good only at describing layoutit's a display-oriented markup. Using HTML, you can make a word bold or italic, but as to the reason that word might be bold or italic, HTML remains mute. With XML, because you define the markup you want to use in your documents, you can mark a certain word as a person's name or the title of a book. When the document is represented, the word will appear bold or italic; but with XML, because your documents know all the locations of people's names or book titles, you can capriciously decide that you want to underline book titles across the board. You have to make this change only once, wherever your XML documents are being represented. And that's just the beginning. Your documents are magically transformed from a bunch of relatively dumb HTML files to documents with intelligence, documents with muscle.
If I hadn't already learned this lesson, I learned it again when migrating TheStreet.com (the online financial news service that I referred to in the Introduction) from a relatively dumb HTML-based publishing system to a relatively smart XML-based content management system. When I joined TheStreet.com, it had been running for over two years with archived content (articles) that needed to be migrated to the new system. This mass of content was stored only as HTML files on disk. A certain company (which shall remain nameless) had built the old system, apparently assuming that no one would ever have to do anything with this data in the future besides spit it out in exactly the same format. With a lot of Perl (then the lingua franca of programming languages for the Web and an excellent tool for writing data translation scripts) and one developer's hard-working and largely unrecognized efforts over the course of six months, we managed to get most of it converted to XML. Would it have been easier to start with a content management system built from the ground up for repurposing content? Undoubtedly!
If this tale doesn't motivate you sufficiently, consider the problem of the wireless applications market. Currently, wireless devices (such as mobile phones, Research In Motion's Blackberry pager, and the Palm VII wireless personal digital assistant) are springing up all over, and content providers are hot to trot out their content onto these devices. Each of these devices implements different markup languages. Many wireless devices use WML (Wireless Markup Language, the markup language component of WAP, Wireless Application Protocol), which is built on top of XML. Any content providers who are already working with XML are uniquely positioned to get their content onto these devices. Anyone who isn't is going to be left holding the bag.
So HTML or WML or whatever you like becomes an output format (the display-oriented markup) for our XML documents. In building a Web publishing system, display-oriented markup happens at the representation stage, the very last stage. When our XML document is represented, it is represented in HTML (either on the fly or in a batch mode). Thus HTML is a "representation" of the root XML document. Just as a music CD or tape is a representation of a master recording made with much more high-fidelity equipment, the display-oriented markup (HTML, WML, or whatever) is a representation for use by a consumer. As a consumer, you probably don't have an 18-track digital recording deck in your living room (or pocket). The CD or tape (or MP3 audio file, for that matter) is a representation of the original recording for you to take with you. But the music publisher retains the original master recording so that when a new medium comes out (like Super Audio CD, for instance), the publisher can convert the high-quality master to this new format. In the case of XML, you retain your XML data forever in your database, but what you send to consumers is markup specific to their current needs.