Designing an XML Dialect

Although XML is described as a language and is compared with Hypertext Markup Language (HTML), it’s actually much larger in scope than that. XML is a markup language that defines how to define a markup language.

That’s an odd distinction to make, and it sounds like the kind of thing you’d encounter in a philosophy textbook. This concept is important to understand, though, because it explains how XML can be used to define data as varied as health-care claims, genealogical records, newspaper articles, and molecules.

The “X” in XML stands for Extensible, and it refers to organizing data for your own purposes. Data that’s organized using the rules of XML can represent anything you want:

  • A programmer at a telemarketing company can use XML to store data on each outgoing call, saving the time of the call, the number, the operator who made the call, and the result.
  • A lobbyist can use XML to keep track of the annoying telemarketing calls she receives, noting the time of the call, the company, and the product being peddled.
  • A programmer at a government agency can use XML to track complaints about telemarketers, saving the name of the marketing firm and the number of complaints.

Each of these examples uses XML to define a new language that suits a specific purpose. Although you could call them XML languages, they’re more commonly described as XML dialects or XML document types.

An XML dialect can be designed using a Document Type Definition (DTD) that indicates the potential elements and attributes that it covers.

A special !DOCTYPE declaration can be placed in XML data, right after the initial ?xml tag, to identify its DTD. Here’s an example:

<!DOCTYPE Library SYSTEM "librml.dtd">

The !DOCTYPE declaration is used to identify the DTD that applies to the data. When a DTD is present, many XML tools can read XML created for that DTD and determine whether the data follows all the rules correctly. If it doesn’t, it is rejected with a reference to the line that caused the error. This process is called validating the XML.

One thing you’ll run into as you work with XML is data that has been structured as XML but wasn’t defined using a DTD. Most versions of RSS files do not require a DTD. This data can be parsed (presuming it’s well-formed), so you can read it into a program and do something with it, but you can’t check its validity to make sure that it’s organized correctly according to the rules of its dialect.

