Seven Steps to XML Mastery, Step 1: Read Before You Write
Reading and understanding the structure of an XML document is the first step toward XML mastery. In this first step, we’ll examine RSS, a simple XML vocabulary for describing news stories, weblogs, and just about anything else on the Web that can be summarized and linked to. The power of RSS comes from the growing availability of web-based tools that present summaries and links to users. We’ll also get into SVG, a popular vocabulary for drawing scalable graphics. Our look at SVG takes us into the realm of declarative programming as we expand our horizons and explore more XML, including entities, DTDs, and CDATA.
Before we jump into those topics, let’s get some XML basics out of the way.
Where Did XML Come From?
Although XML is a relatively recent phenomenon, blessed by the W3C in 1998 as a full-fledged recommendation, the ideas behind XML had been percolating in the data community for some time. XML’s origins can be traced as far back as the 1960s, when Charles Goldfarb and others at IBM were seeking ways to add semantic meaning to large data repositories. The result was a tag language called Generalized Markup Language (GML). In the 1970s and 1980s, GML morphed into SGML, a widely used, industrial-strength standard for document markup. SGML was big and comprehensive, much like its 1980s counterpart programming language Ada. But, just like Ada, SGML was too hefty, and developers were soon looking for a leaner, meaner tag language alternative. The end product of this search, of course, was XML.
In XML, tags serve the same role as field names in a data structure. As an XML developer, you’re free to use any XML tag names that make sense for your application. Here’s an example of some XML data that might be used for a messaging application:
<message> <to>Bob Kirin</to> <from>Roger Rabbit<from> <subject>XML is hot</subject> <text>XML seems to be everywhere</text> </message>
The tags in this example identify the basic parts of a message, including the target, the sender, the subject, and the text of the message. Similar to HTML, the <to> tag has a matching end tag: </to>.
The content between a tag and its matching end tag defines an XML element. Note that the content of the <to> tag is completely contained within the <message> and </message> tags. This ability to embed elements inside one another gives XML its capacity to represent data hierarchies, and herein lies its power.
Tags can also contain attributes—additional information included within the tag’s angle brackets. In designing an XML vocabulary, a decision must be made whether to represent data as elements or attributes.
The following XML example shows how the same message could also be defined using attributes named to, from, and subject:
<message to="Bob Kirin" from="Roger Rabbit" subject=" XML is hot "> <text>XML seems to be everywhere</text> </message>
As in HTML, the attribute name is followed by an equal sign (=) and the attribute value. Since you could design a data structure like <message> equally well using either attributes or tags, you’ll need to decide which is best for your application. A widely accepted rule for deciding between elements and attributes is to use elements for core data and attributes for metadata. Of course, one’s person’s metadata is another’s core data, so there’s some wiggle room here. For an insightful discussion of elements versus attributes, read Uche Ogbuji’s article "Principles of XML Design: When to Use Elements Versus Attributes."
Keeping It Well-Formed
A big difference between XML and HTML is that an XML document is always required to be well-formed. There are several rules that determine when a document is well-formed, but one of the most important is that every tag has a closing tag. Thus, in our XML, the </to> tag is not optional. The <to> element is never terminated by any tag other than </to>. However, occasionally it makes sense to have a tag stand by itself, sometimes with attributes only, or just standing alone to define a marker element.
When a tag doesn’t contain any element content, we can write a shorthand element known as an "empty tag." An empty tag ends with /> instead of just >. For example, the following book element has no element content:
So we could use this shorthand form and still have well-formed XML:
XML Prolog and Comments
Understanding elements and attributes is key to being able to read and compose your own XML. However, there are some other parts of XML that you’ll encounter and need to know about. Let’s use the same message document but add an XML prolog and a comment:
<?xml version="1.0"?> <message to="Bob king" from="Roger Rabbit" subject=" XML is hot "> <!-- This is a comment --> <text> XML is finding its way into numerous apps </text> </message>
The first line of our revised document now has an XML prolog. Although technically not required, many XML documents begin with the XML prolog. If present, the XML prolog must appear as the first component.
A minimal prolog looks like this:
The prolog declaration may contain additional attributes, such as encoding and standalone:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
The optional encoding attribute specifies the character set used to encode the data. The optional standalone attribute specifies whether the document references any external entities. If there are no external references, then "yes" is appropriate.
If you’re familiar with HTML comments, you already know how to write XML comments. XML comments begin with <!-- and end with -->, as in the following example:
<!-- This is a comment -->