Home > Articles > Web Services > XML

Creating Well-Formed XML Documents

  • Print
  • + Share This
Making your XML well-formed is integral to creating XML documents. Steve Holzner covers all aspects of well-formedness, including constraints, namespaces, infosets, and canonical XML.
This chapter is from the book

Yesterday, you took a look at the various parts of XML documents—prologs, elements and attributes, processing instructions, and so forth. Today, you're going to start putting those items to work as you create well-formed documents.

Why is it so important to make an XML document well-formed? For one thing, W3C doesn't consider an XML document to be XML unless it's well-formed. For another, XML processors won't read XML documents unless those documents are well-formed. All of which is to say that making your XML well-formed is integral to creating XML documents—software isn't even going to be able to read your documents unless they are. Here's an overview of today's topics:

  • Well-formed XML documents

  • The W3C Well-formedness constraints

  • Nesting constraints

  • Element and attribute constraints

  • Namespaces

  • Local and default namespaces

  • XML Infosets

  • Canonical XML

To some extent, the current loose state of HTML documents is responsible for the great emphasis W3C puts on making sure XML documents are well-formed. HTML browsers have become more and more friendly to HTML pages as time has gone on, which means a Web page can have dozens of errors and still be displayed by a browser. That's not such a problem when it comes to simply displaying a Web page, but when it comes to handling what might be crucial data, it's a different story.

So W3C changed the rules from HTML to XML—unlike an HTML browser, an XML processor is never supposed to guess when it reads an XML document. If it finds an error (if the document is not well-formed, or if it uses a DTD or XML schema and it's not valid), the XML processor is supposed to inform you of the error, but then it can quit immediately. Ideally, according to W3C, a validating XML processor should list all the errors in an XML document and then quit; a non-validating one doesn't even have to do that—it can quit the first time it sees an error.

This enforced precision has two sides to it—there's no doubt that your data is transferred more faithfully using XML, but because XML processors make no guesses as to what you're trying to do, XML and XML processors can come across as non-user friendly, and not as generous or as easy to work with as HTML. On the other hand, you don't end up with the many possible errors that can creep into HTML, and that's important. XML authors have to be aware of the constraints on what they write, which is why we spend time in this book on document well-formedness and validity. In fact, in the XML 1.0 specification, W3C says that you can't even call a data object an XML document unless it's well-formed:

A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints.

What Makes an XML Document Well-Formed?

The W3C, which is responsible for the term well-formedness, defines it this way in the XML 1.0 recommendation:

    A textual object is a well-formed XML document if:

    • Taken as a whole, it matches the production labeled document.

    • It meets all the well-formedness constraints given in this specification (that is, the XML 1.0 specification, http://www.w3.org/TR/REC-xml).

    • Each of the parsed entities, which is referenced directly or indirectly within the document, is well-formed.

    Because the major differences between XML 1.0 and XML 1.1 have to do with what characters are legal, you probably won't be surprised to learn that a well-formed XML 1.0 document is also a well-formed XML 1.1 document, as long as it avoids certain characters. From the XML 1.1 specification:

    If a document is well-formed or valid XML 1.0, and provided it does not contain any characters in the range [#x7F-#x9F] other than as character escapes, it may be made well-formed or valid XML 1.1 respectively simply by changing the version number.

Let's get into three conditions that make an XML document well-formed, starting with the requirement that the document must match the production named document.

Matching the Production Labeled document

W3C calls the individual specifications within a working draft or recommendation productions. In this case, to be well-formed, a document must follow the document production, which means that the document itself must have three parts:

  • a prolog (which can be empty)

  • a root element (which can contain other elements)

  • a miscellaneous part (unlike the preceding two parts, this part is optional)

You've seen XML prologs yesterday; they can contain an XML declaration (such as <?xml version = "1.0"?>), as well as comments, processing instructions, and doctype declarations (that is, DTDs).

You've also seen root elements; the root element is the XML element that contains all the other elements in your document. Each well-formed XML document must have one, and only one, root element.

The optional miscellaneous part can be made up of XML comments, processing instructions, and whitespace, all items you saw yesterday.

In other words, this first requirement says that an XML document must be made up of the parts you saw yesterday. So far, so good.

Meeting the Well-Formedness Constraints

The next requirement is a little more difficult to track down, because it says that to be well-formed, XML documents must also satisfy the well-formedness constraints in the XML 1.0 specification. This means that your XML documents should adhere to the syntax rules specified in the XML 1.0 recommendation. You'll discuss those rules, which are sprinkled throughout the XML 1.0 specification, in a few pages.

Making Parsed Entity Must Be Well-Formed

The final requirement is that each parsed entity in a well-formed document must itself be well-formed. When an XML document is parsed by an XML processor, entity references (such as &#x03C0;) are replaced by the entities they stand for (such as þ in this case). The requirement that all parsed entities must be well-formed simply means that when you replace entity references with the entities they stand for, the result must be well-formed.

That's the W3C's definition of a well-formed document, but you still need more information. What are the well-formedness constraints given throughout the XML specification? You're going to go over these constraints today; to start, you'll create an XML document that you'll use as we discuss what it means for a document to be well-formed.

  • + Share This
  • 🔖 Save To Your Account