Home > Articles > Web Services > XML

  • Print
  • + Share This

Declaring a NOTATION

A notation is anything that the XML processor can't understand and parse (notations are also called unparsed entities). Although this conjures up the idea of binary data, it can also be text that XML doesn't understand. A chunk of JavaScript, for example, could be kept in an external file and referred to as a notation.

TIP

Notations only can refer to external files. There's no way to hide information inside a document and pass it on to a user agent for special processing unless you put it inside an XML tag and instruct the agent to do something with the contents of the tag. If you want to include text data containing special characters inside the document, you should escape it inside a CDATA element as described in the later section, "Escaping a Text Block in a CDATA Section."

The problem with notations is that they require access to the DTD to use the notation. Although you might use the internal subset of the DTD to make notation information available locally to the document, non-validating parsers are not required to read the internal DTD subset.

NOTATION syntax is simple:

<!NOTATION name identifier "helper" >

The identifier is a catalog entry, as used in SGML. Many XML processors are recycled SGML processors, so they support a catalog by default. This is slightly safer than pointing to a helper application that may or may not be there, but XML requires the helper application to be referenced in any case, which can lead to anomalous behavior. You might reference Adobe Photoshop, for example, as the helper application for viewing a GIF image, but the browser is likely to know how to display GIFs on its own. The browser is also far more likely to be able to integrate the image properly into the rendered document on a display device or printer, a task that Adobe Photoshop is quite incapable of. Using both the identifier and the "name" of the helper allows you to compromise between just telling the user agent what sort of file is being passed and telling it explicitly how to display it while knowing nothing whatsoever about the environment the document is being displayed in.

Although some people may encourage you to behave as if Microsoft Windows is the center of the universe and do something like this

<!NOTATION gif SYSTEM "gif" >

which actually works on Windows systems, I'm not one of them. Use this syntax at your peril. The above example assumes the presence of the Windows system Registry to resolve the reference, which registry is not exactly universally available and short-circuits the standard system identifier completely.

A safer course is to enter the entire catalog entry and identifier sequence. This method gives a stronger hint to the eventual application that will deal with the processed XML file about what might be done with it but doesn't actually process the notation reference. The bare system identifier, "gif", works on Windows systems because Windows knows about GIFs already. But a handheld device or even a computer using another operating system may not have the knowledge of how to handle GIF images at its beck and call.

So it would be best to recast the previously shown reference as

<!NOTATION gif89a PUBLIC "-//CompuServe//NOTATION Graphics Interchange Format 89a//EN" "gif">

Notations can't be used in isolation. They have to be declared in an entity as well. A complete declaration sequence might look like this:

<!NOTATION gif89a  PUBLIC "-//CompuServe//NOTATION Graphics Interchange Format 89a//EN" "gif">
<!ENTITY gif89a SYSTEM "gif89a.gif" NDATA gif89a>
<!ELEMENT image EMPTY>
<!ATTLIST image source CDATA #REQUIRED
        alt  CDATA #IMPLIED
        type  NDATA gif89a >

In your document, your tag would look like this

<image source=uri"
    alt=[image of something]>

You can also create an enumerated list of notation types, which uses a slightly different syntax to describe the notation type:

<!NOTATION gif87a  PUBLIC "-//CompuServe//NOTATION Graphics Interchange Format 87a//EN" "gif">
<!NOTATION gif89a  PUBLIC "-//CompuServe//NOTATION Graphics Interchange Format 89a//EN" "gif">
<!ENTITY gif87a SYSTEM "gif87a.gif" NDATA gif87a>
<!ENTITY gif89a NDATA SYSTEM "gif89a.gif" NDATA gif89a>
<!ELEMENT image EMPTY>
<!ATTLIST image source CDATA #REQUIRED
        alt  CDATA #IMPLIED
        type  NOTATION (gif87a | gif89a) "gif89a" >

In your document, your tag would look like this

<image source=uri"
    alt=[image of something]>

but you could override the default type given in the attribute list like this:

<image source=uri"
    alt=[image of something]
    type="gif87a" >

This lets the user agent know that the image is in GIF87a format, if that matters at all.

NOTATIONs Are Awkward Solutions

NOTATIONs are interesting devices because they allow you to isolate binary data as well as character-ish data, such as scripts or interpreted source code, which you don't want the XML processor to have to deal with. In that sense, they're a good thing. But the specificity required becomes quickly tiresome. XML is being used here as a dispatcher to choose among alternatives that quickly become obsolescent as new binary technologies emerge.

The NOTATION syntax is somewhat of an anachronism left over from SGML, which uses them extensively. It's possible that they had to be included for compatibility with the older standard, but on the Web it's almost inconceivable that you would know the permanent location of anything, even on your own system. File systems evolve and change; nothing is constant. NOTATION declarations tacitly assume an unchanging environment that never alters, or alters so glacially that tweaking a catalog entry isn't a chore.

The Web is different. Changes propagate overnight. Before you can wink an eye someone has a new multimedia widget out there and everybody is using it. The plug-in mechanism already used by browsers might have been a better way of doing this. Or maybe we could just declare a plug-in NOTATION. Even better would be to let the browser and the server negotiate the best format for a particular situation.

Unfortunately, the better way has yet to emerge. So until it does, using the official method is an interim solution. Hopefully, whatever new mechanism supplants this awkward hack will be able to use the older method to extrapolate from the hard-coded location to where an appropriate processor might be found. In any case, you should think long and hard before including notations in an XML document and be prepared for something better to take the place of notations in the near future.

ENTITY Declaration

The ENTITY declaration is one of the most simple, in spite of the fact that there are so many different types of entities. The list of restrictions placed on entities is confusing, however, and so tersely described in the XML 1.0 Recommendation that it takes a thorough reading or two before you get it. You'll learn more about entities and have the opportunity to compare and contrast their various uses in the "Understanding the W3C Entity Table" section later in this chapter.

If you think of an XML document as a collection of entities, which are, roughly speaking, objects in the object-oriented sense, the entity declaration is a way of pointing at one instance of a particular object. There aren't that many options and it's not that hard to learn. Following is the basic format:

<!ENTITY name value >

The name part has two options, one with no modifier as you see it here and one with a preceding percent sign (%) followed by a space that marks a parameter entity. The value part has three basic options: either some form of quoted string, an identifier that points to a catalog entry or location external to the file, or a notation reference. The following are some variations for parsed entities:

<!ENTITY name PUBLIC "catalog entry" >
<!ENTITY name PUBLIC "catalog entry" "uri" >
<!ENTITY name SYSTEM "uri" > <!-- External ID -->
<!ENTITY name "&#0000;" > <!-- General entity mnemonic for character entity -->

The following is how you reference parameter entities:

<!ENTITY % name PUBLIC "catalog entry" >
<!ENTITY % name PUBLIC "catalog entry" "uri" >
<!ENTITY % name SYSTEM "uri" > <!-- External ID -->
<!ENTITY % name "&#0000;" > <!-- Parameter entity mnemonic for character entity -->

Parameter entities behave differently than general entities because they were designed for different purposes. The difference in their declarations is designed to be obvious when you, or the XML processor, sees one.

The way you refer to them is different as well. A general entity is referred to like this:

    &name;

Although a parameter entity is referred to like this:

    %name;

This is a relatively trivial difference that masks a huge difference in usage.

Parameter Entities

The parameter entity is a bit like a C macro designed for use in the DTD itself, so that you can identify objects—collections of data—for use during the process of building the DTD. After this is done, parameter entities have no meaning anywhere else. So if your document happens to contain a string that looks like the name of a parameter entity, it will be ignored, or rather, treated as the simple text that it is.

Inside the DTD however, a parameter entity has great utility. You can use it to store chunks of markup text for later use, or point to external collections of markup text for convenience.

In the internal DTD subset, parameter entities can only be used to insert complete markup. So, the following declaration and use is legal:

<!DOCTYPE name [
<!ENTITY % myname "<!ELEMENT e1 ANY>">
%myname;
...
]

The following example, using external references pointed to by a URI and, in the second entity, a public ID (or catalog entry), is also legal:

<!DOCTYPE name [
<!ENTITY % myname SYSTEM "uri">
%myname;
<!ENTITY % myname2 PUBLIC "catalog entry" "uri">
%myname2;
 ...
]

However, this example, which tries to define parts of markup that will be resolved later, is not legal in the internal DTD subset:

<!DOCTYPE name [
<!ENTITY % mypart "ANY">
<!ENTITY % myname "<!ELEMENT e1 %mypart;>">
%myname;
...
]

Note that this code would be legal in the external DTD subset or an external entity.

In a non-validating XML processor, the external references may or may not be fetched and incorporated into the document, but this is not an error whichever way it goes.

NOTE

The distinction between validating and non-validating XML processors may seem trivial but almost all browsers are and will be non-validating. On the Web, it will be possible for a DTD to reference dozens of locations, any of which may reference dozens more. A vali- dating browser must read in everything, potentially the entire Web, before it displays anything. The wait can become tiresome.

This brings up a subtle point. If the internal DTD subset contains a mixture of internal and external parameter entity references, a non-validating processor must stop processing them as soon as it doesn't interpret one, which it is permitted to do. The reason for this is that the reference may become undefined:

<!DOCTYPE name [
<!ENTITY % myname "<!ELEMENT e1ement1 ANY>">
%myname1;
<!ENTITY % myname2 SYSTEM "uri"> <!--external file contains <!ELEMENT e1ement2 ANY>
%myname2;
<!ENTITY % myname3 "<!ELEMENT3 e1 ANY>">
%myname3;
 ...
]

If the non-validating XML processor reads the external parameter entities, which it is permitted to do, all three elements are declared. If it doesn't read any external parameter entities, which it is also allowed to do, then only element1 is declared. The reason is that the processor doesn't know whether the external reference contained a declaration of element3 among its text, in which case the value of element3 would have been whatever that value was, because the processor would have seen that first if it had read it. Because it doesn't know for sure, it must ignore all succeeding entity and attribute list references. To make matters even more complicated, a non-validating XML processor is permitted to skip all parameter entities, in which case none of the elements are defined.

Up to that point, however, a non-validating XML processor is required to read and process the internal subset of the DTD, if any such DTD subset exists. So any other declarations inside the internal subset, including setting the replacement text of internal general entities, setting default attribute values, and normalizing attributes must be processed and performed. Figure 3.2 shows a metaphorical representation of the difference between the views seen by validating and non-validating XML parsers.

Figure 3.2 The difference is shown between the view of an XML document seen by a validating parser and a non-validating parser.

The validating parser sees everything clearly. The non-validating parser may or may not be able to see the entities and definitely doesn't see the DTD although it knows that the DTD exists.

Even though the non-validating XML processor must read and process the internal subset of the DTD until, and if, it's required to stop processing, it can't validate the document on that basis. If it did, it would be a validating processor and would be required to read everything.

NOTE

It's surprising how many people get parameter entities wrong and it's one of the problems with the EBNF that forms a part of the XML 1.0 standard. Programmers are familiar with EBNF and think it must define everything, but a large part of the specification is actually contained in the often-obscure accompanying text. You have to read both the EBNF and the text to fully capture the meaning of the W3C Recommendation.

General Entities

A general entity can occur in the document itself, at least potentially. They're identified by a particular syntax in the declaration:

<!ENTITY name {stuff} >

The big distinction in general entities is whether they're internal, in which case stuff is a quoted string, or external, when it's a catalog entry or a URL.

<-- General Internal Entity -->
<!ENTITY name "text of some sort" >
<-- General Internal Entities -->
<!ENTITY name PUBLIC "-//LeeAnne.com//My XML Stuff//EN" >
<!ENTITY name PUBLIC "-//LeeAnne.com//My XML Stuff//EN" "my-dtd=stuff.dtd" >
<!ENTITY name SYSTEM "http://www.leeanne.com/xml/my-xml-stuff.dtd" >

External entities may not be included in the document if the XML processor is not validating. Internal general entities may appear in the DTD, but only as the values in an attribute list or an entity. Basically this means that they have to appear with quoted strings.

Unparsed Entities

The unparsed entity has already been treated earlier in the explanation of NOTATIONs. Unparsed entities can only be external, but the externality of the entity is taken care of by the notation declaration referred to in an NDATA attribute. Their use, like that of notations in general, is somewhat controversial. There's no particular reason that the designer of a DTD has to know what particular sort of multimedia file is going to sit inside a document and then dispatch it to the proper handler sight unseen. Instead of being a generic document template, then, the DTD is limited by the types of unparsed files foreseen from the beginning and accounted for.

This is unlike the existing case with HTML. Within limits, you just point to a binary file and the application figures out what it is and how to display it. It's unlike the case in the UNIX environment that many of the designers of XML came from. Within limits, in UNIX you just use a file. Executable files are self-identifying and behave properly on their own. That would have seemed a much more robust approach, in my humble opinion.

Be that as it may, you're stuck with the difficult necessity of updating your DTDs whenever a new graphics or audio format is invented. Your alternative is to leave the DTD alone and fall by the wayside as video supplants still images, as interactive video supplants spectator video, and 3D virtual reality supplants mere 2D interaction.

The ENTITY declaration for an unparsed entity works hand in hand with the notation entity. The NOTATION must be declared before using it in an ENTITY.

Here's what a declaration for an unparsed entity would look like in your DTD along with the element declaration that is necessary to actually instantiate a particular example:

<!ENTITY gif89a SYSTEM "gif89a.gif" NDATA gif89a>
<!ELEMENT image EMPTY>
<!ATTLIST image source CDATA #REQUIRED
        alt  CDATA #IMPLIED
        type  NDATA gif89a >

In your document, your tag would look like:

<image source=uri"
    alt=[image of something]>

You can also create an enumerated list of notation types, which uses a slightly different syntax to describe the notation type, as you learned in the discussion about NOTATION declarations:

<!NOTATION gif87a  PUBLIC "-//CompuServe//NOTATION Graphics  Interchange Format 87a//EN" "gif">
<!NOTATION gif89a  PUBLIC "-//CompuServe//NOTATION Graphics  Interchange Format 89a//EN" "gif">
<!ENTITY gif87a NDATA gif87a>
<!ENTITY gif89a NDATA gif89a>
<!ELEMENT image EMPTY>
<!ATTLIST image source CDATA #REQUIRED
        alt  CDATA #IMPLIED
        type  NOTATION (gif87a | gif89a) "gif89a" >

In your document, your tag would look like this:

<image source=uri"
    alt=[image of something]>

but you could override the default type given in the attribute list like this if your gif was in the older gif87a format:

<image source=uri"
    alt=[image of something]
    type="gif87a" >

This tiny example points out the folly of this approach. How many people know offhand which of the two formats their gif files adhere to? How many care? Yet XML as it exists today makes this and many other trivialities a matter of pressing import. In the immortal words of Tim Bray, one of the XML 1.0 design team, "This is completely bogus."

ELEMENT Declaration

The element declaration is the part of XML you see most clearly in the final product. It represents the actual tags you'll use in your documents, and you have to have at least one or your document isn't valid XML. A non-validating XML processor will never see your DTD, but the tags and attributes contained in your document will describe it fairly completely anyway. Along with an associated style sheet, you can display the document correctly without any DTD at all.

If you don't care what the document looks like, you may not even need a style sheet. This might be the case for a document that was essentially a database, or was designed as a transport mechanism to transfer structured data between two applications.

The ELEMENT declaration looks like this:

<!ELEMENT name content-model >

The name is the name of your tag in use. The content model is where things start to get interesting.

The content model can contain an arbitrary mixture of terminal and non-terminal elements. Non-terminal elements are the names of other elements while terminal elements are text or other content. This is the syntax that forms nodes in your document. There are two general content models. The first model describes sequential—or ordered—content which uses a comma-separated list to indicate that one element has to follow another to the end of the ordered list. The second model uses a vertical "or" bar as a list separator to indicate a selection between alternatives. With these two mechanisms, you can construct almost anything.

Ordered Content

Entity names separated by commas are sequentially ordered. The first in the list is first, the second second, and so on. The items on the list should be surrounded by parentheses for clarity, although it's not strictly necessary for a pure ordered list:

<!ELEMENT name ( sub-element1, sub-element2, sub-element3, ... ) >

Selection Content

Entity names are separated by "or" bars (|), the vertical bars that should be available on your national-language–specific keyboard, often above the backslash (\). They should and must be surrounded by parentheses. In use, they look like this:

<!ELEMENT name ( sub-element1 | sub-element2 | sub-element3 | ... ) >

Repeating XML Content Elements

Content names, or groups of content names surrounded by parentheses, can be followed by a question mark (?), a plus sign (+), or an asterisk (*) to indicate a repetition factor, sometimes called an occurrence indicator. No repetition mark means that the element must appear once and once only. A question mark means that the item can appear zero or one time. A plus sign means that the element appears at least one time and repeats as needed. An asterisk means that the element repeats as needed but is optional. In other words, an asterisk means zero or more. These signs can be combined with parentheses and the previous sequence or alternation to form structures of arbitrary complexity.

Content models are so important to XML that it might pay to write these down somewhere until you have them memorized. Table 3.1 lists the symbols used to indicate repetition factors and the two types of content model.

Table 3.1 Occurrence Indicators Used in XML DTDs

Syntax

Meaning

?

Zero or one occurrence

+

One or more occurrences

*

Zero or more occurrences

( a | b )

Either a or b but not both

( a , b )

A followed by b


If you're familiar with Regular Expressions in Vi, Emacs, and other UNIX-style editors, the syntax will be fairly familiar. The following example shows several uses:

<!ELEMENT name (( sub-element1 | sub-element2)? , (sub-element-3))+ >

This says that the element contains one or more substructures containing either sub-element1 or sub-element2 followed by one instance of sub-element3 or it contains one instance of sub-element3. So, the following are all valid productions:

<name><sub-element3></sub-element3></name>
<name><sub-element3></sub-element3><sub-element3></sub-element3></name>
<name><sub-element1></sub-element1><sub-element3></sub-element3></name>
<name><sub-element2></sub-element2><sub-element3></sub-element3></name>

Even this simple example can generate an infinite number of productions, although it might become boring to list them. The ways in which these simple elements can combine can become confusing quickly. One of the user-friendliest uses of parameter entities is to encapsulate subsets of these behaviors so you can think about them separately.

TIP

Look-ahead is a term from the compiler/parser world that simply means the parser can look ahead in the input stream and backtrack to resolve ambiguities. Because this implies the ability to buffer the entire document in memory if necessary, anything more than the one-character look-ahead—so common that it's not usually dignified with the name look-ahead—needed to resolve tokens was dropped from the language definition. Avoiding arbitrary levels of look-ahead means that an XML parser can be small and lightweight, suitable for handheld and other devices with limited memory and power.

Because XML processors don't do look-ahead, you have to guarantee that your content model can be successfully parsed without backtracking before handing it over. A good strategy is to structure a content model with a lot of optional elements as an alternative between cascading models with optional elements dropped off the beginning of the model subset:

(a,b,c,d) | (b,c,d) | (c,d) | (d)

You can't drop off elements from the end or put alternatives in the middle because then the parser would have to backtrack to parse them. So content models that look anything like the following ambiguous examples probably don't do what the designer intends them to do:

(a,b,c,d) | (a,b,c) | (a,b) | (a)
(a,b,c,d) | (a,b,d) | (a,d) | (a)

In both these examples, when the parser encounters element a, the first alternative, (a,b,c,d), is chosen. None of the other alternatives can be considered and may be ignored. Some XML parsers may generate an error when encountering a non-deterministic content model, however, so you're required to ensure that all content models are unambiguous.

NOTE

Technically, the ambiguous content models shown here are non-deterministic, which means it's not possible to construct a finite state automaton to process them. It may be possible to convert a non-deterministic content model to one that is deterministic algorithmically, but this is not guaranteed.

By nesting known bits of combining logic into larger ones, what might be daunting when viewed in its entirety can be broken down into component parts. Unfortunately, because of limitations on the internal DTD subset, this facility is only available in the external DTD.

Terminal content

The leaves of our document tree are represented by terminal content, of which there is one type: #PCDATA. Parsed Character Data is mixed content that can contain both text and markup. This is the most general type of leaf. When you use a mixed content model you cannot control the order or occurrence of elements, although you can constrain the elements to be of a certain type.

It would be used in an element like

<!ELEMENT name (#PCDATA | el1 | el2 | el3 | ... )* >

or like this with no control over element type:

<!ELEMENT name (#PCDATA)* >

It's fairly straightforward.

TIP

You could use this type of element content to contain XML tags from another namespace, for example, in an XML document without violating the DTD of the base document. Although the document DTD would have no idea what the inserted tags meant, the governing DTD of the namespace and the designer of the page presumably would.

EMPTY content

If the element is declared as EMPTY, there is no content. So if you use the start and end tag convention, you have to guarantee that the end tag immediately follows the start tag like

DTD declaration: <!ELEMENT anyname EMPTY>
XML document instance: <anyname></anyname>

or you'll generate a validation error.

You'll also probably break any browser that runs into it if the name happens to look like an HTML empty tag. So this is unsafe although perfectly legal in XML:

<img></img>

All in all, it's probably better to use the special empty tag syntax like this:

<anyname />

Notice there is a space between the element name and the forward slash, and the slash is followed immediately by the closing angle bracket.

NOTE

Because empty elements are, by definition, empty, the only possible content they can carry is in the attributes associated with each empty element. In general, any text content can easily be included in an attribute. The only real limitation is that you can't typically extend the document itself by means of an attribute. There are two important exceptions: A notation could theoretically call a process that inserted more content, much as one can do using Dynamic HTML, and it is possible to transclude content from an external file using attributes on an XLink anchor element. See Chapter 8, "XPath," and Chapter 9, "XLink and XPointer,"for more information on XLink, XPath, and XPointer. Indirectly, it would also be possible to use XSLT to transform a document based on the value contained in an attribute, but that will have to wait until we discover XSL in Chapter 13, "Cascading Style Sheets and XML/XHTML."

ANY Content

This is the ultimate in loose declarations and means exactly what it says. An element so defined can contain anything at all:

<!ELEMENT anyname ANY>

It's the equivalent of listing every element in your DTD in any order but saves a lot of typing time.

ANY content is handy primarily for developing a DTD or for debugging a broken DTD if you have an example document but no DTD. If you have validity errors in your first cut you can try changing the content model to ANY in likely spots until the DTD is valid so you can load it into a DTD design tool. At that point, you can start tightening up your content model until it just begins to pinch. Then you have a valid DTD. With a large document model, this can be tedious but it's the DTD designer equivalent of knitting, after a while the process becomes so mechanical you hardly think about it.

  • + Share This
  • 🔖 Save To Your Account

Related Resources

There are currently no related titles. Please check back later.