Home > Articles

  • Print
  • + Share This
This chapter is from the book

Generic Markup Makes Natural Languages More Formal

Starting in 1969, a research effort within IBM began to focus on generic markup in the context of integrated law office information systems.6 By 1986, Charles Goldfarb had chaired an ANSI/ISO process that resulted in the adoption of Standard GML, also known as Standard Generalized Markup Language (SGML, ISO 8879:1986). Today, SGML is the gold standard for nonproprietary information representation and management; XML, the eXtensible Markup Language of the Web, corresponds closely to a Web-oriented ISO-standard profile of SGML called WebSGML. The Web's traditional language for Web pages, HTML, is basically a specific SGML tag set or markup vocabulary. XML, like SGML, allows users to define their own markup vocabularies.

SGML was based on the notion that natural language text could be marked up in a generalized fashion, so that different markup vocabularies (or tag sets) could be used to mark up different kinds of information in different ways, for different applications, and yet still be parsable using exactly the same software, regardless of the markup vocabulary. Since interchangeable information always takes the form of a sequence of characters, the ability to mark up sequences of characters in a way that is both standard (one piece of software works for everything) and user-specifiable (users can invent their own markup vocabularies) has turned out to be a key part of the answer to the question, "How can global knowledge interchange be supported?"

The SGML and XML languages that ultimately grew out of the early GML work now dominate most of the world's thinking about the problem of global information interchange. These languages represent an elegant and powerful solution to the problem of making the structure of any interchangeable information easily and cheaply detectable, processable, and validatable by any application.

Perhaps the most fundamental insight that led to the predominance of SGML and XML is the notion of generic markup, as opposed to procedural markup. Procedural markup is exemplified by tag sets that tell applications what to do with the characters that appear between any specific pair of tags (an element start tag and an element end tag). For example, imagine a start tag that says, in effect, "Render the following characters in italics," followed by the name of a ship, such as Queen Mary, followed by an end tag that says, in effect, "This is the end of the character string to be rendered in italics; stop using the italic font now." This set of instructions is indicated by the following syntax:

<italics>Queen Mary</italics>

These font-changing instructions are very helpful for a rendering application, but they are virtually useless for supporting applications that are looking for occurrences of the names of ships because many things are italicized for many reasons, not just the names of oceangoing ships. It turns out that generic markup offers significant economic benefits to the owners of information assets. For example, a start tag (for example, "ship-name") that, in effect, says, "The next few characters are the name of a ship," that is, what kind of thing that character string is, is just as useful for rendering purposes as one that says, "Italics start here," but the generic tag can support many more kinds of applications, including applications that weren't even imagined when the information asset was originally created. Generic markup is not application-oriented; it is information-oriented. It provides information (metadata) about the information that is being marked up.

A start tag is a piece of formal, computer-understandable data that can appear in the midst of natural language data that the computer does not understand. Because of generic markup, we can now use computers to help us manage and interchange information in a hybrid fashion: the computer understands the computer-oriented formal information, and the rest is often explicitly rendered for human consumption.7

But problems remain.

  • How, for example, are computers supposed to understand what the tags mean? The "ship-name" tag, by itself, could easily be misunderstood as indicating the beginning of the name of the recipient of some sort of shipment of merchandise, for example. Let's forget about computers for a moment and consider human beings instead. No matter which natural language you choose, most of the people on this planet can't read it. Even those who can read English may use a local dialect that may cause them to be misled as to the significance of a tag name. In general, how are human beings supposed to understand that this particular tag's intended purpose is limited to marking up the names of oceangoing ships? It is difficult to see how the dream of global knowledge interchange can be realized in the absence of a rigorous way to provide metadata about any kind of metadata, including markup.

  • What about information that isn't marked up very well (or at all) to begin with?

  • What about information whose structure is arguable or ambiguous? It can only be marked up one way at a time, unless you're willing to maintain two versions of the same source information—a strategy that can often be more than twice as expensive as maintaining a single source.

  • What if you need to regard information as having a structure that is different from the structure its markup thrusts upon you, and you don't have the right or ability to change it, copy it, or reformat it?

As you can see, generic markup is only part of the answer to the problem of supporting global knowledge interchange. Much of the rest of the answer has to do with other kinds of metadata—kinds of metadata that are not internal to the information assets but are information assets in their own right. Although they are strikingly and subtly different from other kinds of metadata, topic maps are, among other things, just one of many kinds of such external metadata information assets.

  • + Share This
  • 🔖 Save To Your Account