Home > Articles > Web Services > XML

This chapter is from the book

This chapter is from the book

3.4 Text markup

The "HT" in HTML stands for HyperText, and the early historical Web was very much textual. Despite all the graphic and multimedia advances of recent years, this textual foundation has not eroded. The advance of XML has, if anything, only strengthened it.

Any text markup language must provide a sufficient inventory of markup constructs for in-flow text fragments that for some reason must be differentiated from their context. Examples of such fragments include emphasized words or phrases, names or identifiers, quotes, and foreign language citations.

Block and inline elements. HTML (as well as other presentation-oriented vocabularies, for instance XSL-FO) differentiates between block-level and inline-level objects. This distinction has to do mostly with visual formatting, as block-level elements are supposed to be stacked vertically, while inline elements are part of the horizontal flow of text.4 Therefore, it is not really relevant for your semantic XML markup, which must reflect content structure, not formatting. Still, since HTML is your primary target format, the block/inline distinction may sometimes have repercussions for your source definition. markupsemantic semantic XML

Thus, it may be difficult to handle situations where a source element that normally transforms into an inline-level target element has to apply to a larger fragment of a document (3.4.3). From the XSLT viewpoint (4.5.1), block-level elements are more often generated by pull-style trunk templates, while inline-level elements are the exclusive domain of push-style branch templates.

Existing vocabularies. DocBook5 is an established standard dating from 1991 that is used mostly for technical books and documentation. It may well be the most widely used XML vocabulary after XHTML; when somebody tells you, "My documents are in XML," chances are it's actually DocBook. Software support for this vocabulary is also quite good.

DocBook is vast but not too deep, so it is simple to learn despite its large number of element types (epigraphs, bibliographies, programming code, glossaries, and so on). If you don't understand what a particular element type is supposed to do, probably you don't need it (yet). For those constructs you do need, however, DocBook may be a rich source of text markup and structuring wisdom.

TEI6 (Text Encoding Initiative) is an older and bigger beast, developed for markup of all kinds of scientific and humanities texts. Compared to DocBook, it is focused more on low-level text markup than on high-level book structures. The TEI DTD offers many modules that cover everything from verse to graph theory, so it is highly recommended if you need to mark up specialized texts. The TEI Guidelines7 is a very comprehensive and detailed guidebook explaining the use of the TEI DTD as well as many finer points of marking up complex text constructs.

3.4.1 Mark up the meaning

Your source XML must be semantic; that is, it must reflect the meaning of text-level constructs, not their presentation. The em and i element types, both present in HTML, provide a canonic illustration of this principle. While an i element dictates using an italic face in visual media, an em only designates an emphasis, which is a semantic concept rendered differently in different media. For example, a fragment of text inside em can be set in italic in a graphic browser, but it can also be highlighted in a text-mode browser or read aloud emphatically by a speech browser.

Modern HTML deprecates i and other presentation-oriented element types; instead, you are supposed to use appropriate semantic element types such as em, possibly in combination with CSS. In your XML source, however, deprecating anything is not an option - you have to make sure that with your schema, no presentation-oriented markup is possible at all. Formatting hints (3.6.2) can only be used in your XML when absolutely unavoidable.

3.4.2 Rich markup

The same visible formatting may result from different source markup. For example, you may use the same italic font face for both emphasis and citations, but they must be marked up differently in your source. What only a human reader can distinguish in the formatted result should, ideally, be automatically distinguishable in the source.

In general, semantic markup in the source should be richer and more detailed than the resulting HTML markup after transformation. For example, it is often a good idea to use special element types to mark up all dates, person names, or company names in your source, even though in the resulting web pages they are not formatted in any special way.

Why mark up what you don't need right here and now? Because your XML source is more than just an undeveloped (as in "undeveloped film") version of the web site. Rather, it is the start of a project that will keep growing and changing, sprouting new connections and renditions over time. For example, you may want to reuse your web site material in PDF brochures, interactive CDs, archival and search applications, and more. source documentsreusing materials from

This means that your XML source must be able to serve as the semantic foundation not only for your current site but also for everything it can potentially become. You may not need any extra markup right now, but it may come in very handy when you extend your site or reuse the source documents for anything beyond the web site pages.

Imagine that one day you need to convert all dates on your site from one format to another (e.g., from MM/DD/YY to DD/MM/YY). Dealing with dates scattered in the text is so much easier if all of them are marked up consistently - for example,

... which happened on 
<date><month>09</month><day>04</day><year>2003</year></date>.

instead of simply

... which happened on 09/04/2003.

With rich markup, you can change dates' rendition (e.g., reorder date components or use a different separator character) without touching the source at all, simply by modifying the stylesheet.

On another occasion, you may decide to paint all company names (or only your own company's name) green on your web pages. Or, you may find it a good idea to automatically compile an index of all persons' names mentioned on your site. All of these tasks are only possible if your source XML has these elements consistently and unambiguously marked up.

The need for rich text markup obviously depends on the quality, value, and planned longevity of your material. You don't need rich markup for short-lived stuff, but if you want your material to remain useful in the long term, you should always try to think in terms of "what markup is perfect for this content" rather than "what markup is sufficient for the task at hand." Examples of long-lived or otherwise valuable content include standards, specifications, historical texts, etc.

Existing vocabularies. As an example (and a good source of ideas), consider NITF8 (News Industry Text Format), which is a standard vocabulary for rich markup of news stories. Only a necessary minimum of NITF markup may be used in a story that goes directly to press; however, for exchange, syndication, or archival use, a complete enriched NITF markup is required. A properly prepared NITF news story uses rich markup to answer questions such as who the story is about, when and where the described event occurred, and even why it is considered newsworthy by the story author.

3.4.3 Transcending levels

The text elements we've discussed in this section would be termed inline in HTML, meaning they are only allowed within block elements such as paragraphs. However, this limitation does not always make sense. For example, a rich markup element such as emphasis may need to be applied to more than one complete paragraph.

Usually, this is an indication that these paragraphs constitute some logical entity, such as a quotation, which (rather than the emphasis itself) you need to mark up. However, there may be situations where no such element exists, but inline text markup still has to spread across one or more block elements. What are we to do in such cases?

Inserting a separate inline markup element within each paragraph is the least elegant solution:

<p><em>This is the first paragraph using emphasis throughout.</em></p>
<p><em>And this is the second emphasized paragraph.</em></p>

This leads to unnecessary duplication of markup, poor maintainability, and just plain ugliness. This is the only option, however, if your emphasis spans one paragraph and a half.

The simplest approach is to just do away with the inline/block distinction and allow any text markup to be applied at any level of the hierarchy, both below and above the paragraph level. This will allow you to enclose all affected paragraphs into a common parent element specifying emphasis:

<em>
<p>This is the first paragraph using emphasis throughout.</p>
<p>And this is the second emphasized paragraph. Note that we can use
<em>nested emphasis</em>.</p>
</em>

This might make sense, especially in contexts where you want to allow both paragraphs and short non-paragraph text fragments (3.3). The problem with this approach is that it blurs your hierarchy of element types, thereby making your documents harder to maintain and more prone to errors.

It might be argued, on the other hand, that the emphasis spanning one or more paragraphs is semantically different from the emphasis that spans one or more words. Therefore, they could use different element types:

<emphasis>
<p>This is the first paragraph using emphasis throughout.</p>
<p>And this is the second emphasized paragraph. Note that we can use
<em>nested emphasis</em>, but this time it is a different element
type for the inline level.</p>
</emphasis>

If the paragraph-level emphasis is semantically connected with the paragraph element, you can instead add an attribute to those paragraphs that fall within its scope:

<p type="emphasis">This is the first paragraph using emphasis 
throughout.</p>
<p type="emphasis">And this is the second emphasized paragraph. 
Again, <em>nested emphasis</em> is possible.</p>

Among these options, there is perhaps no single winner suitable for all situations. Your choice will depend on the semantics of the element in question, the frequency of its use at inline and block levels, and the possible connections between its semantics and that of the standard block-level element (paragraph).

3.4.4 Nested markup

Another issue with text markup is whether nesting elements of one type is to be allowed. Presentation-oriented markup never uses, for instance, i within i — but for semantic markup, a similar structure may be meaningful. Thus, emphasis within emphasis or a quote within a quote are both perfectly valid semantically, even though in an HTML rendition, nesting of the corresponding formatting elements may have no visible effect.

Therefore, to properly transform nested semantic markup, you must use different formatting depending on the nesting level of the semantic element. For example, if you use italic face for emphasis, nested emphasis can be rendered either as regular face ("toggle" approach, where you switch between regular and italic faces for each new nesting level) or as bold italic face ("additive" approach, in which the italic rendition of the parent is augmented by the bold formatting of the child).

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020