Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

3.9 Master document

In the previous chapter (2.1.2.1, page 49), we found that any web site consisting of more than one page must have a master document providing shared content and a site directory. In this section, we'll look at some practical examples of constructs in a typical web site's master document.

You may find the sample master document described here (see Example 3.2, page 143, for a complete listing) somewhat eclectic. This eclecticism, however, stems from the real-world practice of XML web sites. In fact, the master document is more of a database than a document (1.2). The layout of components in this database is rarely important, as they are not processed sequentially but accessed in arbitrary order. For lots of ideas on how to access and use the master document content from the stylesheet, see Chapter 5.

A master document represents a new document type, with its root element type different from that of a page document, and most other element types usable only in a master document. However, if you don't use DTDs (2.2.4) or XSDL, this distinction has little practical value, and you can use one schema to validate all of your XML (both page documents and the master document). Such a schema written in Schematron is shown in Example 3.3, page 149 (see also 5.1.3 for advanced Schematron checks).

3.9.1 Site structure

The role of the master document is that of a hub that all other documents refer to when they need to figure out a wider context of the web site or establish mutual links. Whenever the stylesheet needs some information that is not supplied by the currently processed document, it will consult the master document to find either that information or a link to it.

Therefore, the most important part of a master document is the site directory — a collection of information about all pages of the site and their organization. This directory is used for building the site's navigation as well as for resolving abbreviated internal links (3.5.3).

Besides pages, other components of the site may also be mentioned in the master document, such as all Flash animations you have or all images of a specific kind used on the site. Units of orthogonal content must be listed in the master document as well (3.9.1.3) so that pages can reference and incorporate them. Finally, sources of dynamic content must be registered for the stylesheet to know what to insert into static page templates (3.9.1.4).

3.9.1.1 Menu structure

A flat list of all pages is not sufficient for building a usable site. We also need to represent the structure of the site's menu and the correspondence between menu items and pages.

A simple site's menu may be little more than a linear list of links to each of its pages. However, most sites require more complex menu structures. Common are hierarchical menus where some of the top-level items encompass multiple subpages and/or nested submenus. Such a structure is straightforward to express in XML.

Some sites may have more than one menu. For example, there may be a menu of topics (content sections) and another independent menu of tools (pages that help navigate the site, such as search and site map). Such orthogonal menu hierarchies can be stored in independent XML subtrees within the master document.

3.9.1.2 Menu items and pages

What do we need to store in the master document for each menu item? To build a clickable menu element, we must know at least its label (the visible text displayed in the menu) and the page that it is linked to. A label may contain inline markup and should therefore be stored in a child element. As for the link, it is natural to use the general linking attributes with abbreviated addresses that we've developed for in-flow links on site pages (3.5.1).

Items vs. pages. A menu item is not the same as a page of the site. Some pages may not be available through the menu, while others may be linked from more than one menu item. Therefore, the page itself must be represented by a separate element that the menu item element will link to.

However, that does not mean that these page elements must be stored in a different part of the master document. You can still categorize all your pages under the branches of the menu tree: Even if a page is not linked from the menu, usually you can find a branch where it logically belongs (unless it is orthogonal content, 3.9.1.3). The stylesheet will thus be able to read the menu structure both hierarchically (when looking for menu items) and sequentially (when looking for pages).

Here's a possible representation of a menu item:

<item link="products">  
  <label>Products</label>
  <page id="products" title="Our products" 
        src="products/"/>
  <page id="software" title="Our software" 
        src="products/software/"/>
  <page id="hardware" title="Our hardware" 
        src="products/hardware"/>
</item>

In addition to a label and one or more pages, an item may also contain other item children. A complete menu description would thus consist of a hierarchy of items under one parent, e.g. menu. Note that in each page element, the id attribute provides a unique identifier of not only that element, but of the page itself. It is these identifiers that are used as abbreviated addresses (3.5.3) in internal links.

How unabbreviation works. When resolving a link, the stylesheet translates the page identifier into the location of that page taken from the src attribute. However, that attribute's value is also somewhat "abbreviated" in that it omits irrelevant technical information such as the filename extension and the default filename (usually index.html) in a directory. These omitted parts are easy to restore by applying simple rules, so the three page elements in the above example would yield these page locations:

/products/index.html
/products/software/index.html
/products/hardware.html

Note that a location ending with a "/" is considered a directory and has "index.html" appended; other locations only receive the ".html" extension.

Accessing the source. There is one more reason to store page pathnames without extensions. When locations are resolved for the purpose of accessing the source XML documents rather than creating an HTML link, the same src values are transformed into *.xml file locations (assuming the directory structure of the site source is similar to that of the transformed site, 3.9.3). For stylesheet code examples to access this menu structure, see Chapter 5 (5.1.1, 5.7).

Storing page metadata. Sometimes, a more complex layout for the page elements may be necessary. For example, if your bilingual site provides two language versions of each page, a page element could hold both metadata that is common to all language versions of the page (e.g., the page's identifier and source location) and language-specific metadata (e.g., title):

<page id="software" src="products/software/">
  <translation lang="en">>Our software</translation>
  <translation lang="fr">>Nos logiciels</translation>
</page>

Some of the metadata (3.1.1) may also be moved from page documents into the master document for convenient access. For example, if you want to control which pages of the site are to be seen by search engine spiders and which are hidden from them, you could add a corresponding value to each page's source document. However, since this information will be pulled from all pages of the site simultaneously, it is more convenient to add a spider control attribute to the page element in the master document. This way, the stylesheet will be able to produce a site-wide robots.txt file for external spiders and/or a configuration update for a local search engine spider without accessing all page documents.

3.9.1.3 Orthogonal content

Along with all pages, a master document should also list all the units of orthogonal content that your site will use (2.1.2.2, page 51). However, unlike pages, orthogonal content references cannot be categorized under the menu hierarchy (that is why this content is orthogonal, after all). You'll need to create a separate construct to associate orthogonal content identifiers with corresponding (abbreviated) source locations - for example,

<blocks>
  <block id="news" src="news/latest"/>
  <block id="subscribe" src="scripts/subscribe"/>
  <block id="donate" src="scripts/donate"/>
</blocks>

Now if the stylesheet processing a page document encounters a block that has no content of its own but references some orthogonal content unit - for example, by specifying idref="news" — the document at news/latest.xml will be retrieved and inserted into the current document, formatted as appropriate for an orthogonal content block.

It is important that the id and src attributes of a master document's block element have the same names and semantics as the attributes of page elements (3.9.1.2). We will use this when writing stylesheet code to unabbreviate links or search through all pages of the site (), since every page must be registered as either a page in the menu or a source of an orthogonal block (or both).

Extracting orthogonal content. In the last example, each orthogonal block was stored in its own file - but this is not always the best approach. You may want to reuse parts of regular pages as orthogonal content.

For instance, the news page of a site is often a list of news items in reverse chronological order. You may want to automatically extract the most recent news item and display it in an orthogonal content block on other pages of the site. Another example is a "featured product" blurb extracted from that product's own page and reused on the front page of the site.

For these situations, what we need is a way to specify what part of the original page document is to be reused as orthogonal content on other pages. Since this part will most likely also be a block, we only need to indicate the id of the block we are interested in. Thus, if the most recent news block on the news page always has id="last", we could write in the master document:

<block id="last-news" src="news/" select="last"/>

Now any page can place a copy of the latest news item by referencing the corresponding orthogonal block by its identifier, last-news. For example, your page document might contain

<block idref="last-news"/>

Likewise, the featured product blurb could be extracted from the block with id="blurb" on that product's page:

<block id="feature" src="products/foobar" select="blurb"/>

Here, the featured product is identified by the path to the corresponding document (products/foobar.xml). When you want to feature a different product, all you need to do is change this value so it points to another product's page (assuming each product page has exactly one block with id="blurb"; see also ). After that, all pages that use

<block idref="feature"/>

will (after you rerun the transformation) display the blurb for the new product.

Logically, without the select attribute, a master document's block will reference the entire content of the document pointed to by the src attribute. Your Schematron schema could also check that the referenced elements actually exist in the referenced documents (see 5.3.3.1, page 224 for how to code this).

No perfection in this world. It would be even more natural to use XPath expressions for extracting orthogonal blocks. Then we could use not only the id attribute value but any XPath test for identifying the block we need. For instance, for the first block on the page, we would write

<block id="news" src="news/" xpath="//block[1]"/>

Selecting the last block that has a section inside would be as simple as

<block id="lastsection" src="dir/page" 
       xpath="//block[section][last()]"/>

There's only one problem with this kind of selector: In XSLT, you can't take a string and treat it as an XPath expression - and what the master document (or any other document) stores in its attributes is always just strings from the XSLT processor viewpoint.

Saxon offers the saxon:evaluate() extension function (4.4.2.1) that might save the idea, but its implementation is quite limited, not to mention nonportable to other XSLT processors. Much better is the dyn:evaluate() function16 from EXSLT (4.4.1) which is currently supported by several processors but not by Saxon.

3.9.1.4 Registering dynamic content

Recall our discussion of dynamic sites in 1.5. We found that a dynamic web page is produced from two main parts - static templates and dynamic values - and that both can (and should) use XML markup. It's now time to see how these concepts fit into the source definition we are building.

One way of many. There exist different ways to aggregate dynamic content and static templates. Some of them come before XSLT transformation, which is usually the last stage in a dynamic XML web site workflow; in these cases, you don't need any special source markup because your stylesheet will get complete seamless page source with both static and dynamic content. However, in some situations (notably offline XSLT processing, 1.4.1) implementing dynamic content aggregation in XSLT is convenient. This section shows one approach to organizing such transformation-time incorporation of dynamic content.

Reusing blocks. An orthogonal content block that the stylesheet extracts from another document may be considered a special case of a composite dynamic value. Therefore, it makes sense to extend our blocks' markup constructs so that they cover the "truly dynamic" content as well - content that is calculated or compiled by some external process and not just stored in a static document.

We can define a number of block conventions that will allow us to use blocks not only for enveloping independent bits of content but also as links to external sources of information. Once again, our guiding principle is: Let the page author use short mnemonic identifiers and hide all the gory details of accessing data in the master document and/or stylesheet.

Calling a process. Suppose we want to build a site map page that automatically compiles a hierarchical list of all pages of the site. The first thing we need is the static part of that page - a document that stores all the static bits unique to the page, such as an introductory paragraph and heading(s). This is a normal page that is listed in the menu hierarchy in the master, just like any other page.

Wherever we want to insert our dynamic content into that static frame, we place a block reference, e.g.:

<block idref="sitemap"/>

In the master, however, we cannot associate the sitemap identifier with any source file, since no such file exists - the list of pages is generated dynamically.

Instead, we must associate our dynamic block identifier (sitemap) with an identifier of some abstract process that generates its data. You can think of a process as a kind of a script or application; it may accept some parameters that affect its output. Thus, if we write in the master document (within the same blocks envelope used for orthogonal blocks)

<block-process id="sitemap" process="sitemap" mode="text" depth="2"/>

then the stylesheet will know that a sitemap block needs to be filled in with data generated by the sitemap process with parameters mode="text" and depth="2". This process can be, for example, a callable template within the stylesheet (4.5.1) or an external program. With this approach, document authors don't need to know anything about processes or parameters; they use identifiers to refer to data sources, and the master document associates each source with a process and its set of parameters.

Watching a directory. A stylesheet can access external files even if the list of these files is changing dynamically. For example, an external process (which may or may not be another stylesheet) might be dropping its output XML documents into a directory. Your stylesheet would then read the list of files in that directory (5.3.2) and do what it pleases with their content - such as dump all available content from all files into one page or perform some elaborate selection, filtering, or rotation.

If, for example, your stylesheet implements a list-titles process that takes a directory as a parameter and returns the list of title elements from all XML documents in the directory, then you could define a block to perform this operation on all (dynamically updated) documents in the news directory by writing in the master document

<block-process id="news-list" process="list-titles" dir="news/"/>

In a page document that wants to use this list, you would then write simply

<block idref="news-list"/>

XML, not HTML. Note that processes similar to sitemap or list-titles should only aggregate content, not format it. This means that the corresponding templates or functions in your stylesheet must produce valid XML data (nodesets), not HTML renditions. You would then feed these nodesets to the regular formatting templates in the same stylesheet (see for ideas on how to chain templates together). If a process is implemented as an external program, it should return serialized XML data or plain text that the stylesheet will be able to convert to nodesets.

3.9.2 Common content and site metadata

On a typical web site, all pages contain bits of information that either remain the same or change predictably from page to page. Some of this repeating data, such as the company logo or tag line, actually belongs to the domain of presentation rather than content and therefore needs to be filled in by the stylesheet rather than stored in the source. Other components, such as webmaster email links, "designed by" signatures, copyright or legal notices, etc., are natural to store in the master document.

It is recommended that you envelop all such bits of content in one or more umbrella elements, each containing data with similar roles or positions on the pages. Here's a master document fragment defining the footer to be placed at the bottom of each page:

<page-footer>
  <designed-by>Site design: <ext link="www.kirsanov.com">Dmitry
    Kirsanov Studio</ext></designed-by>
  <legal linktype="internal" link="legal">Legal notices</legal>
  <contact linktype="internal" link="contact">Contact us</contact>
</page-footer>

Note that the elements inside page-footer may have mixed content with any of the text markup, linking, or other elements that were developed for page documents. In particular, we see internal and external links used in this example, each with its own address abbreviation scheme (3.5.3).

The page-footer parent element makes the stylesheet simpler and more bullet-proof: Instead of providing templates for each of the individual footer elements, you can program the stylesheet to process all items within a page-footer in turn, and only provide separate templates for those that differ from others in formatting. With this approach, you'll be able to add a new element type for a new footer object even without changing the stylesheet.

Similarly, we can create an envelope for storing metadata that applies to the entire site. Examples of such metadata include site-wide keyword lists (which could be merged with page-specific keywords supplied by the page documents, 3.1.1) and extended credits (which could be put in comments in the HTML code of the site's front page).

3.9.3 Processing parameters

Your stylesheet will need to know some parameters of the environment in which it is run as well as the environment where its HTML output will be placed. The most frequently required processing parameter is the base URI that the stylesheet will prepend to all the image and link pathnames. By changing this parameter, you can turn all internal link URIs from relative to absolute with an arbitrary base, which is useful for testing the site in different environments. Other parameters may provide the path to the source tree and the operating system under which the stylesheet is run (which, in turn, may affect the syntax of pathnames).

Grouping parameters into environments. It is important that the same set of source files may be processed on different computers - for example, on a developer's personal system, then in a temporary (staging) location on the server, and finally in the publicly accessible area on the target server. Each of these environments will require its own set of processing parameters. It is therefore convenient to define several groups of parameter values, one for each environment, and select only one of the groups by its identifier when running the transformation.

Where to store the environment groups? Obviously, the need to group parameters and assign a unique identifier to each group makes using XML very convenient - as opposed to, say, storing the values within scripts used to run the site build process (6.5.1). Note also that scripts are the most OS-dependent part of the site setup, so it is best to keep them as simple and therefore as portable as possible. And of all the XML documents of a web site, the two most likely choices are the XSLT stylesheet and the master document.

Your stylesheet is more likely to be shared (in whole or in part) among different projects, so it is not wise to use it for storing information that is too project-specific. Also, even though you can use XSLT variables for storing processing parameters, it is more convenient to use custom element hierarchies for structuring and accessing this data. For these reasons, the master document emerges as the most natural storage for processing parameters.

This does not mean that your master document will differ among environments. Instead, all identical copies of it will have information on all environments, and each environment will extract the relevant set of data by passing a parameter to the stylesheet.

Here's an example group of parameters that define the processing environment called staging (see 3.10.2 for the meanings of the elements):

<environment id="staging">
  <os>Linux</os>
  <src-path>/var/website/src/</src-path>
  <out-path>/var/website/out/</out-path>
  <target-path>/test/</target-path>
  <img-path>img</img-path>
</environment>

3.9.4 Site-wide content and formatting

Normally, formatting of web pages is created by the stylesheet. Sometimes, however, formatting is dependent on certain parameters that, being more content than style, belong in the site's source and not in the stylesheet. Also, sometimes the stylesheet may need to create objects that are used on many pages but do not belong to any one page in particular. In both these situations, the master document is a convenient place to store data.

Site-wide buttons. An example of such an object is a pair of graphic buttons - "next" and "prev" - used on sequential pages (such as chapters in an online book). If your stylesheet generates other graphic buttons on the site (5.5.2), design consistency and maintainability will be much better if all buttons are done in the same way.

These buttons are not specific to any particular page; moreover, pages that use them don't even need to mention the buttons in the source because the stylesheet can automatically create the page sequence, including appropriate navigation. All we need is to store the button labels somewhere so the stylesheet can generate the buttons. It makes sense to use the master document for this.

You can store the button labels in a separate element in the master and program the stylesheet to regenerate the buttons when run with the corresponding parameter. For example,

<buttons>
  <button>prev</button>
  <button>next</button>
</buttons>
  • + Share This
  • 🔖 Save To Your Account