Home > Articles > Programming > PHP

This chapter is from the book

PHP and XML

Note: If you're already familiar with the basic concepts of XML, you can safely skip the next sections giving a short introduction to XML, and continue directly with PHP and Expat.

What Is XML?

XML (Extensible Markup Language) is a meta markup language for documents containing structured information. Let's try to explain it word by word in plain English:

  • XML is extensible. Take HTML: the tag <h1> always denotes a first-level heading. In XML, by contrast, the tag means nothing until you give it a meaning with an accompanying rule, the Document Type Definition (DTD).

  • XML is a markup language. Just as HTML should, theoretically, XML does not provide layout information to the processing application.

  • XML is a meta language. XML doesn't have a fixed tag set—it provides a facility to define tags.

  • XML works with documents. Documents. As in not limited to files! Documents can come from a database, over the network, or indeed from files.

  • XML defines structured information. It arranges single parts of data in a larger body and gives it a contextual meaning and a structural relationship.

Structured Information

There's one key concept you'll need to understand when talking about XML: structured documents or—more eloquently—structured information markup. Structured markup explicitly defines the structure and semantic content (the contextual meaning) of a document. It doesn't influence the way in which the document will appear to the reader—the interpretation of the data (parsing, layout, etc.) is completely left to the processing application. Take the HTML <p> (paragraph) tag: It denotes multiple sentences belonging together to form a logical unit. The tag per se doesn't imply how the paragraph should be rendered in the browser; the browser could insert a blank line before or after, indent the first line in the paragraph, or add ornamental borders around it. This is logical markup—the style information is hardcoded into the browser. XML documents are compounded of such logical markup. As in HTML, tags are used to identify the markup information. But in XML, there are no visual elements as in HTML (think of <font>)—it's restricted to logical markup. There's no way to specify a word as italic in XML. You can only mark it for its semantic meaning, for example with <emphasis>.

So much for the markup—where's the structure? XML tags can be nested and have a contextual state—that is, it's important where they appear in a document. A tag combination <chapter><title> is treated differently than <book><title>. There's no limitation on the number of nested elements in the XML specification—the only requirement is that all elements must originate in one root element.

XML's Relatives

The ancestor of XML is SGML. Since it became an ISO standard in 1986, SGML (Standard Generalized Markup Language) has been used to maintain structured documents by large corporations in all industries. However, SGML is a complex standard that's difficult to support in applications. Most SGML applications—editors, storage servers, transformation tools—are therefore quite expensive, often costing well above $10,000.

HTML, on the other hand, has wide industry support and is used on millions of Web sites. It defines a simple type of document for a common class of short articles, with headings, paragraphs, lists, illustrations, and some provision for hypertext and multimedia. But it's very limited regarding flexibility and extensibility. The tags and semantics are fixed—you can't define your own tag for an entry in a Table of Contents. Neither is it suited for media other than computer interfaces—if you ever tried to print articles distributed to multiple files, you know what this means. The very open specification led to a fragmentation with multiple different implementations. As you know, it's an art per se to write browser-neutral HTML.

So there was a need to create a new format allowing structured documents to be used over the Web. XML was designed to overcome the limitations in the only viable alternatives, SGML and HTML.

The design goals of XML had some clearly defined points:

  • It must be easy to use—both for users and for developers implementing XML parsers. The complexity of SGML is a constraint that needs to be removed.

  • XML must be open to support a wide variety of applications and subprotocols. The dependence on a single, inflexible document type as with HTML needed to be eliminated.

  • It requires a strict syntax. Optional features lead to compatibility problems when users want to share documents. There was the constant fear that the same could happen as happened with HTML—multiple competing and incompatible implementations.

  • It must be compatible with SGML. Members of the development committee were also involved in SGML efforts and had legacy data contained in SGML systems.

The development resulted in a clear specification approved by the World Wide Web Consortium (W3C) as the recommendation Extensible Markup Language (XML) 1.0 from February 10, 1998.

XML is different from SGML: XML strips out a large number of SGML's more complex and less-used features and creates a new reduced SGML-based application. Because it's a subset of SGML, you can read an XML document with any SGML-compliant system. Every valid XML document is a valid SGML element.

XML is different from HTML: Apart from removing HTML misconceptions, it has important syntactical differences. Plus, XML is fully Unicode-ready; tags, attributes and contents can be in any string encoding defined by Unicode.

Let's look at a short excerpt from the source code of this book:

<title>Cutting-Edge Applications</title>
<abstract>
    <para>
        If you realize that all things change,
         there is nothing you will try to hold on to.
    </para>
</abstract>

Here you see the tags in use, providing for structured and logical markup. In contrast to HTML,

  • Tags are case sensitive.

  • Whitespace is significant.

  • Opening tags must always have a matching closing tag or be self-closing (for example, <xref/>).

  • Documents can have an arbitrary valid Document Type Definition.

Thus we can happily summarize: XML removes the enormous complexity of SGML, while still providing all necessary features for structural markup, including the definition of custom document types.

XML's Advantages

But why XML? With all those formal definitions and fact sheets, developers sometimes don't see the usefulness for their daily activities at first. Indeed, why use XML and not Word or Notes? Or your own proprietary storage format? Or a relational database?

The main argument against proprietary formats is just that: They're proprietary. Data that's designed to be used on a heterogeneous network such as the Internet has to be usable on all types of computers connected to it. XML is built out of plain text (as opposed to the binary format of most proprietary applications), making it supportable by all current computing platforms. Besides, proprietary data formats are often (for example in public bodies) just not an option: You don't want to rely on the mercy of a single vendor who could change the format at will, or even abandon it altogether. XML is license-free, vendor-neutral, and platform-independent.

While XML provides means for structured content, it presents a different (but not necessarily opposing) view on content than relational database systems. XML doesn't provide a relational model. It allows unlimited nested levels, which could not be handled by a database system. On the other hand, it misses features found in an RDBMS, such as stringent field types, constraints, keys, and so on. Of course, there are similarities in the two concepts and there is indeed development going on to create a SQL-like query language for XML documents. Anyway, the success of XML shouldn't make you forget the usefulness of the traditional RDBMS; they provide many important processing features that could hardly be modeled in XML, and they're optimized for speed from the ground up.

The overall and killer advantage of XML is the separation of logical structure from layout. By having your documents in XML, you can transform them into any representation you want: HTML, PostScript, PDF, RTF, plain text, audio, Braille—from one single source. And as XML (plain text) documents can be parsed with your favorite scripting language, it's easy to change hyperlinks dynamically, change element contents, or associate structures with a database.

And if you're still not convinced, review all those Document Type Definitions that are being developed or are already in use. XML itself is mostly an "under the hood" technology—the meat is the applications that use XML.

What Is XML Used For?

As a structured information markup language, XML is of course used in content management systems, archiving solutions, and corporate document repositories. But plenty of other XML applications and subprotocols exist. Due to the open nature of the standard, DTDs have been developed at a fast pace.

DocBook

The DocBookX DTD is a very popular set of tags for describing books, articles, and other prose documents, particularly technical documentation. It was originally developed in 1991 by the publisher O'Reilly as an SGML DTD for in-house use. It soon became popular with authors and spread to other publishing houses, a change embraced by O'Reilly, which handed over further development to the Davenport Group. In mid-1998, OASIS (Organization for the Advance of Structured Information Standards) officially took over the maintenance of DocBook. When XML became increasingly popular, an unofficial XML version (3.1) was created by Norman Walsh; work is currently underway to transform this to an official release—DocBook 5 will most probably come in SGML and XML flavors.

When we started writing this book, it was clear that we wanted to use an open format such as XML. The DocBook DTD was consequently chosen because it offered all the features we would ever need. All the elements typically used in technical writing are present and, to tell you the truth, even very esoteric ones are included—or have you ever seen a MouseButton element (from the quick reference: The conventional name of a mouse button) in your word processor?

XML and DocBook offer some clear advantages to us. We can use CVS as a version control tool for both the PHP examples and the book files. Transformation to HTML is easy, either with PHP or using a style sheet processor like James Clark's XT. And editing is very comfortable, thanks to SoftQuad's XMetaL, which allows intuitive visual editing by using Cascading Style Sheets (CSS) for the display in the authoring environment, as shown in Figure 7.3.

07fig03.gifFigure 7.3. SoftQuad's XMetaL XML authoring environment, used for writing this book.

WML—Wireless Markup Language

WML is another Document Type Definition that has quickly become an industry standard. It's intended for use in specifying content and user interfaces for wireless devices such as mobile phones or Personal Digital Assistants. These devices have some common constraints, which make HTML a bad choice for a markup language:

  • Small and low-resolution graphical displays

  • Limited user interaction

  • Narrow-band network access (for now)

  • Limited computational resources

WML addresses these issues. It divides content into small pieces ("cards") and organizes them in larger information units ("decks"). To avoid continuous network access, WML defines a set of client-side scripting procedures in XML, for example the ability to set and access variables on the client computer. Because of limited screen real estate, creating meaningful navigation paths is especially difficult on portable devices. WML explicitly requires the user agent—the WAP browser—to have a navigation history and enables WML documents to make use of it, thus freeing the author from some of the responsibility and delegating it to the user agent.

RDF—Resource Description Framework

The RDF specification defines a language to store meta information about Web resources in an XML format. The Web as it is, with its millions of HTML pages, is very difficult to process by automated machines like spiders or robots. Search engines are hitting their limit every day, and even the most clever algorithms don't guarantee meaningful search results, as anyone using the Web for professional research knows. Web pages can only be full-text searched—which is a very limited searching method.

Current HTML allows primitive storing of meta data about a document. As you may know, meta tags can be used to denote keywords for a document, a short summary, and author information. But what if you want to store the publication history of the document? Information about the editors? Any bibliographer will laugh at HTML's meta tags.

In 1998, the W3C formed a committee to research a format for defining meta data and released the Resource Description Framework (RDF) as a recommendation on February 22, 1999.

RDF extends the format originally used for PICS, a content rating system, and is more and more replacing the Dublin Core Metadata for Resource Discovery standard, another methodology for classifying meta data. RDF has quickly become accepted as a standard mechanism for the global exchange of meta data on the Internet.

XML Documents

XML documents consist of markup and content (called character data in XML terms) in the Unicode character set. There are different types of markup, which we'll introduce in the following overview.

Elements

Elements will look familiar to anyone who has worked with HTML. They denote the meaning of a content section. XML cannot contain elements with no closing tag (HTML's <img>, for example), but has a distinct notation to identify empty tags:

<xref   linkend="end"/>

Keep in mind that the nesting of tags is significant—improperly nested tags will lead to badly formed documents.

Attributes

Elements can have attributes. Attributes are name/value pairs that occur within the tags after the element name and specify a property of an element. Attribute values must be contained in quotes. No attribute name may appear more than once in the same tag.

Any XML document can optionally (and regardless of the Document Type Definition) have two standard attributes: xml:lang and xml:space. The xml:lang attribute was defined because language independence is one of XML's most important goals.

Without knowing what language a text is written in, it's impossible for an application to display, spell-check, or index it. XML's great Unicode support wouldn't be of any help if the author couldn't assign a language tag to a particular part of a document. Thus the xml:lang attribute was introduced:

<p>Worldwide declarations of love</p>
<p xml:lang="It">Ti amo.</p>
<p xml:lang="De">Ich liebe Dich.</p>
<p xml:lang="X-Klingon">qabang</p>

The language identifier is one of the following:

  • A two-letter ISO 639 language code

  • A language code registered with the Internet Assigned Numbers Authority (IANA); these begin with the prefix "i-" (or "I-")

  • A user-defined code, prefixed with "x-" (or "X-")

The other standard attribute, xml:space, isn't as straightforward to understand and use. As mentioned earlier, whitespace is significant in XML—it will be passed to the processing application. But after having read our Coding Style guidelines, you know that whitespace is important to structure and indent code to improve readability. This way it's used for laying out the markup, but it's of no importance for the markup itself or for the character data. On the other hand, an author may well intend whitespace to be preserved.

Because there are these two conflicting views on the subject, the XML committee introduced the xml:space attribute that controls the behavior of whitespace. It can only take two values: preserve or default. On any element that includes the attribute xml:space="preserve", whitespace is treated as "significant" and passed to the processing application as is. The default value tells the application that the application's default processing should be applied. Both standard attributes are inherited to sub-elements until they are explicitly reset in an element.

Note: An XML processor is the program used to read XML documents. The XML processor makes it possible for an application to access the structure and content of an XML document. Throughout this book, the terms XML processor and XML parser refer to the same kind of software.

Processing Instructions

Another "element" type you'll find in XML documents is the processing instruction, or PI. PIs are used to define parts in a document that should not be interpreted by the regular parser engine but instead by a specialized processing handler. They consist of <? and a target name used to identify the application to which the instruction is directed. The long PHP tag (<?php) is of course such a PI and can be used in XML documents to mark PHP code.

Note: In order to be XML-compliant, you have to set the short_tags directive in your PHP configuration to Off and use the long opening tag <?php consistently. The short opening tag would confuse XML, as it wouldn't be a valid processing instruction. On the other hand, tags like <xml would interfere with PHP, as PHP would think of the xml as code, and produce a parse error accordingly.

Entities

Any text that's not markup constitutes the character data of the document. Within this content, an author needs a way to include special characters like < or > that normally would introduce start or end markup sections. Similarly to HTML, XML knows the notation of entities. Five entities are predefined:

Entity Character Symbol
&lt; < less than
&gt; > greater than
&amp; & ampersand
&quot; " double quote
&apos; ' single quote (apostrophe)

Note: If you use a Document Type Definition, these entities need to be declared if you want to use them.

Using character references, you can insert any arbitrary Unicode character into your document. They consist of the normal notation of references, but with a pound sign (#) following the ampersand. After that, either a decimal or a hexadecimal reference to the Unicode position is inserted. For example, both &#8478; and &#x221E refer to the infinity sign (∞). Entities are not limited to a single character, though; they can be of any length. For example, a DTD could define an entity &footer; to contain "Copyright (c) 2000 New Riders."

Comments

XML uses the same notation for comments as HTML: <!--comment-->. Comments can contain any data except the literal string -- and can be placed between markup entries anywhere in your document. The XML specification explicitly states that comments are not part of a document's contents—a parser is not required to pass them to the processing application. This means you can't use comments for hidden instructions or the like, as you might be used to doing from HTML (think of using comment tags for hiding JavaScript from older browsers).

CDATA Sections

One special type of content is CDATA sections. As soon as you try to embed larger sections of code (containing many occurrences of < or &) into an XML document, you'll find the standard method of referencing special characters through entities awkward. HTML has the <pre> tag to turn off markup interpretation for a section—but as XML doesn't know any built-in tags, that's out of our reach. To overcome this, you can mark sections in XML as CDATA, using this construct:

<![CDATA[
   print("<a href="script.php3?foo=bar&baz=foobar");
]]>

Within a CDATA section, all characters can occur, except for the ]]> sequence.

Document Prologue

Note: Although prolog is the spelling in the official specs, our editor prefers the Americanized (and possibly arcane) spelling prologue. XML documents should (but don't have to) begin with an XML declaration that specifies the version of XML being used. This version information is part of the document prologue:

<?xml version="1.0"?>
    <greeting>Hello, world!
</greeting>

By having this information at the top of a document, a processor can decide whether it can handle the document's version of XML. It's also useful as a method to identify the document's type; just as #!/bin/sh in the head of a file declares it to be a shell script, the XML declaration identifies an XML document.

The second important part of the document prologue is the document type declaration. Don't confuse this with Document Type Definition (DTD)—the document type declaration contains or points to a DTD! The DTD consists of markup declarations that provide a "grammar" for XML documents. The document type declaration can point to an external DTD, contain the markup declarations directly, or both. The DTD for a document consists of both subsets taken together. Here's an example of a document type declaration:

<!DOCTYPE book SYSTEM "docbookx.dtd">

This document type declaration has the name book and points to an external DTD named docbookx.dtd. It has no inline DTD.

If a document contains the full DTD and no external entities, it's a called a stand-alone document and marked as such in the XML declaration:

<?xml version="1.0" standalone='yes'?>

This can be useful for some applications; for example, for delivery of documents over a network, when you want to open only a single document stream. Note that even XML documents with external DTDs can be converted to stand-alone documents by importing the DTD and external entities into the document prologue.

Document Structure

Now you know all the pieces that form an XML document: elements (with attributes), processing instructions, entities, comments, and CDATA sections. But how are these pieces grouped together to form a meaningful XML document?

The XML specification only defines a very generic document structure. It says that each well-formed document has these qualities (more about what "well-formed" means later):

  • May have a document prologue identifying the XML version and DTD.

  • Must have exactly one root element and an arbitrary number of elements below the root.

  • May have miscellaneous stuff after that.

The last part, "miscellaneous stuff," is referenced in a wry tone here—it's considered by many people to be a design error of XML. It makes parsing XML documents potentially much harder, because you can't rely on the document end being the closing root element. When parsing a document over a network connection, for example, you can't close the connection after having received the closing root element—you must wait until the server closes the connection on its own, as there may still be more "miscellaneous" content to consider.

But nothing was said yet about the syntax and structure of the thing that supposedly is responsible for the whole magic of XML: the Document Type Definition. Indeed, it's the DTD that gives meaning to an XML document; it defines its syntax, the sequence and nesting of tags, possible element attributes, entities—in short, the whole grammar. Writing complex DTDs is no easy task and whole books have been written to cover the subject. Because as an XML user you usually don't need to deal with this task directly, we won't cover this topic here. Instead, we'd like to look at another XML concept that may be more important in your daily work.

XML Namespaces

You've seen some different XML applications (Document Type Definitions) and what they're used for. But what if you want to create a single XML document containing elements from two different DTDs? For example, the <part> element could mean a book part in one DTD and a manufacturing part in another. Without a way to separate these two namespaces, the two element names would clash. How could these distinct elements be identified? You need to associate an identifier with the element, for example <part namespace = "book"> or, if you want to avoid attributes, <book:part> and <manufacturing:part>.

The W3C learned early about this shortcoming in XML and introduced a new specification: Namespaces in XML, published as a Recommendation on January 14, 1999.

XML namespaces provide a method for having multiple namespaces, identified by Uniform Resource Identifiers (URI), in one XML document. The Resource Description Framework DTD uses this method. Look at the following example from the RDF specification:

<?xml version="1.0"?>
<rdf:RDF   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
           xmlns:s="http://description.org/schema/">
    <rdf:Description about="http://www.w3.org/Home/Lassila">
        <s:Creator>Ora Lassila</s:Creator>
    </rdf:Description>
</rdf:RDF>

This defines two namespaces, one named rdf and one named s. After the definition, a namespace is referenced by prefixing it (concatenated with a colon) to an element name, thus effectively avoiding the collision of different logical meanings and syntactical definitions.

Note: The URI in a namespace identifier is not a DTD. It would of course be nice to be able to point to different DTDs using XML namespaces, but there are currently many technical problems with this approach—this is being addressed by the W3C in the XML Schema definition, which is under development at the time of this writing.

EBNF—Or "What the Heck Is That Again?"

As a Web developer, you'll frequently be faced with the task of reading specifications–whether project specs, formal language definitions, or standards whitepapers. When reading some of the specifications from the W3C (the most well-known are the HTML and XML documents, probably), you'll stumble across a strange mixture of characters that presumably form a grammar definition.

document ::= prolog element Misc*

This is the very first syntax definition in the XML specification and defines the basic structure of an XML document. The notation used is called Extended Backus-Naur Form, or EBNF for short. Understanding the formal specifications will get a lot easier once you understand the basics of EBNF.

EBNF is a formal way to define the syntax of a programming language so that there's no ambiguity left as to what's valid or allowed. It's also used in many other standards, such as protocol or data formats and markup languages like XML and SGML. As EBNF makes for a very rigorous grammar definition, there are software tools available that automatically transform a set of EBNF rules into a parser. Programs that do this are called compiler compilers. The most famous of these is YACC (Yet Another Compiler Compiler), but there are of course many more.

You can see EBNF as a set of rules, called productions or production rules. Every rule describes a part of the overall syntax. You start with one start symbol (called S, by tradition) and then you define rules for what you can replace this symbol with. Gradually, this will form a complex language grammar composed by the set of strings you can produce when following these rules.

If you look at the example from above again, you see that this is an assignment; there's a symbol on the left, an assignment operator (which can also be written as :=), and a list of values on the right. You play the game by following the symbol definition down to the last occurrence–then on the right side of the assignment no symbols are given, but a final string called terminal, which is an atomic value.

EBNF defines three operators, which will look familiar to you from regular expressions:

Operator Meaning
? Optional
+ Must occur one or more times
* Must occur zero or more times

To define the grammar of language, which allows you to express floating-point numbers, this EBNF notation would be used:

S := SIGN? D+ (. D+)?
D := [0-9]
SIGN := "+"|"-"

The first line defines the start symbol, with the following sequence:

  • An optional sign, consisting either of + or -

  • One or more elements of the D production

  • Optionally, a dot, and again one or more elements of D production

Notice that EBNF allows operators to work on groups of symbols: (. D+)? means that this expression is optional.

The second line lists the finals (atoms) for the D production, the digits 0 to 9 in this case. The syntax used is the same as with regular expressions; a set is defined in a bracket expression. The third line defines the two possible signs. The pipe character (|) is used to denote alternatives: A|B means "A or B but not both."

That's a very basic explanation of EBNF. The XML specification defines additional syntax; for example, validity constraints and well-formedness constraints–it's explained in the Notation section of the spec, so we won't go into details here. More information about EBNF can be found in any modern compiler book.

Validity and Well-Formedness

There are two types of compliant XML documents: valid documents and well-formed documents. Any XML document is well-formed if it matches XML's basic syntax guidelines:

  • It contains one root element and an arbitrary number of elements below that element.

  • Elements are properly nested.

  • Attributes appear only once per element and are enclosed by single or double quotes. They cannot contain direct or indirect entity references to external entities. Nor can they contain an opening tag (<).

  • Entities must be declared before they're used, except for the standard entities.

  • Entities must not refer to themselves recursively.

For example, the following is a well-formed XML document:

<greeting>Hello world.</greeting>

But it's not a valid document. The XML specification defines it this way: An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it. This means that any valid XML document is also well-formed. A well-formed document may be invalid if it doesn't adhere to the syntax laid out in the associated DTD. An ill-formed document can never be valid. An ill-formed document is not an XML document: It contains fatal errors and XML parsers are instructed to stop processing at this point. The distinction between valid and well-formed has two very important connotations to XML. First, it brings along two classes of XML parsers: those that care about validity of an XML document and those that don't: that is, validating and non-validating parsers. The XML specification lists ease of use for developers as a design goal, and indeed it's quite easy for any medium-level programmer to write a non-validating parser. Writing a validating parser is a different matter, through.

Second, the validity versus well-formedness concept divides XML applications into two categories. One range of applications treats XML as an extended data-storage format. Well-formed documents are used for data storage and display. For this task, a DTD is not necessary; a well-formed document is sufficient. You would achieve some level of code reuse with this approach; for example, you could reuse the code for parsing data and generating tags in later applications. But as soon as you want to exchange information as information (as opposed to treating it as pure data), you need to give the document a meaning and associate it with a DTD. In applications dealing with information processing and exchange, only valid documents are appropriate.

Now that you've learned about the basics of XML and related topics, let's put the gained knowledge into practice by looking at Expat, a non-validating parser built into PHP.

PHP and Expat

Expat is the parser that is responsible for XML processing in Mozilla, Apache, Perl, and many other projects. It can be compiled into PHP since version 3.0.6 and is part of the official Apache distribution since Apache 1.3.9. Since Expat is a non-validating parser, it's fast and small—well suited for Web applications.

Event-Based API

There are two types of XML parser APIs: tree-based parsers that usually provide an interface to the Document Object Model (more about this later) and those that process XML documents with an event-based approach. Expat makes an event-based API available.

Event-based parsers have a data-centric view of XML documents. They parse the document from top to bottom and report events—such as the start of an element, the end of an element, starting of character data, etc.—to the application, usually through callback functions. The "Hello World" example document from earlier in the chapter would be reported by an event-based parser as a series of these events:

  1. Open Element: greeting

  2. Open CDATA section, value: Hello World

  3. Close Element: greeting

Unlike tree-based parsers, they don't create a structure representation of the document. This provides for a lower-level access and is much more efficient in terms of speed and resource usage. There's no need to hold the entire document in memory; indeed, documents can be much larger than your system's memory. Of course, it's still completely possible to create a native tree structure if you need to do so. Prior to parsing a document, event-based parsers generally require you to register callback functions that will get invoked when a certain event occurs. Expat is no exception. It defines six possible events plus one default handler:

Target Function Description
elements xml_set_element_handler() Opening and closing of elements
character data xml_set_character_data_handler() Beginning of character data
external entities xml_set_external_entity_ref_handler() Occurrence of an external entity
unparsed external entities xml_set_unparsed_entity_decl_handler Occurrence of an unparsed external entity
processing instructions xml_set_processing_instruction_handler() Occurrence of a processing instruction
notation declarations xml_set_notation_decl_handler() Occurrence of a notation declaration
default xml_set_default_handler() All events that have no assigned handler

Let's start with a really basic example. The source code in Listing 7.2 forms a program to extract all comments from an XML document (remember, comments have the form <!-- … -->). The example registers only one handler that gets called for all events during the parsing. If you register another handler, for example using xml_set_character_data_handler(), the default handler would not be invoked for this specific event—the default handler processes only "free" events with no assigned handler.

Listing 7.2. Extracting comments from an XML document.

require("xml.php3");

function default_handler($p, $data)
{
    global $count;  // count of comments found

    // Check if the current contains a comment
    if (ereg("!--", $data, $matches))
    {
       $line = xml_get_current_line_number($p);

        // Insert a tab before new lines
       $data = str_replace("\n", "\n\t", $data);

        // Output line number and comment
       print "$line:\t$data\n";

        // Increase count of comments found
       $count++;
    }

}

// Process the file passed as first argument to the script
$file = $argv[1];

$count = 0;

// Create the XML parser
$parser = xml_parser_create();

// Set the default handler for all events
xml_set_default_handler($parser, "default_handler");

// Parse file and check the return code
$ret = xml_parse_from_file($parser, $file);
if(!$ret)
{
    // Print error message and die
    die(sprintf("XML error: %s at line %d",
                    xml_error_string(xml_get_error_code($parser)),
                    xml_get_current_line_number($parser)));
}

// Free the parser instance
xml_parser_free($parser);

The example works in a pretty straightforward way. First, the XML parser instance is created using xml_parser_create(). In all subsequent functions, you'll use the parser identifier you created this way—in a similar fashion to the result-identifier in the MySQL functions. Then the default handler is registered and the file is parsed. xml_parse_from_file() is a custom function we provide in a library; this function simply opens the file specified as the argument and parses it in blocks of 4KB. PHP's original XML functions xml_parse() and xml_parse_into_struct() operate on strings—by using wrappers for opening, reading, and closing a file and passing its contents to the respective functions, you can save time and code.

The default handler checks whether the current data section is a comment and outputs it if this is the case. Along with each comment, the current line number (returned by xml_get_current_line_number()) is also printed.

Now, while this example shows off the basic concepts of invoking the XML parser, registering callback functions, and processing data, it doesn't exactly demonstrate the common usage of an XML parser. It doesn't process information; raw data is just read in and scanned for a string—nothing that couldn't be done with traditional regular expressions. In most situations where you process XML, you'll want to keep at least a basic representation of the document structure.

Stacks, Depths, and Lists

Our second example illustrates how to remember the element depth the parser is currently processing. In the start-element handler the global $depth variable is increased by four; in the stop-element handler it's decreased by the same figure. This is the most reduced case of a parser stack—no structure other than depth information is being kept. As an XML pretty printer, the example uses the depth to properly indent code. The handler functions simply apply a Cascading Style Sheet to the current data to produce nicely formatted output. The only other noteworthy part of the code is this line:

xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);

This disables case folding for the parser, telling it that the case of element names should be preserved. If this option is enabled, all element names are transformed to uppercase. Usually, you'll want to turn this off, as case is important for element names in XML.

We won't print the source code of the example here because of its simplicity; you can find it on the CD-ROM. Figure 7.4 shows a screen shot of the output.

07fig04.gifFigure 7.4. Output of the XML pretty printer.

Usually, this naive approach of maintaining just one depth variable is not enough. With event-based parsers, you'll usually end up using your own stacks or lists to maintain information about the document's structure. This is evidenced quite well by the next example, shown in Listing 7.3.

Listing 7.3. XMLStats—collecting statistical information about an XML document.

require("xml.php3");

// The first argument is the file to process
$file = $argv[1];

// Initialize variables
$elements = $stack = array();
$total_elements = $total_chars = 0;

// The base class for an element
class element
{
    var $count = 0;
    var $chars = 0;
    var $parents = array();
    var $childs = array();
}

// Utility function to print a message in a box
function print_box($title, $value)
{
    printf("\n+%'-60s+\n", "");
    printf("|%20s", "$title:");
    printf("%14s", $value);
    printf("%26s|\n", "");
    printf("+%'-60s+\n", "");
}

// Utility function to print a line
function print_line($title, $value)
{
    printf("%20s", "$title:");
    printf("%15s\n", $value);
}

// Sort function for usasort()
function my_sort($a, $b)
{
    return(is_object($a) && is_object($b) ? $b->count - $a->count: 0);
}

function start_element($parser, $name, $attrs)
{
    global $elements, $stack;
    // Does this element already exist in the global $elements array?
    if(!isset($elements[$name]))
    {
        // No - add a new instance of class element
        $element = new element;
        $elements[$name] = $element;
    }

    // Increase this element's count
    $elements[$name]->count++;

    // Is there a parent element?
    if(isset($stack[count($stack)-1]))
    {
        // Yes - set $last_element to the parent
        $last_element = $stack[count($stack)-1];

        // If there is no entry for the parent element in the current
        // element's parents array, initialize it to 0
        if(!isset($elements[$name]->parents[$last_element]))
        {
            $elements[$name]->parents[$last_element] = 0;
        }

        // Increase the count for this element's parent
        $elements[$name]->parents[$last_element]++;

        // If there is no entry for this element in the parent's
        // elements' child array, initialize it to 0
        if(!isset($elements[$last_element]->childs[$name]))
        {
            $elements[$last_element]->childs[$name] = 0;
        }

        // Increase the count for this element parent in the parent's
        // childs array
        $elements[$last_element]->childs[$name]++;
    }

    // Add current element to the stack
    array_push($stack, $name);
}

function stop_element($parser, $name)
{
    global $stack;

    // Remove last element from the stack
    array_pop($stack);
}

function char_data($parser, $data)
{
    global $elements, $stack, $depth;

    // Increase character count for the current element
    $elements[$stack[count($stack)-1]]->chars += strlen(trim($data));
}

// Create Expat parser
$parser = xml_parser_create();

// Set handler functions
xml_set_element_handler($parser, "start_element", "stop_element");
xml_set_character_data_handler($parser, "char_data");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);

// Parse the file
$ret = xml_parse_from_file($parser, $file);
if(!$ret)
{
    die(sprintf("XML error: %s at line %d",
                    xml_error_string(xml_get_error_code($parser)),
                    xml_get_current_line_number($parser)));
}

// Free parser
xml_parser_free($parser);

// Free helper elements
unset($elements["current_element"]);
unset($elements["last_element"]);

// Sort $elements array by element count
uasort($elements, "my_sort");

// Loop through all elements collected in $elements
while(list($name, $element) = each($elements))
{
    print_box("Element name", $name);

    print_line("Element count", $element->count);
    print_line("Character count", $element->chars);

    printf("\n%20s\n", "* Parent elements");

    // Loop through the parents of this element, output them
    while(list($key, $value) = each($element->parents))
    {
        print_line($key, $value);
    }
    if(count($element->parents) == 0)
    {
        printf("%35s\n", "[root element]");
    }

    // Loop through the childs of this element, output them
    printf("\n%20s\n", "* Child elements");
    while(list($key, $value) = each($element->childs))
    {
        print_line($key, $value);
    }
    if(count($element->childs) == 0)
    {
        printf("%35s\n", "[no childs]");
    }

    $total_elements += $element->count;
    $total_chars += $element->chars;
}

// Final summary
print_box("Total elements", $total_elements);
print_box("Total characters", $total_chars);

This application uses Expat to collect statistical data about an XML document. For each element, it prints a bunch of information:

  • How many times it occurred within the document

  • How much character data was found within this element

  • All parent elements encountered

  • All child elements

To achieve this, the script needs at the very least to know the parent element for the current element. This is not possible using the normal XML parser—you only get events for the current element, and no contextual information is recorded. Thus we needed to set up our own stack structure. We could have used a FIFO stack (First In, First Out) with two elements, but to give you a better example of keeping element nesting information within a data structure, we voted for a FILO (First In, Last Out) stack. This stack, which is a normal array, holds all currently open elements. In the open-element handler, the current element is pushed on top of the stack using array_push(). Accordingly, the end-element handler function removes the top element with array_pop().

A note on array_pop() and array_push(). These and many other useful functions dealing with arrays have been added only in PHP 4.0. We wanted to port them over to PHP 3.0, but it's difficult to implement them efficiently in native PHP (to backport it to PHP 3.0) because of the way unset() works. To pop an element off the stack, you would use a snippet like this:

unset($array[count($array) - 1]);

If this would work well, it would be trivial to implement array_pop()- however, it doesn't work well. With PHP, unset() leaves holes in the array—it doesn't reset the "index counter." You can easily verify this yourself:

$array = array("a");
unset($array[0]);
$array[] = "a";
var_dump($array);

The element a will now have the key 1, instead of the expected 0. This leads to fragmented arrays—unsuitable for a stack. This behavior has its reasons with every other element in the array: If the hole was eliminated, the array would need to be reorganized, which would be undesirable in many situations. To work around this problem, we'd need an array_compact() version—which doesn't exist in PHP at the time of this writing. The only conclusion to draw is this: Use PHP 4.0. In the PHP 3.0 implementation of the example (see the CD-ROM), we had to use the $depth variable to keep track of the element nesting manually. This introduces another global variable and is not as elegant as array_pop() and array_push(), but it works.

To collect information about each element, the scripts needs to remember all occurrences of each element. We use a global array variable, $elements, to hold all distinct elements of the document. The array entries are instances of the element class, which has four properties (class variables):

Property Description
count The number of times the element was found in the document.
chars Bytes of character data within this element.
parents Parent elements.
childs Child elements.

As you see, it's no problem to keep class instances within an array.

Tip: A peculiar language feature of PHP is that you can traverse class structures just like you would traverse associative arrays, using the while(list() = each()) loop shown in Chapter 1, "Development Concepts." It will show you all class variables and method names as strings.

Each time an element is found, the count element in the corresponding elements array item is incremented. In the parent's entry (parent meaning the last opened element tag), the current element's name is appended to the childs array entry. The parent element is added to the array entry with the key parents. The rest of the code loops through the elements array and its subarrays to display the statistics. While this produces a nice output, the code per se is neither of particular elegance nor does it consist of clever tricks: It's a loop like you probably use every day simply to get the job done.

DOM—Document Object Model

The other main family of XML parsers are those that enable access to a Document Object Model (DOM) structure. As you've seen, with event-based parsers you often have to set up your own data structures. The DOM approach avoids that requirement by building its own structure in main memory. Rather than responding to specific events, you work with this structure to process the document. While event-based parsers read an XML document in small chunks, reducing parsing memory usage and increasing performance, DOM parsers need to create an in-memory representation of the whole document. This uses more memory—keep this in mind when working with large documents.

The DOM Level 1.0 was defined as a standard (W3C Recommendation) in October 1998 by the (by now probably well known) W3C organization. You may have heard of the DOM standard already in another context: The term is also commonly used to describe the object model of HTML pages that can be accessed with JavaScript. For example, to read the value of a form field, you could use the following JavaScript snippet:

fieldvalue = document.myform.myfield.value;

Notice the hierarchy expressed in the statement. document is the root element and myform denotes an HTML form, within which myfield is a text field. Indeed, the HTML DOM is an extension of the core Document Object Model defined by the W3C. The DOM core represents the functionality used for XML documents, and also serves as the basis for the HTML DOM. It's a collection of objects that you use to access and manipulate the data and markup stored in an XML document. It defines the following:

  • A set of objects for representing the complete structure of an XML document

  • A model of how these objects can be combined

  • An interface for accessing and manipulating these objects

By abstracting the document, the DOM exposes a tree, with parent and child nodes, and methods like getAttribute() for the nodes. Put short, DOM provides you with a standard, object-oriented and tree-like interface to XML documents.

The DOM specification is programming-language-independent. The specification recommends an object-oriented implementation, thus requiring a language with at least basic object-orientation features. It defines a set of node types (interfaces), which taken together form the complete document. Some types of nodes may have child nodes, others are leaf nodes that cannot have anything below them. We'll continue by describing these node types, as they're outlined in the original W3C specification. Please refer to the specification for a detailed description of all methods and attributes of each instance.

Document

The Document interface is the root node of the structure tree. This interface can contain only one element, which is the XML document's root element. It can also contain the document type declaration associated with this document (organized in a DocumentType interface), and, if available, processing instructions or comments from outside the root element.

Since the other nodes are all placed below the Document node, the Document interface contains a number of methods to create subnodes. Using these functions, it's possible to construct a complete XML document programmatically. The specification also defines a method getElementsByTagName() to retrieve all elements with a given tag name in the document.

DocumentFragment

A DocumentFragment node is a portion of a complete XML document. It's often necessary to rearrange parts of a document or to extract part of it; for this, a lightweight object is needed to hold the resulting fragment. For example, imagine you want to construct a single book file out of many different chapter files—each chapter could be read into the DocumentFragment object and inserted into the book's document structure. Without a way to organize fragments of documents, you'd have to add each element of each chapter one by one to the book document.

To make it even easier, the specification defines that when DocumentFragment is inserted into a node, only the children of the DocumentFragment and not the DocumentFragment itself are inserted into the node.

DocumentType

The DocumentType node holds the document type declaration of a document, if present. This interface is read-only; it cannot be altered through the DOM at this time.

Element

Each element in a document is represented by an Element node. To get the name of the element, the tagName property can be used. This interface also defines a series of functions to set and get element attributes, and to access sub-elements.

Attr

An Attr node represents an element attribute in an Element object. Name and value of the attribute can be read for the name and value properties of the interface. The specified property tells you whether the user specified a value for this Attr or the value is the default string specified in the DTD.

EntityReference

This node represents an entity reference found in the XML document. Note that character references (for example, &lt;) are expanded by the XML parser and are thus not made available as EntityReference nodes.

Entity

This node represents an entity, either parsed or unparsed.

ProcessingInstruction

The ProcessingInstruction node represents a processing instruction (PI) in a document. It has only two attributes, namely target (the PI target) and data (the contents).

Comment

This CharacterData interface represents the content of a comment, i.e. all the characters between <!-- and -->. It has no further attributes or methods.

Text

The Text CharacterData interface represents the character data (textual content) of an Element or Attr note. The Text interface has no attributes, and only one method, namely splitText(). This method splits one Text node into two, which can be useful for rearranging content.

CDATASection

The CDATASection interface inherits the Text interface (and with it the CharacterData interface) and holds the CDATA section.

Notation

This node represents a notation declared in the document type declaration.

Basic Interfaces

All these objects inherit the Node interface, which is the primary basic datatype for the DOM. It represents a single node in the document tree structure. The Node interface defines the attributes and methods you'll use most often when dealing with the DOM. To traverse a document, for example, you would use the childNodes attribute containing all children and the nextSibling attribute containing the next node on the same level. Methods like appendChild() and removeChild() can be used to alter the tree structure.

The only objects not directly derived from a Node interface are CDATASection, Text, and Comment. Text and Comment are derived from the CharacterData interface; CDATASection inherits Text. The CharacterData interface extends Node with a set of attributes and methods for accessing character data. For example, you can use substringData() to extract part of the character data.

Example: Analyzing a Short Document with the DOM

The easiest way to get an idea about the concrete implementation of the DOM is by seeing how a sample XML document would be handled by a DOM-compliant processor. Let's create a short book document:

<?xml version="1.0"?>
<!DOCTYPE book SYSTEM "docbookx.dtd">
<book>
    <title>
        Cutting-Edge Applications
    </title>
    <para language="en">
        Sample paragraph.
    </para>
</book>

A DOM representation of this document will be organized in a hierarchical structure like the one shown in Figure 7.5. In a DOM-compliant API, code could be similar to the following pseudocode:

// Construct Document class instance
$doc = new Document("file.xml");

// Output the root element's name
printf("Root element: %s<p>", $doc->documentElement->tagName);

// Get all elements below the root node
$node_list = $doc->getElementsbyTagName("*");

// Traverse the returned node list
for($i=0; $i<$node_list->length; $i++)
{
    // Create node
    $node = $node_list->item($i);

    // Output node name and value
    printf("Node name: %s<br>", $node->nodeName);
    printf("Node value: %s<br>", $node->nodeValue);
}

07fig05.gifFigure 7.5. DOM structure.

LibXML—A DOM-Based XML Parser

Since version 4.0, a new XML parser is built into PHP: LibXML. Daniel Veillard originally created this parser for the Gnome project to offer a DOM-ready parser for managing complex data exchange, and Uwe Steinman integrated it into PHP.

While LibXML's internal document representation is very close to the DOM interfaces, it's misleading to call LibXML a DOM parser: Parsing and DOM usage really happen at different times in a document's life. It would be feasible to create an API above Expat to provide a DOM interface. The LibXML library makes this much easier, though—it's merely a matter of changing the API to match the DOM specification. Indeed, there is a GDome module in Gnome, which implements a DOM interface for LibXML.

Note: At the time of this writing, the LibXML API in PHP was being finalized. It was unstable and contained bugs—nonetheless it already showed the tremendous benefits the finished LibXML API will offer. Therefore, we decided to document the basic principles here and provide some examples; if changes occur, we'll document them on the book's Web site.

Overview

Most developers will agree that an XML document is best represented in a tree structure. LibXML provides a nice API to construct trees and DOM-like data structures from an XML file. When you parse a document with LibXML, PHP constructs a set of classes, and you'll work with them directly. By invoking functions on these classes, you can access all levels of the structure and modify the document.

The two most important objects you'll spot when working with LibXML are document and node objects.

XML Documents

The abstract XML document is represented in a document object. Such objects are created by the functions xmldoc(), xmldocfile(), and new_xmldoc().

The function xmldoc() takes as its only argument a string containing an XML document. The xmldocfile() function behaves very similarly, but takes a filename as argument. To construct a new, blank XML document, you can use new_xmldoc().

All three functions return a document object, which has four associated methods and one class variable:

  • root()

  • add_root()

  • dtd()

  • dumpmem()

  • version

The function root() returns a node object containing the root element of the document. On empty documents as created by new_xmldoc(), you can add a root element using add_root(), which will return a node object as well. The function add_root() expects the name of the element as first argument when called as class method. You can also call it as global function, but then you need to pass a document class instance as first argument, and the name of the root element as second argument.

The dtd() function returns a DTD object with no methods, and the class variables name, sysid, and extid. The name of a DTD is always the name of the root element. The variable sysid contains the system identifier (for example, docbookx.dtd); the extid variable contains the external or public identifier. To convert the in-memory structure to a string, you can use the dumpmem() function. The version class variable contains the document's XML version, usually 1.0 today.

With these explanations, you're ready for a first, simple example. Let's construct a Hello World XML document with LibXML:

$doc = new_xmldoc("1.0");
$root = $doc->add_root("greeting");
$root->content = "Hello World!";
print(htmlspecialchars($doc->dumpmem()));

This will result in a well-formed XML document:

<?xml version="1.0"?>
<greeting>Hello World!</greeting>

The example also shows one property you don't know yet—accessing the contents of a node object.

Nodes

The Tao Te King says everything is Tao. In XML parsing, everything is a node. Elements, attributes, text, PIs, and so forth—from a programmer's point of view, you can treat them all in a very similar way, because they're nodes.

As we've already mentioned, nodes can be the most basic, atomic structure in an XML document. A node object has the following associated functions and variables:

  • parent()

  • children()

  • new_child()

  • getattr()

  • setattr()

  • attributes()

  • type

  • name

  • if available, content

With these functions and properties, you can get all available information about a node. You can access its attributes, child nodes (if any), and parent node. And you can modify the tree by adding children or setting attributes. Listing 7.4 shows the functions in action. This is the XML pretty printer mentioned earlier in the Expat section, ported to LibXML—instead of registering handler functions, it applies different formatting according to the node's type. Each node has an associated type. The type identifier is a PHP constant, and you can see the complete list in the example's source. Using the children() function, which returns the node's child elements (as node objects), it's easy to loop through the document. The example performs the loop recursively by calling the output_node() function again.

Listing 7.4. XML pretty printer—example using the LibXML functions.

// Define tab width
define("INDENT", 4);

function output_node($node, $depth)
{
    // Different action per node type
    switch($node->type)
    {
        case XML_ELEMENT_NODE:
            for($i=0; $i<$depth; $i++) print("&nbsp;");

            // Print start element
            print("<span class='element'>&lt;");
            print($node->name);

            // Get attribute names and values
            $attribs = $node->attributes();
            if(is_array($attribs))
            {
                while(list($key, $value) = each($attribs))
                {
                    print(" $key = <span class='attribute'>$value</span>");
                }
            }

            print("&gt;</span><br>");

            // Process children, if any
            $children = $node->children();
            for($i=0; $i < count($children); $i++)
            {
                output_node($children[$i], $depth+INDENT);
            }

            // Print end element
            for($i=0; $i<$depth; $i++) print("&nbsp;");
            print("<span class='element'>&lt;/");
            print($node->name);
            print("&gt;</span><br>");
            break;
        case XML_PI_NODE:
            for($i=0; $i<$depth; $i++) print("&nbsp;");
            printf("<span class='pi'>&lt;?%s %s?&gt;</span><br>", $node->name, $node->content);
            break;
        case XML_COMMENT_NODE:
            for($i=0; $i<$depth; $i++) print("&nbsp;");
            print("<span class='element'>&lt;!-- </span>");
            print($node->content);
            print("<span class='element'> --&gt;</span><br>");
            break;
        case XML_TEXT_NODE:
        case XML_ENTITY_REF_NODE:
        case XML_ENTITY_REF_NODE:
        case XML_DOCUMENT_NODE:
        case XML_DOCUMENT_TYPE_NODE:
        case XML_DOCUMENT_FRAG_NODE:
        case XML_CDATA_SECTION_NODE:
        case XML_NOTATION_NODE:
        case XML_GLOBAL_NAMESPACE:
        case XML_LOCAL_NAMESPACE:
        default:
            for($i=0; $i<$depth; $i++) print("&nbsp;");
            printf("%s<br>", isset($node->content) ? $node->content : "");
    }
}

// Output stylesheet
?>
<style type="text/css">
<!--
.xml {  font-family: "Courier New", Courier, mono;
        font-size: 10pt; color: #000000}
.element {  color: #0033CC}
.attribute {  color: #000099}
.pi {  color: #990066}
-->
</style>
<span class="xml">
<?

// Process the file passed as first argument to the script
$file = "test.xml";

// Initial indenting
$depth = 0;

// Check if file exists
if(!file_exists($file))
{
    die("Can't find file \"$file\".");
}

// Create xmldoc object from file
$doc = xmldocfile($file) or die("XML error while parsing file \"$file\"");

// Access root node
$root = $doc->root();

// Start traversal
output_node($root, $depth);

// End stylesheet span
print("</span>");

One of the great advantages of LibXML over Expat is that you can also use it to construct XML documents. This avoids messing around with custom XML creation routines and frees you from tasks like remembering the nesting level to properly close tags. Listing 7.5 takes our earlier Hello World example a step further and constructs a complete RSS document (RSS stands for Rich Site Summary, an XML format to provide content information for Web sites). It uses setattr() to add attributes to an element and new_child() to add elements to a node. Have you noted the way new_child() is used? The function returns a node object, and you can simply discard that return value if you don't need it—you only need to assign it to a variable if you want to add child elements to the note you've just created.

Listing 7.5. Using LibXML routines to construct XML documents.

$doc = new_xmldoc("1.0");

$root = $doc->add_root("rss");
$root->setattr("version", "0.91");

$channel = $root->new_child("channel", "");
$channel->new_child("title", "XML News and Features from XML.com");
$channel->new_child("description", "XML.com features a rich mix of information and services for the XML community.");
$channel->new_child("language", "en-us");
$channel->new_child("link", "http://xml.com/pub");
$channel->new_child("copyright", "Copyright 1999, O'Reilly and Associates and Seybold Publications");
$channel->new_child("managingEditor", "dale@xml.com (Dale Dougherty)");
$channel->new_child("webMaster", "peter@xml.com (Peter Wiggin)");

$image =$channel->new_child("image", "");
$image->new_child("title", "XML News and Features from XML.com");
$image->new_child("url", "http://xml.com/universal/images/xml_tiny.gif");
$image->new_child("link", "http://xml.com/pub");
$image->new_child("width", "88");
$image->new_child("height", "31");

print(htmlspecialchars($doc->dumpmem()));

XML Trees

The methods outlined above construct separate objects for the document and for each node. While this is great for looping through the document as shown in the XML pretty printer, accessing single elements tends to get a bit cumbersome. Do you remember our sample Hello World document from earlier in the chapter?

<?xml version="1.0"?>
<greeting>Hello World!</greeting>

To access the contents of the root element, you'd have to use the following code:

// Create xmldoc object from file
$doc = xmldocfile("test.xml") or die("XML error while parsing file \"$file\"");

// Access root node
$root = $doc->root();

// Access root's children
$children = $root->children();

// Print first child's content
print($children[0]->content);

And that's for a depth of one; imagine how you'd have to continue with deeper nested elements. If you think that this is a bit too much work, we agree. Fortunately, Uwe Steinman agrees too, and has provided a more elegant method of random access to document elements: xmltree(). This function creates a structure of PHP objects, representing the whole XML document. When you pass it a string containing an XML document as first argument, the function returns a document object. The object is a bit different from the one described earlier, though: It doesn't allow functions to be called, but sets up properties of the same. Instead of getting a list of child elements with a children() call, the children are already present in the structure (in the children class variable)—making it easy to access elements in every depth. Accessing the contents of the greeting element would therefore be done with the following call:

// Create xmldoc object from file
$doc = xmldocfile(join("", file($file)) or die("XML error while parsing file \"$file\"");

      print($doc->root->children[0]->content);

That looks infinitely better now. When you dump the structure returned by xmltree() with var_dump(), you get the following output:

object(Dom document)(2) {
  ["version"]=>
  string(3) "1.0"

  ["root"]=>
  object(Dom node)(3) {
    ["type"]=>
    int(1)

    ["name"]=>
    string(8) "greeting"

    ["children"]=>
    array(1) {
      [0]=>
      object(Dom node)(3) {
        ["name"]=>
        string(4) "text"

        ["content"]=>
        string(12) "Hello World!"

        ["type"]=>
        int(3)
      }
    }
  }
}

You see that this is one large structure, with the whole document ready in place. The actual parts of the structure are still document or object nodes; indeed, internally the same class definitions are used. In contract to objects created with xmldoc() and friends, though, you can't invoke functions on these structures. Consequently, the structure returned by xmltree() is read-only at this time—to construct XML documents, you need to use the other methods.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020