Home > Articles

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Text Nodes

The ensure interoperability, the DOM defines a standard data type for representing character data in a source document. The DOM defines the DOMString as asequence of 16-bit units (encoded using UTF-16) for this purpose.10 The DOM also defines a generic CharacterData interface deriving from Node that encapsulates DOMString and provides behavior for inserting, appending, replacing, and deleting the DOMString's value. All other DOM interfaces that deal directly with character data extend the CharacterData interface, which itself extends Node.

The CharacterData interface provides basic string manipulation operations. The CharacterData interface is never implemented by itself; it is always implemented in tandem with an extended interface. The Text interface is the most common extended interface, and it is used to model collections of character information items that appear as in an element information item's [children] property. Consider the following XML document:

<foo>The <bar>quick</bar> brown <bar>fox <baz>jumped</baz>
over <baz>the</baz> lazy</bar> dog</foo>

When this document is loaded into the DOM, contiguous text not separated by markup will be contained within a Text node as shown in Figure 2.11.

Figure 2.11Figure 2.11. Text nodes


The text of an element is considered normalized when it contains no two adjacent Text nodes, as was shown above. In general, deserializing an XML document into a DOM will yield normalized elements. However, when new Text nodes are inserted into the hierarchy, one can wind up with a denormalized element. While completely legal, various XML technologies have a difficult time handling denormalized elements. XPath, for example, depends on a normalized document tree structure to behave properly. Performing an XPath traversal against a document with denormalized elements would yield unexpected results. This can be prevented using the Node.normalize method, which recursively normalizes all ancestor Text nodes. Consider the following Java code:

import org.w3c.dom.*;
void appendText(Document doc, Node elem) {
  int nChildren = elem.getChildNodes().getLength();
  Node text1 = doc.createTextNode("hello ");
  Node text2 = doc.createTextNode("world");
  elem.appendChild(text1);
  elem.appendChild(text2);
  text2.splitText(2);
  assert(elem.getChildNodes().getLength() == nChildren + 3);
  elem.normalize();
  assert(elem.getChildNodes().getLength() == nChildren + 1);
}

As shown in Figure 2.12, after the call to Text.splitText, there are three new Text node children. However, after the call to Node.normalize, the three adjacent Text nodes are folded into a single node containing the string "hello, world".

Figure 2.12Figure 2.12. Text node normalization


The DOM defines two other CharacterData-related interfaces: Comment and CDATASection. The Comment interface (and corresponding concrete node type) extends CharacterData and is used to represent comment information items. The CDATASection interface (and corresponding concrete node type) extends Text and is used to signal the presence of CDATA start and end information items. Neither of these interfaces adds any operations beyond those present in their base interfaces.

  • + Share This
  • 🔖 Save To Your Account