Home > Articles > Web Services > XML

📄 Contents

  1. Sams Teach Yourself XML in 21 Days, Third Edition
  2. Table of Contents
  3. About the Author
  4. Acknowledgments
  5. We Want to Hear from You!
  6. Introduction
  7. Part I: At a Glance
  8. Day 1. Welcome to XML
  9. All About Markup Languages
  10. All About XML
  11. Looking at XML in a Browser
  12. Working with XML Data Yourself
  13. Structuring Your Data
  14. Creating Well-Formed XML Documents
  15. Creating Valid XML Documents
  16. How XML Is Used in the Real World
  17. Online XML Resources
  18. Summary
  19. Q&A
  20. Workshop
  21. Day 2. Creating XML Documents
  22. Choosing an XML Editor
  23. Using XML Browsers
  24. Using XML Validators
  25. Creating XML Documents Piece by Piece
  26. Creating Prologs
  27. Creating an XML Declaration
  28. Creating XML Comments
  29. Creating Processing Instructions
  30. Creating Tags and Elements
  31. Creating CDATA Sections
  32. Handling Entities
  33. Summary
  34. Q&A
  35. Workshop
  36. Day 3. Creating Well-Formed XML Documents
  37. What Makes an XML Document Well-Formed?
  38. Creating an Example XML Document
  39. Understanding the Well-Formedness Constraints
  40. Using XML Namespaces
  41. Understanding XML Infosets
  42. Understanding Canonical XML
  43. Summary
  44. Q&A
  45. Workshop
  46. Day 4. Creating Valid XML Documents: DTDs
  47. All About DTDs
  48. Validating a Document by Using a DTD
  49. Creating Element Content Models
  50. Commenting a DTD
  51. Supporting External DTDs
  52. Handling Namespaces in DTDs
  53. Summary
  54. Q&A
  55. Workshop
  56. Declaring Attributes in DTDs
  57. Day 5. Handling Attributes and Entities in DTDs
  58. Specifying Default Values
  59. Specifying Attribute Types
  60. Handling Entities
  61. Summary
  62. Q&A
  63. Workshop
  64. Day 6. Creating Valid XML Documents: XML Schemas
  65. Using XML Schema Tools
  66. Creating XML Schemas
  67. Dissecting an XML Schema
  68. The Built-in XML Schema Elements
  69. Creating Elements and Types
  70. Specifying a Number of Elements
  71. Specifying Element Default Values
  72. Creating Attributes
  73. Summary
  74. Q&A
  75. Workshop
  76. Day 7. Creating Types in XML Schemas
  77. Restricting Simple Types by Using XML Schema Facets
  78. Creating XML Schema Choices
  79. Using Anonymous Type Definitions
  80. Declaring Empty Elements
  81. Declaring Mixed-Content Elements
  82. Grouping Elements Together
  83. Grouping Attributes Together
  84. Declaring all Groups
  85. Handling Namespaces in Schemas
  86. Annotating an XML Schema
  87. Summary
  88. Q&A
  89. Workshop
  90. Part I. In Review
  91. Well-Formed Documents
  92. Valid Documents
  93. Part II: At a Glance
  94. Day 8. Formatting XML by Using Cascading Style Sheets
  95. Our Sample XML Document
  96. Introducing CSS
  97. Connecting CSS Style Sheets and XML Documents
  98. Creating Style Sheet Selectors
  99. Using Inline Styles
  100. Creating Style Rule Specifications in Style Sheets
  101. Summary
  102. Q&A
  103. Workshop
  104. Day 9. Formatting XML by Using XSLT
  105. Introducing XSLT
  106. Transforming XML by Using XSLT
  107. Writing XSLT Style Sheets
  108. Using <xsl:apply-templates>
  109. Using <xsl:value-of> and <xsl:for-each>
  110. Matching Nodes by Using the match Attribute
  111. Working with the select Attribute and XPath
  112. Using <xsl:copy>
  113. Using <xsl:if>
  114. Using <xsl:choose>
  115. Specifying the Output Document Type
  116. Summary
  117. Q&A
  118. Workshop
  119. Day 10. Working with XSL Formatting Objects
  120. Introducing XSL-FO
  121. Using XSL-FO
  122. Using XSL Formatting Objects and Properties
  123. Building an XSL-FO Document
  124. Handling Inline Formatting
  125. Formatting Lists
  126. Formatting Tables
  127. Summary
  128. Q&A
  129. Workshop
  130. Part II. In Review
  131. Using CSS
  132. Using XSLT
  133. Using XSL-FO
  134. Part III: At a Glance
  135. Day 11. Extending HTML with XHTML
  136. Why XHTML?
  137. Writing XHTML Documents
  138. Validating XHTML Documents
  139. The Basic XHTML Elements
  140. Organizing Text
  141. Formatting Text
  142. Selecting Fonts: <font>
  143. Comments: <!-->
  144. Summary
  145. Q&A
  146. Workshop
  147. Day 12. Putting XHTML to Work
  148. Creating Hyperlinks: <a>
  149. Linking to Other Documents: <link>
  150. Handling Images: <img>
  151. Creating Frame Documents: <frameset>
  152. Creating Frames: <frame>
  153. Creating Embedded Style Sheets: <style>
  154. Formatting Tables: <table>
  155. Creating Table Rows: <tr>
  156. Formatting Table Headers: <th>
  157. Formatting Table Data: <td>
  158. Extending XHTML
  159. Summary
  160. Q&A
  161. Workshop
  162. Day 13. Creating Graphics and Multimedia: SVG and SMIL
  163. Introducing SVG
  164. Creating an SVG Document
  165. Creating Rectangles
  166. Adobe's SVG Viewer
  167. Using CSS Styles
  168. Creating Circles
  169. Creating Ellipses
  170. Creating Lines
  171. Creating Polylines
  172. Creating Polygons
  173. Creating Text
  174. Creating Gradients
  175. Creating Paths
  176. Creating Text Paths
  177. Creating Groups and Transformations
  178. Creating Animation
  179. Creating Links
  180. Creating Scripts
  181. Embedding SVG in HTML
  182. Introducing SMIL
  183. Summary
  184. Q&A
  185. Workshop
  186. Day 14. Handling XLinks, XPointers, and XForms
  187. Introducing XLinks
  188. Beyond Simple XLinks
  189. Introducing XPointers
  190. Introducing XBase
  191. Introducing XForms
  192. Summary
  193. Workshop
  194. Part III. In Review
  195. Part IV: At a Glance
  196. Day 15. Using JavaScript and XML
  197. Introducing the W3C DOM
  198. Introducing the DOM Objects
  199. Working with the XML DOM in JavaScript
  200. Searching for Elements by Name
  201. Reading Attribute Values
  202. Getting All XML Data from a Document
  203. Validating XML Documents by Using DTDs
  204. Summary
  205. Q&A
  206. Workshop
  207. Day 16. Using Java and .NET: DOM
  208. Using Java to Read XML Data
  209. Finding Elements by Name
  210. Creating an XML Browser by Using Java
  211. Navigating Through XML Documents
  212. Writing XML by Using Java
  213. Summary
  214. Q&A
  215. Workshop
  216. Day 17. Using Java and .NET: SAX
  217. An Overview of SAX
  218. Using SAX
  219. Using SAX to Find Elements by Name
  220. Creating an XML Browser by Using Java and SAX
  221. Navigating Through XML Documents by Using SAX
  222. Writing XML by Using Java and SAX
  223. Summary
  224. Q&A
  225. Workshop
  226. Day 18. Working with SOAP and RDF
  227. Introducing SOAP
  228. A SOAP Example in .NET
  229. A SOAP Example in Java
  230. Introducing RDF
  231. Summary
  232. Q&A
  233. Workshop
  234. Part IV. In Review
  235. Part V: At a Glance
  236. Day 19. Handling XML Data Binding
  237. Introducing DSOs
  238. Binding HTML Elements to HTML Data
  239. Binding HTML Elements to XML Data
  240. Binding HTML Tables to XML Data
  241. Accessing Individual Data Fields
  242. Binding HTML Elements to XML Data by Using the XML DSO
  243. Binding HTML Tables to XML Data by Using the XML DSO
  244. Searching XML Data by Using a DSO and JavaScript
  245. Handling Hierarchical XML Data
  246. Summary
  247. Q&A
  248. Workshop
  249. Day 20. Working with XML and Databases
  250. XML, Databases, and ASP
  251. Storing Databases as XML
  252. Using XPath with a Database
  253. Introducing XQuery
  254. Summary
  255. Q&A
  256. Workshop
  257. Day 21. Handling XML in .NET
  258. Creating and Editing an XML Document in .NET
  259. From XML to Databases and Back
  260. Reading and Writing XML in .NET Code
  261. Using XML Controls to Display Formatted XML
  262. Creating XML Web Services
  263. Summary
  264. Q&A
  265. Workshop
  266. Part V. In Review
  267. Appendix A. Quiz Answers
  268. Quiz Answers for Day 1
  269. Quiz Answers for Day 2
  270. Quiz Answers for Day 3
  271. Quiz Answers for Day 4
  272. Quiz Answers for Day 5
  273. Quiz Answers for Day 6
  274. Quiz Answers for Day 7
  275. Quiz Answers for Day 8
  276. Quiz Answers for Day 9
  277. Quiz Answers for Day 10
  278. Quiz Answers for Day 11
  279. Quiz Answers for Day 12
  280. Quiz Answers for Day 13
  281. Quiz Answers for Day 14
  282. Quiz Answers for Day 15
  283. Quiz Answers for Day 16
  284. Quiz Answers for Day 17
  285. Quiz Answers for Day 18
  286. Quiz Answers for Day 19
  287. Quiz Answers for Day 20
  288. Quiz Answers for Day 21
Recommended Book

Creating XML Documents Piece by Piece

Yesterday, you created this example XML document:

<?xml version="1.0" encoding="UTF-8"?>
<document>
    <heading>
        Hello From XML
    </heading>
    <message>
        This is an XML document!
    </message>
</document>

That's a fully-functional XML document, but it's only an example. Today, we're going to be more systematic about what goes into an XML document, discussing all the possible parts of such documents. You'll take a look at these parts of an XML document in the coming sections:

  • Prologs
  • XML declarations
  • Processing instructions
  • Elements and attributes
  • Comments
  • CDATA sections
  • Entities

W3C defines everything that can go into XML documents in the XML 1.0 and 1.1 specifications, right down to our starting point—the character set you use.

Character Encodings: ASCII, Unicode, and UCS

The characters in an XML document are stored using numeric codes. That can be an issue, because different character sets use different codes, which means an XML processor might have problems trying to read an XML document that uses a character set—called a character encoding—other than what it's used to.

For example, a common character encoding used by text editors is the American Standard Code for Information Interchange (ASCII). ASCII is the default for plain text files created with Windows WordPad. ASCII codes extend from 0 to 255—for example, the ASCII code for A is 65, for B is 66, and so on. So, if you stored the word cat in an XML document written in ASCII, the numbers 67, 65, and 84 are what would actually be stored. On the other hand, the World Wide Web is just that—worldwide. Plenty of character sets can't fit into the 256 characters of ASCII, such as Cyrillic, Armenian, Hebrew, Thai, Tibetan, and so on.

For that reason, W3C turned to Unicode (http://www.unicode.org), which holds 65,536 characters, not just 256 (although only about 40,000 Unicode codes are reserved at this point). To make things easier, the first 256 Unicode characters correspond to the ASCII character set.

There's another character encoding available that has even more space than Unicode—the Universal Character System (UCS, also called ISO 10646) uses 32 bits—two bytes—per character. This gives it a range of two billion symbols—and a good thing, too, since there are more Chinese characters alone than there is space in Unicode. UCS also encompasses the smaller Unicode character set—each Unicode character is represented by the same code in UCS, in much the same way that Unicode encompasses the smaller ASCII character set.

So which character sets are supported in XML? ASCII? Unicode? UCS? Unicode uses two bytes for each character, so a Unicode file would be twice as long as an ASCII file. For that and other reasons, it's difficult to convert much of the available software to Unicode. XML actually supports a compressed version of Unicode created by the UCS group called UCS Transformation Format-8 (UTF-8). UTF-8 includes all the ASCII codes unchanged, and uses a single byte for the most common Unicode characters. Any other Unicode characters need more than one byte (and can use up to six)—for example, the Unicode for U041F.GIF is 03C0 in hexadecimal (960 in decimal), which you need to store in two bytes.

To make it easier to handle, UCS itself has also been compressed in the same way into a character set named UTF-16, which uses two bytes (instead of the normal four that UCS uses) for the most common characters, and more bytes for the less common characters.

W3C requires all XML processors to support both UTF-8 (compressed Unicode, including the full ASCII set), and UTF-16 (compressed UCS, including the full ASCII set), and those are the only two W3C requires. The UTF-8 encoding is the most popular one today in XML documents, because you can store documents in ASCII using a text editor and they can be treated, without any changes, as UTF-8 by an XML processor (ASCII uses one byte for characters, and UTF-8 uses one byte for the most common characters, including all the characters in the ASCII set). In fact, we've been using UTF-8 since our first XML example, as you can see where we've specified the character encoding for a document with the encoding attribute in the XML declaration:


   <?xml version="1.0" encoding="UTF-8"?>
<document>
    <heading>
        Hello From XML
    </heading>
    <message>
        This is an XML document!
    </message>
</document>

UTF-8 is so widespread that an XML processor will assume you're using it if you omit the encoding attribute. Although W3C requires all XML processors to support UTF-16 and UTF-8 (so you can assign these values to the encoding attribute), most don't support UTF-16 yet.

Although only UTF-8 and UTF-16 are required, there are many character encodings that an XML processor can support, such as the following:

  • US-ASCII— U.S. ASCII
  • UTF-8— Compressed Unicode
  • UTF-16— Compressed UCS
  • ISO-10646-UCS-2— Unicode
  • ISO-10646-UCS-4— UCS
  • ISO-2022-JP— Japanese
  • ISO-2022-CN— Chinese
  • ISO-8859-5ASCII and Cyrillic

The increasing adoption of Unicode is the main driving force behind XML 1.1. There are three main areas in which XML 1.1 differs from XML 1.0, all having to do with characters:

  • XML 1.1 accepts more Unicode characters than were available when XML 1.0 was created. (XML 1.0 was created when Unicode version 2.0 was current; now version 4.0 is being tested.)
  • XML 1.1 relaxes some rules of creating names (as used for elements and attributes) to allow more Unicode characters, and to permit for Unicode expansion in the future.
  • XML 1.1 permits more legal characters you can use to end a line.

You'll see these various points in more depth today. However, note that most of these differences are technical, and won't concern you a great deal. For example, XML 1.0 and 1.1 differ slightly in what character references you can use. As in HTML, character reference stands for a Unicode character and begins with &, followed by a numeric code specifying a character, and ends with ;. You can either enter a Unicode character in an XML document as the character itself or as a character reference, which the XML processor will convert into the corresponding character.

For example, the Unicode for U041F.GIF is 960 in decimal, so you can embed U041F.GIF in your XML document by entering U041F.GIF (if your text editor supports Unicode), or as the character reference &#960; (if your text editor doesn't support Unicode). The XML processor will replace the character reference with U041F.GIF. (You can also give the Unicode in hexadecimal if you preface it with an x, which would be &#x03C0; in this case.)

The difference between XML 1.0 and XML 1.1 as far as character references go is that XML 1.1 allows the use of character references &#x1; through &#x1F;, most of which are forbidden in XML 1.0. Conversely, the character references &#x7F; through &#x9F;, which were allowed as characters or character references in XML 1.0 documents, might only appear as character references in XML 1.1. These kinds of relatively small differences aren't going to concern us a great deal. For all these details, check the XML 1.1 candidate recommendation itself.

That's given us a handle on the character encodings you can use to create XML documents. The next step is to see just how you put those characters to work in XML as you create markup and text data.

Understanding XML Markup and XML Data

At their most basic level, XML documents are combinations of markup and text data. They might also include binary data one day, but there's no way to include binary data in an XML document at the moment. (If you want to associate binary data with an XML document, you keep that data external to the document and use an entity reference, as you'll see later today and in Day 5 in detail.)

The markup in a document gives it its structure. Markup includes start tags, end tags, empty element tags, entity references, character references, comments, CDATA section delimiters (more about CDATA sections in a few pages), document type declarations, and processing instructions. What about the data in an XML document? All the text in an XML document that is not markup is data.

Although the markup we've seen has mostly consisted of tags up to this point, there's another type of markup that doesn't use tags—general entity references and parameter entity references. Whereas tags begin with < and end with >, general entity references start with & and end with ; (as with the character references we've already seen, which are general entity references—for example, if you're using the UTF-16 encoding, &#x03C0; is a character reference for U041F.GIF). General entity references are replaced by the entity they refer to when the document is parsed. Parameter entity references, which start with % and end with ;, are used in DTDs, as we'll see in Days 4 and 5.

For example, the markup &lt; is a general entity reference that is turned into a < (less than) symbol when parsed by an XML processor, and the general entity reference &gt; is turned into a > (greater than) symbol when parsed by an XML processor. You can see an example using these general entity references in Listing 2.1.

Example 2.1. Using an Entity Reference (ch02_01.xml)

<?xml version="1.0" encoding="UTF-8"?>
<document>
    <heading>
        Hello From XML
    </heading>
    <message>
        This text is inside a &lt;message&gt; element.
    </message>
</document>

You can see ch02_01.xml in Internet Explorer in Figure 2.9. As you can see in the figure, the markup &lt; was turned into a <, and the markup &gt; was turned into a > by the XML processor.

02fig09.gif

Figure 2.9 Using markup in Internet Explorer.

Besides character entity references, where a character code is replaced by the character it stands for, there are five predefined general entity references in XML, which are used when browsers might otherwise assume that they're part of markup to be interpreted:

  • &lt;— Replaced with <
  • &gt;— Replaced with >
  • &amp;— Replaced with &
  • &quot;— Replaced with "
  • &apos;— Replaced with '

You can also create your own general entity references, which we'll do in Day 5.

When an XML processor parses your XML, it replaces general entity references like &gt; with the entity those references stand for, which is > in this case. Before it's parsed, text data is called character data; after it's been parsed and general entity references have been replaced with the entities they refer to, the text data is called parsed character data.

Using Whitespace and Ends of Lines

Spaces, carriage returns, line feeds, and tabs are all treated as whitespace in XML. That means that to an XML processor, this XML document:

<?xml version="1.0" encoding="UTF-8"?>
<document>
<heading>
Hello From XML
</heading>
<message>
This is an XML document!
</message>
</document>

is the same as this one, in terms of content:

<?xml version="1.0" encoding="UTF-8"?>
<document>heading>Hello From XML</heading>
<message>This is an XML document!</message></document>

You can use a special attribute named xml:space in an element to indicate that you want whitespace to be preserved by XML processors (not all XML processors will support this attribute). You can set this attribute to "default" to indicate that the default handling of whitespace is OK for the current element and all contained elements, or "preserve" to indicate that you want all applications to preserve whitespace as it is in the document. This is useful if the XML processor is going to display the XML document visually:

<?xml version="1.0" encoding="UTF-8"?>
<document xml:space="preserve">
    <heading>
        Hello From XML
    </heading>
    <message>
        This is an XML document!
    </message>
</document>

In XML 1.0, lines officially end with a linefeed character (ASCII and UTF-8 code 10—the Unix way of ending lines). In MS DOS and some Windows programs, lines can end with a carriage return (ASCII and UTF-8 code 13) linefeed pair, but when parsed by an XML processor, that pair (codes 13 and 10) is converted into a single linefeed (ASCII and UTF-8 code 10). In XML 1.1, which is mostly about expanding the character sets you can use, XML 1.0 was considered to discriminate against the conventions used on IBM and IBM-compatible mainframes. That means that in XML 1.1, the acceptable line endings that XML processors are supposed to convert to &#xA; are expanded to include the following:

  • The two-character sequence &#xD; &#xA;
  • The two-character sequence &#xD; &#x85; (&#x85; is the New Line (NEL) character in many mainframes.)
  • The single character &#x85;
  • The single character &#x2028; (This is the Unicode line separator character.)
  • Any &#xD; character not immediately followed by &#xA; or &#x85;.

That brings us up through the basic structure of an XML document—markup and data. Now it's time to actually start putting markup and data to work as you start creating XML documents.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.