Accessing XML Data
Now that you've seen how to create an XML document, we get to the fun part: how to write code to extract and manipulate data from an XML document using classes found in the .NET frameworks. There's no one right way to do this; in fact, before .NET came along, there were two predominant ways to parse an XML document: the XML DOM and Simple API for XML (SAX).
There is an implementation of the XML DOM in the .NET frameworks. In this chapter we'll primarily focus on the DocumentNavigator, XMLTextReader, and XMLTextWriter objects. These objects are the standard .NET way to access XML data; they provide a good combination of high performance, .NET integration and ease-of-programming. But you should know about the other ways to deal with XML, too, particularly since the specialized .NET reader and writer objects are designed to interact with the Internet-standard DOM objects. So for the remainder of this chapter, we'll include brief examples of how to work with the DOM model as well.
About Simple API for XML (SAX)
Simple API for XML was designed to provide a higher level of performance and a simpler programmability model than XML DOM. It uses a fundamentally different programmability modelinstead of reading in the entire document at once and exposing the elements of the document as nodes, SAX provides an event-driven model for parsing XML.
SAX is not yet supported in .NET. In fact, it's not even an official Internet standard. It's a programming interface for XML that was created by developers who wanted an XML parser with higher-performance and a smaller memory footprint, especially when parsing very large documents.
Microsoft supports an event-driven model for XML parsing known as the DocumentNavigator. This model is similar in principle to SAX, but with different implementation details. We'll cover the DocumentNavigator later in this chapter.
Although it is not yet supported in the .NET Framework, SAX is supported in Microsoft's COM-based XML parser implementation. For more information on this tool, see http://msdn.microsoft.com/xml/).
Using the XML Document Object Model in .NET
The XML Document Object Model (DOM) is a programming interface used to parse XML documents. It was the first programming interface provided for XML by Microsoft; XML DOM implementations that target other languages and other operating systems are available.
The original Microsoft XML DOM implementation is COM-based, so it is accessible from any COM-compliant language. The XML parsers in .NET are, naturally, accessible from any .NET-compliant language.
Fortunately, the number of objects you need to work with on a regular basis in the XML DOM is minimal. In fact, the XML DOM recommendation segregates the objects in the DOM into two groups, fundamental classes and extended classes. Fundamental classes are the ones that application developers will find most useful; the extended classes are primarily useful tools to developers and people who like to pummel themselves with detail.
The fundamental classes of the XML DOM as implemented in the .NET framework are XmlNode, XmlNodeList, and XmlNamedNodeMap. These classes, as well as the parent XmlDocument class, are illustrated in Figure 11.1.
Figure 11.1 Fundamental XML DOM objects.
Note that the XmlDocument object is technically an extended class, not a fundamental class, because it inherits from XmlNode. We're including discussion of it in this chapter because it's rather tricky to do useful stuff in XML without it. The class adds some useful file and URL-handling capabilities to XmlNode.
The XmlNode and XmlDocument classes are found in the System.Xml namespace. The XmlDocument class inherits from System.Xml.XmlNode. A reference to the classes, properties and methods introduced in this chapter is included at the end of this chapter.
In general, to work with a XML document using the DOM, you first open the document (using the .Load() or .LoadXML() method of the XmlDocument object). The .Load() method is overloaded and can take any one of three arguments: a string, a System.IO.TextReader object, or a System.Xml.XmlReader object.
The easiest way to demonstrate how to load an XML document from a file on disk is to pass the .Load() method a string. The string can either be a local file on disk or a URL. If the string is a URL, the XmlDocument retrieves the document from a Web server. This is pretty handyit makes you wish that every file-handling object worked this way.
Code Listing 11.7 shows an example of loading an XML document from disk using an XmlDocument object.
Listing 11.7 Loading a Local XML File Using the XmlDocument's .Load() Method
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) Dim xd As New XmlDocument() xd.Load("c:\data\books.xml") Response.Write (xd.OuterXml) xd = Nothing End Sub </SCRIPT>
This code works for any XML document accessible to the local file system. Listing 11.8 demonstrates how to load an XML document that resides on a Web server.
Listing 11.8 Loading an XML File That Resides on a Web Server
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Dim xd As New XmlDocument() xd.Load("http://www.myserver.com/books.xml") Response.Write (xd.OuterXml) xd = Nothing </SCRIPT>
As you can see, the syntax is nearly identical whether you're loading the file from the local file system or over HTTP. Both of these examples are extremely simple; they demonstrate how easy it is to open and view an XML document using the DOM. The next step is to start doing things with the data in the document you've retrieved.
Viewing Document Data Using the XmlNode Object
Once you've loaded a document, you need some way to programmatically visit each of its nodes in order to determine what's inside. In the XML DOM, there are several ways to do this, all of which are centered around the XmlNode object.
The XmlNode object represents a node in the XML document. It exposes an object hierarchy that exposes attributes and child nodes, as well as every other part of an XML document.
When you've loaded an XML document to parse it (as we demonstrated the previous code examples), your next step will usually involve retrieving that document's top-level node. Use the .FirstChild() property to do this.
Listing 11.9 shows an example of retrieving and displaying the name of the top-level node in the document using .FirstChild().
Listing 11.9 Loading a Local XML File Using the XmlDocument's .Load() Method
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) Dim xd As New XmlDocument() xd.Load("c:\data\books.xml") MsgBox (xd.FirstChild.Name) xd = Nothing End Sub </SCRIPT>
The code demonstrates how the .FirstChild() property returns a XmlNode object with its own set of properties and methods. In the example, we call the .Name() property of the XmlNode object returned by .FirstChild().
You can do more useful and interesting things with the XmlNode object. One common operation is drilling down and retrieving data from the ChildNodes object owned by XmlNode. Two features of ChildNodes make this possible: its status as an enumerable class, and the InnerText property of each child node.
Enumerable classes implement the .NET IEnumerable interface. This is the same interface definition that arrays, collections, and more complex constructs like ADO.NET DataSets support. (You may think of ChildNodes as just another collection, but in .NET, Collection is a distinct data type.)
When an object supports IEnumerable, it exposes functionality (through a behind-the-scenes object called an enumerator) that enables other processes to visit each of its child members. In the case of ChildNodes, the enumerator lets your code visit the object's child XmlNode objects. The For Each...Next block in Visual Basic is the construct that is most commonly used to traverse an enumerable class. Listing 11.10 shows an example of this.
Listing 11.10 Traversing the Enumerable ChildNodes Class
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) xd.Load("c:\data\books.xml") ndBook = xd.FirstChild.Item("BOOK") For Each nd In ndBook.ChildNodes If nd.Name = "AUTHOR" Then MsgBox("The author's name is " & nd.InnerText) End If Next End Sub </SCRIPT>
In this code example, the For Each...Next loop goes through the set of XmlNode objects found in ChildNodes. When it finds one whose Name property is AUTHOR, it displays it. Note that for the example file books.xml, two message boxes will appear, because the example book has two authors.
Note also that the value contained in an XML node is returned by the InnerXml() property in .NET, not by the .text property as it was in the COM-based MSXML library. Making a more granular distinction between a simple "text" property versus inner and outer text or inner and outer XML gives you a greater degree of power and flexibility. Use the "outer" properties when you want to preserve markup; the "inner" properties return the values themselves.
With the few aspects of the XmlDocument and XmlNode objects we've discussed so far, you now have the ability to perform rudimentary retrieval of data in an XML document using the DOM. However, looping through a collection of nodes using For Each...Next leaves something to be desired. For example, what happens when your book node contains a set of 50 child nodes, and you're only interested in extracting a single child node from that?
Fortunately, .NET provides several objects that enable you to easily navigate the hierarchical structure of an XML document. These include the XmlTextReader and DocumentNavigator object.
Using the XmlDataReader Object
The XmlDataReader object provides a method of accessing XML data that is both easier to code and more efficient than using the full-blown XML DOM. At the same time, the XmlDataReader understands DOM objects in a way that lets you use both types of access cooperatively.
XmlDataReader is found in the System.Xml namespace. It inherits from System.Xml.XmlReader, an abstract class. A reference to the classes, properties, and methods introduced in this chapter is included at the end of this chapter.
If you've used the XML DOM in the past, the XmlDataReader will change the way you think about XML parsing in general. The XmlDataReader doesn't load an entire XML document and expose its various nodes and attributes to you in the form of a large hierarchical tree; that process causes a large performance hit as data is parsed and buffered. Instead, think of the XMLDataReader object as a truck that bounces along the road from one place to another. Each time the truck moves across another interesting aspect of the landscape, you have the ability to take some kind of interesting action based on what's there.
Parsing an XML document using the XmlDataReader object involves a few steps. First, you create the object, optionally passing in a file name or URL that represents the source of XML to parse. Next, execute the .Read method of the XmlDataReader object until that method returns the value False. (You'll typically set up a loop to do this so you can move from the beginning to the end of the document.)
Each time you execute the XmlDataReader object's .Read method, the XmlDataReader object's properties are populated with fragments of information from the XML document you're parsing. This information includes the type of the data the object just read, and the value of the data itself (if any).
The type of data is exposed through the XmlDataReader object's NodeType property. The value of data retrieved can be retrieved in an untyped format (through the .Value() property of the XmlDataReader object) or typed format (through such properties as .ReadDateTime(), .ReadInt32(), .ReadString(), and so forth).
Most of the time, the NodeType property will be XmlNodeType.Element (an element tag), XmlNodeType.Text (the data contained in a tag), or XmlNodeType.Attribute.
Listing 11.11 shows an example of how this works. The objective of this example is to retrieve the title of a book from an XML file that is known to contain any one of a number of nodes pertaining to the book itself.
Listing 11.11 Extracting a Book Title Using the XmlTextReader Object
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) Dim xr As New XmlTextReader(Server.MapPath("books.xml")) Dim bTitle As Boolean While xr.Read() Select Case xr.NodeType Case XmlNodeType.Element If xr.Name = "TITLE" Then bTitle = True End If Case XmlNodeType.Text If bTitle Then Response.Write("Book title: " & xr.ReadString) bTitle = False End If End Select End While End Sub </SCRIPT>
This code example can be found in the downloadable code examples under the XML section in the package XmlTextReader.zip.
The example opens the XML file by passing the name of the XML file to the constructor of the XmlDataReader object. It then reads one chunk of the document at a time (through successive calls to the XmlDataReader object's Read method). If the current data represents the element name "TITLE", the code sets a flag, bTitle.
When the bTitle flag is set to True, it means "get ready, a book title is coming next." The book title itself is extracted in the next few lines of code. When the code encounters the text chunk, it extracts it from the XML document in the form of a string.
Note that the values XmlNodeType.Element and XmlNodeType.Text are predefined members of the XmlNodeType structure. You can set up more involved parsing structures based on any XML type found in the DOM if you wish. For example, if you included a case to process based on the type XmlNodeType.XmlDeclaration, you could process the XML declaration that appears (but is not required to appear) as the first line of the XML document.
As you can see from these examples, a beautiful thing about XML is the fact that if the structure of the document changes, your parsing code will still work correctly, as long the document contains a TITLE node. (In the previous code example, if for some reason the document contains no book title, no action is taken.) So the problems we discussed at the beginning of this chapter go away in the new world of XML parsing.
The XmlDataReader works well for both large and small documents. Under most circumstances (particularly for large documents), it should perform better than the XML DOM parser. However, like the DOM, it too has its own set of limitations. The XmlDataReader object doesn't have the ability to scrollto jump around between various areas in the document. (If you're a database developer, you can think of an XmlDataReader as being analogous to a cursorless or forward-only result set.) Also, as its name implies, the XmlDataReader object only permits you to read data; you can't use it to make changes in existing node values or add new nodes to an existing document.
To provide a richer set of features, including the ability to scroll backward and forward in a document, the .NET framework provides another object, the DocumentNavigator object.
Using the DocumentNavigator Object
So far in this chapter you've seen two distinct ways provided by the .NET Framework to access XML data: the XML Document Object Model and the XmlDataReader object. Both have their advantages and drawbacks.
In many ways, the DocumentNavigator object represents the best of all worlds. It provides a simpler programmability model than the XmlDocument object, yet it integrates with the standard DOM objects nicely. In fact, in most cases when you're working with XML data in .NET, you'll typically create a DocumentNavigator by creating a DOM XmlDocument object first.
The DocumentNavigator class is found in the System.Xml namespace. It inherits from System.Xml.XmlNavigator, an abstract class. A reference to the classes, properties, and methods introduced in this chapter is included at the end of this chapter.
Listing 11.12 shows an example of creating a DocumentNavigator object from an existing XmlDocument object that has been populated with data.
Listing 11.12 Creating a DocumentNavigator Object from an XmlDocument Object
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) Dim xd As New XmlDocument() Dim xn As DocumentNavigator = New DocumentNavigator(xd) ' Code to work with the DocumentNavigator goes here End Sub </SCRIPT>
Navigating Through the Document Using the DocumentNavigator Object
After you've created and populated the DocumentNavigator, you can move through the document. You begin by moving to the beginning of the document by executing the DocumentNavigator's MoveToDocument method. This method is handy because it always gets you to the beginning of the document. In a way, MoveToDocument is the DocumentNavigator version of the ADO.old MoveFirst method.
The .NET Framework SDK documentation suggests that you must execute the MoveToDocument method first, before you can begin working with a document using a DocumentNavigator. We didn't find this to be the case. Go figure.
Unfortunately, the similarities between the Recordset and the XML DocumentNavigator end there. This is because the Recordset represents a nice, tidy two-dimensional array of data; the DocumentNavigator, in contrast, provides access to XML documents that can contain complex hierarchies. So rather than starting at the top and working your way down as with a Recordset, the DocumentNavigator must provide a way for you to access subordinate child nodes of any given XML node, in addition to letting you move up and down within the node you're in currently.
The navigation methods of the DocumentNavigator object make a distinction between parent-child node relationships and sibling node relationships in an XML document. For example, in our example books.xml document, the BOOKS node is the parent of the BOOK node, and the BOOK node is the parent of the TITLE and AUTHOR nodes. You use MoveToChild to navigate between these nodes. TITLE and AUTHOR, on the other hand, are at the same level in the hierarchy; they're sibling nodes. You use MoveToNext to navigate from one sibling node to the next. Figure 11.2 illustrates this.
Figure 11.2 Navigating through an XML hierarchy using the DocumentNavigator object
When navigating in any given XML document, much hinges on whether the document's structure is known to you. If you can assume that a given parent node will have child nodes 100% of the time, it can save you headaches. However, before you use navigational methods such as MoveToChild, you may wish to first test to see whether children actually exist or not. You use the HasChildren method to do this.
MoveToChild is an indexed method; it takes an integer value that represents the number of the subordinate node you want to move to. (In our simple example we'll assume that each parent node only has one child node.) Because all indexes in .NET begin with zero, you pass the value zero to the MoveToChild method to move to the first child of a node.
The first time you execute MoveToChild, you're taken to the root node of the document. After that, each successive call to MoveToChild takes you deeper in the hierarchy. So for a docu-ment like books.xml that contains BOOKS, BOOK, and TITLE nodes, you'd have to execute MoveToChild three times before you landed at the first TITLE node. Listing 11.13 demonstrates this.
Listing 11.13 Using MoveToChild to Drill down into the Hierarchy of an XML Document
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) Dim xd As New XmlDocument() xd.Load(Server.MapPath("books.xml")) Dim xn As DocumentNavigator = New DocumentNavigator(xd) xn.MoveToChild(0) ' Go to BOOKS xn.MoveToChild(0) ' Go to BOOK xn.MoveToChild(0) ' Go to TITLE Response.Write(xn.Name & " - " & xn.Value & "<BR>") End Sub </SCRIPT>
Once you've moved to a node that contains data, you can move to the next sibling node using the MoveToNext method. Listing 11.14 shows an example of this, outputting all the node names and values for the book stored in the document.
Listing 11.14 Using MoveNext to Navigate Between Sibling Nodes
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) Dim xd As New XmlDocument() xd.Load(Server.MapPath("books.xml")) Dim xn As DocumentNavigator = New DocumentNavigator(xd) xn.MoveToChild(0) ' Go to BOOKS xn.MoveToChild(0) ' Go to BOOK xn.MoveToChild(0) ' Go to TITLE ' Output book title Response.Write("<B>" & xn.InnerText & "</B><BR>") Do While xn.MoveToNext() ' Output all authors Response.Write(xn.Name & " - " & xn.InnerText & "<BR>") Loop End Sub </SCRIPT>
This is a good demonstration of how MoveToNext serves to control the looping structure, similar to the Read method of the XmlTextReader object we discussed earlier in this chapter. Because MoveToNext returns True when it successfully navigates to a sibling node and False when there are no more nodes left to navigate to, it's easy to set up a While loop that displays data for all the nodes owned by a book.
So you've seen with the previous few examples that navigating using MoveToChild and MoveNext works well enough. But if the process of repeatedly executing the MoveToChild and MoveNext methods to drill down into the document hierarchy seems a little weak to you, you're right. For example, how do you go directly to a node when you know the name of the node and can be reasonably sure that the node exists? And how do we get rid of the inelegant process of calling MoveToChild repeatedly to drill down to the place in the document where useful data exists?
Fortunately, the DocumentNavigator provides a number of more sophisticated techniques for drilling into the document hierarchy which we'll discuss in more detail in the next few sections.
Using the Select and SelectSingle Methods to Retrieve Nodes Using an XPath Query
The Select method of the DataNavigator object enables you to filter and retrieve subsets of XML data from any XML document. You do this by constructing an XPath expression and passing it to either the Select or SelectSingle methods of the DataNavigator object. An XPath expression is a compact way of querying an XML document without going to the trouble of parsing the whole thing first. Using XPath, it's possible to retrieve very useful subsets of information from an XML document, often with only a single line of code.
XPath syntax is described in more detail in the section "Querying XML Documents Using XPath Expressions" later in this chapter.
Listing 11.15 shows a very simple example of using an XPath expression passed to the Select method to move to and display the title of the first book in the document books.xml.
Listing 11.15 Using the Select Method of the DocumentNavigator Object to Retrieve a Subset of Nodes
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) Dim xd As New XmlDocument() xd.Load(Server.MapPath("books.xml")) Dim xn As DocumentNavigator = New DocumentNavigator(xd) xn.MoveToDocument() xn.Select("BOOKS/BOOK/AUTHOR") xn.MoveToNextSelected() Response.Write(xn.Name & " - " & xn.InnerText) End Sub </SCRIPT>
When the Select method in this example is executed, you're telling the DocumentNavigator object to retrieve all of the AUTHOR nodes owned by BOOK nodes contained in the BOOKS root node. The XPath expression "BOOKS/BOOK/AUTHOR" means "all the authors owned by BOOK nodes under the BOOKS root node." Any AUTHOR nodes in the document owned by parent nodes other than BOOK won't be retrieved, although you could construct an XPath expression to retrieve AUTHOR nodes anywhere in the document regardless of their parentage.
The product of this operation is a selection, a subset of XML nodes that can then be manipulated independently of the main document. After you've retrieved a selection, you then execute the MoveToNextSelected method to move to the first selected node. From there you can retrieve and display the data from the selected nodes (potentially calling MoveToNextSelected again to loop through all the selected nodes).
This example is useful, but it has a flaw: It only displays the first author, and this book has two authors! In this case, the Select method did indeed retrieve all the AUTHOR nodes owned by the BOOK node; we just didn't display them. To display all of them, we need to create a loop. Listing 11.16 demonstrates how to do this.
Listing 11.16 Using the Select Method of the DocumentNavigator Object to Display All Book Authors
<%@ Import Namespace="System.Xml" %> <SCRIPT runat='server'> Sub Page_Load(Sender As Object, e As EventArgs) Dim xd As New XmlDocument() xd.Load(Server.MapPath("books.xml")) Dim xn As DocumentNavigator = New DocumentNavigator(xd) xn.MoveToDocument() xn.Select("BOOKS/BOOK/AUTHOR") While xn.MoveToNextSelected() Response.Write(xn.Name & " - " & xn.InnerText & "<BR>") End While End Sub </SCRIPT>
In this example, we're taking advantage of the fact that the MoveToNextSelected method returns a Boolean True/False value based on whether there are any more nodes in the selection to retrieve. If there is no next node, the method returns False and your loop exits. This code will work for zero, one, or many authors in a document.
The DataNavigator object also gives you a way to explicitly display the first match returned by an XPath query. The SelectSingle method retrieves a single node that matches the XPath expression. Be careful when using SelectSingle, though. You'll want to ensure that the document doesn't have more than one instance of the node you're looking for, because the method will only select the first node that matches your expression.
Querying XML Documents Using XPath Expressions
XPath is a set-based query syntax for extracting data from an XML document. If you're accustomed to database programming using Structured Query Language (SQL), you can think of XPath as being somewhat equivalent to SQL. But as with so many analogies between relational and XML data, the similarities run out quickly. XPath demands a completely different implementation to handle the processing of hierarchical data.
The XPath syntax is a World Wide Web Consortium (W3C) recommendation. You can get more information about XPath from the W3C site at http://www.w3.org/TR/ xpath. Information on the Microsoft XML 3.0 (COM) implementation of XPath is at http://msdn.microsoft.com/library/psdk/xmlsdk/xslr0fjs.htm.
The idea behind XPath is that you should be able to extract data from an XML document using a compact expression, ideally on a single line of code. Using XPath is generally a more concise way to extract information buried deep within an XML document. (The alternative to using XPath is to write loops or recursive functions, as most of the examples used earlier in this chapter did.) The compactness of XPath comes at a price, though: readability. Unless you're well versed in the XPath syntax, you may have trouble figuring out what the author of a complicated XPath expression was trying to look up. Bear this in mind as you utilize XPath in your applications.
While the complete XPath syntax is quite involved (and beyond the scope of this book), there are certain commonly used operations you should know about as you approach XML processing using the .NET framework classes. The three most common XPath scenarios include:
Retrieving a subset of nodes that match a certain value (for example, all of the orders associated with customers)
Retrieving one or more nodes based on the value of an attribute (such as retrieving all of the orders for customer ID 1006)
Retrieving all the parent and child nodes where an attribute of a child node matches a certain value (such as retrieving all the customers and orders where the Item attribute of the order node equals "Tricycle")