Parsing an XML Document and Extracting Statistics
Now that you've got a good understanding of what it means to parse an XML document, let's take a look at a short example that uses one of the tools that were just discussed. For this example, we'll use the XML::Parser Perl module to parse a small XML document. Listing 2.1 shows a small XML document that contains statistics from two great baseball players. As you can see, the XML document is very simple. It has two <player> elements, and each <player> element has four child elements (<name>, <team>, <home_runs>, and <batting_average>). Granted, this is a small and simple XML document, but it is perfect for illustrating simple XML parsing. Our goal for this example is to parse this document and extract the statistics for each player.
Listing 2.1 Career statistics for two Hall of Fame baseball players stored in an XML document. (Filename: ch2_baseball_stats.xml)
<?xml version="1.0"?> <career_statistics> <player> <name>Mickey Mantle</name> <team>NY Yankees</team> <home_runs>536</home_runs> <batting_average>.298</batting_average> </player> <player> <name>Joe DiMaggio</name> <team>NY Yankees</team> <home_runs>361</home_runs> <batting_average>.325</batting_average> </player> </career_statistics>
We wrote a small application to parse this XML document. Don't worry about the source code yet, we'll start off slow and start showing Perl applications a little later in this chapter. For now, just focus on how an XML parser works, and don't be too concerned with all of the small details yet.
We used the XML::Parser Perl module to build an application to parse the XML document. The XML::Parser Perl module is an event based parser that will be discussed in Chapter 3, "Event-Driven Parser Modules." The output generated by the Perl XML parsing application is shown in the following:
career_statistics = player = name = Mickey Mantle team = NY Yankees home_runs = 536 batting_average = .298 player = name = Joe DiMaggio team = NY Yankees home_runs = 361 batting_average = .325
Note that in the output, each element is printed out in the same order as the order that the opening tags appear in the XML document. As you look at the output listing, you're probably asking why the <career_statistics> and <player> elements are empty, right? Well, take a look at the original XML document in Listing 2.1 again. Notice that the original XML file really contains a <career_statistics> element that is made up of two child elements named <player>. Each <player> element is made up of four child elements: <name>, <team>, <home_runs>, and <batting_average>. So, when parsing the XML document, the <career_statistics> and <player> elements don't contain any character data themselves, but rather contain other elementstheir child elements. After the <name>, <team>, <home_runs>, and <batting average> elements are encountered, the parser can extract the character data contained by each element. Of course, more complex scenarios exist in which elements contain attributes, character data, and child elements, but let's go one step at a time. You will have examples later that cover all these scenarios.
Now that we've given a quick introduction to parsing XML, the first question is, how do we do it? That's the easy part: use any number of Perl XML modules. It's only natural to try to use these two technologies together to solve complex problems. A large number of Perl modules are available that perform just about every conceivable task related to XML. As of this writing, there are nearly 500 Perl modules related to XML. We're going to explain the most popular modules, show you how to use them, and demonstrate them using real-world examples that can easily be extended to support more complex applications.