Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

The Problem: Generating and Parsing an XML Document

The two most common XML programming tasks are XML document generation and XML document parsing. In this section, we will present two applications that demonstrate how to perform these tasks. Both of these tasks are commonly performed during XML processing and form the basis for more advanced applications throughout the book.

The first program reads a CSV file that contains user information and transforms it into XML. The second program parses the XML file that was generated by the first program, retrieves the user information, and transforms it into HTML for display in a web browser.

Generating an XML Document

Let's assume you are given the task of converting the information in an electronic mailing list from CSV to XML because a new email application supports only XML input. The fields in the CSV file will be formatted as follows, with each user record on a separate line:

First Name, Last Name, Email Address

Simple enough? That's why we decided to use the Perl XML::Simple module. It's designed especially for simple tasks, although it can be used for more complex tasks. The rest of the book focuses on other Perl XML modules that are better suited for more difficult tasks. After you become familiar with the other XML modules, you'll be able to make a sound judgment and pick the right tool for the task at hand.

NOTE

Additional information about the XML::Simple Perl module can be found by typing perldoc XML::Simple. Also, XML::Simple will be discussed in greater detail in Chapter 4, "Tree-Based Parser Modules."

Now that you have an understanding of the program's requirements, you can really get your hands dirty. The CSV is a simple file with only three fields: first name, last name, and email address. The input CSV file is shown in Listing 2.6.

Listing 2.6 The Input CSV file containing source data for XML document. (Filename: ch2_user_email.csv)

Ilya,Sterin,isterin@cpan.org
Mark,Riehl,mark_riehl@hotmail.com
John,Smith,j.smith@xmlproj.com

Granted, this is a very small input file because it only has three records. However, the techniques used in this example can be applied to input data files with 3,000 or 30,000,000 records.

Now that we have the format of the input information, let's take a quick look at the required format of the output XML file. Let's assume that the output file is required to have the following format:

<users>
  <user>
   <first_name> </first_name>
   <last_name> </last_name>
   <email_address> </email_address>
  </user>
  ...
</users>

Note that we have a <users> element at the root of the document that is made up of multiple <user> elements. Each <user> element has one <first_name>, <last_name>, and <email_address> element.

The data for each user is stored in a predefined order: <first_name>, <last_name>, <email address>. Let's take a minute and list the steps that we would need to perform if we were doing this conversion by hand.

  1. Read each line of the input file.

  2. Parse the line into columns based on location of the delimiters.

  3. Print each column surrounded by the proper start and end tags.

Because this is such a simple example, this is exactly how our program will work. Listing 2.7 shows the Perl program used to parse the CSV file and generate the desired output XML document. This example doesn't use any XML modules, it just parses the input file and then creates an XML document using the Perl print function. The example is explained in detail in the following sections.

Listing 2.7 Perl application that converts the CSV input file to XML. (Filename: ch2_csv_to_xml_app.pl)

1.  use strict;
2.  
3.  # Open the ch2_xml_users.csv file for input
4.  open(CSV_FILE, "ch2_xmlusers.csv") || 
5.   die "Can't open file: $!";
6.  
7.  # Open the ch2_xmlusers.xml file for output
8.  open(XML_FILE, ">ch2_xmlusers.xml") || 
9.   die "Can't open file: $!";
10. 
11. # Print the initial XML header and the root element
12. print XML_FILE "<?xml version=\"1.0\"?>\n";
13. print XML_FILE "<users>\n";
14. 
15. # The while loop to traverse through each line in users.csv
16. while(<CSV_FILE>) {
17.   chomp; # Delete the new line char for each line
18. 
19.   # Split each field, on the comma delimiter, into an array
20.   my @fields = split(/,/, $_);
21. 
22. print XML_FILE<<"EOF";
23.   <user>
24.     <first_name>$fields[0]</first_name>
25.     <last_name>$fields[1]</last_name>
26.     <email_address>$fields[2]</email_address>
27.   </user>
28. EOF
29.  }
30. 
31. # Close the root element
32. print XML_FILE "</users>";
33. 
34. # Close all open files
35. close CSV_FILE;
36. close XML_FILE;

1–9 The opening section of the program starts with the use strict pragma. This is a good practice in all Perl programs. The file ch2_xmlusers_csv is opened for input and ch2_xmlusers.xml is opened for output. After both files have been opened, we print the XML declaration and the opening tag for the <users> elements to the output file by using the XML_FILE filehandle.

1.  use strict;
2.  
3.  # Open the ch2_xml_users.csv file for input
4.  open(CSV_FILE, "ch2_xmlusers.csv") || 
5.   die "Can't open file: $!";
6.  
7.  # Open the ch2_xmlusers.xml file for output
8.  open(XML_FILE, ">ch2_xmlusers.xml") || 
9.   die "Can't open file: $!";10. 
11. # Print the initial XML header and the root element
12. print XML_FILE "<?xml version=\"1.0\"?>\n";
13. print XML_FILE "<users>\n";

NOTE

In your program, use strict enforces error checking, and use diagnostics expands the diagnostics if the Perl compiler encounters an error or any questionable constructs.

15–29 This portion of the program performs a majority of the work in this example. The input file (ch2_xmlusers.csv) is read line by line within the while loop, and each line is broken up into columns by the delimiter, which in our case is a comma. The columns are then printed to ch2_xmlusers.xml, surrounded by the appropriate tags. To print this example, we're using a Perl "Here" document, which allows us to print a block of text without having to use a print statement for each line. In this case, we're printing everything between print XML_FILE<<"EOF" and EOF. Note that variables are being evaluated inside this block. For additional information on Perl "Here" documents, please see perldoc perldata.

Because we know the order of the input columns (first_name, last_name, email_address), we can print the input contents directly to the output XML file. This procedure is repeated for each line (user) in the ch2_xmlusers.csv file, and a <user> element is created each time through the while loop.

15. # The while loop to traverse through each line in users.csv
16. while(<CSV_FILE>) {
17.   chomp; # Delete the new line char for each line
18. 
19.   # Split each field, on the comma delimiter, into an array
20.   my @fields = split(/,/, $_);
21. 
22. print XML_FILE<<"EOF";
23.   <user>
24.     <first_name>$fields[0]</first_name>
25.     <last_name>$fields[1]</last_name>
26.     <email_address>$fields[2]</email_address>
27.   </user>
28. EOF
29.  }

31–36 The final section of our Perl application just prints the closing root </users> tag to the output file. Note that the opening and closing <user> element tags were outside the while loop. The code inside the while loop creates each <user> element.

31. # Close the root element
32. print XML_FILE "</users>";
33. 
34. # Close all open files
35. close CSV_FILE;
36. close XML_FILE;

The Perl application will generate the output that is shown in Listing 2.8.

Listing 2.8 Generated XML file containing the data from the input CSV file. (Filename: ch2_xmlusers.xml)

<?xml version="1.0"?>
<users>
  <user>
   <first_name>Ilya</first_name>
   <last_name>Sterin</last_name>
   <email_address>isterin@cpan.org</email_address>
  </user>
  <user>
   <first_name>Mark</first_name>
   <last_name>Riehl</last_name>
   <email_address>mark_riehl@hotmail.com</email_address>
  </user>
  <user>
   <first_name>John</first_name>
   <last_name>Smith</last_name>
   <email_address>j.smith@xmlproj.com</email_address>
  </user>
</users>

As you can see in Listing 2.9, the XML file that was generated by our program matches the required output file format. Don't worry, there are better ways to generate an XML file. We won't always be using the Perl print function to generate XML files. The goal was to illustrate the concept of generating an XML document using another file (in this case, CSV) as the input data source.

Now that we've generated our first XML file, how do we parse it? That is discussed in the next section.

Parsing an XML Document

We've accomplished the first part of the application that showed the basics of generating a simple XML document. Now we can move on to the next task in which we transform the ch2_xmlusers.xml file in Listing 2.9 into an HTML file, so we can display the contents of the XML document in a browser.

NOTE

This chapter demonstrates the very basic concepts of XML generation, parsing, and transformation. Quite a few XML-related Perl modules are available to deal with each of these tasks. You'll see the benefits of the XML-related modules and when to use them (as well as when not to) as we move ahead and discuss more complicated examples.

Converting the input file from CSV to XML provides us with a document that is in a structured, clear, and easy-to-read format. Because XML is self-describing (because of the element names), anyone can look at the contents of an XML document and instantly see the layout of the data as well as the relationships between different elements. To display this information to someone not familiar with XML (or just for nicer formatting into tables), we need to convert it to another format. For this example, we have decided that we want this XML document to be displayed in a browser, so we must convert the XML document to HTML.

NOTE

A majority of the web browsers available today are able to display HTML in a consistent way. However, XML support among browsers isn't consistent and varies greatly among the most portable browsers (for example, some browsers will display XML whereas others require stylesheets). At the time of this writing, the most portable way to display an XML document with a browser is to convert the XML document to HTML.

To transform the XML document from XML to HTML, we need to parse it first. Using the XML::Simple module, this task is easily accomplished with its XMLin function, which reads XML from a file (string or an IO object), parses it, and returns a Perl data structure containing the information that was stored in XML. We can then traverse through the data structure, retrieve the required values, and print out the HTML file.

Additional XML::Simple Information

Note that there is also an XMLout function provided by XML::Simple. This function performs the reverse function of XMLin and returns XML based on the provided Perl data structure. Therefore, if XMLout is fed the same, unmodified data structure returned by XMLin, it will generate the original XML document. One potential issue is that the returned XML will be semantically alike, but not necessarily the same. This is due to the way Perl stores hashes, which is in no particular order. It's not necessarily a problem because many applications retrieve the data randomly and have no particular use for the element order, but it might pose a problem to applications that require a particular sequential order of elements (for example, sorted by a particular element). For additional information on XML::Simple, please see perldoc XML::Simple.

A number of attributes can be set to control XML::Simple's behavior. For example, we will use the forcearray attribute, which forces nested elements to be output as arrays, even if there is only one instance present. Also, we'll use the keeproot attribute, which causes the root element to be retained by XML::Simple (it is usually discarded).

Now that we have a high-level understanding of the XML::Simple module, let's take a detailed look at the program shown in Listing 2.9 that converts the input XML file to HTML.

Listing 2.9 Program that converts an input XML file to HTML. (Filename: ch2_xml_to_html_app.pl)

1.  use strict;
2.  
3.  # Load the XML::Simple module
4.  use XML::Simple;
5.  
6.  # Call XML::Simple's new constructor function to create an 
7.  # XML::Simple object and get the reference.
8.  my $simple = XML::Simple->new();
9.  
10. # Read in ch2_xmlusers.xml and return a data structure. 
11. # Note that forcearray and keeproot are both turned on
12. -my $struct = $simple->XMLin("./ch2_xmlusers.xml", forcearray => 1, keeproot => 1);
13. 
14. # Open ch2_xmlusers.html file for output
15. # It will create it if it doesn't exist
16. open(HTML_FILE, ">ch2_xmlusers.html") || 
17.  die "Can't open file: $!\n";
18. 
19. # Print the initial HTML tags
20. print HTML_FILE "<html>\n<body>\n";
21. 
22. # The for loop traverses over each user.
23. # $_ points to the user
24. for (@{$struct->{users}->[0]->{user}}) {
25.  # Print the each field in the user structure
26. print HTML_FILE <<"EOF";
27.  First Name: $_->{first_name}->[0]<br>
28.  Last Name: $_->{last_name}->[0]<br>
29.  Email Address: $_->{email_address}->[0]<br>
30.  ----------------------------------------<br>
31. EOF
32. }
33. 
34. # Print the ending HTML tags
35. print HTML_FILE "</body><br></html>";
36. 
37. # Close ch2_xmlusers.html
38. close (HTML_FILE);

1–12 This initial block of the program starts with the standard use strict pragma. Because we're using the XML::Simple module, we must also include the use XML::Simple statement to load the XML::Simple module. After these initialization calls, we create a new XML::Simple object and then parse the XML file by calling the XMLIn function. The scalar $struct contains a reference to a data structure that contains the information from the XML file.

1.  use strict;
2.  
3.  # Load the XML::Simple module
4.  use XML::Simple;
5.  
6.  # Call XML::Simple's new constructor function to create an 
7.  # XML::Simple object and get the reference.
8.  my $simple = XML::Simple->new();
9.  
10. # Read in ch2_xmlusers.xml and return a data structure. 
11. # Note that forcearray and keeproot are both turned on
12. -my $struct = $simple->XMLin("./ch2_xmlusers.xml", forcearray => 1, keeproot => 1);
19–38 The second half of the program actually performs all the work. First, we open the output HTML file and print the initial required HTML tags. Then, we loop through an array of references to data structures that contain the information from the XML file. The information from the XML file is then printed to the HTML file in the proper order.

We're taking advantage of the "Here" document construct again. This way, we don't need multiple print statements. Note that each time through the for() loop corresponds to one row in the CSV file and one <user> element in the XML document. After looping through the array, we print the required closing tags to the HTML file. The resulting HTML file is shown in Listing 2.10.

Listing 2.10 Contents of the dynamically generated HTML file. (Filename: ch2_xmlusers.html)

<html>
<body>
First Name: Ilya<br>
Last Name: Sterin<br>
Email Address: isterin@cpan.org<br>
----------------------------------------<br>
First Name: Mark<br>
Last Name: Riehl<br>
Email Address: mark_riehl@hotmail.com<br>
----------------------------------------<br>
First Name: John<br>
Last Name: Smith<br>
Email Address: j.smith@xmlproj.com<br>
----------------------------------------<br>
</body>
</html>

This HTML document is shown in a browser in Figure 2.6. As you can see, we didn't apply any formatting to this document, just the required headers and footers so that it displays properly. Examples later in the book will demonstrate how to dynamically generate HTML documents containing tables of data.

Figure 2.6 The ch2_xmlusers.html file displayed in a browser.

Viewing the Contents of Data Structures

This section discusses a Perl module that allows us to visualize the data structure that is returned by the XML::Simple function XMLin. All we need is a simple script and the Perl Data::Dumper module. The Perl Data::Dumper module is very useful in situations like this, when we want to verify that we understand a particular data structure. It prints out the contents of a particular data structure in such a way that it enables us to see the hierarchy and relationships inside the data structure.

What we're going to do is follow the same steps of the last example: instantiate an XML::Simple object and then pass it the name of the input XML document (ch2_xmlusers.xml). The XML document will be parsed by the XML::Simple module and stored in a Perl data structure. At this point, we'll take a look at the contents of the Perl data structure by using the Perl Data::Dumper module. The Perl application that performs these steps is shown in Listing 2.11.

NOTE

For additional information the Perl Data::Dumper module, look at perldoc Data::Dumper.

Listing 2.11 Perl application that uses the Data::Dumper module to visual a complex data structure. (Filename: ch2_data_dumper_app.pl)

1.  use strict;
2.  use XML::Simple;
3.  use Data::Dumper;
4.  
5.  my $simple = XML::Simple->new();
6.  -my $struct = $simple->XMLin("./ch2_xmlusers.xml", forcearray => 1, keeproot => 1);
7.  
8.  # Use Data::Dumper Dumper function to return a nicely 
9.  # formatted stringified structure
10. print Dumper($struct);

1–10 This program is basically the same as the last example with a few exceptions. Note that we need the use Data::Dumper pragma to load the Data::Dumper module. After creating a new XML::Simple object, we parse the XML document using the XMLin function. A Perl data structure that contains the parsed XML document stored in a Perl data structure and that is named $struct is returned.

All we need to do is pass the Perl data structure to the Dumper function that is provided by the Perl Data::Dumper module. That's all there is to it. The output is a nicely formatted report that can be used to study the data structure. The output from the Data::Dumper module is shown in Listing 2.12. Notice that the output from Data::Dumper is a mirror of the input XML file.

Listing 2.12 Output from the Data::Dumper module showing the hierarchy of the parsed XML file from XML::Simple. (Filename: ch2_data_dumper_output.txt)

$VAR1 = {
'users' => [
    {
     'user' => [
       {
        'first_name' => [
                 'Ilya'
                ],
        'email_address' => [
                  'isterin@cpan.org'
                 ],
        'last_name' => [
                'Sterin'
               ]
       },
       {
        'first_name' => [
                 'Mark'
                ],
        'email_address' => [
                  'mark_riehl@hotmail.com'
                 ],
        'last_name' => [
                'Riehl'
               ]
       },
       {
        'first_name' => [
                 'John'
                ],
        'email_address' => [
                  'j.smith@xmlproj.com'
                 ],
        'last_name' => [
                'Smith'
               ]
       }
          ]
    }
       ]
    };

As you can see, the Perl data structure follows the same structure as original XML document (which we would expect). We recommend using the Data::Dumper module often. It is useful in situations where you need to access data that is stored in a complex Perl data structure. By looking at the output from the Perl Data::Dumper module, it will be easier to write the code to access different parts of the data structure and extract the desired information.

  • + Share This
  • 🔖 Save To Your Account