Home > Articles > Web Services > XML

This chapter is from the book

This chapter is from the book

The Problem: Generating and Parsing an XML Document

The two most common XML programming tasks are XML document generation and XML document parsing. In this section, we will present two applications that demonstrate how to perform these tasks. Both of these tasks are commonly performed during XML processing and form the basis for more advanced applications throughout the book.

The first program reads a CSV file that contains user information and transforms it into XML. The second program parses the XML file that was generated by the first program, retrieves the user information, and transforms it into HTML for display in a web browser.

Generating an XML Document

Let's assume you are given the task of converting the information in an electronic mailing list from CSV to XML because a new email application supports only XML input. The fields in the CSV file will be formatted as follows, with each user record on a separate line:

First Name, Last Name, Email Address

Simple enough? That's why we decided to use the Perl XML::Simple module. It's designed especially for simple tasks, although it can be used for more complex tasks. The rest of the book focuses on other Perl XML modules that are better suited for more difficult tasks. After you become familiar with the other XML modules, you'll be able to make a sound judgment and pick the right tool for the task at hand.

NOTE

Additional information about the XML::Simple Perl module can be found by typing perldoc XML::Simple. Also, XML::Simple will be discussed in greater detail in Chapter 4, "Tree-Based Parser Modules."

Now that you have an understanding of the program's requirements, you can really get your hands dirty. The CSV is a simple file with only three fields: first name, last name, and email address. The input CSV file is shown in Listing 2.6.

Listing 2.6 The Input CSV file containing source data for XML document. (Filename: ch2_user_email.csv)

Ilya,Sterin,isterin@cpan.org
Mark,Riehl,mark_riehl@hotmail.com
John,Smith,j.smith@xmlproj.com

Granted, this is a very small input file because it only has three records. However, the techniques used in this example can be applied to input data files with 3,000 or 30,000,000 records.

Now that we have the format of the input information, let's take a quick look at the required format of the output XML file. Let's assume that the output file is required to have the following format:

<users>
  <user>
   <first_name> </first_name>
   <last_name> </last_name>
   <email_address> </email_address>
  </user>
  ...
</users>

Note that we have a <users> element at the root of the document that is made up of multiple <user> elements. Each <user> element has one <first_name>, <last_name>, and <email_address> element.

The data for each user is stored in a predefined order: <first_name>, <last_name>, <email address>. Let's take a minute and list the steps that we would need to perform if we were doing this conversion by hand.

  1. Read each line of the input file.

  2. Parse the line into columns based on location of the delimiters.

  3. Print each column surrounded by the proper start and end tags.

Because this is such a simple example, this is exactly how our program will work. Listing 2.7 shows the Perl program used to parse the CSV file and generate the desired output XML document. This example doesn't use any XML modules, it just parses the input file and then creates an XML document using the Perl print function. The example is explained in detail in the following sections.

Listing 2.7 Perl application that converts the CSV input file to XML. (Filename: ch2_csv_to_xml_app.pl)

1.  use strict;
2.  
3.  # Open the ch2_xml_users.csv file for input
4.  open(CSV_FILE, "ch2_xmlusers.csv") || 
5.   die "Can't open file: $!";
6.  
7.  # Open the ch2_xmlusers.xml file for output
8.  open(XML_FILE, ">ch2_xmlusers.xml") || 
9.   die "Can't open file: $!";
10. 
11. # Print the initial XML header and the root element
12. print XML_FILE "<?xml version=\"1.0\"?>\n";
13. print XML_FILE "<users>\n";
14. 
15. # The while loop to traverse through each line in users.csv
16. while(<CSV_FILE>) {
17.   chomp; # Delete the new line char for each line
18. 
19.   # Split each field, on the comma delimiter, into an array
20.   my @fields = split(/,/, $_);
21. 
22. print XML_FILE<<"EOF";
23.   <user>
24.     <first_name>$fields[0]</first_name>
25.     <last_name>$fields[1]</last_name>
26.     <email_address>$fields[2]</email_address>
27.   </user>
28. EOF
29.  }
30. 
31. # Close the root element
32. print XML_FILE "</users>";
33. 
34. # Close all open files
35. close CSV_FILE;
36. close XML_FILE;

1–9 The opening section of the program starts with the use strict pragma. This is a good practice in all Perl programs. The file ch2_xmlusers_csv is opened for input and ch2_xmlusers.xml is opened for output. After both files have been opened, we print the XML declaration and the opening tag for the <users> elements to the output file by using the XML_FILE filehandle.

1.  use strict;
2.  
3.  # Open the ch2_xml_users.csv file for input
4.  open(CSV_FILE, "ch2_xmlusers.csv") || 
5.   die "Can't open file: $!";
6.  
7.  # Open the ch2_xmlusers.xml file for output
8.  open(XML_FILE, ">ch2_xmlusers.xml") || 
9.   die "Can't open file: $!";10. 
11. # Print the initial XML header and the root element
12. print XML_FILE "<?xml version=\"1.0\"?>\n";
13. print XML_FILE "<users>\n";

NOTE

In your program, use strict enforces error checking, and use diagnostics expands the diagnostics if the Perl compiler encounters an error or any questionable constructs.

15–29 This portion of the program performs a majority of the work in this example. The input file (ch2_xmlusers.csv) is read line by line within the while loop, and each line is broken up into columns by the delimiter, which in our case is a comma. The columns are then printed to ch2_xmlusers.xml, surrounded by the appropriate tags. To print this example, we're using a Perl "Here" document, which allows us to print a block of text without having to use a print statement for each line. In this case, we're printing everything between print XML_FILE<<"EOF" and EOF. Note that variables are being evaluated inside this block. For additional information on Perl "Here" documents, please see perldoc perldata.

Because we know the order of the input columns (first_name, last_name, email_address), we can print the input contents directly to the output XML file. This procedure is repeated for each line (user) in the ch2_xmlusers.csv file, and a <user> element is created each time through the while loop.

15. # The while loop to traverse through each line in users.csv
16. while(<CSV_FILE>) {
17.   chomp; # Delete the new line char for each line
18. 
19.   # Split each field, on the comma delimiter, into an array
20.   my @fields = split(/,/, $_);
21. 
22. print XML_FILE<<"EOF";
23.   <user>
24.     <first_name>$fields[0]</first_name>
25.     <last_name>$fields[1]</last_name>
26.     <email_address>$fields[2]</email_address>
27.   </user>
28. EOF
29.  }

31–36 The final section of our Perl application just prints the closing root </users> tag to the output file. Note that the opening and closing <user> element tags were outside the while loop. The code inside the while loop creates each <user> element.

31. # Close the root element
32. print XML_FILE "</users>";
33. 
34. # Close all open files
35. close CSV_FILE;
36. close XML_FILE;

The Perl application will generate the output that is shown in Listing 2.8.

Listing 2.8 Generated XML file containing the data from the input CSV file. (Filename: ch2_xmlusers.xml)

<?xml version="1.0"?>
<users>
  <user>
   <first_name>Ilya</first_name>
   <last_name>Sterin</last_name>
   <email_address>isterin@cpan.org</email_address>
  </user>
  <user>
   <first_name>Mark</first_name>
   <last_name>Riehl</last_name>
   <email_address>mark_riehl@hotmail.com</email_address>
  </user>
  <user>
   <first_name>John</first_name>
   <last_name>Smith</last_name>
   <email_address>j.smith@xmlproj.com</email_address>
  </user>
</users>

As you can see in Listing 2.9, the XML file that was generated by our program matches the required output file format. Don't worry, there are better ways to generate an XML file. We won't always be using the Perl print function to generate XML files. The goal was to illustrate the concept of generating an XML document using another file (in this case, CSV) as the input data source.

Now that we've generated our first XML file, how do we parse it? That is discussed in the next section.

Parsing an XML Document

We've accomplished the first part of the application that showed the basics of generating a simple XML document. Now we can move on to the next task in which we transform the ch2_xmlusers.xml file in Listing 2.9 into an HTML file, so we can display the contents of the XML document in a browser.

NOTE

This chapter demonstrates the very basic concepts of XML generation, parsing, and transformation. Quite a few XML-related Perl modules are available to deal with each of these tasks. You'll see the benefits of the XML-related modules and when to use them (as well as when not to) as we move ahead and discuss more complicated examples.

Converting the input file from CSV to XML provides us with a document that is in a structured, clear, and easy-to-read format. Because XML is self-describing (because of the element names), anyone can look at the contents of an XML document and instantly see the layout of the data as well as the relationships between different elements. To display this information to someone not familiar with XML (or just for nicer formatting into tables), we need to convert it to another format. For this example, we have decided that we want this XML document to be displayed in a browser, so we must convert the XML document to HTML.

NOTE

A majority of the web browsers available today are able to display HTML in a consistent way. However, XML support among browsers isn't consistent and varies greatly among the most portable browsers (for example, some browsers will display XML whereas others require stylesheets). At the time of this writing, the most portable way to display an XML document with a browser is to convert the XML document to HTML.

To transform the XML document from XML to HTML, we need to parse it first. Using the XML::Simple module, this task is easily accomplished with its XMLin function, which reads XML from a file (string or an IO object), parses it, and returns a Perl data structure containing the information that was stored in XML. We can then traverse through the data structure, retrieve the required values, and print out the HTML file.

Additional XML::Simple Information

Note that there is also an XMLout function provided by XML::Simple. This function performs the reverse function of XMLin and returns XML based on the provided Perl data structure. Therefore, if XMLout is fed the same, unmodified data structure returned by XMLin, it will generate the original XML document. One potential issue is that the returned XML will be semantically alike, but not necessarily the same. This is due to the way Perl stores hashes, which is in no particular order. It's not necessarily a problem because many applications retrieve the data randomly and have no particular use for the element order, but it might pose a problem to applications that require a particular sequential order of elements (for example, sorted by a particular element). For additional information on XML::Simple, please see perldoc XML::Simple.

A number of attributes can be set to control XML::Simple's behavior. For example, we will use the forcearray attribute, which forces nested elements to be output as arrays, even if there is only one instance present. Also, we'll use the keeproot attribute, which causes the root element to be retained by XML::Simple (it is usually discarded).

Now that we have a high-level understanding of the XML::Simple module, let's take a detailed look at the program shown in Listing 2.9 that converts the input XML file to HTML.

Listing 2.9 Program that converts an input XML file to HTML. (Filename: ch2_xml_to_html_app.pl)

1.  use strict;
2.  
3.  # Load the XML::Simple module
4.  use XML::Simple;
5.  
6.  # Call XML::Simple's new constructor function to create an 
7.  # XML::Simple object and get the reference.
8.  my $simple = XML::Simple->new();
9.  
10. # Read in ch2_xmlusers.xml and return a data structure. 
11. # Note that forcearray and keeproot are both turned on
12. -my $struct = $simple->XMLin("./ch2_xmlusers.xml", forcearray => 1, keeproot => 1);
13. 
14. # Open ch2_xmlusers.html file for output
15. # It will create it if it doesn't exist
16. open(HTML_FILE, ">ch2_xmlusers.html") || 
17.  die "Can't open file: $!\n";
18. 
19. # Print the initial HTML tags
20. print HTML_FILE "<html>\n<body>\n";
21. 
22. # The for loop traverses over each user.
23. # $_ points to the user
24. for (@{$struct->{users}->[0]->{user}}) {
25.  # Print the each field in the user structure
26. print HTML_FILE <<"EOF";
27.  First Name: $_->{first_name}->[0]<br>
28.  Last Name: $_->{last_name}->[0]<br>
29.  Email Address: $_->{email_address}->[0]<br>
30.  ----------------------------------------<br>
31. EOF
32. }
33. 
34. # Print the ending HTML tags
35. print HTML_FILE "</body><br></html>";
36. 
37. # Close ch2_xmlusers.html
38. close (HTML_FILE);

1–12 This initial block of the program starts with the standard use strict pragma. Because we're using the XML::Simple module, we must also include the use XML::Simple statement to load the XML::Simple module. After these initialization calls, we create a new XML::Simple object and then parse the XML file by calling the XMLIn function. The scalar $struct contains a reference to a data structure that contains the information from the XML file.

1.  use strict;
2.  
3.  # Load the XML::Simple module
4.  use XML::Simple;
5.  
6.  # Call XML::Simple's new constructor function to create an 
7.  # XML::Simple object and get the reference.
8.  my $simple = XML::Simple->new();
9.  
10. # Read in ch2_xmlusers.xml and return a data structure. 
11. # Note that forcearray and keeproot are both turned on
12. -my $struct = $simple->XMLin("./ch2_xmlusers.xml", forcearray => 1, keeproot => 1);
19–38 The second half of the program actually performs all the work. First, we open the output HTML file and print the initial required HTML tags. Then, we loop through an array of references to data structures that contain the information from the XML file. The information from the XML file is then printed to the HTML file in the proper order.

We're taking advantage of the "Here" document construct again. This way, we don't need multiple print statements. Note that each time through the for() loop corresponds to one row in the CSV file and one <user> element in the XML document. After looping through the array, we print the required closing tags to the HTML file. The resulting HTML file is shown in Listing 2.10.

Listing 2.10 Contents of the dynamically generated HTML file. (Filename: ch2_xmlusers.html)

<html>
<body>
First Name: Ilya<br>
Last Name: Sterin<br>
Email Address: isterin@cpan.org<br>
----------------------------------------<br>
First Name: Mark<br>
Last Name: Riehl<br>
Email Address: mark_riehl@hotmail.com<br>
----------------------------------------<br>
First Name: John<br>
Last Name: Smith<br>
Email Address: j.smith@xmlproj.com<br>
----------------------------------------<br>
</body>
</html>

This HTML document is shown in a browser in Figure 2.6. As you can see, we didn't apply any formatting to this document, just the required headers and footers so that it displays properly. Examples later in the book will demonstrate how to dynamically generate HTML documents containing tables of data.

Figure 2.6 The ch2_xmlusers.html file displayed in a browser.

Viewing the Contents of Data Structures

This section discusses a Perl module that allows us to visualize the data structure that is returned by the XML::Simple function XMLin. All we need is a simple script and the Perl Data::Dumper module. The Perl Data::Dumper module is very useful in situations like this, when we want to verify that we understand a particular data structure. It prints out the contents of a particular data structure in such a way that it enables us to see the hierarchy and relationships inside the data structure.

What we're going to do is follow the same steps of the last example: instantiate an XML::Simple object and then pass it the name of the input XML document (ch2_xmlusers.xml). The XML document will be parsed by the XML::Simple module and stored in a Perl data structure. At this point, we'll take a look at the contents of the Perl data structure by using the Perl Data::Dumper module. The Perl application that performs these steps is shown in Listing 2.11.

NOTE

For additional information the Perl Data::Dumper module, look at perldoc Data::Dumper.

Listing 2.11 Perl application that uses the Data::Dumper module to visual a complex data structure. (Filename: ch2_data_dumper_app.pl)

1.  use strict;
2.  use XML::Simple;
3.  use Data::Dumper;
4.  
5.  my $simple = XML::Simple->new();
6.  -my $struct = $simple->XMLin("./ch2_xmlusers.xml", forcearray => 1, keeproot => 1);
7.  
8.  # Use Data::Dumper Dumper function to return a nicely 
9.  # formatted stringified structure
10. print Dumper($struct);

1–10 This program is basically the same as the last example with a few exceptions. Note that we need the use Data::Dumper pragma to load the Data::Dumper module. After creating a new XML::Simple object, we parse the XML document using the XMLin function. A Perl data structure that contains the parsed XML document stored in a Perl data structure and that is named $struct is returned.

All we need to do is pass the Perl data structure to the Dumper function that is provided by the Perl Data::Dumper module. That's all there is to it. The output is a nicely formatted report that can be used to study the data structure. The output from the Data::Dumper module is shown in Listing 2.12. Notice that the output from Data::Dumper is a mirror of the input XML file.

Listing 2.12 Output from the Data::Dumper module showing the hierarchy of the parsed XML file from XML::Simple. (Filename: ch2_data_dumper_output.txt)

$VAR1 = {
'users' => [
    {
     'user' => [
       {
        'first_name' => [
                 'Ilya'
                ],
        'email_address' => [
                  'isterin@cpan.org'
                 ],
        'last_name' => [
                'Sterin'
               ]
       },
       {
        'first_name' => [
                 'Mark'
                ],
        'email_address' => [
                  'mark_riehl@hotmail.com'
                 ],
        'last_name' => [
                'Riehl'
               ]
       },
       {
        'first_name' => [
                 'John'
                ],
        'email_address' => [
                  'j.smith@xmlproj.com'
                 ],
        'last_name' => [
                'Smith'
               ]
       }
          ]
    }
       ]
    };

As you can see, the Perl data structure follows the same structure as original XML document (which we would expect). We recommend using the Data::Dumper module often. It is useful in situations where you need to access data that is stored in a complex Perl data structure. By looking at the output from the Perl Data::Dumper module, it will be easier to write the code to access different parts of the data structure and extract the desired information.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020