Home > Articles > Web Services > XML

xMail: E-mail as XML

  • Print
  • + Share This
Take the steps toward converting your email to XML for more effective processing, archiving, and searching. This sample chapter gives a detailed outline of converting Unix mbox and Eudora mailing programs.
This chapter is from the book

This chapter is from the book

E-mail is a good example of a structured text format that can usefully be converted to XML for processing, archiving, and searching. In this chapter, we develop xMail–a Python application to convert e-mail to XML.

It is an unfortunate fact of life that e-mail systems differ in the way they store e-mail. Some store it in proprietary binary formats. The two e-mail notations we deal with in this chapter (Unix mbox and Eudora) are, thankfully, text based. On Linux, e-mail messages are stored so that each message begins with From:. If that sequence of characters happens to occur within the body of a message, it is escaped by being prefixed with a > character. The Eudora e-mail client begins each message with a sentinel string of the form From ???@???.

Although there are differences in the way Linux and Eudora store e-mail messages, there is a lot of commonality we can exploit in the conversion code. In particular, we can take advantage of the Python standard rfc822 module to do most of the work in parsing e-mail headers.

14.1 | The rfc822 Module

The term rfc822 refers to the standard for the header information used in Internet e-mail messages. The full specification can be found at http://www.ietf.org/rfcs/rfc0822.txt. The bulk of rfc822 is concerned with specifying the syntax for the headers that accompany the body of e-mail messages; headers such as from, to, subject, and so on. Python's rfc822 module takes a file object and puts as much of the content as it can parse into headers, according to the rules of rfc822.

The following program illustrates how the rfc822 module is used.

CD-ROM reference=14001.txt
Simple program to illustrate use of Python's rfc822 module


import rfc822,StringIO

email = """To: sean@digitome.com
From: paul@digitome.com
Subject: Parsing e-mail headers
Reply-To: Majordomo@allrealgood.com
Message-Id: <199902051120.EAA14648@digitome.com>


Can Python parse this?

fo =StringIO.StringIO(email)
m = rfc822.Message (fo)
print "<headers>"
for (k,v) in m.items():
      print "<%s>%s</%s>" % (k,v,k)
print "</headers>"
print "<body>"
print fo.read()
print "</body>"

The result of running this program is shown below.

CD-ROM reference=14002.txt
<subject>Parsing e-mail headers</subject>

Can Python parse this?


  • + Share This
  • 🔖 Save To Your Account