xMail: E-mail as XML
- The rfc822 Module
- A Simple DTD for E-mail
- An Example of an E-mail Message in XML
- Processing a Eudora Mailbox
- Processing a Linux Mailbox
- Processing an E-mail Message by Using the rfc822 Module
- Sending E-mail by Using xMail
- Source Code for the SendxMail Application
- Source Code for the xMail Application
E-mail is a good example of a structured text format that can usefully be converted to XML for processing, archiving, and searching. In this chapter, we develop xMaila Python application to convert e-mail to XML.
It is an unfortunate fact of life that e-mail systems differ in the way they store e-mail. Some store it in proprietary binary formats. The two e-mail notations we deal with in this chapter (Unix mbox and Eudora) are, thankfully, text based. On Linux, e-mail messages are stored so that each message begins with From:. If that sequence of characters happens to occur within the body of a message, it is escaped by being prefixed with a > character. The Eudora e-mail client begins each message with a sentinel string of the form From ???@???.
Although there are differences in the way Linux and Eudora store e-mail messages, there is a lot of commonality we can exploit in the conversion code. In particular, we can take advantage of the Python standard rfc822 module to do most of the work in parsing e-mail headers.
14.1 | The rfc822 Module
The term rfc822 refers to the standard for the header information used in Internet e-mail messages. The full specification can be found at http://www.ietf.org/rfcs/rfc0822.txt. The bulk of rfc822 is concerned with specifying the syntax for the headers that accompany the body of e-mail messages; headers such as from, to, subject, and so on. Python's rfc822 module takes a file object and puts as much of the content as it can parse into headers, according to the rules of rfc822.
The following program illustrates how the rfc822 module is used.
CD-ROM reference=14001.txt """ Simple program to illustrate use of Python's rfc822 module """ import rfc822,StringIO email = """To: firstname.lastname@example.org From: email@example.com Subject: Parsing e-mail headers Reply-To: Majordomo@allrealgood.com Message-Id: <199902051120.EAA14648@digitome.com> Sean, Can Python parse this? regards, Paul """ fo =StringIO.StringIO(email) m = rfc822.Message (fo) print "<headers>" for (k,v) in m.items(): print "<%s>%s</%s>" % (k,v,k) print "</headers>" print "<body>" print fo.read() print "</body>"
The result of running this program is shown below.
CD-ROM reference=14002.txt <headers> <subject>Parsing e-mail headers</subject> <from>firstname.lastname@example.org</from> <message-id><199902051120.EAA14648@digitome.com></message-id> <reply-to>email@example.com</reply-to> <to>firstname.lastname@example.org</to> </headers> <body> Sean, Can Python parse this? regards, Paul </body>