
xMail: E-mail as XML
Date: Jun 27, 2003
Sample Chapter is provided courtesy of Prentice Hall Professional.
E-mail is a good example of a structured text format that can usefully be converted to XML for processing, archiving, and searching. In this chapter, we develop xMaila Python application to convert e-mail to XML.
It is an unfortunate fact of life that e-mail systems differ in the way they store e-mail. Some store it in proprietary binary formats. The two e-mail notations we deal with in this chapter (Unix mbox and Eudora) are, thankfully, text based. On Linux, e-mail messages are stored so that each message begins with From:. If that sequence of characters happens to occur within the body of a message, it is escaped by being prefixed with a > character. The Eudora e-mail client begins each message with a sentinel string of the form From ???@???.
Although there are differences in the way Linux and Eudora store e-mail messages, there is a lot of commonality we can exploit in the conversion code. In particular, we can take advantage of the Python standard rfc822 module to do most of the work in parsing e-mail headers.
14.1 | The rfc822 Module
The term rfc822 refers to the standard for the header information used in Internet e-mail messages. The full specification can be found at http://www.ietf.org/rfcs/rfc0822.txt. The bulk of rfc822 is concerned with specifying the syntax for the headers that accompany the body of e-mail messages; headers such as from, to, subject, and so on. Python's rfc822 module takes a file object and puts as much of the content as it can parse into headers, according to the rules of rfc822.
The following program illustrates how the rfc822 module is used.
CD-ROM reference=14001.txt """ Simple program to illustrate use of Python's rfc822 module """ import rfc822,StringIO email = """To: sean@digitome.com From: paul@digitome.com Subject: Parsing e-mail headers Reply-To: Majordomo@allrealgood.com Message-Id: <199902051120.EAA14648@digitome.com> Sean, Can Python parse this? regards, Paul """ fo =StringIO.StringIO(email) m = rfc822.Message (fo) print "<headers>" for (k,v) in m.items(): print "<%s>%s</%s>" % (k,v,k) print "</headers>" print "<body>" print fo.read() print "</body>"
The result of running this program is shown below.
CD-ROM reference=14002.txt <headers> <subject>Parsing e-mail headers</subject> <from>paul@digitome.com</from> <message-id><199902051120.EAA14648@digitome.com></message-id> <reply-to>paul@digitome.com</reply-to> <to>sean@digitome.com</to> </headers> <body> Sean, Can Python parse this? regards, Paul </body>
14.2 | A Simple DTD for E-mail
Before going any further with parsing Linux or Eudora mailboxes, we need to settle on an XML representation of a mailbox. We will use the following simple DTD.
CD-ROM reference=14003.txt <!-- XMail = A simple DTD for a collection of e-mail messages An xMail file consists of zero or more message elements. A message has a headers element that contains fields such as from, to, subject, and so on. The body element houses the text of the e-mail --> <!ELEMENT xmail (message)*> <!ELEMENT message (headers,body)> <!ELEMENT headers (field)+> <!ELEMENT field (name,value)> <!ELEMENT name (#PCDATA)> <!ELEMENT value (#PCDATA)> <!ELEMENT body (#PCDATA)>
14.3 | -An Example of an E-mail Message in XML
Here is an example of an e-mail message that conforms to the xMail DTD.
CD-ROM reference=14004.txt <?xml version="1.0"?> <!DOCTYPE xmail SYSTEM "xmail.dtd"> <xmail> <message> <headers> <field> <name>subject</name> <value>Greetings</value> </field> </headers> <body> Hello World </body> </message> </xmail>
14.4 | Processing a Eudora Mailbox
The following code fragment shows the control structure required to process a Eudora mailbox into individual e-mail messages. The processing of each message has been delegated to the ProcessMessage function. This function is used by both the Linux and Eudora converters. Note how the sentinel string "From ???@???" is used to chop the contents of the mailbox into individual messages.
CD-ROM reference=14005.txt def DoEudoraMailbox(f): # f is a file object. # Chop the contents of a Eudora mailbox # into individual messages for processing # by the ProcessMessage subroutine. Message = [] L = f.readline() while L: if string.find(L,"From ???@???")!=-1: -# Full message accumulated in the Message # list, so process it to XML. ProcessMessage(Message,out) Message = [] else: # Accumulate e-mail contents line by line in # Message list. Message.append (L) L = f.readline() if Message: # Last message in the mailbox ProcessMessage(Message,out)
14.5 | Processing a Linux Mailbox
To process a Linux-style mailbox into individual messages, a different control structure is required. Note, however, that the processing of each individual e-mail is handled by ProcessMessage, which is common to both Linux and Eudora converters.
CD-ROM reference=14006.txt DoLinuxMailBox(f): # f is a file object. L = f.readline()[:-1] if string.find(L,"From ")!=0: -print 'Expected mailbox "%s" to start with "From "' % MailBox return Message = [] L = f.readline() while L: if string.find(L,"From ")==0: -# Full message accumulated in the Message # list, so process it to XML ProcessMessage(Message,out) Message = [] else: # Accumulate e-mail contents line by line in # Message list. Message.append (L) L = f.readline() if Message: # Last message in the mailbox ProcessMessage(Message,out)
14.6 | -Processing an E-mail Message by Using the rfc822 Module
The two functions DoLinuxMailBox and DoEudoraMailBox chop up mailboxes into individual messages that are processed by the ProcessMessage function. This function uses the rfc822 module to separate the headers from the body of the message.
CD-ROM reference=14007.txt def ProcessMessage(lines,out): """ Given the lines that make up an e-mail message, create an XML message element. Uses the rfc822 module to parse the e-mail headers. """ out.write("<message>\n") # Create a single string from these lines. MessageString = string.joinfields(lines,"") # Create a file object from the string for use # by the rfc822 module. fo = StringIO.StringIO(MessageString) m = rfc822.Message (fo) # The m object now contains all the headers. # The headers can be accessed as a Python dictionary. out.write("<headers>\n") for (h,v) in m.items(): out.write("<field>\n") out.write("<name>%s</name>\n" % XMLEscape(h)) out.write("<value>%s</value>\n" % XMLEscape(v)) out.write("</field>\n") out.write("</headers>\n") out.write("<body>\n") out.write(XMLEscape(fo.read())) out.write("</body>\n") out.write("</message>\n")
Time to illustrate the program in action. The -l (Linux) or -e (Eudora) command-line switch tells the program what type of mailbox to process.
Here is a small Eudora mailbox.
CD-ROM reference=14008.txt C>type test.mbx From ???@??? Mon Sep 06 14:07:14 1999 To: sean@p13 From: Sean Mc Grath <sean@digitome.com> Subject: Hello Cc: Bcc: X-Attachments: In-Reply-To: References: X-Eudora-Signature: <Standard> World From ???@??? Mon Sep 06 14:07:31 1999 To: sean@p13 From: Sean Mc Grath <sean@digitome.com> Subject: Message 2 Cc: Bcc: X_Attachments: In-Reply-To: References: X-Eudora-Signature: <Standard> Hello From ???@??? Mon Sep 06 14:13:41 1999 To: sean@p13 From: Sean Mc Grath <sean@digitome.com> Subject: Message 2 Cc: Bcc: X-Attachments: In-Reply-To: References: X-Eudora-Signature: <Standard> From sean@digitome.com Hello
The file can be converted to XML as follows.
CD-ROM reference=14009.txt C>python xmail.py -e test.mbx <?xml version="1.0"?> <!DOCTYPE xmail SYSTEM "xmail.dtd"> <xmail> <message> <headers> <field> <name>subject</name> <value>Hello</value> </field> <field> <name>references</name> <value></value> </field> <field> <name>bcc</name> <value></value> </field> <field> <name>x-attachments</name> <value></value> </field> <field> <name>cc</name> <value></value> </field> <field> <name>in-reply-to</name> <value></value> </field> <field> <name>x-eudora-signature</name> <value><Standard></value> </field> <field> <name>from</name> <value>Sean Mc Grath <sean@digitome.com></value> </field> <field> <name>to</name> <value>sean@p13</value> </field> </headers> <body> World </body> </message> <message> <headers> <field> <name>subject</name> <value>Message 2</value> </field> <field> <name>references</name> <value></value> </field> <field> <name>bcc</name> <value></value> </field> <field> <name>x-attachments</name> <value></value> </field> <field> <name>cc</name> <value></value> </field> <field> <name>in-reply-to</name> <value></value> </field> <field> <name>x-eudora-signature</name> <value><Standard></value> </field> <field> <name>from</name> <value>Sean Mc Grath <sean@digitome.com></value> </field> <field> <name>to</name> <value>sean@p13</value> </field> </headers> <body> Hello </body> </message> <message> <headers> <field> <name>subject</name> <value>Message 2</value> </field> <field> <name>references</name> <value></value> </field> <field> <name>bcc</name> <value></value> </field> <field> <name>x-attachments</name> <value></value> </field> <field> <name>cc</name> <value></value> </field> <field> <name>in-reply-to</name> <value></value> </field> <field> <name>x-eudora-signature</name> <value><Standard></value> </field> <field> <name>from</name> <value>Sean Mc Grath <sean@digitome.com></value> </field> <field> <name>to</name> <value>sean@p13</value> </field> </headers> <body> From sean@digitome.com Hello </body> </message> </xmail>
Notice how the & character has been escaped to & whenever it occurs in a header or the body of an e-mail message.
Here is a small, Linux-style mailbox.
CD-ROM reference=14010.txt $cat test From sean@digitome.com Mon Sep 6 13:58:36 1999 Return-Path: <sean@digitome.com> Received: from gateway ([100.100.100.105]) by p13.digitome.com (8.9.3/8.8.7) with SMTP id NAA07403 for <sean@p13>; Mon, 6 Sep 1999 13:58:36 GMT Message-Id: <3.0.6.32.19990906140714.009b0ac0@p13> X-Sender: sean@p13 X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.6 (32) Date: Mon, 06 Sep 1999 14:07:14 +0100 To: sean@p13.digitome.com From: Sean Mc Grath <sean@digitome.com> Subject: Hello Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" World From sean@digitome.com Mon Sep 6 13:58:53 1999 Return-Path: <sean@digitome.com> Received: from gateway ([100.100.100.105]) by p13.digitome.com (8.9.3/8.8.7) with SMTP id NAA07407 for <sean@p13>; Mon, 6 Sep 1999 13:58:52 GMT Message-Id: <3.0.6.32.19990906140731.009b6a40@p13> X-Sender: sean@p13 X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.6 (32) Date: Mon, 06 Sep 1999 14:07:31 +0100 To: sean@p13.digitome.com From: Sean Mc Grath <sean@digitome.com> Subject: Message 2 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Hello
It can be converted to XML with the following command.
CD-ROM reference=14011.txt $python xmail.py -l test <?xml version="1.0"?> <!DOCTYPE xmail SYSTEM "xmail.dtd"> <xmail> <message> <headers> <field> <name>subject</name> <value>Hello</value> </field> <field> <name>x-sender</name> <value>sean@p13</value> </field> <field> <name>x-mailer</name> <value>QUALCOMM Windows Eudora Light Version 3.0.6 (32)</value> </field> <field> <name>content-type</name> <value>text/plain; charset="us-ascii"</value> </field> <field> <name>message-id</name> <value><3.0.6.32.19990906140714.009b0ac0@p13></value> </field> <field> <name>to</name> <value>sean@p13.digitome.com</value> </field> <field> <name>date</name> <value>Mon, 06 Sep 1999 14:07:14 +0100</value> </field> <field> <name>mime-version</name> <value>1.0</value> </field> <field> <name>return-path</name> <value><sean@digitome.com></value> </field> <field> <name>from</name> <value>Sean Mc Grath <sean@digitome.com></value> </field> <field> <name>received</name> <value>from gateway ([100.100.100.105]) by p13.digitome.com (8.9.3/8.8.7) with SMTP id NAA07403 for <sean@p13>; Mon, 6 Sep 1999 13:58:36 GMT</value> </field> </headers> <body> World </body> </message> <message> <headers> <field> <name>subject</name> <value>Message 2</value> </field> <field> <name>x-sender</name> <value>sean@p13</value> </field> <field> <name>x-mailer</name> <value>QUALCOMM Windows Eudora Light Version 3.0.6 (32)</value> </field> <field> <name>content-type</name> <value>text/plain; charset="us-ascii"</value> </field> <field> <name>message-id</name> <value><3.0.6.32.19990906140731.009b6a40@p13></value> </field> <field> <name>to</name> <value>sean@p13.digitome.com</value> </field> <field> <name>date</name> <value>Mon, 06 Sep 1999 14:07:31 +0100</value> </field> <field> <name>mime-version</name> <value>1.0</value> </field> <field> <name>return-path</name> <value><sean@digitome.com></value> </field> <field> <name>from</name> <value>Sean Mc Grath <sean@digitome.com></value> </field> <field> <name>received</name> <value>from gateway ([100.100.100.105]) by p13.digitome.com (8.9.3/8.8.7) with SMTP id NAA07407 for <sean@p13>; Mon, 6 Sep 1999 13:58:52 GMT</value> </field> </headers> <body> Hello </body> </message> </xmail>
14.7 | Sending E-mail by Using xMail
Having converted the e-mail to XML, we can process it in a variety of ways by using any XML-aware databases, editors, search engines, and so on. We can contemplate processing them with Python by using Pyxie or SAX- or DOM-style processing. One useful form of processing would be to send e-mail from this XML notation. In this section, we develop a Pyxie application, sendxMail, to do that.
Sending e-mail to a group of people at the same time is common, so we start by defining an XML notation for a mailing list. Here is a sample document conforming to a contacts DTD.
CD-ROM reference=14012.txt <!DOCTYPE contacts SYSTEM "contacts.dtd"> <contacts> <contact> <name>Neville Bagnall</name> <email>neville@digitome.com</email> </contact> <contact> <name>Noel Duffy</name> <email>noel@digitome.com</email> </contact> <contact> <name>Sean Mc Grath</name> <email>sean@digitome.com</email> </contact> </contacts>
The DTD for this is, of course, trivial.
CD-ROM reference=14013.txt C>type contacts.dtd <!-- Trivial DTD for a mailing list --> <!ELEMENT contacts (contact)*> <!ELEMENT contact (name,email)> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)>
The full source code for sendxMail is given at the end of this chapter. The program uses the smtplib Python standard library. This library allows Python programs to send e-mail messages by talking to an SMTP server. Here is a small test program that illustrates how the smtplib module works.
CD-ROM reference=14014.txt """ Small test program to illustrate Python's standard smtplib library """ import smtplib SMTPServer = "gpo.iol.ie" #Create an SMTP server object. server = smtplib.SMTP (SMTPServer) #Turn debugging on. server.set_debuglevel(1) # Send an e-mail. First argument is the sender. Second argument # is a list of recepients. Third argument is the text of the # message. server.sendmail ( "From:sean@digitome.com", ["paul@digitome.com"], "Hello World")
To execute this program, change SMPTServer to point to a suitable SMTP server. The program will produce a lot of output because debugging has been turned on. The abridged output from an execution of this program is shown here.
CD-ROM reference=14015.txt send: 'ehlo GATEWAY\015\012' reply: '250-gpo2.mail.iol.ie Hello dialup-024.ballina.iol.ie [194.125.48.152], pleased to meet you\015\012' -reply: retcode (250); Msg: gpo2.mail.iol.ie Hello dialup_ 024.ballina.iol.ie [194.125.48.152], pleased to meet you send: 'mail FROM:<sean@digitome.com> size=11\015\012' reply: '250 <sean@digitome.com>... Sender ok\015\012' reply: retcode (250); Msg: <sean@digitome.com>... Sender ok send: 'rcpt TO:<paul@digitome.com>\015\012' reply: '250 <paul@digitome.com>... Recipient ok\015\012' reply: retcode (250); Msg: <paul@digitome.com>... Recipient ok send: 'data \015\012' reply: '354 Enter mail, end with "." on a line by itself\ 015\012' reply: retcode (354); Msg: Enter mail, end with "." on a line by itself data: (354, 'Enter mail, end with "." on a line by itself') send: 'Hello World' send: '\015\012.\015\012' reply: '250 QAA03299 Message accepted for delivery\015\012' reply: retcode (250); Msg: QAA03299 Message accepted for delivery data: (250, 'QAA03299 Message accepted for delivery')
To execute sendxMail, you provide it with four parameters:
-
The e-mail address of the sender.
-
The XML document containing the list of recepients. This file should conform to the contacts DTD.
-
The e-mail message document. This document should conform to the xMail DTD.
-
The name of the SMTP server to use.
A sample invocation is shown below.
CD-ROM reference=14016.txt C>-python sendxMail.py sean@digitome.com PyxieList.xml Welcome.xml gpo.iol.ie
14.8 | -Source Code for the SendxMail Application
CD-ROM reference=14017.txt """ sendxMail XML Processing with Python Sean Mc Grath Send e-mail over the Internet to a group of e-mail accounts, using the xmail XML representation. The program connects to the specified SMTP server and uses Python's smtplib library. The list of addresses in also in XML. A simple message file looks like this: <xmail> <message> <headers> <field><name>subject</name><value>Greetings</value></field> </headers> <body> Hello World </body> </message> </xmail> A simple address file looks like this: <contacts> <contact> <name>Neville Bagnall</name> <email>neville@digitome.com</email> </contact> </contacts> Sample invocation: -python sendxmail.py sean@digitome.com contacts.xml email.xml gpo.iol.ie """ import smtplib from pyxie import * # Class uses event-driven XML processing style to send messages # one at a time and so inherits from xDispatch. class xMailSender(xDispatch): -def __init__(self,Sender,MailingListFile,MessageFile, SMTPServer): -# PYX source for later event dispatching is the # message file xDispatch.__init__(self,File2PYX(MessageFile)). # The Gathered variable is used to gather characters # arriving in the data handler method between certain # start- and end-tags. self.Gathered = [] self.Sender = Sender self.Addresses = [] self.MessageFile = MessageFile # Accumulated message header self.MessageHeader = "" # Accumulated message body self.MessageText = "" self.Recepients = [] self.server = smtplib.SMTP( SMTPServer ) self.server.set_debuglevel(1) # Use tree-processing style to assemble list of # recipients. T = File2Tree(self.MessageListFile) for n in T: T.Seek(n) if T.AtElement("email"): email = T.JoinData(" ") self.Addresses.append(email) # Invoke event dispatching to handler methods # PYX source is the message file. self.Dispatch() def start_body(self,etn,attrs): # Reset gathered data for each message body. self.Gathered = [] def end_body(self,etn,attrs): # Save gathered data as message body. self.messageText = string.join(self.Gathered) def start_name(self,etn,attrs): # Reset gathered data for each name element. self.Gathered = [] def start_value(self,etn,attrs): # Reset gathered data for each value element. self.Gathered = [] def end_name(self,etn): -# Save gathered data as header field # recipient name. self.fieldname = string.join(self.Gathered) def end_value(self,etn): # Save gathered data as header field value. -self.fieldvalue = string.join(self. Gathered) -# Add the new name/value pair to the end of # the message header. -self.MessageHeader = self.MessageHeader + self.fieldname + ": " + self.fieldvalue + "\n" def characters(self,str): -# Handler for character data. Accumulate # data in the Gathered variable.
Various # end-tag handlers copy out the accumulated # contents as needed. self.Gathered.append (PYXDecoder(str)) def end_body(self,etn): -# At this point, we have everything we need # to send e-mail. -self.MessageText = string.join (self .Gathered) self.server.sendmail (self.Sender, -self.Addresses, self.MessageHeader+"\n"+ self.MessageText ) # Close down the SMTP connection. self.server.quit() if __name__ == '__main__': import sys if len(sys.argv)==1: xMailSender ("sean@digitome.com", "contacts.xml", "email.xml", "gpo.iol.ie") else: xMailSender (sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
14.9 | -Source Code for the xMail Application
CD-ROM reference=14018.txt """ xMail Convert mailboxes to a simple XML form for e-mail messages. XML Processing with Python Sean Mc Grath The Eudora e-mail client stores e-mail messages in mailboxes. The file format is plain text. Individual messages are separated by the string "From ???@???". This program processes a mailbox creating an XML file that conforms to the xmail DTD. """ # Import some standard modules # rfc822 is the module for e_mail header parsing import string,rfc822,StringIO LINUX = 0 EUDORA = 1 def XMLEscape(s): """ Escape XMLs two special characters which may occur within an e-mail message. """ s = string.replace(s,"&","&") s = string.replace(s,"<","<") return s def ProcessMessage(lines,out): """ Given the lines that make up an e-mail message, create an XML message element. Uses the rfc822 module to parse the e_mail headers. """ out.write("<message>\n") # Create a single string from these lines. MessageString = string.joinfields(lines,"") # Create a file object from the string for use # by the rfc822 module. fo = StringIO.StringIO(MessageString) m = rfc822.Message (fo) # The m object now contains all the headers. # The headers can be accessed as a Python dictionary. out.write("<headers>\n") for (h,v) in m.items(): out.write("<field>\n") out.write("<name>%s</name>\n" % XMLEscape(h)) out.write("<value>%s</value>\n" % XMLEscape(v)) out.write("</field>\n") out.write("</headers>\n") out.write("<body>\n") out.write(XMLEscape(fo.read())) out.write("</body>\n") out.write("</message>\n") def DoEudoraMailBox(MailBox): """ -Given a Eudora mailbox, convert its contents to XML conforming to the xmail DTD. """ f = open (MailBox,"r") l = f.readline()[:_1] if string.find(l,"From ???@???")==_1: -# Sentinel that separates e-mail messages in the # Eudora mbx notation. print 'Expected mailbox "%s"' % MailBox, Print 'to start with "From ???@???"' return if MailBox[-4:] != ".mbx": -print "Expected mailbox to have .mbx file extension", MailBox return # Output file has same base name but .xml extension. out = open(MailBox[:-3]+"xml","w") out.write ('<?xml version="1.0"?>\n') out.write ('<!DOCTYPE xmail SYSTEM "xmail.dtd">\n') out.write ('<xmail>\n') Message = [] l = f.readline() while l: if string.find(l,"From ???@???")!=-1: # Full message accumulated in the Message list, # so process it to XML. ProcessMessage(Message,out) Message = [] else: # Accumulate e-mail contents line by line in # Message list. Message.append (l) l = f.readline() if Message: # Last message in the mailbox ProcessMessage(Message,out) out.write ('</xmail>\n') f.close() out.close() def DoLinuxMailBox(MailBox): """ -Given a Unix mbox style mailbox, convert its contents to XML conforming to the xmail DTD. """ f = open (MailBox,"r") l = f.readline()[:_1] if string.find(l,"From ")!=0: -print 'Expected mailbox "%s" to start with "From "' % MailBox return # Output file has same name as mailbox but with ".xml" added. out = open(MailBox+".xml","w") out.write ('<?xml version="1.0"?>\n') out.write ('<!DOCTYPE xmail SYSTEM "xmail.dtd">\n') out.write ('<xmail>\n') Message = [] l = f.readline() while l: if string.find(l,"From ")==0: # Full message accumulated in the Message list, # so process it to XML. ProcessMessage(Message,out) Message = [] else: # Accumulate e_mail contents line by line in # Message list. Message.append (l) l = f.readline() if Message: # Last message in the mailbox ProcessMessage(Message,out) out.write ('</xmail>\n') f.close() out.close() if __name__=="__main__": import sys,getopt format = LINUX (options,remainder) = getopt.getopt (sys.argv[1:],"le") for (option,value) in options: if option == "-l": format = LINUX elif option == "-e": format = EUDORA if len(remainder)!=1: print "Usage: %s -l|-e mailbox" % sys.argv[0] sys.exit() if format==EUDORA: DoEudoraMailBox(remainder[0]) elif format==LINUX: DoLinuxMailBox(remainder[0])