The Characteristics of Spam Email
It is easy for a person to look at a piece of email and say, "This isn't something I asked for. It looks like an advertisement, and I don't want it, so it must be spam." But although it is easy for humans to recognize spam, it is much harder for software to recognize it. And, after all, the point of spam-blocking software is to eliminate the need for humans to recognize spam.
In this chapter we discuss spam as it appears to software and not as it appears to a human. For example, Figure 2.1 shows how a piece of spam email might appear to you when viewed in your email reader.
Figure 2.1 How spam email might appear to a human.
The message in Figure 2.1, however, might look like the following to software:
<a href="dvds.example.com"><img src="dvds.example.com/images/em1"></a>
This is a form of HTML code called a clickable link, and although software lacks human eyes, it can easily detect that this is a clickable link.
In this chapter we reveal, at the code level, how spam email is often internally structured. But bear in mind that spam email is constantly changing and evolving, and the lessons you learn in this chapter—although forward-looking—are not the total picture. Thus, we encourage you to collect spam to examine for yourself. By doing so, you will note trends not covered here.
One way to do this is to collect spam in your normal mailbox, but because this can get a bit messy, we instead advise you to set up a bait machine (see Chapter 3). [1]
2.1 Connection Behavior
Spam email is sent like all other email. The sending site connects to the local sendmail daemon and waits for the local sendmail to say it is ready.
Sender connects and waits for:
220 local.hostname...
Here, the line beginning with 220 is from the local sendmail and tells the sending client that it can begin to send messages.
A well-behaved client will wait for the 220 before sending anything, but some spamming software will not. To speed up the spamming process, some spamming software will send the entire message without waiting for the 220.
Sender connects and immediately sends the whole message.
Then it disconnects and moves on to the next host in line.
220 local.hostname...
This works for spammers because email runs on top of a protocol called Transmission Control Protocol/Internet Protocol (TCP/IP). With TCP/IP the operating system sets up a buffer to hold the incoming message, and because sendmail looks only at the buffer, rather than at the actual connection, it does not see that the inbound sender has already sent everything. Thus, sendmail will obligingly accept the message and probably deliver it.
Fortunately, version 8.13 of sendmail adds a new feature to catch just this sort of ill-mannered behavior. Called the greetpause FEATURE, it causes sendmail to sleep briefly before sending the 220. After the sleep and before sending the 220, sendmail checks to see whether input from the sender is already present in the buffer. If it is, sendmail thereafter rejects all SMTP commands from that sender, for that connection, issuing a 554 error.
Naturally it would seem better to just drop the connection, but sendmail dares not do that because doing so could leave unread information in the TCP buffer, eventually filling that buffer.
In the future, spamming sites will probably adapt to this form of detection, modify their behavior, and start sending only after receiving the 220 greeting, after the EHLO, after the MAIL FROM, or after a RCPT TO. To adapt, spamming software will need to move its asynchronous attack further into the synchronous protocol. As this competition continues, spamming software will slowly be forced to evolve into normal, well-behaved email sending software.