Home > Articles > Security > Network Security

Stopping Spam at the Server: Part III

  • Print
  • + Share This
The best approach to developing an anti-spam solution for your network is to use a combination of techniques. The LeBlancs give a darned good overview of what you need to know, with lots of specific examples.
Like this article? We recommend

In the second article in this series, we looked at the various tests a smart anti-spam solution can make on E-mail contents. Adding this technique to those we discussed in the first article (testing the E-mail envelopes) gives us a wide range of options to choose from.

However, the best approach of all is to combine as many spam identification tricks as possible to maximize your chances of sorting the spam from the ham so that your users can go on with their lives with minimal concerns. After all, if you live by the mail administrator credo of Lose No Mail, your anti-spam solution will also let users handle their own quarantines and train their own filters.

The Safety (and Wisdom) of Combined Solutions

A combined filtering approach makes sense for a number of reasons. For one thing, spammers are always looking for a clever way to foil the latest in spam-blocking techniques. Look at the Bayesian poison pills we discussed in the previous article in this series, for example, and the Distributed Denial of Service attacks (DDoSs) that were successfully used to not only shut down a couple of DNSBLs, but to cripple the anti-spam software that relied on those external sources for making their decisions. When these attacks occurred, there were administrators whose E-mail servers were unable to deliver mail until their spam-fighting measures were disabled, or until new ones were hastily installed.

A combined approach relies on scoring techniques similar to those used by feature recognizers. Such tools offer a range of supported tests, from envelope to content to the many additional types available. When configuring a combined tool, the administrator tells it which tests to use; these tests are assigned numeric values according to how much each particular test is trusted to accurately spot spam. The more reliable the test is for spotting spam, the higher the score; the more reliably it identifies ham, the lower the score. Once the battery of available tests is finished on a particular message, the tool adds up all the positive and negative values to calculate an overall score.

Rather than making you try to imagine what's happening, we'll take a look at an example. Let's say that we have the following piece of E-mail, which arrives from a host claiming to be "mail.aol.com":

Received: from xx.xx.xx.xx (bogus [xx.xx.xx.xx])
 by mail.example.com (8.12.10/8.12.10/1.1) with SMTP id
i1B5ktbr020202
 for <recipient@example.com>; Tue, 10 Feb 2004 21:47:02 -0800
Message-Id: <200402110547.i1B5ktbr020202@mail.example.com>
From: EHFVMTDRRXHKDU@aol.com
To: recipient@example.com
Subject: 100% Verified E-mail Addresses: 525 million (5cdS) ONLY $99.00
Date: Wed, 11 Feb 2004 18:47:25 -0500

MLM Marketing Opportunities!

535 million Email Addresses in a 5-disk set REGULARLY $637.00
NOW ONLY $99.00

When this E-mail message arrives at our anti-spam filter, it faces both envelope tests and content tests. The following specific tests might be triggered in this case:

Rule

Why it was Triggered

SPF_FAIL

The host that connected to the mail server (xx.xx.xx.xx) was checked against the SPF records for its stated domain (aol.com), and was not found on the list of authorized mail servers for that domain. This is an envelope test, but rather than basing our diagnosis solely on this fact, we just assign a positive score (+1.604) to this suspicious occurrence.

NO_REAL_NAME

The "From:" header usually also contains the sender's real name, as in "Real Name <address>". If the sender didn't enter a real name when he configured his mail client, only the E-mail address will be shown. This isn't technically invalid, but it's suspicious, so our feature recognizer has a rule to spot this and assigns a positive score (+0.160) to the total.

DATE_IN_FUTURE_12_24

Spammers often like to mess with the "Date:" header in the hope that your mail client will sort their mail closer to the top of the inbox. In this case, our feature recognizer detects that the date in the "Date:" header is 12-24 hours ahead of the date in the "Received:" header. Again, this is not conclusive by itself, since a drifting system clock could be to blame, but based on how often this rule is triggered in spam, and how rarely time differences this exaggerated appear in ham, we add a score of +3.332 to the total.

MLM

A feature recognizer looking for a common spam keyword or phrase like "MLM" and variations on "Multi-Level Marketing" would find a match in the body of this E-mail, adding a score of +1.787 to the total.

MILLION_EMAIL

Spam that offers CDs of "millions" of E-mail addresses is so common these days that there are hard-coded feature recognizer rules to identify patterns like "million(s) (of) (e-mail) addresses". Triggering this rule adds +1.999 to the total score for this message.

DCC_CHECK

By comparing the contents of this E-mail with samples submitted by others at the Distributed Checksum Clearinghouse (DCC), we find that many, many others have already received this particular spam and classified it as such. That earns this message another +2.907 points.

RCVD_IN_SBL

The connecting host's address (xx.xx.xx.xx) was looked up against the Spamhaus Block List (SBL), one of the more popular DNSBLs, and was listed there as a known spam source. This envelope test adds +1.113 to our total score.

BAYES_60

This particular E-mail message does not contain a lot of text, so it doesn't offer a lot of tokens for the Bayes learning engine to work with. It finds some suspicious tokens, but also a number of tokens that are found often in legitimate mail, so its overall confidence level for this mail is between 60% and 70%. This adds +1.592 to the total.


The final score for this example message is 14.494, but the score alone still does not tell us whether this mail is spam or ham—that has to be determined on an individual basis, according to the recipients' threshold score, which can be set by individual users (we'll get into user quarantines in a moment). If one recipient sets his spam threshold at 5.0, and another recipient doesn't consider E-mail to be spam unless its score is 15.0 or higher, this item would clearly be spam to the first recipient and ham to the second.

Setting spam threshold scores can be a bit of an art. The various test scores are usually calibrated against a fixed value, such as 5.0, based on analyzing a huge sample of spam and ham, so it makes sense to start with that score. If too much spam is slipping through the filter, lower the threshold score a bit; if too much legitimate mail is ending up in the quarantine, raise the threshold score a bit.

Hopefully, even this brief example helps to illustrate the power of a combined envelope and content testing solution. The more tests the better! Next in this series, we will address how you can combine all of this data to help your users keep the spam out of their inbox, without breaking that all-important mail administrator rule: Lose No Mail.

  • + Share This
  • 🔖 Save To Your Account