Free Outlook Express Spam FilterAnti-Spam Blocker Software For Microsoft

The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It

01 | 02 | 03

William S. Yerazunis, PhD*

Abstract: Bayesian filters have now become the standard for spam filtering; unfortunately most Bayesian filters seem to reach a plateau of accuracy at 99.9%. We experimentally compare the training methods TEFT, TOE, and TUNE, as well as pure Bayesian, token-bag, token-sequence, SBPH, and Markovian discriminators. The results demonstrate that TUNE is indeed best for training, but computationally exorbitant, and that Markovian discrimination is considerably more accurate than Bayesian, but not sufficient to reach four-nines accuracy, and that other techniques such as inoculation are needed.

Keywords: spam, Bayesian, filter, email, e-mail, Markovian

Introduction: The original heuristic spamfilters such as SpamAssassin and Brightmail are weighted filters whose features are generated by humans working in a reactive way. These filters use a fixed set of feature detectors, and set a threshold score for rejection of an email as “too spammy”. The first generation of these filters also had a human-generated weight set; now most heuristic filters use a neural net or other relaxation-based system to optimize the feature weighting. For example, SpamAssassin uses a set of over three hundred hand-coded Perl routines as feature recognizers.

In the recent past, these heuristic filters been eclipsed by the Bayesian antispam filters. Paul Graham described these filters initially in A Plan For Spam. These Bayesian filters have essentially dominated spam filtering, because of their significantly greater accuracy and greatly increased difficulty of subversion. Instead of using human-coded ad-hoc “feature recognizers”, a Bayesian filter is trained on a pair of text corpi, one of spam, and one of nonspam. The Bayesian filter counts the word frequencies in each of the spam and nonspam corpi, and from that ratio can determine a local probability on any incoming text given only that it contains a particular word.

In any case, the Bayesian filter's first big advantage is already evident- there is no human intervention required to generate the feature recognizers. A simple whitespace delimiter detector can break the incoming text into words, and each word is considered a feature in the database.

Some Bayesian filters incorporate a database clipping function that extracts only a few hundred or thousand of the most spammy or non-spammy words from the database for testing during the execution phase. Most Bayesian filters also use a testing clip, in that only the top ten or twenty words in the spammy or nonspammy category are considered significant.

Whether or not the Bayesian filter does database or test clipping, the next phase is to apply a statistical combining function to the local probabilities of each of the word features. Typically the local probability is something of the form:

Plocal-spam = Nspam / ( Nspam+Nnonspam)

with appropriate bounds testing to prevent divide-by-zero errors. A more advanced local probability formula takes into account that a local probability certainty should be shaded toward uncertainty (P=0.5), especially when there are low experience counts for that feature in the example texts. For example:

(Nspam - Nnonspam )

Plocal-spam = 0.5 +_______________________________________________

C1 *( Nspam+Nnonspam + C2)

which biases the local probability estimates strongly toward 0.5 for low numbers of event counts. For example, if C1 = 2 and C2 = 1, and if there is only one example of a word being spammy, the local probability of an incoming text containing that word being spam is considered to be 0.75, which is only a moderate likelihood of spam If there had been ten examples of the word in spam, and none in nonspam, the local probability of the incoming text found to contain that word being spam would be 95.4%, a reasonably high likelihood for spam.

Each word feature generates one such local probability. These local probabilities are then used in a Bayesian chain rule to calculate an overall probability that an incoming text is spam. As noted above, some Bayesian filters will do database or testing clipping, to minimize the total computation. The Bayesian chain rule is:

P(feat | in class) * P (in class)

P (in class | feature) =
_____________________________________

P(feat | class) * P (in class) + P(feat | not in class) * P (not in class)

which is aplied iteratively to chain each local probability feature into an overall probability for the text in question.

One failing of the Bayesian chain rule is that strictly speaking it is only valid in the case of all features being statistically independent. This is emphatically not the case in text analysis; the words are not chosen randomly but with a very significant correllation. What is remarkable about Bayesian text classifiers is that the Bayesian classifiers work so well even with this gaping fundamental error.

To avoid the error of presumed decorrellation, is possible to use a chi-squared or other combining rules. SpamBayes uses a modified Chi-squared combining rule.

The overall success of Bayesian filters is visible in two ways- first, the large number of freely available filters. At last check, over twenty different Bayesian-style spamfilters were available on Sourceforge.net. Almost all of these filters quoted demonstrable accuracies on the order of 99.9% accuracy. By comparison, a human is only about 99.84% accurate in filtering spam and nonspam, so any of these filters is more effective than a human “correspondence secretary”.

The extreme effectiveness of Bayesian filters has not been lost on the one group whose livelihood is most affected by the filters: the spammers themselves. While spammers used to evade simple checksumming filters by adding a few randomly chosen words or gibberish strings to the text and subject, spammers now typically use subterfuges such as:

  • “Long Story” spams, where the first page or two of a spam seems to be actual innocent email.

  • “Dictionary Salad” spams, where large sections of the text are randomly chosen dictionary words. The spammers apparently are hoping to drown their signal in the random dictionary noise.
  • “News Story” spam, where the initial text is copied verbatim from a news report of a current popular news event completely unrelated to the spam “payload”.
  • “Habeas Haiku” spams. The Habeas Haiku is a short poem whose copyright is owned by Habeas Inc, a group of lawyers. Initially use of the Habeas Haiku was to be under license only to companies agreeing to obey Habeas' code of email conduct, but the haiku has now been co-opted by spammers to the extent that, statistically speaking, the presence of the haiku is a strong statistical predictor of spam. This switchover from “nonspam” to “spam” indication occurred on one test stream in less than one week of time.

Spammers have also begun another very interesting tactic- penetration of well-credentialled and widely-distributed mailing lists, such as alumni and special-interest mailing lists. These well-credentialled lists are often whitelisted by their subscribers and hence the spam is delivered without significant filtering.

Comparison of Methods and Variations of Bayesian Antispam Filtering

A consistent comparison experiment for different filtering techniques needs to operate against a consistent incoming data stream. Because of the significant variation in spams from day to day, for our experimentation we used the standard testing corpus available from the SpamAssassin web site.

This test set contains 4,147 messages, pre-classified into spam, easy nonspam, and hard nonspam. Of the messages, 1400 are spam, and the remainder are nonspam.

This test is a severe torture test of filtering- the author can rarely score 90% accuracy on this torture test, while in real life the author can achieve about 99.84% accuracy.

To provide a larger test set, the set of 4147 messages was shuffled ten times, with each shuffle providing an effectively random sequencing of the messages. Each of the ten shuffles was preserved so that every experimental configuration would get the same ten sequences of the messages.

Except as noted, the last 500 messages of each shuffle were “reserved” as the final evaluation testing set, and errors were summed, so each experimental configuration had 5,000 test attempts. Thus, our raw output statistics are in errors per 5,000 in the test.

Just as in real life, the testing procedure ran for the complete set of 4147 messages, and training was performed as specified even during the final 500 (evaluation) messages. After a complete run of 4147 messages were done, the learning databases were deleted so the next run would not only execute against a different shuffle, but also without any a priori training.

Next >

DISCLAIMER
Although we do our best to provide our users with useful and accurate information on our web site, we do not update this information which is derived from sources believed to be accurate. Users must understand that information presented does not serve as an endorsement of any particular company or individual and that this information changes frequently and is subject to differing interpretations. Users are hereby advised that they are responsible for ensuring that the facts and general advice obtained from our site are applicable to their specific situations and should discuss their specific tax, business, financial, and legal matters with pertinent professionals.

 

» Welcome
» SPAMarkov Guided Tour
» Features and Download
» Store
» Reviews
» Forum
» Help and Support
» About Us
» Contact Us
   
» Link Exchange Instructions
» Our Partners
» Site Map

 

Subscribe to Our Newsletter

Name:
Email:
Country:
Your question or comment:

 

Free Outlook Express Spam Filter
© Copyright 2005, SPAMarkov.com. | Free Outlook Express Spam Filter
Live Support! Download FREE Trial Here!