Free Outlook Express Spam FilterAnti-Spam Blocker Software For Microsoft

The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It

01 | 02 | 03

Training Methods Comparison

The first experimental set was to determine the relative merits of various training methods. As we have considerable experience with Sparse Binary Polynomial Hash features and a Bayesian Chain Rule, we set up a SBPH/BCR classifier and ran it with three different training methods:

  • TEFT - Train Every Thing – for each member of the shuffle set, classify the text, record the output (correct or incorrect), and then force train that piece of text into the database in the correct category.

  • TOE – Train Only Errors – for each member of the shuffle set, classify the text, record the output (correct or incorrect), and if the text was incorrectly classified, train that text into the database in the correct category.
  • TUNE – Train Until No Errors – for the first N-500 messages, repeatedly classify and train the messages if incorrect. After this intensive training test and record the final 500 texts, doing training if an error occurred. (Because there is no gaurantee of convergence, training was terminated if only a few messages in the training set were not trained. Of all 42470 messages in the ten training sections, only three were not correctly classified.)

The results are summarized below in Table 1. The execution processor was a Transmeta 666 with Red Hat Linux 7.3 and 128 megabytes of memory. The feature hashtable size was limited to 1,000,000 slots, and slots were randomly decremented when there were no unused slots available.

Training Method

Errors per 5000

Approximate time for all 10 shuffles

TEFT

149

6 hrs

TOE

69

3 hrs

TUNE

54

20 hrs

Table 1 Comparison of TEFT, TOE, and TUNE training methods

Note that TOE training is almost as good as TUNE training, with only a small fraction of the CPU time expended.

Even though TUNE is somewhat more accurate than TOE training, TUNE requires keeping the entire prior history of both spam and nonspam. In a testing situation with repeated execution, this is acceptable, but in a real-life deployment this would require a huge amount of storage per user. A rough estimate might be as much as

250 megabytes per year per user, even if duplicate spams are dropped from the database.

The additional CPU time needed for TUNE processing is nontrivial. With a back-catalog of 4147 messages, it took approximately two hours to re-classify and retrain sort; this process would need to be repeated whenever any training occurred. For an individual user with a dedicated CPU, this is acceptable, but even a small ISP (with, say, 1000 accounts) could not deploy such a system.

Comparison of Extensions to Bayesian Filtering

Taking the results of table 1 to indicate that TOE (train only errors) training is acceptable in performance and accuracy, we can now consider the relative accuracies of different extensions of a Bayesian antispam filter.

For these experiments, we again use the 4147 corpus messages, in the same ten shuffles, clearing out any learned memory between each shuffle. Each extension to Bayesian filtering will see the same ten shuffles, and each will be trained with in the TOE method. Each method will be allowed at most 1,000,000 feature slots, and the same random deletion of old slots to make room for new data was activated, however none of these extensions needed to purge older data.

The different variations tested were:

  • Standard Bayesian - each word is a feature. This is the method proposed by Graham and used in most antispam filters. There was no clipping in either the database or the text; every feature was retained and used in the evaluation.

  • Token Grab Bag – A sliding window of five words is moved across the input text. All combinations of those five words (but requiring the first word in the window) are taken in an order-insensitive way. Every such order-insensitive combination is a feature.
  • Token Sequence Sensitive – A sliding window of five words is moved across the input text. All combinations of word deletions are applied (except that the first word in the window is never deleted) and the resulting sequence-sensitive set of words is used as a feature.
  • Sparse Binary Polynomial Hashing with Bayesian Chain Rule (SBPH/BCR) – As described in Yerazunis a sliding window of five words is moved across the input. A feature is the sequence-and-spacing-sensitive set of all possible combinations of those five words, except that the first word in the window is always included.
  • Peaking Sparse Binary Polynomial Hashing – this is similar to SBPH/BCR, except that for each window position, only the feature with the most extreme probability (furthest from 0.5) is used. The other features generated at that window position are disregarded. This is an attempt to 'decouple” the sequences of words in a text and make the Bayesian chain rule more appropriate.
  • Markovian matching – This is similar to Sparse Binary Polynomial Hashing but the individual features are given variable weights. These weights are assigned in a superincreasing way, so that a feature that contains more words than any of it's subfeatures can outweigh all of it's sub-features combined.

These tests were run on the same system as the training tests. Run times did not differ substantially from TOE-trained SBPH/BCR in table 1;

Evaluator

Errors per 5000

Pure Bayesian matching

92

Peak Window Value

80

Token Sequence Sensitive

78

Token Grab Bag

71

Sparse Binary Polynomal Hash

69

Markovian matching

56

Table 2 Accuracies of various Evaluators

Interestingly, Token Sequence Sensitive (TSS) scored as a weaker system than Token Grab Bag (TGB), by a statistically significant amount. One might assume that the extra information available to a TSS filter would give higher performance, but apparently in the face of spammer countermeasures such as random word insertion, we find that TSS does not perform as well as TGB. Interestingly, SBPH and TGB are very close in performance.

The best performance of all of the filters was found to be the Markovian matching filter, so some further details are in order.

The Markovian matching filter does not attempt to generate the actual Markov model of the incoming texts, for computational tractability. Instead, it uses a hashed representation of the known example texts in each corpus as local sections of the large (and mostly unknown) Markov models of spam and nonspam.

The implementation of this is quite simple. Instead of chaining tokens (as described in Zdziarski, the features are weighted to give greatly increased credence to evidence found in longer features.

For example, in the text “The quick brown fox jumped”, the features and their associated weights would be

Feature

Weight

The

1

The quick

4

The <skip> brown

4

The quick brown

16

The <skip> <skip> fox

4

The quick <skip> fox

16

The <skip> brown fox

16

The quick brown fox

64

The <skip> <skip> <skip> jumped

4

The quick <skip> <skip> jumped

16

The <skip> brown <skip> jumped

16

The quick brown <skip> jumped

64

The <skip> <skip> fox jumped

16

The quick <skip> fox jumped

64

The <skip> brown fox jumped

64

The quick brown fox jumped

256

Table 3 Example Features and Weights for Markovian matching

< Previous        Next >

DISCLAIMER
Although we do our best to provide our users with useful and accurate information on our web site, we do not update this information which is derived from sources believed to be accurate. Users must understand that information presented does not serve as an endorsement of any particular company or individual and that this information changes frequently and is subject to differing interpretations. Users are hereby advised that they are responsible for ensuring that the facts and general advice obtained from our site are applicable to their specific situations and should discuss their specific tax, business, financial, and legal matters with pertinent professionals.

 

» Welcome
» SPAMarkov Guided Tour
» Features and Download
» Store
» Reviews
» Forum
» Help and Support
» About Us
» Contact Us
   
» Link Exchange Instructions
» Our Partners
» Site Map

 

Subscribe to Our Newsletter

Name:
Email:
Country:
Your question or comment:

 

Free Outlook Express Spam Filter
© Copyright 2005, SPAMarkov.com. | Free Outlook Express Spam Filter
Live Support! Download FREE Trial Here!