The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It
01 | 02 | 03
Training Methods Comparison
The first experimental set was to determine the relative merits of various training methods. As we have considerable experience with Sparse Binary Polynomial Hash features and a Bayesian Chain Rule, we set up a SBPH/BCR classifier and ran it with three different training methods:
-
TEFT - Train Every Thing – for each member of the shuffle set, classify the text, record the output (correct or incorrect), and then force train that piece of text into the database in the correct category.
- TOE – Train Only Errors – for each member of the shuffle set, classify the text, record the output (correct or incorrect), and if the text was incorrectly classified, train that text into the database in the correct category.
- TUNE – Train Until No Errors – for the first N-500 messages, repeatedly classify and train the messages if incorrect. After this intensive training test and record the final 500 texts, doing training if an error occurred. (Because there is no gaurantee of convergence, training was terminated if only a few messages in the training set were not trained. Of all 42470 messages in the ten training sections, only three were not correctly classified.)
The results are summarized below in Table 1. The execution processor was a Transmeta 666 with Red Hat Linux 7.3 and 128 megabytes of memory. The feature hashtable size was limited to 1,000,000 slots, and slots were randomly decremented when there were no unused slots available.
| Training Method |
Errors per 5000 |
Approximate time for all 10 shuffles |
TEFT |
149 |
6 hrs |
TOE |
69 |
3 hrs |
TUNE |
54 |
20 hrs |
Table 1 Comparison of TEFT, TOE, and TUNE training methods
Note that TOE training is almost as good as TUNE training, with only a small fraction of the CPU time expended.
Even though TUNE is somewhat more accurate than TOE training, TUNE requires keeping the entire prior history of both spam and nonspam. In a testing situation with repeated execution, this is acceptable, but in a real-life deployment this would require a huge amount of storage per user. A rough estimate might be as much as
250 megabytes per year per user, even if duplicate spams are dropped from the database.
The additional CPU time needed for TUNE processing is nontrivial. With a back-catalog of 4147 messages, it took approximately two hours to re-classify and retrain sort; this process would need to be repeated whenever any training occurred. For an individual user with a dedicated CPU, this is acceptable, but even a small ISP (with, say, 1000 accounts) could not deploy such a system.
Comparison of Extensions to Bayesian Filtering
Taking the results of table 1 to indicate that TOE (train only errors) training is acceptable in performance and accuracy, we can now consider the relative accuracies of different extensions of a Bayesian antispam filter.
For these experiments, we again use the 4147 corpus messages, in the same ten shuffles, clearing out any learned memory between each shuffle. Each extension to Bayesian filtering will see the same ten shuffles, and each will be trained with in the TOE method. Each method will be allowed at most 1,000,000 feature slots, and the same random deletion of old slots to make room for new data was activated, however none of these extensions needed to purge older data.
The different variations tested were:
These tests were run on the same system as the training tests. Run times did not differ substantially from TOE-trained SBPH/BCR in table 1;
-
| Evaluator |
Errors per 5000 |
Pure Bayesian matching |
92 |
Peak Window Value |
80 |
Token Sequence Sensitive |
78 |
Token Grab Bag |
71 |
Sparse Binary Polynomal Hash |
69 |
Markovian matching |
56 |
Table 2 Accuracies of various Evaluators
Interestingly, Token Sequence Sensitive (TSS) scored as a weaker system than Token Grab Bag (TGB), by a statistically significant amount. One might assume that the extra information available to a TSS filter would give higher performance, but apparently in the face of spammer countermeasures such as random word insertion, we find that TSS does not perform as well as TGB. Interestingly, SBPH and TGB are very close in performance.
The best performance of all of the filters was found to be the Markovian matching filter, so some further details are in order.
The Markovian matching filter does not attempt to generate the actual Markov model of the incoming texts, for computational tractability. Instead, it uses a hashed representation of the known example texts in each corpus as local sections of the large (and mostly unknown) Markov models of spam and nonspam.
The implementation of this is quite simple. Instead of chaining tokens (as described in Zdziarski, the features are weighted to give greatly increased credence to evidence found in longer features.
For example, in the text “The quick brown fox jumped”, the features and their associated weights would be
-
| Feature |
Weight |
The |
1 |
The quick |
4 |
The <skip> brown |
4 |
The quick brown |
16 |
The <skip> <skip> fox |
4 |
The quick <skip> fox |
16 |
The <skip> brown fox |
16 |
The quick brown fox |
64 |
The <skip> <skip> <skip> jumped |
4 |
The quick <skip> <skip> jumped |
16 |
The <skip> brown <skip> jumped |
16 |
The quick brown <skip> jumped |
64 |
The <skip> <skip> fox jumped |
16 |
The quick <skip> fox jumped |
64 |
The <skip> brown fox jumped |
64 |
The quick brown fox jumped |
256 |
Table 3 Example Features and Weights for Markovian matching
< Previous Next >
DISCLAIMER
Although we do our best to provide our users with useful and accurate information on our web site, we do not update this information which is derived from sources believed to be accurate. Users must understand that information presented does not serve as an endorsement of any particular company or individual and that this information changes frequently and is subject to differing interpretations. Users are hereby advised that they are responsible for ensuring that the facts and general advice obtained from our site are applicable to their specific situations and should discuss their specific tax, business, financial, and legal matters with pertinent professionals.
|