Free Outlook Express Spam FilterAnti-Spam Blocker Software For Microsoft

The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It

01 | 02 | 03

In this experiment, we used superincreasing weights as determined by the formula

Weight = 22N

Thus, for features containing 1, 2, 3, 4, and 5 words, the weights of those features would be 1, 4, 16, 64, and 256 respectively. These weights are used to bias short feature local probabilities toward 0.5 such as in

(Nspam - Nnonspam ) * Weight

Plocal-spam = 0.5 + ______________________________________________

C1 *( Nspam+Nnonspam + C2) * WeightMax

A Theoretical Justification for the Observed Markovian Accuracy

One might ask why a Markovian filter would be significantly more accurate than the very closely related Sparse Binary Polynomial Hash (SBPH) filter. One hypothesis is that the SBPH filter is still a linear filter, while a Markovian filter with superincreasing weights is nonlinear. One can envision the filtering domain of SPBH in a high-dimensional hyperspace (one hyperspatial dimension per feature). However, the partition between spam and nonspam is still a flat plane in this hyperspace; texts on one side of the plane are judged to be spam, those on the other side are judged to be nonspam.

Minsky and Papert showed that such a linear filter cannot implement anything even as complex as a mathematical XOR operation. That is, linear filters cannot learn (or even express!) anything as complex as “A or B but NOT BOTH”. The classic Bayesian spam filter is still a linear filter; as an example, a Bayesian cannot be trained to allow “viagra” or “pharmacy” but not both.

The Markovian filter with superincreasing weights is no longer a linear filter; because any feature of length N can override all features of length 1, 2, ... N-1

a Markovian filter can easily learn situations where A or B but NOT BOTH is the correct partitioning. As implemented with a sliding window of length 5, the Markovian partitions the feature hyperspace along a quintic (fifth-order) curved surface, which may not even be fully connected.

This theoretical justification is, unfortunately, not proven. It may be some other issue of implementation that causes a Markovian filter to have significantly better accuracy in these tests than a Bayesian. Therefore, the matter should be considered “likely”, but by no means “closed” in a scientific sense.

Also, we should note that the accuracy advantage of a Markovian over a Bayesian is significant- the Markovian made about 40% fewer errors on the same data streams. But a 40% improvement in filtering accuracy is only a few months respite at the current month-to-month increase rate of increase in spam. Even a Markovian is no defense against the spam attack where a spammer joins a well-credentialed list; this attack is becoming more and more common.

Inoculation and Minefielding

The next generations of spam filtering can take advantage of the DeSade observation- that is, that one man's pain is another man's pleasure. In this case, we use the pain of one person receiving a spam to train not only their own filters, but also the filters of a number of friends. In this situation, the error rate of the system for discriminatable spam goes with the inverse of the number of subscribers; with ten friends participating, you achieve 10x improvement in filtering. However, this is still a human-mediated training and hence subject to human error.

Dispite this, we have programmed such inoculation-based systems and preliminary results are that they do function well.

A second observation is a site-wide rather than a per-user observation. If one observes the behavior of most spam campaigns, the same spammer will spam all of the accounts known to it on a given site in a very short period of time. In one case, this was every account on a small site in less than ten seconds.

Because of this rapid propagation of the spam, human reaction time is too slow- by the time the first human reads the spam, all of the mailbox filters will have already either accepted or refused the spam.

To counter this style of attack, one can add “email minefield” defenses. An email minefield is constructed by adding a large set of dummy email addresses into the address space of a site. These email addresses are then intentionally leaked to spammers.

Since no human would send email to those extra addresses, any email to those addresses is known a-priori to be spam.

More usefully, spammers usually attempt to falsify their headers and hide their IP addresses. However, during the SMTP transaction from the spammer to the minefield address, the spammer must reveal their actual IP address. The spammer cannot spoof this address, as the SMTP transaction depends on at least the RCPT OK section of the transaction being delivered correctly to the spammer, and that can only happen if the spammer reveals a correct IP address during the socket setup phase.

At this point, the targeted site and any site cooperating with the targeted site can immediately blacklist the offending IP address. This blacklist can be either a “recieve and discard”, or “refuse connection” situation.

Unfortunately, John Graham-Cumming points out, it is possible to use a Bayesian filter against a Bayesian filter. In this process, the “evil bayesian” filter is supplied with a large dictionary of words; it repeatedly sends spam with included random words to the “good” bayesian filter and recieves back negative acknowledgements when the “good” bayesian filter correctly detects the spam. Graham-Cumming reports that the “evil bayesian” will converge on the set of random dictionary words most likely to thwart the “good” bayesian's filtering ability and allow spam to penetrate.

This result is extendable to any filtering algorithm; Graham-Cumming points out that the only true defense for it is to never return any feedback to the sender. Thus, our minefielding system must also perform in the same way to avoid disclosing which accounts are minefield accounts and which are real humans- incoming mail to either kind of account must be accepted and only afterwards deleted if it is known to come from a spamming IP address.

Fortunately, the same rapidity of communication that allows a spammer to hit all of the accounts on a system also allows one system to even more rapidly communicate the source IP address to other systems. In this kind of shared realtime minefield, multiple cooperating sites dynamically blacklist individual IP addresses for short periods of time (a few hours to days) in response to spamming activity to a minefield account.

Further, because the dynamic minefield spans multiple sites for both sensing a spam attack and transmitting blacklist IP addresses, a very large number of accounts can be protected relatively easily and without human intervention to either add IP addresses to the blacklist or remove them from the blacklist.

Dynamic minefielding in this style is currently in testing and is a subject for future work.

Conclusions:

Bayesian filtering may have reached a limit of accuracy; enhancements may be useful but the amount of information in a particular email is limited and an ever-increasing quality of filtering may be impossible.

Fortunately, correllating mail from multiple accounts, either with or without human intervention, will provide a significantly larger source of information, and a source of information with a significantly higher limit.

Thanks and Credits

The author would like to thank all of the members of the CRM114 development mailing list in general, as well as Darren Leigh and Erik Piip

Graham, Paul, “A Plan For Spam”

Yerazunis, William S., “Sparse Binary Polynomial Hashing and the CRM114 Discriminator”, MIT Spam Conference

Zdzairsky, Jonathan, "Advanced Language Classification using Chained Tokens", MIT Spam Conference

Minsky and Papert, 1969, PERCEPTRONS

John Graham-Cumming, "How to Beat a Bayesian Spam Filter", MIT Spam Conference

< Previous

DISCLAIMER
Although we do our best to provide our users with useful and accurate information on our web site, we do not update this information which is derived from sources believed to be accurate. Users must understand that information presented does not serve as an endorsement of any particular company or individual and that this information changes frequently and is subject to differing interpretations. Users are hereby advised that they are responsible for ensuring that the facts and general advice obtained from our site are applicable to their specific situations and should discuss their specific tax, business, financial, and legal matters with pertinent professionals.

 

» Welcome
» SPAMarkov Guided Tour
» Features and Download
» Store
» Reviews
» Forum
» Help and Support
» About Us
» Contact Us
   
» Link Exchange Instructions
» Our Partners
» Site Map

 

Subscribe to Our Newsletter

Name:
Email:
Country:
Your question or comment:

 

Free Outlook Express Spam Filter
© Copyright 2005, SPAMarkov.com. | Free Outlook Express Spam Filter
Live Support! Download FREE Trial Here!