Naive Bayesian Filtering
Naive Bayesian Filtering classifier (also known as Idiot's Bayes) is a simple probabilistic classifier and is used in most Outlook spam filters. Naive Bayesian Filtering classifiers are based on probability models that incorporate strong independence assumptions which often have no bearing in reality, therefore are (deliberately) naive. A more descriptive term for the underlying probability model would be independent feature model. Also the probability model can be derived using Bayes' theorem and is used in many free Outlook Express spam blockers (credited to Thomas Bayes).
Depending on the precise nature of the probability model, Naive Bayesian Filtering classifiers can be trained efficiently in a supervised learning setting. In many practical applications, condition estimation for naive Bayes models uses the method of greatest likelihood; for example, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.
Despite their naive design and apparently over-simplified assumptions, Naive Bayesian Filtering classifiers often work much better in many complex real-world situations than might be expected from their simple design, but CRM114 is even more advanced . Recently, careful analysis of the Bayesian classification problem has shown that there are sound theoretical reasons for the apparently unreasonable efficacy of Naive Bayesian Filtering classifiers (see references at the end of this article for more details).
Bayesian talks about probability and statistics -- either
- methods associated with the Reverend Thomas Bayes (ca. 1702–1761); or
- the degree-of-belief interpretation of probability, as opposed to frequency or proportion or propensity interpretations; or
- Bayes' theorem on conditional probability.
The development has taken place after his death, and includes:
• Bayesian probability
• Bayesian inference
• Bayesian network
• Bayes factor
• Bayesian model comparison
• Bayesian filtering
• Empirical Bayes method
• Naive Bayes classifier
• Bayesian game
Bayesian filtering is the process of using Bayesian statistical methods to classify documents into categories. This method is being used today in anti-spam software for Outlook.
Bayesian filtering gained attention when it was described in the paper A Plan for Spam by Paul Graham, and has become a popular mechanism to distinguish illegitimate spam email from legitimate "ham" email. Many modern mail programs such as Mozilla Thunderbird and Microsoft's Outlook spam filter put into effect Bayesian spam filtering. Server-side email filters, such as SpamAssassin and ASSP, make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within mail server software itself.
Bayesian email filters and outlook express spam blockers take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email.
Particular words have particular probabilities of occurring in spam email and in ham email, especially in outlook 2003 spam. For instance, most email users will often meet the word Viagra in spam email, but will seldom see it in ham email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually suggest whether a new email is spam or ham. For all words in each training email, the filter will therefore adjust the words' spam and ham probabilities in its database. For instance, Bayesian spam filters in will typically have learned a high spam probability for the words "Viagra" and "refinance", but a low spam probability in spam in outlook express (and a high ham probability) for words seen only in ham email, such as the names of friends and family members.
After training, the spam and ham word probabilities (also known as likelihood functions) are used to compute the probability that an email with a particular set of words in it belongs to either the spam or ham class. Each word in the email contributes to the email's spam probability. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the email's spam probability is added up over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as spam. Email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright.
The advantage of Bayesian spam filtering in outlook express spam filters is that it can be trained on a per-user basis. The spam got by a user often has some relevance to her, and defines the characteristic spam likelihood role for her filter. For example, placing a personal ad may increase the amount of personal-ad-related spam that she gets. So, her Bayesian spam filter would learn a higher spam probability for words common to personal-ad-related spam, higher than it would if it were trained on some other user's email. The ham that she gets will also tend to be relevant to her. Many of her coworkers, friends, and family members will choose to talk about related subjects, to sum up use similar words, generating a characteristic ham likelihood role. These two likelihood functions are unique to each user and can develop over time with corrective training whenever the filter incorrectly classifies an email. So, Bayesian spam filtering accuracy can be excellent, often superior to pre-defined rules. SpamAssassin can combine the results from both Bayesian spam filtering and pre-defined rules, leading to even higher filtering accuracy. Recent spammer tactics include insertion of random words that are not normally associated with spam, because of that decreasing the email's spam score and increasing its ham score, making it more likely to slip past a Bayesian spam filter.
While Naive Bayesian Filtering classifiers are used widely to find spam email, including outlook 2000 spam, the technique can classify (or "cluster") almost any sort of data. It has uses in science, medicine, and engineering. One example is a general purpose classification program called AutoClass which was originally used to classify stars according to spectral characteristics that were otherwise too subtle to notice. There is recent speculation that even the brain uses Naive Bayesian Filtering classifier methods to classify sensory stimuli and decide on behavioral responses.
DISCLAIMER
Although we do our best to provide our users with useful and accurate information on our web site, we do not update this information which is derived from sources believed to be accurate. Users must understand that information presented does not serve as an endorsement of any particular company or individual and that this information changes frequently and is subject to differing interpretations. Users are hereby advised that they are responsible for ensuring that the facts and general advice obtained from our site are applicable to their specific situations and should discuss their specific tax, business, financial, and legal matters with pertinent professionals.
|