« A taxonomy of visualizations | Main | Theories of information and interestingness »

January 9, 2007

Statistical models and spam

In the old days, internet technologies were developed for ethical well-behaved people. But when the hordes were unleashed on the internet, the old technologies could not cope with the bad behavior, but it has been very hard to change the underlying fabric of internet standards and protocols. Spam in particular is one of the most annoying problems. Spam filtering is an automated classification of messages (e-mails, but also blog comments, blog trackbacks, instant messages and so on) into the good (ham) and into the bad (spam).

The so-called Bayesian filtering has been popularized by Paul Graham in his essays A Plan for Spam and Better Bayesian Filtering a few years ago, but goes back to a Microsoft Research who first worked on detecting insults in 1997 and then junk in 1998. The traditional way of dealing with spam has been to identify individual words that seem to be overrepresented in spam (Viagra, free, money, casino, games). The appearance of each such word increases the log-odds that the email is spam. On the other hand, the appearance of words related to one's interests and work increases the log-odds that the email is spam. When we sum up all these log-odds, we obtain a score, which is used to classify the email. This approach is known as the naive Bayesian classification. Of course, "Bayesian" here is as in Bayes rule, not as in Bayesian statistics. Although models such as logistic regression and support vector machines would yield better accuracy for spam filtering, practitioners tell me that naive Bayes is still heavily used in practice because it scales to the huge collections of data: given a list of spam emails and a list of non-spam emails, we can figure out how much log-odds to add or subtract for each word (or some other aspect of the message, such a the number of all-uppercase words, or the maximum length of a sequence of exclamation marks).

My pet peeve with spam filtering is that it doesn't root the problem out. It merely provides an efficient broom to sweep it under the rug. The innocent users are paying for filtering and wasted internet bandwidth, have to keep separating emails into spam and ham, risk the loss of important emails due to spam filters, whereas the spammers incur practically no penalties, only profits. Furthermore, the adaptive aspect of filtering has been used in the adversarial strategy of Bayesian poisoning, where messages that will be classified as spam are made up of purely legitimate words, and where legitimate words are injected into the spam messages. Moreover, spammy messages are now stored in images, which cannot be easily filtered automatically. For these reasons, the effectiveness of spam filters has gone down over the past year or two. Until we get internet postage stamps, internet taxes and internet police, I would prefer vigilante approaches such as the notorious Make Love not Spam. But pessimists like to use the following cannot-solve-spam form.

It is still refreshing to observe new developments. I have come across a paper on the next generation of spam filtering techniques Spam Filtering Using Statistical Data Compression Models by some of my former colleagues. Andrej Bratko et al. have found that models based on individual letters outperform the models based on the word counts. For example, their method can employ the indentation pattern "> " which is far more frequent in legitimate emails than in spam. With sufficient training, they would also be able to detect misspellings and foreign languages. Although not rooting the problem out, they can still buy some time.

Posted by Aleks at January 9, 2007 10:24 AM

RSS feed for this entry.

Trackback Pings

TrackBack URL for this entry:
http://www.stat.columbia.edu/~cook/movabletype/mt-tb.cgi/765

Comments

Because spam almost always has a goal,
usually to sell a product,
then spam can fairly well be detected with a naive Bayes filter.
My 600 emails a day might have one false negative a week and even fewer false positives (treating ham as spam) from a naive Bayes filter.
Perhaps the problems you mention with a naive wrong (naive Bayes) solution to the right probleBayes filter arise from using the same Bayes filter for 5 years -- you must retrain (update your Bayes priors) once a year with emails representing perhaps 100 spam and 100 ham.
Overtraining on ham can be preferrable
since the prior can prefer the spam/ham on which it most trained.

But in some other arenas; eg,
300,000 internal reports from various automobile manufacturers, a naive Bayes filter correctly detects only 27 percent of critical reports
deserving of a possible automobile recall.

The best Bayes filter for spam is considered to be the spam filter crm114.
When applied to those 300,000 internal reports,
87 percent of critical reports get correctly identified, better than the 50 percent correctness from reading each report by 12 contractors costing $1 million each year.
Pure naive Bayes presumes independence between words.
Crm114 greatly improves results by filtering on all combinations of 5-words at a time; eg,
"send money to injured son" and "money to injured son inlaw".
Adding the heuristic TOE (train on error) prevents cluttering the priors with unhelpful data.
Adding the heuristic TUNE (train until no error) reconsiders each report as if it were not already incorporated in the Bayes prior,
and if that report would get classified incorrectly, then it is again incorporated into the prior -- recyling thru all reports 300,000 several times.
It is these heuristics --
wrong solution to the right problem rather than the right solution (naive Bayes) to the wrong problem -- that make crm114 the best Bayes filter.

Posted by: Jameson Burt at January 17, 2007 2:53 PM.

Aleks

Hey, what's the deal? Can we get a better spam filter at Columbia? Also, I think there's a way to use a blacklist to clean up our blog comment spam, but I can't remember how to do it.

Posted by: Andrew [TypeKey Profile Page] at January 17, 2007 4:51 PM.

Jameson's comment seems to be some sort of an advertisement for crm114. I don't know about its quality, but thanks for the TOE/TUNE description, might be useful.

Posted by: Aleks [TypeKey Profile Page] at January 18, 2007 11:07 AM.

Make love not spam no longer works.

Apparently someone created a computer virus with that name, and the MLNS people decided to stop distributing the legitimate software to avoid confusion.

Posted by: Janet at January 29, 2007 9:10 AM.

Post a comment




Remember Me?

(you may use HTML tags for style)