The mysteries of the spam filter

I just received an email from “[email protected]” with subject line, “Your email just won £500,000 British Pounds in our anniversary promo.” This email went into my inbox; it did not get caught by the spam filter.

What I wanna know is, if “Your email just won £500,000 British Pounds in our anniversary promo” isn’t spam, what is???

5 thoughts on “The mysteries of the spam filter

  1. As a Bayesian Statistician you might be interested in some issues in Bayesian spam filtering. One known issue is over-training.

    I'll assume you use Mozilla Thunderbird, but similar techniques may work for other systems.

    In Thunderbird, I've found that lowering the threshold for spamminess works apparently much better than the default threshold. Specifically I set the threshold to 70 via the following sequence:

    Edit -> Preferences -> Advanced -> Config Editor -> mail.adaptivefilters.junk_threshold

    What I believe this does is two things:

    1) It increases the chance that marginal stuff will be classified as spam, which catches more things.

    2) Early on during the training process, it increases the chance that marginal non-spam will be classified as spam. This gives me the chance to catch non-spam that has been misclassified and reverse the classification, which helps the spam filter avoid over-training.

    After about a week or two I almost never get anything misclassified in either direction. I found that with the default setting of 90 the classification let too many spam messages through and became also more likely to classify non-spam as spam if it contained spammy words (since it never saw any misclassifications early in the training process).

    If you make this change, I suggest you reset the training data and wait a week or so to retrain. To reset training data do:

    Edit -> Preferences -> Privacy -> Junk tab -> Reset Training Data

    For a general discussion about this issue, and a cute reference to Dr Strangelove you may be interested in:

    http://crm114.sourceforge.net/wiki/doku.php

    somewhere on that site there used to be a discussion of accuracy testing vs different training methods.

  2. Daniel,
    I do use Thunderbird and I appreciate your suggestions, but I'm stuck on your first step. When I click on Edit, there's no Preferences option.

  3. You might be using Thunderbird on Windows then. :) Assuming you do, try Tools -> Options and then you should be able to see various tabs like Privacy, Advanced, etc.

  4. On the Windows version of Thunderbird, it's

    Tools -> Options

    rather than

    Edit -> Preferences.

    Everything else is as described in Daniel's comment.

Comments are closed.