This post originated from an RSS feed registered with Java Buzz
by Russell Beattie.
Original Post: Un-Bayesing SpamAssassin
Feed Title: Russell Beattie Notebook
Feed URL: http://www.russellbeattie.com/notebook/rss.jsp?q=java,code,mobile
Feed Description: My online notebook with thoughts, comments, links and more.
It was getting crazy last week. I was getting more and more and more spam. I would go to bed, wake up 8 hours later and have 50+ messages waiting for me. My SpamAssassin was just completely falling down on the job.
At first I lowered the hit level, and that didn't seem to help. Then I went throught he .spamassassin/user_prefs and added points for the following:
Nothing seemed to helping. I spent all day while I was working with a tail -f of .procmail.log in a window trying to monitor what was happening and comparing the spam headers I got to what was supposed but I couldn't figure it out. *Then* I noticed that many of the headers had a BAYES_0 in it. What that means is that the Bayesian filter had determined there was a 0% chance that the email was spam. Unlike what I first thought, instead of just leaving it at 0, the higher Bayes score actually *subtracted points* from the hit count, thus putting it under my spam limit.
Ahh. So first I modified the scores for the BAYES_xx but started getting false positives, which is bad. I was stumped a bit until my coworker Vineet told me that probably what had happened was the the Bayes filtering had "learned badly". Ahhhh. *That* made sense.
So instead of trying to untrain it, or whatever. I wacked the bayes_seen and bayes_tok files. I'm *sure* there are more elegant ways of doing it, but I figured I'd start from scratch and see if that helped. It definitely helped. I'm still getting *a ton* of spam, but Thunderbird is also helping.
Does anyone know the right score modified for attachments in general? I'm *sooo* fucking sick of that virus or whatever it is with insanely stupid text and a .zip file attachement. "I hate cleartext. Password is 21341254". AAAAhhh.
Anyways, that's my suggestion. I cannot wait until there are some solid solutions for Spam. It's just gotten to a crazy level lately.