[TriLUG] Re: spam solutions - spamassassin

Tue Jun 8 22:09:18 EDT 2004

On Tue, 8 Jun 2004, Myrhillion wrote:

> I'm already noticing that the default hits of 5 seems to miss a lot of 
> spam (from the default config), so
> I'll probably need to figure out this spam / ham feeding business next 
> and then lower the hits value in config.

I wouldn't recommend making the threshold any lower.  I've received a few 
false positives (only a very few, over tens of thousands of messages over 
the past couple years), but there are a lot more legit messages that come 
in at 4, 4.2, etc.  The scores are all worked out with the assumption that 
the hit threshold is 5 (or higher).

However, the Bayesian learning really helps.  Here's my system for spam:

1)  Procmail sorts all mailing list messages first, except for the few 
that accept non-subscriber e-mail and thus get spam

2)  Remaining messages run through spamc (just like all the examples 
you've seen)

3)  All spam that's been autolearned by the Bayesian system gets discarded 
unilaterally.  I believe this works out to be a score of 12 or higher, 
though not everything with that score gets autolearned, see the man page 
for more info.  I do this with the following rule:
	:0
	* ^X-Spam-Status: Yes.*autolearn=spam
	/dev/null

Since this mail is already learned, and is definitely spam, no need to 
keep it.

4)  All spam with a score higher than 10 gets saved for Bayesian learning.  
I consider SA good enough that I want all this spam to be discarded 
without me looking at it, but first run through the Bayesian filters and 
learned as spam.  Here's the rule:
	:0:
	* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*
	$HOME/mail/tenspam

Then the following short shell script runs every night from my crontab:
	TENSPAM=$HOME/mail/tenspam
	mv $TENSPAM ${TENSPAM}.tmp
	sa-learn --showdots --spam --mbox $TENSPAM.tmp
	rm ${TENSPAM}.tmp

5) All remaining spam (score of 5 or higher) gets saved to a spam folder.  
Because I use the TriLUG IMAP system, the rule for this looks like:
	:0
	* ^X-Spam-Status: Yes
	| $DELIVERTO +spam/sa-spam
This will probably be different on your system.

6) Procmail rules for the remaining mailing lists (the ones that get 
spam, which has now hopefully been filtered out.

Next, I go through the "sa-spam" folder periodically to check for false 
positives.  I also move any false negatives from my INBOX to this folder.  
Periodically, after checking carefully for false positives, I run the 
following script (this uses the 'mailutil' program to download from the 
IMAP folders; there are other ways to do this.  If you have local folders 
it's simpler)
	cat /dev/null > $HOME/tmp/spam.txt
	mailutil appenddelete -verbose \
		"{moya.trilug.org/user=jeremy/ssl/novalidate-cert}spam/sa-spam" \
		\#driver.unix/$HOME/tmp/spam.txt
	echo -n "Running sa-learn "
	sa-learn --showdots --spam --mbox $HOME/tmp/spam.txt

Note that my script doesn't delete the temporary file spam.txt until the 
next time it's run, which is an extra safeguard to retrieve false 
positives in case I need to look for them later (at least, until the time 
the script is run).  I use a similar script periodically on my INBOX and 
saved mail folders to learn ham (learning similar amounts of spam and ham 
is critical to proper Bayes operation).

I think this is a fairly decent set up.  It reduces the amount of spam I 
have to manually check, by discarding high-value spam, and harnesses the 
pwoer of the Bayesian system.  I get anywhere from 100 to 250 spams a day 
(it seems to vary widely), but only get 30-40 added to the spam folder.  I 
normally check the folder (and run the script) once every couple of weeks.

Note that all scripts and procmail recipes in this e-mail have been 
indented one tab.

Hope this helps,
Jeremy

-- 
/---------------------------------------------------------------------\
| Jeremy Portzer        jeremyp at pobox.com      trilug.org/~jeremy     |
| GPG Fingerprint: 712D 77C7 AB2D 2130 989F  E135 6F9F F7BC CC1A 7B92 |
\---------------------------------------------------------------------/