[TriLUG] Re: spam solutions - spamassassin
Jeremy Portzer
jeremyp at pobox.com
Tue Jun 8 22:09:18 EDT 2004
On Tue, 8 Jun 2004, Myrhillion wrote:
> I'm already noticing that the default hits of 5 seems to miss a lot of
> spam (from the default config), so
> I'll probably need to figure out this spam / ham feeding business next
> and then lower the hits value in config.
I wouldn't recommend making the threshold any lower. I've received a few
false positives (only a very few, over tens of thousands of messages over
the past couple years), but there are a lot more legit messages that come
in at 4, 4.2, etc. The scores are all worked out with the assumption that
the hit threshold is 5 (or higher).
However, the Bayesian learning really helps. Here's my system for spam:
1) Procmail sorts all mailing list messages first, except for the few
that accept non-subscriber e-mail and thus get spam
2) Remaining messages run through spamc (just like all the examples
you've seen)
3) All spam that's been autolearned by the Bayesian system gets discarded
unilaterally. I believe this works out to be a score of 12 or higher,
though not everything with that score gets autolearned, see the man page
for more info. I do this with the following rule:
:0
* ^X-Spam-Status: Yes.*autolearn=spam
/dev/null
Since this mail is already learned, and is definitely spam, no need to
keep it.
4) All spam with a score higher than 10 gets saved for Bayesian learning.
I consider SA good enough that I want all this spam to be discarded
without me looking at it, but first run through the Bayesian filters and
learned as spam. Here's the rule:
:0:
* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*
$HOME/mail/tenspam
Then the following short shell script runs every night from my crontab:
TENSPAM=$HOME/mail/tenspam
mv $TENSPAM ${TENSPAM}.tmp
sa-learn --showdots --spam --mbox $TENSPAM.tmp
rm ${TENSPAM}.tmp
5) All remaining spam (score of 5 or higher) gets saved to a spam folder.
Because I use the TriLUG IMAP system, the rule for this looks like:
:0
* ^X-Spam-Status: Yes
| $DELIVERTO +spam/sa-spam
This will probably be different on your system.
6) Procmail rules for the remaining mailing lists (the ones that get
spam, which has now hopefully been filtered out.
Next, I go through the "sa-spam" folder periodically to check for false
positives. I also move any false negatives from my INBOX to this folder.
Periodically, after checking carefully for false positives, I run the
following script (this uses the 'mailutil' program to download from the
IMAP folders; there are other ways to do this. If you have local folders
it's simpler)
cat /dev/null > $HOME/tmp/spam.txt
mailutil appenddelete -verbose \
"{moya.trilug.org/user=jeremy/ssl/novalidate-cert}spam/sa-spam" \
\#driver.unix/$HOME/tmp/spam.txt
echo -n "Running sa-learn "
sa-learn --showdots --spam --mbox $HOME/tmp/spam.txt
Note that my script doesn't delete the temporary file spam.txt until the
next time it's run, which is an extra safeguard to retrieve false
positives in case I need to look for them later (at least, until the time
the script is run). I use a similar script periodically on my INBOX and
saved mail folders to learn ham (learning similar amounts of spam and ham
is critical to proper Bayes operation).
I think this is a fairly decent set up. It reduces the amount of spam I
have to manually check, by discarding high-value spam, and harnesses the
pwoer of the Bayesian system. I get anywhere from 100 to 250 spams a day
(it seems to vary widely), but only get 30-40 added to the spam folder. I
normally check the folder (and run the script) once every couple of weeks.
Note that all scripts and procmail recipes in this e-mail have been
indented one tab.
Hope this helps,
Jeremy
--
/---------------------------------------------------------------------\
| Jeremy Portzer jeremyp at pobox.com trilug.org/~jeremy |
| GPG Fingerprint: 712D 77C7 AB2D 2130 989F E135 6F9F F7BC CC1A 7B92 |
\---------------------------------------------------------------------/
More information about the TriLUG
mailing list