Forum: >>> Magnum BBS <<<

[slrn] experiment: can bayesian filtering score usenet posts?

From Tavis Ormandy@21:1/5 to All on Mon Dec 20 03:32:55 2021

The problem with training spam filters with NNTP is that the protocol is designed around offering headers and bodies seperately.

Sure, in theory you could just download everything at once, but then you
lose all the performance benefits of the protocol. If you could just
score on the XOVER headers, then you would still have all the protocol benefits, but is that enough data?

I decided to try it, and the answer is it works! *but* it took a lot of training before it started to work.

I used bogofilter (https://bogofilter.sourceforge.io/) and wrote a macro
to pipe just the overview headers into it. It then auto-generates a
scorefile.

For the last few months, it has been really accurate at identifying the messages I want to read and I've been finding it really useful. If
anyone else wants to try it out, here is the macro I used:

https://lock.cmpxchg8b.com/files/bogofilter.sl

The macro automatically learns any articles you read when you leave a
group. If the message had a positive score, it learns them as good. If
it has a very low score, it learns them as bad.

Tavis.

--
_o) $ lynx lock.cmpxchg8b.com
/\\ _o) _o) $ finger taviso@sdf.org
_\_V _( ) _( ) @taviso

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	67:27:55
Calls:	6,654
Files:	12,200
Messages:	5,331,951