Forum: >>> Magnum BBS <<<

Spamhalter getting overwhelmed by HTML meta tags

From Marco Old@21:1/5 to All on Fri Aug 6 16:25:19 2021

Spamhalter has not been working well having degraded over the past
couple of years. There was a long thread about Spamhalter from 2019
about this.

I performed all of the hints from that thread but Spamhalter still
misses many Spam emails.

In looking at the "Explain Spam Classification", I see that almost all
of the words used to classify the email are HTML meta tags. Words
like "style", "arial", "margin", "sans-serif", "font-family",
"text-align" and so on.

So I train an email as Spam and those words get into the
classificaiton for Spam messages and then on the next email, I train
the email as not Spam and those words are removed from the
classification. Then the next email is not considered Spam.

Has anyone noticed this?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Steve Hayes@21:1/5 to All on Sat Aug 7 09:43:54 2021

On Fri, 06 Aug 2021 16:25:19 -0700, Marco Old <notme@silXogicX.com>
wrote:

Spamhalter has not been working well having degraded over the past
couple of years. There was a long thread about Spamhalter from 2019
about this.

I performed all of the hints from that thread but Spamhalter still
misses many Spam emails.

In looking at the "Explain Spam Classification", I see that almost all
of the words used to classify the email are HTML meta tags. Words
like "style", "arial", "margin", "sans-serif", "font-family",
"text-align" and so on.

So I train an email as Spam and those words get into the
classificaiton for Spam messages and then on the next email, I train
the email as not Spam and those words are removed from the
classification. Then the next email is not considered Spam.

Has anyone noticed this?

That's probably the reason why most HTML e-mail ends up in my "Junk"
queue, and as most of it is junk, I don't bother to fish it out.

--
Steve Hayes from Tshwane, South Africa
Web: http://www.khanya.org.za/stevesig.htm
Blog: http://khanya.wordpress.com
E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Euler German@21:1/5 to All on Sat Aug 7 15:27:46 2021

On article <5ogrgg99d7l9nucvu8u1968iepbc1ob35r@4ax.com>, Marco Old
wrote (at least in part):

So I train an email as Spam and those words get into the
classificaiton for Spam messages and then on the next email, I train
the email as not Spam and those words are removed from the
classification. Then the next email is not considered Spam.

Maybe you're no "training" SpamHalter correctly. There's a big
difference between selecting one or more misclassified messages and
MOVING it to the Suspicious or junk mail folder, and picking
Spamhalter classification > Train message(s) as Spam from the menu.
The same applies the other way around, that is, MOVING message(s)
from the Suspicious or junk mail folder to any other folder is much
more effective than Train message(s) as Not-Spam. There's a technical explanation for each method but in a nutshell it's how it works.

OTOH if it is not your case you may benefit of SpamHalter's database
cleaning which will remove deprecated data from corpus. Pick it from
Tools > Spam and content controls > Spamhalter... > Cleanup...

My current Spamhalter training strategy and settings:

(*) Train on classification errors only (smaller database)
( ) Train always (larger database, self-trained) <- no need if you're
run standalone
or on small LAN.

Spam level (%): 50 Not-spam boost: 1

SpamHalter has been running flawlessly here since version 1.0 with
these settings.

--
Kind regards,
Euler German

Please, reply preferably to the list.
Reply-To: partially ROT13, invalid=com
Due to spam I'm filtering-out GoogleGroups. Sorry. :(

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marco Old@21:1/5 to rstrezna.hfrarg@znvyahyy.invalid on Wed Aug 25 15:05:48 2021

Euler,

Thanks for the hints. I had the training strategy setting but I had
default settings for Spam Level and Not-spam boost.

I changed them to your recommendation and we will see
what happens.

I will be sure to drag the spam messages into the Junk folder.

Marco

On Sat, 7 Aug 2021 15:27:46 -0300, Euler German <rstrezna.hfrarg@znvyahyy.invalid> wrote:

My current Spamhalter training strategy and settings:

(*) Train on classification errors only (smaller database)
( ) Train always (larger database, self-trained) <- no need if you're run standalone or on small LAN.

Spam level (%): 50 Not-spam boost: 1

SpamHalter has been running flawlessly here since version 1.0 with
these settings.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Euler German@21:1/5 to All on Thu Aug 26 09:32:04 2021

On article <0ffdig5vmtk4ct0bmn9eads9bt6sjj7gq3@4ax.com>, Marco Old
wrote (at least in part):

I will be sure to drag the spam messages into the Junk folder.

You may also use Quick Actions for this (I'm a keyboard guy). Look at
Folder > Quick actions > Define quick actions...

--
Kind regards,
Euler German

Please, reply preferably to the list.
Reply-To: partially ROT13, invalid=com
Due to spam I'm filtering-out GoogleGroups. Sorry. :(

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marco Old@21:1/5 to All on Sun Oct 24 13:44:29 2021

Update:

Helped by the residents of this group, I've got Spamhalter working
much better now.

I cleared out all of the previous cached data, clicked on the

(o) Train on classification errors only

set "Spam Level %" to 50

and set "Not-spam boost" to 1

as recommended in other posts.

Then I made sure to ONLY drag spam emails into the spam folder, NEVER
use the right click menu item "Train Messages(s) as Spam".

After a few weeks of dragging spam emails, now Spamhalter is working
very well. Almost 100% accuracy in detecting Spam and not-Spam.

Thanks to all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	241:00:51
Calls:	6,624
Files:	12,173
Messages:	5,320,079

Spamhalter getting overwhelmed by HTML meta tags

Who's Online

System Info