I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.
So I have the following code to open each mail message:-
#
#
# Read the message from standard input and make a message object from it
#
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())
and then later I have (among many other bits and pieces):-
#
#
# test for string in Subject:
#
if searchTxt in str(msg.get("subject", "unknown")):
do
various
things
This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed. I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).
So, for example, the latest case of this happening has:-
Subject: aka Marne à la Saône (Waterways Continental Europe)
where the searchTxt in the code above is "Waterways Continental Europe".
Is there any way I can work round this issue? E.g. is there a way to
strip out all extended characters from a string? Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?
Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)
A bit more information, msg.get("subject", "unknown") does return a
string, as follows:-
Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.
Surely there's a way to handle this.
I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.
So I have the following code to open each mail message:-
#
#
# Read the message from standard input and make a message object from it
#
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())
and then later I have (among many other bits and pieces):-
#
#
# test for string in Subject:
#
if searchTxt in str(msg.get("subject", "unknown")):
do
various
things
This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed. I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).
So, for example, the latest case of this happening has:-
Subject: aka Marne à la Saône (Waterways Continental Europe)
where the searchTxt in the code above is "Waterways Continental Europe".
Is there any way I can work round this issue? E.g. is there a way to
strip out all extended characters from a string? Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?
Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)
--https://www.kstu.kz/slot-gacor-2023/
Chris Green
·
Chris Green <cl@isbd.net> wrote:
A bit more information, msg.get("subject", "unknown") does return a... and of course I now see the issue! The Subject: with utf-8
string, as follows:-
Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.
Surely there's a way to handle this.
characters in it gets spaces changed to underscores. So searching for '(Waterways Continental Europe)' fails.
I'll either need to test for both versions of the string or I'll need
to change underscores to spaces in the Subject: returned by msg.get().
It's a long enough string that I'm searching for that I won't get any
false positives.
Sorry for the noise everyone, it's a typical case of explaining the
problem shows one how to fix it! :-)
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 74:28:18 |
Calls: | 6,715 |
Calls today: | 3 |
Files: | 12,246 |
Messages: | 5,357,274 |