Forum: >>> Magnum BBS <<<

Problem with accented characters in mailbox.Maildir()

From Chris Green@21:1/5 to All on Sat May 6 11:13:03 2023

I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.

So I have the following code to open each mail message:-

#
#
# Read the message from standard input and make a message object from it
#
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

and then later I have (among many other bits and pieces):-

#
#
# test for string in Subject:
#
if searchTxt in str(msg.get("subject", "unknown")):
do
various
things

This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed. I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).

So, for example, the latest case of this happening has:-

Subject: aka Marne à la Saône (Waterways Continental Europe)

where the searchTxt in the code above is "Waterways Continental Europe".

Is there any way I can work round this issue? E.g. is there a way to
strip out all extended characters from a string? Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?

Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)

--
Chris Green
·

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sat May 6 13:38:49 2023

Chris Green ha scritto:

I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.

So I have the following code to open each mail message:-

#
#
# Read the message from standard input and make a message object from it
#
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

and then later I have (among many other bits and pieces):-

#
#
# test for string in Subject:
#
if searchTxt in str(msg.get("subject", "unknown")):
do
various
things

This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed. I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).

So, for example, the latest case of this happening has:-

Subject: aka Marne à la Saône (Waterways Continental Europe)

where the searchTxt in the code above is "Waterways Continental Europe".

Is there any way I can work round this issue? E.g. is there a way to
strip out all extended characters from a string? Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?

Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)

Hi,
you could try extracting the "Content-Type:charset" and then using it
for subject conversion:

subj = str(raw_subj, encoding='...')

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Green@21:1/5 to All on Sat May 6 12:51:37 2023

A bit more information, msg.get("subject", "unknown") does return a
string, as follows:-

Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.

Surely there's a way to handle this.

--
Chris Green
·

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Green@21:1/5 to Chris Green on Sat May 6 13:46:55 2023

Chris Green <cl@isbd.net> wrote:

A bit more information, msg.get("subject", "unknown") does return a
string, as follows:-

Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.

Surely there's a way to handle this.

... and of course I now see the issue! The Subject: with utf-8
characters in it gets spaces changed to underscores. So searching for '(Waterways Continental Europe)' fails.

I'll either need to test for both versions of the string or I'll need
to change underscores to spaces in the Subject: returned by msg.get().
It's a long enough string that I'm searching for that I won't get any
false positives.

Sorry for the noise everyone, it's a typical case of explaining the
problem shows one how to fix it! :-)

--
Chris Green
·

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From risky sibam@21:1/5 to All on Sat May 6 06:38:03 2023

Pada Sabtu, 06 Mei 2023 pukul 17.18.25 UTC+7, Chris Green menulis:

I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.

So I have the following code to open each mail message:-

#
#
# Read the message from standard input and make a message object from it
#
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

and then later I have (among many other bits and pieces):-

#
#
# test for string in Subject:
#
if searchTxt in str(msg.get("subject", "unknown")):
do
various
things

This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed. I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).

So, for example, the latest case of this happening has:-

Subject: aka Marne à la Saône (Waterways Continental Europe)

where the searchTxt in the code above is "Waterways Continental Europe".

Is there any way I can work round this issue? E.g. is there a way to
strip out all extended characters from a string? Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?

Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)

--
Chris Green
·

https://www.kstu.kz/slot-gacor-2023/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sat May 6 16:27:04 2023

Chris Green ha scritto:

Chris Green <cl@isbd.net> wrote:

A bit more information, msg.get("subject", "unknown") does return a
string, as follows:-

Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.

Surely there's a way to handle this.

... and of course I now see the issue! The Subject: with utf-8
characters in it gets spaces changed to underscores. So searching for '(Waterways Continental Europe)' fails.

I'll either need to test for both versions of the string or I'll need
to change underscores to spaces in the Subject: returned by msg.get().
It's a long enough string that I'm searching for that I won't get any
false positives.

Sorry for the noise everyone, it's a typical case of explaining the
problem shows one how to fix it! :-)

This is probably what you need:

import email.header

raw_subj = '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='

subj = email.header.decode_header(raw_subj)[0]

subj[0].decode(subj[1])

'aka Marne à la Saône (Waterways Continental Europe)'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue May 7 21:37:57 2024
  from Wales, Uk via Telnet
- Michal Wronka
  Wed May 8 21:31:48 2024
  from Wroclaw, Poland via SSH
- Cronus
  Wed May 8 19:22:39 2024
  from Provo, Ut via SSH
- Michal Wronka
  Wed May 8 18:58:52 2024
  from Wroclaw, Poland via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	74:28:18
Calls:	6,715
Calls today:	3
Files:	12,246
Messages:	5,357,274

Problem with accented characters in mailbox.Maildir()

Who's Online

Recent Visitors

System Info