• Problem with accented characters in mailbox.Maildir()

    From Chris Green@21:1/5 to All on Sat May 6 11:13:03 2023
    I have a custom mail filter in python that uses the mailbox package to
    open a mail message and give me access to the headers.

    So I have the following code to open each mail message:-

    #
    #
    # Read the message from standard input and make a message object from it
    #
    msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

    and then later I have (among many other bits and pieces):-

    #
    #
    # test for string in Subject:
    #
    if searchTxt in str(msg.get("subject", "unknown")):
    do
    various
    things


    This works exactly as intended most of the time but occasionally a
    message whose subject should match the test is missed. I have just
    realised when this happens, it's when the Subject: has accented
    characters in it (this is from a mailing list about canals in France).

    So, for example, the latest case of this happening has:-

    Subject: aka Marne à la Saône (Waterways Continental Europe)

    where the searchTxt in the code above is "Waterways Continental Europe".


    Is there any way I can work round this issue? E.g. is there a way to
    strip out all extended characters from a string? Or maybe it's
    msg.get() that isn't managing to handle the accented string correctly?

    Yes, I know that accented characters probably aren't allowed in
    Subject: but I'm not going to get that changed! :-)


    --
    Chris Green
    ·

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jak@21:1/5 to All on Sat May 6 13:38:49 2023
    Chris Green ha scritto:
    I have a custom mail filter in python that uses the mailbox package to
    open a mail message and give me access to the headers.

    So I have the following code to open each mail message:-

    #
    #
    # Read the message from standard input and make a message object from it
    #
    msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

    and then later I have (among many other bits and pieces):-

    #
    #
    # test for string in Subject:
    #
    if searchTxt in str(msg.get("subject", "unknown")):
    do
    various
    things


    This works exactly as intended most of the time but occasionally a
    message whose subject should match the test is missed. I have just
    realised when this happens, it's when the Subject: has accented
    characters in it (this is from a mailing list about canals in France).

    So, for example, the latest case of this happening has:-

    Subject: aka Marne à la Saône (Waterways Continental Europe)

    where the searchTxt in the code above is "Waterways Continental Europe".


    Is there any way I can work round this issue? E.g. is there a way to
    strip out all extended characters from a string? Or maybe it's
    msg.get() that isn't managing to handle the accented string correctly?

    Yes, I know that accented characters probably aren't allowed in
    Subject: but I'm not going to get that changed! :-)



    Hi,
    you could try extracting the "Content-Type:charset" and then using it
    for subject conversion:

    subj = str(raw_subj, encoding='...')

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Green@21:1/5 to All on Sat May 6 12:51:37 2023
    A bit more information, msg.get("subject", "unknown") does return a
    string, as follows:-

    Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

    So it's the 'searchTxt in msg.get("subject", "unknown")' that's
    failing. I.e. for some reason 'in' isn't working when the searched
    string has utf-8 characters.

    Surely there's a way to handle this.

    --
    Chris Green
    ·

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Green@21:1/5 to Chris Green on Sat May 6 13:46:55 2023
    Chris Green <cl@isbd.net> wrote:
    A bit more information, msg.get("subject", "unknown") does return a
    string, as follows:-

    Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

    So it's the 'searchTxt in msg.get("subject", "unknown")' that's
    failing. I.e. for some reason 'in' isn't working when the searched
    string has utf-8 characters.

    Surely there's a way to handle this.

    ... and of course I now see the issue! The Subject: with utf-8
    characters in it gets spaces changed to underscores. So searching for '(Waterways Continental Europe)' fails.

    I'll either need to test for both versions of the string or I'll need
    to change underscores to spaces in the Subject: returned by msg.get().
    It's a long enough string that I'm searching for that I won't get any
    false positives.


    Sorry for the noise everyone, it's a typical case of explaining the
    problem shows one how to fix it! :-)

    --
    Chris Green
    ·

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From risky sibam@21:1/5 to All on Sat May 6 06:38:03 2023
    Pada Sabtu, 06 Mei 2023 pukul 17.18.25 UTC+7, Chris Green menulis:
    I have a custom mail filter in python that uses the mailbox package to
    open a mail message and give me access to the headers.

    So I have the following code to open each mail message:-

    #
    #
    # Read the message from standard input and make a message object from it
    #
    msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

    and then later I have (among many other bits and pieces):-

    #
    #
    # test for string in Subject:
    #
    if searchTxt in str(msg.get("subject", "unknown")):
    do
    various
    things


    This works exactly as intended most of the time but occasionally a
    message whose subject should match the test is missed. I have just
    realised when this happens, it's when the Subject: has accented
    characters in it (this is from a mailing list about canals in France).

    So, for example, the latest case of this happening has:-

    Subject: aka Marne à la Saône (Waterways Continental Europe)

    where the searchTxt in the code above is "Waterways Continental Europe".


    Is there any way I can work round this issue? E.g. is there a way to
    strip out all extended characters from a string? Or maybe it's
    msg.get() that isn't managing to handle the accented string correctly?

    Yes, I know that accented characters probably aren't allowed in
    Subject: but I'm not going to get that changed! :-)


    --
    Chris Green
    ·
    https://www.kstu.kz/slot-gacor-2023/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jak@21:1/5 to All on Sat May 6 16:27:04 2023
    Chris Green ha scritto:
    Chris Green <cl@isbd.net> wrote:
    A bit more information, msg.get("subject", "unknown") does return a
    string, as follows:-

    Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

    So it's the 'searchTxt in msg.get("subject", "unknown")' that's
    failing. I.e. for some reason 'in' isn't working when the searched
    string has utf-8 characters.

    Surely there's a way to handle this.

    ... and of course I now see the issue! The Subject: with utf-8
    characters in it gets spaces changed to underscores. So searching for '(Waterways Continental Europe)' fails.

    I'll either need to test for both versions of the string or I'll need
    to change underscores to spaces in the Subject: returned by msg.get().
    It's a long enough string that I'm searching for that I won't get any
    false positives.


    Sorry for the noise everyone, it's a typical case of explaining the
    problem shows one how to fix it! :-)


    This is probably what you need:

    import email.header

    raw_subj = '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='

    subj = email.header.decode_header(raw_subj)[0]

    subj[0].decode(subj[1])

    'aka Marne à la Saône (Waterways Continental Europe)'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)