• mailbox misbehavior with non-ASCII

    From Peter Pearson@21:1/5 to All on Fri Jul 29 23:24:57 2022
    The following code produces a nonsense result with the input
    described below:

    import mailbox
    box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
    x = box.values()[0]
    h = x.get("X-DSPAM-Factors")
    print(type(h))
    # <class 'email.header.Header'>

    The output is the desired "str" when the message file contains this:

    To: recipient@example.com
    Message-ID: <123>
    Date: Sun, 24 Jul 2022 15:31:19 +0000
    Subject: Blah blah
    From: from@from.com
    X-DSPAM-Factors: a'b

    xxx

    ... but if the apostrophe in "a'b" is replaced with a
    RIGHT SINGLE QUOTATION MARK, the returned h is of type
    "email.header.Header", and seems to contain inscrutable garbage.

    I realize that one should not put non-ASCII characters in
    message headers, but of course I didn't put it there, it
    just showed up, pretty much beyond my control. And I realize
    that when software is given input that breaks the rules, one
    cannot expect optimal results, but I'd think an exception
    would be the right answer.

    Is this worth a bug report?

    --
    To email me, substitute nowhere->runbox, invalid->com.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ethan Furman@21:1/5 to Peter Pearson on Fri Jul 29 16:59:29 2022
    On 7/29/22 16:24, Peter Pearson wrote:


    ... but if the apostrophe in "a'b" is replaced with a
    RIGHT SINGLE QUOTATION MARK, the returned h is of type "email.header.Header", and seems to contain inscrutable garbage.

    I'd think an exception would be the right answer.

    Is this worth a bug report?

    I would say yes.

    --
    ~Ethan~

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to Peter Pearson on Fri Jul 29 19:53:09 2022
    On 2022-07-29 at 23:24:57 +0000,
    Peter Pearson <pkpearson@nowhere.invalid> wrote:

    The following code produces a nonsense result with the input
    described below:

    import mailbox
    box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
    x = box.values()[0]
    h = x.get("X-DSPAM-Factors")
    print(type(h))
    # <class 'email.header.Header'>

    The output is the desired "str" when the message file contains this:

    To: recipient@example.com
    Message-ID: <123>
    Date: Sun, 24 Jul 2022 15:31:19 +0000
    Subject: Blah blah
    From: from@from.com
    X-DSPAM-Factors: a'b

    xxx

    ... but if the apostrophe in "a'b" is replaced with a
    RIGHT SINGLE QUOTATION MARK, the returned h is of type
    "email.header.Header", and seems to contain inscrutable garbage.

    I realize that one should not put non-ASCII characters in
    message headers, but of course I didn't put it there, it
    just showed up, pretty much beyond my control. And I realize
    that when software is given input that breaks the rules, one
    cannot expect optimal results, but I'd think an exception
    would be the right answer.

    Be strict in what you send, but generous is what you receive.

    I agree that email headers are supposed to be ASCII (RFC 822, 2822, and
    now 5322) all say that, but always throwing an exception seems a little
    harsh, and arguably (I'm not arguing for or against) breaks backwards compatibility. At least let the exception contain, in its own
    attribute, the inscrutable garbage after the space after the colon and
    before next CR/LF pair.

    Is this worth a bug report?

    If nothing else, the documentation could specify or disclaim the
    existing behavior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Barry@21:1/5 to All on Sat Jul 30 08:55:24 2022
    

    On 30 Jul 2022, at 00:30, Peter Pearson <pkpearson@nowhere.invalid> wrote:

    The following code produces a nonsense result with the input
    described below:

    import mailbox
    box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
    x = box.values()[0]
    h = x.get("X-DSPAM-Factors")
    print(type(h))
    # <class 'email.header.Header'>

    The output is the desired "str" when the message file contains this:

    To: recipient@example.com
    Message-ID: <123>
    Date: Sun, 24 Jul 2022 15:31:19 +0000
    Subject: Blah blah
    From: from@from.com
    X-DSPAM-Factors: a'b

    xxx

    ... but if the apostrophe in "a'b" is replaced with a
    RIGHT SINGLE QUOTATION MARK, the returned h is of type "email.header.Header", and seems to contain inscrutable garbage.

    Include in any bug report the exact bytes that are in the header.
    In may not be utf-8 encoded it maybe windows cp1252, etc.
    Repr of the bytes header will show this.

    Barry


    I realize that one should not put non-ASCII characters in
    message headers, but of course I didn't put it there, it
    just showed up, pretty much beyond my control. And I realize
    that when software is given input that breaks the rules, one
    cannot expect optimal results, but I'd think an exception
    would be the right answer.

    Is this worth a bug report?

    --
    To email me, substitute nowhere->runbox, invalid->com.
    --
    https://mail.python.org/mailman/listinfo/python-list

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Peter Pearson on Sat Jul 30 22:19:18 2022
    On 2022-07-29 23:24:57 +0000, Peter Pearson wrote:
    The following code produces a nonsense result with the input
    described below:

    import mailbox
    box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
    x = box.values()[0]
    h = x.get("X-DSPAM-Factors")
    print(type(h))
    # <class 'email.header.Header'>

    The output is the desired "str" when the message file contains this:

    To: recipient@example.com
    Message-ID: <123>
    Date: Sun, 24 Jul 2022 15:31:19 +0000
    Subject: Blah blah
    From: from@from.com
    X-DSPAM-Factors: a'b

    xxx

    ... but if the apostrophe in "a'b" is replaced with a
    RIGHT SINGLE QUOTATION MARK, the returned h is of type "email.header.Header", and seems to contain inscrutable garbage.

    It's not inscrutable to me, but then I remember when RFC 1522 was the
    relevant RFC.

    Calling h.encode() returns

    =?unknown-8bit?b?YeKAmWI=?=

    which is about the best result you can get. The character set is unknown
    and the content (when decoded) is the bytes

    61 e2 80 99 62

    which is what your file contained (assuming you used UTF-8).

    What would be nice if you could get at that content directly. There
    doesn't seem to be documented method to do that. You can use h._chunks,
    but as the _ in the name implies, that's implementation detail which
    might change in future versions (and it's not quite straightforward
    either, although consistent with other parts of python, I think).

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmLlkkEACgkQ8g5IURL+ KF0NAQ/8CLr5zWVAlITalaJ2+oY17TqjoT8fZyyxmHrLgiheJLYlSbS9w+aw38zI JLLa/vMeYv/NNC8tbpDBuF6WLc2Q8ZNZ1oKJQVfjKgPfsE46KxpnCV7E/qtq5p/q pG4YlI1gBZVen5GH+38+xJTJeXd/ypGyM/WqK++KCqKnjVyr4b3spEEAmds1//bY UiaAOpsmXeVIgFe/zEsHJNdVWsqX/Y6USvgxIm//AL1za/4m+VCB3IaTP9paJRmG F6d6WxUL7ZfSzldjMTvIUxwBGMNK2F9xLFUhkWQu2cv0VSHWghxpRkP5kVOi1Rug yVbs8dm9MdYihZU6KDjX1ikN6UnKI1V3IC1eW9/hOSk2xpzKql3ZB5ACmqhKzB31 AlgRH3JMKH+QzKsPKik5QUhkY1c5wIkQI+4sFknWdMNrGxLQE7YXuNQmUdkXbyWq oQPHDFPC35m9AOLwFR2ijqw+gTnJfb1ghgxyFJbJz0S/DhbrQri9883EMlOKqhCJ MKotirb3QlQlvypUr2jjf21wo8lE60lvoArVB0ptuSL0ajNOVJgStXXe43jBkweC SmbZfA4hdzXtt5Ac3Zuo6yms97t1N8yR/B8zMs9RVxzMDkIfYjkwIaAEJlMJIM4J ooGW8f0A5KtQ7iKCI0h5/BBAAJBrrQOclLTMKzE