Forum: >>> Magnum BBS <<<

mailbox misbehavior with non-ASCII

From Peter Pearson@21:1/5 to All on Fri Jul 29 23:24:57 2022

The following code produces a nonsense result with the input
described below:

import mailbox
box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
x = box.values()[0]
h = x.get("X-DSPAM-Factors")
print(type(h))
# <class 'email.header.Header'>

The output is the desired "str" when the message file contains this:

To: recipient@example.com
Message-ID: <123>
Date: Sun, 24 Jul 2022 15:31:19 +0000
Subject: Blah blah
From: from@from.com
X-DSPAM-Factors: a'b

xxx

... but if the apostrophe in "a'b" is replaced with a
RIGHT SINGLE QUOTATION MARK, the returned h is of type
"email.header.Header", and seems to contain inscrutable garbage.

I realize that one should not put non-ASCII characters in
message headers, but of course I didn't put it there, it
just showed up, pretty much beyond my control. And I realize
that when software is given input that breaks the rules, one
cannot expect optimal results, but I'd think an exception
would be the right answer.

Is this worth a bug report?

--
To email me, substitute nowhere->runbox, invalid->com.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ethan Furman@21:1/5 to Peter Pearson on Fri Jul 29 16:59:29 2022

On 7/29/22 16:24, Peter Pearson wrote:

... but if the apostrophe in "a'b" is replaced with a
RIGHT SINGLE QUOTATION MARK, the returned h is of type "email.header.Header", and seems to contain inscrutable garbage.

I'd think an exception would be the right answer.

Is this worth a bug report?

I would say yes.

--
~Ethan~

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to Peter Pearson on Fri Jul 29 19:53:09 2022

On 2022-07-29 at 23:24:57 +0000,
Peter Pearson <pkpearson@nowhere.invalid> wrote:

The following code produces a nonsense result with the input
described below:

import mailbox
box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
x = box.values()[0]
h = x.get("X-DSPAM-Factors")
print(type(h))
# <class 'email.header.Header'>

The output is the desired "str" when the message file contains this:

To: recipient@example.com
Message-ID: <123>
Date: Sun, 24 Jul 2022 15:31:19 +0000
Subject: Blah blah
From: from@from.com
X-DSPAM-Factors: a'b

xxx

... but if the apostrophe in "a'b" is replaced with a
RIGHT SINGLE QUOTATION MARK, the returned h is of type
"email.header.Header", and seems to contain inscrutable garbage.

I realize that one should not put non-ASCII characters in
message headers, but of course I didn't put it there, it
just showed up, pretty much beyond my control. And I realize
that when software is given input that breaks the rules, one
cannot expect optimal results, but I'd think an exception
would be the right answer.

Be strict in what you send, but generous is what you receive.

I agree that email headers are supposed to be ASCII (RFC 822, 2822, and
now 5322) all say that, but always throwing an exception seems a little
harsh, and arguably (I'm not arguing for or against) breaks backwards compatibility. At least let the exception contain, in its own
attribute, the inscrutable garbage after the space after the colon and
before next CR/LF pair.

Is this worth a bug report?

If nothing else, the documentation could specify or disclaim the
existing behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Barry@21:1/5 to All on Sat Jul 30 08:55:24 2022

On 30 Jul 2022, at 00:30, Peter Pearson <pkpearson@nowhere.invalid> wrote:

The following code produces a nonsense result with the input
described below:

import mailbox
box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
x = box.values()[0]
h = x.get("X-DSPAM-Factors")
print(type(h))
# <class 'email.header.Header'>

The output is the desired "str" when the message file contains this:

To: recipient@example.com
Message-ID: <123>
Date: Sun, 24 Jul 2022 15:31:19 +0000
Subject: Blah blah
From: from@from.com
X-DSPAM-Factors: a'b

xxx

... but if the apostrophe in "a'b" is replaced with a
RIGHT SINGLE QUOTATION MARK, the returned h is of type "email.header.Header", and seems to contain inscrutable garbage.

Include in any bug report the exact bytes that are in the header.
In may not be utf-8 encoded it maybe windows cp1252, etc.
Repr of the bytes header will show this.

Barry

I realize that one should not put non-ASCII characters in
message headers, but of course I didn't put it there, it
just showed up, pretty much beyond my control. And I realize
that when software is given input that breaks the rules, one
cannot expect optimal results, but I'd think an exception
would be the right answer.

Is this worth a bug report?

--
To email me, substitute nowhere->runbox, invalid->com.
--
https://mail.python.org/mailman/listinfo/python-list

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Peter Pearson on Sat Jul 30 22:19:18 2022

On 2022-07-29 23:24:57 +0000, Peter Pearson wrote:

The following code produces a nonsense result with the input
described below:

import mailbox
box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
x = box.values()[0]
h = x.get("X-DSPAM-Factors")
print(type(h))
# <class 'email.header.Header'>

The output is the desired "str" when the message file contains this:

To: recipient@example.com
Message-ID: <123>
Date: Sun, 24 Jul 2022 15:31:19 +0000
Subject: Blah blah
From: from@from.com
X-DSPAM-Factors: a'b

xxx

... but if the apostrophe in "a'b" is replaced with a
RIGHT SINGLE QUOTATION MARK, the returned h is of type "email.header.Header", and seems to contain inscrutable garbage.

It's not inscrutable to me, but then I remember when RFC 1522 was the
relevant RFC.

Calling h.encode() returns

=?unknown-8bit?b?YeKAmWI=?=

which is about the best result you can get. The character set is unknown
and the content (when decoded) is the bytes

61 e2 80 99 62

which is what your file contained (assuming you used UTF-8).

What would be nice if you could get at that content directly. There
doesn't seem to be documented method to do that. You can use h._chunks,
but as the _ in the name implies, that's implementation detail which
might change in future versions (and it's not quite straightforward
either, although consistent with other parts of python, I think).

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmLlkkEACgkQ8g5IURL+ KF0NAQ/8CLr5zWVAlITalaJ2+oY17TqjoT8fZyyxmHrLgiheJLYlSbS9w+aw38zI JLLa/vMeYv/NNC8tbpDBuF6WLc2Q8ZNZ1oKJQVfjKgPfsE46KxpnCV7E/qtq5p/q pG4YlI1gBZVen5GH+38+xJTJeXd/ypGyM/WqK++KCqKnjVyr4b3spEEAmds1//bY UiaAOpsmXeVIgFe/zEsHJNdVWsqX/Y6USvgxIm//AL1za/4m+VCB3IaTP9paJRmG F6d6WxUL7ZfSzldjMTvIUxwBGMNK2F9xLFUhkWQu2cv0VSHWghxpRkP5kVOi1Rug yVbs8dm9MdYihZU6KDjX1ikN6UnKI1V3IC1eW9/hOSk2xpzKql3ZB5ACmqhKzB31 AlgRH3JMKH+QzKsPKik5QUhkY1c5wIkQI+4sFknWdMNrGxLQE7YXuNQmUdkXbyWq oQPHDFPC35m9AOLwFR2ijqw+gTnJfb1ghgxyFJbJz0S/DhbrQri9883EMlOKqhCJ MKotirb3QlQlvypUr2jjf21wo8lE60lvoArVB0ptuSL0ajNOVJgStXXe43jBkweC SmbZfA4hdzXtt5Ac3Zuo6yms97t1N8yR/B8zMs9RVxzMDkIfYjkwIaAEJlMJIM4J ooGW8f0A5KtQ7iKCI0h5/BBAAJBrrQOclLTMKzE

Who's Online
Recent Visitors
- Cronus
  Thu Apr 25 18:32:15 2024
  from Provo, Ut via SSH
- Cronus
  Thu Apr 25 18:24:38 2024
  from Provo, Ut via SSH
- Michal Wronka
  Thu Apr 25 14:02:21 2024
  from Wroclaw, Poland via SSH
- Bob Worm
  Thu Apr 25 11:52:12 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (3 / 13)
Uptime:	56:13:57
Calls:	6,652
Calls today:	4
Files:	12,200
Messages:	5,330,867

mailbox misbehavior with non-ASCII

Who's Online

Recent Visitors

System Info