XPost: comp.mail.mime
In comp.mail.mime, Grant Taylor <
gtaylor@tnetconsulting.net> wrote:
On 10/24/2018 03:53 PM, Eli the Bearded wrote:
=?CHARSET?Q?Quoted=2Dprintable_content?=
So, I think I am more interested in this than I was when I originally
read your post.
3596 utf-8
262 windows-1252
246 iso-8859-1
104 koi8-r
...
I like the idea, but I get the impression that I need to specify the
source encoding. Which as you can see above, there are a number of them.
Or am I misunderstanding you? Do you mean that you specify the target encoding? Like I would want something like all of the above to be
decoded to utf-8 for processing in formail?
There's no misunderstanding. It extracts only when the encoding matches
because it really doesn't make sense to extract an encoding you can't understand. In all likelihood, /you/ understand (some) utf-8,
windows-1252, and iso-8859-1 when they are displayed by a compatible
terminal, but your terminal window will only understand one of those,
and koi8-r will not be understood by you or your terminal.
IF your terminal understands some form of Unicode (UTF-8, UTF-16,
UCS-32, UTF-7) then any charset you encounter can be converted to
your prefered Unicode charset. But if your terminal is ISO-8859-1
(or Windows-1252), then chances are the content in any of the other
encodings /cannot/ be converted to your charset. Windows-1252 in
particular is ISO-8859-1 plus some other characters, so ISO-8859-1
will display as Windows-1252, but not always the other way around.
This is beta quality code and I'm looking for people to test it and
flush out any problems with it.
Fair enough.
I wouldn't say no to someone wiring in iconv to convert between
charsets. To someone who wanted to do that, I'd suggest building it
into the self-contained mime.c that I wrote.
I can see two useful ways to do this:
a) A decoder function that takes a target charset and decodes MIME
words that can be safely re-encoded into the target charset. If
a MIME word uses characters outside teh target charset, the
original string is left unmodified.
b) A decoder function that takes a target charset and lossily, if
needed, decodes MIME words somehow indicating when characters
have been omitted as untranslatable.
The procmail / formail code is a nightmare to edit. I don't know if you
are familiar with C, but here's how main() starts for formail:
int main(lastm,argv)int lastm;const char*const argv[];
{ int i,split=0,force=0,bogus=1,every=0,headreply=0,digest=0,nowait=0,keepb=0,
minfields=(char*)progid-(char*)progid,conctenate=0,babyl=0,babylstart,
berkeley=0,forgetclen;
long maxlen,ctlength;FILE*idcache=0;pid_t thepid;
size_t j,lnl,escaplen;char*chp,*namep,*escap=ESCAP;
struct field*fldp,*fp2,**afldp,*fdate,*fcntlength,*fsubject,*fFrom_;
charset = NULL;
if(lastm) /* sanity check, any argument at all? */ #define Qnext_arg() if(!*chp&&!(chp=(char*)*++argv))goto usg
while(chp=(char*)*++argv)
{ if((lastm= *chp++)==FM_SKIP)
goto number;
else if(lastm!=FM_TOTAL)
goto usg;
for(;;)
{ switch(lastm= *chp++)
{ case FM_TRUST:headreply|=1;
continue;
case FM_REPLY:areply=1;
White space is eschewed at every opportunity, and ugly C-isms abound: "chp=(char*)*++argv" and the like. This makes adding features a bit
tricky. (That line with "charset = NULL;" is an addition of mine.)
My next version of this project puts the decoder into procmail proper,
again gated by a charset. There will be no sense in using trying to
match a ISO-8859-1 regular expression against a KOI8-R header.
I personally use a UTF-8 terminal which can display basically anything
in Unicode, but the content that is in koi8-r (Cyrillic) I probably
can't read and won't want to see anyway. Checking my mail log, I have
several dozen koi8-r entries that begin 7sUg1cTBxdTT0SDEz9PUwdfJ1NgK
which is "Не удается доставить" in UTF-8 and which translates to
"unable to deliver". All of them were joe-job bounce messages.
Another option is to not modify the C code at all but handle this
in procmail if you care.
:0
* ^Subject:.*=\?[a-z0-9][a-z0-9.-]+\?[qp]\?[^? ]*\?=
* ^Subject:.*=\?\/[a-z0-9][a-z0-9.-]+
{
CHARSET=$MATCH
Subject=`formail -x Subject: -M $CHARSET | iconv -f $CHARSET -t UTF-8`
}
:0 E
{
Subject=`formail -x Subject:`
}
That still leaves the allowed but "hard to imagine why except for
pathological tests" case of multiple different charsets used in
different MIME words in one header, but it's probably rare enough to
not worry about.
Elijah
------
seriously considering running a pretty-printer over this code, diffs be damned
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)