• do you use the formail program?

    From Jorgen Grahn@21:1/5 to Eli the Bearded on Fri Oct 26 05:57:15 2018
    XPost: comp.mail.mime

    On Wed, 2018-10-24, Eli the Bearded wrote:
    Do you use the formail program? If so maybe you could help test
    something.

    In a fit of frustration with RFC-2047 MIME encoded words, I added
    code to formail that will decode them.
    ...
    This is beta quality code and I'm looking for people to test it and
    flush out any problems with it.

    https://github.com/Eli-the-Bearded/procmail-formail

    I don't use formail (yet), but I'm fairly interested in running this
    on my mailboxes to see if they trigger any problems. If I find the
    time, I'll let you know the result.

    Note that mailing list archives are good test data.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to gtaylor@tnetconsulting.net on Thu Nov 1 18:54:42 2018
    XPost: comp.mail.mime

    In comp.mail.mime, Grant Taylor <gtaylor@tnetconsulting.net> wrote:
    On 10/24/2018 03:53 PM, Eli the Bearded wrote:
    =?CHARSET?Q?Quoted=2Dprintable_content?=
    So, I think I am more interested in this than I was when I originally
    read your post.

    3596 utf-8
    262 windows-1252
    246 iso-8859-1
    104 koi8-r
    ...
    I like the idea, but I get the impression that I need to specify the
    source encoding. Which as you can see above, there are a number of them.

    Or am I misunderstanding you? Do you mean that you specify the target encoding? Like I would want something like all of the above to be
    decoded to utf-8 for processing in formail?

    There's no misunderstanding. It extracts only when the encoding matches
    because it really doesn't make sense to extract an encoding you can't understand. In all likelihood, /you/ understand (some) utf-8,
    windows-1252, and iso-8859-1 when they are displayed by a compatible
    terminal, but your terminal window will only understand one of those,
    and koi8-r will not be understood by you or your terminal.

    IF your terminal understands some form of Unicode (UTF-8, UTF-16,
    UCS-32, UTF-7) then any charset you encounter can be converted to
    your prefered Unicode charset. But if your terminal is ISO-8859-1
    (or Windows-1252), then chances are the content in any of the other
    encodings /cannot/ be converted to your charset. Windows-1252 in
    particular is ISO-8859-1 plus some other characters, so ISO-8859-1
    will display as Windows-1252, but not always the other way around.

    This is beta quality code and I'm looking for people to test it and
    flush out any problems with it.
    Fair enough.

    I wouldn't say no to someone wiring in iconv to convert between
    charsets. To someone who wanted to do that, I'd suggest building it
    into the self-contained mime.c that I wrote.

    I can see two useful ways to do this:

    a) A decoder function that takes a target charset and decodes MIME
    words that can be safely re-encoded into the target charset. If
    a MIME word uses characters outside teh target charset, the
    original string is left unmodified.

    b) A decoder function that takes a target charset and lossily, if
    needed, decodes MIME words somehow indicating when characters
    have been omitted as untranslatable.

    The procmail / formail code is a nightmare to edit. I don't know if you
    are familiar with C, but here's how main() starts for formail:

    int main(lastm,argv)int lastm;const char*const argv[];
    { int i,split=0,force=0,bogus=1,every=0,headreply=0,digest=0,nowait=0,keepb=0,
    minfields=(char*)progid-(char*)progid,conctenate=0,babyl=0,babylstart,
    berkeley=0,forgetclen;
    long maxlen,ctlength;FILE*idcache=0;pid_t thepid;
    size_t j,lnl,escaplen;char*chp,*namep,*escap=ESCAP;
    struct field*fldp,*fp2,**afldp,*fdate,*fcntlength,*fsubject,*fFrom_;
    charset = NULL;
    if(lastm) /* sanity check, any argument at all? */ #define Qnext_arg() if(!*chp&&!(chp=(char*)*++argv))goto usg
    while(chp=(char*)*++argv)
    { if((lastm= *chp++)==FM_SKIP)
    goto number;
    else if(lastm!=FM_TOTAL)
    goto usg;
    for(;;)
    { switch(lastm= *chp++)
    { case FM_TRUST:headreply|=1;
    continue;
    case FM_REPLY:areply=1;

    White space is eschewed at every opportunity, and ugly C-isms abound: "chp=(char*)*++argv" and the like. This makes adding features a bit
    tricky. (That line with "charset = NULL;" is an addition of mine.)

    My next version of this project puts the decoder into procmail proper,
    again gated by a charset. There will be no sense in using trying to
    match a ISO-8859-1 regular expression against a KOI8-R header.

    I personally use a UTF-8 terminal which can display basically anything
    in Unicode, but the content that is in koi8-r (Cyrillic) I probably
    can't read and won't want to see anyway. Checking my mail log, I have
    several dozen koi8-r entries that begin 7sUg1cTBxdTT0SDEz9PUwdfJ1NgK
    which is "Не удается доставить" in UTF-8 and which translates to
    "unable to deliver". All of them were joe-job bounce messages.

    Another option is to not modify the C code at all but handle this
    in procmail if you care.

    :0
    * ^Subject:.*=\?[a-z0-9][a-z0-9.-]+\?[qp]\?[^? ]*\?=
    * ^Subject:.*=\?\/[a-z0-9][a-z0-9.-]+
    {
    CHARSET=$MATCH
    Subject=`formail -x Subject: -M $CHARSET | iconv -f $CHARSET -t UTF-8`
    }
    :0 E
    {
    Subject=`formail -x Subject:`
    }

    That still leaves the allowed but "hard to imagine why except for
    pathological tests" case of multiple different charsets used in
    different MIME words in one header, but it's probably rare enough to
    not worry about.

    Elijah
    ------
    seriously considering running a pretty-printer over this code, diffs be damned

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)