• 40tude: Exporting folder of messages to .TXT or MBOX always inserts CRL

    From Sqwertz@21:1/5 to All on Tue Oct 5 23:45:17 2021
    I'm trying to filter out extraneous headers from this text file
    which I've exported using File->Save Selected Messages->MBOX|TXT.
    There are a couple thousand of messages in here and I'm trying to
    make it more legible without all the visual noise of the headers.

    No other long headers seem to do the CRLF thing except for
    X-Received: Is this my obnoxious newserver (highwinds) doing this
    and Dialog doesn't care?

    I've been working on this for days off and on. Can anybody help me
    delete all headers except for:

    Newsgroups:
    Date:
    From:
    Subject:
    Message-ID:
    (in their natural order, not how I've listed)

    From the text file at:

    https://drive.google.com/file/d/1ElDcN7rUvmy7kn6f3Sn78jz6YXwz-WhJ/view?usp=sharing

    It's for a very good cause (the Missouri Board of Nursing in regards
    to a paedo pediatric HOME CARE nurse)

    Using notepad++ I've gotten rid of everything EXCEPT for those nasty X-Received: second lines and there's no pattern that won't remove
    other context that I can figure - but my grepping and regex's are
    really rusty.

    Here's a sample of the text file to show my/our problem (more at the
    link).

    Thanks IA.

    -sw

    From jwk6680@bjc.org Mon Oct 04 05:37:30 2021
    X-Folder: Kuthe
    X-Received: by 2002:a37:688b:: with SMTP id d133mr9895201qkc.352.1633351051221;
    Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
    X-Received: by 2002:a25:b84e:: with SMTP id b14mr15395553ybm.348.1633351051055;
    Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
    Path: not-for-mail
    Newsgroups: rec.food.cooking
    Date: Mon, 4 Oct 2021 05:37:30 -0700 (PDT)
    Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
    NNTP-Posting-Host: 35.129.9.50
    User-Agent: G2/1.0
    MIME-Version: 1.0
    Message-ID: <4c1d3472-5f99-4672-be69-3020138c3fefn@googlegroups.com>
    Subject: And I had a GREAT ORGASM yesterday!
    From: John Kuthe <jwk6680@bjc.org>
    Injection-Date: Mon, 04 Oct 2021 12:37:31 +0000
    Content-Type: text/plain; charset="UTF-8"
    X-Received-Bytes: 1007

    On Sunday, my DAY OFF! :-) Complete with ejaculation! Wow!

    At 61 Years old! And it felt SO GOOD! :-)


    John Kuthe, RN, BSN...

    From jwk6680@bjc.org Sun Oct 03 18:34:10 2021
    X-Folder: Kuthe
    X-Received: by 2002:a0c:e381:: with SMTP id a1mr20159752qvl.42.1633311251669;
    Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
    X-Received: by 2002:a25:3620:: with SMTP id d32mr12272072yba.46.1633311251515;
    Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
    Path: not-for-mail
    Newsgroups: rec.food.cooking
    Date: Sun, 3 Oct 2021 18:34:11 -0700 (PDT)
    In-Reply-To: <sjdl8i$9tg$1@dont-email.me>
    Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
    NNTP-Posting-Host: 35.129.9.50
    References: <c5d1ff82-d941-44de-b125-8e22ce08f555n@googlegroups.com> <sjdl8i$9tg$1@dont-email.me>
    User-Agent: G2/1.0

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Rose@21:1/5 to Sqwertz on Wed Oct 6 07:50:47 2021
    On Tue, 5th Oct 2021 23:45:17 -0500, Sqwertz wrote:

    Using notepad++ I've gotten rid of everything EXCEPT for those nasty X-Received: second lines and there's no pattern that won't remove
    other context that I can figure - but my grepping and regex's are
    really rusty.

    In Notepad++ replace (RegEx) the following with an empty string:
    ^X-Received: [^\r\n]+\r\n(\h[^\r\n]+\r\n)*

    HTH.
    Bernd

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From VanguardLH@21:1/5 to Sqwertz on Wed Oct 6 05:42:05 2021
    Sqwertz <sqwertzme@gmail.invalid> wrote:

    I'm trying to filter out extraneous headers from this text file
    which I've exported using File->Save Selected Messages->MBOX|TXT.
    There are a couple thousand of messages in here and I'm trying to
    make it more legible without all the visual noise of the headers.

    No other long headers seem to do the CRLF thing except for
    X-Received: Is this my obnoxious newserver (highwinds) doing this
    and Dialog doesn't care?

    I've been working on this for days off and on. Can anybody help me
    delete all headers except for:

    Newsgroups:
    Date:
    From:
    Subject:
    Message-ID:
    (in their natural order, not how I've listed)

    From the text file at:

    https://drive.google.com/file/d/1ElDcN7rUvmy7kn6f3Sn78jz6YXwz-WhJ/view?usp=sharing

    It's for a very good cause (the Missouri Board of Nursing in regards
    to a paedo pediatric HOME CARE nurse)

    Using notepad++ I've gotten rid of everything EXCEPT for those nasty X-Received: second lines and there's no pattern that won't remove
    other context that I can figure - but my grepping and regex's are
    really rusty.

    Here's a sample of the text file to show my/our problem (more at the
    link).

    Thanks IA.

    -sw

    From jwk6680@bjc.org Mon Oct 04 05:37:30 2021
    X-Folder: Kuthe
    X-Received: by 2002:a37:688b:: with SMTP id d133mr9895201qkc.352.1633351051221;
    Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
    X-Received: by 2002:a25:b84e:: with SMTP id b14mr15395553ybm.348.1633351051055;
    Mon, 04 Oct 2021 05:37:31 -0700 (PDT)
    Path: not-for-mail
    Newsgroups: rec.food.cooking
    Date: Mon, 4 Oct 2021 05:37:30 -0700 (PDT)
    Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
    NNTP-Posting-Host: 35.129.9.50
    User-Agent: G2/1.0
    MIME-Version: 1.0
    Message-ID: <4c1d3472-5f99-4672-be69-3020138c3fefn@googlegroups.com>
    Subject: And I had a GREAT ORGASM yesterday!
    From: John Kuthe <jwk6680@bjc.org>
    Injection-Date: Mon, 04 Oct 2021 12:37:31 +0000
    Content-Type: text/plain; charset="UTF-8"
    X-Received-Bytes: 1007

    On Sunday, my DAY OFF! :-) Complete with ejaculation! Wow!

    At 61 Years old! And it felt SO GOOD! :-)

    John Kuthe, RN, BSN...

    From jwk6680@bjc.org Sun Oct 03 18:34:10 2021
    X-Folder: Kuthe
    X-Received: by 2002:a0c:e381:: with SMTP id a1mr20159752qvl.42.1633311251669;
    Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
    X-Received: by 2002:a25:3620:: with SMTP id d32mr12272072yba.46.1633311251515;
    Sun, 03 Oct 2021 18:34:11 -0700 (PDT)
    Path: not-for-mail
    Newsgroups: rec.food.cooking
    Date: Sun, 3 Oct 2021 18:34:11 -0700 (PDT)
    In-Reply-To: <sjdl8i$9tg$1@dont-email.me>
    Injection-Info: google-groups.googlegroups.com; posting-host=35.129.9.50; posting-account=ja_j6woAAABJv24pt7Dxx6icnyi92ahF
    NNTP-Posting-Host: 35.129.9.50
    References: <c5d1ff82-d941-44de-b125-8e22ce08f555n@googlegroups.com> <sjdl8i$9tg$1@dont-email.me>
    User-Agent: G2/1.0

    Continuation lines are allowed for headers to accomodate those that are
    long, sometimes exceeding the 998-character maximum per physical line.

    headerName: string1
    string2
    string3

    string2 and string3 are continuation lines.

    Continuation lines are denoted by a leading space character. That is,
    at a minimum, there must be a space character in column 1 of a header
    line for it to be a continuation line. For a continuation line, it must
    be prefixed with 1, or more, whitespace characters.

    Nothing wrong with the Received header. It obeys the RFC standard for
    Internet messages. The header section ends with the first blank line;
    i.e., /n in column 1. Before that, your script would need to copy and
    paste every continuation line to the preceding line to compose 1 long
    header line as 1 physical line. Since you're throwing away the headers,
    why keep anything before blank like delimiting the header section? Scan
    (parse through) the message, and keep ignoring everything until, and
    after, the first blank line your parser encounters.

    If you want to keep some headers, you'll have to test each line on a
    read to see if the header's name matches one of those you want to keep.
    If so, you have to keep that line, and every continuation line
    thereafter (ever following line with a space in column 1), until the
    next line in the format:

    headerName: string
    ^ ^
    | |__ one whitespace minimum for parsing name from value
    |__ must be in column 1 of a physical line

    You'll need to write a parser script checking if each line is a header
    line (headername:<space>), if that's one you want to keep, and if
    following lines are continuation lines, or another header line, and
    terminating the parsing upon reaching the first blank line.

    Regex is handy, but I don't think you can get it to handle continuation
    lines as part of the preceding header line.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sqwertz@21:1/5 to Bernd Rose on Wed Oct 6 23:30:36 2021
    On Wed, 6 Oct 2021 07:50:47 +0200, Bernd Rose wrote:

    On Tue, 5th Oct 2021 23:45:17 -0500, Sqwertz wrote:

    Using notepad++ I've gotten rid of everything EXCEPT for those nasty
    X-Received: second lines and there's no pattern that won't remove
    other context that I can figure - but my grepping and regex's are
    really rusty.

    In Notepad++ replace (RegEx) the following with an empty string:
    ^X-Received: [^\r\n]+\r\n(\h[^\r\n]+\r\n)*

    HTH.
    Bernd

    Thanks Bernd! That worked perfectly. I also used it on the
    multiple Received: lines as well as they got pretty extensive, too.

    Thank you too, VanguardLH. I had definitely seen the multi-line
    headers in SMTP email (especially google and MS), but I guess I
    never really noticed them in NNTP. I swear I never saw a line break
    in References: and Path: especially (and now we have no Path: header
    with Highwinds <grrrr>).

    -sw

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)