• mbox archive for a news server

    From Jason Evans@21:1/5 to All on Mon Aug 23 07:07:10 2021
    Hi all,

    I would like to know if anyone here knows about an archival tool or
    script that can be used to back up new articles into mbox format. Similar
    to what exists in archive.org. I'm not much of a programmer but before I
    spend the time and effort to cobble together a bash script to do this, I
    want to see if someone here already has some this that can do that
    already. No need to re-invent the wheel, etc. Thanks.

    __
    JE

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Matija Nalis@21:1/5 to Jason Evans on Mon Aug 23 16:45:42 2021
    On Mon, 23 Aug 2021 07:07:10 -0000 (UTC), Jason Evans <jsevans@mailfence.com> wrote:
    I would like to know if anyone here knows about an archival tool or
    script that can be used to back up new articles into mbox format. Similar
    to what exists in archive.org. I'm not much of a programmer but before I spend the time and effort to cobble together a bash script to do this, I
    want to see if someone here already has some this that can do that
    already. No need to re-invent the wheel, etc. Thanks.

    I'm using slrn as news agent, and simply use '#' to tag messages/threads
    I want to archive, and then press 'o' to select mbox file in which to
    save them (could be more automated with slang macros in slrn, if one wants)

    That is for archive.org-alike saving (you choose what you want to save).

    --
    Opinions above are GNU-copylefted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jason Evans@21:1/5 to Matija Nalis on Mon Aug 23 15:15:42 2021
    On Mon, 23 Aug 2021 16:45:42 +0200, Matija Nalis wrote:

    I'm using slrn as news agent, and simply use '#' to tag messages/threads
    I want to archive, and then press 'o' to select mbox file in which to
    save them (could be more automated with slang macros in slrn, if one
    wants)

    That is for archive.org-alike saving (you choose what you want to save).

    Hi Matija,

    Thanks for the input.

    However I'm running my own INN news server using the tradpool storage
    method. I want to be able to create an automated monthly archive of every article on my server and dump that to mbox files for each newsgroup.
    Doing it through slrn would be way too much of a headache.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Slootweg@21:1/5 to Jason Evans on Mon Aug 23 15:16:04 2021
    Jason Evans <jsevans@mailfence.com> wrote:
    Hi all,

    I would like to know if anyone here knows about an archival tool or
    script that can be used to back up new articles into mbox format. Similar
    to what exists in archive.org. I'm not much of a programmer but before I spend the time and effort to cobble together a bash script to do this, I
    want to see if someone here already has some this that can do that
    already. No need to re-invent the wheel, etc. Thanks.

    If I understand your question correctly, then the '-S' option of the
    'tin' newsreader can do what you want.

    Here are some relevant sections from the tin(1) manpage:

    -S Save unread articles for later reading by the ''-R''
    option. For more information read section "AUTOMATIC MAIL-
    ING AND SAVING NEW NEWS".

    [...]

    AUTOMATIC MAILING AND SAVING NEW NEWS
    tin allows new/unread news articles to be mailed (''-M'' and ''-N''
    option) or saved (''-S'' option) in batch mode for later reading. Use-
    ful when going on holiday and you don't want to return and find that
    expire has removed a whole load of unread articles. Best to run via
    cron(1) everyday while away, after which you will be mailed a report of
    which articles were mailed/saved from which newsgroups and the total
    number of articles mailed/saved. Articles are saved in a private news
    structure under your savedir directory (default is ${TIN_HOME-
    DIR:-"$HOME"}/News). Be careful of using this option if you read a lot
    of groups because you could overflow your file system.

    When using ''-S'' together with a given directory to save to (''-s''
    option), the same directory must be specified when reading the articles
    by ''-R''.

    If you only want to save some of your groups use the batch_save tinrc
    variable. Set to ON or OFF in tinrc to enable/disable saving of all
    groups and then use the batch_save attribute to fine tune which groups
    you want to have saved. For example, if you want to save most of your
    groups, then set batch_save to ON in tinrc and selectively turn off the
    ones you don't want using attributes.

    tin -M iain -c -f newsrc.mail
    (mail any unread articles in newsgroups specified
    in file newsrc.mail to the local user iain and mark
    them as read)

    tin -S -c -f newsrc.save
    (save any unread articles in newsgroups specified
    in file newsrc.save and mark them as read)

    tin -R (read any articles saved by tin -S)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Jason Evans on Mon Aug 23 13:05:50 2021
    Jason Evans <jsevans@mailfence.com> writes:

    I would like to know if anyone here knows about an archival tool or
    script that can be used to back up new articles into mbox
    format. Similar to what exists in archive.org. I'm not much of a
    programmer but before I spend the time and effort to cobble together a
    bash script to do this, I want to see if someone here already has some
    this that can do that already. No need to re-invent the wheel, etc.

    The archive program that comes with INN and can be configured as a feed in newsfeeds is fairly close to what you want except that when you configure
    it to store multiple messages in a single file, it uses a custom separator rather than doing From escaping and inserting a mailbox From.

    You could run it in its default mode where it saves each individual
    message to a file and then separately run some other program to convert a directory full of files to a mailbox. I suspect most of the Google hits
    for "maildir2mbox" would do it, since a maildir is essentially a directory
    full of messages.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to Jason Evans on Mon Aug 23 13:45:33 2021
    On 8/23/21 9:15 AM, Jason Evans wrote:
    However I'm running my own INN news server using the tradpool storage
    method. I want to be able to create an automated monthly archive
    of every article on my server and dump that to mbox files for each
    newsgroup. Doing it through slrn would be way too much of a headache.

    I think I would approach this a slightly different way.

    I'd think about configuring a peer with a news feed similar to
    news-to-email such that every (selected) incoming article is sent to the archiving system. I'd then rely on the archiving system to manage the
    monthly archives. Even if the archiving system is something as simple
    as a script that appends the message to an mbox formatted file with a
    file name derived from the newsgroup name and month (and likely year).
    That way the month to month rotation would be implicit. This would also
    avoid the complexity of interfacing with NNTP or the news spool thus eliminating the need to match formats.

    The biggest unknown for me at the moment is how to deal with all the
    newsgroups and different files. I'm sure that you could create
    different mbox archive files for each newsgroup, but you probably want something scalable that doesn't require manual configuration. I would initially wonder about just extracting the contents of the Newsgroups:
    header and append the message to the file for each of the listed
    newsgroups. But I've seen some questionable content in the Newsgroups:
    header. So you'll likely want to do something to sanitize the user
    provided content and not blindly accept it. Maybe do a string filter /
    case fold / comparison against the contents of the active (newsgroups) file.

    You will probably want something to find older archives and do some sort
    of maintenance on them, compression or removal of really old archives.
    I feel like a cron job working on files > 2 intervals (months) old (to
    avoid the possibility of a race condition) would suffice.

    You might want to store the archives in a directory per group in it's
    own directory structure. YMMV

    I suspect if you back up and look at the LEGO pieces from a different
    direction you can probably come up with a workable solution. -- Don't maintain state that must be synchronized / compared a la polling. Do
    this stateless and use push from the news server itself. }:-)



    --
    Grant. . . .
    unix || die

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ted Heise@21:1/5 to Russ Allbery on Tue Aug 24 00:29:03 2021
    On Mon, 23 Aug 2021 13:05:50 -0700,
    Russ Allbery <eagle@eyrie.org> wrote:
    Jason Evans <jsevans@mailfence.com> writes:

    I would like to know if anyone here knows about an archival
    tool or script that can be used to back up new articles into
    mbox format. Similar to what exists in archive.org. I'm not
    much of a programmer but before I spend the time and effort to
    cobble together a bash script to do this, I want to see if
    someone here already has some this that can do that already.
    No need to re-invent the wheel, etc.

    The archive program that comes with INN and can be configured
    as a feed in newsfeeds is fairly close to what you want except
    that when you configure it to store multiple messages in a
    single file, it uses a custom separator rather than doing From
    escaping and inserting a mailbox From.

    You could run it in its default mode where it saves each
    individual message to a file and then separately run some other
    program to convert a directory full of files to a mailbox. I
    suspect most of the Google hits for "maildir2mbox" would do it,
    since a maildir is essentially a directory full of messages.

    The formail tool that comes with procmail may be worth looking at
    in this context too.

    --
    Ted Heise <theise@panix.com> West Lafayette, IN, USA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to Ted Heise on Mon Aug 23 22:24:30 2021
    On 8/23/21 6:29 PM, Ted Heise wrote:
    The formail tool that comes with procmail may be worth looking at in
    this context too.

    formail is a very nice tool. I use the crap out of it, particularly in procmail recipes and commands querying messages. But I thought that it
    split mbox / archives into multiple discrete messages, not the other way
    around which is my understanding of the OP's need. If I'm mistaken,
    please correct me.



    --
    Grant. . . .
    unix || die

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ted Heise@21:1/5 to Grant Taylor on Tue Aug 24 17:24:50 2021
    On Mon, 23 Aug 2021 22:24:30 -0600,
    Grant Taylor <gtaylor@tnetconsulting.net> wrote:
    On 8/23/21 6:29 PM, Ted Heise wrote:
    The formail tool that comes with procmail may be worth looking
    at in this context too.

    formail is a very nice tool. I use the crap out of it,
    particularly in procmail recipes and commands querying
    messages. But I thought that it split mbox / archives into
    multiple discrete messages, not the other way around which is
    my understanding of the OP's need. If I'm mistaken, please
    correct me.

    Ah Grant, I think you are correct. Thanks for setting things
    straight!

    --
    Ted Heise <theise@panix.com> West Lafayette, IN, USA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Todd Michel McComb@21:1/5 to theise@panix.com on Tue Aug 24 17:29:20 2021
    In article <slrnsiaar2.68d.theise@panix2.panix.com>,
    Ted Heise <theise@panix.com> wrote:
    On Mon, 23 Aug 2021 22:24:30 -0600,
    Grant Taylor <gtaylor@tnetconsulting.net> wrote:
    On 8/23/21 6:29 PM, Ted Heise wrote:
    The formail tool that comes with procmail may be worth looking
    at in this context too.
    formail is a very nice tool. I use the crap out of it,
    particularly in procmail recipes and commands querying
    messages. But I thought that it split mbox / archives into
    multiple discrete messages, not the other way around which is
    my understanding of the OP's need. If I'm mistaken, please
    correct me.
    Ah Grant, I think you are correct. Thanks for setting things
    straight!

    formail can be used to add mail headers such as From.... You could
    process through formail and then just 'cat' straight to an mbox
    file.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Grant Taylor on Wed Aug 25 08:50:32 2021
    Grant Taylor <gtaylor@tnetconsulting.net> writes:

    I believe that formail habitually /appends/ new headers to the existing headers. Seeing as how the From line is used as a message separator in
    mbox format, it /MUST/ be the first header. I'm not sure that formail
    in and of itself can /prepend/ / insert a header at the start of a
    message.

    Another problem is that the line used as a message separator is not a
    header (it doesn't have a colon). It starts with "From " and is a weird special case left over from mbox's odd legacy format.

    A more subtle problem is that Usenet messages don't escape lines that
    start with "From " in the body of the message, but this is mandatory when storing messages in mbox format or a body line might be mistaken for the
    start of a new message. Conventionally this is done by prepending > to
    the line starting with "From ". There are other approaches, but one needs
    to do something about this. The maildir2mbox programs will handle this
    case (or should).

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to Todd Michel McComb on Wed Aug 25 09:44:15 2021
    On 8/24/21 11:29 AM, Todd Michel McComb wrote:
    formail can be used to add mail headers such as From.... You could
    process through formail and then just 'cat' straight to an mbox file.

    I want to agree. But I have concerns.

    I believe that formail habitually /appends/ new headers to the existing headers. Seeing as how the From line is used as a message separator in
    mbox format, it /MUST/ be the first header. I'm not sure that formail
    in and of itself can /prepend/ / insert a header at the start of a message.

    I suspect that it might be better to use formail to extract the From:
    header, mung it to fabricate a From separator line contents, and then
    prepend it to the message. E.g. do something like the following:

    for message in /path/to/desired/articles/*; do
    NewFrom=$(cat $message | formail -x From: | $CommandToReformat)
    echo "" >> /path/to/archive
    echo "From $NewFrom" >> /path/to/archive
    cat $message >> /path/to/archive
    done



    --
    Grant. . . .
    unix || die

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Russ Allbery on Wed Aug 25 19:54:05 2021
    Russ Allbery <eagle@eyrie.org> wrote:
    Grant Taylor <gtaylor@tnetconsulting.net> writes:

    I believe that formail habitually /appends/ new headers to the existing >>headers. Seeing as how the From line is used as a message separator in >>mbox format, it /MUST/ be the first header. I'm not sure that formail
    in and of itself can /prepend/ / insert a header at the start of a
    message.

    Another problem is that the line used as a message separator is not a
    header (it doesn't have a colon). It starts with "From " and is a weird >special case left over from mbox's odd legacy format.

    A more subtle problem is that Usenet messages don't escape lines that
    start with "From " in the body of the message, but this is mandatory when >storing messages in mbox format or a body line might be mistaken for the >start of a new message. Conventionally this is done by prepending > to
    the line starting with "From ". There are other approaches, but one needs
    to do something about this. The maildir2mbox programs will handle this
    case (or should).

    The conventional mbox separator line is created from ENVELOPE FROM.

    I've been using alpine/pine forever. It doesn't parse for "nlFrom_" but
    a line resembling ENVELOPE FROM. All these years of use, I can't say
    it's ever mistaken a line in the body for a separator line.

    If I archive an article from Usenet, it's to an mbox just so I can read
    it with a mail client. If I have to do it manually, I just copy and
    paste a separator line that is recognized.

    A parser that looks for "nlFrom_" shouldn't be acceptable.

    And, yeah, I've seen Usenet articles that escape "nlFrom_". It messes up
    the quote level.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Adam H. Kerman on Wed Aug 25 13:38:50 2021
    "Adam H. Kerman" <ahk@chinet.com> writes:

    I've been using alpine/pine forever. It doesn't parse for "nlFrom_" but
    a line resembling ENVELOPE FROM. All these years of use, I can't say
    it's ever mistaken a line in the body for a separator line.

    I agree that the chances of this being a problem given a sufficiently
    picky parser are low, but they're still not non-zero since there is no
    protocol reason why a Usenet article cannot contain a line like:

    From foo@example.com Wed Aug 25 13:33:52 2021

    in, for example, a discussion of envelope From lines. :) So when contemplating archive software that one wants to just work and not have to think about, ideally it should cope with this.

    The old Babyl format solves this problem, but alas never caught on in the
    UNIX world.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ted Heise@21:1/5 to Adam H. Kerman on Wed Aug 25 20:46:24 2021
    On Wed, 25 Aug 2021 19:54:05 -0000 (UTC),
    Adam H. Kerman <ahk@chinet.com> wrote:
    Russ Allbery <eagle@eyrie.org> wrote:
    Grant Taylor <gtaylor@tnetconsulting.net> writes:

    I believe that formail habitually /appends/ new headers to the
    existing headers. Seeing as how the From line is used as a
    message separator in mbox format, it /MUST/ be the first
    header. I'm not sure that formail in and of itself can
    /prepend/ / insert a header at the start of a message.

    Another problem is that the line used as a message separator is
    not a header (it doesn't have a colon). It starts with "From "
    and is a weird special case left over from mbox's odd legacy
    format.

    A more subtle problem is that Usenet messages don't escape
    lines that start with "From " in the body of the message, but
    this is mandatory when storing messages in mbox format or a
    body line might be mistaken for the start of a new message.
    Conventionally this is done by prepending > to the line
    starting with "From ". There are other approaches, but one
    needs to do something about this. The maildir2mbox programs
    will handle this case (or should).

    The conventional mbox separator line is created from ENVELOPE
    FROM.

    I've been using alpine/pine forever. It doesn't parse for
    "nlFrom_" but a line resembling ENVELOPE FROM. All these years
    of use, I can't say it's ever mistaken a line in the body for a
    separator line.

    If I archive an article from Usenet, it's to an mbox just so I
    can read it with a mail client. If I have to do it manually, I
    just copy and paste a separator line that is recognized.

    This is all very interesting, and jogging some very old memories.
    I too have used Pine for decades, and have slrn configured to save
    Usenet posts straight into my pine mbox mail structure. It works
    wonderfully well.

    What I was thinking of when I offered my original suggestion was a
    time many years ago that I converted a boatload of e-mail from
    some other structure for use in pine. My recollection is I just
    fed that large file into formail and it generated a flood of
    individual messages that all ended up in my inbox. I suppose this
    would have depended on the mail processing system in place at the
    time as well, but I'm pretty sure they ended up in mbox format.

    Sorry this is not better informed, and please don't roast me too
    badly if I'm off track.

    --
    Ted Heise <theise@panix.com> West Lafayette, IN, USA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ted Heise@21:1/5 to Russ Allbery on Wed Aug 25 20:49:44 2021
    On Wed, 25 Aug 2021 13:38:50 -0700,
    Russ Allbery <eagle@eyrie.org> wrote:
    "Adam H. Kerman" <ahk@chinet.com> writes:

    I've been using alpine/pine forever. It doesn't parse for "nlFrom_" but
    a line resembling ENVELOPE FROM. All these years of use, I can't say
    it's ever mistaken a line in the body for a separator line.

    I agree that the chances of this being a problem given a sufficiently
    picky parser are low, but they're still not non-zero since there is no
    protocol reason why a Usenet article cannot contain a line like:

    From foo@example.com Wed Aug 25 13:33:52 2021

    in, for example, a discussion of envelope From lines. :) So when
    contemplating archive software that one wants to just work and not have to
    think about, ideally it should cope with this.

    The old Babyl format solves this problem, but alas never caught on in the
    UNIX world.

    For what it's worth, I saved the above message to my pine mbox
    file and got the below. so somewhere somehow the bare From line is
    getting changed in that process.

    Ted



    Date: Wed, 25 Aug 2021 13:38:50 -0700
    From: Russ Allbery <eagle@eyrie.org>
    Newsgroups: news.software.nntp
    Subject: Re: mbox archive for a news server

    "Adam H. Kerman" <ahk@chinet.com> writes:

    I've been using alpine/pine forever. It doesn't parse for "nlFrom_" but
    a line resembling ENVELOPE FROM. All these years of use, I can't say
    it's ever mistaken a line in the body for a separator line.

    I agree that the chances of this being a problem given a
    sufficiently picky parser are low, but they're still not non-zero
    since there is no protocol reason why a Usenet article cannot
    contain a line like:

    From foo@example.com Wed Aug 25 13:33:52 2021

    in, for example, a discussion of envelope From lines. :) So when contemplating archive software that one wants to just work and not
    have to think about, ideally it should cope with this.

    The old Babyl format solves this problem, but alas never caught on
    in the UNIX world.

    --
    Russ Allbery (eagle@eyrie.org)
    <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Ted Heise on Wed Aug 25 21:11:44 2021
    Ted Heise <theise@panix.com> wrote:

    For what it's worth, I saved the above message to my pine mbox
    file and got the below. so somewhere somehow the bare From line is
    getting changed in that process.

    Escaping "nlFrom_" is an option in .pinerc. I keep it unset. I don't
    recall which setting is default.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Ted Heise on Wed Aug 25 13:59:13 2021
    Ted Heise <theise@panix.com> writes:

    For what it's worth, I saved the above message to my pine mbox file and
    got the below. so somewhere somehow the bare From line is getting
    changed in that process.

    Yeah, the normal approach is that when you save a post to an mbox, the
    client does From line escaping like that.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Aug 25 23:52:55 2021
    According to Russ Allbery <eagle@eyrie.org>:
    Grant Taylor <gtaylor@tnetconsulting.net> writes:

    I believe that formail habitually /appends/ new headers to the existing
    headers. ...

    I am getting the strange feeling that I am the only person here who has run a news
    message through formail to see what happens.

    So as not to leave you in suspense, it puts a From_ line at the top, using the address
    in the regular From: header and the current timestamp. If you glom those together,
    you'll get an mbox.

    It does >From escapes, too.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Todd Michel McComb@21:1/5 to johnl@taugh.com on Thu Aug 26 00:09:34 2021
    In article <sg6l4n$jn2$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote: >I am getting the strange feeling that I am the only person here
    who has run a news message through formail to see what happens.

    Ha, well, I don't normally post here, but that's why I did! I'm
    not enthusiastic enough about it as a solution to insist, though....
    :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)