• INN2: importing archival messages an threads

    From ejs@21:1/5 to All on Sun Oct 23 17:57:10 2022
    Hi All,

    We are running a local instace of Usenet server.

    Sometimes we get a pieces of historical messages and in order not to
    obstruct the active groups, we moved them to a hifferent hierarchy.

    The current problem is the messages we have to proces will interfere
    with the messages on the server, as they cover the save time period.

    Cold someone explain how the messages are stored and referenced on the
    server? I can see the numerical ID, corresponding to the physical file
    on the server and Message-ID, which appears from ...?

    My current idea is to query the INN server for a Message-IDs in a
    specific newsgroup and if they are found, we have a duplicates. If no, i
    can feed the message and the entire thread to the server. The file name
    and the history entry will be created by the server.
    Am i right? Or maybe there are messages on a different groups and i may
    have a clash there?

    What we did until now - just placed the files into spool and recreated
    the history. But we were sure there will be no clashes neither in file
    names, nor Message-IDs; this was performed with vgrep and oh, boy ...
    It works for small batches and short threads and į'm not sure if it will
    scale easilly.

    I have the nearly 3M messages to be fed in the database and i can
    perform alomst any adjustments on-the-fly.

    --
    ejs
    news://news.rkm.lt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Henning Hucke@21:1/5 to ejs on Mon Oct 24 08:31:14 2022
    ejs <Usernet.eternal-september@seniejitrakai.net> wrote:
    Hi All,

    Hi stranger,

    We are running a local instace of Usenet server.
    [...]

    honestly this is a somehow unstructured request for help or at least it
    has no structure I recognise.

    You want to import historic/archived postings into a running inn
    instance?
    In which format are the postings you have available? One article per
    file? Batch files containing multiple postings?
    And do you want to know whether or not you use a viable method to import
    the postings or do you want to know a / the appropriate way to import the postings?
    (I think the later would be the/a sensefull way)

    Be aware that article numbers are heavily problematic since they may be -
    not unlikely are - already in use on the actual server. They are also
    only partially helpfull if you use other storage methods than "tradspool".

    Duplicate detection and the like are performed if you "feed" the
    postings into inn. This is processor intensive but from my point of
    view the most secure method to feed - historic as well as current -
    postings into the message base.

    Best regards
    Henning
    --
    How many bits would a BitBlit blit if a BitBlit could blit bits?
    -- macanespie@waves.pas.ti.com in <1993Nov16.130625.1@waves.pas.ti.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From ejs@21:1/5 to All on Mon Oct 24 14:07:26 2022
    2022-10-24 11:31, Henning Hucke rašė:
    You want to import historic/archived postings into a running inn
    instance?

    Yes. I've done it in the off-line mode, but that was for the new hierarchy.

    In which format are the postings you have available? One article per
    file? Batch files containing multiple postings?

    I can export either as one file per message or feed them using Python
    NNTP library.

    And do you want to know whether or not you use a viable method to import
    the postings or do you want to know a / the appropriate way to import the postings?
    (I think the later would be the/a sensefull way)

    I need to do it in the proper way.

    Be aware that article numbers are heavily problematic since they may be -
    not unlikely are - already in use on the actual server. They are also
    only partially helpfull if you use other storage methods than "tradspool".

    Right now I assume there will be Message-ID duplicates.

    Duplicate detection and the like are performed if you "feed" the
    postings into inn. This is processor intensive but from my point of
    view the most secure method to feed - historic as well as current -
    postings into the message base.

    So, for the consistency, i could fetch all the headers, build a list of Message-IDs used and alter the Message-ID as well as 'References:' and 'In-Reply-To:' fields of the message imported.
    I need to have proper threading and no duplicate messages on the server.

    --
    ejs
    news://news.rkm.lt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Henning Hucke@21:1/5 to ejs on Thu Oct 27 10:48:54 2022
    ejs <Usernet.eternal-september@seniejitrakai.net> wrote:
    [...]
    So, for the consistency, i could fetch all the headers, build a list of Message-IDs used and alter the Message-ID as well as 'References:' and 'In-Reply-To:' fields of the message imported.
    I need to have proper threading and no duplicate messages on the server.

    Either I still have a misunderstanding of what you want to achieve or
    you by yourself have a misunderstanding of how INN and NNTP work...

    Posting headers or extracted message ids aren't helpfull if you want to
    write historic postings into an INN message base since you want to write
    the articles in whole and you you have duplicate postings very seldom (otherwise this would mean that you already have a lot of these
    "historic" posting in your message base).

    INN does duplicate detection already by itself. That's nothing you need
    to do. And you also need no separate python implemented NNTP feeder
    since such tools already exist.
    Maybe its a good idea to use two computers so that the INN powers news
    server can do its work while the other computer can manage the IO load
    to grab and post all the single posting files.

    Its possibly also no good idea to use the tradspool storage and
    therewith also writing the postings directly into the storage would also
    be no good idea.

    The more relevant stuff with this task is possibly the INN
    configuration. To be able to write the historic postings into the message
    base you need to adapt the "artcutoff" (and some other) settings and the "expire.ctl" if you use a storage method affected by expire.
    After having imported the historic postings you should certainly reset
    the "artcutoff" setting (and some others).


    Its evetually helpful that you simply describe where /you/ see problems
    in writing the postings via NNTP (tools) and /why/ you want to write
    directly into a tradspool storage.


    Regards
    Henning
    --
    Can't open /usr/fortunes. Lid stuck on cookie jar.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Henning Hucke on Thu Oct 27 08:46:56 2022
    Henning Hucke <h_hucke+spam.news@newsmail.aeon.icebear.org> writes:

    INN does duplicate detection already by itself. That's nothing you need
    to do. And you also need no separate python implemented NNTP feeder
    since such tools already exist.

    The big problem with injecting old posts is that INN uses strictly
    increasing article numbers, so they'll get larger article numbers than
    existing (more recent) posts.

    How this will look to the user will vary by newsreader, but in general all
    the historic posts will show up as new, and they may or may not be sorted correctly when people view the group depending on whether the newsreader
    sorts by article date or by article number. Sorting by article number is
    quite common.

    If you want this to look as if all the articles had arrived in normal
    order, unfortunately you (speaking to the original poster here) have to do major surgery. You'll have to assemble an article tree, probably with
    manually assigned article numbers, that has all the articles you want
    numbered in the right order. I think you'll have to use tradspool and tradindexed overview, and then use the tdx-util program from tradindexed
    to rebuild overview for that group. You'll also have to inject the
    articles into history, probably by rebuilding history.

    This is unforutnately not going to be easy to do and is going to be
    disruptive for any existing readers of the group on that server (because
    you'll end up renumbering the articles in that group). INN doesn't
    provide any tools out of the box for doing this, although I have done
    things like this before (many years ago) manually.

    You may find it easier to set up a second INN server, assemble a list of
    all the articles you want on that server in correct date sorted order, and
    then feed all the articles to that server in that order using innxmit or
    some similar tool. This will also require some work to assemble all the
    pieces and build the batch file pointing to all the article files, so it
    will take some manual experimentation. (Since I haven't done that experimentation in over ten years, I unfortunately can't give you
    step-by-step instructions.) But it may mean less fiddling than manually assembling a tradspool structure and rebuilding history and overview. Or
    it may not! I'm not sure which is easier.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jesse Rehmer@21:1/5 to Russ Allbery on Thu Oct 27 17:24:12 2022
    On Oct 27, 2022 at 10:46:56 AM CDT, "Russ Allbery" <eagle@eyrie.org> wrote:

    You may find it easier to set up a second INN server, assemble a list of
    all the articles you want on that server in correct date sorted order, and then feed all the articles to that server in that order using innxmit or
    some similar tool. This will also require some work to assemble all the pieces and build the batch file pointing to all the article files, so it
    will take some manual experimentation. (Since I haven't done that experimentation in over ten years, I unfortunately can't give you step-by-step instructions.) But it may mean less fiddling than manually assembling a tradspool structure and rebuilding history and overview. Or
    it may not! I'm not sure which is easier.

    This is the way to go - there was a thread I started some months back where Julien helped provide the syntax necessary to generate the list of messages sorted by posting date which you can then transmit to another server. This is the path of least resistence to get a large amount articles in a sane order without several large operations in place.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Thu Oct 27 20:14:27 2022
    Hi Jesse,

    You may find it easier to set up a second INN server, assemble a list of
    all the articles you want on that server in correct date sorted order, and >> then feed all the articles to that server in that order using innxmit or
    some similar tool. This will also require some work to assemble all the
    pieces and build the batch file pointing to all the article files, so it
    will take some manual experimentation. (Since I haven't done that
    experimentation in over ten years, I unfortunately can't give you
    step-by-step instructions.) But it may mean less fiddling than manually
    assembling a tradspool structure and rebuilding history and overview. Or
    it may not! I'm not sure which is easier.

    This is the way to go - there was a thread I started some months back where Julien helped provide the syntax necessary to generate the list of messages sorted by posting date which you can then transmit to another server.

    Yup, and I added that information in the FAQ as I thought it may be
    useful to other people :-)

    https://www.eyrie.org/~eagle/faqs/inn.html#S6.4

    """
    [generating the "<pathoutgoing>/list" file]

    The result file contains tokens ordered by arrival time on the old
    server (which is usually roughly the same as the posting time). In case
    the history file was not populated chronologically, it is better to sort
    it by posting time so that articles are fed in the right order. This
    can be achieved with the following command:

    sort -t '~' -k3n < history > history.sorted

    And then, consider history.sorted instead of history for the next steps.
    """

    --
    Julien ÉLIE

    « 21.1.1 How to convert mSQL tools for MySQL?
    1. Run the shell script msql2mysql on the source. This requires the
    replace program, which is distributed with MySQL.
    2. Compile.
    3. Fix all compiler errors. » (MySQL online manual)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)