How can I best renumber the article numbers of all newsgroups, sorted by post date.
Can someone help me with this?
Eli:
How can I best renumber the article numbers of all newsgroups, sorted by post
date.
Why?! Renumbering articles is generally a bad idea. Newsreaders rely on
them to tell what's new and what to mark as read.
If a server renumbers its articles, its readers would need to throw out
their newsrc and show everything as unread.
Can someone help me with this?
- stop accepting articles on the existing server
- make a sorted list of storage tokens (you'll probably need to write a script for this. loop over the history and make a list of tokens and their posting date, then sort it)
- set up a new empty server
- use the sorted list to feed articles into the new server
- swap servers
- start accepting articles again
On 23 Apr 2023 at 18:08:22 CEST, "Frank" <franky@xxx.yyy> wrote:
If a server renumbers its articles, its readers would need to throw out
their newsrc and show everything as unread.
It is a new server and has no readers yet, so that isn't a problem.
- use the sorted list to feed articles into the new server
With 300,000,000 articles, the server is quite large and transferring these articles to another server over a single connection takes forever.
Sorting the history file by post date is no problem, but is it possible to rebuild the overview databases based on this sorted history file?
On Tue, 25 Apr 2023 09:34:09 GMT, Eli <eliistheman@gmail.com> wrote:
On 23 Apr 2023 at 18:08:22 CEST, "Frank" <franky@xxx.yyy> wrote:
If a server renumbers its articles, its readers would need to throw out
their newsrc and show everything as unread.
It is a new server and has no readers yet, so that isn't a problem.
So if you do not want to involve another server; to do such renumber,
one would need to (hopefully I'm not forgetting important step):
With 300,000,000 articles, the server is quite large and transferring these >>> articles to another server over a single connection takes forever.
Usin other server is the alternative, yes. You can run it in parallel over multiple connections, but only per-group parallelism (not at article
level!).
On 28 Apr 2023 at 15:11:26 CEST, "Matija Nalis" <mnalis-news@voyager.hr>
So if you do not want to involve another server; to do such renumber,
one would need to (hopefully I'm not forgetting important step):
Thank you very much for the detailed explanation. However, I wonder if manually renumbering the article files works, since crossposts are stored as symbolic links. But it might be worth a try.
With 300,000,000 articles, the server is quite large and transferring these
articles to another server over a single connection takes forever.
Usin other server is the alternative, yes. You can run it in parallel over >> multiple connections, but only per-group parallelism (not at article
level!).
With single or multiple connections, things will probably go wrong again due to the crossposts.
So it seems that renumbering by posted date is not possible at all due to the crossposts.
As an example: Suppose there are two newsgroups, named A and B.
Both newsgroups have articles from the years 2003 to 2023.
First, newsgroup A is transferred to the new server.
Newsgroup A has an article from 2022 that has been crossposted to newsgroup B.
Since newsgroup B does not yet have articles on the new server, this article will get article number 1 in newsgroup B. So the same problem arises again on the new server.
Newsgroup B (new server):
Article number 1: 2022
Article number 2: 2003
So it seems that renumbering by posted date is not possible at all due to the crossposts.
- for each group, rename files so their numbers sequentially follow the
chronological order of `Date` headers in their content (you might need
to write a relatively simple script for that; I don't know if any exist
already)
Hi Matija,
- for each group, rename files so their numbers sequentially follow the
chronological order of `Date` headers in their content (you might need
to write a relatively simple script for that; I don't know if any exist >> already)
FWIW, in <patharticles>, the dates can be obtained with something like:
grep -m 1 '^Date: ' *
and the header field values converted to epoch with the convdate tool,
like in:
convdate -n 'Fri, 28 Apr 2023 15:11:26 +0200'
You'll also need updating the Xref header fields in articles.
On Fri, 28 Apr 2023 18:32:45 GMT, Eli <eliistheman@gmail.com> wrote:
On 28 Apr 2023 at 15:11:26 CEST, "Matija Nalis" <mnalis-news@voyager.hr>
So if you do not want to involve another server; to do such renumber,
one would need to (hopefully I'm not forgetting important step):
Thank you very much for the detailed explanation. However, I wonder if
manually renumbering the article files works, since crossposts are stored as >> symbolic links. But it might be worth a try.
I'm not sure, but I think crossposts may have been stored as hardlinks instead?
If that is true, then they wouldn't mind such renaming.
But if they are indeed symlinks, then yes, your script would need to fix them too (by looking at Newsgroups header, doing readdir() in each group, and finding realink(2) where it points until it finds one that need to be
fixed). That would obviously make it even slower, yes.
With 300,000,000 articles, the server is quite large and transferring these
articles to another server over a single connection takes forever.
Usin other server is the alternative, yes. You can run it in parallel over >>> multiple connections, but only per-group parallelism (not at article
level!).
With single or multiple connections, things will probably go wrong again due >> to the crossposts.
Ah yes, you are correct, crossposts would break parallelism with multiple connections.
But, it should still work for single connection, given good preparation
(see below).
So it seems that renumbering by posted date is not possible at all due to the
crossposts.
You'd first have to create a list of all messages sorted by date (sorted history file would be great for that, were it not for the fact that it contains ONLY articles that arrived in last xx days, and not ALL of them).
And then you would simply feed the articles from that sorted list to the
new (empty) server.
They would be arriving on new server just like they did in the real life
in chronological order, to one group or to the other, and even crossposts would arrive correctly (as message "X" would be in all cases after all
older ones, but before all newer ones, regardless of group(s) they were posted to).
Hi Eli,
As an example: Suppose there are two newsgroups, named A and B.
Both newsgroups have articles from the years 2003 to 2023.
First, newsgroup A is transferred to the new server.
Newsgroup A has an article from 2022 that has been crossposted to newsgroup B.
Since newsgroup B does not yet have articles on the new server, this article >> will get article number 1 in newsgroup B. So the same problem arises again on
the new server.
Newsgroup B (new server):
Article number 1: 2022
Article number 2: 2003
So it seems that renumbering by posted date is not possible at all due to the
crossposts.
If you're renumbering the articles like Matija suggested for tradspool,
you won't encounter that problem as you do not transfer articles from a server to another, but rebuilding the history file and overview data
from your renumbered tradspool.
What you are describing is a pullnews-like scenario ("newsgroup A is transferred to the new server").
If you're renumbering the articles like Matija suggested for tradspool,
you won't encounter that problem as you do not transfer articles from a
server to another, but rebuilding the history file and overview data
from your renumbered tradspool.
Is there a way to do the same when using the timecaf storage?
Hi Eli,
If you're renumbering the articles like Matija suggested for tradspool,
you won't encounter that problem as you do not transfer articles from a
server to another, but rebuilding the history file and overview data
from your renumbered tradspool.
Is there a way to do the same when using the timecaf storage?
Renumbering articles in-place stored in timecaf buffers? No, that's not simple at all; you'll need rewriting the whole CAF (index + articles).
Only tradspool can be done with "rudimentary" grep/sed commands.
So it would actually be better if pullnews would download all articles per newsgroup and ignore the crossposts. Just download everything first, save the articles in their folders and add the xref field. Nothing more.
Filtering might not work in this case?
I'm wondering whether you could just:
- Download all the articles with "pullnews -r" (it will write a file
with all the articles within). You may run several instances of
pullnews to have several files.
- Parse the articles within these files (they are separated with "#!
rnews <size>" lines) to take the dates and write the articles in a new
batch file, ordered by posting date.
- Inject these batch files into innd (with rnews). No need to change
any Xref header fields. The articles will be treated in order, assigned
new Xref, and you'll have article numbers and history file sorted as you want.
On 5 May 2023 at 21:37:05 CEST, "Julien ÉLIE" <iulius@nom-de-mon-site.com.invalid> wrote:
I'm wondering whether you could just:
- Download all the articles with "pullnews -r" (it will write a file
with all the articles within). You may run several instances of
pullnews to have several files.
- Parse the articles within these files (they are separated with "#!
rnews <size>" lines) to take the dates and write the articles in a new
batch file, ordered by posting date.
- Inject these batch files into innd (with rnews). No need to change
any Xref header fields. The articles will be treated in order, assigned
new Xref, and you'll have article numbers and history file sorted as you
want.
Hi Julien,
I let pullnews -r exporting about 4000 articles to the batch file named 'rnews01.batch'.
The articles in the batch file are complete, saying headers and bodies, each separated with '#! rnews <bytes>'
Then I used 'rnews -v rnews01.batch'
But unfortunately INN doesn't accept the articles.
Each article is refushed with the error:
"rnews01.batch: rejected 437 No body [Path: not-for-mail ...]"
Any suggestion?
I've tried to convert the batchfile using 'dos2unix' and also 'sed -e "s/\r//g"'
but other than the errors are gone, the articles are not
transferred at all. The news log and others remain completely empty.
Hi Eli,
I see something strange in the news log.
For each of the above articles it says:
"May 5 22:36:04.708 - not-for-mail <msg-id>^M 437 No body"
Note the '^M'. It seems it doesn't seem to understand this newline character?
The 'rnews01.batch' file contains these '^M' characters at the end of each >> line.
Indeed, I'll have a look. Either by having pullnews write articles with
mere LF, or/and having rnews understand CRLF.
Unfortunately, if you change CRLF by hand, <size> becomes wrong in "#!
rnews <size>"...
On 5 May 2023 at 22:58:59 CEST, "Eli" <eliistheman@gmail.com> wrote:
On 5 May 2023 at 21:37:05 CEST, "Julien ÉLIE"
<iulius@nom-de-mon-site.com.invalid> wrote:
I'm wondering whether you could just:
- Download all the articles with "pullnews -r" (it will write a file
with all the articles within). You may run several instances of
pullnews to have several files.
- Parse the articles within these files (they are separated with "#!
rnews <size>" lines) to take the dates and write the articles in a new
batch file, ordered by posting date.
- Inject these batch files into innd (with rnews). No need to change
any Xref header fields. The articles will be treated in order, assigned >>> new Xref, and you'll have article numbers and history file sorted as you >>> want.
Hi Julien,
I let pullnews -r exporting about 4000 articles to the batch file named
'rnews01.batch'.
The articles in the batch file are complete, saying headers and bodies, each >> separated with '#! rnews <bytes>'
Then I used 'rnews -v rnews01.batch'
But unfortunately INN doesn't accept the articles.
Each article is refushed with the error:
"rnews01.batch: rejected 437 No body [Path: not-for-mail ...]"
Any suggestion?
I see something strange in the news log.
For each of the above articles it says:
"May 5 22:36:04.708 - not-for-mail <msg-id>^M 437 No body"
Note the '^M'. It seems it doesn't seem to understand this newline character? The 'rnews01.batch' file contains these '^M' characters at the end of each line.
I see something strange in the news log.
For each of the above articles it says:
"May 5 22:36:04.708 - not-for-mail <msg-id>^M 437 No body"
Note the '^M'. It seems it doesn't seem to understand this newline character? The 'rnews01.batch' file contains these '^M' characters at the end of each line.
Hi Eli,
I've tried to convert the batchfile using 'dos2unix' and also 'sed -e
"s/\r//g"'
It won't work because the <size> changes...
but other than the errors are gone, the articles are not
transferred at all. The news log and others remain completely empty.
Aren't these articles already in your spool?
If the Message-IDs are already in the history, rnews won't try to send them.
The 'rnews01.batch' file contains these '^M' characters at the end of each >>> line.
Indeed, I'll have a look. Either by having pullnews write articles with
mere LF, or/and having rnews understand CRLF.
I look forward to your solution.
Could you please test this following patch?
I've tested it with 2 articles in an rnews batch generated with rnews,
and it was imported fine.
(The first 2 fixes for Xref and Bytes are not needed in your case, but
should be fixed in the final commit as well as how $tx_len is computed.)
As an example: Suppose there are two newsgroups, named A and B.
Both newsgroups have articles from the years 2003 to 2023.
First, newsgroup A is downloaded using 'pullnews -r'.
Then, newsgroup B is downloaded using 'pullnews -r'.
Both groups are downloaded into two separated batchfiles.
When finished downloading, the batchfile created for newsgroup A is feeded to INN using rnews.
Hi Eli,
So it would actually be better if pullnews would download all articles per >> newsgroup and ignore the crossposts. Just download everything first, save the
articles in their folders and add the xref field. Nothing more.
I'm wondering whether you could just:
- Download all the articles with "pullnews -r" (it will write a file
with all the articles within). You may run several instances of
pullnews to have several files.
- Parse the articles within these files (they are separated with "#!
rnews <size>" lines) to take the dates and write the articles in a new
batch file, ordered by posting date.
- Inject these batch files into innd (with rnews). No need to change
any Xref header fields. The articles will be treated in order, assigned
new Xref, and you'll have article numbers and history file sorted as you want.
Hi Eli,
The 2 batch files have to be merged in one, ordered by posting date, and
not fed separately to INN.
Hi Eli,
I see in the source code that there is a hard-coded number for the
maximum number of articles a single CAF file has room for:
[storage/timecaf/caf.h]
/*
** Number of slots to put in TOC by default. Can be raised if we ever get
** more than 256*1024=262144 articles in a file (frightening thought).
*/
The "262145: CAF_ERR_ARTWONTFIT" error corresponds to it.
I see in the source code that there is a hard-coded number for the
maximum number of articles a single CAF file has room for:
[storage/timecaf/caf.h]
/*
** Number of slots to put in TOC by default. Can be raised if we ever get >> ** more than 256*1024=262144 articles in a file (frightening thought).
*/
Is it possible to rebuild the history and overview data with just the timecaf files?
This in case the system is crashed and I only backed up the timecaf files.
Hi Eli,
Incidentally, in case you start any other thread about INN or any other
news server, please do that in the news.software.nntp newsgroup.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 388 |
Nodes: | 16 (2 / 14) |
Uptime: | 134:02:18 |
Calls: | 8,209 |
Calls today: | 7 |
Files: | 13,122 |
Messages: | 5,871,457 |