Hi,
This question has probably been asked before, but I couldn't find it, so here it is.
I will soon be setting up a text-only news server for public access, but what is the minimum storage capacity my server needs for all non-binary newsgroups?
Thanks, Eli.
I will soon be setting up a text-only news server for public access, but what is the minimum storage capacity my server needs for all non-binary newsgroups?
SSD (NVMe) disks are not that cheap, but are they necessary for a
news server or are HDD disks fine?
I was thinking to start with 2x 1.92 TB SSD or is that not enough
for all non-binary groups?
I really have no idea how much data these newsgroups take up.
On 10/03/2023 21:04, Eli wrote:
I will soon be setting up a text-only news server for public access, but whatAnything you can afford. Hard disks prices are getting cheaper. Are you
is the minimum storage capacity my server needs for all non-binary newsgroups?
going to require registration or will it be open server like Paganini,
aioe, and mixmin? Open servers are quite popular and you'll get more
users on it.
Hello!
Depends, how much spam and low value articles you can filter out.
20-30GB per year is comfortable.
You can do with way less, if you have curated list of groups and good
spam filter.
Best regards,
U.ee
Hi,
This question has probably been asked before, but I couldn't find it, so here it is.
I will soon be setting up a text-only news server for public access, but what is the minimum storage capacity my server needs for all non-binary newsgroups?
Thanks, Eli.
On 10/03/2023 21:04, Eli wrote:
I will soon be setting up a text-only news server for public
access, but what is the minimum storage capacity my server needs
for all non-binary newsgroups?
Anything you can afford. Hard disks prices are getting cheaper. Are
you going to require registration or will it be open server like
Paganini, aioe, and mixmin? Open servers are quite popular and you'll
get more users on it.
Make sure you don't filter or censor anything or block anybody on it otherwise you will become a hate figure and target for hackers.
SSD (NVMe) disks are not that cheap, but are they necessary for a
news server or are HDD disks fine?
I will soon be setting up a text-only news server for public access,
but what is the minimum storage capacity my server needs for all
non-binary newsgroups?
Hi,
This question has probably been asked before, but I couldn't find it, so here it is.
I will soon be setting up a text-only news server for public access, but what is the minimum storage capacity my server needs for all non-binary newsgroups?
Thanks, Eli.
If you set up a server based on Debian or Ubuntu, plan around 15-20 GB, because the log files will quickly fill up your disk if there are errors.
On 3/11/23 6:11 PM, Timo wrote:
If you set up a server based on Debian or Ubuntu, plan around 15-20
GB, because the log files will quickly fill up your disk if there are
errors.
I would *STRONGLY* suggest checking out log-rotate or the likes if
you're not using it.
I would rather just set up inotify scripts to truncate or delete log
files to prevent them from filling up a lot of space.
Does INN2 require any of the data in the log files for operation?
Is it safe to delete the log files once they reach a certain size?
What about truncating the log files to X lines every Y hours or when
inotify reports a size limit?
Does INN automatically populate the database with all existing
articles from a NEW peer or only new articles that come in.
If not, is there a way to download all existing articles from a
(commercial) news server via INN?
I know this is a lot of data to download :)
If not, is there a way to download all existing articles from a
(commercial) news server via INN?
As said above, there are some tools that can be used to pull messages. I believe that `suck` is one such tool.
The download isn't the hard part. The hard part will be getting those messages into your local INN instance. You'll need to (temporarily)
disable default protections which reject older articles.
What about truncating the log files to X lines every Y hours or when
inotify reports a size limit?
Simply truncating files without doing anything else is likely to cause
some corruption and / or uncontrolled disk consumption. You can reduce
the size of the file on disk, but anything with an open file handle may
not know that the file size has shrunk and may therefore do the wrong
thing the next time it writes to the file.
I'm curious why you want to go the inotify route as opposed to simply a
cron job that periodically checks the size of file(s) and takes proper
action if they are over a threshold (size and / or age).
These commands should be used before the beginning of the pulling. The first one deactivates the reject of old articles, and the other ones deactivate spam & abuse filtering.
ctlinnd param c 0
ctlinnd perl n
ctlinnd python n
After pullnews or suck have completed, then re-activate these protections:
ctlinnd param c 10
ctlinnd perl y
ctlinnd python y
Hi San,
If not, is there a way to download all existing articles from a
(commercial) news server via INN?
As said above, there are some tools that can be used to pull messages. I
believe that `suck` is one such tool.
Yes, suck (an external program) does the job.
There's also pullnews, shipped with INN:
https://www.eyrie.org/~eagle/software/inn/docs/pullnews.html
Thank you for this information Julien. I'm copying it to my INN /
Usenet tips & tricks collection.
I think I'd pay some good money or at least a few chocolate fish to
read those notes Grant :)
Can multiple pullnews instances be launched side by side?
Or does this corrupt the INN databases?
Just a quick question about the settings in expire.ctl.
I never want the old messages from any newsgroup to be automatically deleted (expired). Even though they are 20 years old.
I have the 'groupbaseexpiry' on 'false' (or is 'true' better?).
Is '0:1:99990:never' in expire.ctl the correct setting for this?
On 2023-03-16, Neodome Admin <admin@neodome.net> wrote:
There are no meaningful text articles bigger than 64 Kb. Actually,
maximum size is probably 32 Kb or less.
There are several regularly posted FAQs, etc, which are larger than
that.
Cheers,
Tom
I said "meaningful", Tom :-)
Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs. At least on Usenet. We might pretend all we want, but that's just
the way things are. Those FAQs are nothing more that regular spam in
most of newsgroups where they are posted. How many times you visited a
group and there is nothing except Google Groups drug scam and those
FAQs? Probably a lot of times, huh?
Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs.
I was thinking to start with 2x 1.92 TB SSD or is that not enough for all non-binary groups?
There are no meaningful text articles bigger than 64 Kb. Actually,
maximum size is probably 32 Kb or less.
For people getting into retro computing (Atari, Amiga, etc.), some of those 700+KB FAQ articles are gold.
I look at the server stats and even though there is no open posting
anymore I still see hundreds of people reading via my servers. And after
all these years not a single one of them ever complained that they can't
read some article. And it's not like I was running the server for a year
or two.
DV <dv@reply-to.not.invalid> writes:
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one cares. I'm sorry guys. No one needs those FAQs.
On 2023-03-16, Neodome Admin <admin@neodome.net> wrote:
I look at the server stats and even though there is no open posting
anymore I still see hundreds of people reading via my servers. And after
all these years not a single one of them ever complained that they can't
read some article. And it's not like I was running the server for a year
or two.
That's your server, run it however you like. The person you suggested to limit article sizes to 64 or 32k might like to know that there are
larger articles which may be considered of interest, that's their call
to make.
I do.
Neodome Admin wrote:
DV <dv@reply-to.not.invalid> writes:
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one
cares. I'm sorry guys. No one needs those FAQs.
I say it again: i do. You should stop repeating that *no one* needs
them, unless you think I don't exist.
"Neodome Admin" <admin@neodome.net> wrote:
I said "meaningful", Tom :-)
Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs. At least on Usenet. We might pretend all we want, but that's just
the way things are. Those FAQs are nothing more that regular spam in
most of newsgroups where they are posted. How many times you visited a
group and there is nothing except Google Groups drug scam and those
FAQs? Probably a lot of times, huh?
I've found many useful over the years. If you're not being fed them it
seems difficult to judge their value.
For people getting into retro computing (Atari, Amiga, etc.), some of
those 700+KB FAQ articles are gold.
Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:
"Neodome Admin" <admin@neodome.net> wrote:
I said "meaningful", Tom :-)
Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs. At least on Usenet. We might pretend all we want, but that's just
the way things are. Those FAQs are nothing more that regular spam in
most of newsgroups where they are posted. How many times you visited a
group and there is nothing except Google Groups drug scam and those
FAQs? Probably a lot of times, huh?
I've found many useful over the years. If you're not being fed them it
seems difficult to judge their value.
For people getting into retro computing (Atari, Amiga, etc.), some of
those 700+KB FAQ articles are gold.
How many of them contain any information that can’t be found on the web?
I don't doubt that.
I doubt that regular posting of 700+KB FAQ is doing any good.
I doubt that anything in those FAQs is more useful than information
that can be found with Google or DuckDuckGo. We're not living in an
era of Altavista, after all. And if there is some kind of gem hidden
there, one simply don't need to post it to newsgroup regularly with
700+KB of irrelevant text.
Plus, I'm pretty sure that if there are any questions, one can just
ask a question in retro-computing group and expect an answer... unless
that group is dead, of course.
I think you belong to binary Usenet, and you're free to read and post anything you want as long as all parties involved agree on that.
No, seriously, I have no problems with 700+KB posts. If you want, I can
set up a script posting any 700+KB FAQ you want, to any newsgroup, using
your name, as often as you want, and even more often. What do you say?
On 3/16/23 3:43 AM, Neodome Admin wrote:
I doubt that anything in those FAQs is more useful than
information that can be found with Google or DuckDuckGo. We're
not living in an era of Altavista, after all. And if there is
some kind of gem hidden there, one simply don't need to post
it to newsgroup regularly with 700+KB of irrelevant text.
I think that there is some value in having some unrequested
information put in front of you.
I've seen many things that I didn't know that I wanted to know
put in front of me.
I've also been mildly interested in something and seen
something new (to me) done with it that really peaks my
interest and causes me to actively investigate it.
I believe there is some value in things being put in front of
me for my perusal.
Plus, I'm pretty sure that if there are any questions, one can
just ask a question in retro-computing group and expect an
answer... unless that group is dead, of course.
It's really hard to ask a question about something if you don't
know that said something exists.
I don't mind quarterly or even monthly posting of FAQs. I do
have an objection to super large FAQs. -- I think I have my
server configured to accept 128 kB articles.
DV <dv@reply-to.not.invalid> writes:
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one >cares. I'm sorry guys. No one needs those FAQs.
I look at the server stats and even though there is no open posting
anymore I still see hundreds of people reading via my servers. And after
all these years not a single one of them ever complained that they can't
read some article. And it's not like I was running the server for a year
or two.
Hi Eli,
Can multiple pullnews instances be launched side by side?
Yes, though you have to use a different set of newsgroups for each
instance. Otherwise, they would do the same thing and it won't run much faster.
For instance:
pullnews -t 3 -c pullnews.marks1
pullnews -t 3 -c pullnews.marks2
Is it possible in pullnews to pre-skip articles above a certain number of bytes, instead of downloading the whole article first?
Maybe by making a small change in the perl script?
pullnews -t 3 -c pullnews.marks1
pullnews -t 3 -c pullnews.marks2
In the pulldown logs I see many of these lines:
x
DEBUGGING 55508 421
x
DEBUGGING 55509 421
What does this mean and what causes it?
After each such line it takes about 2 minutes until the next article is downloaded and this slows down the download enormously.
I may add a dedicated option to pullnews and integrate it in a future release, if that may prove to be useful for others.
I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
I would likely use it. :)
In the pulldown logs I see many of these lines:
x
DEBUGGING 55508 421
x
DEBUGGING 55509 421
What does this mean and what causes it?
After each such line it takes about 2 minutes until the next article is
downloaded and this slows down the download enormously.
It means that article numbers 55508 and 55509 were not found on the
server (x). My guess is that the connection has timed out (421 special internal code).
Jesse reported a bug which sounds like that a few months ago.
Could you please download the latest version of pullnews, and try it please?
https://raw.githubusercontent.com/InterNetNews/inn/main/frontends/pullnews.in
Just grab that file, rename it without .in, and change the first 2 lines
to fit what your current pullnews script has (it is the path to Perl and
the INN::Config module).
Then you can run that script. It will work with your version of INN.
Hi Eli,
Is it possible in pullnews to pre-skip articles above a certain number of
bytes, instead of downloading the whole article first?
Currently not.
Maybe by making a small change in the perl script?
I would suggest these additional lines:
@@ -928,6 +928,13 @@
push @{$article}, "\n" if not $is_control_art;
}
}
+
+ my $overview = $fromServer->xover($i);
+ # Skip the article if its size is more than 100,000 bytes.
+ if ($$overview{$i}[5] and $$overview{$i}[5] > 100000) {
+ $skip_article = 1;
+ }
+
if (not $skip_article
and (not $header_only or $is_control_art or
$add_bytes_header))
{
I've quickly tested it, and I believe it works.
I may add a dedicated option to pullnews and integrate it in a future release, if that may prove to be useful for others.
The latest version seems to have fixed it.
I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
This also works perfectly.
Hi Eli and Jesse,
Hope it suits your needs :-)
When running with -d 1, sometimes when I hit CTRL-C to stop the process it wipes out the pullnews.marks file. It does not do this every time, but seems like it is happening if I stop the process while it is retrieving overview information.
Hi Eli and Jesse,
I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
I've just committed a proper patch, which will be shipped with INN
2.7.1. (You can grab it at the same URL as provided before.)
-L size
Specify the largest wanted article size in bytes. The default is to
download all articles, whatever their size. When this option is
used, pullnews will first retrieve overview data (if available) of
each newsgroup to process so as to obtain articles sizes, before
deciding which articles to actually download.
However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a newsgroup
with a few million messages. I don't know if I like that.
On 2023-03-20, Eli <eliistheman@gmail.com> wrote:
However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a
newsgroup
with a few million messages. I don't know if I like that.
You need the overview to get the article size before downloading,
whether that's all at once or per article. I imagine it's more efficient
to get it all up front then filter out the unwanted articles.
Cheers,
Tom
Hi Jesse,
When running with -d 1, sometimes when I hit CTRL-C to stop the process it >> wipes out the pullnews.marks file. It does not do this every time, but seems >> like it is happening if I stop the process while it is retrieving overview >> information.
When you say "wipe out", does it mean you have an empty pullnews.marks
file? Or a pullnews.marks file with wrong article numbers?
Does it happen only with "-d 1"?
I'm unsure what could cause that, as I've not changed the way the configuration file is handled :-/
It is saved when pullnews receives a SIGINT (Ctrl+C for instance), and
it writes the last article number processed.
I've tried to reproduce it with "-d 1", but do not see anything
suspicious in pullnews.marks. The last line in standard output is
"Saving config" after Ctrl+C.
Here is what I'm seeing in a session that I did not kill, but was killed off after the upstream host cut me off for time limit:
. DEBUGGING 361445 -- not downloading already existing message <45af67cb$0$20803$5fc30a8@news.tiscali.it> code=223
. DEBUGGING 361449 -- not downloading already existing message <45af6a64$0$20803$5fc30a8@news.tiscali.it> code=223
. DEBUGGING 361451 -- not downloading already existing message <xn0f1cwlp7d8yw000IdSub@news.individual.net> code=223
Transfer to server failed (436): Flushing log and syslog files
When I start the command again:
[news@spool1 ~]$ pullnews -d 1 -O -c pullnews4.marks -L 200000 -t 3 -G it.sport.calcio,it.sport.calcio.estero,it.sport.calcio.fiorentina,it.sport.ca lcio.genoa,it.sport.calcio.inter
Mon Mar 20 11:00:14 2023 start
No servers!
[news@spool1 ~]$ cat pullnews4.marks
[news@spool1 ~]$
However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a
newsgroup with a few million messages. I don't know if I like that.
You need the overview to get the article size before downloading,
whether that's all at once or per article. I imagine it's more efficient
to get it all up front then filter out the unwanted articles.
It may be a bit more efficient, but I still see more disadvantages than advantages. For example, if I want to change the max. file size or want to change the regex for option -m, this is no longer possible after the entire overview has already been downloaded and the filtering has already been processed.
Hi Eli,
However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a >>>> newsgroup with a few million messages. I don't know if I like that.
You need the overview to get the article size before downloading,
whether that's all at once or per article. I imagine it's more efficient >>> to get it all up front then filter out the unwanted articles.
Indeed, downloading the overview in a unique command is globally faster
than article by article.
Suppose the group contains article numbers 1 to 1,000,000 and the last
time pullnews ran, it retrieved article 800,000.
Then on a new run, it will first ask the overview of articles 800,001 to 1,000,000 in a unique command, and it will get a unique (long) answer.
Then pullnews will actually download only articles known to be smaller
than the maximum size wanted.
Otherwise, the easiest way would be to retrieve overview data, article
by article. It will take globally more time, but I agree the user
experience is better as he does not have the feeling of a "hang" during
the download of the whole overview.
I could tweak it to download by bunches of 100 articles for instance,
but it's more work to do :( I may have to do so finally...
Hi Jesse,
Here is what I'm seeing in a session that I did not kill, but was killed off >> after the upstream host cut me off for time limit:
. DEBUGGING 361445 -- not downloading already existing message
<45af67cb$0$20803$5fc30a8@news.tiscali.it> code=223
. DEBUGGING 361449 -- not downloading already existing message
<45af6a64$0$20803$5fc30a8@news.tiscali.it> code=223
. DEBUGGING 361451 -- not downloading already existing message
<xn0f1cwlp7d8yw000IdSub@news.individual.net> code=223
Transfer to server failed (436): Flushing log and syslog files
Hmm, this log line does not correspond to a time limit enforced by the upstream host. It is generated by the downstream server to which you
are sending articles. The "Flushing log and syslog files" message
appears during log rotation (INN is paused a very short moment).
When I start the command again:
[news@spool1 ~]$ pullnews -d 1 -O -c pullnews4.marks -L 200000 -t 3 -G
it.sport.calcio,it.sport.calcio.estero,it.sport.calcio.fiorentina,it.sport.ca
lcio.genoa,it.sport.calcio.inter
Mon Mar 20 11:00:14 2023 start
No servers!
[news@spool1 ~]$ cat pullnews4.marks
[news@spool1 ~]$
Gosh!
Don't you have anything else after "Transfer to server failed (436):
Flushing log and syslog files"?
No "can't open pullnews4.marks" error?
I'm a bit surprised, the configuration file is saved this way:
open(FILE, ">$groupFile") || die "can't open $groupFile: $!\n";
print LOG "\nSaving config\n" unless $quiet;
print FILE "# Format: (date is epoch seconds)\n";
print FILE "# hostname[:port][_tlsmode] [username password]\n";
print FILE "# group date high\n";
foreach $server ( ... )
print [...]
close FILE;
You don't even have the "Saving config" debug line in your console, nor
the 3 initial # lines written in the new pullnews4.marks file...
Sounds like open() failed, or close() failed...
Could you try to add an explicit message error ?
close(FILE) or die "can't close $groupFile: $!\n";;
Don't you have in mind anything that could explain why the file couldn't
be written? (lack of disk space, wrong permissions on the file because pullnews was not started with the right user, etc.)
I could tweak it to download by bunches of 100 articles for instance,
but it's more work to do :( I may have to do so finally...
Maybe it would be sufficient to print a message indicating it is retrieving overview data to inform the user of what is happening to account for the pause?
Hello Julien,
In some newsgroups I get the following error while using pullnews:
DEBUGGING 560 Post 436: Msg: <Can't store article>
Then pullnews quits.
Can this be avoided as it is very annoying.
Hi Eli,
In some newsgroups I get the following error while using pullnews:
DEBUGGING 560 Post 436: Msg: <Can't store article>
Then pullnews quits.
Can this be avoided as it is very annoying.
Do you happen to have other logs in <pathlog>/news.err or news.notice?
It would be useful to understand why innd did not manage to store the
article provided by pullnews.
It is an unusual error. Do all the
newsgroups match an entry in storage.conf?
In the latest version of pullnews (the one from the link you posted earlier) >> it quits with the error:
Transfer to server failed (436): Can't store article
Didn't previous versions of pullnews report the same error?
It seems that this only happens with some old posted articles. But still very
annoying.
Only old posts in some newsgroups? Do they have something special?
(article number > 2^31, unusual headers, etc.)
In some newsgroups I get the following error while using pullnews:
DEBUGGING 560 Post 436: Msg: <Can't store article>
Then pullnews quits.
Can this be avoided as it is very annoying.
In the latest version of pullnews (the one from the link you posted earlier) it quits with the error:
Transfer to server failed (436): Can't store article
It seems that this only happens with some old posted articles. But still very annoying.
Here are 3 that I have on hand in a moment:
<42258b3e_2@127.0.0.1>
<Xns9735CDDF6F44CYouandmeherecom@216.113.192.29> <1186867877_2111@sp6iad.superfeed.net>
If you cannot access the articles then let me know and I'll post the headers here.
Hi Eli,
I've tried to inject the first one on my news server, and do not see any problem... I don't know why it cannot be stored on yours.
(I've only added "trigofacile.test" to the list of newsgroups as I do
not carry alt.*)
235 Article transferred OK
Transfer to server failed (436): Can't store article
I'll keep looking for a cause.
On 3/16/23 4:19 AM, Neodome Admin wrote:
I think you belong to binary Usenet, and you're free to read and
post anything you want as long as all parties involved agree on
that.
Wait a minute.
We're talking about a /text/ post consisting of entirely printable
ASCII meant to be read by a human. That's very much so /text/. It's
not binary encoded in text.
No, seriously, I have no problems with 700+KB posts. If you want, I can
set up a script posting any 700+KB FAQ you want, to any newsgroup, using
your name, as often as you want, and even more often. What do you say?
Stop it.
I know that you know that would be a form of abuse.
On 3/16/23 3:43 AM, Neodome Admin wrote:
I don't doubt that.
So you agree that the content of the articles does have some value to
some people.
I doubt that regular posting of 700+KB FAQ is doing any good.
What's your primary objection? The frequency or the size of the posts?
I doubt that anything in those FAQs is more useful than information
that can be found with Google or DuckDuckGo. We're not living in an
era of Altavista, after all. And if there is some kind of gem hidden
there, one simply don't need to post it to newsgroup regularly with
700+KB of irrelevant text.
I think that there is some value in having some unrequested
information put in front of you.
I've seen many things that I didn't know that I wanted to know put in
front of me.
I've also been mildly interested in something and seen something new
(to me) done with it that really peaks my interest and causes me to
actively investigate it.
I believe there is some value in things being put in front of me for
my perusal.
Plus, I'm pretty sure that if there are any questions, one can just
ask a question in retro-computing group and expect an
answer... unless that group is dead, of course.
It's really hard to ask a question about something if you don't know
that said something exists.
I don't mind quarterly or even monthly posting of FAQs. I do have an objection to super large FAQs. -- I think I have my server
configured to accept 128 kB articles.
Even at 1 MB, this is only a few seconds worth of audio / video as -- purportedly -- admin@Neodome pointed out in a different message.
These messages really are not much to sneeze at. -- My news server
sees 50 or more of these messages worth of traffic per day. So, one
of these per month, much less quarter, not even worth complaining
about.
Hi Eli,
Transfer to server failed (436): Can't store article
I'll keep looking for a cause.
As it seems you are using the tradspool storage system, could you please
try:
scanspool -n -v
Though probably not related to overview, could you also try:
tdx-util -A
(if you're using tradindexed)
On 22 Mar 2023 at 07:47:21 CET, "Julien ÉLIE" <iulius@nom-de-mon-site.com.invalid> wrote:
Hi Eli,
Transfer to server failed (436): Can't store article
I'll keep looking for a cause.
As it seems you are using the tradspool storage system, could you please
try:
scanspool -n -v
Though probably not related to overview, could you also try:
tdx-util -A
(if you're using tradindexed)
Since these commands take quite a long time, I will wait with this until all pullnews sessions are done and let you know.
Something else:
How can I reset a newsgroup that has already been fully downloaded so that pullnews starts downloading all posts again?
Can this be done by:
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'
or is there a better way?
Thank again and apologies for all my questions.
They are not posts *created* by humans, and this is my problem with
them.
Of course, if we'll try to be completely logical about this, there
can be posts created by humans with binary files attached, etc.,
and no one cares about those.
You are correct, Grant. It was a sarcasm.
Same as binary MIME attachments to legit Usenet messages written
by real people. They have some value for me if they add to the
conversation.
Is there really a reason to avoid them now when I literally use moreACK
memory on my 256 GB iPhone to store pictures of random dogs and cats
than I use on my server to store 2 years of unfiltered text Usenet? By unfiltered I mean completely unfiltered, all Google Groups spam and
other junk included.
I just find it technically much simpler to differentiate by the
article size. Bigger than some value - binary. Smaller - text Usenet.
Thus my advice.
FAQs are little bit different story than other messages. Like I said,
my main problem with them is that they're not written by the people,
and thus I don't see the need to treat them any different than spam
and binaries. After all, all those binary messages also can be useful
for someone, maybe even bigger amount of people will find them more
useful compared to FAQs.
I think that legit text conversations in binary newsgroups bring more
to the Usenet as communication platform than bi-weekly FAQs in dead
text newsgroups, thus they are the ones that deserve to be preserved
for the future readers.
BTW, currently it's not being done by text Usenet servers.
You are correct. If there are FAQs bigger than 64 Kb, the amount of
data they consume is miniscule compared even to the Google Groups
spam. Actually, thinking of it, I might receive them anyway from one
of the peers who set their newsfeeds incorrectly, and probably still
didn't fix it. I just never complained about it because it's not a
problem from technical point of view.
Hi Eli,
Something else:
How can I reset a newsgroup that has already been fully downloaded so that >> pullnews starts downloading all posts again?
Can this be done by:
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'
or is there a better way?
+1 for Jesse's way.
I have a question about these ctlinnd rmgroup/newgroup commands. Do you happen to have already used them to "reset" a newsgroup?
It would explain the "Can't store" errors if you also did not purge the tradspool files in <pathspool> for some newsgroups. Files named with
article numbers "1", "2", "3", etc. will still be present in your spool.
If you recreate a newsgroup with ctlinnd rmgroup/newgroup, it just
recreate it in the active file, without wiping the spool. Article
numbering is reset to 1, and INN will try to store articles in already existing "1", "2", etc. files.
On Mar 22, 2023 at 6:54:40 AM CDT, "Eli" <eliistheman@gmail.com> wrote:
On 22 Mar 2023 at 07:47:21 CET, "Julien ÉLIE"
<iulius@nom-de-mon-site.com.invalid> wrote:
Hi Eli,
Transfer to server failed (436): Can't store article
I'll keep looking for a cause.
As it seems you are using the tradspool storage system, could you please >>> try:
scanspool -n -v
Though probably not related to overview, could you also try:
tdx-util -A
(if you're using tradindexed)
Since these commands take quite a long time, I will wait with this until all >> pullnews sessions are done and let you know.
Something else:
How can I reset a newsgroup that has already been fully downloaded so that >> pullnews starts downloading all posts again?
Can this be done by:
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'
or is there a better way?
Thank again and apologies for all my questions.
If you've already made a full pass over the group with pullnews and want to make another full pass, I think the easiest is to modify the pullnews.marks counts for that group and set to 1. That should cause pullnews to start from the beginning.
Though, I wonder if we are now in the day & age that we could create
filters that either:
- detect multiple strings of text with white space between them, thus words.
- detect the standard encoding methods; e.g. 76 x [A-Za-z0-0+/=] for
base64
Something else:
How can I reset a newsgroup that has already been fully downloaded so that pullnews starts downloading all posts again?
Can this be done by:
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'
or is there a better way?
In the context of detecting encoded binary attachments, I feel like that should be relatively easy to do.
I've been wondering if it might be possible to use something likeI don't know what SpamAssassin will think of news articles.
spamassassin with bayesian learning on a newsfeed though I haven't
got to the point of trying to implement anything yet.
I wonder if it would be possible to leverage something like the milter interface to SpamAssassin so that you don't need to integrate and or
fork SpamAssassin.
While Cleanfeed is effective enough at what it does, there's no
"smarts" to it and it can be a chore coming up with effective patterns
that work but don't get in the way of legitimate posts that happen to
contain some of the "trouble" words or phrases.
I've been wondering if it might be possible to use something like spamassassin with bayesian learning on a newsfeed though I haven't
got to the point of trying to implement anything yet.
On Mar 22, 2023 at 1:09:28 PM CDT, "Grant Taylor" <gtaylor@tnetconsulting.net>
wrote:
Though, I wonder if we are now in the day & age that we could create
filters that either:
- detect multiple strings of text with white space between them, thus
words.
- detect the standard encoding methods; e.g. 76 x [A-Za-z0-0+/=] for
base64
Diablo has this article type detection built in and allows you to filter based
on types in newsfeed definitions. Cleanfeed and pyClean do the same for INN. it's not perfect, but pretty damn effective.
On 2023-03-22, Jesse Rehmer <jesse.rehmer@blueworldhosting.com> wrote:
On Mar 22, 2023 at 1:09:28 PM CDT, "Grant Taylor" <gtaylor@tnetconsulting.net>
wrote:
Though, I wonder if we are now in the day & age that we could create
filters that either:
- detect multiple strings of text with white space between them, thus
words.
- detect the standard encoding methods; e.g. 76 x [A-Za-z0-0+/=] for
base64
Diablo has this article type detection built in and allows you to filter based
on types in newsfeed definitions. Cleanfeed and pyClean do the same for INN. >> it's not perfect, but pretty damn effective.
While Cleanfeed is effective enough at what it does, there's no "smarts"
to it and it can be a chore coming up with effective patterns that work
but don't get in the way of legitimate posts that happen to contain some
of the "trouble" words or phrases. I've been wondering if it might be possible to use something like spamassassin with bayesian learning on a newsfeed though I haven't got to the point of trying to implement
anything yet.
Cheers,
Tom
I don't know what SpamAssassin will think of news articles.Seems like I remember efforts in the past, perhaps not specific to INN or Diablo, but other tools to implement SpamAssassin for filtering articles, but off hand can't recall where that conversation occurred.
I wonder if it would be possible to leverage something like the
milter interface to SpamAssassin so that you don't need to
integrate and>> or fork SpamAssassin. >
On 3/22/23 1:33 PM, Tom Furie wrote:
While Cleanfeed is effective enough at what it does, there's no
"smarts" to it and it can be a chore coming up with effective patterns
that work but don't get in the way of legitimate posts that happen to
contain some of the "trouble" words or phrases.
Please elaborate and share some examples.
This in particular for the newsgroup 'news.lists.filters'. This group contains
the references to the 'spam' messages that NoCem then deletes. I want to reset
this newsgroup 'news.lists.filters' so that all messages are checked locally >> again and in case of spam removed.
As for NoCeM, you can directly refeed your notices to perl-nocem without resetting anything.
perl-nocem expects storage tokens on its standard input.
Example:
echo '@020162BEB132016300000000000000000000@' | perl-nocem
As you're running tradindexed overview, I would suggest to have a look
at the output of:
tdx-util -g -n news.lists.filters
It dumps the overview data of this newsgroup. The last field is a
storage token.
You could replay NoCeM notices with these information :)
Hi Eli,
That's what I thought at first too, but this prevents all existing files in >> the spool from being downloaded again and all messages are treated as
'-- not downloading already existing message'.
My question is therefore how you can completely reset a newsgroup so that
everything is downloaded again.
Ah yes, that's a bit tricky as what you want is to remove all traces of articles in spool, overview and history.
The proper method would be to:
- ctlinnd rmgroup xxx
- remove the <pathspool>/articles/.../xxx directory of the group
- set /remember/ to 0 in expire.ctl
- run the expireover and expire process (for instance via news.daily
called with the same parameters as in crontab, plus "notdaily")
- undo the change in expire.ctl (/remember/ set to 11)
- ctlinnd newgroup xxx
- reset the last downloaded article in pullnews.marks for this group
- deactivate Perl and Python filters, and set the artcutoff to 0
- run pullnews
- reactivate the filters, and artcutoff to 10
I think INN will happily accept to be refed of these articles.
This in particular for the newsgroup 'news.lists.filters'. This group contains
the references to the 'spam' messages that NoCem then deletes. I want to reset
this newsgroup 'news.lists.filters' so that all messages are checked locally >> again and in case of spam removed.
As for NoCeM, you can directly refeed your notices to perl-nocem without resetting anything.
perl-nocem expects storage tokens on its standard input.
Example:
echo '@020162BEB132016300000000000000000000@' | perl-nocem
As you're running tradindexed overview, I would suggest to have a look
at the output of:
tdx-util -g -n news.lists.filters
It dumps the overview data of this newsgroup. The last field is a
storage token.
You could replay NoCeM notices with these information :)
Hi Eli,
In some newsgroups I get the following error while using pullnews:
DEBUGGING 560 Post 436: Msg: <Can't store article>
Then pullnews quits.
Can this be avoided as it is very annoying.
Do you happen to have other logs in <pathlog>/news.err or news.notice?
It would be useful to understand why innd did not manage to store the
article provided by pullnews. It is an unusual error. Do all the
newsgroups match an entry in storage.conf?
Another question, is it possible to limit the maximum number of connections per authenticated user? I know this is possible for peers, but can this also be set up for authenticated users? Maybe a setting in readers.conf or nnrpd that I'm overlooking?
From: yamo' <yamo@beurdin.invalid>
Newsgroups: news.software.nntp
Subject: Re: Google Groups spam - INN/Cleanfeed/etc solutions?
Date: Sun, 19 Sep 2021 10:11:24 -0000 (UTC)
Message-ID: <si72cc$ko9$1@pi2.pasdenom.info>
If you've already made a full pass over the group with pullnews and want to >> make another full pass, I think the easiest is to modify the pullnews.marks >> counts for that group and set to 1. That should cause pullnews to start from >> the beginning.
That's what I thought at first too, but this prevents all existing files in the spool from being downloaded again and all messages are treated as
'-- not downloading already existing message'.
My question is therefore how you can completely reset a newsgroup so that everything is downloaded again.
This in particular for the newsgroup 'news.lists.filters'. This group contains
the references to the 'spam' messages that NoCem then deletes. I want to reset
this newsgroup 'news.lists.filters' so that all messages are checked locally again and in case of spam removed.
I probably found the problem.
The errlog gives the following error:
==
innd: tradspool: could not symlink /usr/local/news/spool/articles/alabama/politics/11365 to /usr/local/news/spool/articles/alt/2600/414/78: Not a directory
==
/usr/local/news/spool/articles/alt/2600/414 is a file, but for some reason INND wants to create a folder in that path with the same name as the file name.
Any ideas how this is possible and how to fix?
Unfortunately, the response is no. There's no native way of limiting
users' connections.
It's pretty simple to run nnrpd via other utilities that will do the limiting for you, though, most UNIX/Linux systems have at least two or three tools to accomplish more or less the same thing.
That said, it would be nice to have that ability directly in nnrpd.
Hi Eli,
Another question, is it possible to limit the maximum number of connections >> per authenticated user? I know this is possible for peers, but can this also >> be set up for authenticated users? Maybe a setting in readers.conf or nnrpd >> that I'm overlooking?
Unfortunately, the response is no. There's no native way of limiting
users' connections.
You may want to write a custom authentication hook (perl_auth or
python_auth in readers.conf) that would do the job by accounting how
many connections are open by a given user, and deny access if it exceeds
the limit. I am not aware of existing scripts to do that :-(
It could be worthwhile having though, as you're not the first one to ask
(but nobody wrote or shared what he came up with).
Oh, there's no problem for it catching binaries, that's a
non-issue. I'm talking about methods for catching the still ever
prevelant text spam..
I don't imagine it will have any problem with the bodies, but the
headers will likely be a different matter since I doubt spamassassin
knows anything about them. Maybe some custom rulesets to inform it
what to look at...
Yes, I was thinking of interfacing that way, or feeding everything
off to spamd.
Neodome Admin wrote:
DV <dv@reply-to.not.invalid> writes:
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one
cares. I'm sorry guys. No one needs those FAQs.
I say it again: i do. You should stop repeating that *no one* needs
them, unless you think I don't exist.
There's no native way of limiting users' connections.
Here are a few that I think illustrate the "effective pattern" problem.
Now, this sample is all Google - which is already tagged as a known
spam source - but still they made it through. Sure, I could just
block the sender, but that seems a bit of "blunt instrument" approach
to me. And what happens in the potential situation where a spammer
forges an otherwise legitimate poster's email address, etc?
There's also the posts whereby the originals get caught by the filter,
but the fully quoted replies including full headers posted into
the body of the "complaint", make it through. That's one poster I'm incredibly close to outright banning since he's effectively simply
a reflector of the original spam.
Sure, you exist.
But who want obsolete faq ? To do what ?
Ya. I'm not a fan of blocking Google carte blanche like some advocate for.
Ah yes, exactly. That's the reason why this was never implemented in
INN. It's not seen as a priority at all, and it's also not trivial to do.
Issue #23
"nnrpd currently has no way of limiting connections per IP address other
than using the custom auth hooks. In its daemon mode, it could in
theory keep track of this and support throttling. It's probably not
worth trying to support this when invoked via inetd, since at that point
one could just use xinetd and its built-in support for things like this.
When started from innd, this is a bit harder. innd has some basic rate limiting stuff, but nothing for tracking number of simultaneous
connections over time. It may be fine to say that if you want to use
this feature, you need to have nnrpd be invoked separately, not run from innd."
So the answer is to use something like "per_source = 5" in xinetd.conf
and start nnrpd by xinetd.
2) nntpd sends the login data to perl_auth, which consults the database for
Hi Eli,
Any ideas how this is possible and how to fix?
Ah, OK, I understand.
The article I tested last day was posted to a newsgroup named
"alt.2600.414". It did not produce any error on my news server because
I do not have a newsgroup named alt.2600.
Phuque gewghul.
2) nnrpd sends the login data to perl_auth, which consults the database for authorization, as well as checking whether the user has already reached his maximum connections.That problem is probably why nobody has still written yet the requested
So point 2 is where the problem lies.
Each time the authorization is successful, a 'session' record can be added to the database. The number of records determines how many sessions are running for this user.
But as soon as a session disconnects, the record must be removed from the database. However, nnrpd does not know that the session has been disconnected.
Only xinitd knows this, but it doesn't have the user data, nor can it access the database.
Perhaps something can be done with the xinitd PID, but even then this will have to be passed to the perl_auth script.
Do you have a suggestion?
So here's the problem I can't solve.
Starting nnrpd by xinetd on port 119 requires a second IP (since
innd is already bound to 119). But how does this affect peers? Do
they connect to the IP of innd or nnrpd?
Here are a few that I think illustrate the "effective pattern" problem.
Thank you for the message IDs. Unfortunately Thunderbird is treating
them as email addresses. I'll have to find a way to look them up.
1) xinitd starts nnrpd and after this the authentication takes place.
Hi Grant,
http://al.howardknight.net/
Hi,
The messages I download via pullnews are sent to the peers. How can that be prevented?
I can temporarily block the newsgroup in "newsfeeds", but that has its drawbacks. Can this also be done differently?
Without disabling newsfeeds entirely while using pullnews an option is to have
pullnews add a fake Path entry that you configure your each of your newsfeeds to exclude.
Could you please download the latest version of pullnews, and try it please?
https://raw.githubusercontent.com/InterNetNews/inn/main/frontends/pullnews.in
Just grab that file, rename it without .in, and change the first 2 lines
to fit what your current pullnews script has (it is the path to Perl and
the INN::Config module).
Then you can run that script. It will work with your version of INN.
I think there is a bug in this latest version.
When pullnews starts, it writes a PID file.
If pullnews is accidentally restarted, it will report that pullnews is already
running and that's perfect.
But after this message pullnews deletes the PID file :(
Another question:
How can I delete all messages up to a certain date in a newsgroup immediately.
The messages were received via pullnews, so the date received is the same for all messages.
The issue you're facing is only triggered when you run pullnews outside
an INN installation (which is a possible use of pullnews, as it could be
run from a separate server).
There are 2 branches at several places in the code. PID file handling
was wrong in one of these branches, as you noticed.
I believe the issue is now fixed, and you can download the updated
pullnews script from the Git repository.
When you say "up to a certain date", do you mean before or after?
If it is before, then just use expire.ctl and run news.daily so as to
expire articles.
Verify that you do not use the "-p" option in the expireoverflags and
flags options of the news.daily command. (By default, arrival time is
used, and "-p" switches to the actual date in Date header fields).
If it is after, then it is a bit more complicated...
Your history file in <pathdb> contains lines with several timestamps.
You can find the storage tokens of articles arrived after March, 1st
with the following commands:
% convdate -n '01 Mar 2023 00:00 +0000'
1677628800
% perl -ne 'chomp; our ($hash, $timestamps, $_) = split " "; my
($arrived, $expires, $posted) = split("~", $timestamps); print "$_\n" if
$_ and $arrived >= 1677628800' history
Pipe the result to "sm -r" to remove these articles.
Hello,
Once you have edited the inn-secrets.conf file (<https://www.eyrie.org/~eagle/software/inn/docs/inn-secrets.conf.html>)
and the "cancels" group, each article posted on your server will have :
...
For me, Gencancel is the only tool to do the job but as I am not an INN expert, maybe Julien or Russ can give you more information (even if I
think the documentation is well enough written.
Have nice day.
Franck
For me, Gencancel is the only tool to do the job
su news -s /bin/sh -c "/usr/local/news/bin/gencancel -n '<newsgroup>' '<msg-id>'" | su news -s /bin/sh -c "/usr/local/news/bin/inews -h -P -D"
However, when using the above command without the -D gives:
inews: warning: What server?
inews: article will be spooled >
Any ideas what goes wrong?
Eli, is there something unclear in the gencancel documentation that
should be improved? If that's the case, what should be written and where?
inews tries to connect to the server set in the "server" parameter in inn.conf. I guess this parameter is unset.
I'll add the name of the parameter in the inews manual page. It
currently just mentions "inews sends the article to the local news
server as specified in inn.conf".
And inews is not listed in the names of the programs for which the
"server" parameter is used in inn.conf... Also added.
It was the 'server' setting indeed.
I'm getting close, but now inews is trying to connect using the
IPv6 address of the server, instead of IPv4.
I'll have to figure out how to make a workaround for this, especially since I'm using xinetd for nnrpd.
Hi Eli,
Another question, is it possible to limit the maximum number of connections >> per authenticated user? I know this is possible for peers, but can this also >> be set up for authenticated users? Maybe a setting in readers.conf or nnrpd >> that I'm overlooking?
Unfortunately, the response is no. There's no native way of limiting
users' connections.
You may want to write a custom authentication hook (perl_auth or
python_auth in readers.conf) that would do the job by accounting how
many connections are open by a given user, and deny access if it exceeds
the limit. I am not aware of existing scripts to do that :-(
It could be worthwhile having though, as you're not the first one to ask
(but nobody wrote or shared what he came up with).
Hi Eli,
It was the 'server' setting indeed.
I'm getting close, but now inews is trying to connect using the
IPv6 address of the server, instead of IPv4.
Did you try to just put the IPv4 address of your server in the "server" setting? (instead of its hostname)
It will maybe work (I have not tested).
I have another question about the settings in expire.ctl.
What settings should I use if I want to delete all posts older than 90 days for 1 specific newsgroup (e.g. linux.debian.bugs.dist).
As far as I can read in the documentation, this mainly concerns the middle field (the default value) and the first and last fields (keep and purge) are only important for messages that have an expiration header. But if I set the <purge> field lower than the <default> field, inncheck still throws a warning.
So it seems that the <keep and purge> fields still affect the <default> value.
linux.debian.bugs.dist:AX:0:90:11
inncheck returns: purge `11' younger than default `90'
linux.debian.bugs.dist:AX:0:90:90
seems good.
What I would like is to have all messages older than 90 days deleted immediately and messages with the expiration header deleted immediately after the expiration date.
With 0:90:11, suppose that you have an article with an Expires header
field corresponding to 30 days, it would be deleted after 11 days which
is not what you were expecting. That's the reason of the warning from inncheck; it looks unusual to force the deletion of articles with an
Expires header field sooner than other articles. (The date in the
Expires header field will still be respected.)
The nnrpd manual states:
"As each command is received, nnrpd tries to change its "argv" array so that ps(1) will print out the command being executed."
This will then look like this:
nnrpd: <xxx.xxx.xxx.xxx> GROUP
nnrpd: <xxx.xxx.xxx.xxx> XOVER
Is it perhaps also possible to add the authenticated user to this?
Something like:
nnrpd: <xxx.xxx.xxx.xxx> Eli GROUP
nnrpd: <xxx.xxx.xxx.xxx> Eli XOVER
This would make it possible to limit the number of connections per user via a perl script.
Hi Eli,
The nnrpd manual states:
"As each command is received, nnrpd tries to change its "argv" array so that >> ps(1) will print out the command being executed."
This will then look like this:
nnrpd: <xxx.xxx.xxx.xxx> GROUP
nnrpd: <xxx.xxx.xxx.xxx> XOVER
Is it perhaps also possible to add the authenticated user to this?
Something like:
nnrpd: <xxx.xxx.xxx.xxx> Eli GROUP
nnrpd: <xxx.xxx.xxx.xxx> Eli XOVER
This would make it possible to limit the number of connections per user via a
perl script.
It is indeed possible to use that "feature".
If you can rebuild INN from sources, just change the following command
in nnrpd/nnrpd.c:
- setproctitle("%s %s", Client.host, av[0]);
+
+ setproctitle("%s %s %s", Client.host,
+ PERMuser[0] != '\0' ? PERMuser : "-", av[0]);
I am unsure if this would be worth having in an official release; there
may be privacy concerns. Maybe it should be configurable with a
readers.conf option (like addprocesstitleuser which would enable that behaviour when set to true in an access group).
I am unsure if this would be worth having in an official release; there
may be privacy concerns. Maybe it should be configurable with a
readers.conf option (like addprocesstitleuser which would enable that
behaviour when set to true in an access group).
I don't know if many people will use this feature, but it is nice if INN supports it. Making it configurable is a good idea.
About pullnews:
In pullnews I use the '-w -1000000' option to download the 1 million most recent articles per newsgroup.
This works fine, but when pullnews is restarted (for example after a server timeout), pullnews will redownload all already downloaded articles.
It does come with the message that the articles already exist, but when there are almost a million per newsgroup, it is not very pleasant.
It would be nice if pullnews continued downloading where it left off. For example, only if the high water mark >0.
Could you improve this on pullnews?
About pullnews:
In pullnews I use the '-w -1000000' option to download the 1 million most
recent articles per newsgroup.
This works fine, but when pullnews is restarted (for example after a server >> timeout), pullnews will redownload all already downloaded articles.
It does come with the message that the articles already exist, but when there
are almost a million per newsgroup, it is not very pleasant.
It would be nice if pullnews continued downloading where it left off. For
example, only if the high water mark >0.
Could you improve this on pullnews?
I assume you do not use '-w -1000000' for another run of pullnews
(manual or out of cron); it would otherwise be normal for it to download again a million of articles.
But I modified pullnews with the line below and this seems to work fine:
if (defined $watermark) {
printf LOG "\tOur previous highest: %d\n", $prevHigh if not $quiet;
$high = $watermark;
$high = $last + $watermark if substr($watermark, 0, 1) eq '-';
+ $high = $prevHigh if $prevHigh > 0;
$high = 0 if $high < 0;
$shash->{$group} = [time, $high];
}
[news@spool1 ~]$ cat pullnews4.marks
[news@spool1 ~]$
Could you try to add an explicit message error ?
close(FILE) or die "can't close $groupFile: $!\n";;
Sure, may be a few days before I reply back with results!
I really have no idea how much data these newsgroups take up.from 1982 to 1991 is 191 tapes or 10 gigs so multiply that and youll
Depends, how much spam and low value articles you can filter out.i am starting a server later with 4tb (that is the only spare hdd i have)
20-30GB per year is comfortable.
You can do with way less, if you have curated list of groups and good
spam filter.
Storage capacity wise, I've got 20 years of the Big8 consuming ~750GB.
On a server with ZFS using CNFS buffers with INN, this can compress down
to about 300GB using default ZFS compression.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 388 |
Nodes: | 16 (2 / 14) |
Uptime: | 133:54:03 |
Calls: | 8,209 |
Calls today: | 7 |
Files: | 13,122 |
Messages: | 5,871,359 |