• New Usenet Archive

    From Jason Evans@21:1/5 to All on Mon Feb 7 14:05:36 2022
    Hi all,

    For the past month, I have been downloading and sorting Usenet archives from
    a news server (with their permission) of everything from 2003 until today.
    My next step is to decide how to upload them to archive.org.

    Here is the current archive that runs from the 80's and 90's until around
    2003: https://archive.org/details/usenethistorical

    Each newsgroup hierarchy has its entry. I'm thinking about something
    different, and I want you input on how to do that.

    Here my plan. The following newsgroup hierarchies will have their own
    entries:

    Big-8:
    comp
    sci
    news
    misc
    talk
    humanities
    soc

    uk

    de

    alt will be broken down into subgroups because it's so huge.

    alt-a-e
    alt-f-j
    alt-k-o
    alt-p-t
    alt-u-z

    For example, alt.folklore.computers would be found in alt-f-j.

    The rest of the hierarchies will be grouped together since they are
    generally smaller and more likely to be nothing but spam.

    Misc Newsgroup hierarchies-a-e
    Misc Newsgroup hierarchies-f-j
    Misc Newsgroup hierarchies-k-o
    Misc Newsgroup hierarchies-p-t
    Misc Newsgroup hierarchies-u-z

    These are questions to you folks:

    1. Does this makes since or would breaking everything down by individual hierarchy be better?

    2. If I do it this way, are there any other hierarchies that should not be grouped with the misc. groups that should stand alone?

    One final note. In case you're wondering, I am not archiving any binary
    groups or any group that I think could get deleted because of the extremely distasteful subject matter. I think you can get my gist about what I mean. Everything else is here. Even the stupid spammy revenge froops.

    Jason

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Jason Evans on Mon Feb 7 16:03:05 2022
    Jason Evans <jsevans@mailfence.com> wrote:

    For the past month, I have been downloading and sorting Usenet archives from >a news server (with their permission) of everything from 2003 until today.
    My next step is to decide how to upload them to archive.org.

    So you'd be relying upon their indexing and its likely inability to tell
    the difference between the article body, the .sig, and headers?

    We've already got that. Google indexed Usenet articles as if they were
    posted on the Web in the first place as the lousy Google Groups Web
    interface was treated like a real Web page. Within Google Groups itself, searching became seriously hideous because Google stopped devoting staff resources to making sure the indexes were being maintained. The indexing services weren't great but they were better than what they became.

    An extremely serious problem with Google Groups indexing of the article
    body, when it was working, was it didn't do a great job distinguishing
    between the author's own text and the quoted text if it was a followup.

    Usenet archives lack decent indexes. Is there a way for you to upload a
    very small archive, then work on the indexing and presentation of the
    articles so it in some way resembles walking the thread tree? Can the
    index be developed along with the archive, and then tested tested tested
    to avoid another Google Groups?

    . . .

    One final note. In case you're wondering, I am not archiving any binary >groups or any group that I think could get deleted because of the extremely >distasteful subject matter. I think you can get my gist about what I mean. >Everything else is here. Even the stupid spammy revenge froops.

    Are you literally saying that you're archiving cancellable spam and
    those various smaller-scale attacks on Usenet with articles uploaded by
    the thousands from anonymyzing servers that aren't preventing abuse?

    Revenge froups weren't any more spammy than any other part of Usenet.
    Spam is spam regardless of the newsgroup.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jason Evans@21:1/5 to Adam H. Kerman on Mon Feb 7 19:14:27 2022
    Adam H. Kerman wrote:

    So you'd be relying upon their indexing and its likely inability to tell
    the difference between the article body, the .sig, and headers?

    We've already got that. Google indexed Usenet articles as if they were
    posted on the Web in the first place as the lousy Google Groups Web
    interface was treated like a real Web page. Within Google Groups itself, searching became seriously hideous because Google stopped devoting staff resources to making sure the indexes were being maintained. The indexing services weren't great but they were better than what they became.


    There are two differences between what I'm doing and what Google is doing.

    First, I am archiving the raw source articles in the same format that are already on archive.org, through plain text MBOX files. If you're doing research, download the newsgroup that you want and let your mail client or whatever you want to use for MBOX files do the heavy lifting for you when it comes to sorting and searching.

    Secondly Google no longer provides headers which is important for research.
    I am providing everything.

    An extremely serious problem with Google Groups indexing of the article
    body, when it was working, was it didn't do a great job distinguishing between the author's own text and the quoted text if it was a followup.

    Usenet archives lack decent indexes. Is there a way for you to upload a
    very small archive, then work on the indexing and presentation of the articles so it in some way resembles walking the thread tree? Can the
    index be developed along with the archive, and then tested tested tested
    to avoid another Google Groups?

    I don't have the time or energy to create a website to host this stuff that would also do a good job of indexing everything. What I'm doing is providing the files free of charge to archive.org so if someone else wants to do that, they can.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Hochstein@21:1/5 to Adam H. Kerman on Mon Feb 7 18:28:54 2022
    Adam H. Kerman schrieb:

    So you'd be relying upon their indexing and its likely inability to tell
    the difference between the article body, the .sig, and headers?

    AFAIS, <https://archive.org/details/usenethistorical> has just zip'ed mbox archives, one per group, with no way to browse, search or index anything.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Jason Evans on Mon Feb 7 18:49:29 2022
    Jason Evans <jsevans@mailfence.com> wrote:
    Thomas Hochstein wrote:
    Adam H. Kerman schrieb:

    So you'd be relying upon their indexing and its likely inability to tell >>>the difference between the article body, the .sig, and headers?

    AFAIS, <https://archive.org/details/usenethistorical>has just zip'ed mbox >>archives, one per group, with no way to browse, search or index anything.

    That is exactly what I have. My question is, is it better to have them on >archive.org with one entry per hierarchy or to group them like I suggested?

    I didn't mean to volunteer you to perform work you weren't willing to
    do. I apologize for that. My comment, stating the obvious, was pointing
    out what we don't have.

    I don't have an opinion on whether your proposed grouping is better or
    worse.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Thomas Hochstein on Mon Feb 7 18:43:45 2022
    Thomas Hochstein <thh@thh.name> wrote:
    Adam H. Kerman schrieb:

    So you'd be relying upon their indexing and its likely inability to tell >>the difference between the article body, the .sig, and headers?

    AFAIS, <https://archive.org/details/usenethistorical> has just zip'ed mbox >archives, one per group, with no way to browse, search or index anything.

    I saw that they were zipped. Jason stated he's doing something different.

    So if he's merely presented Usenet articles as text files or
    digestified somehow but still text filed, I was questing how he was
    going to rely upon archive.org's own indexing processes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jason Evans@21:1/5 to Thomas Hochstein on Mon Feb 7 19:16:21 2022
    Thomas Hochstein wrote:

    Adam H. Kerman schrieb:

    So you'd be relying upon their indexing and its likely inability to tell
    the difference between the article body, the .sig, and headers?

    AFAIS, <https://archive.org/details/usenethistorical> has just zip'ed mbox archives, one per group, with no way to browse, search or index anything.

    That is exactly what I have. My question is, is it better to have them on archive.org with one entry per hierarchy or to group them like I suggested?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Tue Feb 8 20:46:39 2022
    Hi Jason,

    Here is the current archive that runs from the 80's and 90's until around 2003: https://archive.org/details/usenethistorical

    As noted by another person (who spoke about that archive in a French newsgroup), the encoding of bodies is wrong. All non-ASCII characters
    are mungled :-/
    Seen in fr.* and de.*, and I bet it is the same for all hierarchies.

    --
    Julien ÉLIE

    « J'oubliais qu'Assurancetourix a une nouvelle corde à sa harpe ! »
    (Astérix)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jason Evans@21:1/5 to All on Wed Feb 9 08:07:19 2022
    Julien ÉLIE wrote:


    Hi Jason,

    Here is the current archive that runs from the 80's and 90's until around
    2003: https://archive.org/details/usenethistorical

    As noted by another person (who spoke about that archive in a French newsgroup), the encoding of bodies is wrong. All non-ASCII characters
    are mungled :-/
    Seen in fr.* and de.*, and I bet it is the same for all hierarchies.


    Hi Julian,

    This doesn't really answer the question that I asked in my original article about organizing Usenet hierarchies for archive.org.

    However, to respond to your comment, I picked this article at random from fr.usenet.distribution. This is a screenshot (https://pasteboard.co/YA9d6r01LUnP.png)using Thunderbird from one of the archives that I created. You can see that the French letters can be read correctly because this article is from last year and encoded in UTF-8. Even some of the old articles in this particular archive that are encoded in iso-8859-15 appear correctly.

    The problem is that when you go back far enough, either plain ASCII is used
    or some non-standard encoding and then the non-English characters are
    munged. My colleague, Tristan, has been doing some work on this when it
    comes to this issue with Esperanto on the early Usenet.

    Jason

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Wed Feb 9 18:34:46 2022
    Hi Jason,

    The problem is that when you go back far enough, either plain ASCII is used or some non-standard encoding and then the non-English characters are
    munged. My colleague, Tristan, has been doing some work on this when it
    comes to this issue with Esperanto on the early Usenet.

    Yes, apparently, the problem is only for old archives (of last century
    or so). When no encoding is specified, non-ASCII chars get mungled.
    Thanks for the screenshot and information that recent articles are
    correctly archived.


    This doesn't really answer the question that I asked in my original
    article about organizing Usenet hierarchies for archive.org.
    I don't have a strong opinion about that. I would tend to prefer a
    breaking down by individual hierarchies, as any kind of mixing
    hierarchies may not be what users want.

    --
    Julien ÉLIE

    « You know what I did before I married? Anything I wanted to. »

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)