• Encoding madness

    From Nigel Reed@21:1/5 to All on Tue Apr 11 01:44:37 2023
    Hi all,

    I'm trying to sync up the active and newsgroups file from 15 peers and
    it's proving to be a bit of a challenge.

    The first bit is done, which is mainly getting rid of groups that have
    invalid names (those that end in a period, contain illegal characters,
    and the like).

    Next is a little more of a challenge. Trying to sync the descriptions.
    It wouldn't be so bad if everyone used the same encoding, however the
    majority are using ISO-8859-1, a couple are using UTF-8, some using
    ASCII and one is Non-ISO extended-ASCII.

    This becomes a problem when trying to do a diff or other operations
    trying to match group names.

    I know there isn't a standard encoding for the newsgroup file but that
    may have been a bit of an oversight now that some people are trying to
    run a clean server.

    Going forward, maybe the powers that be can get their heads together
    and enforce a certain coding standard for innd (and whatever else is
    out there) that is at least maintained. Personally, I don't care which
    one we end up with, ISO-8859 seems to be the far more popular (7
    servers) followed by ASCII (4) then UTF-8 (3) .

    I guess we'd need all the servers to want to agree and update their
    files accordingly.

    Somehow, I feel I'll be shot down here since it's been like this since
    1986.

    Thoughts?
    --
    End Of The Line BBS - Plano, TX
    telnet endofthelinebbs.com 23

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Nigel Reed on Tue Apr 11 08:49:41 2023
    Nigel Reed <sysop@endofthelinebbs.com> writes:
    I know there isn't a standard encoding for the newsgroup file but that
    may have been a bit of an oversight now that some people are trying to
    run a clean server.

    Going forward, maybe the powers that be can get their heads together
    and enforce a certain coding standard for innd (and whatever else is
    out there) that is at least maintained. Personally, I don't care which
    one we end up with, ISO-8859 seems to be the far more popular (7
    servers) followed by ASCII (4) then UTF-8 (3) .

    If there’s going to be a global choice of encoding then it has to be
    UTF-8.

    I guess we'd need all the servers to want to agree and update their
    files accordingly.

    That’s the hard bit...

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Franck@21:1/5 to All on Tue Apr 11 09:51:12 2023
    Hello,

    I'm trying to sync up the active and newsgroups file from 15 peers and
    it's proving to be a bit of a challenge.

    Hum, I rather say : "It's a f***ing challenge"!

    Next is a little more of a challenge. Trying to sync the descriptions.

    I know there isn't a standard encoding for the newsgroup file but that
    may have been a bit of an oversight now that some people are trying to
    run a clean server.
    Since my server has a graphical interface that displays group names and descriptions, I need to be able to know which encoding is used.

    To solve the problem, I use a small file that records the hierarchies
    with the encoding used. Either I get it from the cmsg checkgroup of some hierarchies that mention it (notably fr), or I fix it by hand.

    This file is very simple. It mentions the hierarchy and the encoding
    used, separated by a TAB.

    Maybe adding an identical file to "active" and "newsgroups" ones would
    do the trick?

    ##
    ## This file lists the charsets used by hierarchies.
    ##
    ## Format: hierarchie<TAB>charset
    cn Big5
    fr UTF-8

    And so on.

    ?

    Franck

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Franck@21:1/5 to All on Tue Apr 11 10:04:12 2023
    Hello,

    Since my server has a graphical interface that displays group names and descriptions, I need to be able to know which encoding is used.

    It looks like : https://i.ibb.co/LYcsQVQ/Console.png

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Franck@21:1/5 to All on Tue Apr 11 09:52:41 2023
    Hello,

    If there’s going to be a global choice of encoding then it has to be
    UTF-8.

    +1

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Tue Apr 11 12:12:03 2023
    Hi Nigel,

    Next is a little more of a challenge. Trying to sync the descriptions.
    It wouldn't be so bad if everyone used the same encoding, however the majority are using ISO-8859-1, a couple are using UTF-8, some using
    ASCII and one is Non-ISO extended-ASCII.

    FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
    file are here:
    http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8

    It may facilitate your life :-)

    The conversions I found out to work are:
    - cn.* and han.* are encoded in gb18030;
    - fido7.*, medlux.* and relcom.* in koi8-r;
    - ukr.* in koi8-u;
    - nctu.*, ncu.* and tw.* in big5;
    - scout.forum.chinese and scout.forum.korean in big5;
    - eternal-september.*, fido.* and fr.* in utf-8;
    - all the others fit well in cp1252.


    Going forward, maybe the powers that be can get their heads together
    and enforce a certain coding standard for innd (and whatever else is
    out there) that is at least maintained.

    UTF-8 is the expected encoding for the descriptions returned by a LIST NEWSGROUPS command in the NNTP protocol.

    --
    Julien ÉLIE

    « Et maintenant, la balle est dans le camp des slalomeurs. »

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Franck@21:1/5 to All on Tue Apr 11 14:34:50 2023
    Salut Julien.

    FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
    file are here:
      http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8

    Et tu le dis maintenant!?!?!? ;-)

    Why not to put it at ftp.isc.org?

    I had looked on your site, notably the List of Usenet public managed hierarchies, but I had not found this list, reason why I coded it in
    SNS. Maybe I not looked so well...

    It may facilitate your life :-)

    I think so, I'll use it instead of the one listed at ftp.isc.org and
    will remove some lines of code in SNS.

    Franck

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam W.@21:1/5 to iulius@nom-de-mon-site.com.invalid on Tue Apr 11 15:13:09 2023
    Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:

    The conversions I found out to work are:

    In Poland (pl.*) we traditionally used iso-8859-2 for posting (now I think utf-8 has become a de facto standard, but iso-8859-2 is still accepted),
    but I can see that group descriptions for pl.* are just transliterated
    (there are no national characters used, all are 7-bit, or us-ascii).

    The main question is if currently used readers can handle utf-8 in group descriptions. If yes, I'd stick with utf-8. If not, then I think it would
    be safest to transliterate the descriptions to us-ascii (if it can be done
    for all encodings; for Polish national characters it's perfectly fine, but
    I don't know how it works with non-Latin alphabets like Russian or
    Chinese).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Olivier Miakinen@21:1/5 to All on Tue Apr 11 17:38:16 2023
    Hello Adam,

    Le 11/04/2023 17:15, Adam H. Kerman a écrit :

    If there’s going to be a global choice of encoding then it has to be >>UTF-8.

    ASCII's advantage over UTF-8 is its universality.

    I would have said exactly the opposite : UTF-8's advantage over ASCII
    is its universality, because UTF-8 can express any character from any
    language.

    But of course ASCII's advantage over UTF-8 is that it is recognized by
    all usenet softwares.

    [...]

    Gee whiz. The ASCII apostrophe was used ambiguously as single close quote
    AND as the combining diacritical mark for the acute accent since 1967.

    Oh? Which software does that weird thing? Surely it is not a standard use
    of ASCII.


    --
    Olivier Miakinen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Richard Kettlewell on Tue Apr 11 15:15:50 2023
    Richard Kettlewell <invalid@invalid.invalid> wrote:
    Nigel Reed <sysop@endofthelinebbs.com> writes:

    I know there isn't a standard encoding for the newsgroup file but that
    may have been a bit of an oversight now that some people are trying to
    run a clean server.

    Going forward, maybe the powers that be can get their heads together
    and enforce a certain coding standard for innd (and whatever else is
    out there) that is at least maintained. Personally, I don't care which
    one we end up with, ISO-8859 seems to be the far more popular (7
    servers) followed by ASCII (4) then UTF-8 (3) .

    If there’s going to be a global choice of encoding then it has to be
    UTF-8.

    ASCII's advantage over UTF-8 is its universality.

    You yourself just used the UTF-8 character code for single close quote ambiguously as an apostrophe. The character is has been used ambiguously
    in the current version of Unicode, replacing another character code that
    was used to indicate a glottal stop as a letter modifier.

    Gee whiz. The ASCII apostrophe was used ambiguously as single close quote
    AND as the combining diacritical mark for the acute accent since 1967.
    Where is the UTF-8 advantage if there continues to be ambiguously-used character codes for such common punctuation marks?

    If there's going to be a global choice, then stop using UTF-8 character
    codes to substitue for ASCII in plain text communication. Use open and
    close single and double quotes ONLY in typography, not email and not
    Usenet. This thwarts communication. It makes a difference as ASCII is
    universal and UTF-8 is not.

    I'm pointing out again that ASCII had combining characters but it didn't include all possible diacritical marks like umlaut, but it has acute,
    grave, circumflex, tilde, slash, cedilla, and I'm sure I've forgotten.

    Teletypewriters could perform the combining action with a backspace/
    overstrike sequence but terminals didn't usually display them.

    I guess we'd need all the servers to want to agree and update their
    files accordingly.

    That’s the hard bit...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Olivier Miakinen on Tue Apr 11 16:44:25 2023
    Olivier Miakinen <om+news@miakinen.net> wrote:
    04/11/2023 17:15, Adam H. Kerman wrote:

    If there’s going to be a global choice of encoding then it has to be >>>UTF-8.

    ASCII's advantage over UTF-8 is its universality.

    I would have said exactly the opposite : UTF-8's advantage over ASCII
    is its universality, because UTF-8 can express any character from any >language.

    Well, yes, if one's set up displays UTF-8, but every setup can use
    ASCII.

    But of course ASCII's advantage over UTF-8 is that it is recognized by
    all usenet softwares.

    [...]

    Gee whiz. The ASCII apostrophe was used ambiguously as single close quote >>AND as the combining diacritical mark for the acute accent since 1967.

    Oh? Which software does that weird thing? Surely it is not a standard use
    of ASCII.

    ASCII was the 7-bit encoding used for teletypewriters, an improvement
    over 5-bit Baudot code. Backspace/overstrike sequences were the way
    diacritic marks were combined with the alphabetic character on a teletypewriter. No, generally this wasn't implemented in computer
    software.

    But the notion that ASCII wasn't intended for Latin alphabets beyond
    English used in America was always wrong.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nigel Reed@21:1/5 to iulius@nom-de-mon-site.com.invalid on Tue Apr 11 11:43:57 2023
    On Tue, 11 Apr 2023 12:12:03 +0200
    Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:


    Going forward, maybe the powers that be can get their heads together
    and enforce a certain coding standard for innd (and whatever else is
    out there) that is at least maintained.

    UTF-8 is the expected encoding for the descriptions returned by a
    LIST NEWSGROUPS command in the NNTP protocol.

    That's good to know. I've been converting everything to fit into my
    newsgroups file which is ISO-8859 so it looks like I've been going the
    wrong way. Back to the drawing board now that my scripts are almost
    done lol.



    --
    End Of The Line BBS - Plano, TX
    telnet endofthelinebbs.com 23

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard@21:1/5 to All on Tue Apr 11 17:07:40 2023
    [Please do not mail me a copy of your followup]

    Richard Kettlewell <invalid@invalid.invalid> spake the secret code <wwvh6tmg8oa.fsf@LkoBDZeT.terraraq.uk> thusly:

    If there's going to be a global choice of encoding then it has to be
    UTF-8.

    You mean, so you can use a gratuitously fancy apostrophe character
    instead of the ASCII ' character that serves exactly the same purpose
    with fewer problems?

    UTF-8 is great for non-Latin codepoints like Asian languages and
    Klingon.

    Where UTF-8 fails is in using fancy codepoints for the functional
    equivalent of the same ASCII character.
    --
    "The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
    The Terminals Wiki <http://terminals-wiki.org>
    The Computer Graphics Museum <http://computergraphicsmuseum.org>
    Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tom Furie@21:1/5 to Nigel Reed on Tue Apr 11 17:11:21 2023
    On 2023-04-11, Nigel Reed <sysop@endofthelinebbs.com> wrote:
    On Tue, 11 Apr 2023 12:12:03 +0200
    Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:

    UTF-8 is the expected encoding for the descriptions returned by a
    LIST NEWSGROUPS command in the NNTP protocol.

    That's good to know. I've been converting everything to fit into my newsgroups file which is ISO-8859 so it looks like I've been going the
    wrong way. Back to the drawing board now that my scripts are almost
    done lol.

    At least the conversion from ISO-8859 to UTF-8 will be much more straightforward than the conversion from <whatever encoding> to ISO-8859
    ;)

    Cheers,
    Tom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Russ Allbery on Tue Apr 11 17:32:03 2023
    Russ Allbery <eagle@eyrie.org> wrote:
    Nigel Reed <sysop@endofthelinebbs.com> writes:

    Going forward, maybe the powers that be can get their heads together and >>enforce a certain coding standard for innd (and whatever else is out
    there) that is at least maintained. Personally, I don't care which one
    we end up with, ISO-8859 seems to be the far more popular (7 servers) >>followed by ASCII (4) then UTF-8 (3) .

    It's been on my list for years to encode the ftp.isc.org newsgroups file >uniformly in UTF-8, which I think is a prerequisite for enforcing
    something in innd, but it's a bunch of tedious work and I haven't found
    the time yet.

    If you do that, may I request that ASCII equivalents be substituted for
    UTF-8 punctuation in brief descriptions? Pretty please?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Nigel Reed on Tue Apr 11 10:19:08 2023
    Nigel Reed <sysop@endofthelinebbs.com> writes:

    Going forward, maybe the powers that be can get their heads together and enforce a certain coding standard for innd (and whatever else is out
    there) that is at least maintained. Personally, I don't care which one
    we end up with, ISO-8859 seems to be the far more popular (7 servers) followed by ASCII (4) then UTF-8 (3) .

    It's been on my list for years to encode the ftp.isc.org newsgroups file uniformly in UTF-8, which I think is a prerequisite for enforcing
    something in innd, but it's a bunch of tedious work and I haven't found
    the time yet.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Adam W. on Tue Apr 11 10:20:32 2023
    gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) writes:

    The main question is if currently used readers can handle utf-8 in group descriptions. If yes, I'd stick with utf-8. If not, then I think it would
    be safest to transliterate the descriptions to us-ascii (if it can be done for all encodings;

    It definitely cannot. It's rare to find a language where that can be done without losing information (only in Europe, essentially).

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Adam H. Kerman on Tue Apr 11 12:47:41 2023
    "Adam H. Kerman" <ahk@chinet.com> writes:

    If you do that, may I request that ASCII equivalents be substituted for
    UTF-8 punctuation in brief descriptions? Pretty please?

    The goal of all of that machinery is that the hierarchy administrators
    should be canonical for the newsgroups entries for their hierarchy.
    Encoding is one of those things where we need to standardize in order to,
    say, comply with the NNTP standard, but I'm not willing to make any other editorial judgments because it gets into too much annoying work. So this
    is something you should take up with the hierarchy administrators.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Russ Allbery on Tue Apr 11 20:02:38 2023
    Russ Allbery <eagle@eyrie.org> wrote:
    "Adam H. Kerman" <ahk@chinet.com> writes:

    If you do that, may I request that ASCII equivalents be substituted for >>UTF-8 punctuation in brief descriptions? Pretty please?

    The goal of all of that machinery is that the hierarchy administrators
    should be canonical for the newsgroups entries for their hierarchy.
    Encoding is one of those things where we need to standardize in order to, >say, comply with the NNTP standard, but I'm not willing to make any other >editorial judgments because it gets into too much annoying work. So this
    is something you should take up with the hierarchy administrators.

    I apologize for suggesting additional programming work for you. I change
    my request to asking for an amendment to your README in which you might
    urge a proponent or hierarchy administrator not to use UTF-8 punctuation
    for which ASCII punctuation would suffice, to avoid needlessly turning a description into UTF-8.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From news@zzo38computer.org.invalid@21:1/5 to Nigel Reed on Tue Apr 11 15:00:29 2023
    Nigel Reed <sysop@endofthelinebbs.com> wrote:
    This becomes a problem when trying to do a diff or other operations
    trying to match group names.

    My opinion is that newsgroup names should be purely ASCII (there are many benefits to this, and using non-ASCII characters in newsgroup names and
    domain names and commands and configuration files can cause many problems, including security issues (especially if any Unicode-based encoding is
    used; non-Unicode has less security issues, but still is not worth it to
    use non-ASCII in these cases), comparisons, input, etc).

    Descriptions for non-English newsgroups, and non-English articles in such newsgroups, should probably use the appropriate encodings for those
    languages, rather than ASCII only (it is not as problematic to use
    non-ASCII characters in newsgroup descriptions, and there are clearly
    benefits to doing so in some cases, so it should be permitted in
    descriptions and in articles; however, if the text is English only then
    it is almost always more beneficial to stay to ASCII only, I think).

    Going forward, maybe the powers that be can get their heads together
    and enforce a certain coding standard for innd (and whatever else is
    out there) that is at least maintained. Personally, I don't care which
    one we end up with, ISO-8859 seems to be the far more popular (7
    servers) followed by ASCII (4) then UTF-8 (3) .

    Well, fortunately, ISO-8859-1 and UTF-8 are both supersets of ASCII, so
    if you use ASCII as much as possible then it will still work. (But, I
    really hate Unicode; it is full of problems, including Han unification
    and other complications; and it is a stateful character set even though
    the encoding is stateless. TRON character code is better in some ways (especially for Japanese text), and I have done some work using this.)

    However, also, enforcing a certain coding standard (regardless of what it
    might be, whether it is Unicode or TRON or something else) can be a problem when you will need other encodings for a reason not previously known by
    whoever enforced them. Making recommendations can be helpful though, but I think that ASCII should be used when possible, and in some contexts (e.g.
    the names of the commands, etc, in the computer programming) should be
    required to be ASCII only.

    It might also be worth to mention what character encodings it uses in the CAPABILITIES, on servers where that is applicable. (The RFC says that it
    should be UTF-8, but I think that this is a mistake in the design of the protocol. Capabilities and commands should be pure ASCII, but this should
    not mean that any text in articles, descriptions, MOTD, etc has to be pure ASCII; it can use other character sets, including the possibility of ones
    which might be incompatible with Unicode, and including TRON codes too.)

    Many people, they just want to put Unicode in everything, without actually understanding Unicode or international text or security or anything else,
    and this just makes a mess (especially since Unicode itself is messy, but
    even if using something else, just putting it in without any consideration, does not substitute for actual understanding). So, don't do that, please.

    (And, for Usenet client programs intended for PC (if using DOS or other text-mode programs), use of PC character set may be beneficial.)

    (I run a NNTP server with my own newsgroups, which are not (currently) considered part of Usenet, and currently have no need for non-ASCII descriptions, but in future if it does, then I will consider what to do. However, I also don't use INN, anyways.)

    --
    Don't laugh at the moon when it is day time in France.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam W.@21:1/5 to Russ Allbery on Tue Apr 11 21:35:15 2023
    Russ Allbery <eagle@eyrie.org> wrote:

    It definitely cannot. It's rare to find a language where that can be done without losing information (only in Europe, essentially).

    Well, to be honest, you lose some information, but it's very rare and can usually be deduced from context.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Adam W. on Tue Apr 11 22:26:11 2023
    Adam W. <gof-cut-this-news@cut-this-chmurka.net.invalid> wrote:
    Russ Allbery <eagle@eyrie.org> wrote:

    It definitely cannot. It's rare to find a language where that can be done >>without losing information (only in Europe, essentially).

    Well, to be honest, you lose some information, but it's very rare and can >usually be deduced from context.

    In a language that doesn't use the Latin alphabet? C'mon.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Wed Apr 12 10:17:18 2023
    Hi all,

    The RFC says that it
    should be UTF-8, but I think that this is a mistake in the design of the protocol. Capabilities and commands should be pure ASCII, but this should
    not mean that any text in articles, descriptions, MOTD, etc has to be pure ASCII; it can use other character sets, including the possibility of ones which might be incompatible with Unicode, and including TRON codes too.

    We need an interoperable way to provide texts.
    Please note RFC 2277 (BCP 18) about charsets:

    Protocols MUST be able to use the UTF-8 charset, which consists of
    the ISO 10646 coded character set combined with the UTF-8 character
    encoding scheme, as defined in [10646] Annex R (published in
    Amendment 2), for all text.

    Protocols MAY specify, in addition, how to use other charsets or
    other character encoding schemes for ISO 10646, such as UTF-16, but
    lack of an ability to use UTF-8 is a violation of this policy; such a
    violation would need a variance procedure ([BCP9] section 9) with
    clear and solid justification in the protocol specification document
    before being entered into or advanced upon the standards track.

    For existing protocols or protocols that move data from existing
    datastores, support of other charsets, or even using a default other
    than UTF-8, may be a requirement. This is acceptable, but UTF-8
    support MUST be possible.


    (I run a NNTP server with my own newsgroups, which are not (currently) considered part of Usenet, and currently have no need for non-ASCII descriptions, but in future if it does, then I will consider what to do. However, I also don't use INN, anyways.)

    FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
    can use any encoding you want for them.

    --
    Julien ÉLIE

    « Ils ont refusé une offre de Normand ?!? » (Astérix)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam W.@21:1/5 to Adam H. Kerman on Wed Apr 12 12:06:06 2023
    Adam H. Kerman <ahk@chinet.com> wrote:

    Well, to be honest, you lose some information, but it's very rare and can >>usually be deduced from context.

    In a language that doesn't use the Latin alphabet? C'mon.

    No, I'm only talking about Polish.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Julien on Wed Apr 12 12:41:07 2023
    Julien <iulius@nom-de-mon-site.com.invalid> wrote:

    . . .

    FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
    can use any encoding you want for them.

    The newgroup or checkgroups messages could have MIME headers specifying
    the character set but these won't survive processing, so a big text file
    will have multiple unspecified encodings. Aargh.

    Just stating the obvious here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Adam W. on Wed Apr 12 12:42:04 2023
    Adam W. <gof-cut-this-news@cut-this-chmurka.net.invalid> wrote:
    Adam H. Kerman <ahk@chinet.com> wrote:

    Well, to be honest, you lose some information, but it's very rare and can >>>usually be deduced from context.

    In a language that doesn't use the Latin alphabet? C'mon.

    No, I'm only talking about Polish.

    You are. Russ wasn't.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Adam H. Kerman on Wed Apr 12 07:43:17 2023
    "Adam H. Kerman" <ahk@chinet.com> writes:
    Adam W. <gof-cut-this-news@cut-this-chmurka.net.invalid> wrote:
    Adam H. Kerman <ahk@chinet.com> wrote:

    Well, to be honest, you lose some information, but it's very rare and
    can usually be deduced from context.

    In a language that doesn't use the Latin alphabet? C'mon.

    No, I'm only talking about Polish.

    You are. Russ wasn't.

    Yeah, but I understood Adam was only talking about Polish in his reply.

    It's common in a lot of European languages to be able to transliterate to
    ASCII using various schemes without losing *too* much information. German
    has a standard scheme, some Scandinavian languages have an old scheme that
    used to be used when it was hard to find anything other than ASCII, some
    other European languages are still comprehensible if all the diacritic
    marks are stripped even though it looks weird, etc. Apparently Polish is
    one of those (I know very little about Polish, sadly).

    Arguably, English is itself a case of being able to transliterate to ASCII without losing too much information, depending on how you feel about the correct spelling of Zoë and naïve (I think everyone but the New Yorker has given up on coöperate), or how much you care about reproducing English
    poetry containing words like learnèd.

    But if one gets too far beyond Europe, or even farther into eastern Europe
    and non-Romance languages, the transliterations get more and more dubious
    or simply nonexistent.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam W.@21:1/5 to Russ Allbery on Wed Apr 12 15:23:26 2023
    Russ Allbery <eagle@eyrie.org> wrote:

    Yeah, but I understood Adam was only talking about Polish in his reply.

    Yes, exactly. That's the only non-English language I know.

    German has a standard scheme,

    Do you mean substituting umlauts with their Latin equivalents and adding
    "e"?

    ä = ae
    ö = oe
    ü = ue

    At least that's what I found:

    https://blogs.transparent.com/german/writing-the-letters-%E2%80%9Ca%E2%80%9D-%E2%80%9Co%E2%80%9D-and-%E2%80%9Cu%E2%80%9D-without-a-german-keyboard/

    I also know that their ß (scharfes S) can be substituted with ss.

    some other European languages are still comprehensible if all the
    diacritic marks are stripped even though it looks weird, etc.
    Apparently Polish is one of those (I know very little about Polish,
    sadly).

    It is. There are some word plays, because some words have different
    meanings with and without diacritics (for example, "³aska" and "laska"
    mean different things), but they're rare and correct meaning can be
    deduced from context.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Adam W. on Wed Apr 12 09:34:18 2023
    gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) writes:
    Russ Allbery <eagle@eyrie.org> wrote:

    German has a standard scheme,

    Do you mean substituting umlauts with their Latin equivalents and adding
    "e"?

    ä = ae
    ö = oe
    ü = ue

    At least that's what I found:

    Yeah, exactly.

    I know there's a similar one for Scandinavian languages that uses
    characters like { and } to stand in for characters that don't exist in
    ASCII (I think because those keys on an English keyboard were in the same location as the real letters on a Scandinavian keyboard), but this is now obscure enough that my Google skills are failing me. Old-timers would
    probably still recognize that encoding, but I think everyone just uses
    UTF-8 now.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael =?ISO-8859-1?Q?B=E4uerle?=@21:1/5 to Adam W. on Wed Apr 12 19:53:16 2023
    Adam W. wrote:

    [German has a standard scheme]
    ä = ae
    ö = oe
    ü = ue

    Can be used in all cases for german.

    Same for capital umlauts:

    Ä = Ae (or AE)
    Ö = Oe (or OE)
    Ü = Ue (or UE)

    At least that's what I found:

    https://blogs.transparent.com/german/writing-the-letters-%E2%80%9Ca%E2%80%9D-%E2%80%9Co%E2%80%9D-and-%E2%80%9Cu%E2%80%9D-without-a-german-keyboard/

    | Bräuche – Braeuche (costumes) and Bäuche – Baeuche (bellies)
    ^^^^^^^^
    This is wrong. "Bräuche" means something like "conventions".
    My dictionary says "customs" (sounds a bit similar compared to
    "costumes").

    I also know that their ß (scharfes S) can be substituted with ss.

    There are some cases for which this alters the meaning (sometimes "sz"
    is used for them). Normally no problem if context is available.
    If in doubt, use "ss".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Olivier Miakinen@21:1/5 to All on Wed Apr 12 21:30:33 2023
    Le 12/04/2023 10:17, Julien ÉLIE a écrit :

    We need an interoperable way to provide texts.
    Please note RFC 2277 (BCP 18) about charsets:

    Protocols MUST be able to use the UTF-8 charset, which consists of
    the ISO 10646 coded character set combined with the UTF-8 character
    encoding scheme, as defined in [10646] Annex R (published in
    Amendment 2), for all text.

    Protocols MAY specify, in addition, how to use other charsets or
    other character encoding schemes for ISO 10646, such as UTF-16, but
    lack of an ability to use UTF-8 is a violation of this policy; such a
    violation would need a variance procedure ([BCP9] section 9) with
    clear and solid justification in the protocol specification document
    before being entered into or advanced upon the standards track.

    For existing protocols or protocols that move data from existing
    datastores, support of other charsets, or even using a default other
    than UTF-8, may be a requirement. This is acceptable, but UTF-8
    support MUST be possible.

    And RFC 2277 is a quarter of a century old (January 1998)


    --
    Olivier Miakinen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Urs =?UTF-8?Q?Jan=C3=9Fen?=@21:1/5 to Russ Allbery on Wed Apr 12 23:17:46 2023
    In <87fs95oy9h.fsf@hope.eyrie.org> on Wed, 12 Apr 2023 18:34:18,
    Russ Allbery wrote:
    I know there's a similar one for Scandinavian languages that uses
    characters like { and } to stand in for characters that don't exist in
    ASCII (I think because those keys on an English keyboard were in the same location as the real letters on a Scandinavian keyboard), but this is now obscure enough that my Google skills are failing me. Old-timers would probably still recognize that encoding, but I think everyone just uses
    UTF-8 now.

    JFTR, see "Table 3" from http://bzr.tin.org/doc/iso2asc.txt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Thu Apr 13 08:59:12 2023
    Hi Adam,

    The goal of all of that machinery is that the hierarchy administrators
    should be canonical for the newsgroups entries for their hierarchy.
    Encoding is one of those things where we need to standardize in order to,
    say, comply with the NNTP standard, but I'm not willing to make any other
    editorial judgments because it gets into too much annoying work. So this
    is something you should take up with the hierarchy administrators.

    I apologize for suggesting additional programming work for you. I change
    my request to asking for an amendment to your README in which you might
    urge a proponent or hierarchy administrator not to use UTF-8 punctuation
    for which ASCII punctuation would suffice, to avoid needlessly turning a description into UTF-8.

    Wouldn't a 100% ASCII-encoded file fit your needs?
    I've just generated this one with the Text::Unidecode Perl module:
    http://usenet.trigofacile.com/hierarchies/data/newsgroups.ascii

    Punctuations like French quotations marks («»), unbreakable spaces, etc.
    are converted into ASCII, as well as of course any other characters.

    ftp.isc.org could then make available both files (.utf8 and .ascii).

    --
    Julien ÉLIE

    « Ils ont refusé une offre de Normand ?!? » (Astérix)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Thu Apr 13 09:08:14 2023
    Hi Adam,

    FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
    can use any encoding you want for them.

    The newgroup or checkgroups messages could have MIME headers specifying
    the character set but these won't survive processing, so a big text file
    will have multiple unspecified encodings. Aargh.

    My sentence was not about the process of control messages (for which the encoding in MIME headers are correctly parsed, and the descriptions
    actually converted to UTF-8 for homogeneity purpose).
    The descriptions of newsgroups for which control articles are sent end
    up in UTF-8.

    Besides, there's a /localencoding/ setting in control.ctl to
    parameterize the resulting encoding. The default is UTF-8 but one may
    change it to another encoding if he wants.
    https://www.eyrie.org/~eagle/software/inn/docs/control.ctl.html


    My sentence was just about the encoding of the newsgroups file; INN will provide its contents as-is when being requested the descriptions. If it
    has multiple unspecified encodings (big5, iso-8859-xx, utf8, cp1252...),
    it will provide them as-is. It won't try to convert them on the fly.

    --
    Julien ÉLIE

    « – Laissons-lui notre char et prenons le sien…
    – Oui, ça nous dépannera… » (Astérix)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to iulius@nom-de-mon-site.com.invalid on Thu Apr 13 08:12:22 2023
    Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> writes:

    My sentence was just about the encoding of the newsgroups file; INN will provide its contents as-is when being requested the descriptions. If it
    has multiple unspecified encodings (big5, iso-8859-xx, utf8, cp1252...),
    it will provide them as-is. It won't try to convert them on the fly.

    The fundamental protocol problem here is that the LIST NEWSGROUPS command
    has no way to convey an encoding, let alone a different encoding for every line. It's all well and good for different hierarchies to use different encodings and use appropriate MIME headers for their control messages to
    convey that encoding; all of that would in theory work as expected. But,
    at the end of that process, a given news server returns the whole thing in response to LIST NEWSGROUPS, and it has to pick a single encoding for that response.

    It's not even about the storage, really. Yes, right now INN uses a single
    big file, but it doesn't need to do that. In theory, it could use some
    smarter storage mechanism that preserved the original encoding. But that doesn't help because of the protocol; it still has to respond to LIST NEWSGROUPS commands, and at that point the separate encodings don't help.

    The only workable choices for a single encoding are ASCII and UTF-8;
    everything else is much worse in terms of interoperability. ASCII is not generally sufficient as soon as one gets too far from western Europe and, truly, is not really sufficient for western European languages either;
    while it may be possible to read French with stripped accent marks or
    Spanish without tildes, it's annoying, sometimes ambiguous, and there's no reason to put up with it in 2023. Hence UTF-8.

    Given that, working backwards, sending hierarchy control messages in a different encoding than UTF-8 (or ASCII, which is a UTF-8 subset) is
    probably not the best approach. Even if the news software understands the
    MIME headers properly and knows the encoding (which can be dubious), now
    the content has to be recoded into UTF-8 by the news server anyway. While
    this is a well-defined operation for most encodings, it adds another step
    that can fail and another opportunity for something to go wrong.

    The best results are likely to come from using UTF-8 end-to-end. This
    also has the advantage of being the direction that computing is going
    anyway. My understanding is that even Chinese domestic use is
    increasingly UTF-8, although support for other encodings is still required
    in some situations. (Chinese was a potential sticking point due to issues
    with how Chinese, Japanese, and Korean were encoded in Unicode that's more complex than is worth getting into.)

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to news@zzo38computer.org.invalid on Thu Apr 13 08:28:40 2023
    news@zzo38computer.org.invalid writes:

    My opinion is that newsgroup names should be purely ASCII (there are
    many benefits to this, and using non-ASCII characters in newsgroup names
    and domain names and commands and configuration files can cause many problems, including security issues (especially if any Unicode-based
    encoding is used; non-Unicode has less security issues, but still is not worth it to use non-ASCII in these cases), comparisons, input, etc).

    It's very easy for someone who speaks English to say that newsgroup names should be purely ASCII, but what we would be saying is that people who
    only speak Japanese (or Chinese, or Russian, or...) should put up with newsgroup names being opaque, incomprehensible blobs of foreign
    characters. Imagine what Usenet would be like for you if every newsgroup
    name was in Arabic (or, if you happen to read Arabic, Korean, or some
    other language).

    Historically, that is exactly what we have said. But I think that's sad.

    The security problem is real, but honestly that's largely because Usenet software is very old and is often written in languages that, if not
    actually dying, are at least very stagnant. Handling encodings properly
    in C is a pain, but that's because doing anything properly in C is a pain. Every modern language comes with extremely well-tested libraries, and most
    of them now make *not* dealing with Unicode very difficult; it just
    happens automatically. The remaining non-coding problems are mostly about homograph attacks, and that's not much of an issue with newsgroup names.

    Using multiple encodings, as you say, definitely makes the problem worse,
    since you can't simply reject all invalid UTF-8 very early on, since you
    may instead be dealing with ISO-8859-1 or some other encoding.
    Thankfully, there's no real reason to support anything other than UTF-8
    now. The remaining question is whether Usenet software can cope in
    practice, or whether, like DNS and email, we'll be forced into using complicated ASCII-compatible encoding schemes. Experiments so far seemed
    to indicate that native Usenet software support for UTF-8 newsgroup names wasn't that bad.

    I can't think of any other major Internet protocol, not even domain names,
    that is still limited to ASCII. Newsgroup names are a sad outlier.

    (But, I really hate Unicode; it is full of problems, including Han unification and other complications; and it is a stateful character set
    even though the encoding is stateless. TRON character code is better in
    some ways (especially for Japanese text), and I have done some work
    using this.)

    I hate the email message format (it should be something much less
    ambiguous and machine-parsable), the RFC 2822 Date format, and RFC 2047
    header encoding. The price of implementing protocols is that there will
    always be parts of them you don't like because life is compromise.

    The RFC says that it should be UTF-8, but I think that this is a mistake
    in the design of the protocol.

    Mistake or not, it's not going to change now.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Russ Allbery on Thu Apr 13 16:52:00 2023
    Russ Allbery <eagle@eyrie.org> wrote:

    . . . (Chinese was a potential sticking point due to issues
    with how Chinese, Japanese, and Korean were encoded in Unicode that's more >complex than is worth getting into.)

    Are you talking about character codes for the glyphs common to all three languages, the CJK set, or something else entirely? Also, wasn't there something about China modernizing glyphs that were still being used by
    the other two languages?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Adam H. Kerman on Thu Apr 13 10:50:53 2023
    "Adam H. Kerman" <ahk@chinet.com> writes:
    Russ Allbery <eagle@eyrie.org> wrote:

    . . . (Chinese was a potential sticking point due to issues with how
    Chinese, Japanese, and Korean were encoded in Unicode that's more
    complex than is worth getting into.)

    Are you talking about character codes for the glyphs common to all three languages, the CJK set, or something else entirely? Also, wasn't there something about China modernizing glyphs that were still being used by
    the other two languages?

    Yeah, I'm talking about the glyph unification problem. I forget how much impact the traditional vs. simplified Chinese distinction has on the
    Unicode encoding and whether some of those distinctions are also unified.

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Adam H. Kerman on Thu Apr 13 11:49:38 2023
    "Adam H. Kerman" <ahk@chinet.com> writes:

    I didn't sit in on these meeting years ago like you, but the little I
    know about Chinese is that the traditional glyphs represented words and
    not letters; they represent letters in the other two languages.

    I'm fairly sure this isn't true in general. To the extent that the same
    basic glyphs are used in Japanese kanji, I believe that they are also
    words, or at least not letters in the sense of the Latin alphabet.
    Japanese *kana* uses some Chinese characters to represent syllables
    instead of words, but kana is a supplemental writing system used in
    addition to kanji.

    (Disclaimer that I do not speak or read any of these languages. I just
    have a long-standing amateur interest in character sets.)

    Hangul for Korean is different, but I don't think Hangul characters were unified with Chinese and Japanese. I believe the impact on Korean was on hanja, which is not used for most words. Hangul doesn't look anything
    like Chinese or Japanese characters, to such an extent that I, as someone
    who doesn't know any of these languages, can distinguish between Hangul
    and the other languages on sight.

    I'm aware that the glyphs are combinations of strokes that are common to other glyphs, and I often wondered if the strokes themselves and not the final result should have been what was encoded.

    There was a fairly extensive discussion of this at the time, but they
    decided against it for a bunch of reasons that I don't remember. I think
    one of them was that the existing encodings of those languages did not do
    this, and one of the goals of Unicode was to allow easy conversion from
    and to existing character encodings.

    The coding plane would have been a hell of a lot smaller.

    Yes, but the software would have been a hell of a lot more complicated,
    and it's not clear that's a good tradeoff. Arabic is already a
    substantial challenge to support, and its combining characters are much
    simpler than the system that would be required for stroke encoding, IIRC.

    (Admittedly, most of the challenge with Arabic is the right-to-left directionality.)

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Russ Allbery on Thu Apr 13 18:31:36 2023
    Russ Allbery <eagle@eyrie.org> wrote:
    "Adam H. Kerman" <ahk@chinet.com> writes:
    Russ Allbery <eagle@eyrie.org> wrote:

    . . . (Chinese was a potential sticking point due to issues with how >>>Chinese, Japanese, and Korean were encoded in Unicode that's more
    complex than is worth getting into.)

    Are you talking about character codes for the glyphs common to all three >>languages, the CJK set, or something else entirely? Also, wasn't there >>something about China modernizing glyphs that were still being used by
    the other two languages?

    Yeah, I'm talking about the glyph unification problem. I forget how much >impact the traditional vs. simplified Chinese distinction has on the
    Unicode encoding and whether some of those distinctions are also unified.

    I didn't sit in on these meeting years ago like you, but the little I
    know about Chinese is that the traditional glyphs represented words and
    not letters; they represent letters in the other two languages.

    I'm aware that the glyphs are combinations of strokes that are common to
    other glyphs, and I often wondered if the strokes themselves and not the
    final result should have been what was encoded. I don't know if in
    handwriting the student was talk to draw the strokes in a certain order,
    since order becomes important in representing the combined strokes for
    the glyph.

    I realize strokes aren't letter-equivalents in other languages.

    The coding plane would have been a hell of a lot smaller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam H. Kerman@21:1/5 to Russ Allbery on Thu Apr 13 19:17:45 2023
    Russ Allbery <eagle@eyrie.org> wrote:

    . . .

    I'm fairly sure this isn't true in general. . . .

    All right; I'll look it up. Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Sun Apr 16 22:55:18 2023
    Hi Nigel,

    I'm trying to sync up the active and newsgroups file from 15 peers and
    it's proving to be a bit of a challenge.

    Apart from encoding issues, are there special cases that you would have
    liked to achieve for your sync and merge?

    We have 2 old scripts needing a bit of refresh. I had planned to have a
    look at them for the INN 2.7.2 release (in late 2023 or 2024).
    https://github.com/InterNetNews/inn/issues/39


    # mkngfile - make a newsgroup description file from multiple sources
    #
    # Jeremy Nixon <jeremy@exit109.com>
    # $Id: mkngfile,v 1.1 1999/04/17 09:19:25 jeremy Exp $
    #
    # This program creates a newsgroup description file, using one or
    # several input files containing group descriptions. The resulting
    # file will contain a description line for each group in your active
    # file.
    #
    # If the input contains multiple different descriptions for a group,
    # the program will prompt interactively for which one to use; or, if
    # the --noask option is given, one will be chosen arbitrarily. If a
    # group has no description, $default_desc (below) will be used.
    #
    # The output will be sent to stdout, or to the file specified with
    # the --output (or -o) option.
    #
    # Example - to run with your existing newsgroups file, a local copy
    # of the ISC newsgroups file, and a directory containing checkgroups
    # files with names like *.check, creating the new file as 'newfile':
    # mkngfile -o newfile /news/etc/newsgroups newsgroups checkgroups/*.check
    #
    # You can set the location of your active file below so you don't
    # have to specify it on the command line.


    Besides files as input, I would also add the possibility to sync from
    hostnames (the program will then download their newsgroups files).

    We'll also need a similar tool to merge several active files (note that
    INN already has the actmerge utility, without any documentation, that
    merges 2 active files).




    Descriptions are then cleaned with:

    # cleannewsgroups.pl
    # Copyright 1997-1999 Arthur Hagen
    # Remove duplicate (Moderated) comments
    # Strip trailing spaces
    # Keep only one description for a newsgroup
    # Option to remove extra tabs or to pretty-print with several tabs
    # Option to either sort the newsgroups file alphabetically or to have it
    # in the same order as the active file.

    It can warn when the encoding is not UTF-8 :-)


    The first bit is done, which is mainly getting rid of groups that have invalid names (those that end in a period, contain illegal characters,
    and the like).

    Such checks can also be added to the script (with an option).

    --
    Julien ÉLIE

    « Si, si, si… Avec des si, on mettrait Lutèce en amphore ! » (Vacancier)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Billy G. (go-while)@21:1/5 to All on Sun Sep 24 10:56:31 2023
    On 11.04.23 12:12, Julien ÉLIE wrote:
    FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
    file are here:
      http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8

    It may facilitate your life :-)

    The conversions I found out to work are:
    - cn.* and han.* are encoded in gb18030;
    - fido7.*, medlux.* and relcom.* in koi8-r;
    - ukr.* in koi8-u;
    - nctu.*, ncu.* and tw.* in big5;
    - scout.forum.chinese and scout.forum.korean in big5;
    - eternal-september.*, fido.* and fr.* in utf-8;
    - all the others fit well in cp1252.

    thanks!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)