I know there isn't a standard encoding for the newsgroup file but that
may have been a bit of an oversight now that some people are trying to
run a clean server.
Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained. Personally, I don't care which
one we end up with, ISO-8859 seems to be the far more popular (7
servers) followed by ASCII (4) then UTF-8 (3) .
I guess we'd need all the servers to want to agree and update their
files accordingly.
I'm trying to sync up the active and newsgroups file from 15 peers and
it's proving to be a bit of a challenge.
Next is a little more of a challenge. Trying to sync the descriptions.Since my server has a graphical interface that displays group names and descriptions, I need to be able to know which encoding is used.
I know there isn't a standard encoding for the newsgroup file but that
may have been a bit of an oversight now that some people are trying to
run a clean server.
Since my server has a graphical interface that displays group names and descriptions, I need to be able to know which encoding is used.
If there’s going to be a global choice of encoding then it has to be
UTF-8.
Next is a little more of a challenge. Trying to sync the descriptions.
It wouldn't be so bad if everyone used the same encoding, however the majority are using ISO-8859-1, a couple are using UTF-8, some using
ASCII and one is Non-ISO extended-ASCII.
Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained.
FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
file are here:
 http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8
It may facilitate your life :-)
The conversions I found out to work are:
If there’s going to be a global choice of encoding then it has to be >>UTF-8.
ASCII's advantage over UTF-8 is its universality.
[...]
Gee whiz. The ASCII apostrophe was used ambiguously as single close quote
AND as the combining diacritical mark for the acute accent since 1967.
Nigel Reed <sysop@endofthelinebbs.com> writes:
I know there isn't a standard encoding for the newsgroup file but that
may have been a bit of an oversight now that some people are trying to
run a clean server.
Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained. Personally, I don't care which
one we end up with, ISO-8859 seems to be the far more popular (7
servers) followed by ASCII (4) then UTF-8 (3) .
If there’s going to be a global choice of encoding then it has to be
UTF-8.
I guess we'd need all the servers to want to agree and update their
files accordingly.
That’s the hard bit...
04/11/2023 17:15, Adam H. Kerman wrote:
If there’s going to be a global choice of encoding then it has to be >>>UTF-8.
ASCII's advantage over UTF-8 is its universality.
I would have said exactly the opposite : UTF-8's advantage over ASCII
is its universality, because UTF-8 can express any character from any >language.
But of course ASCII's advantage over UTF-8 is that it is recognized by
all usenet softwares.
[...]
Gee whiz. The ASCII apostrophe was used ambiguously as single close quote >>AND as the combining diacritical mark for the acute accent since 1967.
Oh? Which software does that weird thing? Surely it is not a standard use
of ASCII.
Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained.
UTF-8 is the expected encoding for the descriptions returned by a
LIST NEWSGROUPS command in the NNTP protocol.
If there's going to be a global choice of encoding then it has to be
UTF-8.
On Tue, 11 Apr 2023 12:12:03 +0200
Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:
UTF-8 is the expected encoding for the descriptions returned by a
LIST NEWSGROUPS command in the NNTP protocol.
That's good to know. I've been converting everything to fit into my newsgroups file which is ISO-8859 so it looks like I've been going the
wrong way. Back to the drawing board now that my scripts are almost
done lol.
Nigel Reed <sysop@endofthelinebbs.com> writes:
Going forward, maybe the powers that be can get their heads together and >>enforce a certain coding standard for innd (and whatever else is out
there) that is at least maintained. Personally, I don't care which one
we end up with, ISO-8859 seems to be the far more popular (7 servers) >>followed by ASCII (4) then UTF-8 (3) .
It's been on my list for years to encode the ftp.isc.org newsgroups file >uniformly in UTF-8, which I think is a prerequisite for enforcing
something in innd, but it's a bunch of tedious work and I haven't found
the time yet.
Going forward, maybe the powers that be can get their heads together and enforce a certain coding standard for innd (and whatever else is out
there) that is at least maintained. Personally, I don't care which one
we end up with, ISO-8859 seems to be the far more popular (7 servers) followed by ASCII (4) then UTF-8 (3) .
The main question is if currently used readers can handle utf-8 in group descriptions. If yes, I'd stick with utf-8. If not, then I think it would
be safest to transliterate the descriptions to us-ascii (if it can be done for all encodings;
If you do that, may I request that ASCII equivalents be substituted for
UTF-8 punctuation in brief descriptions? Pretty please?
"Adam H. Kerman" <ahk@chinet.com> writes:
If you do that, may I request that ASCII equivalents be substituted for >>UTF-8 punctuation in brief descriptions? Pretty please?
The goal of all of that machinery is that the hierarchy administrators
should be canonical for the newsgroups entries for their hierarchy.
Encoding is one of those things where we need to standardize in order to, >say, comply with the NNTP standard, but I'm not willing to make any other >editorial judgments because it gets into too much annoying work. So this
is something you should take up with the hierarchy administrators.
This becomes a problem when trying to do a diff or other operations
trying to match group names.
Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained. Personally, I don't care which
one we end up with, ISO-8859 seems to be the far more popular (7
servers) followed by ASCII (4) then UTF-8 (3) .
It definitely cannot. It's rare to find a language where that can be done without losing information (only in Europe, essentially).
Russ Allbery <eagle@eyrie.org> wrote:
It definitely cannot. It's rare to find a language where that can be done >>without losing information (only in Europe, essentially).
Well, to be honest, you lose some information, but it's very rare and can >usually be deduced from context.
The RFC says that it
should be UTF-8, but I think that this is a mistake in the design of the protocol. Capabilities and commands should be pure ASCII, but this should
not mean that any text in articles, descriptions, MOTD, etc has to be pure ASCII; it can use other character sets, including the possibility of ones which might be incompatible with Unicode, and including TRON codes too.
(I run a NNTP server with my own newsgroups, which are not (currently) considered part of Usenet, and currently have no need for non-ASCII descriptions, but in future if it does, then I will consider what to do. However, I also don't use INN, anyways.)
Well, to be honest, you lose some information, but it's very rare and can >>usually be deduced from context.
In a language that doesn't use the Latin alphabet? C'mon.
. . .
FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
can use any encoding you want for them.
Adam H. Kerman <ahk@chinet.com> wrote:
Well, to be honest, you lose some information, but it's very rare and can >>>usually be deduced from context.
In a language that doesn't use the Latin alphabet? C'mon.
No, I'm only talking about Polish.
Adam W. <gof-cut-this-news@cut-this-chmurka.net.invalid> wrote:
Adam H. Kerman <ahk@chinet.com> wrote:
Well, to be honest, you lose some information, but it's very rare and
can usually be deduced from context.
In a language that doesn't use the Latin alphabet? C'mon.
No, I'm only talking about Polish.
You are. Russ wasn't.
Yeah, but I understood Adam was only talking about Polish in his reply.
German has a standard scheme,
some other European languages are still comprehensible if all the
diacritic marks are stripped even though it looks weird, etc.
Apparently Polish is one of those (I know very little about Polish,
sadly).
Russ Allbery <eagle@eyrie.org> wrote:
German has a standard scheme,
Do you mean substituting umlauts with their Latin equivalents and adding
"e"?
ä = ae
ö = oe
ü = ue
At least that's what I found:
[German has a standard scheme]
ä = ae
ö = oe
ü = ue
At least that's what I found:
https://blogs.transparent.com/german/writing-the-letters-%E2%80%9Ca%E2%80%9D-%E2%80%9Co%E2%80%9D-and-%E2%80%9Cu%E2%80%9D-without-a-german-keyboard/
I also know that their ß (scharfes S) can be substituted with ss.
We need an interoperable way to provide texts.
Please note RFC 2277 (BCP 18) about charsets:
Protocols MUST be able to use the UTF-8 charset, which consists of
the ISO 10646 coded character set combined with the UTF-8 character
encoding scheme, as defined in [10646] Annex R (published in
Amendment 2), for all text.
Protocols MAY specify, in addition, how to use other charsets or
other character encoding schemes for ISO 10646, such as UTF-16, but
lack of an ability to use UTF-8 is a violation of this policy; such a
violation would need a variance procedure ([BCP9] section 9) with
clear and solid justification in the protocol specification document
before being entered into or advanced upon the standards track.
For existing protocols or protocols that move data from existing
datastores, support of other charsets, or even using a default other
than UTF-8, may be a requirement. This is acceptable, but UTF-8
support MUST be possible.
I know there's a similar one for Scandinavian languages that uses
characters like { and } to stand in for characters that don't exist in
ASCII (I think because those keys on an English keyboard were in the same location as the real letters on a Scandinavian keyboard), but this is now obscure enough that my Google skills are failing me. Old-timers would probably still recognize that encoding, but I think everyone just uses
UTF-8 now.
The goal of all of that machinery is that the hierarchy administrators
should be canonical for the newsgroups entries for their hierarchy.
Encoding is one of those things where we need to standardize in order to,
say, comply with the NNTP standard, but I'm not willing to make any other
editorial judgments because it gets into too much annoying work. So this
is something you should take up with the hierarchy administrators.
I apologize for suggesting additional programming work for you. I change
my request to asking for an amendment to your README in which you might
urge a proponent or hierarchy administrator not to use UTF-8 punctuation
for which ASCII punctuation would suffice, to avoid needlessly turning a description into UTF-8.
FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
can use any encoding you want for them.
The newgroup or checkgroups messages could have MIME headers specifying
the character set but these won't survive processing, so a big text file
will have multiple unspecified encodings. Aargh.
My sentence was just about the encoding of the newsgroups file; INN will provide its contents as-is when being requested the descriptions. If it
has multiple unspecified encodings (big5, iso-8859-xx, utf8, cp1252...),
it will provide them as-is. It won't try to convert them on the fly.
My opinion is that newsgroup names should be purely ASCII (there are
many benefits to this, and using non-ASCII characters in newsgroup names
and domain names and commands and configuration files can cause many problems, including security issues (especially if any Unicode-based
encoding is used; non-Unicode has less security issues, but still is not worth it to use non-ASCII in these cases), comparisons, input, etc).
(But, I really hate Unicode; it is full of problems, including Han unification and other complications; and it is a stateful character set
even though the encoding is stateless. TRON character code is better in
some ways (especially for Japanese text), and I have done some work
using this.)
The RFC says that it should be UTF-8, but I think that this is a mistake
in the design of the protocol.
. . . (Chinese was a potential sticking point due to issues
with how Chinese, Japanese, and Korean were encoded in Unicode that's more >complex than is worth getting into.)
Russ Allbery <eagle@eyrie.org> wrote:
. . . (Chinese was a potential sticking point due to issues with how
Chinese, Japanese, and Korean were encoded in Unicode that's more
complex than is worth getting into.)
Are you talking about character codes for the glyphs common to all three languages, the CJK set, or something else entirely? Also, wasn't there something about China modernizing glyphs that were still being used by
the other two languages?
I didn't sit in on these meeting years ago like you, but the little I
know about Chinese is that the traditional glyphs represented words and
not letters; they represent letters in the other two languages.
I'm aware that the glyphs are combinations of strokes that are common to other glyphs, and I often wondered if the strokes themselves and not the final result should have been what was encoded.
The coding plane would have been a hell of a lot smaller.
"Adam H. Kerman" <ahk@chinet.com> writes:
Russ Allbery <eagle@eyrie.org> wrote:
. . . (Chinese was a potential sticking point due to issues with how >>>Chinese, Japanese, and Korean were encoded in Unicode that's more
complex than is worth getting into.)
Are you talking about character codes for the glyphs common to all three >>languages, the CJK set, or something else entirely? Also, wasn't there >>something about China modernizing glyphs that were still being used by
the other two languages?
Yeah, I'm talking about the glyph unification problem. I forget how much >impact the traditional vs. simplified Chinese distinction has on the
Unicode encoding and whether some of those distinctions are also unified.
. . .
I'm fairly sure this isn't true in general. . . .
I'm trying to sync up the active and newsgroups file from 15 peers and
it's proving to be a bit of a challenge.
Besides files as input, I would also add the possibility to sync fromhostnames (the program will then download their newsgroups files).
The first bit is done, which is mainly getting rid of groups that have invalid names (those that end in a period, contain illegal characters,
and the like).
FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
file are here:
 http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8
It may facilitate your life :-)
The conversions I found out to work are:
- cn.* and han.* are encoded in gb18030;
- fido7.*, medlux.* and relcom.* in koi8-r;
- ukr.* in koi8-u;
- nctu.*, ncu.* and tw.* in big5;
- scout.forum.chinese and scout.forum.korean in big5;
- eternal-september.*, fido.* and fr.* in utf-8;
- all the others fit well in cp1252.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 307 |
Nodes: | 16 (2 / 14) |
Uptime: | 125:51:02 |
Calls: | 6,854 |
Files: | 12,360 |
Messages: | 5,417,416 |