Forum: >>> Magnum BBS <<<

Encoding madness

From Nigel Reed@21:1/5 to All on Tue Apr 11 01:44:37 2023

Hi all,

I'm trying to sync up the active and newsgroups file from 15 peers and
it's proving to be a bit of a challenge.

The first bit is done, which is mainly getting rid of groups that have
invalid names (those that end in a period, contain illegal characters,
and the like).

Next is a little more of a challenge. Trying to sync the descriptions.
It wouldn't be so bad if everyone used the same encoding, however the
majority are using ISO-8859-1, a couple are using UTF-8, some using
ASCII and one is Non-ISO extended-ASCII.

This becomes a problem when trying to do a diff or other operations
trying to match group names.

I know there isn't a standard encoding for the newsgroup file but that
may have been a bit of an oversight now that some people are trying to
run a clean server.

Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained. Personally, I don't care which
one we end up with, ISO-8859 seems to be the far more popular (7
servers) followed by ASCII (4) then UTF-8 (3) .

I guess we'd need all the servers to want to agree and update their
files accordingly.

Somehow, I feel I'll be shot down here since it's been like this since
1986.

Thoughts?
--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to Nigel Reed on Tue Apr 11 08:49:41 2023

Nigel Reed <sysop@endofthelinebbs.com> writes:

I know there isn't a standard encoding for the newsgroup file but that
may have been a bit of an oversight now that some people are trying to
run a clean server.

Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained. Personally, I don't care which
one we end up with, ISO-8859 seems to be the far more popular (7
servers) followed by ASCII (4) then UTF-8 (3) .

If there’s going to be a global choice of encoding then it has to be
UTF-8.

I guess we'd need all the servers to want to agree and update their
files accordingly.

That’s the hard bit...

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Franck@21:1/5 to All on Tue Apr 11 09:51:12 2023

Hello,

I'm trying to sync up the active and newsgroups file from 15 peers and
it's proving to be a bit of a challenge.

Hum, I rather say : "It's a f***ing challenge"!

Next is a little more of a challenge. Trying to sync the descriptions.

I know there isn't a standard encoding for the newsgroup file but that
may have been a bit of an oversight now that some people are trying to
run a clean server.

Since my server has a graphical interface that displays group names and descriptions, I need to be able to know which encoding is used.

To solve the problem, I use a small file that records the hierarchies
with the encoding used. Either I get it from the cmsg checkgroup of some hierarchies that mention it (notably fr), or I fix it by hand.

This file is very simple. It mentions the hierarchy and the encoding
used, separated by a TAB.

Maybe adding an identical file to "active" and "newsgroups" ones would
do the trick?

##
## This file lists the charsets used by hierarchies.
##
## Format: hierarchie<TAB>charset
cn Big5
fr UTF-8

And so on.

?

Franck

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Franck@21:1/5 to All on Tue Apr 11 10:04:12 2023

Hello,

Since my server has a graphical interface that displays group names and descriptions, I need to be able to know which encoding is used.

It looks like : https://i.ibb.co/LYcsQVQ/Console.png

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Franck@21:1/5 to All on Tue Apr 11 09:52:41 2023

Hello,

If there’s going to be a global choice of encoding then it has to be
UTF-8.

+1

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Tue Apr 11 12:12:03 2023

Hi Nigel,

Next is a little more of a challenge. Trying to sync the descriptions.
It wouldn't be so bad if everyone used the same encoding, however the majority are using ISO-8859-1, a couple are using UTF-8, some using
ASCII and one is Non-ISO extended-ASCII.

FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
file are here:
http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8

It may facilitate your life :-)

The conversions I found out to work are:
- cn.* and han.* are encoded in gb18030;
- fido7.*, medlux.* and relcom.* in koi8-r;
- ukr.* in koi8-u;
- nctu.*, ncu.* and tw.* in big5;
- scout.forum.chinese and scout.forum.korean in big5;
- eternal-september.*, fido.* and fr.* in utf-8;
- all the others fit well in cp1252.

Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained.

UTF-8 is the expected encoding for the descriptions returned by a LIST NEWSGROUPS command in the NNTP protocol.

--
Julien ÉLIE

« Et maintenant, la balle est dans le camp des slalomeurs. »

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Franck@21:1/5 to All on Tue Apr 11 14:34:50 2023

Salut Julien.

FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
file are here:
http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8

Et tu le dis maintenant!?!?!? ;-)

Why not to put it at ftp.isc.org?

I had looked on your site, notably the List of Usenet public managed hierarchies, but I had not found this list, reason why I coded it in
SNS. Maybe I not looked so well...

It may facilitate your life :-)

I think so, I'll use it instead of the one listed at ftp.isc.org and
will remove some lines of code in SNS.

Franck

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam W.@21:1/5 to iulius@nom-de-mon-site.com.invalid on Tue Apr 11 15:13:09 2023

Julien �LIE <iulius@nom-de-mon-site.com.invalid> wrote:

The conversions I found out to work are:

In Poland (pl.*) we traditionally used iso-8859-2 for posting (now I think utf-8 has become a de facto standard, but iso-8859-2 is still accepted),
but I can see that group descriptions for pl.* are just transliterated
(there are no national characters used, all are 7-bit, or us-ascii).

The main question is if currently used readers can handle utf-8 in group descriptions. If yes, I'd stick with utf-8. If not, then I think it would
be safest to transliterate the descriptions to us-ascii (if it can be done
for all encodings; for Polish national characters it's perfectly fine, but
I don't know how it works with non-Latin alphabets like Russian or
Chinese).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Olivier Miakinen@21:1/5 to All on Tue Apr 11 17:38:16 2023

Hello Adam,

Le 11/04/2023 17:15, Adam H. Kerman a écrit :

If there’s going to be a global choice of encoding then it has to be >>UTF-8.

ASCII's advantage over UTF-8 is its universality.

I would have said exactly the opposite : UTF-8's advantage over ASCII
is its universality, because UTF-8 can express any character from any
language.

But of course ASCII's advantage over UTF-8 is that it is recognized by
all usenet softwares.

[...]

Gee whiz. The ASCII apostrophe was used ambiguously as single close quote
AND as the combining diacritical mark for the acute accent since 1967.

Oh? Which software does that weird thing? Surely it is not a standard use
of ASCII.

--
Olivier Miakinen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Richard Kettlewell on Tue Apr 11 15:15:50 2023

Richard Kettlewell <invalid@invalid.invalid> wrote:

Nigel Reed <sysop@endofthelinebbs.com> writes:

I know there isn't a standard encoding for the newsgroup file but that
may have been a bit of an oversight now that some people are trying to
run a clean server.

Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained. Personally, I don't care which
one we end up with, ISO-8859 seems to be the far more popular (7
servers) followed by ASCII (4) then UTF-8 (3) .

If there’s going to be a global choice of encoding then it has to be
UTF-8.

ASCII's advantage over UTF-8 is its universality.

You yourself just used the UTF-8 character code for single close quote ambiguously as an apostrophe. The character is has been used ambiguously
in the current version of Unicode, replacing another character code that
was used to indicate a glottal stop as a letter modifier.

Gee whiz. The ASCII apostrophe was used ambiguously as single close quote
AND as the combining diacritical mark for the acute accent since 1967.
Where is the UTF-8 advantage if there continues to be ambiguously-used character codes for such common punctuation marks?

If there's going to be a global choice, then stop using UTF-8 character
codes to substitue for ASCII in plain text communication. Use open and
close single and double quotes ONLY in typography, not email and not
Usenet. This thwarts communication. It makes a difference as ASCII is
universal and UTF-8 is not.

I'm pointing out again that ASCII had combining characters but it didn't include all possible diacritical marks like umlaut, but it has acute,
grave, circumflex, tilde, slash, cedilla, and I'm sure I've forgotten.

Teletypewriters could perform the combining action with a backspace/
overstrike sequence but terminals didn't usually display them.

I guess we'd need all the servers to want to agree and update their
files accordingly.

That’s the hard bit...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Olivier Miakinen on Tue Apr 11 16:44:25 2023

Olivier Miakinen <om+news@miakinen.net> wrote:

04/11/2023 17:15, Adam H. Kerman wrote:

If there’s going to be a global choice of encoding then it has to be >>>UTF-8.

ASCII's advantage over UTF-8 is its universality.

I would have said exactly the opposite : UTF-8's advantage over ASCII
is its universality, because UTF-8 can express any character from any >language.

Well, yes, if one's set up displays UTF-8, but every setup can use
ASCII.

But of course ASCII's advantage over UTF-8 is that it is recognized by
all usenet softwares.

[...]

Gee whiz. The ASCII apostrophe was used ambiguously as single close quote >>AND as the combining diacritical mark for the acute accent since 1967.

Oh? Which software does that weird thing? Surely it is not a standard use
of ASCII.

ASCII was the 7-bit encoding used for teletypewriters, an improvement
over 5-bit Baudot code. Backspace/overstrike sequences were the way
diacritic marks were combined with the alphabetic character on a teletypewriter. No, generally this wasn't implemented in computer
software.

But the notion that ASCII wasn't intended for Latin alphabets beyond
English used in America was always wrong.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Nigel Reed@21:1/5 to iulius@nom-de-mon-site.com.invalid on Tue Apr 11 11:43:57 2023

On Tue, 11 Apr 2023 12:12:03 +0200
Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:

Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained.

UTF-8 is the expected encoding for the descriptions returned by a
LIST NEWSGROUPS command in the NNTP protocol.

That's good to know. I've been converting everything to fit into my
newsgroups file which is ISO-8859 so it looks like I've been going the
wrong way. Back to the drawing board now that my scripts are almost
done lol.

--
End Of The Line BBS - Plano, TX
telnet endofthelinebbs.com 23

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard@21:1/5 to All on Tue Apr 11 17:07:40 2023

[Please do not mail me a copy of your followup]

Richard Kettlewell <invalid@invalid.invalid> spake the secret code <wwvh6tmg8oa.fsf@LkoBDZeT.terraraq.uk> thusly:

If there's going to be a global choice of encoding then it has to be
UTF-8.

You mean, so you can use a gratuitously fancy apostrophe character
instead of the ASCII ' character that serves exactly the same purpose
with fewer problems?

UTF-8 is great for non-Latin codepoints like Asian languages and
Klingon.

Where UTF-8 fails is in using fancy codepoints for the functional
equivalent of the same ASCII character.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Terminals Wiki <http://terminals-wiki.org>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tom Furie@21:1/5 to Nigel Reed on Tue Apr 11 17:11:21 2023

On 2023-04-11, Nigel Reed <sysop@endofthelinebbs.com> wrote:

On Tue, 11 Apr 2023 12:12:03 +0200
Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> wrote:

UTF-8 is the expected encoding for the descriptions returned by a
LIST NEWSGROUPS command in the NNTP protocol.

That's good to know. I've been converting everything to fit into my newsgroups file which is ISO-8859 so it looks like I've been going the
wrong way. Back to the drawing board now that my scripts are almost
done lol.

At least the conversion from ISO-8859 to UTF-8 will be much more straightforward than the conversion from <whatever encoding> to ISO-8859
;)

Cheers,
Tom

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Russ Allbery on Tue Apr 11 17:32:03 2023

Russ Allbery <eagle@eyrie.org> wrote:

Nigel Reed <sysop@endofthelinebbs.com> writes:

Going forward, maybe the powers that be can get their heads together and >>enforce a certain coding standard for innd (and whatever else is out
there) that is at least maintained. Personally, I don't care which one
we end up with, ISO-8859 seems to be the far more popular (7 servers) >>followed by ASCII (4) then UTF-8 (3) .

It's been on my list for years to encode the ftp.isc.org newsgroups file >uniformly in UTF-8, which I think is a prerequisite for enforcing
something in innd, but it's a bunch of tedious work and I haven't found
the time yet.

If you do that, may I request that ASCII equivalents be substituted for
UTF-8 punctuation in brief descriptions? Pretty please?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Nigel Reed on Tue Apr 11 10:19:08 2023

Nigel Reed <sysop@endofthelinebbs.com> writes:

Going forward, maybe the powers that be can get their heads together and enforce a certain coding standard for innd (and whatever else is out
there) that is at least maintained. Personally, I don't care which one
we end up with, ISO-8859 seems to be the far more popular (7 servers) followed by ASCII (4) then UTF-8 (3) .

It's been on my list for years to encode the ftp.isc.org newsgroups file uniformly in UTF-8, which I think is a prerequisite for enforcing
something in innd, but it's a bunch of tedious work and I haven't found
the time yet.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Adam W. on Tue Apr 11 10:20:32 2023

gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) writes:

The main question is if currently used readers can handle utf-8 in group descriptions. If yes, I'd stick with utf-8. If not, then I think it would
be safest to transliterate the descriptions to us-ascii (if it can be done for all encodings;

It definitely cannot. It's rare to find a language where that can be done without losing information (only in Europe, essentially).

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Adam H. Kerman on Tue Apr 11 12:47:41 2023

"Adam H. Kerman" <ahk@chinet.com> writes:

If you do that, may I request that ASCII equivalents be substituted for
UTF-8 punctuation in brief descriptions? Pretty please?

The goal of all of that machinery is that the hierarchy administrators
should be canonical for the newsgroups entries for their hierarchy.
Encoding is one of those things where we need to standardize in order to,
say, comply with the NNTP standard, but I'm not willing to make any other editorial judgments because it gets into too much annoying work. So this
is something you should take up with the hierarchy administrators.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Russ Allbery on Tue Apr 11 20:02:38 2023

Russ Allbery <eagle@eyrie.org> wrote:

"Adam H. Kerman" <ahk@chinet.com> writes:

If you do that, may I request that ASCII equivalents be substituted for >>UTF-8 punctuation in brief descriptions? Pretty please?

The goal of all of that machinery is that the hierarchy administrators
should be canonical for the newsgroups entries for their hierarchy.
Encoding is one of those things where we need to standardize in order to, >say, comply with the NNTP standard, but I'm not willing to make any other >editorial judgments because it gets into too much annoying work. So this
is something you should take up with the hierarchy administrators.

I apologize for suggesting additional programming work for you. I change
my request to asking for an amendment to your README in which you might
urge a proponent or hierarchy administrator not to use UTF-8 punctuation
for which ASCII punctuation would suffice, to avoid needlessly turning a description into UTF-8.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From news@zzo38computer.org.invalid@21:1/5 to Nigel Reed on Tue Apr 11 15:00:29 2023

Nigel Reed <sysop@endofthelinebbs.com> wrote:

This becomes a problem when trying to do a diff or other operations
trying to match group names.

My opinion is that newsgroup names should be purely ASCII (there are many benefits to this, and using non-ASCII characters in newsgroup names and
domain names and commands and configuration files can cause many problems, including security issues (especially if any Unicode-based encoding is
used; non-Unicode has less security issues, but still is not worth it to
use non-ASCII in these cases), comparisons, input, etc).

Descriptions for non-English newsgroups, and non-English articles in such newsgroups, should probably use the appropriate encodings for those
languages, rather than ASCII only (it is not as problematic to use
non-ASCII characters in newsgroup descriptions, and there are clearly
benefits to doing so in some cases, so it should be permitted in
descriptions and in articles; however, if the text is English only then
it is almost always more beneficial to stay to ASCII only, I think).

Going forward, maybe the powers that be can get their heads together
and enforce a certain coding standard for innd (and whatever else is
out there) that is at least maintained. Personally, I don't care which
one we end up with, ISO-8859 seems to be the far more popular (7
servers) followed by ASCII (4) then UTF-8 (3) .

Well, fortunately, ISO-8859-1 and UTF-8 are both supersets of ASCII, so
if you use ASCII as much as possible then it will still work. (But, I
really hate Unicode; it is full of problems, including Han unification
and other complications; and it is a stateful character set even though
the encoding is stateless. TRON character code is better in some ways (especially for Japanese text), and I have done some work using this.)

However, also, enforcing a certain coding standard (regardless of what it
might be, whether it is Unicode or TRON or something else) can be a problem when you will need other encodings for a reason not previously known by
whoever enforced them. Making recommendations can be helpful though, but I think that ASCII should be used when possible, and in some contexts (e.g.
the names of the commands, etc, in the computer programming) should be
required to be ASCII only.

It might also be worth to mention what character encodings it uses in the CAPABILITIES, on servers where that is applicable. (The RFC says that it
should be UTF-8, but I think that this is a mistake in the design of the protocol. Capabilities and commands should be pure ASCII, but this should
not mean that any text in articles, descriptions, MOTD, etc has to be pure ASCII; it can use other character sets, including the possibility of ones
which might be incompatible with Unicode, and including TRON codes too.)

Many people, they just want to put Unicode in everything, without actually understanding Unicode or international text or security or anything else,
and this just makes a mess (especially since Unicode itself is messy, but
even if using something else, just putting it in without any consideration, does not substitute for actual understanding). So, don't do that, please.

(And, for Usenet client programs intended for PC (if using DOS or other text-mode programs), use of PC character set may be beneficial.)

(I run a NNTP server with my own newsgroups, which are not (currently) considered part of Usenet, and currently have no need for non-ASCII descriptions, but in future if it does, then I will consider what to do. However, I also don't use INN, anyways.)

--
Don't laugh at the moon when it is day time in France.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam W.@21:1/5 to Russ Allbery on Tue Apr 11 21:35:15 2023

Russ Allbery <eagle@eyrie.org> wrote:

It definitely cannot. It's rare to find a language where that can be done without losing information (only in Europe, essentially).

Well, to be honest, you lose some information, but it's very rare and can usually be deduced from context.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Adam W. on Tue Apr 11 22:26:11 2023

Adam W. <gof-cut-this-news@cut-this-chmurka.net.invalid> wrote:

Russ Allbery <eagle@eyrie.org> wrote:

It definitely cannot. It's rare to find a language where that can be done >>without losing information (only in Europe, essentially).

Well, to be honest, you lose some information, but it's very rare and can >usually be deduced from context.

In a language that doesn't use the Latin alphabet? C'mon.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Wed Apr 12 10:17:18 2023

Hi all,

The RFC says that it
should be UTF-8, but I think that this is a mistake in the design of the protocol. Capabilities and commands should be pure ASCII, but this should
not mean that any text in articles, descriptions, MOTD, etc has to be pure ASCII; it can use other character sets, including the possibility of ones which might be incompatible with Unicode, and including TRON codes too.

We need an interoperable way to provide texts.
Please note RFC 2277 (BCP 18) about charsets:

Protocols MUST be able to use the UTF-8 charset, which consists of
the ISO 10646 coded character set combined with the UTF-8 character
encoding scheme, as defined in [10646] Annex R (published in
Amendment 2), for all text.

Protocols MAY specify, in addition, how to use other charsets or
other character encoding schemes for ISO 10646, such as UTF-16, but
lack of an ability to use UTF-8 is a violation of this policy; such a
violation would need a variance procedure ([BCP9] section 9) with
clear and solid justification in the protocol specification document
before being entered into or advanced upon the standards track.

For existing protocols or protocols that move data from existing
datastores, support of other charsets, or even using a default other
than UTF-8, may be a requirement. This is acceptable, but UTF-8
support MUST be possible.

(I run a NNTP server with my own newsgroups, which are not (currently) considered part of Usenet, and currently have no need for non-ASCII descriptions, but in future if it does, then I will consider what to do. However, I also don't use INN, anyways.)

FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
can use any encoding you want for them.

--
Julien ÉLIE

« Ils ont refusé une offre de Normand ?!? » (Astérix)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam W.@21:1/5 to Adam H. Kerman on Wed Apr 12 12:06:06 2023

Adam H. Kerman <ahk@chinet.com> wrote:

Well, to be honest, you lose some information, but it's very rare and can >>usually be deduced from context.

In a language that doesn't use the Latin alphabet? C'mon.

No, I'm only talking about Polish.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Julien on Wed Apr 12 12:41:07 2023

Julien <iulius@nom-de-mon-site.com.invalid> wrote:

. . .

FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
can use any encoding you want for them.

The newgroup or checkgroups messages could have MIME headers specifying
the character set but these won't survive processing, so a big text file
will have multiple unspecified encodings. Aargh.

Just stating the obvious here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Adam W. on Wed Apr 12 12:42:04 2023

Adam W. <gof-cut-this-news@cut-this-chmurka.net.invalid> wrote:

Adam H. Kerman <ahk@chinet.com> wrote:

Well, to be honest, you lose some information, but it's very rare and can >>>usually be deduced from context.

In a language that doesn't use the Latin alphabet? C'mon.

No, I'm only talking about Polish.

You are. Russ wasn't.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Adam H. Kerman on Wed Apr 12 07:43:17 2023

"Adam H. Kerman" <ahk@chinet.com> writes:

Adam W. <gof-cut-this-news@cut-this-chmurka.net.invalid> wrote:

Adam H. Kerman <ahk@chinet.com> wrote:

Well, to be honest, you lose some information, but it's very rare and
can usually be deduced from context.

In a language that doesn't use the Latin alphabet? C'mon.

No, I'm only talking about Polish.

You are. Russ wasn't.

Yeah, but I understood Adam was only talking about Polish in his reply.

It's common in a lot of European languages to be able to transliterate to
ASCII using various schemes without losing *too* much information. German
has a standard scheme, some Scandinavian languages have an old scheme that
used to be used when it was hard to find anything other than ASCII, some
other European languages are still comprehensible if all the diacritic
marks are stripped even though it looks weird, etc. Apparently Polish is
one of those (I know very little about Polish, sadly).

Arguably, English is itself a case of being able to transliterate to ASCII without losing too much information, depending on how you feel about the correct spelling of Zoë and naïve (I think everyone but the New Yorker has given up on coöperate), or how much you care about reproducing English
poetry containing words like learnèd.

But if one gets too far beyond Europe, or even farther into eastern Europe
and non-Romance languages, the transliterations get more and more dubious
or simply nonexistent.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam W.@21:1/5 to Russ Allbery on Wed Apr 12 15:23:26 2023

Russ Allbery <eagle@eyrie.org> wrote:

Yeah, but I understood Adam was only talking about Polish in his reply.

Yes, exactly. That's the only non-English language I know.

German has a standard scheme,

Do you mean substituting umlauts with their Latin equivalents and adding
"e"?

� = ae
� = oe
� = ue

At least that's what I found:

https://blogs.transparent.com/german/writing-the-letters-%E2%80%9Ca%E2%80%9D-%E2%80%9Co%E2%80%9D-and-%E2%80%9Cu%E2%80%9D-without-a-german-keyboard/

I also know that their � (scharfes S) can be substituted with ss.

some other European languages are still comprehensible if all the
diacritic marks are stripped even though it looks weird, etc.
Apparently Polish is one of those (I know very little about Polish,
sadly).

It is. There are some word plays, because some words have different
meanings with and without diacritics (for example, "�aska" and "laska"
mean different things), but they're rare and correct meaning can be
deduced from context.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Adam W. on Wed Apr 12 09:34:18 2023

gof-cut-this-news@cut-this-chmurka.net.invalid (Adam W.) writes:

Russ Allbery <eagle@eyrie.org> wrote:

German has a standard scheme,

Do you mean substituting umlauts with their Latin equivalents and adding
"e"?

ä = ae
ö = oe
ü = ue

At least that's what I found:

Yeah, exactly.

I know there's a similar one for Scandinavian languages that uses
characters like { and } to stand in for characters that don't exist in
ASCII (I think because those keys on an English keyboard were in the same location as the real letters on a Scandinavian keyboard), but this is now obscure enough that my Google skills are failing me. Old-timers would
probably still recognize that encoding, but I think everyone just uses
UTF-8 now.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael =?ISO-8859-1?Q?B=E4uerle?=@21:1/5 to Adam W. on Wed Apr 12 19:53:16 2023

Adam W. wrote:

[German has a standard scheme]
ä = ae
ö = oe
ü = ue

Can be used in all cases for german.

Same for capital umlauts:

Ä = Ae (or AE)
Ö = Oe (or OE)
Ü = Ue (or UE)

At least that's what I found:

https://blogs.transparent.com/german/writing-the-letters-%E2%80%9Ca%E2%80%9D-%E2%80%9Co%E2%80%9D-and-%E2%80%9Cu%E2%80%9D-without-a-german-keyboard/

| Bräuche – Braeuche (costumes) and Bäuche – Baeuche (bellies)
^^^^^^^^
This is wrong. "Bräuche" means something like "conventions".
My dictionary says "customs" (sounds a bit similar compared to
"costumes").

I also know that their ß (scharfes S) can be substituted with ss.

There are some cases for which this alters the meaning (sometimes "sz"
is used for them). Normally no problem if context is available.
If in doubt, use "ss".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Olivier Miakinen@21:1/5 to All on Wed Apr 12 21:30:33 2023

Le 12/04/2023 10:17, Julien �LIE a �crit :

We need an interoperable way to provide texts.
Please note RFC 2277 (BCP 18) about charsets:

Protocols MUST be able to use the UTF-8 charset, which consists of
the ISO 10646 coded character set combined with the UTF-8 character
encoding scheme, as defined in [10646] Annex R (published in
Amendment 2), for all text.

Protocols MAY specify, in addition, how to use other charsets or
other character encoding schemes for ISO 10646, such as UTF-16, but
lack of an ability to use UTF-8 is a violation of this policy; such a
violation would need a variance procedure ([BCP9] section 9) with
clear and solid justification in the protocol specification document
before being entered into or advanced upon the standards track.

For existing protocols or protocols that move data from existing
datastores, support of other charsets, or even using a default other
than UTF-8, may be a requirement. This is acceptable, but UTF-8
support MUST be possible.

And RFC 2277 is a quarter of a century old (January 1998)

--
Olivier Miakinen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Urs =?UTF-8?Q?Jan=C3=9Fen?=@21:1/5 to Russ Allbery on Wed Apr 12 23:17:46 2023

In <87fs95oy9h.fsf@hope.eyrie.org> on Wed, 12 Apr 2023 18:34:18,
Russ Allbery wrote:

I know there's a similar one for Scandinavian languages that uses
characters like { and } to stand in for characters that don't exist in
ASCII (I think because those keys on an English keyboard were in the same location as the real letters on a Scandinavian keyboard), but this is now obscure enough that my Google skills are failing me. Old-timers would probably still recognize that encoding, but I think everyone just uses
UTF-8 now.

JFTR, see "Table 3" from http://bzr.tin.org/doc/iso2asc.txt

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Thu Apr 13 08:59:12 2023

Hi Adam,

The goal of all of that machinery is that the hierarchy administrators
should be canonical for the newsgroups entries for their hierarchy.
Encoding is one of those things where we need to standardize in order to,
say, comply with the NNTP standard, but I'm not willing to make any other
editorial judgments because it gets into too much annoying work. So this
is something you should take up with the hierarchy administrators.

I apologize for suggesting additional programming work for you. I change
my request to asking for an amendment to your README in which you might
urge a proponent or hierarchy administrator not to use UTF-8 punctuation
for which ASCII punctuation would suffice, to avoid needlessly turning a description into UTF-8.

Wouldn't a 100% ASCII-encoded file fit your needs?
I've just generated this one with the Text::Unidecode Perl module:
http://usenet.trigofacile.com/hierarchies/data/newsgroups.ascii

Punctuations like French quotations marks («»), unbreakable spaces, etc.
are converted into ASCII, as well as of course any other characters.

ftp.isc.org could then make available both files (.utf8 and .ascii).

--
Julien ÉLIE

« Ils ont refusé une offre de Normand ?!? » (Astérix)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Thu Apr 13 09:08:14 2023

Hi Adam,

FWIW, INN does not enforce UTF-8 in the descriptions of newsgroups. You
can use any encoding you want for them.

The newgroup or checkgroups messages could have MIME headers specifying
the character set but these won't survive processing, so a big text file
will have multiple unspecified encodings. Aargh.

My sentence was not about the process of control messages (for which the encoding in MIME headers are correctly parsed, and the descriptions
actually converted to UTF-8 for homogeneity purpose).
The descriptions of newsgroups for which control articles are sent end
up in UTF-8.

Besides, there's a /localencoding/ setting in control.ctl to
parameterize the resulting encoding. The default is UTF-8 but one may
change it to another encoding if he wants.
https://www.eyrie.org/~eagle/software/inn/docs/control.ctl.html

My sentence was just about the encoding of the newsgroups file; INN will provide its contents as-is when being requested the descriptions. If it
has multiple unspecified encodings (big5, iso-8859-xx, utf8, cp1252...),
it will provide them as-is. It won't try to convert them on the fly.

--
Julien ÉLIE

« – Laissons-lui notre char et prenons le sien…
– Oui, ça nous dépannera… » (Astérix)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to iulius@nom-de-mon-site.com.invalid on Thu Apr 13 08:12:22 2023

Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> writes:

My sentence was just about the encoding of the newsgroups file; INN will provide its contents as-is when being requested the descriptions. If it
has multiple unspecified encodings (big5, iso-8859-xx, utf8, cp1252...),
it will provide them as-is. It won't try to convert them on the fly.

The fundamental protocol problem here is that the LIST NEWSGROUPS command
has no way to convey an encoding, let alone a different encoding for every line. It's all well and good for different hierarchies to use different encodings and use appropriate MIME headers for their control messages to
convey that encoding; all of that would in theory work as expected. But,
at the end of that process, a given news server returns the whole thing in response to LIST NEWSGROUPS, and it has to pick a single encoding for that response.

It's not even about the storage, really. Yes, right now INN uses a single
big file, but it doesn't need to do that. In theory, it could use some
smarter storage mechanism that preserved the original encoding. But that doesn't help because of the protocol; it still has to respond to LIST NEWSGROUPS commands, and at that point the separate encodings don't help.

The only workable choices for a single encoding are ASCII and UTF-8;
everything else is much worse in terms of interoperability. ASCII is not generally sufficient as soon as one gets too far from western Europe and, truly, is not really sufficient for western European languages either;
while it may be possible to read French with stripped accent marks or
Spanish without tildes, it's annoying, sometimes ambiguous, and there's no reason to put up with it in 2023. Hence UTF-8.

Given that, working backwards, sending hierarchy control messages in a different encoding than UTF-8 (or ASCII, which is a UTF-8 subset) is
probably not the best approach. Even if the news software understands the
MIME headers properly and knows the encoding (which can be dubious), now
the content has to be recoded into UTF-8 by the news server anyway. While
this is a well-defined operation for most encodings, it adds another step
that can fail and another opportunity for something to go wrong.

The best results are likely to come from using UTF-8 end-to-end. This
also has the advantage of being the direction that computing is going
anyway. My understanding is that even Chinese domestic use is
increasingly UTF-8, although support for other encodings is still required
in some situations. (Chinese was a potential sticking point due to issues
with how Chinese, Japanese, and Korean were encoded in Unicode that's more complex than is worth getting into.)

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to news@zzo38computer.org.invalid on Thu Apr 13 08:28:40 2023

news@zzo38computer.org.invalid writes:

My opinion is that newsgroup names should be purely ASCII (there are
many benefits to this, and using non-ASCII characters in newsgroup names
and domain names and commands and configuration files can cause many problems, including security issues (especially if any Unicode-based
encoding is used; non-Unicode has less security issues, but still is not worth it to use non-ASCII in these cases), comparisons, input, etc).

It's very easy for someone who speaks English to say that newsgroup names should be purely ASCII, but what we would be saying is that people who
only speak Japanese (or Chinese, or Russian, or...) should put up with newsgroup names being opaque, incomprehensible blobs of foreign
characters. Imagine what Usenet would be like for you if every newsgroup
name was in Arabic (or, if you happen to read Arabic, Korean, or some
other language).

Historically, that is exactly what we have said. But I think that's sad.

The security problem is real, but honestly that's largely because Usenet software is very old and is often written in languages that, if not
actually dying, are at least very stagnant. Handling encodings properly
in C is a pain, but that's because doing anything properly in C is a pain. Every modern language comes with extremely well-tested libraries, and most
of them now make *not* dealing with Unicode very difficult; it just
happens automatically. The remaining non-coding problems are mostly about homograph attacks, and that's not much of an issue with newsgroup names.

Using multiple encodings, as you say, definitely makes the problem worse,
since you can't simply reject all invalid UTF-8 very early on, since you
may instead be dealing with ISO-8859-1 or some other encoding.
Thankfully, there's no real reason to support anything other than UTF-8
now. The remaining question is whether Usenet software can cope in
practice, or whether, like DNS and email, we'll be forced into using complicated ASCII-compatible encoding schemes. Experiments so far seemed
to indicate that native Usenet software support for UTF-8 newsgroup names wasn't that bad.

I can't think of any other major Internet protocol, not even domain names,
that is still limited to ASCII. Newsgroup names are a sad outlier.

(But, I really hate Unicode; it is full of problems, including Han unification and other complications; and it is a stateful character set
even though the encoding is stateless. TRON character code is better in
some ways (especially for Japanese text), and I have done some work
using this.)

I hate the email message format (it should be something much less
ambiguous and machine-parsable), the RFC 2822 Date format, and RFC 2047
header encoding. The price of implementing protocols is that there will
always be parts of them you don't like because life is compromise.

The RFC says that it should be UTF-8, but I think that this is a mistake
in the design of the protocol.

Mistake or not, it's not going to change now.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Russ Allbery on Thu Apr 13 16:52:00 2023

Russ Allbery <eagle@eyrie.org> wrote:

. . . (Chinese was a potential sticking point due to issues
with how Chinese, Japanese, and Korean were encoded in Unicode that's more >complex than is worth getting into.)

Are you talking about character codes for the glyphs common to all three languages, the CJK set, or something else entirely? Also, wasn't there something about China modernizing glyphs that were still being used by
the other two languages?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Adam H. Kerman on Thu Apr 13 10:50:53 2023

"Adam H. Kerman" <ahk@chinet.com> writes:

Russ Allbery <eagle@eyrie.org> wrote:

. . . (Chinese was a potential sticking point due to issues with how
Chinese, Japanese, and Korean were encoded in Unicode that's more
complex than is worth getting into.)

Are you talking about character codes for the glyphs common to all three languages, the CJK set, or something else entirely? Also, wasn't there something about China modernizing glyphs that were still being used by
the other two languages?

Yeah, I'm talking about the glyph unification problem. I forget how much impact the traditional vs. simplified Chinese distinction has on the
Unicode encoding and whether some of those distinctions are also unified.

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Adam H. Kerman on Thu Apr 13 11:49:38 2023

"Adam H. Kerman" <ahk@chinet.com> writes:

I didn't sit in on these meeting years ago like you, but the little I
know about Chinese is that the traditional glyphs represented words and
not letters; they represent letters in the other two languages.

I'm fairly sure this isn't true in general. To the extent that the same
basic glyphs are used in Japanese kanji, I believe that they are also
words, or at least not letters in the sense of the Latin alphabet.
Japanese *kana* uses some Chinese characters to represent syllables
instead of words, but kana is a supplemental writing system used in
addition to kanji.

(Disclaimer that I do not speak or read any of these languages. I just
have a long-standing amateur interest in character sets.)

Hangul for Korean is different, but I don't think Hangul characters were unified with Chinese and Japanese. I believe the impact on Korean was on hanja, which is not used for most words. Hangul doesn't look anything
like Chinese or Japanese characters, to such an extent that I, as someone
who doesn't know any of these languages, can distinguish between Hangul
and the other languages on sight.

I'm aware that the glyphs are combinations of strokes that are common to other glyphs, and I often wondered if the strokes themselves and not the final result should have been what was encoded.

There was a fairly extensive discussion of this at the time, but they
decided against it for a bunch of reasons that I don't remember. I think
one of them was that the existing encodings of those languages did not do
this, and one of the goals of Unicode was to allow easy conversion from
and to existing character encodings.

The coding plane would have been a hell of a lot smaller.

Yes, but the software would have been a hell of a lot more complicated,
and it's not clear that's a good tradeoff. Arabic is already a
substantial challenge to support, and its combining characters are much
simpler than the system that would be required for stroke encoding, IIRC.

(Admittedly, most of the challenge with Arabic is the right-to-left directionality.)

--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Russ Allbery on Thu Apr 13 18:31:36 2023

Russ Allbery <eagle@eyrie.org> wrote:

"Adam H. Kerman" <ahk@chinet.com> writes:

Russ Allbery <eagle@eyrie.org> wrote:

. . . (Chinese was a potential sticking point due to issues with how >>>Chinese, Japanese, and Korean were encoded in Unicode that's more
complex than is worth getting into.)

Are you talking about character codes for the glyphs common to all three >>languages, the CJK set, or something else entirely? Also, wasn't there >>something about China modernizing glyphs that were still being used by
the other two languages?

Yeah, I'm talking about the glyph unification problem. I forget how much >impact the traditional vs. simplified Chinese distinction has on the
Unicode encoding and whether some of those distinctions are also unified.

I didn't sit in on these meeting years ago like you, but the little I
know about Chinese is that the traditional glyphs represented words and
not letters; they represent letters in the other two languages.

I'm aware that the glyphs are combinations of strokes that are common to
other glyphs, and I often wondered if the strokes themselves and not the
final result should have been what was encoded. I don't know if in
handwriting the student was talk to draw the strokes in a certain order,
since order becomes important in representing the combined strokes for
the glyph.

I realize strokes aren't letter-equivalents in other languages.

The coding plane would have been a hell of a lot smaller.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adam H. Kerman@21:1/5 to Russ Allbery on Thu Apr 13 19:17:45 2023

Russ Allbery <eagle@eyrie.org> wrote:

. . .

I'm fairly sure this isn't true in general. . . .

All right; I'll look it up. Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Sun Apr 16 22:55:18 2023

Hi Nigel,

I'm trying to sync up the active and newsgroups file from 15 peers and
it's proving to be a bit of a challenge.

Apart from encoding issues, are there special cases that you would have
liked to achieve for your sync and merge?

We have 2 old scripts needing a bit of refresh. I had planned to have a
look at them for the INN 2.7.2 release (in late 2023 or 2024).
https://github.com/InterNetNews/inn/issues/39

# mkngfile - make a newsgroup description file from multiple sources
#
# Jeremy Nixon <jeremy@exit109.com>
# $Id: mkngfile,v 1.1 1999/04/17 09:19:25 jeremy Exp $
#
# This program creates a newsgroup description file, using one or
# several input files containing group descriptions. The resulting
# file will contain a description line for each group in your active
# file.
#
# If the input contains multiple different descriptions for a group,
# the program will prompt interactively for which one to use; or, if
# the --noask option is given, one will be chosen arbitrarily. If a
# group has no description, $default_desc (below) will be used.
#
# The output will be sent to stdout, or to the file specified with
# the --output (or -o) option.
#
# Example - to run with your existing newsgroups file, a local copy
# of the ISC newsgroups file, and a directory containing checkgroups
# files with names like *.check, creating the new file as 'newfile':
# mkngfile -o newfile /news/etc/newsgroups newsgroups checkgroups/*.check
#
# You can set the location of your active file below so you don't
# have to specify it on the command line.

Besides files as input, I would also add the possibility to sync from

hostnames (the program will then download their newsgroups files).

We'll also need a similar tool to merge several active files (note that
INN already has the actmerge utility, without any documentation, that
merges 2 active files).

Descriptions are then cleaned with:

# cleannewsgroups.pl
# Copyright 1997-1999 Arthur Hagen
# Remove duplicate (Moderated) comments
# Strip trailing spaces
# Keep only one description for a newsgroup
# Option to remove extra tabs or to pretty-print with several tabs
# Option to either sort the newsgroups file alphabetically or to have it
# in the same order as the active file.

It can warn when the encoding is not UTF-8 :-)

The first bit is done, which is mainly getting rid of groups that have invalid names (those that end in a period, contain illegal characters,
and the like).

Such checks can also be added to the script (with an option).

--
Julien ÉLIE

« Si, si, si… Avec des si, on mettrait Lutèce en amphore ! » (Vacancier)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Billy G. (go-while)@21:1/5 to All on Sun Sep 24 10:56:31 2023

On 11.04.23 12:12, Julien ÉLIE wrote:

FWIW, the descriptions encoded in UTF-8 from the ftp.isc.org newsgroup
file are here:
http://usenet.trigofacile.com/hierarchies/data/newsgroups.utf8

It may facilitate your life :-)

The conversions I found out to work are:
- cn.* and han.* are encoded in gb18030;
- fido7.*, medlux.* and relcom.* in koi8-r;
- ukr.* in koi8-u;
- nctu.*, ncu.* and tw.* in big5;
- scout.forum.chinese and scout.forum.korean in big5;
- eternal-september.*, fido.* and fr.* in utf-8;
- all the others fit well in cp1252.

thanks!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	307
Nodes:	16 (2 / 14)
Uptime:	125:51:02
Calls:	6,854
Files:	12,360
Messages:	5,417,416

Encoding madness

Who's Online

System Info