• OT: Why use UTF-16 for simple text?

    From Lorem Ipsum@21:1/5 to All on Sat Mar 18 11:09:58 2023
    I'm posting this here, because this group seems to have fairly intelligent members who have more than basic knowledge on things computerish.

    I'm working with LTspice and started using command scripts to generate measurement output files. But they display in my editor as all the characters being separated by nulls. On asking in the LTspice group about this, it seems that while most of the
    textual output files spit out something viewable in a text editor, (namely UTF-8), this one file is generated in UTF-16!

    The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.

    There's no facility to write anything into this file other than simple text. Given the other file formats this program generates are either UTF-8 or Western (ISO-8859-1), can anyone think of a reason why they would spit out UTF-16 for this one file
    format???

    LTspice is free, but it's not cheap. Everytime I use it, I run into problems like this, that waste my time trying to work around them. It's like the user interface was designed by asylum inmates, *for* asylum inmates.

    I know there's no real fix for this. I'm not looking for ways to convert the file and I can't change LTspice. I'm mostly just venting my frustration for the last week of dealing with the poor documentation and the religious fanaticism of the support
    group.

    --

    Rick C.

    - Get 1,000 miles of free Supercharging
    - Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to Lorem Ipsum on Sat Mar 18 11:30:52 2023
    Lorem Ipsum schrieb am Samstag, 18. März 2023 um 19:09:59 UTC+1:
    I'm posting this here, because this group seems to have fairly intelligent members who have more than basic knowledge on things computerish.

    I'm working with LTspice and started using command scripts to generate measurement output files. But they display in my editor as all the characters being separated by nulls. On asking in the LTspice group about this, it seems that while most of the
    textual output files spit out something viewable in a text editor, (namely UTF-8), this one file is generated in UTF-16!

    The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.

    There's no facility to write anything into this file other than simple text. Given the other file formats this program generates are either UTF-8 or Western (ISO-8859-1), can anyone think of a reason why they would spit out UTF-16 for this one file
    format???

    LTspice is free, but it's not cheap. Everytime I use it, I run into problems like this, that waste my time trying to work around them. It's like the user interface was designed by asylum inmates, *for* asylum inmates.

    I know there's no real fix for this. I'm not looking for ways to convert the file and I can't change LTspice. I'm mostly just venting my frustration for the last week of dealing with the poor documentation and the religious fanaticism of the support
    group.

    FWIW the free Notepad++ text editor has a menu item Encoding for such conversions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to minforth on Sat Mar 18 11:35:14 2023
    On Saturday, March 18, 2023 at 2:30:54 PM UTC-4, minforth wrote:
    Lorem Ipsum schrieb am Samstag, 18. März 2023 um 19:09:59 UTC+1:
    I'm posting this here, because this group seems to have fairly intelligent members who have more than basic knowledge on things computerish.

    I'm working with LTspice and started using command scripts to generate measurement output files. But they display in my editor as all the characters being separated by nulls. On asking in the LTspice group about this, it seems that while most of the
    textual output files spit out something viewable in a text editor, (namely UTF-8), this one file is generated in UTF-16!

    The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.

    There's no facility to write anything into this file other than simple text. Given the other file formats this program generates are either UTF-8 or Western (ISO-8859-1), can anyone think of a reason why they would spit out UTF-16 for this one file
    format???

    LTspice is free, but it's not cheap. Everytime I use it, I run into problems like this, that waste my time trying to work around them. It's like the user interface was designed by asylum inmates, *for* asylum inmates.

    I know there's no real fix for this. I'm not looking for ways to convert the file and I can't change LTspice. I'm mostly just venting my frustration for the last week of dealing with the poor documentation and the religious fanaticism of the support
    group.
    FWIW the free Notepad++ text editor has a menu item Encoding for such conversions.

    Yes, I can convert the file many ways. But that is a silly step. Here's a file you can't use, but you can use this other program to convert it to a format that works for.

    This is not the only issue with LTspice. Most of it has to do with the terse documentation. The guy who designed the program is a bit of a genius, really. But he knows F**K ALL about UIs. I understand it's managed by a committee now. Oh, the horror!

    --

    Rick C.

    + Get 1,000 miles of free Supercharging
    + Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Lorem Ipsum on Sun Mar 19 12:43:57 2023
    On 19/03/2023 5:09 am, Lorem Ipsum wrote:
    ...
    I know there's no real fix for this. I'm not looking for ways to convert the file and I can't change LTspice. I'm mostly just venting my frustration for the last week of dealing with the poor documentation and the religious fanaticism of the support
    group.

    You've come to the right place to vent. We always knew you were one of us.
    I liked the appeal at the beginning of your post. It was different from your usual :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Zbig@21:1/5 to All on Sun Mar 19 09:17:17 2023
    The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.

    Googling around I've found this thread:
    https://www.quora.com/When-should-UTF-16-encoding-be-preferred-over-UTF-8

    To me the conclusion is:
    „UTF-16 should only be used for interoperability with existing APIs that are incompatible with UTF-8. Absent such requirements, UTF-8 should be preferred to UTF-16. UTF-8 has a few clear advantages over UTF-16, such as:

    * compatibility with ASCII
    * self-synchronizing property
    * endianness-independence

    On the other hand, UTF-16 has zero clear advantages over UTF-8. While UTF-16 does take up less space than UTF-8 for some Asian languages, you can always just compress the UTF-8 encoding. The case for using UTF-8 everywhere is so compelling that this
    should be considered a minor inconvenience.”

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Zbig on Sun Mar 19 11:09:03 2023
    On Sunday, March 19, 2023 at 12:17:19 PM UTC-4, Zbig wrote:
    The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.
    Googling around I've found this thread: https://www.quora.com/When-should-UTF-16-encoding-be-preferred-over-UTF-8

    To me the conclusion is:
    „UTF-16 should only be used for interoperability with existing APIs that are incompatible with UTF-8. Absent such requirements, UTF-8 should be preferred to UTF-16. UTF-8 has a few clear advantages over UTF-16, such as:

    * compatibility with ASCII
    * self-synchronizing property
    * endianness-independence

    On the other hand, UTF-16 has zero clear advantages over UTF-8. While UTF-16 does take up less space than UTF-8 for some Asian languages, you can always just compress the UTF-8 encoding. The case for using UTF-8 everywhere is so compelling that this
    should be considered a minor inconvenience.”

    Meanwhile, in the LTspice group, I'm being labeled a troll for talking about this.

    I get that various groups have a common interest and may not be very interested in hearing about issues with a tool. But the LTspice group seems to really come down on people for even mentioning that problems exist.

    People don't have that shortcoming here. They mostly just come down on people for not much at all. lol

    But thanks for the reference. I may share that with the LTspice developers.

    --

    Rick C.

    -- Get 1,000 miles of free Supercharging
    -- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Zbig on Sun Mar 19 18:07:19 2023
    Zbig <zbigniew2011@gmail.com> writes:
    While UTF-16 does take up less space than UTF-8 for some Asian languages

    Often claimed, but often not true. E.g., consider the web page

    https://ctee.com.tw/news/tech/823656.html

    This is encoded in UTF-8. Let's see how big it would be in UTF-16:

    wget https://ctee.com.tw/news/tech/823656.html
    recode utf8..utf16 <823656.html >823656-utf16.html
    ls -l 823656*

    This shows:

    -rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
    -rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html

    So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
    than UTF-8.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From S Jack@21:1/5 to Lorem Ipsum on Sun Mar 19 11:19:59 2023
    On Saturday, March 18, 2023 at 1:35:16 PM UTC-5, Lorem Ipsum wrote:
    Yes, I can convert the file many ways. But that is a silly step. Here's a file you can't use, but you can use this other program > to convert it to a format that works for.

    UTF was new toy just as color was decades ago when one could go
    to an office and see women who changed their display background
    to magenta and print many color memos so that 100 dollar ink jets
    replaced 2 dollar ribbons.

    Like the 5 year old after getting into her mother's makeup stands
    in front of a mirror and proudly gazes at her new visage, heavily
    powered face and rouged cheeks with smeared crimson lips and eyes
    darkened almost black.
    --
    me

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to S Jack on Sun Mar 19 14:54:57 2023
    On Sunday, March 19, 2023 at 2:20:00 PM UTC-4, S Jack wrote:
    On Saturday, March 18, 2023 at 1:35:16 PM UTC-5, Lorem Ipsum wrote:
    Yes, I can convert the file many ways. But that is a silly step. Here's a file you can't use, but you can use this other program > to convert it to a format that works for.
    UTF was new toy just as color was decades ago when one could go
    to an office and see women who changed their display background
    to magenta and print many color memos so that 100 dollar ink jets
    replaced 2 dollar ribbons.

    Like the 5 year old after getting into her mother's makeup stands
    in front of a mirror and proudly gazes at her new visage, heavily
    powered face and rouged cheeks with smeared crimson lips and eyes
    darkened almost black.

    Someone please explain that to the crowd at the groups.io LTspice group.

    --

    Rick C.

    -+ Get 1,000 miles of free Supercharging
    -+ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Lorem Ipsum on Mon Mar 20 11:16:55 2023
    On 20/03/2023 5:09 am, Lorem Ipsum wrote:
    On Sunday, March 19, 2023 at 12:17:19 PM UTC-4, Zbig wrote:
    The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.
    Googling around I've found this thread:
    https://www.quora.com/When-should-UTF-16-encoding-be-preferred-over-UTF-8

    To me the conclusion is:
    „UTF-16 should only be used for interoperability with existing APIs that are incompatible with UTF-8. Absent such requirements, UTF-8 should be preferred to UTF-16. UTF-8 has a few clear advantages over UTF-16, such as:

    * compatibility with ASCII
    * self-synchronizing property
    * endianness-independence

    On the other hand, UTF-16 has zero clear advantages over UTF-8. While UTF-16 does take up less space than UTF-8 for some Asian languages, you can always just compress the UTF-8 encoding. The case for using UTF-8 everywhere is so compelling that this
    should be considered a minor inconvenience.”

    Meanwhile, in the LTspice group, I'm being labeled a troll for talking about this.

    I get that various groups have a common interest and may not be very interested in hearing about issues with a tool. But the LTspice group seems to really come down on people for even mentioning that problems exist.

    People don't have that shortcoming here. They mostly just come down on people for not much at all. lol

    Forth tells each programmer he can be a genius. This can result in over-achievers.
    Every system has a way of self-regulation. Other languages tell users they exist to
    be seen, not heard. Forth has tried that with mixed results.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ron AARON@21:1/5 to Zbig on Mon Mar 20 06:40:39 2023
    On 19/03/2023 18:17, Zbig wrote:
    The whole point of using the command script is to facilitate getting the results expediently, I now have to convert the durn files before I can usefully view them.

    Googling around I've found this thread:
    https://www.quora.com/When-should-UTF-16-encoding-be-preferred-over-UTF-8

    To me the conclusion is:
    „UTF-16 should only be used for interoperability with existing APIs that are incompatible with UTF-8. Absent such requirements, UTF-8 should be preferred to UTF-16. UTF-8 has a few clear advantages over UTF-16, such as:

    * compatibility with ASCII
    * self-synchronizing property
    * endianness-independence

    On the other hand, UTF-16 has zero clear advantages over UTF-8. While UTF-16 does take up less space than UTF-8 for some Asian languages, you can always just compress the UTF-8 encoding. The case for using UTF-8 everywhere is so compelling that this
    should be considered a minor inconvenience.”

    Not entirely. UTF-16 has an advantage that seeking to a specific
    character offset is O(1) whereas it's O(n) for UTF-8. Likewise seeking backwards through a string is easier for UTF-16.

    That said, 8th uses UTF-8 because it takes up less space in general
    (especially when most code and text is not multibyte), and modern
    systems understand it perfectly well. Only Windows (among the popular
    OSes) insists on UTF-16, and at least has conversion routines for it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Ron AARON on Mon Mar 20 06:28:43 2023
    Ron AARON <clf@8th-dev.com> writes:
    UTF-16 has an advantage that seeking to a specific
    character offset is O(1) whereas it's O(n) for UTF-8.

    Wrong. Even seeking to a specific code point offset is O(n) for
    UTF-16. Even UTF-32 does not give us O(1) character seeking, because
    a character can be composed of several code points; UTF-32 does give
    us O(1) code-point seeking, but why would one want that?

    Likewise seeking
    backwards through a string is easier for UTF-16.

    In what way?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ron AARON@21:1/5 to Anton Ertl on Mon Mar 20 09:36:02 2023
    On 20/03/2023 8:28, Anton Ertl wrote:
    Ron AARON <clf@8th-dev.com> writes:
    UTF-16 has an advantage that seeking to a specific
    character offset is O(1) whereas it's O(n) for UTF-8.

    Wrong. Even seeking to a specific code point offset is O(n) for
    UTF-16. Even UTF-32 does not give us O(1) character seeking, because
    a character can be composed of several code points; UTF-32 does give
    us O(1) code-point seeking, but why would one want that?

    Ah, you are correct; I was thinking of UCS-2.

    As for seeking in O(1) it's useful if you're splitting strings on X
    characters. Admittedly less frequently useful for most people.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to albert on Mon Mar 20 23:11:06 2023
    On 20/03/2023 10:53 pm, albert wrote:
    In article <2023Mar19.190719@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Zbig <zbigniew2011@gmail.com> writes:
    While UTF-16 does take up less space than UTF-8 for some Asian languages

    Often claimed, but often not true. E.g., consider the web page

    https://ctee.com.tw/news/tech/823656.html

    This is encoded in UTF-8. Let's see how big it would be in UTF-16:

    wget https://ctee.com.tw/news/tech/823656.html
    recode utf8..utf16 <823656.html >823656-utf16.html
    ls -l 823656*

    This shows:

    -rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
    -rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html

    So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
    than UTF-8.

    Viewing the ridiculous waste of website bandwidth for pictures,
    I think size is hardly relevant.

    Working with D1 I come accross source files with comment in Chinese. I
    can decipher it with my youdoa pen (or google) and I prefer this
    situation over no comment.
    While at the moment English is the "lingua franca" of the Internet
    and science, Chinese will become more important.

    Back to ideograms?

    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to Anton Ertl on Mon Mar 20 12:53:06 2023
    In article <2023Mar19.190719@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Zbig <zbigniew2011@gmail.com> writes:
    While UTF-16 does take up less space than UTF-8 for some Asian languages

    Often claimed, but often not true. E.g., consider the web page

    https://ctee.com.tw/news/tech/823656.html

    This is encoded in UTF-8. Let's see how big it would be in UTF-16:

    wget https://ctee.com.tw/news/tech/823656.html
    recode utf8..utf16 <823656.html >823656-utf16.html
    ls -l 823656*

    This shows:

    -rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
    -rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html

    So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
    than UTF-8.

    Viewing the ridiculous waste of website bandwidth for pictures,
    I think size is hardly relevant.

    Working with D1 I come accross source files with comment in Chinese. I
    can decipher it with my youdoa pen (or google) and I prefer this
    situation over no comment.
    While at the moment English is the "lingua franca" of the Internet
    and science, Chinese will become more important.


    - anton

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to none albert on Mon Mar 20 07:03:11 2023
    On Monday, March 20, 2023 at 7:53:10 AM UTC-4, none albert wrote:
    In article <2023Mar1...@mips.complang.tuwien.ac.at>,
    Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
    Zbig <zbigni...@gmail.com> writes:
    While UTF-16 does take up less space than UTF-8 for some Asian languages

    Often claimed, but often not true. E.g., consider the web page

    https://ctee.com.tw/news/tech/823656.html

    This is encoded in UTF-8. Let's see how big it would be in UTF-16:

    wget https://ctee.com.tw/news/tech/823656.html
    recode utf8..utf16 <823656.html >823656-utf16.html
    ls -l 823656*

    This shows:

    -rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
    -rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html

    So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
    than UTF-8.
    Viewing the ridiculous waste of website bandwidth for pictures,
    I think size is hardly relevant.

    It's not always about the Internet. My problem is compatibility. I use ASCII tools, such as the text editor. I don't know a single reason why a program would output files in UTF-8, UTF-16 and Western (ISO-8859-1) (an 8 bit extension of ASCII), but not
    by the users choice. It's based on which file is being output. Of 8 different file types, the one that is generated as a textual record of measurement taken, i.e. very likely to be read by another program, is in UTF-16, not compatible with ASCII bytes.



    Working with D1 I come accross source files with comment in Chinese. I
    can decipher it with my youdoa pen (or google) and I prefer this
    situation over no comment.
    While at the moment English is the "lingua franca" of the Internet
    and science, Chinese will become more important.

    Does UTF-16 work with Chinese better than UTF-8? As others have said, UTF-8 is a superset of 8 bit ASCII and inter-workable.

    --

    Rick C.

    +- Get 1,000 miles of free Supercharging
    +- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to albert@cherry. on Mon Mar 20 12:34:19 2023
    albert@cherry.(none) (albert) writes:
    In article <2023Mar19.190719@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Zbig <zbigniew2011@gmail.com> writes:
    While UTF-16 does take up less space than UTF-8 for some Asian languages

    Often claimed, but often not true. E.g., consider the web page

    https://ctee.com.tw/news/tech/823656.html

    This is encoded in UTF-8. Let's see how big it would be in UTF-16:

    wget https://ctee.com.tw/news/tech/823656.html
    recode utf8..utf16 <823656.html >823656-utf16.html
    ls -l 823656*

    This shows:

    -rw-r--r-- 1 anton users 175148 Mar 19 19:06 823656-utf16.html
    -rw-r--r-- 1 anton users 92601 Mar 19 19:05 823656.html

    So for this Taiwanese web page UTF-16 is *bigger* by a factor 1.89
    than UTF-8.

    Viewing the ridiculous waste of website bandwidth for pictures,
    I think size is hardly relevant.

    So even if there happened to be a case where UTF-16 was smaller, it
    would be hardly relevant according to your argument.

    While at the moment English is the "lingua franca" of the Internet
    and science, Chinese will become more important.

    That may or may not be the case, but does not make UTF-16 more
    relevant than it is now, because pictures will still take more space
    and because there will be just as much ASCII in Chinese text as now
    (as demonstrated in the HTML page above).

    Lest one think that I cherry-picked a web page to demonstrate my
    point, here's the numbers for Daniel Lemire's unicode_lipsum <https://github.com/lemire/unicode_lipsum>:

    utf8 utf16 utf32 16/8
    81685 91530 183056 1.120 Arabic-Lipsum.$u.txt
    69840 46922 93840 0.671 Chinese-Lipsum.$u.txt
    65542 65542 65544 1.000 Emoji-Lipsum.$u.txt
    66495 74612 149220 1.122 Hebrew-Lipsum.$u.txt
    87997 65532 131060 0.744 Hindi-Lipsum.$u.txt
    67808 46750 93496 0.689 Japanese-Lipsum.$u.txt
    66600 54290 108576 0.815 Korean-Lipsum.$u.txt
    86940 173882 347760 2.000 Latin-Lipsum.$u.txt
    104770 115962 231920 1.106 Russian-Lipsum.$u.txt

    The first three columns are in bytes, the fourth gives the size
    disadvantage factor of UTF-16 compared to UTF-8 (<1 means UTF-16 has
    an advantage).

    Lemir also gives Wikipedia entries on Mars (the utf* numbers are from conversion to "text" (looks like Markdown to me)):

    html
    954430 533857 849690 1699376 Mar 20 13:33 arabic
    382079 181321 274418 548832 Mar 20 13:33 chinese
    368442 152721 287666 575328 Mar 20 13:33 czech
    1005060 390368 775020 1550036 Mar 20 13:33 english
    192461 86963 168252 336500 Mar 20 13:33 esperanto
    1032638 446908 869736 1739468 Mar 20 13:33 french
    397376 205779 402432 804860 Mar 20 13:33 german
    326722 181348 286000 571996 Mar 20 13:33 greek
    327412 190114 292704 585404 Mar 20 13:33 hebrew
    712465 396593 547918 1095832 Mar 20 13:33 hindi
    304786 164355 237784 475564 Mar 20 13:33 japanese
    193001 97859 145838 291672 Mar 20 13:33 korean
    293677 156209 249390 498776 Mar 20 13:33 persan
    692409 280660 547232 1094456 Mar 20 13:33 portuguese
    713817 407095 624076 1248148 Mar 20 13:33 russian
    1088085 593589 809518 1619032 Mar 20 13:33 thai
    387007 195078 370886 741768 Mar 20 13:33 turkish
    674255 319029 564840 1129676 Mar 20 13:33 vietnamese

    For the latter files the UTF16 variants are always bigger than the
    UTF8 variants. Looking at chinese.utf8.txt, there is a lot of ASCII
    there in the links/URLs (where non-ASCII is encoded in ASCII,
    e.g. "/wiki/Wikipedia:%E6%B6%88%E6%AD%A7%E4%B9%89"), and also a bit in
    the form of Markdown Markup (e.g., []() in links, or **...**, but
    there is also numbers and percentages shown in ASCII; temperatures use
    a combined "degrees-C" sign rather than the degree sign followed by
    "C". There are also references to sources that are predominantly
    ASCII.

    The Lipsum Chinese text, OTOH, contains just ideograms and newlines,
    not even a blank or an ASCII digit in sight (which, looking at the
    Taiwanese web page linked to above seems to have become customary at
    least in Taiwan). The Russian text also contains no digits, but it
    does contain spaces and punctuation marks in ASCII.

    So the Lipsum texts seem to be the best case for UTF-16. And indeed,
    there the Chinese, Hindi, Japanese, and Korean UTF-16 texts are
    smaller than their UTF-8 counterparts, but for Arabic and Russian
    UTF-8 is smaller. And of course for the pseudo-Latin Lorem Ipsum,
    where the UTF-16 version is more than twice as big as the UTF-8
    version (More? How so? My guess is that the BOM causes two extra
    bytes for UTF-16).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From S Jack@21:1/5 to dxforth on Mon Mar 20 09:02:42 2023
    On Monday, March 20, 2023 at 7:11:08 AM UTC-5, dxforth wrote:
    Back to ideograms?

    Pure ideograms are very nice in an environment comprised of many
    dialects. Knowing the ideograms, one sounds in his mind his
    own dialect without translation to disturb the mind's harmony.

    But Chinese got corrupted long ago when some progressive
    "improved" it by adding phoneme elements to characters and
    the LC (language committee) got carried away and produced thousands
    of characters providing job security for the scribes.

    Romanji works in Japan so would assume it should work in China
    and elsewhere.
    --
    me

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to dxforth on Mon Mar 20 13:13:08 2023
    dxforth <dxforth@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.

    Touché.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Paul Rubin on Tue Mar 21 10:42:00 2023
    On 21/03/2023 7:13 am, Paul Rubin wrote:
    dxforth <dxforth@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.

    Touché.

    Precisely :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to S Jack on Tue Mar 21 10:29:44 2023
    On 21/03/2023 3:02 am, S Jack wrote:
    On Monday, March 20, 2023 at 7:11:08 AM UTC-5, dxforth wrote:
    Back to ideograms?

    Pure ideograms are very nice in an environment comprised of many
    dialects. Knowing the ideograms, one sounds in his mind his
    own dialect without translation to disturb the mind's harmony.

    But Chinese got corrupted long ago when some progressive
    "improved" it by adding phoneme elements to characters and
    the LC (language committee) got carried away and produced thousands
    of characters providing job security for the scribes.

    Romanji works in Japan so would assume it should work in China
    and elsewhere.

    Humans do have a habit of romanticising language and culture - especially
    if they view it as being on the ascendency or as possessing something they don't. The expression of the human condition in any language is fine by me
    but let's keep it simple :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From S Jack@21:1/5 to dxforth on Tue Mar 21 07:32:44 2023
    On Monday, March 20, 2023 at 6:29:47 PM UTC-5, dxforth wrote:
    On 21/03/2023 3:02 am, S Jack wrote:
    On Monday, March 20, 2023 at 7:11:08 AM UTC-5, dxforth wrote:
    Humans do have a habit of romanticising language and culture - especially
    if they view it as being on the ascendency or as possessing something they don't. The expression of the human condition in any language is fine by me but let's keep it simple :)

    Contrary to what many think major languages were designed not
    something that evolved naturally. What was natural was their erosion
    to better suit the general speaking public, prime example being
    English. For a couple hundred years it was spoken by the uneducated;
    the educated spoke and wrote in French the language used by the
    court. As result inflections were replaced by prepositions. The
    former more efficient for the court scribes that had to write many
    official documents and the the latter with less rules to learn and
    more facilitating to the ear of the general speaking public many of
    whom didn't know how to write. The same natural trend as seen in
    vulgate Latin, vulgate ancient Egyptian and vulgate anything (Koo in
    the extreme).

    One could think that the designers lacked skill but anyone involved
    in standard Forth would be very sympathetic to those attempting such
    task. Don't think it's prudent to take on such an endeavor for real;
    imagine the push back from all the world. But long ago for fun made
    an attempt. The idea was to come up with a simple workable grammar
    and use different sets of vocabularies. One set with words rich in
    labials giving the pleasant lilt of an African language, another set
    using gutturals for harsh sounds of Klingon. The grammar would be
    the same and words of different vocabulary sets would have one to
    one correspondence but from their sound and looks it wouldn't be
    obvious. Hollywood actors would have a simple grammar and some basic
    words to learn which sounds could easily be modified to fit the
    scenes they were performing.

    Esperanto, Ido and such had the daunting task of having to create
    very large vocabularies to be accepted. But for the above that task
    could be circumvented by picking Latin roots that correspond to
    basic English and adapting them. (Basic English used for selecting a
    complete working set of words. The words wouldn't appear as English
    nor Latin in the vocabularies.)
    --
    me

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Doug Hoffman on Wed Mar 22 21:06:54 2023
    On 22/03/2023 8:20 pm, Doug Hoffman wrote:
    On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
    On 21/03/2023 7:13 am, Paul Rubin wrote:
    dxforth <dxf...@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.

    Touché.
    Precisely :)

    Users in various countries may differ. For example, the euro glyph is common ( € ) .

    Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII
    and 128 slots for whatever else one believes is important.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Doug Hoffman@21:1/5 to dxforth on Wed Mar 22 02:20:01 2023
    On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
    On 21/03/2023 7:13 am, Paul Rubin wrote:
    dxforth <dxf...@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.

    Touché.
    Precisely :)

    Users in various countries may differ. For example, the euro glyph is common ( € ) .

    OT, but:
    This epitaph was taken from a real-life tombstone found in Tombstone, Arizona. The headstone reads, "Here lies Lester Moore, Four slugs from a .44, No Les No more."

    -Doug

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From S Jack@21:1/5 to dxforth on Wed Mar 22 05:57:08 2023
    On Wednesday, March 22, 2023 at 5:06:58 AM UTC-5, dxforth wrote:
    Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII
    and 128 slots for whatever else one believes is important.

    That's what I do.
    I'm on a UTF hterm, not my choice, and I use code page to map
    128 characters to the upper register.
    --
    me

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brian Fox@21:1/5 to none albert on Wed Mar 22 11:05:26 2023
    On Monday, March 20, 2023 at 7:53:10 AM UTC-4, none albert wrote:

    While at the moment English is the "lingua franca" of the Internet
    and science, Chinese will become more important.

    <sidebar>
    As famous American baseball player, Yogi Berra, was reputed to say:
    "It's hard to tell what's gonna happen, especially when it's in the future"

    I won't be around to see it but China is on a demographic precipice.
    Some are saying by early in the 3rd quarter of this century the population
    will be half of current number.

    Who knows what that does? It might put Hindi in ascent.
    </sidebar>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Doug Hoffman@21:1/5 to dxforth on Wed Mar 22 11:25:30 2023
    On Wednesday, March 22, 2023 at 6:06:58 AM UTC-4, dxforth wrote:
    On 22/03/2023 8:20 pm, Doug Hoffman wrote:
    On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
    On 21/03/2023 7:13 am, Paul Rubin wrote:
    dxforth <dxf...@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.

    Touché.
    Precisely :)

    Users in various countries may differ. For example, the euro glyph is common ( € ) .
    Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII
    and 128 slots for whatever else one believes is important.

    I think I'll use shift-option-2 for € (whatever of the remaining 128 slots that is). Wonder
    what others will use?

    -Doug

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to Brian Fox on Wed Mar 22 12:04:39 2023
    On Wednesday, March 22, 2023 at 2:05:28 PM UTC-4, Brian Fox wrote:
    On Monday, March 20, 2023 at 7:53:10 AM UTC-4, none albert wrote:
    While at the moment English is the "lingua franca" of the Internet
    and science, Chinese will become more important.
    <sidebar>
    As famous American baseball player, Yogi Berra, was reputed to say:
    "It's hard to tell what's gonna happen, especially when it's in the future"

    I won't be around to see it but China is on a demographic precipice.
    Some are saying by early in the 3rd quarter of this century the population will be half of current number.

    Who knows what that does? It might put Hindi in ascent.
    </sidebar>

    I don't think the selection of the language for business is done by a popular vote of the world's population. Until India changes course and finds another dominate export, other than telemarking phone calls purporting to be selling medical insurance or
    credit card services, they will remain other than a first world country.

    I have to give them credit for ingenuity though. Who would have thought you could turn fraud into an export?

    --

    Rick C.

    --- Get 1,000 miles of free Supercharging
    --- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to dxforth on Wed Mar 22 11:59:15 2023
    On Wednesday, March 22, 2023 at 6:06:58 AM UTC-4, dxforth wrote:
    On 22/03/2023 8:20 pm, Doug Hoffman wrote:
    On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
    On 21/03/2023 7:13 am, Paul Rubin wrote:
    dxforth <dxf...@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore.

    Touché.
    Precisely :)

    Users in various countries may differ. For example, the euro glyph is common ( € ) .
    Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII
    and 128 slots for whatever else one believes is important.

    UTF-8 is code compatible with ASCII, while supporting as many characters as you would like. If you use ASCII, it is also UTF-8 encoded, automagically. I'm pretty sure UTF-8 includes the euro glyph without machinations.

    --

    Rick C.

    ++ Get 1,000 miles of free Supercharging
    ++ Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Lorem Ipsum on Wed Mar 22 12:52:29 2023
    Lorem Ipsum <gnuarm.deletethisbit@gmail.com> writes:
    I'm pretty sure UTF-8 includes the euro glyph without machinations.

    The codepoint is U+20AC so the utf-8 encoding is 3 bytes long. In
    Windows-1252 it has a single byte encoding (0x80). It doesn't seem to
    exist in ISO-8859-1. In ISO-8859-15 it is 0xa4. Especially in the
    Forth milieu on limited systems, I can understand the attraction of
    having a single byte encoding for every character, even if that limits
    the character set. I think Unicode was originally intended to be a 16
    bit character set corresponding to the Unicode BMP (basic multilingual
    plane), but the BMP ran out of characters and now we have a contorted
    mess with slightly over 20 bits but plenty of literally crap characters
    (viz. U+1F4A9, the poop emoji).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Hendrix@21:1/5 to Doug Hoffman on Wed Mar 22 12:15:39 2023
    On Wednesday, March 22, 2023 at 7:25:32 PM UTC+1, Doug Hoffman wrote:
    [..]
    I think I'll use shift-option-2 for € (whatever of the remaining 128 slots that is). Wonder
    what others will use?

    What about "Euro" ? It definitely looks better in a sentence, column heading or caption.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Zbig@21:1/5 to All on Wed Mar 22 14:28:40 2023
    I think I'll use shift-option-2 for € (whatever of the remaining 128 slots that is). Wonder
    what others will use?

    In Linux Alt-5 is used.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Lorem Ipsum on Thu Mar 23 12:05:50 2023
    On 23/03/2023 5:59 am, Lorem Ipsum wrote:
    On Wednesday, March 22, 2023 at 6:06:58 AM UTC-4, dxforth wrote:
    On 22/03/2023 8:20 pm, Doug Hoffman wrote:
    On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
    On 21/03/2023 7:13 am, Paul Rubin wrote:
    dxforth <dxf...@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore. >>>>>
    Touché.
    Precisely :)

    Users in various countries may differ. For example, the euro glyph is common ( € ) .
    Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII
    and 128 slots for whatever else one believes is important.

    UTF-8 is code compatible with ASCII, while supporting as many characters as you would like. If you use ASCII, it is also UTF-8 encoded, automagically. I'm pretty sure UTF-8 includes the euro glyph without machinations.


    Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
    for every currency there is corresponding ASCII abbreviation e.g. AUD

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ron AARON@21:1/5 to Paul Rubin on Thu Mar 23 07:03:40 2023
    On 22/03/2023 21:52, Paul Rubin wrote:
    Lorem Ipsum <gnuarm.deletethisbit@gmail.com> writes:
    I'm pretty sure UTF-8 includes the euro glyph without machinations.

    The codepoint is U+20AC so the utf-8 encoding is 3 bytes long. In Windows-1252 it has a single byte encoding (0x80). It doesn't seem to
    exist in ISO-8859-1. In ISO-8859-15 it is 0xa4. Especially in the
    Forth milieu on limited systems, I can understand the attraction of
    having a single byte encoding for every character, even if that limits
    the character set. I think Unicode was originally intended to be a 16
    bit character set corresponding to the Unicode BMP (basic multilingual plane), but the BMP ran out of characters and now we have a contorted
    mess with slightly over 20 bits but plenty of literally crap characters
    (viz. U+1F4A9, the poop emoji).

    And they'll keep adding garbage characters because we've got all that
    space now. And then the space aliens will show up and we'll need
    characters for their language, but we won't have any space left, so
    they'll zap us.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lorem Ipsum@21:1/5 to dxforth on Wed Mar 22 23:50:23 2023
    On Wednesday, March 22, 2023 at 9:05:55 PM UTC-4, dxforth wrote:
    On 23/03/2023 5:59 am, Lorem Ipsum wrote:
    On Wednesday, March 22, 2023 at 6:06:58 AM UTC-4, dxforth wrote:
    On 22/03/2023 8:20 pm, Doug Hoffman wrote:
    On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
    On 21/03/2023 7:13 am, Paul Rubin wrote:
    dxforth <dxf...@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore. >>>>>
    Touché.
    Precisely :)

    Users in various countries may differ. For example, the euro glyph is common ( € ) .
    Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII >> and 128 slots for whatever else one believes is important.

    UTF-8 is code compatible with ASCII, while supporting as many characters as you would like. If you use ASCII, it is also UTF-8 encoded, automagically. I'm pretty sure UTF-8 includes the euro glyph without machinations.

    Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
    for every currency there is corresponding ASCII abbreviation e.g. AUD

    Didn't you read the post that started this thread???

    --

    Rick C.

    --- Get 1,000 miles of free Supercharging
    --- Tesla referral code - https://ts.la/richard11209

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to dxforth on Thu Mar 23 06:50:11 2023
    dxforth <dxforth@gmail.com> writes:
    Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
    for every currency there is corresponding ASCII abbreviation e.g. AUD

    For AUD there is even an ASCII character: $

    However, this demonstrastes trhe advantage of currency codes over
    currency signs: Currency signs are ambiguous.

    The currency code for the Euro is EUR.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Ron AARON on Thu Mar 23 07:02:50 2023
    Ron AARON <clf@8th-dev.com> writes:
    And they'll keep adding garbage characters because we've got all that
    space now. And then the space aliens will show up and we'll need
    characters for their language, but we won't have any space left, so
    they'll zap us.

    Unicode has grown from 110,117 code points in September 2012 to
    149,186 characters in September 2022. UTF-16 supports 1,112,064 code
    points, while UTF-8 would straightforwardly support 2G code points,
    but software should complain about code points outside the 1,112,064
    ones, so a lot of UTF-8 software probably does not support more code
    points. UTF-32 obviously can support 4G code points, and eliminating
    the limit of 1,112,064 code points will be pretty easy for UTF-32.

    As for "garbage characters", looking at the notes in <https://en.wikipedia.org/wiki/Unicode>, the vast majority of
    additions are not for stuff like new emojis (which probably would not
    have been introduced if space was tight), with 4,192 out of 4,489 code
    points added in the most recent version of Unicode being CJK
    ideographs, but also adding 20 emojis. Interestingly, the emojis are
    making the headlines, and they are probably more widely used than the
    newly added CJK ideographs or control characters for Egyptian
    hieroglyphs, so who are we to say that they are garbage characters.

    It's interesting that, while in Forth standardization we have strong
    resistance against new optional features from implementors, Unicode standardization seems to have little resistance to adding stuff
    (probably due to its roots: they want to support all writing systems
    rather than computer-established practice). And there can be quite
    substantial implementation cost. As a result, implementation tends to
    lag behind (at one point I wanted to run the program I had used to
    produce Figure 1 of <https://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26paysan05.pdf>,
    but had trouble finding a font that supported all the scripts I had
    used. But over time, more stuff seems to be supported.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Lorem Ipsum on Thu Mar 23 20:04:30 2023
    On 23/03/2023 5:50 pm, Lorem Ipsum wrote:
    On Wednesday, March 22, 2023 at 9:05:55 PM UTC-4, dxforth wrote:
    On 23/03/2023 5:59 am, Lorem Ipsum wrote:
    On Wednesday, March 22, 2023 at 6:06:58 AM UTC-4, dxforth wrote:
    On 22/03/2023 8:20 pm, Doug Hoffman wrote:
    On Monday, March 20, 2023 at 7:42:00 PM UTC-4, dxforth wrote:
    On 21/03/2023 7:13 am, Paul Rubin wrote:
    dxforth <dxf...@gmail.com> writes:
    What can't be done in 7-bit ASCII isn't worth doing. Less is Moore. >>>>>>>
    Touché.
    Precisely :)

    Users in various countries may differ. For example, the euro glyph is common ( € ) .
    Assuming an ASCII world, one byte should be plenty - 128 slots for ASCII >>>> and 128 slots for whatever else one believes is important.

    UTF-8 is code compatible with ASCII, while supporting as many characters as you would like. If you use ASCII, it is also UTF-8 encoded, automagically. I'm pretty sure UTF-8 includes the euro glyph without machinations.

    Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
    for every currency there is corresponding ASCII abbreviation e.g. AUD

    Didn't you read the post that started this thread???

    Yes - I loved it.

    On 19/03/2023 5:09 am, Lorem Ipsum wrote:
    ...
    I know there's no real fix for this. I'm not looking for ways to convert the file and I can't change LTspice. I'm mostly just venting my frustration for the last week of dealing with the poor documentation and the religious fanaticism of the support
    group.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Thu Mar 23 08:57:54 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    (at one point I wanted to run the program I had used to
    produce Figure 1 of ><https://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26paysan05.pdf>,
    but had trouble finding a font that supported all the scripts I had
    used. But over time, more stuff seems to be supported.

    I just tried it again. It works fine on the xterm setup I normally
    use (with the face "Noto Sans Mono"), and the program also displays
    fine on the emacs setup I use. In Emacs I looked at various
    characters (using C-u C-x =) to see what fonts are used, and they are:

    ASCII: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso8859-1
    Runic: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
    Thai: ftcrhb:-PfEd-Tlwg Typist-normal-normal-normal-*-15-*-*-*-*-0-iso10646-1
    Hebrew: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
    Cyrillic: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
    Greek: x:-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Doug Hoffman@21:1/5 to Anton Ertl on Thu Mar 23 04:51:55 2023
    On Thursday, March 23, 2023 at 3:00:17 AM UTC-4, Anton Ertl wrote:
    dxforth <dxf...@gmail.com> writes:
    Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
    for every currency there is corresponding ASCII abbreviation e.g. AUD
    For AUD there is even an ASCII character: $

    However, this demonstrastes trhe advantage of currency codes over
    currency signs: Currency signs are ambiguous.

    The currency code for the Euro is EUR.
    - anton

    Interesting to know that the world has bent its glyph usage to ASCII (7-bit). I did not know that. I guess that makes things much simpler as dxforth suggests.
    Makes me wonder why we bothered with the XCHAR extension in the Forth standard.

    -Doug

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dxforth@21:1/5 to Anton Ertl on Fri Mar 24 00:27:28 2023
    On 23/03/2023 5:50 pm, Anton Ertl wrote:
    dxforth <dxforth@gmail.com> writes:
    Sure but who is going to implement UTF-8 when ASCII will do? AFAIK
    for every currency there is corresponding ASCII abbreviation e.g. AUD

    For AUD there is even an ASCII character: $

    Unless it's ANS-Forth or 200x

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Paul Rubin on Sun Apr 2 17:13:12 2023
    Paul Rubin <no.email@nospam.invalid> writes:
    Especially in the
    Forth milieu on limited systems, I can understand the attraction of
    having a single byte encoding for every character, even if that limits
    the character set.

    The limited system does not have a display. It sends its output to a
    computer than knows how to display UTF-8. There is no technical
    reason for limiting yourself to single-byte encodings.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2022: https://euro.theforth.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From none) (albert@21:1/5 to no.email@nospam.invalid on Sun Apr 2 20:45:00 2023
    In article <87h6ucimwy.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:
    Lorem Ipsum <gnuarm.deletethisbit@gmail.com> writes:
    I'm pretty sure UTF-8 includes the euro glyph without machinations.

    The codepoint is U+20AC so the utf-8 encoding is 3 bytes long. In >Windows-1252 it has a single byte encoding (0x80). It doesn't seem to
    exist in ISO-8859-1. In ISO-8859-15 it is 0xa4. Especially in the
    Forth milieu on limited systems, I can understand the attraction of
    having a single byte encoding for every character, even if that limits
    the character set. I think Unicode was originally intended to be a 16
    bit character set corresponding to the Unicode BMP (basic multilingual >plane), but the BMP ran out of characters and now we have a contorted
    mess with slightly over 20 bits but plenty of literally crap characters
    (viz. U+1F4A9, the poop emoji).

    ciforth may be the simplest simple Forth around. It has no
    problems with huge character strings with whatever encoding,
    provided EMIT is adapted a little bit.
    ciforth follows linux that the primitive for char output is the
    string. As long as the length is known in bytes the terminal can
    take care of it. I can see a Chinese VT100, automatically displaying
    the Chinese character for DROP.

    So EMIT is defined as
    : EMIT DSP@ 1 TYPE DROP ;
    It could be equally wel be
    : EMIT DSP@ 2 TYPE DROP ;
    (As long as a single character is at most 8 bytes.)

    I have had no complaints from Chinese users, this far.

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat spinning. - the Wise from Antrim -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)