• Re: UTF-8 and latin1

    From Stefan Ram@21:1/5 to Tobiah on Wed Aug 17 15:33:05 2022
    Tobiah <toby@tobiah.org> writes:
    I get data from various sources; client emails, spreadsheets, and
    data from web applications. I find that I can do some_string.decode('latin1')

    Strings have no "decode" method. ("bytes" objects do.)

    to get unicode that I can use with xlsxwriter,
    or put <meta charset="latin1"> in the header of a web page to display >European characters correctly.

    |You should always use the UTF-8 character encoding. (Remember
    |that this means you also need to save your content as UTF-8.)
    World Wide Web Consortium (W3C) (2014)

    am using data from the wild. It's frustrating that I have to play
    a guessing game to figure out how to use incoming text. I'm just wondering

    You can let Python guess the encoding of a file.

    def encoding_of( name ):
    path = pathlib.Path( name )
    for encoding in( "utf_8", "cp1252", "latin_1" ):
    try:
    with path.open( encoding=encoding, errors="strict" )as file:
    text = file.read()
    return encoding
    except UnicodeDecodeError:
    pass
    return None

    if there are any thoughts. What if we just globally decided to use utf-8? >Could that ever happen?

    That decisions has been made long ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Tobiah on Wed Aug 17 15:48:53 2022
    On 2022-08-17, Tobiah <toby@tobiah.org> wrote:
    I get data from various sources; client emails, spreadsheets, and
    data from web applications. I find that I can do some_string.decode('latin1')
    to get unicode that I can use with xlsxwriter,
    or put <meta charset="latin1"> in the header of a web page to display European characters correctly. But normally UTF-8 is recommended as
    the encoding to use today. latin1 works correctly more often when I
    am using data from the wild. It's frustrating that I have to play
    a guessing game to figure out how to use incoming text. I'm just wondering if there are any thoughts. What if we just globally decided to use utf-8? Could that ever happen?

    That has already been decided, as much as it ever can be. UTF-8 is
    essentially always the correct encoding to use on output, and almost
    always the correct encoding to assume on input absent any explicit
    indication of another encoding. (e.g. the HTML "standard" says that
    all HTML files must be UTF-8.)

    If you are finding that your specific sources are often encoded with
    latin-1 instead then you could always try something like:

    try:
    text = data.decode('utf-8')
    except UnicodeDecodeError:
    text = data.decode('latin-1')

    (I think latin-1 text will almost always fail to be decoded as utf-8,
    so this would work fairly reliably assuming those are the only two
    encodings you see.)

    Or you could use something fancy like https://pypi.org/project/chardet/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tobiah@21:1/5 to All on Wed Aug 17 08:18:57 2022
    I get data from various sources; client emails, spreadsheets, and
    data from web applications. I find that I can do some_string.decode('latin1') to get unicode that I can use with xlsxwriter,
    or put <meta charset="latin1"> in the header of a web page to display
    European characters correctly. But normally UTF-8 is recommended as
    the encoding to use today. latin1 works correctly more often when I
    am using data from the wild. It's frustrating that I have to play
    a guessing game to figure out how to use incoming text. I'm just wondering
    if there are any thoughts. What if we just globally decided to use utf-8? Could that ever happen?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Tobiah on Wed Aug 17 18:10:43 2022
    Tobiah <toby@tobiah.org> writes:
    On 8/17/22 08:33, Stefan Ram wrote:
    Tobiah <toby@tobiah.org> writes:
    I get data from various sources; client emails, spreadsheets, and
    data from web applications. I find that I can do some_string.decode('latin1')
    Strings have no "decode" method. ("bytes" objects do.)
    I'm using 2.7. Maybe that's why.

    Oh, I see.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tobiah@21:1/5 to All on Wed Aug 17 11:14:42 2022
    That has already been decided, as much as it ever can be. UTF-8 is essentially always the correct encoding to use on output, and almost
    always the correct encoding to assume on input absent any explicit
    indication of another encoding. (e.g. the HTML "standard" says that
    all HTML files must be UTF-8.)

    I got an email from a client with blast text that
    was in French with stuff like: Montréal, Quebéc.
    latin1 did the trick.
    Also, whenever I get a spreadsheet from a client and save as .csv,
    or take browser data through PHP, it always seems
    to work with latin1, but not UTF-8.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tobiah@21:1/5 to Stefan Ram on Wed Aug 17 10:55:11 2022
    On 8/17/22 08:33, Stefan Ram wrote:
    Tobiah <toby@tobiah.org> writes:
    I get data from various sources; client emails, spreadsheets, and
    data from web applications. I find that I can do some_string.decode('latin1')

    Strings have no "decode" method. ("bytes" objects do.)

    I'm using 2.7. Maybe that's why.


    Toby

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dn@21:1/5 to Stefan Ram on Thu Aug 18 08:53:17 2022
    On 18/08/2022 03.33, Stefan Ram wrote:
    Tobiah <toby@tobiah.org> writes:
    I get data from various sources; client emails, spreadsheets, and
    data from web applications. I find that I can do some_string.decode('latin1')

    Strings have no "decode" method. ("bytes" objects do.)

    to get unicode that I can use with xlsxwriter,
    or put <meta charset="latin1"> in the header of a web page to display
    European characters correctly.

    |You should always use the UTF-8 character encoding. (Remember
    |that this means you also need to save your content as UTF-8.)
    World Wide Web Consortium (W3C) (2014)

    am using data from the wild. It's frustrating that I have to play
    a guessing game to figure out how to use incoming text. I'm just wondering

    You can let Python guess the encoding of a file.

    def encoding_of( name ):
    path = pathlib.Path( name )
    for encoding in( "utf_8", "cp1252", "latin_1" ):
    try:
    with path.open( encoding=encoding, errors="strict" )as file:
    text = file.read()
    return encoding
    except UnicodeDecodeError:
    pass
    return None

    if there are any thoughts. What if we just globally decided to use utf-8? >> Could that ever happen?

    That decisions has been made long ago.

    Unfortunately, much of our data was collected long before then - and as
    we've discovered, the OP is still living in Python 2 times.

    What about if the path "name" (above) is not in utf-8?
    eg the OP's Montréal in Latin1, as Montréal.txt or Montréal.rpt
    --
    Regards,
    =dn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Barry@21:1/5 to All on Wed Aug 17 21:18:52 2022
    On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list <python-list@python.org> wrote:

    On 2022-08-17, Tobiah <toby@tobiah.org> wrote:
    I get data from various sources; client emails, spreadsheets, and
    data from web applications. I find that I can do some_string.decode('latin1')
    to get unicode that I can use with xlsxwriter,
    or put <meta charset="latin1"> in the header of a web page to display
    European characters correctly. But normally UTF-8 is recommended as
    the encoding to use today. latin1 works correctly more often when I
    am using data from the wild. It's frustrating that I have to play
    a guessing game to figure out how to use incoming text. I'm just wondering >> if there are any thoughts. What if we just globally decided to use utf-8? >> Could that ever happen?

    That has already been decided, as much as it ever can be. UTF-8 is essentially always the correct encoding to use on output, and almost
    always the correct encoding to assume on input absent any explicit
    indication of another encoding. (e.g. the HTML "standard" says that
    all HTML files must be UTF-8.)

    If you are finding that your specific sources are often encoded with
    latin-1 instead then you could always try something like:

    try:
    text = data.decode('utf-8')
    except UnicodeDecodeError:
    text = data.decode('latin-1')

    (I think latin-1 text will almost always fail to be decoded as utf-8,
    so this would work fairly reliably assuming those are the only two
    encodings you see.)

    Only if a reserved byte is used in the string.
    It will often work in either.

    For web pages it cannot be assumed that markup saying it’s utf-8 is
    correct. Many pages are I fact cp1252. Usually you find out because
    of a smart quote that is 0xa0 is cp1252 and illegal in utf-8.

    Barry



    Or you could use something fancy like https://pypi.org/project/chardet/

    --
    https://mail.python.org/mailman/listinfo/python-list


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Tobiah on Thu Aug 18 00:11:28 2022
    On 2022-08-17, Tobiah <toby@tobiah.org> wrote:
    That has already been decided, as much as it ever can be. UTF-8 is
    essentially always the correct encoding to use on output, and almost
    always the correct encoding to assume on input absent any explicit
    indication of another encoding. (e.g. the HTML "standard" says that
    all HTML files must be UTF-8.)

    I got an email from a client with blast text that
    was in French with stuff like: Montréal, Quebéc.
    latin1 did the trick.

    There's no accounting for the Québécois. They think they speak French.

    Also, whenever I get a spreadsheet from a client and save as .csv,
    or take browser data through PHP, it always seems to work with latin1,
    but not UTF-8.

    That depends on how you "saved as .csv" and what you did with PHP.
    Generally speaking browser submisisons were/are supposed to be sent
    using the same encoding as the page, so if you're sending the page
    as "latin1" then you'll see that a fair amount I should think. If you
    send it as "utf-8" then you'll get 100% utf-8 back.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Barry on Thu Aug 18 00:20:53 2022
    On 2022-08-17, Barry <barry@barrys-emacs.org> wrote:
    On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list <python-list@python.org> wrote:
    On 2022-08-17, Tobiah <toby@tobiah.org> wrote:
    I get data from various sources; client emails, spreadsheets, and
    data from web applications. I find that I can do some_string.decode('latin1')
    to get unicode that I can use with xlsxwriter,
    or put <meta charset="latin1"> in the header of a web page to display
    European characters correctly. But normally UTF-8 is recommended as
    the encoding to use today. latin1 works correctly more often when I
    am using data from the wild. It's frustrating that I have to play
    a guessing game to figure out how to use incoming text. I'm just wondering
    if there are any thoughts. What if we just globally decided to use utf-8? >>> Could that ever happen?

    That has already been decided, as much as it ever can be. UTF-8 is
    essentially always the correct encoding to use on output, and almost
    always the correct encoding to assume on input absent any explicit
    indication of another encoding. (e.g. the HTML "standard" says that
    all HTML files must be UTF-8.)

    If you are finding that your specific sources are often encoded with
    latin-1 instead then you could always try something like:

    try:
    text = data.decode('utf-8')
    except UnicodeDecodeError:
    text = data.decode('latin-1')

    (I think latin-1 text will almost always fail to be decoded as utf-8,
    so this would work fairly reliably assuming those are the only two
    encodings you see.)

    Only if a reserved byte is used in the string.
    It will often work in either.

    Because it's actually ASCII and hence there's no difference between interpreting it as utf-8 or iso-8859-1? In which case, who cares?

    For web pages it cannot be assumed that markup saying it’s utf-8 is correct. Many pages are I fact cp1252. Usually you find out because
    of a smart quote that is 0xa0 is cp1252 and illegal in utf-8.

    Hence what I said above. But if a source explicitly states an encoding
    and it's false then these days I see little need for sympathy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tobiah@21:1/5 to All on Thu Aug 18 07:45:09 2022
    Generally speaking browser submisisons were/are supposed to be sent
    using the same encoding as the page, so if you're sending the page
    as "latin1" then you'll see that a fair amount I should think. If you
    send it as "utf-8" then you'll get 100% utf-8 back.

    The only trick I know is to use <meta charset="utf-8">. Would
    that 'send' the post as utf-8? I always expected it had more
    to do with the way the user entered the characters. How do
    they by the way, enter things like Montréal, Quebéc. When they
    enter that into a text box on a web page can we say it's in
    a particular encoding at that time? At submit time?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Tobiah on Thu Aug 18 15:01:48 2022
    Tobiah <toby@tobiah.org> writes:
    The only trick I know is to use <meta charset="utf-8">.

    When you have your own web server or access to the settings
    of the web server used, you would configure your web server
    to send an HTTP header such as

    Content-Type: text/html; charset=utf-8

    . If you are not able to do this, the meta element would be
    a makeshift solution. Maybe some servers use this meta
    element to then adjust the Content-Type header to it.
    Maybe some browsers ignore all specifications and try to
    "sniff" the charset from the content.

    If you are forced to receive latin1 document, there's
    nothing yoy can do about it. But when you create new
    documents and have the choice, I suggest utf-8.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Stefan Ram on Thu Aug 18 15:25:18 2022
    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    When you have your own web server or access to the settings

    With Python 3.9:

    import socket as socket_module
    server = socket_module.socket()
    server.bind( ( "0.0.0.0", 80 ))
    server.listen()
    print( "Listening." )
    while True:
    client, address = server.accept()
    print( "accepted from: ", client, address[ 0 ], address[ 1 ])
    request = client.recv( 1024 )
    print( "Received: ", request )
    body = "<html><head><title>page</title></head>".encode()
    body += "<body><p>hi!</p></body></html>".encode()
    client.send( b"HTTP/1.1 200 Connection established\r\n" +
    b"Content-Type: text/html; charset=UTF-8\nContent-Length: "+
    str( len( body )).encode( "ASCII" )+ b"\r\n\r\n" )
    client.send( body )
    client.close()

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Tobiah on Thu Aug 18 16:53:31 2022
    On 2022-08-18, Tobiah <toby@tobiah.org> wrote:
    Generally speaking browser submisisons were/are supposed to be sent
    using the same encoding as the page, so if you're sending the page
    as "latin1" then you'll see that a fair amount I should think. If you
    send it as "utf-8" then you'll get 100% utf-8 back.

    The only trick I know is to use <meta charset="utf-8">. Would
    that 'send' the post as utf-8? I always expected it had more
    to do with the way the user entered the characters. How do
    they by the way, enter things like Montréal, Quebéc. When they
    enter that into a text box on a web page can we say it's in
    a particular encoding at that time? At submit time?

    You configure the web server to send:

    Content-Type: text/html; charset=...

    in the HTTP header when it serves HTML files. Another way is to put:

    <meta http-equiv="content-type" content="text/html; charset=...">

    or:

    <meta charset="...">

    in the <head> section of your HTML document. The HTML "standard"
    nowadays says that you are only allowed to use the "utf-8" encoding,
    but if you use another encoding then browsers will generally use that
    as both the encoding to use when reading the HTML file and the encoding
    to use when submitting form data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tobiah@21:1/5 to All on Thu Aug 18 11:33:59 2022
    You configure the web server to send:

    Content-Type: text/html; charset=...

    in the HTTP header when it serves HTML files.

    So how does this break down? When a person enters
    Montréal, Quebéc into a form field, what are they
    doing on the keyboard to make that happen? As the
    string sits there in the text box, is it latin1, or utf-8
    or something else? How does the browser know what
    sort of data it has in that text box?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Tobiah on Thu Aug 18 19:01:44 2022
    Tobiah <toby@tobiah.org> writes:
    When a person enters
    Montréal, Quebéc into a form field, what are they
    doing on the keyboard to make that happen?

    Depends on the OS and its configuration. Some devices might
    not even have a keyboard as hardware.

    As the
    string sits there in the text box, is it latin1, or utf-8
    or something else?

    This is an internal implementation detail of the browser.

    How does the browser know what
    sort of data it has in that text box?

    This is an internal implementation details of the browser.

    You usually do not need to know these internal information
    about the browser in order to use it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Tobiah on Thu Aug 18 18:56:58 2022
    On 2022-08-18, Tobiah <toby@tobiah.org> wrote:
    You configure the web server to send:

    Content-Type: text/html; charset=...

    in the HTTP header when it serves HTML files.

    So how does this break down? When a person enters
    Montréal, Quebéc into a form field, what are they
    doing on the keyboard to make that happen?

    It depends on what keybaord they have. Using a standard UK or US
    ("qwerty") keyboard and Windows you should be able to type "é" by
    holding down the 'Alt' key to the right of the spacebar, and typing
    'e'. If they're using a French ("azerty") keyboard then I think they
    can enter it by holding 'shift' and typing '2'.

    As the string sits there in the text box, is it latin1, or utf-8
    or something else?

    That depends on which browser you're using. I think it's quite likely
    it will use UTF-32 (i.e. fixed-width 32 bits per character).

    How does the browser know what sort of data it has in that text box?

    It's a text box, so it knows it's text.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Tobiah on Fri Aug 19 08:20:16 2022
    On Fri, 19 Aug 2022 at 08:15, Tobiah <toby@tobiah.org> wrote:

    You configure the web server to send:

    Content-Type: text/html; charset=...

    in the HTTP header when it serves HTML files.

    So how does this break down? When a person enters
    Montréal, Quebéc into a form field, what are they
    doing on the keyboard to make that happen? As the
    string sits there in the text box, is it latin1, or utf-8
    or something else? How does the browser know what
    sort of data it has in that text box?


    As it sits there in the text box, it is *a text string*.

    When it gets sent to the server, the encoding is defined by the
    browser (with reference to the server's specifications) and identified
    in a request header.

    The server should then receive that and interpret it as a text string.

    Encodings should ONLY be relevant when data is stored in files or
    transmitted across a network etc, and the rest of the time, just think
    in Unicode.

    Also - migrate to Python 3, your life will become a lot easier.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dennis Lee Bieber@21:1/5 to All on Thu Aug 18 21:38:55 2022
    On Thu, 18 Aug 2022 11:33:59 -0700, Tobiah <toby@tobiah.org> declaimed the following:


    So how does this break down? When a person enters
    Montréal, Quebéc into a form field, what are they
    doing on the keyboard to make that happen? As the
    string sits there in the text box, is it latin1, or utf-8
    or something else? How does the browser know what
    sort of data it has in that text box?


    If this were my ancient Amiga -- most of the accented characters in ISO-Latin-1 were entered by using one of the meta/alt keys simultaneously
    with one of five or six designated "dead keys" (in days of typewriters, a
    dead key was one that did not advance the carriage to the next character space). The dead key indicated which accent mark was to be applied to the subsequent "regular" character.

    On Windows, many of the characters might be entered using <alt>#### (where #### are keys on the numeric pad!) (such as <alt>1254 => µ).

    As for what the browser receives? Unless the browser is asking for raw key codes and translating them internally to some encoding, it is likely receiving characters in whatever encoding has been defined for the
    computer/OS (Windows, most likely CP1252, which is a superset of latin-1 as
    I recall). Whether the browser then re-encodes that to UTF-8 is something I can't answer.



    --
    Wulfraed Dennis Lee Bieber AF6VN
    wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gilmeh Serda@21:1/5 to Tobiah on Sat Aug 20 16:51:43 2022
    On Wed, 17 Aug 2022 08:18:57 -0700, Tobiah wrote:

    if there are any thoughts. What if we just globally decided to use
    utf-8?
    Could that ever happen?

    No! Not for as long as Mi¢ro$oft exists!

    --
    Gilmeh

    Westheimer's Discovery: A couple of months in the laboratory can
    frequently save a couple of hours in the library.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Stefan Ram on Tue Oct 25 10:16:58 2022
    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    You can let Python guess the encoding of a file.
    def encoding_of( name ):
    path = pathlib.Path( name )
    for encoding in( "utf_8", "cp1252", "latin_1" ):
    try:
    with path.open( encoding=encoding, errors="strict" )as file:

    I also read a book which claimed that the tkinter.Text
    widget would accept bytes and guess whether these are
    encoded in UTF-8 or "ISO 8859-1" and decode them
    accordingly. However, today I found that here it does
    accept bytes but it always guesses "ISO 8859-1".

    main.py

    import tkinter

    text = tkinter.Text()
    text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='ISO 8859-1' )) text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='UTF-8' )) text.pack()
    print( text.get( "1.0", "end" ))

    output

    AÄäÖöÜüßAÄäÖöÜüß

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Barry Scott@21:1/5 to All on Tue Oct 25 18:59:06 2022
    On 25 Oct 2022, at 11:16, Stefan Ram <ram@zedat.fu-berlin.de> wrote:

    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    You can let Python guess the encoding of a file.
    def encoding_of( name ):
    path = pathlib.Path( name )
    for encoding in( "utf_8", "cp1252", "latin_1" ):
    try:
    with path.open( encoding=encoding, errors="strict" )as file:

    I also read a book which claimed that the tkinter.Text
    widget would accept bytes and guess whether these are
    encoded in UTF-8 or "ISO 8859-1" and decode them
    accordingly. However, today I found that here it does
    accept bytes but it always guesses "ISO 8859-1".

    The best you can do is assume that if the text cannot decode as utf-8 it may be 8859-1.

    Barry


    main.py

    import tkinter

    text = tkinter.Text()
    text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='ISO 8859-1' )) text.insert( tkinter.END, "AÄäÖöÜüß".encode( encoding='UTF-8' )) text.pack()
    print( text.get( "1.0", "end" ))

    output

    AÄäÖöÜüßAÄäÖöÜüß


    --
    https://mail.python.org/mailman/listinfo/python-list

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Barry Scott on Wed Oct 26 08:05:09 2022
    On Wed, 26 Oct 2022 at 05:09, Barry Scott <barry@barrys-emacs.org> wrote:



    On 25 Oct 2022, at 11:16, Stefan Ram <ram@zedat.fu-berlin.de> wrote:

    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    You can let Python guess the encoding of a file.
    def encoding_of( name ):
    path = pathlib.Path( name )
    for encoding in( "utf_8", "cp1252", "latin_1" ):
    try:
    with path.open( encoding=encoding, errors="strict" )as file:

    I also read a book which claimed that the tkinter.Text
    widget would accept bytes and guess whether these are
    encoded in UTF-8 or "ISO 8859-1" and decode them
    accordingly. However, today I found that here it does
    accept bytes but it always guesses "ISO 8859-1".

    The best you can do is assume that if the text cannot decode as utf-8 it may be 8859-1.


    Except when it's Windows-1252.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Thu Oct 27 08:26:02 2022
    Latin-1 - Windows-1252

    Today in good software, latin-1 is an alias for Windows-1252.

    Latin-1 was badly design and is unusable.
    In "unicode" latin-1 deliberately does not exist.

    That’s why Monsieur Adrian MUŸ can have a working mailing address
    and can order a train ticket from his desktop.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)