• Character encoding conversion in wide string literals

    From Juha Nieminen@21:1/5 to All on Tue Dec 7 11:37:11 2021
    Recently I stumbled across a problem where I had wide string literals
    with non-ascii characters UTF-8 encoded. In other words, I had code like
    this (I'm using non-ascii in the code below, I hope it doesn't get
    mangled up, but even if it does, it should nevertheless be clear what
    I'm trying to express):

    std::wstring str = L"non-ascii chars: ???";

    The C++ source file itself uses UTF-8 encoding, meaning that that line
    of code is likewise UTF-8 encoded. If it were a narrow string literal
    (being assigned to a std::string) then it works just fine (primarily
    because the compiler doesn't need to do anything to it, it can simply
    take those bytes from the source file as is).

    However, since it's a wide string literal (being assigned to a std::wstring) it's not as clear-cut anymore. What does the standard say about this
    situation?

    The thing is that it works just fine in Linux using gcc. The compiler will re-encode the UTF-8 encoded characters in the source file inside the parentheses into whatever encoding wide char string use, so the correct
    content will end up in the executable binary (and thus in the wstring).

    Apparently it does not work correctly in (some recent version of)
    Visual Studio, where apparently it just takes the byte values from the
    source file within the parentheses as-is, and just assigns those values
    as-is to the wide chars that end up in the binary. (Or something like that.)

    Does the standard specify what the compiler should do in this situation?
    If not, then what is the proper way of specifying wide string literals
    that contain non-ascii characters?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alf P. Steinbach@21:1/5 to Juha Nieminen on Tue Dec 7 16:44:37 2021
    On 7 Dec 2021 12:37, Juha Nieminen wrote:
    Recently I stumbled across a problem where I had wide string literals
    with non-ascii characters UTF-8 encoded. In other words, I had code like
    this (I'm using non-ascii in the code below, I hope it doesn't get
    mangled up, but even if it does, it should nevertheless be clear what
    I'm trying to express):

    std::wstring str = L"non-ascii chars: ???";

    The C++ source file itself uses UTF-8 encoding, meaning that that line
    of code is likewise UTF-8 encoded. If it were a narrow string literal
    (being assigned to a std::string) then it works just fine (primarily
    because the compiler doesn't need to do anything to it, it can simply
    take those bytes from the source file as is).

    However, since it's a wide string literal (being assigned to a std::wstring) it's not as clear-cut anymore. What does the standard say about this situation?

    The thing is that it works just fine in Linux using gcc. The compiler will re-encode the UTF-8 encoded characters in the source file inside the parentheses into whatever encoding wide char string use, so the correct content will end up in the executable binary (and thus in the wstring).

    Apparently it does not work correctly in (some recent version of)
    Visual Studio, where apparently it just takes the byte values from the
    source file within the parentheses as-is, and just assigns those values
    as-is to the wide chars that end up in the binary. (Or something like that.)

    The Visual C++ compiler assumes that source code is Windows ANSI encoded
    unless

    * you use an encoding option such as `/utf8`, or
    * the source is UTF-8 with BOM, or
    * the source is UTF-16.

    Independently of that Visual C++ assumes that the execution character
    set (the byte-based encoding that should be used for text data in the executable) is Windows ANSI, unless it's specified as something else.
    The `/utf8` option specifies also that. It's a combo option that
    specifies both source encoding and execution character set as UTF-8.

    Unfortunately as of VS 2022 `/utf8` is not set by default in a VS
    project, and unfortunately there's nothing you can just click to set it.
    You have to to type it in (right click project then) Properties -> C/C++
    Command Line. I usually set "/utf-8 /Zc:__cplusplus".


    Does the standard specify what the compiler should do in this situation?
    If not, then what is the proper way of specifying wide string literals
    that contain non-ascii characters?

    I'll let others discuss that, but (1) it does, and (2) just so you're
    aware: the main problem is that the C and C++ standards do not conform
    to reality in their requirement that a `wchar_t` value should suffice to
    encode all possible code points in the wide character set.

    In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
    some emojis etc. that appear as a single character and constitute one
    21-bit code point, can become a pair of two `wchar_t` values, an UTF-16 "surrogate pair".

    That's probably not your problem though, but it is a/the problem.


    - ALf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Juha Nieminen on Tue Dec 7 11:38:44 2021
    On 12/7/21 6:37 AM, Juha Nieminen wrote:
    Recently I stumbled across a problem where I had wide string literals
    with non-ascii characters UTF-8 encoded. In other words, I had code like
    this (I'm using non-ascii in the code below, I hope it doesn't get
    mangled up, but even if it does, it should nevertheless be clear what
    I'm trying to express):

    std::wstring str = L"non-ascii chars: ???";

    The C++ source file itself uses UTF-8 encoding, meaning that that line
    of code is likewise UTF-8 encoded. If it were a narrow string literal
    (being assigned to a std::string) then it works just fine (primarily
    because the compiler doesn't need to do anything to it, it can simply
    take those bytes from the source file as is).

    However, since it's a wide string literal (being assigned to a std::wstring) it's not as clear-cut anymore. What does the standard say about this situation?

    The thing is that it works just fine in Linux using gcc. The compiler will re-encode the UTF-8 encoded characters in the source file inside the parentheses into whatever encoding wide char string use, so the correct content will end up in the executable binary (and thus in the wstring).

    Apparently it does not work correctly in (some recent version of)
    Visual Studio, where apparently it just takes the byte values from the
    source file within the parentheses as-is, and just assigns those values
    as-is to the wide chars that end up in the binary. (Or something like that.)> Does the standard specify what the compiler should do in this situation?

    The standard says a great many things about it, but the most important
    things it says are that the relevant character sets and encodings are implementation-defined. If an implementation uses utf-8 for it's native character encoding, your code should work fine. The most likely
    explanation why it doesn't work is that your utf-8 encoded source code
    file is being interpreted using some other encoding, probably ASCII or
    one of its many variants.

    I have relatively little experience programming for Windows, and
    essentially none with internationalization. Therefore, the following
    comments about Windows all convey second or third-hand information, and
    should be treated accordingly. Many people posting on this newsgroup
    know more than I do about such things - hopefully someone will correct
    any errors I make:

    * When Unicode first came out, Windows choose to use UCS-2 to support
    it, and made that it's default character encoding.
    * When Unicode expanded beyond the capacity of UCS-2, Windows decided to transition over to using UTF-16. There was an annoyingly long transition
    period during which some parts of Windows used UTF-16, while other parts
    still used UCS-2. I cannot confirm whether or not that transition period
    has completed yet.
    * I remember hearing rumors that modern versions of Windows do provide
    some support for UTF-8, but that support is neither complete, nor the
    default. You have know what you need to do to enable such support - I don't.

    If not, then what is the proper way of specifying wide string literals
    that contain non-ascii characters?

    The most portable way of doing it to use what the standard calls
    Universal Character Names, or UCNs for short. "\u" followed by 4
    hexadecimal digits represents the character whose code point is
    identified by those digits. "\U" followed by eight hexadecimal digits represents the character whose Unicode code point is identified by those digits.
    Here's some key things to keep in mind when using UCNs:

    5.2p1: during translation phase 1, the implementation is required to
    convert any source file character that is not in the basic source
    character set into the corresponding UCN.
    5.2p2: Interrupting a UCN with an escaped new-line has undefined behavior. 5.2p4: Creating something that looks like a UCN by using the ## operator
    has undefined behavior.
    5.2p5: During translation phase 5, UCN's are converted to the execution character set.
    5.3p2: A UCN whose hexadecimal digits don't represent a code point or
    which represents a surrogate code point renders the program ill-formed.
    A UCN that represents a control character or a member of the basic
    character set renders the program ill-formed unless it occurs in a
    character literal or string literal.
    5.4p3: The conversion to UCNs is reverted in raw string literals.
    5.10p1: UCNs are allowed in identifiers, but only if they fall into one
    of the ranges listed in Table 2 of the standard.
    5.13.3p8: Any UCN for which there is no corresponding member of the
    execution character set is translated to an implementation-defined encoding. 5.13.5p13: A UCN occurring in a UTF-16 string literal may yield a
    surrogate pair. A UCN occurring in a narrow string literal may map to
    one or more char or char8_t elements.

    Here's a more detailed explanation of what the standard says about this situation:
    The standard talks about three different implementation-defined
    character sets:
    * The physical source character set which is used in your source code file.
    * The source character set which is used internally by the compiler
    while processing your code.
    * The execution character set used by your program when it is executed.

    The standard talks about 5 different character encodings:
    The implementation-defined narrow and wide native encodings used by
    character constants and string literals with no prefix, or with the "L"
    prefix, respectively. These are stored in arrays of char and wchar_t, respectively
    The UTF-8, UTF-16, and UTF-32 encodings used by character constants with
    u8, u, and U prefixes, respectively. These are stored in arrays of
    char8_t, char16_t, and char32_t, respectively.

    Virtually every standard library template that handles characters is
    required to support specializations for wchar_t, char8_t, char16_t, and char32_t.

    The standard mandates support for std::codecvt facets enabling
    conversion between the narrow and wide native encodings, and facets for converting between UTF-8 and either UTF-16 or UTF-32.
    The standard specifies the <cuchar> header which incorporates routines
    form the C standard library header <uchar.h> for converting between the
    narrow native encoding and either UTF-16 or UTF-32.
    Therefore, conversion between wchar_t and either char16_t or char32_t
    requires three conversion steps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Alf P. Steinbach on Tue Dec 7 11:48:59 2021
    On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
    ...
    aware: the main problem is that the C and C++ standards do not conform
    to reality in their requirement that a `wchar_t` value should suffice to encode all possible code points in the wide character set.

    The purpose of the C and C++ standards is prescriptive, not descriptive.
    It's therefore missing the point to criticize them for not conforming to reality. Rather, you should say that some popular implementations fail
    to conform to the standards.

    In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
    some emojis etc. that appear as a single character and constitute one
    21-bit code point, can become a pair of two `wchar_t` values, an UTF-16 "surrogate pair".

    The C++ standard explicitly addresses that point, though the C standard
    does not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alf P. Steinbach@21:1/5 to Keith Thompson on Tue Dec 7 19:07:00 2021
    On 7 Dec 2021 18:41, Keith Thompson wrote:
    "Alf P. Steinbach" <alf.p.steinbach@gmail.com> writes:
    On 7 Dec 2021 12:37, Juha Nieminen wrote:
    Recently I stumbled across a problem where I had wide string literals
    with non-ascii characters UTF-8 encoded. In other words, I had code like >>> this (I'm using non-ascii in the code below, I hope it doesn't get
    mangled up, but even if it does, it should nevertheless be clear what
    I'm trying to express):
    std::wstring str = L"non-ascii chars: ???";
    The C++ source file itself uses UTF-8 encoding, meaning that that
    line
    of code is likewise UTF-8 encoded. If it were a narrow string literal
    (being assigned to a std::string) then it works just fine (primarily
    because the compiler doesn't need to do anything to it, it can simply
    take those bytes from the source file as is).
    However, since it's a wide string literal (being assigned to a
    std::wstring)
    it's not as clear-cut anymore. What does the standard say about this
    situation?
    The thing is that it works just fine in Linux using gcc. The
    compiler will
    re-encode the UTF-8 encoded characters in the source file inside the
    parentheses into whatever encoding wide char string use, so the correct
    content will end up in the executable binary (and thus in the wstring).
    Apparently it does not work correctly in (some recent version of)
    Visual Studio, where apparently it just takes the byte values from the
    source file within the parentheses as-is, and just assigns those values
    as-is to the wide chars that end up in the binary. (Or something like that.)

    The Visual C++ compiler assumes that source code is Windows ANSI
    encoded unless

    * you use an encoding option such as `/utf8`, or
    * the source is UTF-8 with BOM, or
    * the source is UTF-16.

    What exactly do you mean by "Windows ANSI"? Windows-1252 or something
    else? (Microsoft doesn't call it "ANSI", because it isn't.)

    [...]

    "Windows ANSI" is the encoding specified by the `GetACP` API function,
    which, but as I recall that's more or less undocumented, just serves up
    the codepage number specified by registry value

    Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage@ACP

    This means that "Windows ANSI" is a pretty dynamic thing. Not just system-dependent, but at-the-moment-configuration dependent. Though in English-speaking countries it's Windows 1252 by default.

    And that in turn means that using the defaults with Visual C++, you can
    end up with pretty much any encoding whatsoever of narrow literals.

    Which means that it's a good idea to take charge.

    Option `/utf8` is one way to take charge.


    - Alf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Alf P. Steinbach on Tue Dec 7 09:41:44 2021
    "Alf P. Steinbach" <alf.p.steinbach@gmail.com> writes:
    On 7 Dec 2021 12:37, Juha Nieminen wrote:
    Recently I stumbled across a problem where I had wide string literals
    with non-ascii characters UTF-8 encoded. In other words, I had code like
    this (I'm using non-ascii in the code below, I hope it doesn't get
    mangled up, but even if it does, it should nevertheless be clear what
    I'm trying to express):
    std::wstring str = L"non-ascii chars: ???";
    The C++ source file itself uses UTF-8 encoding, meaning that that
    line
    of code is likewise UTF-8 encoded. If it were a narrow string literal
    (being assigned to a std::string) then it works just fine (primarily
    because the compiler doesn't need to do anything to it, it can simply
    take those bytes from the source file as is).
    However, since it's a wide string literal (being assigned to a
    std::wstring)
    it's not as clear-cut anymore. What does the standard say about this
    situation?
    The thing is that it works just fine in Linux using gcc. The
    compiler will
    re-encode the UTF-8 encoded characters in the source file inside the
    parentheses into whatever encoding wide char string use, so the correct
    content will end up in the executable binary (and thus in the wstring).
    Apparently it does not work correctly in (some recent version of)
    Visual Studio, where apparently it just takes the byte values from the
    source file within the parentheses as-is, and just assigns those values
    as-is to the wide chars that end up in the binary. (Or something like that.)

    The Visual C++ compiler assumes that source code is Windows ANSI
    encoded unless

    * you use an encoding option such as `/utf8`, or
    * the source is UTF-8 with BOM, or
    * the source is UTF-16.

    What exactly do you mean by "Windows ANSI"? Windows-1252 or something
    else? (Microsoft doesn't call it "ANSI", because it isn't.)

    [...]

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for Philips
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alf P. Steinbach@21:1/5 to James Kuyper on Tue Dec 7 18:59:53 2021
    On 7 Dec 2021 17:48, James Kuyper wrote:
    On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
    ...
    aware: the main problem is that the C and C++ standards do not conform
    to reality in their requirement that a `wchar_t` value should suffice to
    encode all possible code points in the wide character set.

    The purpose of the C and C++ standards is prescriptive, not descriptive.
    It's therefore missing the point to criticize them for not conforming to reality. Rather, you should say that some popular implementations fail
    to conform to the standards.

    No, in this case it's the standard's fault. They failed to standardize
    existing practice and instead standardized a completely unreasonable requirement, given that 16-bit `wchar_t` was established as the API
    foundation in the most widely used OS on the platform, something that
    could not easily be changed. In particular this was the C standard
    committee: their choice here was as reasonable and practical as their
    choice of not supporting pointers outside of original (sub-) array.

    It was idiotic. It was simple blunders. But inn both cases, as I recall,
    they tried to cover up the blunder by writing a rationale; they took the blunders to heart and made them into great obstacles, to not lose face.


    In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
    some emojis etc. that appear as a single character and constitute one
    21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
    "surrogate pair".

    The C++ standard explicitly addresses that point, though the C standard
    does not.

    Happy to hear that but some more specific information would be welcome.


    - Alf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Manfred@21:1/5 to James Kuyper on Tue Dec 7 19:07:18 2021
    On 12/7/2021 5:38 PM, James Kuyper wrote:
    * I remember hearing rumors that modern versions of Windows do provide
    some support for UTF-8, but that support is neither complete, nor the default. You have know what you need to do to enable such support - I don't.

    One relevant addition that is relatively recent is support to/from UTF-8
    in the APIs WideCharToMultiByte and MultiByteToWideChar.
    These allow to handle UTF-8 programmatically in code.
    Windows itself still uses UTF-16 internally.
    I don't know how filenames are stored on disk.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paavo Helde@21:1/5 to All on Tue Dec 7 20:32:59 2021
    07.12.2021 19:41 Keith Thompson kirjutas:

    What exactly do you mean by "Windows ANSI"? Windows-1252 or something
    else? (Microsoft doesn't call it "ANSI", because it isn't.)

    It does. From https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

    "Retrieves the current Windows ANSI code page identifier for the
    operating system."

    This is in contrast to the GetOEMCP() function which is said to return
    "OEM code page", not "ANSI code page". Both terms are misnomers from the previous century.

    Both these codepage settings traditionally refer to some narrow char
    codepage identifiers, which will vary depending on the user regional
    settings and are thus unpredictable and unusable for basically anything
    related to internationalization.

    The only meaningful strategy is to set these both to UTF-8 which now
    finally has some (beta stage?) support in Windows 10, and to upgrade all affected software to properly support this setting.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Alf P. Steinbach on Tue Dec 7 13:55:32 2021
    On 12/7/21 12:59 PM, Alf P. Steinbach wrote:
    On 7 Dec 2021 17:48, James Kuyper wrote:
    On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
    ...
    The purpose of the C and C++ standards is prescriptive, not descriptive.
    It's therefore missing the point to criticize them for not conforming to
    reality. Rather, you should say that some popular implementations fail
    to conform to the standards.

    No, in this case it's the standard's fault. They failed to standardize existing practice and instead standardized a completely unreasonable requirement, given that 16-bit `wchar_t` was established as the API foundation in the most widely used OS on the platform, something that
    could not easily be changed. In particular this was the C standard
    committee: their choice here was as reasonable and practical as their
    choice of not supporting pointers outside of original (sub-) array.

    It was existing practice. From the very beginning, wchar_t was supposed
    to be "an integral type whose range of values can represent distinct
    codes for all members of the largest extended character set specified
    among the supported locales". When char32_t was added to the language,
    moving that specification to char32_t might have been a reasonable thing
    to do, but continuing to apply that specification to wchar_t was NOT an innovation. The same version of the standard that added char32_t also
    added char16_t, which is what should now be used for UTF-16 encoding,
    not wchar_t.

    It's an abuse of what wchar_t was intended for, to use it for a
    variable-length encoding. None of the functions in the C or C++ standard library for dealing with wchar_t values has ever had the right kind of interface to allow it to be used as a variable-length encoding. To see
    what I'm talking about, look at the mbrto*() and *tomb() functions from
    the C standard library, that have been incorporated by reference into
    the C++ standard library. Those functions do have interfaces designed to
    handle a variable-length encoding.

    ...
    In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
    some emojis etc. that appear as a single character and constitute one
    21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
    "surrogate pair".

    The C++ standard explicitly addresses that point, though the C standard
    does not.

    Happy to hear that but some more specific information would be welcome.

    5.3p2:
    "A universal-character-name designates the character in ISO/IEC 10646
    (if any) whose code point is the hexadecimal number represented by the
    sequence of hexadecimal-digits in the universal-character-name. The
    program is ill-formed if that number ... is a surrogate code point. ...
    A surrogate code point is a value in the range [D800, DFFF] (hexadecimal)."

    5.13.5p8: "[Note: A single c-char may produce more than one char16_t
    character in the form of surrogate pairs. A surrogate pair is a
    representation for a single code point as a sequence of two 16-bit code
    units. ā€” end note]"

    5.13.5p13: "a universal-character-name in a UTF-16 string literal may
    yield a surrogate pair. ... The size of a UTF-16 string literal is the
    total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus
    one for the terminating uā€™\0ā€™."

    Note that it's UTF-16, which should be encoded using char16_t, for which
    this issue is acknowledged. wchar_t is not, and never was, supposed to
    be a variable-length encoding like UTF-8 and UTF-16.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Paavo Helde on Tue Dec 7 13:55:55 2021
    On 12/7/21 1:32 PM, Paavo Helde wrote:
    07.12.2021 19:41 Keith Thompson kirjutas:

    What exactly do you mean by "Windows ANSI"? Windows-1252 or something
    else? (Microsoft doesn't call it "ANSI", because it isn't.)

    It does. From https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

    "Retrieves the current Windows ANSI code page identifier for the
    operating system."

    This is in contrast to the GetOEMCP() function which is said to return
    "OEM code page", not "ANSI code page". Both terms are misnomers from the previous century.

    Both these codepage settings traditionally refer to some narrow char
    codepage identifiers, which will vary depending on the user regional
    settings and are thus unpredictable and unusable for basically anything related to internationalization.

    The only meaningful strategy is to set these both to UTF-8 which now
    finally has some (beta stage?) support in Windows 10, and to upgrade all affected software to properly support this setting.

    Note that it was referred to as "ANSI" because Microsoft proposed it for
    ANSI standardization, but that proposal was never approved. Continuing
    to refer to it as "ANSI" decades later is a rather sad failure to
    acknowledge that rejection.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Alf P. Steinbach on Tue Dec 7 12:21:47 2021
    "Alf P. Steinbach" <alf.p.steinbach@gmail.com> writes:
    On 7 Dec 2021 18:41, Keith Thompson wrote:
    "Alf P. Steinbach" <alf.p.steinbach@gmail.com> writes:
    On 7 Dec 2021 12:37, Juha Nieminen wrote:
    Recently I stumbled across a problem where I had wide string literals
    with non-ascii characters UTF-8 encoded. In other words, I had code like >>>> this (I'm using non-ascii in the code below, I hope it doesn't get
    mangled up, but even if it does, it should nevertheless be clear what
    I'm trying to express):
    std::wstring str = L"non-ascii chars: ???";
    The C++ source file itself uses UTF-8 encoding, meaning that that
    line
    of code is likewise UTF-8 encoded. If it were a narrow string literal
    (being assigned to a std::string) then it works just fine (primarily
    because the compiler doesn't need to do anything to it, it can simply
    take those bytes from the source file as is).
    However, since it's a wide string literal (being assigned to a
    std::wstring)
    it's not as clear-cut anymore. What does the standard say about this
    situation?
    The thing is that it works just fine in Linux using gcc. The
    compiler will
    re-encode the UTF-8 encoded characters in the source file inside the
    parentheses into whatever encoding wide char string use, so the correct >>>> content will end up in the executable binary (and thus in the wstring). >>>> Apparently it does not work correctly in (some recent version of)
    Visual Studio, where apparently it just takes the byte values from the >>>> source file within the parentheses as-is, and just assigns those values >>>> as-is to the wide chars that end up in the binary. (Or something like that.)

    The Visual C++ compiler assumes that source code is Windows ANSI
    encoded unless

    * you use an encoding option such as `/utf8`, or
    * the source is UTF-8 with BOM, or
    * the source is UTF-16.
    What exactly do you mean by "Windows ANSI"? Windows-1252 or
    something
    else? (Microsoft doesn't call it "ANSI", because it isn't.)
    [...]

    "Windows ANSI" is the encoding specified by the `GetACP` API function,
    which, but as I recall that's more or less undocumented, just serves
    up the codepage number specified by registry value

    Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage@ACP

    This means that "Windows ANSI" is a pretty dynamic thing. Not just system-dependent, but at-the-moment-configuration dependent. Though in English-speaking countries it's Windows 1252 by default.

    And that in turn means that using the defaults with Visual C++, you
    can end up with pretty much any encoding whatsoever of narrow
    literals.

    Which means that it's a good idea to take charge.

    Option `/utf8` is one way to take charge.

    It appears my previous statement was incorrect. At least some Microsoft documentation does still (incorrectly) refer to "Windows ANSI".

    https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

    The history, as I recall, is that Microsoft proposed one or more 8-bit extensions of the 7-bit ASCII character set as ANSI standards.
    Windows-1252, which has various accented letters and other symbols in
    the range 128-255, is the best known variant. But Microsoft's proposal
    was never adopted by ANSI, leaving us with a bunch of incorrect
    documentation. Instead, ISO created the 8859-* 8-bit character sets,
    including 8859-1, or Latin-1. Latin-1 differs from Windows-1252 in that Latin-1 it has control characters in the range 128-159, while
    Windows-1252 has printable characters.

    https://en.wikipedia.org/wiki/Windows-1252

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for Philips
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Paavo Helde on Tue Dec 7 12:26:38 2021
    Paavo Helde <eesnimi@osa.pri.ee> writes:
    07.12.2021 19:41 Keith Thompson kirjutas:
    What exactly do you mean by "Windows ANSI"? Windows-1252 or
    something
    else? (Microsoft doesn't call it "ANSI", because it isn't.)

    It does. From https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

    "Retrieves the current Windows ANSI code page identifier for the
    operating system."

    Yes, I has missed that.

    But Microsoft has also said:

    The term ANSI as used to signify Windows code pages is a historical
    reference, but is nowadays a misnomer that continues to persist in
    the Windows community.

    https://en.wikipedia.org/wiki/Windows-1252 https://web.archive.org/web/20150204175931/http://download.microsoft.com/download/5/6/8/56803da0-e4a0-4796-a62c-ca920b73bb17/21-Unicode_WinXP.pdf

    Microsoft's mistake was to start using the term "ANSI" before it
    actually became an ANSI standard. Once that mistake was in place,
    cleaning it up was very difficult.

    This is in contrast to the GetOEMCP() function which is said to return
    "OEM code page", not "ANSI code page". Both terms are misnomers from
    the previous century.

    Both these codepage settings traditionally refer to some narrow char
    codepage identifiers, which will vary depending on the user regional
    settings and are thus unpredictable and unusable for basically
    anything related to internationalization.

    The only meaningful strategy is to set these both to UTF-8 which now
    finally has some (beta stage?) support in Windows 10, and to upgrade
    all affected software to properly support this setting.

    Yes, I advocate using UTF-8 whenever practical.

    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for Philips
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?w5bDtiBUaWli?=@21:1/5 to Alf P. Steinbach on Tue Dec 7 16:39:07 2021
    On Tuesday, 7 December 2021 at 20:00:13 UTC+2, Alf P. Steinbach wrote:
    On 7 Dec 2021 17:48, James Kuyper wrote:
    On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
    ...
    aware: the main problem is that the C and C++ standards do not conform
    to reality in their requirement that a `wchar_t` value should suffice to >> encode all possible code points in the wide character set.

    The purpose of the C and C++ standards is prescriptive, not descriptive. It's therefore missing the point to criticize them for not conforming to reality. Rather, you should say that some popular implementations fail
    to conform to the standards.
    No, in this case it's the standard's fault. They failed to standardize existing practice and instead standardized a completely unreasonable requirement, given that 16-bit `wchar_t` was established as the API foundation in the most widely used OS on the platform, something that
    could not easily be changed. In particular this was the C standard
    committee: their choice here was as reasonable and practical as their
    choice of not supporting pointers outside of original (sub-) array.

    It was idiotic. It was simple blunders. But inn both cases, as I recall,
    they tried to cover up the blunder by writing a rationale; they took the blunders to heart and made them into great obstacles, to not lose face.

    If C and/or C++ committee had standardized that wchar_t means
    precisely "UTF-16 LE code unit" and nothing else then it would be
    something different on Windows by now.

    On case of Microsoft the only way to make it to change their idiotic
    "existing practices" appears to be to standardize those. Once idiotic
    practice of Microsoft is standardized then Microsoft finds resources
    to switch from such to some reasonable one (as their "innovative"
    extension).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Alf P. Steinbach on Wed Dec 8 09:18:14 2021
    On 07/12/2021 18:59, Alf P. Steinbach wrote:
    On 7 Dec 2021 17:48, James Kuyper wrote:
    On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
    ...
    aware: the main problem is that the C and C++ standards do not conform
    to reality in their requirement that a `wchar_t` value should suffice to >>> encode all possible code points in the wide character set.

    The purpose of the C and C++ standards is prescriptive, not descriptive.
    It's therefore missing the point to criticize them for not conforming to
    reality. Rather, you should say that some popular implementations fail
    to conform to the standards.

    No, in this case it's the standard's fault. They failed to standardize existing practice and instead standardized a completely unreasonable requirement, given that 16-bit `wchar_t` was established as the API foundation in the most widely used OS on the platform, something that
    could not easily be changed. In particular this was the C standard
    committee: their choice here was as reasonable and practical as their
    choice of not supporting pointers outside of original (sub-) array.

    It was idiotic. It was simple blunders. But inn both cases, as I recall,
    they tried to cover up the blunder by writing a rationale; they took the blunders to heart and made them into great obstacles, to not lose face.


    In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
    some emojis etc. that appear as a single character and constitute one
    21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
    "surrogate pair".

    The C++ standard explicitly addresses that point, though the C standard
    does not.

    Happy to hear that but some more specific information would be welcome.


    My understanding is that at that time, the Windows wide character set
    was UCS2, not UTF-16. Thus a 16-bit wchar_t was sufficient to encode
    all wide characters.

    It turned out that UCS2 was a dead-end, and now UTF-16 is a hack-job
    that combines all the disadvantages of UTF-8 with all the disadvantages
    of UTF-32, and none of the benefits of either. We can't blame MS for
    going for UCS2 - they were early adopters and Unicode was 16-bit, so it
    was a good choice at the time. They, and therefore their users, were
    unlucky (along with Java, QT, Python, and no doubt others). Changing is
    not easy - you have to make everything UTF-8 and yet still support a
    horrible mix of wchar_t, char16_t, UCS2, and UTF-16 for legacy.

    But as far as I can see, the C and C++ standards were fine with 16-bit
    wchar_t when they were written. I have heard, but have no reference or
    source, that the inclusion of 16-bit wchar_t in the standards was
    promoted by MS in the first place.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)