Recently I stumbled across a problem where I had wide string literals
with non-ascii characters UTF-8 encoded. In other words, I had code like
this (I'm using non-ascii in the code below, I hope it doesn't get
mangled up, but even if it does, it should nevertheless be clear what
I'm trying to express):
std::wstring str = L"non-ascii chars: ???";
The C++ source file itself uses UTF-8 encoding, meaning that that line
of code is likewise UTF-8 encoded. If it were a narrow string literal
(being assigned to a std::string) then it works just fine (primarily
because the compiler doesn't need to do anything to it, it can simply
take those bytes from the source file as is).
However, since it's a wide string literal (being assigned to a std::wstring) it's not as clear-cut anymore. What does the standard say about this situation?
The thing is that it works just fine in Linux using gcc. The compiler will re-encode the UTF-8 encoded characters in the source file inside the parentheses into whatever encoding wide char string use, so the correct content will end up in the executable binary (and thus in the wstring).
Apparently it does not work correctly in (some recent version of)
Visual Studio, where apparently it just takes the byte values from the
source file within the parentheses as-is, and just assigns those values
as-is to the wide chars that end up in the binary. (Or something like that.)
Command Line. I usually set "/utf-8 /Zc:__cplusplus".
Does the standard specify what the compiler should do in this situation?
If not, then what is the proper way of specifying wide string literals
that contain non-ascii characters?
Recently I stumbled across a problem where I had wide string literals
with non-ascii characters UTF-8 encoded. In other words, I had code like
this (I'm using non-ascii in the code below, I hope it doesn't get
mangled up, but even if it does, it should nevertheless be clear what
I'm trying to express):
std::wstring str = L"non-ascii chars: ???";
The C++ source file itself uses UTF-8 encoding, meaning that that line
of code is likewise UTF-8 encoded. If it were a narrow string literal
(being assigned to a std::string) then it works just fine (primarily
because the compiler doesn't need to do anything to it, it can simply
take those bytes from the source file as is).
However, since it's a wide string literal (being assigned to a std::wstring) it's not as clear-cut anymore. What does the standard say about this situation?
The thing is that it works just fine in Linux using gcc. The compiler will re-encode the UTF-8 encoded characters in the source file inside the parentheses into whatever encoding wide char string use, so the correct content will end up in the executable binary (and thus in the wstring).
Apparently it does not work correctly in (some recent version of)
Visual Studio, where apparently it just takes the byte values from the
source file within the parentheses as-is, and just assigns those values
as-is to the wide chars that end up in the binary. (Or something like that.)> Does the standard specify what the compiler should do in this situation?
If not, then what is the proper way of specifying wide string literals
that contain non-ascii characters?
aware: the main problem is that the C and C++ standards do not conform
to reality in their requirement that a `wchar_t` value should suffice to encode all possible code points in the wide character set.
In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
some emojis etc. that appear as a single character and constitute one
21-bit code point, can become a pair of two `wchar_t` values, an UTF-16 "surrogate pair".
"Alf P. Steinbach" <alf.p.steinbach@gmail.com> writes:
On 7 Dec 2021 12:37, Juha Nieminen wrote:
Recently I stumbled across a problem where I had wide string literals
with non-ascii characters UTF-8 encoded. In other words, I had code like >>> this (I'm using non-ascii in the code below, I hope it doesn't get
mangled up, but even if it does, it should nevertheless be clear what
I'm trying to express):
std::wstring str = L"non-ascii chars: ???";
The C++ source file itself uses UTF-8 encoding, meaning that that
line
of code is likewise UTF-8 encoded. If it were a narrow string literal
(being assigned to a std::string) then it works just fine (primarily
because the compiler doesn't need to do anything to it, it can simply
take those bytes from the source file as is).
However, since it's a wide string literal (being assigned to a
std::wstring)
it's not as clear-cut anymore. What does the standard say about this
situation?
The thing is that it works just fine in Linux using gcc. The
compiler will
re-encode the UTF-8 encoded characters in the source file inside the
parentheses into whatever encoding wide char string use, so the correct
content will end up in the executable binary (and thus in the wstring).
Apparently it does not work correctly in (some recent version of)
Visual Studio, where apparently it just takes the byte values from the
source file within the parentheses as-is, and just assigns those values
as-is to the wide chars that end up in the binary. (Or something like that.)
The Visual C++ compiler assumes that source code is Windows ANSI
encoded unless
* you use an encoding option such as `/utf8`, or
* the source is UTF-8 with BOM, or
* the source is UTF-16.
What exactly do you mean by "Windows ANSI"? Windows-1252 or something
else? (Microsoft doesn't call it "ANSI", because it isn't.)
[...]
On 7 Dec 2021 12:37, Juha Nieminen wrote:
Recently I stumbled across a problem where I had wide string literals
with non-ascii characters UTF-8 encoded. In other words, I had code like
this (I'm using non-ascii in the code below, I hope it doesn't get
mangled up, but even if it does, it should nevertheless be clear what
I'm trying to express):
std::wstring str = L"non-ascii chars: ???";
The C++ source file itself uses UTF-8 encoding, meaning that that
line
of code is likewise UTF-8 encoded. If it were a narrow string literal
(being assigned to a std::string) then it works just fine (primarily
because the compiler doesn't need to do anything to it, it can simply
take those bytes from the source file as is).
However, since it's a wide string literal (being assigned to a
std::wstring)
it's not as clear-cut anymore. What does the standard say about this
situation?
The thing is that it works just fine in Linux using gcc. The
compiler will
re-encode the UTF-8 encoded characters in the source file inside the
parentheses into whatever encoding wide char string use, so the correct
content will end up in the executable binary (and thus in the wstring).
Apparently it does not work correctly in (some recent version of)
Visual Studio, where apparently it just takes the byte values from the
source file within the parentheses as-is, and just assigns those values
as-is to the wide chars that end up in the binary. (Or something like that.)
The Visual C++ compiler assumes that source code is Windows ANSI
encoded unless
* you use an encoding option such as `/utf8`, or
* the source is UTF-8 with BOM, or
* the source is UTF-16.
On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
...
aware: the main problem is that the C and C++ standards do not conform
to reality in their requirement that a `wchar_t` value should suffice to
encode all possible code points in the wide character set.
The purpose of the C and C++ standards is prescriptive, not descriptive.
It's therefore missing the point to criticize them for not conforming to reality. Rather, you should say that some popular implementations fail
to conform to the standards.
In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
some emojis etc. that appear as a single character and constitute one
21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
"surrogate pair".
The C++ standard explicitly addresses that point, though the C standard
does not.
* I remember hearing rumors that modern versions of Windows do provide
some support for UTF-8, but that support is neither complete, nor the default. You have know what you need to do to enable such support - I don't.
What exactly do you mean by "Windows ANSI"? Windows-1252 or something
else? (Microsoft doesn't call it "ANSI", because it isn't.)
On 7 Dec 2021 17:48, James Kuyper wrote:...
On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
The purpose of the C and C++ standards is prescriptive, not descriptive.
It's therefore missing the point to criticize them for not conforming to
reality. Rather, you should say that some popular implementations fail
to conform to the standards.
No, in this case it's the standard's fault. They failed to standardize existing practice and instead standardized a completely unreasonable requirement, given that 16-bit `wchar_t` was established as the API foundation in the most widely used OS on the platform, something that
could not easily be changed. In particular this was the C standard
committee: their choice here was as reasonable and practical as their
choice of not supporting pointers outside of original (sub-) array.
In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
some emojis etc. that appear as a single character and constitute one
21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
"surrogate pair".
The C++ standard explicitly addresses that point, though the C standard
does not.
Happy to hear that but some more specific information would be welcome.
07.12.2021 19:41 Keith Thompson kirjutas:
What exactly do you mean by "Windows ANSI"? Windows-1252 or something
else? (Microsoft doesn't call it "ANSI", because it isn't.)
It does. From https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
"Retrieves the current Windows ANSI code page identifier for the
operating system."
This is in contrast to the GetOEMCP() function which is said to return
"OEM code page", not "ANSI code page". Both terms are misnomers from the previous century.
Both these codepage settings traditionally refer to some narrow char
codepage identifiers, which will vary depending on the user regional
settings and are thus unpredictable and unusable for basically anything related to internationalization.
The only meaningful strategy is to set these both to UTF-8 which now
finally has some (beta stage?) support in Windows 10, and to upgrade all affected software to properly support this setting.
On 7 Dec 2021 18:41, Keith Thompson wrote:
"Alf P. Steinbach" <alf.p.steinbach@gmail.com> writes:
On 7 Dec 2021 12:37, Juha Nieminen wrote:What exactly do you mean by "Windows ANSI"? Windows-1252 or
Recently I stumbled across a problem where I had wide string literals
with non-ascii characters UTF-8 encoded. In other words, I had code like >>>> this (I'm using non-ascii in the code below, I hope it doesn't get
mangled up, but even if it does, it should nevertheless be clear what
I'm trying to express):
std::wstring str = L"non-ascii chars: ???";
The C++ source file itself uses UTF-8 encoding, meaning that that
line
of code is likewise UTF-8 encoded. If it were a narrow string literal
(being assigned to a std::string) then it works just fine (primarily
because the compiler doesn't need to do anything to it, it can simply
take those bytes from the source file as is).
However, since it's a wide string literal (being assigned to a
std::wstring)
it's not as clear-cut anymore. What does the standard say about this
situation?
The thing is that it works just fine in Linux using gcc. The
compiler will
re-encode the UTF-8 encoded characters in the source file inside the
parentheses into whatever encoding wide char string use, so the correct >>>> content will end up in the executable binary (and thus in the wstring). >>>> Apparently it does not work correctly in (some recent version of)
Visual Studio, where apparently it just takes the byte values from the >>>> source file within the parentheses as-is, and just assigns those values >>>> as-is to the wide chars that end up in the binary. (Or something like that.)
The Visual C++ compiler assumes that source code is Windows ANSI
encoded unless
* you use an encoding option such as `/utf8`, or
* the source is UTF-8 with BOM, or
* the source is UTF-16.
something
else? (Microsoft doesn't call it "ANSI", because it isn't.)
[...]
"Windows ANSI" is the encoding specified by the `GetACP` API function,
which, but as I recall that's more or less undocumented, just serves
up the codepage number specified by registry value
Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage@ACP
This means that "Windows ANSI" is a pretty dynamic thing. Not just system-dependent, but at-the-moment-configuration dependent. Though in English-speaking countries it's Windows 1252 by default.
And that in turn means that using the defaults with Visual C++, you
can end up with pretty much any encoding whatsoever of narrow
literals.
Which means that it's a good idea to take charge.
Option `/utf8` is one way to take charge.
07.12.2021 19:41 Keith Thompson kirjutas:
What exactly do you mean by "Windows ANSI"? Windows-1252 or
something
else? (Microsoft doesn't call it "ANSI", because it isn't.)
It does. From https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
"Retrieves the current Windows ANSI code page identifier for the
operating system."
This is in contrast to the GetOEMCP() function which is said to return
"OEM code page", not "ANSI code page". Both terms are misnomers from
the previous century.
Both these codepage settings traditionally refer to some narrow char
codepage identifiers, which will vary depending on the user regional
settings and are thus unpredictable and unusable for basically
anything related to internationalization.
The only meaningful strategy is to set these both to UTF-8 which now
finally has some (beta stage?) support in Windows 10, and to upgrade
all affected software to properly support this setting.
On 7 Dec 2021 17:48, James Kuyper wrote:
On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
...
aware: the main problem is that the C and C++ standards do not conform
to reality in their requirement that a `wchar_t` value should suffice to >> encode all possible code points in the wide character set.
The purpose of the C and C++ standards is prescriptive, not descriptive. It's therefore missing the point to criticize them for not conforming to reality. Rather, you should say that some popular implementations failNo, in this case it's the standard's fault. They failed to standardize existing practice and instead standardized a completely unreasonable requirement, given that 16-bit `wchar_t` was established as the API foundation in the most widely used OS on the platform, something that
to conform to the standards.
could not easily be changed. In particular this was the C standard
committee: their choice here was as reasonable and practical as their
choice of not supporting pointers outside of original (sub-) array.
It was idiotic. It was simple blunders. But inn both cases, as I recall,
they tried to cover up the blunder by writing a rationale; they took the blunders to heart and made them into great obstacles, to not lose face.
On 7 Dec 2021 17:48, James Kuyper wrote:
On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
...
aware: the main problem is that the C and C++ standards do not conform
to reality in their requirement that a `wchar_t` value should suffice to >>> encode all possible code points in the wide character set.
The purpose of the C and C++ standards is prescriptive, not descriptive.
It's therefore missing the point to criticize them for not conforming to
reality. Rather, you should say that some popular implementations fail
to conform to the standards.
No, in this case it's the standard's fault. They failed to standardize existing practice and instead standardized a completely unreasonable requirement, given that 16-bit `wchar_t` was established as the API foundation in the most widely used OS on the platform, something that
could not easily be changed. In particular this was the C standard
committee: their choice here was as reasonable and practical as their
choice of not supporting pointers outside of original (sub-) array.
It was idiotic. It was simple blunders. But inn both cases, as I recall,
they tried to cover up the blunder by writing a rationale; they took the blunders to heart and made them into great obstacles, to not lose face.
In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
some emojis etc. that appear as a single character and constitute one
21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
"surrogate pair".
The C++ standard explicitly addresses that point, though the C standard
does not.
Happy to hear that but some more specific information would be welcome.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 113 |
Nodes: | 8 (0 / 8) |
Uptime: | 08:34:10 |
Calls: | 2,497 |
Calls today: | 14 |
Files: | 8,644 |
Messages: | 1,902,475 |