Is it correct that the <cuchar> routines are the only way to perform
such conversions? It seems odd to me that the only way to perform such conversions uses a C style interface.
C++ has been changing a little faster than I can easily keep up with. I
only recently noticed that C++ now (since 2017?) seems to require
support for UTF-8, UTF-16, and UTF-32 encodings, which used to be
optional (and even earlier, was non-existent). Investigating futher, I
was surprised by something that seems to be missing.
The standard describes five different character encodings used at
execution time. Two of them are implementation-define native encodings
for narrow and wide characters, stored in char and wchar_t respectively.
The other three are Unicode encodings, UTF-8, UTF-16, and UTF-32, stored
in char8_t, char16_t, and char32_t respectively. The native encodings
could both also be Unicode encodings, but the following question is specifically about implementations where that is not the case.
There are codecvt facets (28.3.1.1.1) for converting between the native encodings, and between char8_t and the other Unicode encodings, but as
far as I can tell, the only way to convert between native and Unicode encodings are the character conversion functions in <cuchar> (21.5.5) incorporated from the C standard library.
Is it correct that the <uchar> routines do in fact perform such
conversions? It's hard to be sure, because the detailed description is
only cross-referenced from the C standard, which doesn't use the term
"native encoding", and allows __STDC_UTF_16__ and __STDC_UTF_32__ to not
be pre#defined.
Is it correct that the <cuchar> routines are the only way to perform
such conversions? It seems odd to me that the only way to perform such conversions uses a C style interface.
On 20 Nov 2021 21:43, James Kuyper wrote:...
C++ has been changing a little faster than I can easily keep up with. I
only recently noticed that C++ now (since 2017?) seems to require
support for UTF-8, UTF-16, and UTF-32 encodings, which used to be
optional (and even earlier, was non-existent). Investigating futher, I
was surprised by something that seems to be missing.
The standard describes five different character encodings used at
execution time. Two of them are implementation-define native encodings
for narrow and wide characters, stored in char and wchar_t respectively.
The other three are Unicode encodings, UTF-8, UTF-16, and UTF-32, stored
in char8_t, char16_t, and char32_t respectively. The native encodings
could both also be Unicode encodings, but the following question is
specifically about implementations where that is not the case.
There are codecvt facets (28.3.1.1.1) for converting between the native
encodings, and between char8_t and the other Unicode encodings, but as
far as I can tell, the only way to convert between native and Unicode
encodings are the character conversion functions in <cuchar> (21.5.5)
incorporated from the C standard library.
The std::codecvt stuff is probably/maybe what you're looking for.
For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
results wrt. endianess, and wrt. to state after conversion failure.
The std::codecvt stuff ...
For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
results wrt. endianess, and wrt. to state after conversion failure.
When you have to compensate for compiler differences in order to get
portable code that uses standard library stuff, for the same platform,
then you know it's really BAD.
The UTF-8 specializations were deprecated in C++17, and one would
naturally think it was in order to replace with something better, not suffering from all that badness, in e.g. C++20.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 113 |
Nodes: | 8 (1 / 7) |
Uptime: | 05:40:56 |
Calls: | 2,497 |
Calls today: | 14 |
Files: | 8,644 |
Messages: | 1,902,088 |