• conversions between native and unicode encodings

    From James Kuyper@21:1/5 to All on Sat Nov 20 15:43:43 2021
    C++ has been changing a little faster than I can easily keep up with. I
    only recently noticed that C++ now (since 2017?) seems to require
    support for UTF-8, UTF-16, and UTF-32 encodings, which used to be
    optional (and even earlier, was non-existent). Investigating futher, I
    was surprised by something that seems to be missing.

    The standard describes five different character encodings used at
    execution time. Two of them are implementation-define native encodings
    for narrow and wide characters, stored in char and wchar_t respectively.
    The other three are Unicode encodings, UTF-8, UTF-16, and UTF-32, stored
    in char8_t, char16_t, and char32_t respectively. The native encodings
    could both also be Unicode encodings, but the following question is specifically about implementations where that is not the case.

    There are codecvt facets (28.3.1.1.1) for converting between the native encodings, and between char8_t and the other Unicode encodings, but as
    far as I can tell, the only way to convert between native and Unicode
    encodings are the character conversion functions in <cuchar> (21.5.5) incorporated from the C standard library.

    Is it correct that the <uchar> routines do in fact perform such
    conversions? It's hard to be sure, because the detailed description is
    only cross-referenced from the C standard, which doesn't use the term
    "native encoding", and allows __STDC_UTF_16__ and __STDC_UTF_32__ to not
    be pre#defined.

    Is it correct that the <cuchar> routines are the only way to perform
    such conversions? It seems odd to me that the only way to perform such conversions uses a C style interface.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sam@21:1/5 to James Kuyper on Sat Nov 20 16:35:30 2021
    This is a MIME GnuPG-signed message. If you see this text, it means that
    your E-mail or Usenet software does not support MIME signed messages.
    The Internet standard for MIME PGP messages, RFC 2015, was published in 1996. To open this message correctly you will need to install E-mail or Usenet software that supports modern Internet standards.

    James Kuyper writes:

    Is it correct that the <cuchar> routines are the only way to perform
    such conversions? It seems odd to me that the only way to perform such conversions uses a C style interface.

    The C++ library's support for transcoding between Unicode and various
    character sets has always sucked. This still remains the case.

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEEMWrVnbBKLOeG9ifkazpiviedvyUFAmGZaiMACgkQazpivied vyVLWBAA3ih5g7v3T1XUYAM5UuffB2SlqjZpNm3g55GnVR6SUM6Pb74IjzQyRbRI XUOEDe8+vyNmrItMEVdHa6sRg9IWgVq/n4UpE6oL0o34b9anaNFvhbgp364bnWJW Ezf737gPU+iMhN3ylytCaYWDR8WOiqDgijlM1qGIXe6xuqZtH3nV8y5t7HrzFGBQ zX0G3fIeS8lYJ+qY4dDWFDeM31skT4LufYO67Mnc4dvIcp1IsHybYNSRo5Yl4Xjt 4QBzx82Wsliyoo4sss8tYVPkcScfAihxXb8wfFcMQZGKeJqDUDGb5z9TixlIy9B+ rU5wLqhD02TRRVQVFxsHiLzhQHIRQy5mGoPRR9/77ivoxx1p00C8UZsjfK0B9SmC Uuslyt3N2qvXpN5joD3g6/bID+MXWUJwzkLk1qwRDsP5mYvCBPMhLes34aB9prQY 8tCEAd1NZO2CR9OY6lbnY6BtuBrUEdMDqlc7JKO23/Ich675yFF25CoEWW52Ltw8 D6jHDpKKKPYla1no1zitEp/qSX4A8163MAaCVjdreEAjTuzCYmXjrgZjJyhBmKJb qw6lBFMvkppX8pmW+vtguFyayrWgrPn9KaZVPwiihmaaALxZz0hVKEpnacrja6K2 e8aZ4XNeDTRPV7An/Z5uVWLqkyj0yVLhBMm5qrZx5SOTPL1Urjs=
    =FuoC
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alf P. Steinbach@21:1/5 to James Kuyper on Sun Nov 21 15:27:37 2021
    On 20 Nov 2021 21:43, James Kuyper wrote:
    C++ has been changing a little faster than I can easily keep up with. I
    only recently noticed that C++ now (since 2017?) seems to require
    support for UTF-8, UTF-16, and UTF-32 encodings, which used to be
    optional (and even earlier, was non-existent). Investigating futher, I
    was surprised by something that seems to be missing.

    The standard describes five different character encodings used at
    execution time. Two of them are implementation-define native encodings
    for narrow and wide characters, stored in char and wchar_t respectively.
    The other three are Unicode encodings, UTF-8, UTF-16, and UTF-32, stored
    in char8_t, char16_t, and char32_t respectively. The native encodings
    could both also be Unicode encodings, but the following question is specifically about implementations where that is not the case.

    There are codecvt facets (28.3.1.1.1) for converting between the native encodings, and between char8_t and the other Unicode encodings, but as
    far as I can tell, the only way to convert between native and Unicode encodings are the character conversion functions in <cuchar> (21.5.5) incorporated from the C standard library.

    Is it correct that the <uchar> routines do in fact perform such
    conversions? It's hard to be sure, because the detailed description is
    only cross-referenced from the C standard, which doesn't use the term
    "native encoding", and allows __STDC_UTF_16__ and __STDC_UTF_32__ to not
    be pre#defined.

    Is it correct that the <cuchar> routines are the only way to perform
    such conversions? It seems odd to me that the only way to perform such conversions uses a C style interface.

    The std::codecvt stuff is probably/maybe what you're looking for.

    For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
    results wrt. endianess, and wrt. to state after conversion failure.

    When you have to compensate for compiler differences in order to get
    portable code that uses standard library stuff, for the same platform,
    then you know it's really BAD.

    The UTF-8 specializations were deprecated in C++17, and one would
    naturally think it was in order to replace with something better, not
    suffering from all that badness, in e.g. C++20.

    But no, the idiots (pardon the expression) only wanted to introduce
    overloads with `char8_t` instead of `char`, that's what c++20 offered. I havent' tested, because I refuse to "upgrade" to C++20. But I presume
    these academic overloads suffer from all the badness of the old.


    - Alf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Alf P. Steinbach on Mon Nov 22 00:05:40 2021
    On 11/21/21 9:27 AM, Alf P. Steinbach wrote:
    On 20 Nov 2021 21:43, James Kuyper wrote:
    C++ has been changing a little faster than I can easily keep up with. I
    only recently noticed that C++ now (since 2017?) seems to require
    support for UTF-8, UTF-16, and UTF-32 encodings, which used to be
    optional (and even earlier, was non-existent). Investigating futher, I
    was surprised by something that seems to be missing.

    The standard describes five different character encodings used at
    execution time. Two of them are implementation-define native encodings
    for narrow and wide characters, stored in char and wchar_t respectively.
    The other three are Unicode encodings, UTF-8, UTF-16, and UTF-32, stored
    in char8_t, char16_t, and char32_t respectively. The native encodings
    could both also be Unicode encodings, but the following question is
    specifically about implementations where that is not the case.

    There are codecvt facets (28.3.1.1.1) for converting between the native
    encodings, and between char8_t and the other Unicode encodings, but as
    far as I can tell, the only way to convert between native and Unicode
    encodings are the character conversion functions in <cuchar> (21.5.5)
    incorporated from the C standard library.
    ...
    The std::codecvt stuff is probably/maybe what you're looking for.

    As indicated above, I'm quite aware of the existence of codecvt.
    However, in the latest draft version of the standard that I have,
    n4860.pdf, the codecvt facets listed in table 102 (28.3.1.1.1p2) are:

    codecvt<char, char, mbstate_t>
    codecvt<char16_t, char8_t, mbstate_t>
    codecvt<char32_t, char8_t, mbstate_t>
    codecvt<wchar_t, char, mbstate_t>

    Which one should I use to convert between native and Unicode encodings?
    None of them seem suitable, which was the point of my message.
    The change from char to char8_t occurred between n4659.pdf (2017-03-21)
    and n4849.pdf (2020-01-04).

    If I'm correct about the routines in <cuchar> converting between the
    native encoding for narrow characters and unicode encodings, it should
    have been straightforward to implement corresponding codecvt facets -
    why didn't they mandate them?

    For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
    results wrt. endianess, and wrt. to state after conversion failure.

    Well, Unicode leaves the endianess for UTF-16 unspecied, provides a BOM
    to clarify the abiguity, and recommends assuming big-endian if no BOM is present. Windows decided to go for little-endian. This is a problem, but
    it's a Unicode problem, not a C++ problem; C++ is doing nothing more
    than failing to resolve the ambiguity that Unicode left ambiguous.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From danielaparker@gmail.com@21:1/5 to Alf P. Steinbach on Mon Nov 22 05:30:23 2021
    On Sunday, November 21, 2021 at 9:27:57 AM UTC-5, Alf P. Steinbach wrote:
    The std::codecvt stuff ...

    For conversion UTF-8 -> UTF-16 MSVC and MinGW g++ yield different
    results wrt. endianess, and wrt. to state after conversion failure.

    When you have to compensate for compiler differences in order to get
    portable code that uses standard library stuff, for the same platform,
    then you know it's really BAD.

    The UTF-8 specializations were deprecated in C++17, and one would
    naturally think it was in order to replace with something better, not suffering from all that badness, in e.g. C++20.

    But if the committee spent time on practical things like unicode encoding conversion and validation, which have massive prior experience to
    draw on, where would they find time to spend time on, say, ranges?

    Daniel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)