• Unicode support in C++ 17

    From stg@21:1/5 to All on Wed Aug 12 13:00:26 2015
    I would really like to see improved unicode support in C++ 17.After
    reading the following discussion, I thought maybe I might be able to
    participate in the discussion:
    [https://groups.google.com/a/isocpp.org/forum/?fromgroups=#!searchin/ std-proposals/unicode/std-proposals/SGFtQkKE0bU/overview]

    Everything in this document reflects my best understanding
    about Unicode, and C++. I would be delighted to have that
    understanding improved or corrected.

    I was hoping the knowledgeable folk in this newsgroup might help me
    evaluate some ideas. Please find my thoughts below, and be both
    critical and kind:


    1.2 Desired functionality
    ~~~~~~~~~~~~~~~~~~~~~~~~~

    1. composed-character awareness -- single display character may be
    composed of multiple codepoints, or may be comprised of ligatures.
    2. multi-byte codepoint awareness.
    3. char_t indexing -- This is the current default behavior, and I
    suppose we must keep it for the sake of backward compatiblity, and
    for the implementation of 1 & 2.

    Currently 3 is the default, but we can get 1&2 compliant behavior for
    much string handling by specifying a locale. We can steer the default
    behavior by setting the globale local, and a great deal of work has
    been done to improve C++'s locale handling (see boost::locale).

    I consider that 1 is in fact the usual use-case, and 2 and 3 are
    typically only of interest to library implementers.


    1.3 Current behavior
    ~~~~~~~~~~~~~~~~~~~~

    Let's consider a concrete example which is likely to be a very common
    use case in the future: migrating legacy code from latin1 to utf-8, or
    a developer who is used to thinking in terms of ascii want to write a
    new application as a utf-8 application. I think this specific example
    generalizes (e.g. to utf-16 or 32) in a trivial way, but I welcome
    further insight.

    The developer may start by setting the global locale. If she wants
    numbers to behave like the c-locale, except when given specific
    context instructions, she might use a boost::locale, or perhaps she
    rolls her own locale, comprising it out of existing facets that suit
    her needs. The relevant detail is that the locale specifies that she
    will be working with a utf-8 character set.

    If there is a legacy application being modernized or replaced, she'll
    have to convert data sources and sinks to utf-8, but that's likely to
    be a pretty trivial task.

    Streaming operations will work as expected, so she won't have to
    modify the std::iostream and std::stringstream stuff.

    std::string will work fine as a container. That's where the good news
    ends.


    1.3.1 sorting
    -------------

    To use std::sort she would have to specify that the application use
    the locale() operator:

    ,----
    | std::sort (str.begin(), str.end(), std::locale);
    `----

    As the default sort uses the numeric src_<C++>(<) operator --
    i.e. it's a byte order sort that is efficient, but not humanly
    meaningful. The above code works but isn't parsimonious.


    1.3.2 find and substr
    ---------------------

    Consider:
    ,----
    | auto pos1 = foo.find(someChar);
    | // sanity check...
    | auto bar = foo.substr(pos1, 3);
    `----

    The determination of pos1 can fail because it might find a match
    inside a composite character. The determination of src_<C++>(bar) will
    fail whenever there's a composite or multi-byte character in within
    the next three positions.


    1.4 My naive proposal:
    ~~~~~~~~~~~~~~~~~~~~~~

    - A std::basic_string has a locale awareness, either "NONE (default,
    current implementation), CODEPOINT (mainly or library implementers
    who want to investigate codepoints, not composed characters), and
    COMPOSITE (alternatively DISPLAY, or CHARACTER -- a displayable
    character).
    - std::locale gets a cc_iterator (composed-character iterator --
    iterates over displayable characters).
    - std::locale gets a cp_iterator (codepoint iterator -- iterates over
    displayable characters. for utf-32 locales this is just the byte
    operator)
    - std::string methods use the locale-aware iterators if the string is
    locale-aware. So size() returns the number of displayble characers
    for a std::string<COMPOSITE>, the number of codepoints for a
    std::string<CODEPOINT>, and the number of bytes for a
    std::string<NONE>

    For a locale-aware string, the following behavior would change:
    - std::sort would use the locale's () operator by default. Maps with
    a la_string key would work in a locale aware way, maps with a
    std::string would work with the old byte src_<C++>(<).
    - integer positional arguments would refer to *composed characters*.
    So src_<C++>(s.substr(pos,3)) would give the last 3 display
    characters, regardless of whether or not they are ligatures,
    composed, or simply 1-byte ascii codepoints. That would apply to
    str[i] and str.size() as well.


    1.4.1 Pros
    ----------

    - updating legacy code should be almost-trivial -- change the
    string construction to create locale-aware strings, and everything
    should work as desired.

    - Minimal language pollution. Seems consistent with current language
    design.


    1.4.2 Cons
    ----------

    - What to do when comparing std::string<locale_aware==false> with a
    std::string<locale_aware==true>? I suggest default behavior is
    byte-comparision, but compilers should generate a warning. May need
    to introduce a cast operations to avoid the warning.
    - I don't see a way to prevent a developer from setting an
    incompatible locale, and using an incompatible string. I suppose
    this would have to throw an exception.
    - std::string<locale_aware> or std::la_string is clunky.


    1.5 Questions
    ~~~~~~~~~~~~~

    - chage locale awareness via typecasting?


    --
    [ comp.std.c++ is moderated. To submit articles, try posting with your ]
    [ newsreader. If that fails, use mailto:std-cpp-submit@vandevoorde.com ]
    [ --- Please see the FAQ before posting. --- ]
    [ FAQ: http://www.comeaucomputing.com/csc/faq.html ]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jakob Bohm@21:1/5 to stg on Fri Aug 14 13:33:14 2015
    On 12/08/2015 21:00, stg wrote:



    I would really like to see improved unicode support in C++ 17.After
    reading the following discussion, I thought maybe I might be able to
    participate in the discussion:
    [
    https://groups.google.com/a/isocpp.org/forum/?fromgroups=#!searchin/
    std-proposals/unicode/std-proposals/SGFtQkKE0bU/overview]

    Everything in this document reflects my best understanding
    about Unicode, and C++. I would be delighted to have that
    understanding improved or corrected.

    I was hoping the knowledgeable folk in this newsgroup might help me
    evaluate some ideas. Please find my thoughts below, and be both
    critical and kind:


    1.2 Desired functionality
    ~~~~~~~~~~~~~~~~~~~~~~~~~

    1. composed-character awareness -- single display character may be
    composed of multiple codepoints, or may be comprised of ligatures.

    The subset of programs which care about this consists
    mostly of those programs which do additional text
    formatting (e.g. columns, word line breaks etc.)
    and/or control cursor navigation in text input (like
    a C++ equivalent of GNU readline etc.). Such programs
    generally are more concerned if a sequence of codepoints
    represent a single screen location (and how big that is)
    on the actual output device in use, not if a hypothetical
    mega-implementation of all Unicode formatting features
    would do so.

    For instance, some display systems will artificially
    cause multi-codepoint (and sometimes even multi-char_t)
    characters to occupy as much space as their encoding,
    while others will not. Some display systems will do
    the right-to-left vs. left-to-right direction shifts
    automatically, while others expect applications to
    reorder displayed characters before output.

    One thing that is of general interest, but is too
    big/slow to be done implicitly during every string
    operation is to convert Unicode strings to one of
    the official normalized forms (A, B, C, D plus any
    future standard form that prevents visually equivalent
    display strings from having different encoding, to
    avoid security attacks that depend on fooling humans
    into accepting a made up name that looks just like
    a different name they trust, such as 0bama vs.
    Obama or V1adimir vs. Vladimir). These things are
    already available in libraries such as IBM's libuni.

    2. multi-byte codepoint awareness.

    This is important for UTF-8 and the higher codepoints in
    UTF-16, and has always been important for non-Unicode
    encodings of East Asian alphabets. Thus where possibly,
    standard library features for this should be done as
    natural extensions / bugfixes for the existing library
    functions that have always done this for traditional
    encodings.

    For UTF-8 and to a lesser degree UTF-16, the Unicode
    standard designers did extra work to ensure that things
    like sorting and searching would work in most cases
    when naively using routines that only use char_t
    values, specifically:

    1. No UTF-8 or UTF-16 encoding of a codepoint will
    match at half-character locations when using a char_t
    based string search algorithm.

    2. Comparing the UTF-8, UTF-32 or plain UCS-4 encodings
    of two strings using code that treats them simply as
    arrays of unsigned char or unsigned char32_t will get
    the same result and ordering as comparing those strings
    codepoint by codepoint using the equivalent codepoint
    numbers in the Unicode standard.

    3. Comparing the UTF-16 encodings of two strings using
    code that treats them simply as arrays of unsigned
    char16_t values will get the same result as codepoint
    by codepoint comparisons, except that codepoints
    U+0000D800 to U+0000FFFF sort after U+10FFFF rather
    than between U+0000CFFF and U+00010000 . However
    this odd result is often needed for compatibility
    with existing systems that were originally designed
    for UCS-2 where that was the correct algorithm due
    to the historic non-existence of codepoints above
    U+00010000 .

    These nice properties do not hold for traditional East
    Asian encodings, though some of those encodings may
    happen to match some locale specific lexicographic
    orderings in a similar way.

    3. char_t indexing -- This is the current default behavior, and I
    suppose we must keep it for the sake of backward compatiblity, and
    for the implementation of 1 & 2.

    Also because this is the most relevant form in the following
    cases:

    1. When processing text strings for purposes of storage or
    transmission, since most storage/transmission systems
    stores/transmits bits and bytes, not abstract characters.

    2. When using the string class as an efficient and convenient
    container for arrays of non-text bytes, such code often gains
    great benefits from the ways string classes differ from
    vectors/lists of bytes, but would fail horribly if the string
    classes started having opinions on what bytes can be stored
    there.

    The computer industry has a long history of the insane costs
    imposed when interfaces are defined to process characters (in
    any character set) rather than sequences of bits and bytes.
    For instance because the Internet e-mail protocols were
    historically defined to operate on sequences of human
    readable English-characters from a common subset of ASCII and
    EBCDIC, even though actual transmission was always ASCII bytes,
    every e-mail containing attachments, pictures or non-English
    text needs to be transmitted using clunky Base64 and Hex
    encodings just in case some mail gateway on the way might
    temporarily process the e-mail using arcane character
    representations (e.g. on older IBM operating systems). And
    this is just one instance of how such a decision in the past
    has come back to haunt us.

    Thus it is best if most standard library classes, methods,
    types and functions are defined to be what some people call
    "8-bit clean", meaning that they won't mangle or damage
    arbitrary binary data given to them, if at all possible
    (the classic std::strxxx() and std::wcsxxx() functions
    obviously need to treat a char_t value of 0 specially as
    per their definitions, but must refrain from mistreating
    other values).




    Currently 3 is the default, but we can get 1&2 compliant behavior for
    much string handling by specifying a locale. We can steer the default
    behavior by setting the globale local, and a great deal of work has
    been done to improve C++'s locale handling (see boost::locale).

    I consider that 1 is in fact the usual use-case, and 2 and 3 are
    typically only of interest to library implementers.


    In my experience, 3 is the most common use case where strings
    are not treated as opaque blobs (then there is no difference),
    the one exception being country-specific lexicographic ordering
    which is never the same as any sorting done purely for
    computational efficiency).

    Real world situations that truly care about codepoints or display
    characters often also care about words and sentences. For
    instance in many locales a list sorted for human consumption
    should ideally go like this

    has one
    h=C3=A1s one
    hat on
    h=C3=A2t on
    have not

    Which requires processing at the word and sentence level, not
    just the code point level. Such rules tend to reflect the way
    written text is usually pronounced (and thus memorized) amongst
    native speakers in that culture/language combination.

    I have heard rumors that some schools teach computing the other
    way round, but that is mostly an artifact of those educators
    lacking experience and/or deeper technical understanding before
    overconfidently instilling superficial misunderstandings into
    their pupils.


    1.3 Current behavior
    ~~~~~~~~~~~~~~~~~~~~

    Let's consider a concrete example which is likely to be a very common
    use case in the future: migrating legacy code from latin1 to utf-8, or
    a developer who is used to thinking in terms of ascii want to write a
    new application as a utf-8 application. I think this specific example
    generalizes (e.g. to utf-16 or 32) in a trivial way, but I welcome
    further insight.

    The developer may start by setting the global locale. If she wants
    numbers to behave like the c-locale, except when given specific
    context instructions, she might use a boost::locale, or perhaps she
    rolls her own locale, comprising it out of existing facets that suit
    her needs. The relevant detail is that the locale specifies that she
    will be working with a utf-8 character set.

    If there is a legacy application being modernized or replaced, she'll
    have to convert data sources and sinks to utf-8, but that's likely to
    be a pretty trivial task.

    Streaming operations will work as expected, so she won't have to
    modify the std::iostream and std::stringstream stuff.

    std::string will work fine as a container. That's where the good news
    ends.


    1.3.1 sorting
    -------------

    To use std::sort she would have to specify that the application use
    the locale() operator:

    ,----
    | std::sort (str.begin(), str.end(), std::locale);
    `----

    As the default sort uses the numeric src_<C++>(<) operator --
    i.e. it's a byte order sort that is efficient, but not humanly
    meaningful. The above code works but isn't parsimonious.

    This depends on the purpose of the sort:

    If the sort is used for a purpose where an ASCII application
    would be happy to sort lowercase a after uppercase Z, then
    sorting by (32 bit) Unicode code point is the natural
    equivalent, and utf-8 was specifically designed (this is
    explicitly stated in the original standards) such that the
    naive byte comparison will yield the correct result with no
    extra effort.

    If the sort is used for a purpose where an ASCII application
    would want upper and lower case A/a to sort in close proximity,
    then the application will already need to use a more
    intelligent string comparison function. For ASCII a simple
    case-insensitive string compare function would do the trick,
    while for anything else, the application would need a highly
    locale-sensitive non-trivial comparison function such as the
    parametrized string comparison function from the Unicode
    standard (that function takes a bunch of parameters
    specifying most of the commonly occurring locale
    oddities, such as rules for the treatment of accents,
    uppercase/lowercase multiple spaces and even punctuation),
    or more practically a truly locale specific comparison
    function that can take into account locale-specific issues
    not covered by such a generic function. In practice this
    would simply involve delegating the comparison operation
    to a virtual method of the locale object, of which there
    can be several depending on usage context, for instance
    some locales have different rules for sorting dictionaries
    versus phone books.



    1.3.2 find and substr
    ---------------------

    Consider:
    ,----
    | auto pos1 = foo.find(someChar);
    | // sanity check...
    | auto bar = foo.substr(pos1, 3);
    `----

    The determination of pos1 can fail because it might find a match
    inside a composite character. The determination of src_<C++>(bar) will
    fail whenever there's a composite or multi-byte character in within
    the next three positions.

    For all the standard UNICODE encodings (except UTF-7, a
    victim of the e-mail design mistake previously mentioned),
    the encoding has been designed to guarantee that
    searching for a valid encoding of a string or character
    in a valid encoding of a string will not result in false
    matches.

    However for any encoding that uses multiple char_t-s to
    represent a single code point, code point operations
    must be treated as substring operations, never as
    character operations.

    In your example above, if someChar is of type char_t,
    then it can only be a single-char_t codepoint, if it
    is a codepoint at all. If someChar is of type string,
    then extracting text where it was found should already
    account for someChar.length(), whatever unit that
    function measures its result in. pos1 can use any unit
    of measurement: Inches of paper, microliters of ink,
    count of codepoints etc., but count of char_t-s is
    just as useful for values that are treated simply as
    abstract non-iterable iterators.

    As for the second step of extracting a known character
    plus the next two characters, then such an operation
    makes sense only when the context makes clear why
    exactly two extra characters are requested, and if
    that reason refers to two display characters, two
    codepoints or two char_t-s. This semantic problem
    cannot be defaulted away without leading to lots of
    malfunctioning applications (namely those that
    needed either of the other two semantics in that
    particular code line, unrelated to what the rest
    of the application needs in unrelated code lines).

    For instance if we are looking for a marker sign
    followed by a two-letter abbreviation in some
    human-originated convention, then one must look at
    that convention to see if these abbreviations are
    defined to consist of two display characters, two
    codepoints or two char_t-s, taking into account
    that many real world human-written documents will
    use those words to refer to any of the other two
    meanings.

    If the relevant specification is unclear, then
    the conversion of this program from ASCII to
    utf-8 is the perfect time to settle that ambiguity
    before failing to interoperate with another
    application whose author would otherwise have
    interpreted the convention differently.

    If on the other hand we are looking to display the
    beginning of a text in a narrow indicator field,
    then we obviously want 3 display character cells,
    using whichever definition of that concept matches
    the actual properties of the intended output device,
    we might even want to change this to the first "3em"
    of the text using a specific font such as
    "Helvetica" or the first 3 6-point cells in braille.




    1.4 My naive proposal:
    ~~~~~~~~~~~~~~~~~~~~~~

    - A std::basic_string has a locale awareness, either "NONE (default,
    current implementation), CODEPOINT (mainly or library implementers
    who want to investigate codepoints, not composed characters), and
    COMPOSITE (alternatively DISPLAY, or CHARACTER -- a displayable
    character).
    - std::locale gets a cc_iterator (composed-character iterator --
    iterates over displayable characters).
    - std::locale gets a cp_iterator (codepoint iterator -- iterates over
    displayable characters. for utf-32 locales this is just the byte
    operator)
    - std::string methods use the locale-aware iterators if the string is
    locale-aware. So size() returns the number of displayble characers
    for a std::string<COMPOSITE>, the number of codepoints for a
    std::string<CODEPOINT>, and the number of bytes for a
    std::string<NONE>

    For a locale-aware string, the following behavior would change:
    - std::sort would use the locale's () operator by default. Maps with
    a la_string key would work in a locale aware way, maps with a
    std::string would work with the old byte src_<C++>(<).
    - integer positional arguments would refer to *composed characters*.
    So src_<C++>(s.substr(pos,3)) would give the last 3 display
    characters, regardless of whether or not they are ligatures,
    composed, or simply 1-byte ascii codepoints. That would apply to
    str[i] and str.size() as well.


    1.4.1 Pros
    ----------

    - updating legacy code should be almost-trivial -- change the
    string construction to create locale-aware strings, and everything
    should work as desired.


    Only if that is the desired behavior, which often it is not
    once one starts looking at the code details.

    - Minimal language pollution. Seems consistent with current language
    design.


    1.4.2 Cons
    ----------

    - What to do when comparing std::string<locale_aware==false> with =
    a
    std::string<locale_aware==true>? I suggest default behavior is
    byte-comparision, but compilers should generate a warning. May need
    to introduce a cast operations to avoid the warning.
    - I don't see a way to prevent a developer from setting an
    incompatible locale, and using an incompatible string. I suppose
    this would have to throw an exception.
    - std::string<locale_aware> or std::la_string is clunky.


    1.5 Questions
    ~~~~~~~~~~~~~

    - chage locale awareness via typecasting?


    Having all that locale-aware code in std::basic_string will
    seriously bloat any application wanting only the non-locale
    aware form.

    It is thus better to have std::basic_lstring as a subclass
    of std::basic_string, such that all the extra code will not
    be linked into statically linked utility programs that don't
    need this extra library code.

    Making std::basic_string a protected base class of
    std::basic_lstring will have additional benefits:

    - accidentlly mixing string and lstring types will cause type
    errors except where std::basic_lstring provides overloaded
    operations to handle the combination.

    - functions that need to be much more complex in
    std::basic_lstring can do this without forcing their
    simpler cousins in std::basic_string to be virtual and
    incur the resulting call overhead, which may easily
    exceed the low cost of the trivial non-locale
    implementatins.

    As an alternative to hiding the basic_string properties of a
    basic_lstring, one could use different names for the non-basic
    operations while keeping the basic operations from the base
    class available. For example

    size_t length() const; // Length in char_t units, usually
    // quick, inherited from basic_string
    size_t vlength() const; // Number of codepoints in string.
    // often expensive and charset
    // dependent, but may be cached
    // for speed.
    size_t tlength() const; // Text length in ideal screen
    // character cells, assuming an
    // semi-ideal display which merges
    // all accents etc. into the main
    // cell and uses no space for any
    // occurrence of formatting specials
    // such as the BOM.
    // Expensive
    size_t hlength() const; // Text length in ideal screen
    // character halfwidth cells,
    // assuming an ideal Asian (east) display
    // which merges all accents etc. into
    // the main cell, treats western
    // characters as half-width unless
    // explicitly marked full-width in the
    // character standard. Also counts no
    // space for non-spacing and formatting
    // characters.
    // Expensive
    size_t flength() const; // Text length in ideal screen
    // character fullwidth cells,
    // assuming an ideal Asian (east) display
    // which merges all accents etc. into
    // the main cell, treats western
    // characters as full-width unless
    // explicitly marked half-width in the
    // character standard. Also counts no
    // space for non-spacing and formatting
    // characters.
    // Expensive

    Similarly for the various substring and indexing operations.


    P.S.

    In the above document i distinguish explicitly between:

    UCS-4: 4-byte/31-bit char32_t encoding of the full potential of the
    Unicode Character Set, allowing codepoints from U+00000000 to
    U+7FFFFFFF Note that the sign bit is still reserved, just as
    it was in 1-byte/7-bit ASCII.

    UTF-32: 4-byte/31-bit char32_t encoding of the subset of the Unicode
    Character Set which can be encoded using the current UTF-16
    encoding, i.e. the codepoints U+00000000 to U+0010FFFF
    inclusive. This is the subset that will be assigned meanings
    first, just as the codepoints from 0 to 127 were the first to
    be assigned in ASCII-derived character sets.

    UCS-2: Historic 2-byte/16-bit char16_t encoding of the first 64K
    code points in the Unicode Character set. More than
    20 years ago some believed this and not UCS-4 would become
    the final standard and thus designed protocols and systems
    accordingly, this includes the designs of Java, Microsoft
    Windows, and mobile text messaging (SMS) standards of 160
    7-bit chars or 70 16 bit chars.

    UTF-16: An encoding of the first about 1 million Unicode codepoints
    which is the same as UCS-2 for the common codepoints and a
    special char16_t[2] encoding of codepoints from U+00010000 to
    U+0010FFFF . This is mostly used when retrofitting UCS-2
    systems to support a larger number of Unicode codepoints.

    UTF-8: An encoding of the full Unicode character range from
    U+00000000 to U+7FFFFFFF using a variable number of 8-bit
    chars such that the ASCII subset U+00000000 to U+0000007F
    encodes as itself and having many other practical properties.
    Many official documents have changed the original UTF-8
    definition to formally prohibit the encoding of codepoints
    that cannot be encoded using UTF-16, but I view this as
    short sighted and potentially subject to future reversal.



    Enjoy

    Jakob
    --
    Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
    Transformervej 29, 2860 S=C3=B8borg, Denmark. Direct +45 31 13 16 10
    This public discussion message is non-binding and may contain errors.
    WiseMo - Remote Service Management for PCs, Phones and Embedded


    [ comp.std.c++ is moderated. To submit articles, try posting with your ]
    [ newsreader. If that fails, use mailto:std-cpp-submit@vandevoorde.com ]
    [ --- Please see the FAQ before posting. --- ]
    [ FAQ: http://www.comeaucomputing.com/csc/faq.html ]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)