• Re: Ada and Unicode

    From Thomas@21:1/5 to Vadim Godunko on Sun Apr 3 18:51:56 2022
    In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>,
    Vadim Godunko <vgodunko@gmail.com> wrote:

    On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:

    What's the way to manage Unicode correctly ?


    Ada doesn't have good Unicode support. :( So, you need to find suitable set of "workarounds".

    There are few different aspects of Unicode support need to be considered:

    1. Representation of string literals. If you want to use non-ASCII characters in source code, you need to use -gnatW8 switch and it will require use of Wide_Wide_String everywhere.
    2. Internal representation during application execution. You are forced to use Wide_Wide_String at previous step, so it will be UCS4/UTF32.

    It is hard to say that it is reasonable set of features for modern world.

    I don't think Ada would be lacking that much, for having good UTF-8
    support.

    the cardinal point is to be able to fill a Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
    (once you got it, when you'll try to fill a Standard.String with a
    non-Latin-1 character, it'll make an error, i think it's fine :-) )

    does Ada 202x allow it ?

    if not, it would probably be easier if it was
    type UTF_8_String is new String;
    instead of
    subtype UTF_8_String is String;


    for all subprograms it's quite easy:
    we just have to duplicate them with the new type, and to mark the old
    one as Obsolescent.

    but, now that "subtype UTF_8_String" exists, i don't know what we can do
    for types.
    (is the only way to choose a new name?)


    To
    fix some of drawbacks of current situation we are developing new text processing library, know as VSS.

    https://github.com/AdaCore/VSS

    (are you working at AdaCore ?)

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas@21:1/5 to All on Sun Apr 3 20:04:36 2022
    In article <s5k0ai$bb5$1@dont-email.me>, "J-P. Rosen" <rosen@adalog.fr>
    wrote:

    Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
    They're different types and should be incompatible, because, well, they are. What does Ada have that allows for this that other languages
    doesn't? Oh yeah! Types!

    They are not so different. For example, you may read the first line of a
    file in a string, then discover that it starts with a BOM, and thus
    decide it is UTF-8.

    could you give me an example of sth that you can do yet, and you could
    not do if UTF_8_String was private, please?
    (to discover that it starts with a BOM, you must look at it.)



    BTW, the very first version of this AI had different types, but the ARG
    felt that it would just complicate the interface for the sake of abusive "purity".

    could you explain "abusive purity" please?

    i guess it is because of ASCII.
    i guess a lot of developpers use only ASCII in a lot of situation, and
    they would find annoying to need Ada.Strings.UTF_Encoding.Strings every
    time.

    but I think a simple explicit conversion is acceptable, for a not fully compatible type which requires some attention.


    the best would be to be required to use ASCII_String as intermediate,
    but i don't know how it could be designed at language level:

    UTF_8_Var := UTF_8_String (ASCII_String (Latin_1_Var));
    Latin_1_Var:= String (ASCII_String (UTF_8_Var));

    and this would be forbidden :
    UTF_8_Var := UTF_8_String (Latin_1_Var);

    this would ensures to raise Constraint_Error when there are somme
    non-ASCII characters.

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas@21:1/5 to Dmitry A. Kazakov on Sun Apr 3 19:24:11 2022
    In article <s5k0ne$opv$1@gioia.aioe.org>,
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote:

    On 2021-04-19 15:15, Luke A. Guest wrote:
    On 19/04/2021 14:10, Dmitry A. Kazakov wrote:

    They're different types and should be incompatible, because, well,
    they are. What does Ada have that allows for this that other languages
    doesn't? Oh yeah! Types!

    They are subtypes, differently constrained, like Positive and Integer.

    No they're not. They're subtypes only and therefore compatible. The UTF string isn't constrained in any other ways.

    Of course it is. There could be string encodings that have no Unicode counterparts and thus missing in UTF-8/16.

    1
    there is missing a validity function to tell weather a given
    UTF_8_String is valid or not,
    and a Dynamic_Predicate on the subtype UTF_8_String connected to the
    function.

    2
    more important, (when non-ASCII,) valid UTF_8_String *do not* represent
    the same thing as themselves converted to String.


    Operations are same values are differently constrained. It does not
    make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It
    is same glyph differently encoded. Encoding is a representation
    aspect, ergo out of the interface!

    it works because 'a' is ASCII.
    if you try it with a non-ASCII character, all goes wrong.

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas@21:1/5 to Randy Brukardt on Sun Apr 3 20:37:10 2022
    In article <s5n8nj$cec$1@franka.jacob-sparre.dk>,
    "Randy Brukardt" <randy@rrsoftware.com> wrote:

    "Luke A. Guest" <laguest@archeia.com> wrote in message news:s5jute$1s08$1@gioia.aioe.org...


    On 19/04/2021 13:52, Dmitry A. Kazakov wrote:

    It is practical solution. Ada type system cannot express differently
    represented/constrained string/array/vector subtypes. Ignoring Latin-1 and using String as if it were an array of octets is the best available solution.


    They're different types and should be incompatible, because, well, they are. What does Ada have that allows for this that other languages doesn't? Oh yeah! Types!

    If they're incompatible, you need an automatic way to convert between representations, since these are all views of the same thing (an abstract string type). You really don't want 35 versions of Open each taking a different string type.

    i need not 35 versions of Open.
    i need a version of Open with an Unicode string type (not Latin-1 -
    preferably UTF-8), which will use Ada.Strings.UTF_Encoding.Conversions
    as far as needed, regarding the underlying API.



    It's the fact that Ada can't do this that makes Unbounded_Strings unusable (well, barely usable).

    knowing Ada, i find it acceptable.
    i don't say the same about Ada.Strings.UTF_Encoding.UTF_8_String.

    Ada 202x fixes the literal problem at least, but we'd
    have to completely abandon Unbounded_Strings and use a different library design in order for for it to allow literals. And if you're going to do
    that, you might as well do something about UTF-8 as well -- but now you're going to need even more conversions. Yuck.

    as i said to Vadim Godunko, i need to fill a string type with an UTF-8 litteral.
    but i don't think this string type has to manage various conversions.

    from my point of view, each library has to accept 1 kind of string type (preferably UTF-8 everywhere),
    and then, this library has to make needed conversions regarding the
    underlying API. not the user.



    I think the only true solution here would be based on a proper abstract Root_String type. But that wouldn't work in Ada, since it would be incompatible with all of the existing code out there. Probably would have to wait for a follow-on language.

    of course, it would be very nice to have a more thicker language with a
    garbage collector, only 1 String type which allows all what we need, etc.

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas@21:1/5 to Simon Wright on Sun Apr 3 21:20:19 2022
    In article <lyfszm5xv2.fsf@pushface.org>,
    Simon Wright <simon@pushface.org> wrote:

    But don't use unit names containing international characters, at any
    rate if you're (interested in compiling on) Windows or macOS:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

    if i understand, Eric Botcazou is a gnu admin who decided to reject your bug? i find him very "low portability thinking"!

    it is the responsability of compilers and other underlying tools, to manage various underlying OS and FS,
    not of the user to avoid those that the compiler devs find too bad!
    (or to use the right encoding. i heard that Windows uses UTF-16, do you know about it?)


    clearly, To_Lower takes Latin-1.
    and this kind of problems would be easier to avoid if string types were stronger ...


    after:

    package Ada.Strings.UTF_Encoding
    ...
    type UTF_8_String is new String;
    ...
    end Ada.Strings.UTF_Encoding;

    i would have also made:

    package Ada.Directories
    ...
    type File_Name_String is new Ada.Strings.UTF_Encoding.UTF_8_String;
    ...
    end Ada.Directories;

    with probably a validity check and a Dynamic_Predicate which allows "".

    then, i would use File_Name_String in all Ada.Directories and Ada.*_IO.

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vadim Godunko@21:1/5 to Thomas on Sun Apr 3 23:10:25 2022
    On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:

    But don't use unit names containing international characters, at any
    rate if you're (interested in compiling on) Windows or macOS:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

    and this kind of problems would be easier to avoid if string types were stronger ...


    Your suggestion is unable to resolve this issue on Mac OS X. Like case sensitivity, binary compare of two strings can't compare strings in different normalization forms. Right solution is to use right type to represent any paths, and even it doesn't
    resolve some issues, like relative paths and change of rules at mounting points.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Wright@21:1/5 to Thomas on Mon Apr 4 15:33:21 2022
    Thomas <fantome.forums.tDeContes@free.fr.invalid> writes:

    In article <lyfszm5xv2.fsf@pushface.org>,
    Simon Wright <simon@pushface.org> wrote:

    But don't use unit names containing international characters, at any
    rate if you're (interested in compiling on) Windows or macOS:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

    if i understand, Eric Botcazou is a gnu admin who decided to reject
    your bug? i find him very "low portability thinking"!

    To be fair, he only suspended it - you can tell I didn't want to press
    very far.

    We could remove the part where the filename is smashed to lower-case as
    if it were ASCII[1][2][3] (OK, perhaps Latin-1?) if the machine is
    Windows or (Apple if not on aarch64!!!), but that still leaves the
    filesystem name issue. Windows might be OK (code pages???)

    [1] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/adaint.c#L620
    [2] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/lib-writ.adb#L812
    [2] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/lib-writ.adb#L1490

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Wright@21:1/5 to Vadim Godunko on Mon Apr 4 15:19:16 2022
    Vadim Godunko <vgodunko@gmail.com> writes:

    On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:

    But don't use unit names containing international characters, at
    any rate if you're (interested in compiling on) Windows or macOS:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

    and this kind of problems would be easier to avoid if string types
    were stronger ...


    Your suggestion is unable to resolve this issue on Mac OS X. Like case sensitivity, binary compare of two strings can't compare strings in
    different normalization forms. Right solution is to use right type to represent any paths, and even it doesn't resolve some issues, like
    relative paths and change of rules at mounting points.

    I think that's a macOS problem that Apple aren't going to resolve* any
    time soon! While banging my head against PR81114 recently, I found
    (can't remember where) that (lower case a acute) and (lower case a,
    combining acute) represent the same concept and it's up to
    tools/operating systems etc to recognise that.

    Emacs, too, has a problem: it doesn't recognise the 'combining' part of
    (lower case a, combining acute), so what you see on your screen is "a'".

    * I don't know how/whether clang addresses this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Wright@21:1/5 to Simon Wright on Mon Apr 4 16:11:23 2022
    Simon Wright <simon@pushface.org> writes:

    I think that's a macOS problem that Apple aren't going to resolve* any
    time soon! While banging my head against PR81114 recently, I found
    (can't remember where) that (lower case a acute) and (lower case a,
    combining acute) represent the same concept and it's up to
    tools/operating systems etc to recognise that.
    [...]
    * I don't know how/whether clang addresses this.

    It doesn't, so far as I can tell; has the exact same problem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Randy Brukardt@21:1/5 to All on Mon Apr 4 18:52:32 2022
    "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr...
    ...
    as i said to Vadim Godunko, i need to fill a string type with an UTF-8 litteral.but i don't think this string type has to manage various conversions.

    from my point of view, each library has to accept 1 kind of string type (preferably UTF-8 everywhere),
    and then, this library has to make needed conversions regarding the underlying API. not the user.

    This certainly is a fine ivory tower solution, but it completely ignores two practicalities in the case of Ada:

    (1) You need to replace almost all of the existing Ada language defined packages to make this work. Things that are deeply embedded in both implementations and programs (like Ada.Exceptions and Ada.Text_IO) would
    have to change substantially. The result would essentially be a different language, since the resulting libraries would not work with most existing programs. They'd have to have different names (since if you used the same names, you change the failures from compile-time to runtime -- or even undetected -- which would be completely against the spirit of Ada), which
    means that one would have to essentially start over learning and using the resulting language. Calling it Ada would be rather silly, since it would be practically incompatible (and it would make sense to use this point to eliminate a lot of the cruft from the Ada design).

    (2) One needs to be able to read and write data given whatever encoding the project requires (that's often decided by outside forces, such as other hardware or software that the project needs to interoperate with). That
    means that completely hiding the encoding (or using a universal encoding) doesn't fully solve the problems faced by Ada programmers. At a minimum, you have to have a way to specify the encoding of files, streams, and hardware interfaces (this sort of thing is not provided by any common target OS, so
    it's not in any target API). That will greatly complicate the interface and implementation of the libraries.

    ... of course, it would be very nice to have a more thicker language with
    a garbage collector ...

    I doubt that you will ever see that in the Ada family, as analysis and therefore determinism is a very important property for the language. Ada has lots of mechanisms for managing storage without directly doing it yourself
    (by calling Unchecked_Deallocation), yet none of them use any garbage collection in a traditional sense. I could see more such mechanisms (an ownership option on the line of Rust could easily manage storage at the same time, since any object that could be orphaned could never be used again and thus should be reclaimed), but standard garbage collection is too non-deterministic for many of the uses Ada is put to.

    Randy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vadim Godunko@21:1/5 to Simon Wright on Tue Apr 5 00:59:54 2022
    On Monday, April 4, 2022 at 5:19:20 PM UTC+3, Simon Wright wrote:
    I think that's a macOS problem that Apple aren't going to resolve* any
    time soon! While banging my head against PR81114 recently, I found
    (can't remember where) that (lower case a acute) and (lower case a,
    combining acute) represent the same concept and it's up to
    tools/operating systems etc to recognise that.

    And will not. It is application responsibility to convert file names to NFD to pass to OS. Also, application must compare any paths after conversion to NFD, it is important to handle more complicated cases when canonical reordering is applied.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From J-P. Rosen@21:1/5 to All on Wed Apr 6 21:57:01 2022
    Le 03/04/2022 à 21:04, Thomas a écrit :
    They are not so different. For example, you may read the first line of a
    file in a string, then discover that it starts with a BOM, and thus
    decide it is UTF-8.

    could you give me an example of sth that you can do yet, and you could
    not do if UTF_8_String was private, please?
    (to discover that it starts with a BOM, you must look at it.)
    Just what I said above, since a BOM is not a valid UTF-8 (otherwise, it
    could not be recognized).


    BTW, the very first version of this AI had different types, but the ARG
    felt that it would just complicate the interface for the sake of abusive
    "purity".

    could you explain "abusive purity" please?

    It was felt that in practice, being too strict in separating the types
    would make things more difficult, without any practical gain. This has
    been discussed - you may not agree with the outcome, but it was not made
    out of pure lazyness

    --
    J-P. Rosen
    Adalog
    2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
    Tel: +33 1 45 29 21 52
    https://www.adalog.fr

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Randy Brukardt@21:1/5 to J-P. Rosen on Wed Apr 6 20:30:58 2022
    "J-P. Rosen" <rosen@adalog.fr> wrote in message news:t2knpr$s26$1@dont-email.me...
    ...
    It was felt that in practice, being too strict in separating the types
    would make things more difficult, without any practical gain. This has
    been discussed - you may not agree with the outcome, but it was not made
    out of pure lazyness

    The problem with that, of course, is that it sends the wrong message
    vis-a-vis strong typing and interfaces. If we abandon it at the first sign
    of trouble, they we are saying that it isn't really that important.

    In this particular case, the reason really came down to practicality: if you want to do anything string-like with a UTF-8 string, making it a separate
    type becomes painful. It wouldn't work with anything in Ada.Strings, Ada.Text_IO, or Ada.Directories, even though most of the operations are
    fine. And there was no political will to replace all of those things with versions to use with proper universal strings.

    Moreover, if you really want to do that, you have to hide much of the array behavior of the Universal string. For instance, you can't allow willy-nilly slicing or replacement: cutting a character representation in half or
    setting an illegal representation has to be prohibited (operations that
    would turn a valid string into an invalid string should always raise an exception). That means you can't (directly) use built-in indexing and
    slicing -- those have to go through some sort of functions. So you do pretty much have to use a private type for universal strings (similar to Ada.Strings.Bounded would be best, I think).

    If you had an Ada-like language that used a universal UTF-8 string
    internally, you then would have a lot of old and mostly useless operations supported for array types (since things like slices are mainly useful for string operations). So such a language should simplify the core
    substantially by dropping many of those obsolete features (especially as
    little of the library would be directly compatible anyway). So one should
    end up with a new language that draws from Ada rather than something in Ada itself. (It would be great if that language could make strings with
    different capacities interoperable - a major annoyance with Ada. And modernizing access types, generalizing resolution, and the like also would
    be good improvements IMHO.)

    Randy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Wright@21:1/5 to Randy Brukardt on Fri Apr 8 09:56:19 2022
    "Randy Brukardt" <randy@rrsoftware.com> writes:

    If you had an Ada-like language that used a universal UTF-8 string internally, you then would have a lot of old and mostly useless
    operations supported for array types (since things like slices are
    mainly useful for string operations).

    Just off the top of my head, wouldn't it be better to use UTF32-encoded Wide_Wide_Character internally? (you would still have trouble with
    e.g. national flag emojis :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to Simon Wright on Fri Apr 8 11:26:05 2022
    On 2022-04-08 10:56, Simon Wright wrote:
    "Randy Brukardt" <randy@rrsoftware.com> writes:

    If you had an Ada-like language that used a universal UTF-8 string
    internally, you then would have a lot of old and mostly useless
    operations supported for array types (since things like slices are
    mainly useful for string operations).

    Just off the top of my head, wouldn't it be better to use UTF32-encoded Wide_Wide_Character internally?

    Yep, that is the exactly the problem, a confusion between interface and implementation.

    Encoding /= interface, e.g. an interface of a string viewed as an array
    of characters. That interface just same for ASCII, Latin-1, EBCDIC,
    RADIX50, UTF-8 etc strings. Why do you care what is inside?

    Ada type system's inability to implement this interface is another
    issue. Usefulness of this interface is yet another. For immutable
    strings it is quite useful. For mutable strings it might appear too constrained, e.g. for packed encodings like UTF-8 and UTF-16.

    Also this interface should have nothing to do with the interface of an
    UTF-8 string as an array of octets or the interface of an UTF-16LE
    string as an array of little endian words.

    Since Ada cannot separate these interfaces, for practical purposes,
    Strings are arrays of octets considered as UTF-8 encoding. The rest goes
    into coding guidelines under the title "never ever do this."

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Wright@21:1/5 to Dmitry A. Kazakov on Fri Apr 8 20:19:08 2022
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

    On 2022-04-08 10:56, Simon Wright wrote:
    "Randy Brukardt" <randy@rrsoftware.com> writes:

    If you had an Ada-like language that used a universal UTF-8 string
    internally, you then would have a lot of old and mostly useless
    operations supported for array types (since things like slices are
    mainly useful for string operations).

    Just off the top of my head, wouldn't it be better to use
    UTF32-encoded Wide_Wide_Character internally?

    Yep, that is the exactly the problem, a confusion between interface
    and implementation.

    Don't understand. My point was that *when you are implementing this* it
    mught be easier to deal with 32-bit charactrs/code points/whatever the
    proper jargon is than with UTF8.

    Encoding /= interface, e.g. an interface of a string viewed as an
    array of characters. That interface just same for ASCII, Latin-1,
    EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?

    With a user's hat on, I don't. Implementers might have a different point
    of view.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Wright@21:1/5 to Vadim Godunko on Fri Apr 8 10:01:33 2022
    Vadim Godunko <vgodunko@gmail.com> writes:

    On Monday, April 4, 2022 at 5:19:20 PM UTC+3, Simon Wright wrote:
    I think that's a macOS problem that Apple aren't going to resolve* any
    time soon! While banging my head against PR81114 recently, I found
    (can't remember where) that (lower case a acute) and (lower case a,
    combining acute) represent the same concept and it's up to
    tools/operating systems etc to recognise that.

    And will not. It is application responsibility to convert file names
    to NFD to pass to OS. Also, application must compare any paths after conversion to NFD, it is important to handle more complicated cases
    when canonical reordering is applied.

    Isn't the compiler a tool? gnatmake? gprbuild? (gnatmake handles ACATS
    c250002 provided you tell the compiler that the fs is case-sensitive,
    gprbuild doesn't even manage that)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to Simon Wright on Fri Apr 8 21:45:18 2022
    On 2022-04-08 21:19, Simon Wright wrote:
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

    On 2022-04-08 10:56, Simon Wright wrote:
    "Randy Brukardt" <randy@rrsoftware.com> writes:

    If you had an Ada-like language that used a universal UTF-8 string
    internally, you then would have a lot of old and mostly useless
    operations supported for array types (since things like slices are
    mainly useful for string operations).

    Just off the top of my head, wouldn't it be better to use
    UTF32-encoded Wide_Wide_Character internally?

    Yep, that is the exactly the problem, a confusion between interface
    and implementation.

    Don't understand. My point was that *when you are implementing this* it
    mught be easier to deal with 32-bit charactrs/code points/whatever the
    proper jargon is than with UTF8.

    I think it would be more difficult, because you will have to convert
    from and to UTF-8 under the hood or explicitly. UTF-8 is de-facto
    interface standard and I/O standard. That would be 60-70% of all cases
    you need a string. Most string operations like search, comparison,
    slicing are isomorphic between code points and octets. So you would win
    nothing from keeping strings internally as arrays of code points.

    The situation is comparable to Unbounded_Strings. The implementation is relatively simple, but the user must carry the burden of calling
    To_String and To_Unbounded_String all over the application and the
    processor must suffer the overhead of copying arrays here and there.

    Encoding /= interface, e.g. an interface of a string viewed as an
    array of characters. That interface just same for ASCII, Latin-1,
    EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?

    With a user's hat on, I don't. Implementers might have a different point
    of view.

    Sure, but in Ada philosophy their opinion should carry less weight,
    than, say, in C.

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Randy Brukardt@21:1/5 to Dmitry A. Kazakov on Fri Apr 8 23:05:38 2022
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message news:t2q3cb$bbt$1@gioia.aioe.org...
    On 2022-04-08 21:19, Simon Wright wrote:
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

    On 2022-04-08 10:56, Simon Wright wrote:
    "Randy Brukardt" <randy@rrsoftware.com> writes:

    If you had an Ada-like language that used a universal UTF-8 string
    internally, you then would have a lot of old and mostly useless
    operations supported for array types (since things like slices are
    mainly useful for string operations).

    Just off the top of my head, wouldn't it be better to use
    UTF32-encoded Wide_Wide_Character internally?

    Yep, that is the exactly the problem, a confusion between interface
    and implementation.

    Don't understand. My point was that *when you are implementing this* it
    mught be easier to deal with 32-bit charactrs/code points/whatever the
    proper jargon is than with UTF8.

    I think it would be more difficult, because you will have to convert from
    and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface standard and I/O standard. That would be 60-70% of all cases you need a string. Most string operations like search, comparison, slicing are isomorphic between code points and octets. So you would win nothing from keeping strings internally as arrays of code points.

    I basically agree with Dmitry here. The internal representation is an implementation detail, but it seems likely that you would want to store
    UTF-8 strings directly; they're almost always going to be half the size
    (even for languages using their own characters like Greek) and for most of
    us, they'll be just a bit more than a quarter the size. The amount of bytes
    you copy around matters; the number of operations where code points are
    needed is fairly small.

    The main problem with UTF-8 is representing the code point positions in a
    way that they (a) aren't abused and (b) don't cost too much to calculate.
    Just using character indexes is too expensive for UTF-8 and UTF-16 representations, and using octet indexes is unsafe (since the splitting a character representation is a possibility). I'd probably use an abstract character position type that was implemented with an octet index under the covers.

    I think that would work OK as doing math on those is suspicious with a UTF representation. We're spoiled from using Latin-1 representations, of course, but generally one is interested in 5 characters, not 5 octets. And the
    number of octets in 5 characters depends on the string. So most of the sorts
    of operations that I tend to do (for instance from some code I was fixing earlier today):

    if Fort'Length > 6 and then
    Font(2..6) = "Arial" then

    This would be a bad idea if one is using any sort of universal
    representation -- you don't know how many octets is in the string literal so you can't assume a number in the test string. So the slice is dangerous
    (even though in this particular case it would be OK since the test string is all Ascii characters -- but I wouldn't want users to get in the habit of assuming such things).

    [BTW, the above was a bad idea anyway, because it turns out that the
    function in the Ada library returned bounds that don't start at 1. So the
    slice was usually out of range -- which is why I was looking at the code. Another thing that we could do without. Slices are evil, since they *seem*
    to be the right solution, yet rarely are in practice without a lot of
    hoops.]

    The situation is comparable to Unbounded_Strings. The implementation is relatively simple, but the user must carry the burden of calling To_String and To_Unbounded_String all over the application and the processor must suffer the overhead of copying arrays here and there.

    Yes, but that happens because Ada doesn't really have a string abstraction,
    so when you try to build one, you can't fully do the job. One presumes that
    a new language with a universal UTF-8 string wouldn't have that problem. (As previously noted, I don't see much point in trying to patch up Ada with a
    bunch of UTF-8 string packages; you would need an entire new set of
    Ada.Strings libraries and I/O libraries, and then you'd have all of the old stuff messing up resolution, using the best names, and confusing everything.
    A cleaner slate is needed.)

    Randy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Wright@21:1/5 to Randy Brukardt on Sat Apr 9 08:43:34 2022
    "Randy Brukardt" <randy@rrsoftware.com> writes:

    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message news:t2q3cb$bbt$1@gioia.aioe.org...
    On 2022-04-08 21:19, Simon Wright wrote:
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

    On 2022-04-08 10:56, Simon Wright wrote:
    "Randy Brukardt" <randy@rrsoftware.com> writes:

    If you had an Ada-like language that used a universal UTF-8 string >>>>>> internally, you then would have a lot of old and mostly useless
    operations supported for array types (since things like slices are >>>>>> mainly useful for string operations).

    Just off the top of my head, wouldn't it be better to use
    UTF32-encoded Wide_Wide_Character internally?

    Yep, that is the exactly the problem, a confusion between interface
    and implementation.

    Don't understand. My point was that *when you are implementing this* it
    mught be easier to deal with 32-bit charactrs/code points/whatever the
    proper jargon is than with UTF8.

    I think it would be more difficult, because you will have to convert from
    and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
    standard and I/O standard. That would be 60-70% of all cases you need a
    string. Most string operations like search, comparison, slicing are
    isomorphic between code points and octets. So you would win nothing from
    keeping strings internally as arrays of code points.

    I basically agree with Dmitry here. The internal representation is an implementation detail, but it seems likely that you would want to store
    UTF-8 strings directly; they're almost always going to be half the size
    (even for languages using their own characters like Greek) and for most of us, they'll be just a bit more than a quarter the size. The amount of bytes you copy around matters; the number of operations where code points are needed is fairly small.

    Well, I don't have any skin in this game, so I'll shut up at this point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From DrPi@21:1/5 to All on Sat Apr 9 12:27:04 2022
    Le 09/04/2022 à 06:05, Randy Brukardt a écrit :
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message news:t2q3cb$bbt$1@gioia.aioe.org...
    On 2022-04-08 21:19, Simon Wright wrote:
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

    On 2022-04-08 10:56, Simon Wright wrote:
    "Randy Brukardt" <randy@rrsoftware.com> writes:

    If you had an Ada-like language that used a universal UTF-8 string >>>>>> internally, you then would have a lot of old and mostly useless
    operations supported for array types (since things like slices are >>>>>> mainly useful for string operations).

    Just off the top of my head, wouldn't it be better to use
    UTF32-encoded Wide_Wide_Character internally?

    Yep, that is the exactly the problem, a confusion between interface
    and implementation.

    Don't understand. My point was that *when you are implementing this* it
    mught be easier to deal with 32-bit charactrs/code points/whatever the
    proper jargon is than with UTF8.

    I think it would be more difficult, because you will have to convert from
    and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
    standard and I/O standard. That would be 60-70% of all cases you need a
    string. Most string operations like search, comparison, slicing are
    isomorphic between code points and octets. So you would win nothing from
    keeping strings internally as arrays of code points.

    I basically agree with Dmitry here. The internal representation is an implementation detail, but it seems likely that you would want to store
    UTF-8 strings directly; they're almost always going to be half the size
    (even for languages using their own characters like Greek) and for most of us, they'll be just a bit more than a quarter the size. The amount of bytes you copy around matters; the number of operations where code points are needed is fairly small.

    The main problem with UTF-8 is representing the code point positions in a
    way that they (a) aren't abused and (b) don't cost too much to calculate. Just using character indexes is too expensive for UTF-8 and UTF-16 representations, and using octet indexes is unsafe (since the splitting a character representation is a possibility). I'd probably use an abstract character position type that was implemented with an octet index under the covers.

    I think that would work OK as doing math on those is suspicious with a UTF representation. We're spoiled from using Latin-1 representations, of course, but generally one is interested in 5 characters, not 5 octets. And the
    number of octets in 5 characters depends on the string. So most of the sorts of operations that I tend to do (for instance from some code I was fixing earlier today):

    if Fort'Length > 6 and then
    Font(2..6) = "Arial" then

    This would be a bad idea if one is using any sort of universal
    representation -- you don't know how many octets is in the string literal so you can't assume a number in the test string. So the slice is dangerous
    (even though in this particular case it would be OK since the test string is all Ascii characters -- but I wouldn't want users to get in the habit of assuming such things).

    [BTW, the above was a bad idea anyway, because it turns out that the
    function in the Ada library returned bounds that don't start at 1. So the slice was usually out of range -- which is why I was looking at the code. Another thing that we could do without. Slices are evil, since they *seem*
    to be the right solution, yet rarely are in practice without a lot of
    hoops.]

    The situation is comparable to Unbounded_Strings. The implementation is
    relatively simple, but the user must carry the burden of calling To_String >> and To_Unbounded_String all over the application and the processor must
    suffer the overhead of copying arrays here and there.

    Yes, but that happens because Ada doesn't really have a string abstraction, so when you try to build one, you can't fully do the job. One presumes that
    a new language with a universal UTF-8 string wouldn't have that problem. (As previously noted, I don't see much point in trying to patch up Ada with a bunch of UTF-8 string packages; you would need an entire new set of Ada.Strings libraries and I/O libraries, and then you'd have all of the old stuff messing up resolution, using the best names, and confusing everything. A cleaner slate is needed.)

    Randy.



    In Python-2, there is the same kind of problem. A string is a byte
    array. This is the programmer responsibility to encode/decode to/from UTF8/Latin1/... and to manage everything correctly. Litteral strings can
    be considered as encoded or decoded depending on the notation ("" or u"").

    In Python-3, a string is a character(glyph ?) array. The internal representation is hidden to the programmer.
    UTF8/Latin1/... encoded "strings" are of type bytes (byte array). Writing/reading to/from a file is done with bytes type.
    When writing/reading to/from a file in text mode, you have to specify
    the encoding to use. The encoding/decoding is then internally managed.
    As a general rule, all "external communications" are done with bytes
    (byte array). This is the programmer responsability to encode/decode
    where needed to convert from/to strings.
    The source files (.py) are considered to be UTF8 encoded by default but
    one can declare the actual encoding at the top of the file in a special
    comment tag. When a badly encoded character is found, an exception is
    raised at parsing time. So, literal strings are real strings, not bytes.

    I think the Python-3 way of doing things is much more understandable and
    really usable.

    On the Ada side, I've still not understood how to correctly deal with
    all this stuff.


    Note : In Python-3, bytes type is not reserved to encoded "strings". It
    is a versatile type for what it's named : a byte array.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dennis Lee Bieber@21:1/5 to All on Sat Apr 9 12:46:04 2022
    On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the
    following:


    In Python-3, a string is a character(glyph ?) array. The internal >representation is hidden to the programmer.

    <SNIP>

    On the Ada side, I've still not understood how to correctly deal with
    all this stuff.

    One thing to take into account is that Python strings are immutable. Changing the contents of a string requires constructing a new string from
    parts that incorporate the change.

    That allows for the second aspect -- even if not visible to a programmer, Python (3) strings are not a fixed representation: If all characters in the string fit in the 8-bit UTF range, that string is stored using one byte per character. If any character uses a 16-bit UTF representation, the entire string is stored as 16-bit characters (and
    similar for 32-bit UTF points). Thus, indexing into the string is still
    fast -- just needing to scale the index by the character width of the
    entire string.




    --
    Wulfraed Dennis Lee Bieber AF6VN
    wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From DrPi@21:1/5 to All on Sat Apr 9 20:59:59 2022
    Le 09/04/2022 à 18:46, Dennis Lee Bieber a écrit :
    On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the following:


    In Python-3, a string is a character(glyph ?) array. The internal
    representation is hidden to the programmer.

    <SNIP>

    On the Ada side, I've still not understood how to correctly deal with
    all this stuff.

    One thing to take into account is that Python strings are immutable. Changing the contents of a string requires constructing a new string from parts that incorporate the change.


    Right. I forgot to mention it.

    That allows for the second aspect -- even if not visible to a programmer, Python (3) strings are not a fixed representation: If all characters in the string fit in the 8-bit UTF range, that string is stored using one byte per character. If any character uses a 16-bit UTF representation, the entire string is stored as 16-bit characters (and
    similar for 32-bit UTF points). Thus, indexing into the string is still
    fast -- just needing to scale the index by the character width of the
    entire string.


    Thanks for clarifying.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vadim Godunko@21:1/5 to DrPi on Sat Apr 9 22:58:49 2022
    On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote:

    On the Ada side, I've still not understood how to correctly deal with
    all this stuff.

    Take a look at https://github.com/AdaCore/VSS

    Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder.

    I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems; in
    other cases their use generates a lot of hidden issues, which is very hard to detect.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From DrPi@21:1/5 to All on Sun Apr 10 20:59:20 2022
    Le 10/04/2022 à 07:58, Vadim Godunko a écrit :
    On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote:

    On the Ada side, I've still not understood how to correctly deal with
    all this stuff.

    Take a look at https://github.com/AdaCore/VSS

    Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder.

    I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems;
    in other cases their use generates a lot of hidden issues, which is very hard to detect.

    That's an interesting solution.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Randy Brukardt@21:1/5 to All on Tue Apr 12 01:13:08 2022
    "Vadim Godunko" <vgodunko@gmail.com> wrote in message news:3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com...
    ...
    I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and >programming languages; more cleaner types and API is a requirement now.

    ...which essentially means Ada is obsolete in your view, as String in particular is way too embedded in the definition and the language-defined
    units to use anything else. You'd end up with a mass of conversions to get anything done (the main problem with Ada.Strings.Unbounded).

    Or I suppose you could replace pretty much the entire library with a new
    one. But now you have two of everything to confuse newcomers and you still
    have a mass of old nonsense weighing down the language and complicating implementations.

    The only case when old character/string types is really makes value is low >resources embedded systems; ...

    ...which of course is at least 50% of the use of Ada, and probably closer to 90% of the money. Any solution for Ada has to continue to meet the needs of embedded programmers. For instance, it would need to support fixed, bounded, and unbounded versions (solely having unbounded strings would not work for
    many applications, and indeed not just embedded systems need to restrict
    those -- any long-running server has to control dynamic allocation)

    ...in other cases their use generates a lot of hidden issues, which is very >hard to detect.

    At least some of which occur because a string is not an array, and the
    forcible mapping to them never worked very well. The Z-80 Pascals that we
    used to implement the very earliest versions of Ada had more functional
    strings than Ada does (by being bounded and using a library for most operations) - they would have been way easier to extend (as the Python ones were, as an example).

    Randy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas@21:1/5 to Stephen Leake on Sat Apr 16 04:32:19 2022
    In article <86mttuk5f0.fsf@stephe-leake.org>,
    Stephen Leake <stephen_leake@stephe-leake.org> wrote:

    DrPi <314@drpi.fr> writes:

    Any way to use source code encoded in UTF-8 ?


    from the gnat user guide, 4.3.1 Alphabetical List of All Switches:

    `-gnati`c''
    Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w). For details
    of the possible selections for `c', see *note Character Set
    Control: 4e.

    This applies to identifiers in the source code

    `-gnatW`e''
    Wide character encoding method (`e'=n/h/u/s/e/8).

    This applies to string and character literals.


    afaik, -gnati is deactivated when -gnatW is not n or h (from memory)

    so you can't ask both to check that identifiers are in ASCII and to have literals in UTF-8.


    (if it's resolved in new versions it's a good news :-) )

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas@21:1/5 to Vadim Godunko on Fri Mar 31 01:35:48 2023
    sorry for the delay.


    In article <48309745-aa2a-47bd-a4f9-6daa843e0771n@googlegroups.com>,
    Vadim Godunko <vgodunko@gmail.com> wrote:

    On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:

    But don't use unit names containing international characters, at any
    rate if you're (interested in compiling on) Windows or macOS:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

    and this kind of problems would be easier to avoid if string types were stronger ...


    Your suggestion is unable to resolve this issue on Mac OS X.

    i said "easier" not "easy".

    don't forget that Unicode has 2 levels :
    - octets <-> code points
    - code points <-> characters/glyphs

    and you can't expect the upper to work if the lower doesn't.


    Like case
    sensitivity, binary compare of two strings can't compare strings in different normalization forms. Right solution is to use right type to represent any paths,

    what would be the "right type", according to you?


    In fact, here the first question to ask is:
    what's the expected encoding for Ada.Text_IO.Open.Name?
    - is it Latin-1 because the type is String not UTF_8_String?
    - is it undefined because it depends on the underling FS?

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas@21:1/5 to Randy Brukardt on Fri Mar 31 05:06:22 2023
    In article <t2g0c1$eou$1@dont-email.me>,
    "Randy Brukardt" <randy@rrsoftware.com> wrote:

    "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr...
    ...
    as i said to Vadim Godunko, i need to fill a string type with an UTF-8 litteral.but i don't think this string type has to manage various conversions.

    from my point of view, each library has to accept 1 kind of string type (preferably UTF-8 everywhere),
    and then, this library has to make needed conversions regarding the underlying API. not the user.

    This certainly is a fine ivory tower solution,

    I like to think from an ivory tower,
    and then look at the reality to see what's possible to do or not. :-)



    but it completely ignores two
    practicalities in the case of Ada:

    (1) You need to replace almost all of the existing Ada language defined packages to make this work. Things that are deeply embedded in both implementations and programs (like Ada.Exceptions and Ada.Text_IO) would
    have to change substantially. The result would essentially be a different language, since the resulting libraries would not work with most existing programs.

    - in Ada, of course we can't delete what's existing, and there are many packages which are already in 3 versions (S/WS/WWS).
    imho, it would be consistent to make a 4th version of them for a new UTF_8_String type.

    - in a new language close to Ada, it would not necessarily be a good
    idea to remove some of them, depending on industrial needs, to keep them
    with us.

    They'd have to have different names (since if you used the same
    names, you change the failures from compile-time to runtime -- or even undetected -- which would be completely against the spirit of Ada), which means that one would have to essentially start over learning and using the resulting language.

    i think i don't understand.

    (and it would make sense to use this point to
    eliminate a lot of the cruft from the Ada design).

    could you give an example of cruft from the Ada design, please? :-)



    (2) One needs to be able to read and write data given whatever encoding the project requires (that's often decided by outside forces, such as other hardware or software that the project needs to interoperate with).

    At a minimum, you
    have to have a way to specify the encoding of files, streams, and hardware interfaces

    That will greatly complicate the interface and
    implementation of the libraries.

    i don't think so.
    it's a matter of interfacing libraries, for the purpose of communicating
    with the outside (neither of internal libraries nor of the choice of the internal type for the implementation).

    Ada.Text_IO.Open.Form already allows (a part of?) this (on the content
    of the files, not on their name), see ARM A.10.2 (6-8).
    (write i the reference to ARM correctly?)




    ... of course, it would be very nice to have a more thicker language with
    a garbage collector ...

    I doubt that you will ever see that in the Ada family,

    as analysis and
    therefore determinism is a very important property for the language.

    I completely agree :-)

    Ada has
    lots of mechanisms for managing storage without directly doing it yourself (by calling Unchecked_Deallocation), yet none of them use any garbage collection in a traditional sense.

    sorry, i meant "garbage collector" in a generic sense, not in a
    traditional sense.
    that is, as Ada users we could program with pointers and pool, without
    memory leaks nor calling Unchecked_Deallocation.

    for example Ada.Containers.Indefinite_Holders.

    i already wrote one for constrained limited types.
    do you know if it's possible to do it for unconstrained limited types,
    like the class of a limited tagged type?

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Randy Brukardt@21:1/5 to Thomas on Sat Apr 1 05:18:11 2023
    I'm not going to answer this point-by-point, as it would take very much too long, and there is a similar thread going on the ARG's Github (which needs
    my attention more than comp.lang.ada.

    But my opinion is that Ada got strings completely wrong, and the best thing
    to do with them is to completely nuke them and start over. But one cannot do that in the context of Ada, one would have to at least leave way to use the
    old mechanisms for compatibility with older code. That would leave a hodge-podge of mechanisms that would make Ada very much harder (rather than easier) to use.

    As far as the cruft goes, I wrote up a 20+ page document on that during the pandemic, but I could never interest anyone knowledgeable to review it, and
    I don't plan to make it available without that. Most of the things are
    caused by interactions -- mostly because of too much generality. And of
    course there are features that Ada would be better off without (like
    anonymous access types).

    Randy.

    "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message news:64264e2f$0$25952$426a74cc@news.free.fr...
    In article <t2g0c1$eou$1@dont-email.me>,
    "Randy Brukardt" <randy@rrsoftware.com> wrote:

    "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message
    news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr...
    ...
    as i said to Vadim Godunko, i need to fill a string type with an UTF-8
    litteral.but i don't think this string type has to manage various
    conversions.

    from my point of view, each library has to accept 1 kind of string type
    (preferably UTF-8 everywhere),
    and then, this library has to make needed conversions regarding the
    underlying API. not the user.

    This certainly is a fine ivory tower solution,

    I like to think from an ivory tower,
    and then look at the reality to see what's possible to do or not. :-)



    but it completely ignores two
    practicalities in the case of Ada:

    (1) You need to replace almost all of the existing Ada language defined
    packages to make this work. Things that are deeply embedded in both
    implementations and programs (like Ada.Exceptions and Ada.Text_IO) would
    have to change substantially. The result would essentially be a different
    language, since the resulting libraries would not work with most existing
    programs.

    - in Ada, of course we can't delete what's existing, and there are many packages which are already in 3 versions (S/WS/WWS).
    imho, it would be consistent to make a 4th version of them for a new UTF_8_String type.

    - in a new language close to Ada, it would not necessarily be a good
    idea to remove some of them, depending on industrial needs, to keep them
    with us.

    They'd have to have different names (since if you used the same
    names, you change the failures from compile-time to runtime -- or even
    undetected -- which would be completely against the spirit of Ada), which
    means that one would have to essentially start over learning and using
    the
    resulting language.

    i think i don't understand.

    (and it would make sense to use this point to
    eliminate a lot of the cruft from the Ada design).

    could you give an example of cruft from the Ada design, please? :-)



    (2) One needs to be able to read and write data given whatever encoding
    the
    project requires (that's often decided by outside forces, such as other
    hardware or software that the project needs to interoperate with).

    At a minimum, you
    have to have a way to specify the encoding of files, streams, and
    hardware
    interfaces

    That will greatly complicate the interface and
    implementation of the libraries.

    i don't think so.
    it's a matter of interfacing libraries, for the purpose of communicating
    with the outside (neither of internal libraries nor of the choice of the internal type for the implementation).

    Ada.Text_IO.Open.Form already allows (a part of?) this (on the content
    of the files, not on their name), see ARM A.10.2 (6-8).
    (write i the reference to ARM correctly?)




    ... of course, it would be very nice to have a more thicker language
    with
    a garbage collector ...

    I doubt that you will ever see that in the Ada family,

    as analysis and
    therefore determinism is a very important property for the language.

    I completely agree :-)

    Ada has
    lots of mechanisms for managing storage without directly doing it
    yourself
    (by calling Unchecked_Deallocation), yet none of them use any garbage
    collection in a traditional sense.

    sorry, i meant "garbage collector" in a generic sense, not in a
    traditional sense.
    that is, as Ada users we could program with pointers and pool, without
    memory leaks nor calling Unchecked_Deallocation.

    for example Ada.Containers.Indefinite_Holders.

    i already wrote one for constrained limited types.
    do you know if it's possible to do it for unconstrained limited types,
    like the class of a limited tagged type?

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas@21:1/5 to Thomas on Tue Apr 4 02:02:03 2023
    In article
    <fantome.forums.tDeContes-079FD6.18515603042022@news.free.fr>,
    Thomas <fantome.forums.tDeContes@free.fr.invalid> wrote:

    In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>,
    Vadim Godunko <vgodunko@gmail.com> wrote:

    On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:

    What's the way to manage Unicode correctly ?


    Ada doesn't have good Unicode support. :( So, you need to find suitable set of "workarounds".

    There are few different aspects of Unicode support need to be considered:

    1. Representation of string literals. If you want to use non-ASCII characters
    in source code, you need to use -gnatW8 switch and it will require use of Wide_Wide_String everywhere.
    2. Internal representation during application execution. You are forced to use Wide_Wide_String at previous step, so it will be UCS4/UTF32.

    It is hard to say that it is reasonable set of features for modern world.

    I don't think Ada would be lacking that much, for having good UTF-8
    support.

    the cardinal point is to be able to fill a Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
    (once you got it, when you'll try to fill a Standard.String with a non-Latin-1 character, it'll make an error, i think it's fine :-) )

    does Ada 202x allow it ?


    hi !

    I think I found a quite nice solution!
    (reading <t3lj44$fh5$1@dont-email.me> again)
    (not tested yet)


    it's not perfect as in the rules of the art,
    but it is:

    - Ada 2012 compatible
    - better than writing UTF-8 Ada code and then telling gnat it is Latin-1
    (in this way it would take UTF_8_String for what it is:
    an array of octets, but it would not detect an invalid UTF-8 string,
    and if someone tells it's really UTF-8 all goes wrong)
    - better than being limited to ASCII in string literals
    - never need to explicitely declare Wide_Wide_String:
    it's always implicit, for very short time,
    and AFAIK eligible for optimization



    package UTF_Encoding is

    subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String;

    function "+" (A : in Wide_Wide_String) return UTF_8_String
    renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode;

    end UTF_Encoding;


    then we can do:


    package User is

    use UTF_Encoding;

    My_String : UTF_8_String := + "Greek characters + smileys";

    end User;


    if you want to avoid "use UTF_Encoding;",
    i think "use type UTF_Encoding.UTF_8_String;" doesn't work,
    but this should work:


    package UTF_Encoding is

    subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String;

    type Literals_For_UTF_8_String is new Wide_Wide_String;

    function "+" (A : in Literals_For_UTF_8_String) return UTF_8_String
    renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode;

    end UTF_Encoding;


    package User is

    use type UTF_Encoding.Literals_For_UTF_8_String;

    My_String : UTF_Encoding.UTF_8_String
    := + "Greek characters + smileys";

    end User;



    what do you think about that ? good idea or not ? :-)

    --
    RAPID maintainer
    http://savannah.nongnu.org/projects/rapid/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)