On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:
What's the way to manage Unicode correctly ?
Ada doesn't have good Unicode support. :( So, you need to find suitable set of "workarounds".
There are few different aspects of Unicode support need to be considered:
1. Representation of string literals. If you want to use non-ASCII characters in source code, you need to use -gnatW8 switch and it will require use of Wide_Wide_String everywhere.
2. Internal representation during application execution. You are forced to use Wide_Wide_String at previous step, so it will be UCS4/UTF32.
It is hard to say that it is reasonable set of features for modern world.
To
fix some of drawbacks of current situation we are developing new text processing library, know as VSS.
https://github.com/AdaCore/VSS
Le 19/04/2021 à 15:00, Luke A. Guest a écrit :
They're different types and should be incompatible, because, well, they are. What does Ada have that allows for this that other languages
doesn't? Oh yeah! Types!
They are not so different. For example, you may read the first line of a
file in a string, then discover that it starts with a BOM, and thus
decide it is UTF-8.
BTW, the very first version of this AI had different types, but the ARG
felt that it would just complicate the interface for the sake of abusive "purity".
On 2021-04-19 15:15, Luke A. Guest wrote:
On 19/04/2021 14:10, Dmitry A. Kazakov wrote:
They're different types and should be incompatible, because, well,they are. What does Ada have that allows for this that other languages
doesn't? Oh yeah! Types!
They are subtypes, differently constrained, like Positive and Integer.
No they're not. They're subtypes only and therefore compatible. The UTF string isn't constrained in any other ways.
Of course it is. There could be string encodings that have no Unicode counterparts and thus missing in UTF-8/16.
Operations are same values are differently constrained. It does not
make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It
is same glyph differently encoded. Encoding is a representation
aspect, ergo out of the interface!
"Luke A. Guest" <laguest@archeia.com> wrote in message news:s5jute$1s08$1@gioia.aioe.org...
On 19/04/2021 13:52, Dmitry A. Kazakov wrote:
It is practical solution. Ada type system cannot express differentlyrepresented/constrained string/array/vector subtypes. Ignoring Latin-1 and using String as if it were an array of octets is the best available solution.
They're different types and should be incompatible, because, well, they are. What does Ada have that allows for this that other languages doesn't? Oh yeah! Types!
If they're incompatible, you need an automatic way to convert between representations, since these are all views of the same thing (an abstract string type). You really don't want 35 versions of Open each taking a different string type.
It's the fact that Ada can't do this that makes Unbounded_Strings unusable (well, barely usable).
Ada 202x fixes the literal problem at least, but we'd
have to completely abandon Unbounded_Strings and use a different library design in order for for it to allow literals. And if you're going to do
that, you might as well do something about UTF-8 as well -- but now you're going to need even more conversions. Yuck.
I think the only true solution here would be based on a proper abstract Root_String type. But that wouldn't work in Ada, since it would be incompatible with all of the existing code out there. Probably would have to wait for a follow-on language.
But don't use unit names containing international characters, at any
rate if you're (interested in compiling on) Windows or macOS:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
But don't use unit names containing international characters, at any
rate if you're (interested in compiling on) Windows or macOS:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
and this kind of problems would be easier to avoid if string types were stronger ...
In article <lyfszm5xv2.fsf@pushface.org>,
Simon Wright <simon@pushface.org> wrote:
But don't use unit names containing international characters, at any
rate if you're (interested in compiling on) Windows or macOS:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
if i understand, Eric Botcazou is a gnu admin who decided to reject
your bug? i find him very "low portability thinking"!
On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:
But don't use unit names containing international characters, at
any rate if you're (interested in compiling on) Windows or macOS:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
and this kind of problems would be easier to avoid if string types
were stronger ...
Your suggestion is unable to resolve this issue on Mac OS X. Like case sensitivity, binary compare of two strings can't compare strings in
different normalization forms. Right solution is to use right type to represent any paths, and even it doesn't resolve some issues, like
relative paths and change of rules at mounting points.
I think that's a macOS problem that Apple aren't going to resolve* any[...]
time soon! While banging my head against PR81114 recently, I found
(can't remember where) that (lower case a acute) and (lower case a,
combining acute) represent the same concept and it's up to
tools/operating systems etc to recognise that.
* I don't know how/whether clang addresses this.
as i said to Vadim Godunko, i need to fill a string type with an UTF-8 litteral.but i don't think this string type has to manage various conversions.
from my point of view, each library has to accept 1 kind of string type (preferably UTF-8 everywhere),
and then, this library has to make needed conversions regarding the underlying API. not the user.
... of course, it would be very nice to have a more thicker language with
a garbage collector ...
I think that's a macOS problem that Apple aren't going to resolve* any
time soon! While banging my head against PR81114 recently, I found
(can't remember where) that (lower case a acute) and (lower case a,
combining acute) represent the same concept and it's up to
tools/operating systems etc to recognise that.
Just what I said above, since a BOM is not a valid UTF-8 (otherwise, itThey are not so different. For example, you may read the first line of a
file in a string, then discover that it starts with a BOM, and thus
decide it is UTF-8.
could you give me an example of sth that you can do yet, and you could
not do if UTF_8_String was private, please?
(to discover that it starts with a BOM, you must look at it.)
BTW, the very first version of this AI had different types, but the ARG
felt that it would just complicate the interface for the sake of abusive
"purity".
could you explain "abusive purity" please?
It was felt that in practice, being too strict in separating the types
would make things more difficult, without any practical gain. This has
been discussed - you may not agree with the outcome, but it was not made
out of pure lazyness
If you had an Ada-like language that used a universal UTF-8 string internally, you then would have a lot of old and mostly useless
operations supported for array types (since things like slices are
mainly useful for string operations).
"Randy Brukardt" <randy@rrsoftware.com> writes:
If you had an Ada-like language that used a universal UTF-8 string
internally, you then would have a lot of old and mostly useless
operations supported for array types (since things like slices are
mainly useful for string operations).
Just off the top of my head, wouldn't it be better to use UTF32-encoded Wide_Wide_Character internally?
On 2022-04-08 10:56, Simon Wright wrote:
"Randy Brukardt" <randy@rrsoftware.com> writes:
If you had an Ada-like language that used a universal UTF-8 string
internally, you then would have a lot of old and mostly useless
operations supported for array types (since things like slices are
mainly useful for string operations).
Just off the top of my head, wouldn't it be better to use
UTF32-encoded Wide_Wide_Character internally?
Yep, that is the exactly the problem, a confusion between interface
and implementation.
Encoding /= interface, e.g. an interface of a string viewed as an
array of characters. That interface just same for ASCII, Latin-1,
EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?
On Monday, April 4, 2022 at 5:19:20 PM UTC+3, Simon Wright wrote:
I think that's a macOS problem that Apple aren't going to resolve* anyAnd will not. It is application responsibility to convert file names
time soon! While banging my head against PR81114 recently, I found
(can't remember where) that (lower case a acute) and (lower case a,
combining acute) represent the same concept and it's up to
tools/operating systems etc to recognise that.
to NFD to pass to OS. Also, application must compare any paths after conversion to NFD, it is important to handle more complicated cases
when canonical reordering is applied.
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
On 2022-04-08 10:56, Simon Wright wrote:
"Randy Brukardt" <randy@rrsoftware.com> writes:
If you had an Ada-like language that used a universal UTF-8 string
internally, you then would have a lot of old and mostly useless
operations supported for array types (since things like slices are
mainly useful for string operations).
Just off the top of my head, wouldn't it be better to use
UTF32-encoded Wide_Wide_Character internally?
Yep, that is the exactly the problem, a confusion between interface
and implementation.
Don't understand. My point was that *when you are implementing this* it
mught be easier to deal with 32-bit charactrs/code points/whatever the
proper jargon is than with UTF8.
Encoding /= interface, e.g. an interface of a string viewed as an
array of characters. That interface just same for ASCII, Latin-1,
EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside?
With a user's hat on, I don't. Implementers might have a different point
of view.
On 2022-04-08 21:19, Simon Wright wrote:
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
On 2022-04-08 10:56, Simon Wright wrote:
"Randy Brukardt" <randy@rrsoftware.com> writes:
If you had an Ada-like language that used a universal UTF-8 string
internally, you then would have a lot of old and mostly useless
operations supported for array types (since things like slices are
mainly useful for string operations).
Just off the top of my head, wouldn't it be better to use
UTF32-encoded Wide_Wide_Character internally?
Yep, that is the exactly the problem, a confusion between interface
and implementation.
Don't understand. My point was that *when you are implementing this* it
mught be easier to deal with 32-bit charactrs/code points/whatever the
proper jargon is than with UTF8.
I think it would be more difficult, because you will have to convert from
and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface standard and I/O standard. That would be 60-70% of all cases you need a string. Most string operations like search, comparison, slicing are isomorphic between code points and octets. So you would win nothing from keeping strings internally as arrays of code points.
The situation is comparable to Unbounded_Strings. The implementation is relatively simple, but the user must carry the burden of calling To_String and To_Unbounded_String all over the application and the processor must suffer the overhead of copying arrays here and there.
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message news:t2q3cb$bbt$1@gioia.aioe.org...
On 2022-04-08 21:19, Simon Wright wrote:
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
On 2022-04-08 10:56, Simon Wright wrote:
"Randy Brukardt" <randy@rrsoftware.com> writes:
If you had an Ada-like language that used a universal UTF-8 string >>>>>> internally, you then would have a lot of old and mostly useless
operations supported for array types (since things like slices are >>>>>> mainly useful for string operations).
Just off the top of my head, wouldn't it be better to use
UTF32-encoded Wide_Wide_Character internally?
Yep, that is the exactly the problem, a confusion between interface
and implementation.
Don't understand. My point was that *when you are implementing this* it
mught be easier to deal with 32-bit charactrs/code points/whatever the
proper jargon is than with UTF8.
I think it would be more difficult, because you will have to convert from
and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
standard and I/O standard. That would be 60-70% of all cases you need a
string. Most string operations like search, comparison, slicing are
isomorphic between code points and octets. So you would win nothing from
keeping strings internally as arrays of code points.
I basically agree with Dmitry here. The internal representation is an implementation detail, but it seems likely that you would want to store
UTF-8 strings directly; they're almost always going to be half the size
(even for languages using their own characters like Greek) and for most of us, they'll be just a bit more than a quarter the size. The amount of bytes you copy around matters; the number of operations where code points are needed is fairly small.
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message news:t2q3cb$bbt$1@gioia.aioe.org...
On 2022-04-08 21:19, Simon Wright wrote:
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
On 2022-04-08 10:56, Simon Wright wrote:
"Randy Brukardt" <randy@rrsoftware.com> writes:
If you had an Ada-like language that used a universal UTF-8 string >>>>>> internally, you then would have a lot of old and mostly useless
operations supported for array types (since things like slices are >>>>>> mainly useful for string operations).
Just off the top of my head, wouldn't it be better to use
UTF32-encoded Wide_Wide_Character internally?
Yep, that is the exactly the problem, a confusion between interface
and implementation.
Don't understand. My point was that *when you are implementing this* it
mught be easier to deal with 32-bit charactrs/code points/whatever the
proper jargon is than with UTF8.
I think it would be more difficult, because you will have to convert from
and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
standard and I/O standard. That would be 60-70% of all cases you need a
string. Most string operations like search, comparison, slicing are
isomorphic between code points and octets. So you would win nothing from
keeping strings internally as arrays of code points.
I basically agree with Dmitry here. The internal representation is an implementation detail, but it seems likely that you would want to store
UTF-8 strings directly; they're almost always going to be half the size
(even for languages using their own characters like Greek) and for most of us, they'll be just a bit more than a quarter the size. The amount of bytes you copy around matters; the number of operations where code points are needed is fairly small.
The main problem with UTF-8 is representing the code point positions in a
way that they (a) aren't abused and (b) don't cost too much to calculate. Just using character indexes is too expensive for UTF-8 and UTF-16 representations, and using octet indexes is unsafe (since the splitting a character representation is a possibility). I'd probably use an abstract character position type that was implemented with an octet index under the covers.
I think that would work OK as doing math on those is suspicious with a UTF representation. We're spoiled from using Latin-1 representations, of course, but generally one is interested in 5 characters, not 5 octets. And the
number of octets in 5 characters depends on the string. So most of the sorts of operations that I tend to do (for instance from some code I was fixing earlier today):
if Fort'Length > 6 and then
Font(2..6) = "Arial" then
This would be a bad idea if one is using any sort of universal
representation -- you don't know how many octets is in the string literal so you can't assume a number in the test string. So the slice is dangerous
(even though in this particular case it would be OK since the test string is all Ascii characters -- but I wouldn't want users to get in the habit of assuming such things).
[BTW, the above was a bad idea anyway, because it turns out that the
function in the Ada library returned bounds that don't start at 1. So the slice was usually out of range -- which is why I was looking at the code. Another thing that we could do without. Slices are evil, since they *seem*
to be the right solution, yet rarely are in practice without a lot of
hoops.]
The situation is comparable to Unbounded_Strings. The implementation is
relatively simple, but the user must carry the burden of calling To_String >> and To_Unbounded_String all over the application and the processor must
suffer the overhead of copying arrays here and there.
Yes, but that happens because Ada doesn't really have a string abstraction, so when you try to build one, you can't fully do the job. One presumes that
a new language with a universal UTF-8 string wouldn't have that problem. (As previously noted, I don't see much point in trying to patch up Ada with a bunch of UTF-8 string packages; you would need an entire new set of Ada.Strings libraries and I/O libraries, and then you'd have all of the old stuff messing up resolution, using the best names, and confusing everything. A cleaner slate is needed.)
Randy.
In Python-3, a string is a character(glyph ?) array. The internal >representation is hidden to the programmer.
On the Ada side, I've still not understood how to correctly deal with
all this stuff.
On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the following:
In Python-3, a string is a character(glyph ?) array. The internal
representation is hidden to the programmer.
<SNIP>
On the Ada side, I've still not understood how to correctly deal with
all this stuff.
One thing to take into account is that Python strings are immutable. Changing the contents of a string requires constructing a new string from parts that incorporate the change.
That allows for the second aspect -- even if not visible to a programmer, Python (3) strings are not a fixed representation: If all characters in the string fit in the 8-bit UTF range, that string is stored using one byte per character. If any character uses a 16-bit UTF representation, the entire string is stored as 16-bit characters (and
similar for 32-bit UTF points). Thus, indexing into the string is still
fast -- just needing to scale the index by the character width of the
entire string.
On the Ada side, I've still not understood how to correctly deal with
all this stuff.
On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote:in other cases their use generates a lot of hidden issues, which is very hard to detect.
Take a look at https://github.com/AdaCore/VSS
On the Ada side, I've still not understood how to correctly deal with
all this stuff.
Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder.
I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems;
I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and >programming languages; more cleaner types and API is a requirement now.
The only case when old character/string types is really makes value is low >resources embedded systems; ...
...in other cases their use generates a lot of hidden issues, which is very >hard to detect.
DrPi <314@drpi.fr> writes:
Any way to use source code encoded in UTF-8 ?
from the gnat user guide, 4.3.1 Alphabetical List of All Switches:
`-gnati`c''
Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w). For details
of the possible selections for `c', see *note Character Set
Control: 4e.
This applies to identifiers in the source code
`-gnatW`e''
Wide character encoding method (`e'=n/h/u/s/e/8).
This applies to string and character literals.
On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote:
But don't use unit names containing international characters, at any
rate if you're (interested in compiling on) Windows or macOS:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114
and this kind of problems would be easier to avoid if string types were stronger ...
Your suggestion is unable to resolve this issue on Mac OS X.
Like case
sensitivity, binary compare of two strings can't compare strings in different normalization forms. Right solution is to use right type to represent any paths,
"Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr...
...
as i said to Vadim Godunko, i need to fill a string type with an UTF-8 litteral.but i don't think this string type has to manage various conversions.
from my point of view, each library has to accept 1 kind of string type (preferably UTF-8 everywhere),
and then, this library has to make needed conversions regarding the underlying API. not the user.
This certainly is a fine ivory tower solution,
but it completely ignores two
practicalities in the case of Ada:
(1) You need to replace almost all of the existing Ada language defined packages to make this work. Things that are deeply embedded in both implementations and programs (like Ada.Exceptions and Ada.Text_IO) would
have to change substantially. The result would essentially be a different language, since the resulting libraries would not work with most existing programs.
They'd have to have different names (since if you used the same
names, you change the failures from compile-time to runtime -- or even undetected -- which would be completely against the spirit of Ada), which means that one would have to essentially start over learning and using the resulting language.
(and it would make sense to use this point to
eliminate a lot of the cruft from the Ada design).
(2) One needs to be able to read and write data given whatever encoding the project requires (that's often decided by outside forces, such as other hardware or software that the project needs to interoperate with).
At a minimum, you
have to have a way to specify the encoding of files, streams, and hardware interfaces
That will greatly complicate the interface and
implementation of the libraries.
... of course, it would be very nice to have a more thicker language with
a garbage collector ...
I doubt that you will ever see that in the Ada family,
as analysis and
therefore determinism is a very important property for the language.
Ada has
lots of mechanisms for managing storage without directly doing it yourself (by calling Unchecked_Deallocation), yet none of them use any garbage collection in a traditional sense.
In article <t2g0c1$eou$1@dont-email.me>,
"Randy Brukardt" <randy@rrsoftware.com> wrote:
"Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message
news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr...
...
as i said to Vadim Godunko, i need to fill a string type with an UTF-8
litteral.but i don't think this string type has to manage various
conversions.
from my point of view, each library has to accept 1 kind of string type
(preferably UTF-8 everywhere),
and then, this library has to make needed conversions regarding the
underlying API. not the user.
This certainly is a fine ivory tower solution,
I like to think from an ivory tower,
and then look at the reality to see what's possible to do or not. :-)
but it completely ignores two
practicalities in the case of Ada:
(1) You need to replace almost all of the existing Ada language defined
packages to make this work. Things that are deeply embedded in both
implementations and programs (like Ada.Exceptions and Ada.Text_IO) would
have to change substantially. The result would essentially be a different
language, since the resulting libraries would not work with most existing
programs.
- in Ada, of course we can't delete what's existing, and there are many packages which are already in 3 versions (S/WS/WWS).
imho, it would be consistent to make a 4th version of them for a new UTF_8_String type.
- in a new language close to Ada, it would not necessarily be a good
idea to remove some of them, depending on industrial needs, to keep them
with us.
They'd have to have different names (since if you used the same
names, you change the failures from compile-time to runtime -- or even
undetected -- which would be completely against the spirit of Ada), which
means that one would have to essentially start over learning and using
the
resulting language.
i think i don't understand.
(and it would make sense to use this point to
eliminate a lot of the cruft from the Ada design).
could you give an example of cruft from the Ada design, please? :-)
(2) One needs to be able to read and write data given whatever encoding
the
project requires (that's often decided by outside forces, such as other
hardware or software that the project needs to interoperate with).
At a minimum, you
have to have a way to specify the encoding of files, streams, and
hardware
interfaces
That will greatly complicate the interface and
implementation of the libraries.
i don't think so.
it's a matter of interfacing libraries, for the purpose of communicating
with the outside (neither of internal libraries nor of the choice of the internal type for the implementation).
Ada.Text_IO.Open.Form already allows (a part of?) this (on the content
of the files, not on their name), see ARM A.10.2 (6-8).
(write i the reference to ARM correctly?)
... of course, it would be very nice to have a more thicker language
with
a garbage collector ...
I doubt that you will ever see that in the Ada family,
as analysis and
therefore determinism is a very important property for the language.
I completely agree :-)
Ada has
lots of mechanisms for managing storage without directly doing it
yourself
(by calling Unchecked_Deallocation), yet none of them use any garbage
collection in a traditional sense.
sorry, i meant "garbage collector" in a generic sense, not in a
traditional sense.
that is, as Ada users we could program with pointers and pool, without
memory leaks nor calling Unchecked_Deallocation.
for example Ada.Containers.Indefinite_Holders.
i already wrote one for constrained limited types.
do you know if it's possible to do it for unconstrained limited types,
like the class of a limited tagged type?
--
RAPID maintainer
http://savannah.nongnu.org/projects/rapid/
In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>,
Vadim Godunko <vgodunko@gmail.com> wrote:
On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote:
What's the way to manage Unicode correctly ?
Ada doesn't have good Unicode support. :( So, you need to find suitable set of "workarounds".
There are few different aspects of Unicode support need to be considered:
1. Representation of string literals. If you want to use non-ASCII characters
in source code, you need to use -gnatW8 switch and it will require use of Wide_Wide_String everywhere.
2. Internal representation during application execution. You are forced to use Wide_Wide_String at previous step, so it will be UCS4/UTF32.
It is hard to say that it is reasonable set of features for modern world.
I don't think Ada would be lacking that much, for having good UTF-8
support.
the cardinal point is to be able to fill a Ada.Strings.UTF_Encoding.UTF_8_String with a litteral.
(once you got it, when you'll try to fill a Standard.String with a non-Latin-1 character, it'll make an error, i think it's fine :-) )
does Ada 202x allow it ?
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 339 |
Nodes: | 16 (2 / 14) |
Uptime: | 07:35:00 |
Calls: | 7,486 |
Files: | 12,704 |
Messages: | 5,635,719 |