So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?
As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.
Now, Unicode files usually are in UTF-8.
One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.
If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.
So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?
As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.
Now, Unicode files usually are in UTF-8.
One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.
If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.
Great libraries, thanks.
It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.
if C = '±' then ...
And Wide_Wide_Character'Pos should give the codepoint.
Great libraries, thanks.
It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.
if C = '±' then ...
And Wide_Wide_Character'Pos should give the codepoint.
Characters no longer exist as a thing as one can even be represented as multiple utf-32 code points.
Characters no longer exist as a thing as one can even be represented as
multiple utf-32 code points.
You're alluding to combining characters?
$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
gcc -c páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"
The reason for this apparently-bizarre message is that macOS takes the composed form (lowercase a acute) and converts it under the hood to
what HFS+ insists on, the fully decomposed form (lowercase a,
combining acute); thus the names are actually different even though
they _look_ the same.
I'm worried. I need the concept of character, for proper text processing.
For example, I want to reference characters in a text file by their position.
(For me, a combining character is not a character, the combination is. Unicode agrees, right?)
Simply ignore or reject decomposed characters.
1. Fixed font representation. Reduce everything to normal glyphs, use
string position corresponding to the beginning of an UTF-8 sequence.
I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature ofUnicode, greatly appreciated.
(For me, a combining character is not a character, the combination is. Unicode agrees, right?)
I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature ofUnicode, greatly appreciated.
(For me, a combining character is not a character, the combination is. Unicode agrees, right?)
On 2021-11-16 12:55, Marius Amado-Alves wrote:
I'm worried. I need the concept of character, for proper text processing.
Simply ignore or reject decomposed characters.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 379 |
Nodes: | 16 (2 / 14) |
Uptime: | 44:47:28 |
Calls: | 8,141 |
Calls today: | 4 |
Files: | 13,085 |
Messages: | 5,858,055 |