• How to read in a (long) UTF-8 file, incrementally?

    From Marius Amado-Alves@21:1/5 to All on Tue Nov 2 10:42:37 2021
    As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

    Now, Unicode files usually are in UTF-8.

    One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

    If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

    So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

    Thanks a lot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to Marius Amado-Alves on Tue Nov 2 19:17:58 2021
    On 2021-11-02 18:42, Marius Amado-Alves wrote:

    So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

    You simply read a stream of Characters into a buffer. Never ever use
    Wide or Wide_Wide, they are useless. Inside the buffer you must have 4 Characters ahead unless the file end is reached. Usually you read until
    some separator like line end.

    Then you call this:

    http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8.Get

    That will give you a code point and advance the cursor to the next UTF-8 character.

    However, normally, no text processing task needs that. Whatever you want
    to do, you can accomplish it using normal String operations and normal String-based data structures like maps and tables. You need not to care
    about any UTF-8 character boundaries ever.

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vadim Godunko@21:1/5 to amado...@gmail.com on Wed Nov 3 00:43:02 2021
    On Tuesday, November 2, 2021 at 8:42:38 PM UTC+3, amado...@gmail.com wrote:
    As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

    Now, Unicode files usually are in UTF-8.

    One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

    If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

    So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

    There is special library to process Unicode text, see https://github.com/AdaCore/VSS; 'contrib' directory contains VSS.Utils.File_IO package to load file into Virtual_String. However, attempt to load whole file into the memory is bad decision usually.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luke A. Guest@21:1/5 to Marius Amado-Alves on Wed Nov 3 08:48:58 2021
    On 02/11/2021 17:42, Marius Amado-Alves wrote:
    As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

    You can take a look at my simple lib: https://github.com/Lucretia/uca

    Now, Unicode files usually are in UTF-8.

    One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

    It can read into a large string buffer.

    If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

    And can break it up into lines. There's no Unicode consistency checks.

    The lib is a bit hacky, but seems to work for now. There's nothing more
    than what I've mentioned so far.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marius Amado-Alves@21:1/5 to All on Thu Nov 4 04:43:22 2021
    Great libraries, thanks.

    It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.

    if C = '±' then ...

    And Wide_Wide_Character'Pos should give the codepoint.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to Marius Amado-Alves on Thu Nov 4 13:13:12 2021
    On 2021-11-04 12:43, Marius Amado-Alves wrote:
    Great libraries, thanks.

    It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.

    if C = '±' then ...

    If the source supports Unicode, it should do UTF-8 as well. So, you
    would write

    if C = "±" then ...

    where C is String.

    And Wide_Wide_Character'Pos should give the codepoint.

    Yes, but you need no Wide_Wide to get an integer value and if your
    objective is Unicode categorization, that is too complicated for manual comparisons. Use a library function [generated from UnicodeData.txt]
    instead.

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luke A. Guest@21:1/5 to Marius Amado-Alves on Thu Nov 4 14:30:25 2021
    On 04/11/2021 11:43, Marius Amado-Alves wrote:
    Great libraries, thanks.

    It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.

    if C = '±' then ...

    And Wide_Wide_Character'Pos should give the codepoint.


    Characters no longer exist as a thing as one can even be represented as multiple utf-32 code points.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marius Amado-Alves@21:1/5 to All on Fri Nov 5 03:56:42 2021
    Characters no longer exist as a thing as one can even be represented as multiple utf-32 code points.

    You're alluding to combining characters?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Wright@21:1/5 to Marius Amado-Alves on Fri Nov 5 19:55:33 2021
    Marius Amado-Alves <amado.alves@gmail.com> writes:

    Characters no longer exist as a thing as one can even be represented as
    multiple utf-32 code points.

    You're alluding to combining characters?

    Fun & games on macOS[1]:

    $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
    gcc -c páck3.ads
    páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"

    The reason for this apparently-bizarre message is that macOS takes the composed form (lowercase a acute) and converts it under the hood to
    what HFS+ insists on, the fully decomposed form (lowercase a,
    combining acute); thus the names are actually different even though
    they _look_ the same.

    [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114#c1

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marius Amado-Alves@21:1/5 to All on Tue Nov 16 03:55:05 2021
    I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of
    Unicode, greatly appreciated.

    (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to Marius Amado-Alves on Tue Nov 16 13:36:00 2021
    On 2021-11-16 12:55, Marius Amado-Alves wrote:
    I'm worried. I need the concept of character, for proper text processing.

    Simply ignore or reject decomposed characters.

    For example, I want to reference characters in a text file by their position.

    That is no problem either. There are two alternatives:

    1. Fixed font representation. Reduce everything to normal glyphs, use
    string position corresponding to the beginning of an UTF-8 sequence.

    2. Proportional font. Use a graphical user interface like GTK. The GTK
    text buffer has a data type (iterator) to indicate a place in the
    buffer, e.g. when a selection happens. These iterators are consistent
    with the glyphs as rendered on the screen and you can convert between
    them and string position.

    (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

    No, Unicode disagrees, e.g. É can be composed from E and acute accent.
    But it is advised just to ignore all this nonsense and consider:

    code point = character

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marius Amado-Alves@21:1/5 to All on Tue Nov 16 05:52:59 2021
    Simply ignore or reject decomposed characters.

    Brilliant!

    1. Fixed font representation. Reduce everything to normal glyphs, use
    string position corresponding to the beginning of an UTF-8 sequence.

    I am indeed resorting to byte position in UTF-8 files as the character position. Treating UTF-8 entities as the strings that they are:-)

    (Not dealing with fonts nor graphics yet, just plain text.)

    Thanks a lot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luke A. Guest@21:1/5 to Marius Amado-Alves on Tue Nov 16 15:25:10 2021
    On 16/11/2021 11:55, Marius Amado-Alves wrote:
    I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of
    Unicode, greatly appreciated.

    (For me, a combining character is not a character, the combination is. Unicode agrees, right?)


    You can't. The concept of character is dead, the new concept are
    grapheme clusters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vadim Godunko@21:1/5 to amado...@gmail.com on Tue Nov 16 09:38:13 2021
    On Tuesday, November 16, 2021 at 2:55:06 PM UTC+3, amado...@gmail.com wrote:
    I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of
    Unicode, greatly appreciated.

    (For me, a combining character is not a character, the combination is. Unicode agrees, right?)

    You can use VSS and Grapheme_Cluster_Iterator to lookup for grapheme cluster at given position and to obtain position of the grapheme cluster in the string (as well as UTF-8/UTF-16 code units).

    However, concept of grapheme clusters doesn't handle special cases like tabulation stops; TAB is just single grapheme cluster.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Randy Brukardt@21:1/5 to Dmitry A. Kazakov on Tue Nov 16 14:23:28 2021
    "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message news:sn08jf$pkq$1@gioia.aioe.org...
    On 2021-11-16 12:55, Marius Amado-Alves wrote:
    I'm worried. I need the concept of character, for proper text processing.

    Simply ignore or reject decomposed characters.

    Unicode calls that "requiing Normalization Form C". ("Form D" is all
    decomposed characters.) You'll note that what Ada compilers do with text not
    in Normalization Form C is implementation-defined; in particular, a compiler could reject such text.

    My understanding is that various Internet standards also require
    Normalization Form C. For instance, web pages are supposed to always be in
    that format. Whether browsers actually enforce that is unknown (they should enforce a lot of stuff about web pages, but generally just try to muddle through, which causes all kinds of security issues).

    Randy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)