Forum: >>> Magnum BBS <<<

How to read in a (long) UTF-8 file, incrementally?

From Marius Amado-Alves@21:1/5 to All on Tue Nov 2 10:42:37 2021

As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

Now, Unicode files usually are in UTF-8.

One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

Thanks a lot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dmitry A. Kazakov@21:1/5 to Marius Amado-Alves on Tue Nov 2 19:17:58 2021

On 2021-11-02 18:42, Marius Amado-Alves wrote:

So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

You simply read a stream of Characters into a buffer. Never ever use
Wide or Wide_Wide, they are useless. Inside the buffer you must have 4 Characters ahead unless the file end is reached. Usually you read until
some separator like line end.

Then you call this:

http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8.Get

That will give you a code point and advance the cursor to the next UTF-8 character.

However, normally, no text processing task needs that. Whatever you want
to do, you can accomplish it using normal String operations and normal String-based data structures like maps and tables. You need not to care
about any UTF-8 character boundaries ever.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vadim Godunko@21:1/5 to amado...@gmail.com on Wed Nov 3 00:43:02 2021

On Tuesday, November 2, 2021 at 8:42:38 PM UTC+3, amado...@gmail.com wrote:

As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

Now, Unicode files usually are in UTF-8.

One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

So it should be possible to read a single UTF-8 character, right? Which might be 1, 2, 3, or 4 bytes long, so it must be read into a String, right? Or directly to Wide_Wide. Are there such functions?

There is special library to process Unicode text, see https://github.com/AdaCore/VSS; 'contrib' directory contains VSS.Utils.File_IO package to load file into Virtual_String. However, attempt to load whole file into the memory is bad decision usually.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luke A. Guest@21:1/5 to Marius Amado-Alves on Wed Nov 3 08:48:58 2021

On 02/11/2021 17:42, Marius Amado-Alves wrote:

As I understand it, to work with Unicode text inside the program it is better to use the Wide_Wide (UTF-32) variants of everything.

You can take a look at my simple lib: https://github.com/Lucretia/uca

Now, Unicode files usually are in UTF-8.

One solution is to read the entire file in one gulp to a String, then convert to Wide_Wide. This solution is not memory efficient, and it may not be possible in some tasks e.g. real time processing of lines of text.

It can read into a large string buffer.

If the files has lines, I guess we can also work line by line (Text_IO). But the text may not have lines. Can be a long XML object, for example.

And can break it up into lines. There's no Unicode consistency checks.

The lib is a bit hacky, but seems to work for now. There's nothing more
than what I've mentioned so far.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marius Amado-Alves@21:1/5 to All on Thu Nov 4 04:43:22 2021

Great libraries, thanks.

It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.

if C = '±' then ...

And Wide_Wide_Character'Pos should give the codepoint.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dmitry A. Kazakov@21:1/5 to Marius Amado-Alves on Thu Nov 4 13:13:12 2021

On 2021-11-04 12:43, Marius Amado-Alves wrote:

Great libraries, thanks.

It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.

if C = '±' then ...

If the source supports Unicode, it should do UTF-8 as well. So, you
would write

if C = "±" then ...

where C is String.

And Wide_Wide_Character'Pos should give the codepoint.

Yes, but you need no Wide_Wide to get an integer value and if your
objective is Unicode categorization, that is too complicated for manual comparisons. Use a library function [generated from UnicodeData.txt]
instead.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luke A. Guest@21:1/5 to Marius Amado-Alves on Thu Nov 4 14:30:25 2021

On 04/11/2021 11:43, Marius Amado-Alves wrote:

Great libraries, thanks.

It still seems to me that Wide_Wide_Character is useful. It allows to represent the character directly in the sourcecode e.g.

if C = '±' then ...

And Wide_Wide_Character'Pos should give the codepoint.

Characters no longer exist as a thing as one can even be represented as multiple utf-32 code points.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marius Amado-Alves@21:1/5 to All on Fri Nov 5 03:56:42 2021

Characters no longer exist as a thing as one can even be represented as multiple utf-32 code points.

You're alluding to combining characters?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon Wright@21:1/5 to Marius Amado-Alves on Fri Nov 5 19:55:33 2021

Marius Amado-Alves <amado.alves@gmail.com> writes:

Characters no longer exist as a thing as one can even be represented as
multiple utf-32 code points.

You're alluding to combining characters?

Fun & games on macOS[1]:

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
gcc -c páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"

The reason for this apparently-bizarre message is that macOS takes the composed form (lowercase a acute) and converts it under the hood to
what HFS+ insists on, the fully decomposed form (lowercase a,
combining acute); thus the names are actually different even though
they _look_ the same.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114#c1

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marius Amado-Alves@21:1/5 to All on Tue Nov 16 03:55:05 2021

I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of
Unicode, greatly appreciated.

(For me, a combining character is not a character, the combination is. Unicode agrees, right?)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dmitry A. Kazakov@21:1/5 to Marius Amado-Alves on Tue Nov 16 13:36:00 2021

On 2021-11-16 12:55, Marius Amado-Alves wrote:

I'm worried. I need the concept of character, for proper text processing.

Simply ignore or reject decomposed characters.

For example, I want to reference characters in a text file by their position.

That is no problem either. There are two alternatives:

1. Fixed font representation. Reduce everything to normal glyphs, use
string position corresponding to the beginning of an UTF-8 sequence.

2. Proportional font. Use a graphical user interface like GTK. The GTK
text buffer has a data type (iterator) to indicate a place in the
buffer, e.g. when a selection happens. These iterators are consistent
with the glyphs as rendered on the screen and you can convert between
them and string position.

(For me, a combining character is not a character, the combination is. Unicode agrees, right?)

No, Unicode disagrees, e.g. É can be composed from E and acute accent.
But it is advised just to ignore all this nonsense and consider:

code point = character

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marius Amado-Alves@21:1/5 to All on Tue Nov 16 05:52:59 2021

Simply ignore or reject decomposed characters.

Brilliant!

1. Fixed font representation. Reduce everything to normal glyphs, use
string position corresponding to the beginning of an UTF-8 sequence.

I am indeed resorting to byte position in UTF-8 files as the character position. Treating UTF-8 entities as the strings that they are:-)

(Not dealing with fonts nor graphics yet, just plain text.)

Thanks a lot.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luke A. Guest@21:1/5 to Marius Amado-Alves on Tue Nov 16 15:25:10 2021

On 16/11/2021 11:55, Marius Amado-Alves wrote:

I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of

Unicode, greatly appreciated.

(For me, a combining character is not a character, the combination is. Unicode agrees, right?)

You can't. The concept of character is dead, the new concept are
grapheme clusters.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vadim Godunko@21:1/5 to amado...@gmail.com on Tue Nov 16 09:38:13 2021

On Tuesday, November 16, 2021 at 2:55:06 PM UTC+3, amado...@gmail.com wrote:

I'm worried. I need the concept of character, for proper text processing. For example, I want to reference characters in a text file by their position. Any tips/references on how to deal with combining characters, or any other perturbating feature of

Unicode, greatly appreciated.

(For me, a combining character is not a character, the combination is. Unicode agrees, right?)

You can use VSS and Grapheme_Cluster_Iterator to lookup for grapheme cluster at given position and to obtain position of the grapheme cluster in the string (as well as UTF-8/UTF-16 code units).

However, concept of grapheme clusters doesn't handle special cases like tabulation stops; TAB is just single grapheme cluster.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Randy Brukardt@21:1/5 to Dmitry A. Kazakov on Tue Nov 16 14:23:28 2021

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message news:sn08jf$pkq$1@gioia.aioe.org...

On 2021-11-16 12:55, Marius Amado-Alves wrote:

I'm worried. I need the concept of character, for proper text processing.

Simply ignore or reject decomposed characters.

Unicode calls that "requiing Normalization Form C". ("Form D" is all
decomposed characters.) You'll note that what Ada compilers do with text not
in Normalization Form C is implementation-defined; in particular, a compiler could reject such text.

My understanding is that various Internet standards also require
Normalization Form C. For instance, web pages are supposed to always be in
that format. Whether browsers actually enforce that is unknown (they should enforce a lot of stuff about web pages, but generally just try to muddle through, which causes all kinds of security issues).

Randy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Guest
  Thu Dec 26 05:34:50 2024
  from /bin/busybox Cat /proc/self/ex via Raw
- Gwylbert
  Thu Dec 26 05:25:03 2024
  from Sydney, Nsw via Telnet
- Guest
  Thu Dec 26 04:02:03 2024
  from /bin/busybox Cat /proc/self/ex via Raw
- Gwylbert
  Thu Dec 26 00:08:06 2024
  from Sydney, Nsw via Telnet
- Bob Worm
  Wed Dec 25 23:09:42 2024
  from Wales, Uk via Telnet
- Guest
  Wed Dec 25 19:36:50 2024
  from /bin/busybox Cat /proc/self/ex via Raw
- Keyop
  Wed Dec 25 16:24:41 2024
  from Huddersfield, West Yorkshire via SSH
- Daniel Garrod
  Wed Dec 25 16:22:01 2024
  from Cambridge, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	379
Nodes:	16 (2 / 14)
Uptime:	44:47:28
Calls:	8,141
Calls today:	4
Files:	13,085
Messages:	5,858,055

How to read in a (long) UTF-8 file, incrementally?

Who's Online

Recent Visitors

System Info