Forum: >>> Magnum BBS <<<

Lexing Unicode strings?

From Johann 'Myrkraverk' Oskarsson@21:1/5 to All on Wed Apr 21 16:20:40 2021

Dear c.compilers,

For context, I have been reading the old book Compiler design in C
by Allen Holub; available here

https://holub.com/compiler/

and it goes into the details of the author's own LeX implementation.

Just like the dragon book [which I admit I haven't read for some number
of years] this uses lookup tables for the individual characters, which
is fine for ASCII, but does kind of seem excessive for all 0x10ffff code
points in Unicode.

I am interested in this, using plain old C, without using external tools
like ICU, for my own reasons[1]. What data structures are appropriate
for this exercise? Are there resources out there I can study, other
than the ICU source code? [Which for other reasons of my own, I'm not
too keen on studying.]

[1] Let's leave out the question if I'll be successful or not.

Thanks,
--
Johann
[The obvious approach if you're scaning UTF-8 text is to keep treating the input as
a sequence of bytes. UTF-8 was designed so that no character representation is a prefix or suffix
of any other character, so it should work without having to be clever. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johann 'Myrkraverk' Oskarsson@21:1/5 to Johann 'Myrkraverk' Oskarsson on Mon May 3 19:58:20 2021

On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

Dear c.compilers,

For context, I have been reading the old book Compiler design in C
by Allen Holub; available here

https://holub.com/compiler/

and it goes into the details of the author's own LeX implementation.

Snip.

[The obvious approach if you're scaning UTF-8 text is to keep treating the input as
a sequence of bytes. UTF-8 was designed so that no character representation is a
prefix or suffix of any other character, so it should work without having to be clever
-John]

That's not always feasible, nor the right approach. Let's consider the
range of all lowercase greek letters. In the source file, that range
will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
not to match the bytes \xCE, \xB1..\xCF, and \x89.

There is also the question of validating the input. It seems more
natural to put the overlong sequence validator, and legal code point
validator into the lexer, rather than preprocess the source file.

But maybe that's "premature optimization?"

Utf-8 has more problems too. There are bytes that can only appear after
other bytes, and detecting that early would be nice[0]; there is also the fact that some glyphs can be constructed with a single code point, or two.

Also, Unicode input is not always utf-8.

When Holub is constructing his state machines for the regular
expressions, he uses bitmasks for the ranges[1]. ASCII, being 128
bits in a mask, is reasonable; even 256 bits for 8bit characters.

0x11000 bits in a mask feels excessive to me, though.

I was maybe naive, when I thought Unicode was considered a
"solved problem." I guess everyone is using either libraries like
ICU or run time environments. These can have the following
"problem:" Invalid input may not be signalled to the higher level
code, but quietly replaced with the replacement symbol, U+FFFD.
Which is, in my opinion, not good for a compiler.

[0] This can be part of output lexer tables, and is harder to
do manually when bootstrapping the lexer by hand.

[1] His implementation of sets.
--
Johann
[I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
is a prefix or suffix of any other UTF-8 encoding, you can lex them
the same way you'd lex strings of ASCII. In that example above, \xCE, \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
a multi-byte sequence, so if they do, you can put a wildcard . at the
end to match bogus bytes and complain about an invalid character. Dunno
what you mean about not always UTF-8; I realize there are mislabeled
files of UTF-16 that you have to sort out by sniffing the BOM at the
front, but you do that and turn whatever you're getting into UTF-8
and then feed it to the lexer.

I agree that lexing Unicode is not a solved problem, and I'm not
aware of any really good ways to limit the table sizes. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Tue May 4 14:39:54 2021

I don't have much to personally add on this topic.

However, if you are considering how to compress lexer tables indexed
by Unicode code points, I would recommend you look at this paper: https://dl.acm.org/doi/10.1145/1780.1802
by Dencker, Duerre, and Heuft
on Optimization of Parsing Tables for Portable Compilers.

They investigated the main techniques for compressing said tables,
with particular interest in ways of using "coloring" (assigning
multiple entries to the same location by indexing into a color table).
From my experience with lexing Unicode (which is admittedly quite
limited), most grammars have long sequential sets of unicode code
points that are all in the same set, e.g. they are alphabetic
characters, digitis, operators, punctuation,etc,or disallowed and that
once grouped into those sets, the actual lexing tables are mostly
compact.

Now, that doesn't necessarily help you map your UTF-8 (et al) into
those sets, although my guess is that it is simpler than it seems as
there is regularity there

And, if you look at the techniques the authors presented, you can
combine one or two of them and get a method that is relatively both
space and time efficient.

If you do some experiments and want review, critique, suggestions, you
may either post here or email me directly. I would be interested in a
space and time efficient solution myself as I intend to make my next
lexer generator for Yacc++ unicode aware and perhaps even unicode
centric.
-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to All on Tue May 4 01:11:51 2021

On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote:

On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

Snip.

[The obvious approach if you're scaning UTF-8 text is to keep treating the input as
a sequence of bytes. UTF-8 was designed so that no character representation is a
prefix or suffix of any other character, so it should work without having to be clever
-John]

That's not always feasible, nor the right approach. Let's consider the
range of all lowercase greek letters. In the source file, that range
will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
not to match the bytes \xCE, \xB1..\xCF, and \x89.

Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode.
You needs some Boolean logic in your matching, though that shouldn't be so hard.

There is also the question of validating the input. It seems more
natural to put the overlong sequence validator, and legal code point validator into the lexer, rather than preprocess the source file.

This reminds me of the question, which I don't remember where I saw,
about applying Boyer-Moore to Unicode. One answer is, as John notes,
to apply it to the UTF-8 bytes. But the whole idea behind Boyer-Moore
is to use the unequal probability distribution of characters, to quickly
skip over impossible matches. But the bytes of UTF-8 don't have the
same (im)probabiity of the individual characters.

It seems to me that you lose some of the speed advantage of Boyer-Moore,
but maybe it is still plenty fast enough.

On the other hand, Java uses a 16 bit char, and one should be able to apply Boyer-Moore just as well in that case. The tables will be bigger, but then
so are computer memories today.

So, as with many problems, there are trade-offs between speed and size,
and one has to choose the best case for the specific problem.

Note that in addition to have a 16 bit Unicode char, the Java language
itself is defined in terms of Unicode. Variable names can be any Unicode letter, followed by Unicode letters and digits. Presumably, then, the designers
of Java compilers have figured this out, I suspect using the 16 bit char.

One can, for example, have variables named A and Α in the same program.
(In case you can't see it, the second one is an Alpha.)

Yes, Unicode can be fun!
[Remember that Unicode is a 20 bit code and for characters outside the first 64K,
Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to All on Tue May 4 14:47:08 2021

On Tuesday, May 4, 2021 at 8:33:50 AM UTC-7, gah4 wrote:

(snip, I wrote)

Note that in addition to have a 16 bit Unicode char, the Java language
itself is defined in terms of Unicode. Variable names can be any Unicode letter, followed by Unicode letters and digits. Presumably, then, the designers of Java compilers have figured this out, I suspect using the 16 bit char.

(snip)

Yes, Unicode can be fun!
[Remember that Unicode is a 20 bit code and for characters outside the first 64K,
Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8

seem clean and beautiful. -John]

I did know that Java used 16 bits, but never tried to figure out what they did with the rest of the characters. There should be enough in the first 64K for writing
programs.

I did once use π for a variable name, with the obvious value. It seems it is \u03c0.
I even found an editor that allowed entering such characters, and then would write
out the file with \u escapes. As far as I know, that is more usual than UTF-8.

I believe that the Java parser converts from \u escapes fairly
early, such that you can quote strings with \u0022, and then you
should be able to put \uu0022 inside the strings.

[If you're only going to allow the lower 64K, your users will be sad
when they try to use quoted strings with uncommon Chinese characters
or with emoji, or more likely your compiler will barf since they will
be encoded as two surrogate characters and your lexer won't know what
to do with them. If you're going to deal with Unicode, better bite the
bullet and deal with the whole mess. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans Aberg@21:1/5 to John Levine on Wed Jul 14 15:39:25 2021

On 2021-05-04 01:58, John Levine wrote:

[I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding is a prefix or suffix of any other UTF-8 encoding, you can lex them
the same way you'd lex strings of ASCII. In that example above, \xCE, \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
a multi-byte sequence, so if they do, you can put a wildcard . at the
end to match bogus bytes and complain about an invalid character. Dunno
what you mean about not always UTF-8; I realize there are mislabeled
files of UTF-16 that you have to sort out by sniffing the BOM at the
front, but you do that and turn whatever you're getting into UTF-8
and then feed it to the lexer.

I agree that lexing Unicode is not a solved problem, and I'm not
aware of any really good ways to limit the table sizes. -John]

I wrote code, in Haskell and C++, that translates Unicode character
classes into byte classes. From a theoretical standpoint, a Unicode
regular language mapped under UTF-8 is a byte regular language, so it is possible. So the 2^8 = 256 size tables that Flex uses is enough. The
Flex manual has an example how to make a regular expression replacing
its dot '.' to pick up all legal UTF-8 bytes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu Apr 25 22:17:10 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 26 10:09:36 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 26 08:24:20 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 26 06:40:30 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	72:15:18
Calls:	6,657
Calls today:	3
Files:	12,203
Messages:	5,332,304
Posted today:	1

Lexing Unicode strings?

Who's Online

Recent Visitors

System Info