• Lexing Unicode strings?

    From Johann 'Myrkraverk' Oskarsson@21:1/5 to All on Wed Apr 21 16:20:40 2021
    Dear c.compilers,

    For context, I have been reading the old book Compiler design in C
    by Allen Holub; available here

    https://holub.com/compiler/

    and it goes into the details of the author's own LeX implementation.

    Just like the dragon book [which I admit I haven't read for some number
    of years] this uses lookup tables for the individual characters, which
    is fine for ASCII, but does kind of seem excessive for all 0x10ffff code
    points in Unicode.

    I am interested in this, using plain old C, without using external tools
    like ICU, for my own reasons[1]. What data structures are appropriate
    for this exercise? Are there resources out there I can study, other
    than the ICU source code? [Which for other reasons of my own, I'm not
    too keen on studying.]

    [1] Let's leave out the question if I'll be successful or not.


    Thanks,
    --
    Johann
    [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
    a sequence of bytes. UTF-8 was designed so that no character representation is a prefix or suffix
    of any other character, so it should work without having to be clever. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Johann 'Myrkraverk' Oskarsson@21:1/5 to Johann 'Myrkraverk' Oskarsson on Mon May 3 19:58:20 2021
    On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

    Dear c.compilers,

    For context, I have been reading the old book Compiler design in C
    by Allen Holub; available here

    https://holub.com/compiler/

    and it goes into the details of the author's own LeX implementation.


    Snip.

    [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
    a sequence of bytes. UTF-8 was designed so that no character representation is a
    prefix or suffix of any other character, so it should work without having to be clever
    -John]

    That's not always feasible, nor the right approach. Let's consider the
    range of all lowercase greek letters. In the source file, that range
    will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
    not to match the bytes \xCE, \xB1..\xCF, and \x89.

    There is also the question of validating the input. It seems more
    natural to put the overlong sequence validator, and legal code point
    validator into the lexer, rather than preprocess the source file.

    But maybe that's "premature optimization?"

    Utf-8 has more problems too. There are bytes that can only appear after
    other bytes, and detecting that early would be nice[0]; there is also the fact that some glyphs can be constructed with a single code point, or two.

    Also, Unicode input is not always utf-8.

    When Holub is constructing his state machines for the regular
    expressions, he uses bitmasks for the ranges[1]. ASCII, being 128
    bits in a mask, is reasonable; even 256 bits for 8bit characters.

    0x11000 bits in a mask feels excessive to me, though.

    I was maybe naive, when I thought Unicode was considered a
    "solved problem." I guess everyone is using either libraries like
    ICU or run time environments. These can have the following
    "problem:" Invalid input may not be signalled to the higher level
    code, but quietly replaced with the replacement symbol, U+FFFD.
    Which is, in my opinion, not good for a compiler.

    [0] This can be part of output lexer tables, and is harder to
    do manually when bootstrapping the lexer by hand.

    [1] His implementation of sets.
    --
    Johann
    [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
    is a prefix or suffix of any other UTF-8 encoding, you can lex them
    the same way you'd lex strings of ASCII. In that example above, \xCE, \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
    a multi-byte sequence, so if they do, you can put a wildcard . at the
    end to match bogus bytes and complain about an invalid character. Dunno
    what you mean about not always UTF-8; I realize there are mislabeled
    files of UTF-16 that you have to sort out by sniffing the BOM at the
    front, but you do that and turn whatever you're getting into UTF-8
    and then feed it to the lexer.

    I agree that lexing Unicode is not a solved problem, and I'm not
    aware of any really good ways to limit the table sizes. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Tue May 4 14:39:54 2021
    I don't have much to personally add on this topic.

    However, if you are considering how to compress lexer tables indexed
    by Unicode code points, I would recommend you look at this paper: https://dl.acm.org/doi/10.1145/1780.1802
    by Dencker, Duerre, and Heuft
    on Optimization of Parsing Tables for Portable Compilers.

    They investigated the main techniques for compressing said tables,
    with particular interest in ways of using "coloring" (assigning
    multiple entries to the same location by indexing into a color table).
    From my experience with lexing Unicode (which is admittedly quite
    limited), most grammars have long sequential sets of unicode code
    points that are all in the same set, e.g. they are alphabetic
    characters, digitis, operators, punctuation,etc,or disallowed and that
    once grouped into those sets, the actual lexing tables are mostly
    compact.

    Now, that doesn't necessarily help you map your UTF-8 (et al) into
    those sets, although my guess is that it is simpler than it seems as
    there is regularity there

    And, if you look at the techniques the authors presented, you can
    combine one or two of them and get a method that is relatively both
    space and time efficient.

    If you do some experiments and want review, critique, suggestions, you
    may either post here or email me directly. I would be interested in a
    space and time efficient solution myself as I intend to make my next
    lexer generator for Yacc++ unicode aware and perhaps even unicode
    centric.
    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to All on Tue May 4 01:11:51 2021
    On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote:
    On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:


    Snip.

    [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
    a sequence of bytes. UTF-8 was designed so that no character representation is a
    prefix or suffix of any other character, so it should work without having to be clever
    -John]

    That's not always feasible, nor the right approach. Let's consider the
    range of all lowercase greek letters. In the source file, that range
    will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
    not to match the bytes \xCE, \xB1..\xCF, and \x89.

    Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode.
    You needs some Boolean logic in your matching, though that shouldn't be so hard.

    There is also the question of validating the input. It seems more
    natural to put the overlong sequence validator, and legal code point validator into the lexer, rather than preprocess the source file.

    This reminds me of the question, which I don't remember where I saw,
    about applying Boyer-Moore to Unicode. One answer is, as John notes,
    to apply it to the UTF-8 bytes. But the whole idea behind Boyer-Moore
    is to use the unequal probability distribution of characters, to quickly
    skip over impossible matches. But the bytes of UTF-8 don't have the
    same (im)probabiity of the individual characters.

    It seems to me that you lose some of the speed advantage of Boyer-Moore,
    but maybe it is still plenty fast enough.

    On the other hand, Java uses a 16 bit char, and one should be able to apply Boyer-Moore just as well in that case. The tables will be bigger, but then
    so are computer memories today.

    So, as with many problems, there are trade-offs between speed and size,
    and one has to choose the best case for the specific problem.

    Note that in addition to have a 16 bit Unicode char, the Java language
    itself is defined in terms of Unicode. Variable names can be any Unicode letter, followed by Unicode letters and digits. Presumably, then, the designers
    of Java compilers have figured this out, I suspect using the 16 bit char.

    One can, for example, have variables named A and Α in the same program.
    (In case you can't see it, the second one is an Alpha.)

    Yes, Unicode can be fun!
    [Remember that Unicode is a 20 bit code and for characters outside the first 64K,
    Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to All on Tue May 4 14:47:08 2021
    On Tuesday, May 4, 2021 at 8:33:50 AM UTC-7, gah4 wrote:

    (snip, I wrote)

    Note that in addition to have a 16 bit Unicode char, the Java language
    itself is defined in terms of Unicode. Variable names can be any Unicode letter, followed by Unicode letters and digits. Presumably, then, the designers of Java compilers have figured this out, I suspect using the 16 bit char.

    (snip)

    Yes, Unicode can be fun!
    [Remember that Unicode is a 20 bit code and for characters outside the first 64K,
    Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8
    seem clean and beautiful. -John]

    I did know that Java used 16 bits, but never tried to figure out what they did with the rest of the characters. There should be enough in the first 64K for writing
    programs.

    I did once use π for a variable name, with the obvious value. It seems it is \u03c0.
    I even found an editor that allowed entering such characters, and then would write
    out the file with \u escapes. As far as I know, that is more usual than UTF-8.

    I believe that the Java parser converts from \u escapes fairly
    early, such that you can quote strings with \u0022, and then you
    should be able to put \uu0022 inside the strings.

    [If you're only going to allow the lower 64K, your users will be sad
    when they try to use quoted strings with uncommon Chinese characters
    or with emoji, or more likely your compiler will barf since they will
    be encoded as two surrogate characters and your lexer won't know what
    to do with them. If you're going to deal with Unicode, better bite the
    bullet and deal with the whole mess. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans Aberg@21:1/5 to John Levine on Wed Jul 14 15:39:25 2021
    On 2021-05-04 01:58, John Levine wrote:
    [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding is a prefix or suffix of any other UTF-8 encoding, you can lex them
    the same way you'd lex strings of ASCII. In that example above, \xCE, \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
    a multi-byte sequence, so if they do, you can put a wildcard . at the
    end to match bogus bytes and complain about an invalid character. Dunno
    what you mean about not always UTF-8; I realize there are mislabeled
    files of UTF-16 that you have to sort out by sniffing the BOM at the
    front, but you do that and turn whatever you're getting into UTF-8
    and then feed it to the lexer.

    I agree that lexing Unicode is not a solved problem, and I'm not
    aware of any really good ways to limit the table sizes. -John]

    I wrote code, in Haskell and C++, that translates Unicode character
    classes into byte classes. From a theoretical standpoint, a Unicode
    regular language mapped under UTF-8 is a byte regular language, so it is possible. So the 2^8 = 256 size tables that Flex uses is enough. The
    Flex manual has an example how to make a regular expression replacing
    its dot '.' to pick up all legal UTF-8 bytes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)