Dear c.compilers,
For context, I have been reading the old book Compiler design in C
by Allen Holub; available here
https://holub.com/compiler/
and it goes into the details of the author's own LeX implementation.
[The obvious approach if you're scaning UTF-8 text is to keep treating the input as
a sequence of bytes. UTF-8 was designed so that no character representation is a
prefix or suffix of any other character, so it should work without having to be clever
-John]
On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:
[The obvious approach if you're scaning UTF-8 text is to keep treating the input as
a sequence of bytes. UTF-8 was designed so that no character representation is a
prefix or suffix of any other character, so it should work without having to be clever
-John]
That's not always feasible, nor the right approach. Let's consider the
range of all lowercase greek letters. In the source file, that range
will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
not to match the bytes \xCE, \xB1..\xCF, and \x89.
There is also the question of validating the input. It seems more
natural to put the overlong sequence validator, and legal code point validator into the lexer, rather than preprocess the source file.
Note that in addition to have a 16 bit Unicode char, the Java language
itself is defined in terms of Unicode. Variable names can be any Unicode letter, followed by Unicode letters and digits. Presumably, then, the designers of Java compilers have figured this out, I suspect using the 16 bit char.
Yes, Unicode can be fun!seem clean and beautiful. -John]
[Remember that Unicode is a 20 bit code and for characters outside the first 64K,
Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8
[I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding is a prefix or suffix of any other UTF-8 encoding, you can lex them
the same way you'd lex strings of ASCII. In that example above, \xCE, \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
a multi-byte sequence, so if they do, you can put a wildcard . at the
end to match bogus bytes and complain about an invalid character. Dunno
what you mean about not always UTF-8; I realize there are mislabeled
files of UTF-16 that you have to sort out by sniffing the BOM at the
front, but you do that and turn whatever you're getting into UTF-8
and then feed it to the lexer.
I agree that lexing Unicode is not a solved problem, and I'm not
aware of any really good ways to limit the table sizes. -John]
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 296 |
Nodes: | 16 (2 / 14) |
Uptime: | 72:15:18 |
Calls: | 6,657 |
Calls today: | 3 |
Files: | 12,203 |
Messages: | 5,332,304 |
Posted today: | 1 |