Hi Folks,[snip]
For months I have been immersed in learning and using Flex. Great fun indeed.
But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way.
Below I have documented some of the differences between Lex/Flex and ScanGen.
Difference:
- Flex regexes use juxtaposition for specifying concatenation.
- ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls it 'catenation' not 'concatenation'
...
Difference:
- Flex allows overlapping regexes. It is up to Flex to use the 'correct' regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins.
- ScanGen does not allow overlapping regexes. Instead, you create one regex and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'
...
Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table
...
I think this difference in word choice has possibly some etymologicalsignificance.
Both word come from "catenary" which is the shape a rope or cord makes when you drape it over some spokes or frames or hooks or whatever. So, to*catenate*
is to hoist the string or rope up onto some hooks or poles so it makes that dangling *garland* kind of curve. So, it's focused on the *rope* as anentity.
*Concatenate* adds the prefix "con" meaning "with". I interpret this asembellishing
the rope with beads or light bulbs or something. So now we're stringing upcommand came from? -John]
a bunch of beads *together*, focusing on the hanging objects.
[Lots of people agree with that etymology. Where do you think the Unix "cat"
Below I have documented some of the differences between Lex/Flex and ScanGen.<snip>
Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table
Hmm, from flex manual:
: -Ce, --ecs
: construct equivalence classes
:
: -Cm, --meta-ecs
: construct meta-equivalence classes
If you want smaller tables use options above and flex DFA will
work on character classes.
Hi Folks,
For months I have been immersed in learning and using Flex. Great fun indeed.
But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way.
Below I have documented some of the differences between Lex/Flex and ScanGen.
Difference:
- Flex allows overlapping regexes. It is up to Flex to use the 'correct' regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins.
- ScanGen does not allow overlapping regexes. Instead, you create one regex and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'
Difference:
- Flex regexes use juxtaposition for specifying concatenation.
- ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls
it 'catenation' not 'concatenation'
Difference:
- Flex regexes use | for specifying alteration in regexes
- ScanGen uses ',' to specify alternation
Difference:
- With Flex, tossing out characters (e.g., toss out the quotes surrounding a string) may involve writing C code to reprocess the token
- ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No token reprocessing needed
Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table
On Wed, 13 Jul 2022 19:52:45 -0000 (UTC), antispam@math.uni.wroc.pl
wrote:
Hmm, from flex manual:
: -Ce, --ecs
: construct equivalence classes
:
: -Cm, --meta-ecs
: construct meta-equivalence classes
If you want smaller tables use options above and flex DFA will
work on character classes.
But note that Flex /may/ run considerably slower if you make heavy use
of equivalence classes. IIRC, that results in (moral equivalent of)
NFA rather than DFA.
[On modern computers it's hard to imagine a scanner so big that the space savings from those two options are worth it. 64K PDP-11 and all that. -John]
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 76:42:54 |
Calls: | 6,716 |
Files: | 12,247 |
Messages: | 5,357,582 |