Forum: >>> Magnum BBS <<<

Learning only one lexer made me blind to its hidden assumptions

From Roger L Costello@21:1/5 to All on Thu Jul 7 17:49:44 2022

Hi Folks,

For months I have been immersed in learning and using Flex. Great fun indeed.

But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way.

Below I have documented some of the differences between Lex/Flex and ScanGen.

Difference:
- Flex allows overlapping regexes. It is up to Flex to use the 'correct'
regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins.
- ScanGen does not allow overlapping regexes. Instead, you create one regex
and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

Difference:
- Flex regexes use juxtaposition for specifying concatenation.
- ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls it 'catenation' not 'concatenation'

Difference:
- Flex regexes use | for specifying alteration in regexes
- ScanGen uses ',' to specify alternation

Difference:
- With Flex, tossing out characters (e.g., toss out the quotes surrounding a string) may involve writing C code to reprocess the token
- ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No token reprocessing needed

Difference:
Flex regexes use ^ for specifying 'not', e.g., [^ab] means any char except a and b
ScanGen regexes uses 'Not', e.g., Not(Quote)

Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the
transition table

Difference:
- Flex regexes use the ? meta-symbol
- ScanGen doesn't have that. Instead, it has 'Epsilon'

Difference:
- ScanGen has something called a Major number and a Minor number for each
token
- Flex doesn't have that concept
[For the same reason, I don't think it's a good idea to learn only one programming langage. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to Roger L Costello on Tue Jul 12 19:49:31 2022

On Monday, July 11, 2022 at 7:26:08 PM UTC-5, Roger L Costello wrote:

Hi Folks,

For months I have been immersed in learning and using Flex. Great fun indeed.

But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way.

Below I have documented some of the differences between Lex/Flex and ScanGen.

[snip]

Difference:
- Flex regexes use juxtaposition for specifying concatenation.
- ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls it 'catenation' not 'concatenation'

I think this difference in word choice has possibly some etymological significance.
Both word come from "catenary" which is the shape a rope or cord makes when
you drape it over some spokes or frames or hooks or whatever. So, to *catenate* is to hoist the string or rope up onto some hooks or poles so it makes that dangling *garland* kind of curve. So, it's focused on the *rope* as an entity.

*Concatenate* adds the prefix "con" meaning "with". I interpret this as embellishing
the rope with beads or light bulbs or something. So now we're stringing up
a bunch of beads *together*, focusing on the hanging objects.

The original APL book uses "catenate" in a way that I think is consistent with my interpretation here. But I could also be wrong. I have not actually researched
this beyond having run into it a few times and attempted to come up with a plausible reason.
[Lots of people agree with that etymology. Where do you think the Unix "cat" command came from? -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ev. Drikos@21:1/5 to Roger L Costello on Wed Jul 13 14:58:50 2022

On 07/07/2022 20:49, Roger L Costello wrote:

...

Difference:
- Flex allows overlapping regexes. It is up to Flex to use the 'correct' regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins.
- ScanGen does not allow overlapping regexes. Instead, you create one regex and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

...

As you can imagine there are many such options. A DFA builder may have
options a) to behave as Flex b) to treat only some tokens as reserved,
others as non reserved and c) to allow you examine shorter matches.

Who knows what else there is out there! (I don't claim to be an expert)

Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table

FYI, there is also a related controversial issue that may fire flames!

Bison also doesn't support character classes and this could be a reason
that scannerless parsing sounds weird to several people. Of course one
may use Bison down to the character level, but with many more states.

Also, if the grammar allows two consecutive identifiers, a lookahead
operator is likely necessary. (admittedly, a better alternative to
scannerless parsing may be different start states as supported by Flex).

When I played in the past with a scannerless GRL parser for SQL I hadn't
seen dramatic runtime slow downs with a few single/multi line commands.
Yet, I wouldn't try (or suggest) such an approach for XML processing.

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Juan Miguel Vilar Torres@21:1/5 to All on Wed Jul 13 01:46:17 2022

El miércoles, 13 de julio de 2022 a las 5:25:39 UTC+2, luser droog
escribió:

I think this difference in word choice has possibly some etymological

significance.

Both word come from "catenary" which is the shape a rope or cord makes when you drape it over some spokes or frames or hooks or whatever. So, to

*catenate*

is to hoist the string or rope up onto some hooks or poles so it makes that dangling *garland* kind of curve. So, it's focused on the *rope* as an

entity.

*Concatenate* adds the prefix "con" meaning "with". I interpret this as

embellishing

the rope with beads or light bulbs or something. So now we're stringing up
a bunch of beads *together*, focusing on the hanging objects.

[Lots of people agree with that etymology. Where do you think the Unix "cat"

command came from? -John]

Not wanting be pedantic, but according to Merrian Webster, catenate comes from "Latin catenatus, past participle of catenare, from catena"; concatenare from "Middle English, from Late Latin concatenatus, past participle of concatenare to link together, from Latin com- + catena chain "; and catenary from "New Latin catenaria, from Latin, feminine of catenarius of a chain, from catena". So, the three words derive ultimately from catena (chain in Latin) and
catenary is not in the root of the others. Moreover, Merrian Webster dates in circa 1623 the first known use of catenate, in 1598 the first known use of concatenate, and in 1788 the first use of catenary.
[Thanks. I think we're done with this thread. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From antispam@math.uni.wroc.pl@21:1/5 to Roger L Costello on Wed Jul 13 19:52:45 2022

Roger L Costello <costello@mitre.org> wrote:

Below I have documented some of the differences between Lex/Flex and ScanGen.

<snip>

Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table

Hmm, from flex manual:

: -Ce, --ecs
: construct equivalence classes
:
: -Cm, --meta-ecs
: construct meta-equivalence classes

If you want smaller tables use options above and flex DFA will
work on character classes.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Thu Jul 14 16:46:36 2022

On Wed, 13 Jul 2022 19:52:45 -0000 (UTC), antispam@math.uni.wroc.pl
wrote:

Hmm, from flex manual:

: -Ce, --ecs
: construct equivalence classes
:
: -Cm, --meta-ecs
: construct meta-equivalence classes

If you want smaller tables use options above and flex DFA will
work on character classes.

But note that Flex /may/ run considerably slower if you make heavy use
of equivalence classes. IIRC, that results in (moral equivalent of)
NFA rather than DFA.

George
[On modern computers it's hard to imagine a scanner so big that the space savings from those two options are worth it. 64K PDP-11 and all that. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Roger L Costello on Fri Jul 15 14:16:59 2022

On 2022-07-07, Roger L Costello <costello@mitre.org> wrote:

Hi Folks,

For months I have been immersed in learning and using Flex. Great fun indeed.

But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way.

It seems hard to find information about ScanGen. I found someone's
slides claim that it's something written by a Gary Sevitsky in 1979,
and then worked on over the next couple of years by Robert Gray
and Charles Fischer.

By "learning ScanGen" do you mean that you have an implementation
and have created some scanners?

Below I have documented some of the differences between Lex/Flex and ScanGen.

Difference:
- Flex allows overlapping regexes. It is up to Flex to use the 'correct' regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins.
- ScanGen does not allow overlapping regexes. Instead, you create one regex and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

Difference:
- Flex regexes use juxtaposition for specifying concatenation.
- ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls

That's verbose. Juxtaposition is used for catenation so that you can
simply use "foo" if you want to match the text "foo".

It behooves pattern matching notations to look like the problem space;
i.e. that expression which just match single terms rather than sets
just look like the terms they are matching.

it 'catenation' not 'concatenation'

Careful speaker and writers of the English language avoid the
"concatenate" pleonasm.

"catena" is Latin for "chain" (e.g. "catenary curve: the curve formed by
a chain supported horizontally at its endpoints, under gravity"). To
catenate literally means to chain; the "con" semantics is implicit,
since chaining is always "together".

The Unix command is called "cat" not "concat"; the function is "strcat"
not "strconcat".

If you could think of a way to chain things apart, that could be
grounds for a functional prefix. For instance, chaining a pair of
agrressive dogs to opposite walls of a yard so they cannot get at each
other could plausibly be called "discatenation". :)

Difference:
- Flex regexes use | for specifying alteration in regexes
- ScanGen uses ',' to specify alternation

That's just gratuitously weird. Commas are widely used for separating
groups in sequences: domain names, object member access,
date.month.year, integer.fraction, filename.suffix, ...

The | character is used in BNF.

I can see that in 1979, stringent character set limitations would still
have to be taken into account when developing something; there would
still be reasons to stick to a six bit character set.

Difference:
- With Flex, tossing out characters (e.g., toss out the quotes surrounding a string) may involve writing C code to reprocess the token
- ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No token reprocessing needed

That C code is just:

char *interior = yytext + 1;
yytext[yyleng - 1] = 0;

You could write a library a function like this, which you can reuse in
future lex jobs:

/* chop "front" characters from front, "back" characters
from back, of yytext, and return pointer to chopped
lexeme. */

char *yytrim(int front, int back)
{
assert (front >= 0);
assert (back >= 0);
assert (front <= yyleng);
assert (back <= yyleng);

{
char *f = yytext + front;
char *b = yytext + yyleng - back;
if (b < f)
b = f;
*b = 0;
return f;
}
}

If the string literal syntax supports the escaping of quotes, and other features, stripping the quotes becomes insufficient.

There are plenty of situations in which it is handy to strip something
from either end of a lexeme, though; the above function would be useful
in the "-ll" lex library.

Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table

Obviously, Lex has explicit character classes, like [cxb]: the class
containing 'c', 'x' and 'b'.

Even without explicit classes in the syntax, the DFA subset constrution implicitly identifies character classes. A regex like (c|x|b) should not generate more states than [cxb].

Subset construction will squeeze out things like, say (adcdx|abcdy),
which should produce the same DFA as a[bd]cd[xy].

A state in a DFA can have many transitions leading out of it, triggered
by different input symbols: that constitutes a character class.
This class is essentially discovered by the algorithm.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From antispam@math.uni.wroc.pl@21:1/5 to George Neuner on Fri Jul 15 20:14:10 2022

George Neuner <gneuner2@comcast.net> wrote:

On Wed, 13 Jul 2022 19:52:45 -0000 (UTC), antispam@math.uni.wroc.pl
wrote:

Hmm, from flex manual:

: -Ce, --ecs
: construct equivalence classes
:
: -Cm, --meta-ecs
: construct meta-equivalence classes

If you want smaller tables use options above and flex DFA will
work on character classes.

But note that Flex /may/ run considerably slower if you make heavy use
of equivalence classes. IIRC, that results in (moral equivalent of)
NFA rather than DFA.

IIUC equivalence classes are reasonably cheap: single extra
array access per input character. Meta-equivalence are more
expensive.

If I wanted ultimate speed I probably would use hand-written
scanner, they tend to be significantly faster than anything
flex can generate. But in most cases flex scanner speed is
adequate.

[On modern computers it's hard to imagine a scanner so big that the space savings from those two options are worth it. 64K PDP-11 and all that. -John]

Well, L1 cache on many processors is just 16K. L2 caches are
bigger, but it is not hard to imagince scanner which wastes
a lot of time due to L2 misses. If smaller tables can fit
in L2, than there may be net speed gain. Even if there is
speed loss on some machines smaller tables are likely to
lead to more predictable/uniform performance.

Also, if scanner if fast enough with smaller tables, then
using bigger tables is just waste. Of course it does not
make sense to spent a lot of effort eliminating small waste,
but with flex effort is just an extra command line switch
to flex.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	76:42:54
Calls:	6,716
Files:	12,247
Messages:	5,357,582

Learning only one lexer made me blind to its hidden assumptions

Who's Online

System Info