• Learning only one lexer made me blind to its hidden assumptions

    From Roger L Costello@21:1/5 to All on Thu Jul 7 17:49:44 2022
    Hi Folks,

    For months I have been immersed in learning and using Flex. Great fun indeed.

    But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way.

    Below I have documented some of the differences between Lex/Flex and ScanGen.

    Difference:
    - Flex allows overlapping regexes. It is up to Flex to use the 'correct'
    regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins.
    - ScanGen does not allow overlapping regexes. Instead, you create one regex
    and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

    Difference:
    - Flex regexes use juxtaposition for specifying concatenation.
    - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls it 'catenation' not 'concatenation'

    Difference:
    - Flex regexes use | for specifying alteration in regexes
    - ScanGen uses ',' to specify alternation

    Difference:
    - With Flex, tossing out characters (e.g., toss out the quotes surrounding a string) may involve writing C code to reprocess the token
    - ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No token reprocessing needed

    Difference:
    Flex regexes use ^ for specifying 'not', e.g., [^ab] means any char except a and b
    ScanGen regexes uses 'Not', e.g., Not(Quote)

    Difference:
    - Flex deals with individual characters
    - ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the
    transition table

    Difference:
    - Flex regexes use the ? meta-symbol
    - ScanGen doesn't have that. Instead, it has 'Epsilon'

    Difference:
    - ScanGen has something called a Major number and a Minor number for each
    token
    - Flex doesn't have that concept
    [For the same reason, I don't think it's a good idea to learn only one programming langage. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luser droog@21:1/5 to Roger L Costello on Tue Jul 12 19:49:31 2022
    On Monday, July 11, 2022 at 7:26:08 PM UTC-5, Roger L Costello wrote:
    Hi Folks,

    For months I have been immersed in learning and using Flex. Great fun indeed.

    But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way.

    Below I have documented some of the differences between Lex/Flex and ScanGen.
    [snip]
    Difference:
    - Flex regexes use juxtaposition for specifying concatenation.
    - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls it 'catenation' not 'concatenation'

    I think this difference in word choice has possibly some etymological significance.
    Both word come from "catenary" which is the shape a rope or cord makes when
    you drape it over some spokes or frames or hooks or whatever. So, to *catenate* is to hoist the string or rope up onto some hooks or poles so it makes that dangling *garland* kind of curve. So, it's focused on the *rope* as an entity.

    *Concatenate* adds the prefix "con" meaning "with". I interpret this as embellishing
    the rope with beads or light bulbs or something. So now we're stringing up
    a bunch of beads *together*, focusing on the hanging objects.

    The original APL book uses "catenate" in a way that I think is consistent with my interpretation here. But I could also be wrong. I have not actually researched
    this beyond having run into it a few times and attempted to come up with a plausible reason.
    [Lots of people agree with that etymology. Where do you think the Unix "cat" command came from? -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ev. Drikos@21:1/5 to Roger L Costello on Wed Jul 13 14:58:50 2022
    On 07/07/2022 20:49, Roger L Costello wrote:
    ...

    Difference:
    - Flex allows overlapping regexes. It is up to Flex to use the 'correct' regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins.
    - ScanGen does not allow overlapping regexes. Instead, you create one regex and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

    ...

    As you can imagine there are many such options. A DFA builder may have
    options a) to behave as Flex b) to treat only some tokens as reserved,
    others as non reserved and c) to allow you examine shorter matches.

    Who knows what else there is out there! (I don't claim to be an expert)

    Difference:
    - Flex deals with individual characters
    - ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table


    FYI, there is also a related controversial issue that may fire flames!

    Bison also doesn't support character classes and this could be a reason
    that scannerless parsing sounds weird to several people. Of course one
    may use Bison down to the character level, but with many more states.

    Also, if the grammar allows two consecutive identifiers, a lookahead
    operator is likely necessary. (admittedly, a better alternative to
    scannerless parsing may be different start states as supported by Flex).

    When I played in the past with a scannerless GRL parser for SQL I hadn't
    seen dramatic runtime slow downs with a few single/multi line commands.
    Yet, I wouldn't try (or suggest) such an approach for XML processing.

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Juan Miguel Vilar Torres@21:1/5 to All on Wed Jul 13 01:46:17 2022
    El miércoles, 13 de julio de 2022 a las 5:25:39 UTC+2, luser droog
    escribió:
    I think this difference in word choice has possibly some etymological
    significance.
    Both word come from "catenary" which is the shape a rope or cord makes when you drape it over some spokes or frames or hooks or whatever. So, to
    *catenate*
    is to hoist the string or rope up onto some hooks or poles so it makes that dangling *garland* kind of curve. So, it's focused on the *rope* as an
    entity.

    *Concatenate* adds the prefix "con" meaning "with". I interpret this as
    embellishing
    the rope with beads or light bulbs or something. So now we're stringing up
    a bunch of beads *together*, focusing on the hanging objects.

    [Lots of people agree with that etymology. Where do you think the Unix "cat"
    command came from? -John]

    Not wanting be pedantic, but according to Merrian Webster, catenate comes from "Latin catenatus, past participle of catenare, from catena"; concatenare from "Middle English, from Late Latin concatenatus, past participle of concatenare to link together, from Latin com- + catena chain "; and catenary from "New Latin catenaria, from Latin, feminine of catenarius of a chain, from catena". So, the three words derive ultimately from catena (chain in Latin) and
    catenary is not in the root of the others. Moreover, Merrian Webster dates in circa 1623 the first known use of catenate, in 1598 the first known use of concatenate, and in 1788 the first use of catenary.
    [Thanks. I think we're done with this thread. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From antispam@math.uni.wroc.pl@21:1/5 to Roger L Costello on Wed Jul 13 19:52:45 2022
    Roger L Costello <costello@mitre.org> wrote:
    Below I have documented some of the differences between Lex/Flex and ScanGen.
    <snip>

    Difference:
    - Flex deals with individual characters
    - ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table

    Hmm, from flex manual:

    : -Ce, --ecs
    : construct equivalence classes
    :
    : -Cm, --meta-ecs
    : construct meta-equivalence classes

    If you want smaller tables use options above and flex DFA will
    work on character classes.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Thu Jul 14 16:46:36 2022
    On Wed, 13 Jul 2022 19:52:45 -0000 (UTC), antispam@math.uni.wroc.pl
    wrote:

    Hmm, from flex manual:

    : -Ce, --ecs
    : construct equivalence classes
    :
    : -Cm, --meta-ecs
    : construct meta-equivalence classes

    If you want smaller tables use options above and flex DFA will
    work on character classes.

    But note that Flex /may/ run considerably slower if you make heavy use
    of equivalence classes. IIRC, that results in (moral equivalent of)
    NFA rather than DFA.

    George
    [On modern computers it's hard to imagine a scanner so big that the space savings from those two options are worth it. 64K PDP-11 and all that. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Roger L Costello on Fri Jul 15 14:16:59 2022
    On 2022-07-07, Roger L Costello <costello@mitre.org> wrote:
    Hi Folks,

    For months I have been immersed in learning and using Flex. Great fun indeed.

    But recently I have been reading a book, Crafting a Compiler with C, and reading its chapter on lexers. The chapter describes two lexer-generators: ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden assumptions in Lex/Flex. Without learning ScanGen I would have continued to think that the way things are done in Lex/Flex way is the only way.

    It seems hard to find information about ScanGen. I found someone's
    slides claim that it's something written by a Gary Sevitsky in 1979,
    and then worked on over the next couple of years by Robert Gray
    and Charles Fischer.

    By "learning ScanGen" do you mean that you have an implementation
    and have created some scanners?

    Below I have documented some of the differences between Lex/Flex and ScanGen.

    Difference:
    - Flex allows overlapping regexes. It is up to Flex to use the 'correct' regex. Flex has rules for picking the correct one: longest match wins, regex listed first wins.
    - ScanGen does not allow overlapping regexes. Instead, you create one regex and then, if needed, you create "Except" clauses. E.g., the token is an Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

    Difference:
    - Flex regexes use juxtaposition for specifying concatenation.
    - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls

    That's verbose. Juxtaposition is used for catenation so that you can
    simply use "foo" if you want to match the text "foo".

    It behooves pattern matching notations to look like the problem space;
    i.e. that expression which just match single terms rather than sets
    just look like the terms they are matching.

    it 'catenation' not 'concatenation'

    Careful speaker and writers of the English language avoid the
    "concatenate" pleonasm.

    "catena" is Latin for "chain" (e.g. "catenary curve: the curve formed by
    a chain supported horizontally at its endpoints, under gravity"). To
    catenate literally means to chain; the "con" semantics is implicit,
    since chaining is always "together".

    The Unix command is called "cat" not "concat"; the function is "strcat"
    not "strconcat".

    If you could think of a way to chain things apart, that could be
    grounds for a functional prefix. For instance, chaining a pair of
    agrressive dogs to opposite walls of a yard so they cannot get at each
    other could plausibly be called "discatenation". :)

    Difference:
    - Flex regexes use | for specifying alteration in regexes
    - ScanGen uses ',' to specify alternation

    That's just gratuitously weird. Commas are widely used for separating
    groups in sequences: domain names, object member access,
    date.month.year, integer.fraction, filename.suffix, ...

    The | character is used in BNF.

    I can see that in 1979, stringent character set limitations would still
    have to be taken into account when developing something; there would
    still be reasons to stick to a six bit character set.

    Difference:
    - With Flex, tossing out characters (e.g., toss out the quotes surrounding a string) may involve writing C code to reprocess the token
    - ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No token reprocessing needed

    That C code is just:

    char *interior = yytext + 1;
    yytext[yyleng - 1] = 0;

    You could write a library a function like this, which you can reuse in
    future lex jobs:

    /* chop "front" characters from front, "back" characters
    from back, of yytext, and return pointer to chopped
    lexeme. */

    char *yytrim(int front, int back)
    {
    assert (front >= 0);
    assert (back >= 0);
    assert (front <= yyleng);
    assert (back <= yyleng);

    {
    char *f = yytext + front;
    char *b = yytext + yyleng - back;
    if (b < f)
    b = f;
    *b = 0;
    return f;
    }
    }

    If the string literal syntax supports the escaping of quotes, and other features, stripping the quotes becomes insufficient.

    There are plenty of situations in which it is handy to strip something
    from either end of a lexeme, though; the above function would be useful
    in the "-ll" lex library.

    Difference:
    - Flex deals with individual characters
    - ScanGen lumps characters into character classes and deals with classes. Use of character classes decreases (quite significantly) the size of the transition table

    Obviously, Lex has explicit character classes, like [cxb]: the class
    containing 'c', 'x' and 'b'.

    Even without explicit classes in the syntax, the DFA subset constrution implicitly identifies character classes. A regex like (c|x|b) should not generate more states than [cxb].

    Subset construction will squeeze out things like, say (adcdx|abcdy),
    which should produce the same DFA as a[bd]cd[xy].

    A state in a DFA can have many transitions leading out of it, triggered
    by different input symbols: that constitutes a character class.
    This class is essentially discovered by the algorithm.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From antispam@math.uni.wroc.pl@21:1/5 to George Neuner on Fri Jul 15 20:14:10 2022
    George Neuner <gneuner2@comcast.net> wrote:
    On Wed, 13 Jul 2022 19:52:45 -0000 (UTC), antispam@math.uni.wroc.pl
    wrote:

    Hmm, from flex manual:

    : -Ce, --ecs
    : construct equivalence classes
    :
    : -Cm, --meta-ecs
    : construct meta-equivalence classes

    If you want smaller tables use options above and flex DFA will
    work on character classes.

    But note that Flex /may/ run considerably slower if you make heavy use
    of equivalence classes. IIRC, that results in (moral equivalent of)
    NFA rather than DFA.

    IIUC equivalence classes are reasonably cheap: single extra
    array access per input character. Meta-equivalence are more
    expensive.

    If I wanted ultimate speed I probably would use hand-written
    scanner, they tend to be significantly faster than anything
    flex can generate. But in most cases flex scanner speed is
    adequate.

    [On modern computers it's hard to imagine a scanner so big that the space savings from those two options are worth it. 64K PDP-11 and all that. -John]

    Well, L1 cache on many processors is just 16K. L2 caches are
    bigger, but it is not hard to imagince scanner which wastes
    a lot of time due to L2 misses. If smaller tables can fit
    in L2, than there may be net speed gain. Even if there is
    speed loss on some machines smaller tables are likely to
    lead to more predictable/uniform performance.

    Also, if scanner if fast enough with smaller tables, then
    using bigger tables is just waste. Of course it does not
    make sense to spent a lot of effort eliminating small waste,
    but with flex effort is just an extra command line switch
    to flex.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)