• Please provide a learning path for mastering lexical analysis languages

    From Roger L Costello@21:1/5 to In Chris Christopher's last post he on Fri May 6 11:59:17 2022
    Hi Folks,

    I want to master lexical analysis.

    In Chris Christopher's last post he said:

    [Flex] is not even as powerful as some other lexical analysis languages
    and even exploiting its power often requires things that cannot be
    expressed in Flex alone nor can they be done in ways that are simple
    and easy to reason about.

    I am currently in the process of mastering Flex. I feel that mastering Flex will give me a foundation for learning other more advanced lexical analysis languages. "This XYZ lexical analysis language does it this way, Flex does it this other way. Ah, yes, I can see how XYZ's way is simpler and more
    powerful."

    From Chris's post I see that there is much to learn beyond Flex. Thank you Chris.

    Can you provide a learning path, please? A learning path for mastering lexical analysis languages.

    After I master Flex, what lexical analysis language should I then master? And after mastering that, what is the next lexical analysis language that I should master?

    /Roger

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to costello@mitre.org on Sun May 8 12:52:14 2022
    On Fri, 6 May 2022 11:59:17 +0000, Roger L Costello
    <costello@mitre.org> wrote:

    I want to master lexical analysis.

    In Chris Christopher's last post he said:

    [Flex] is not even as powerful as some other lexical analysis languages
    and even exploiting its power often requires things that cannot be
    expressed in Flex alone nor can they be done in ways that are simple
    and easy to reason about.

    I am currently in the process of mastering Flex. I feel that mastering Flex >will give me a foundation for learning other more advanced lexical analysis >languages. "This XYZ lexical analysis language does it this way, Flex does it >this other way. Ah, yes, I can see how XYZ's way is simpler and more >powerful."

    From Chris's post I see that there is much to learn beyond Flex. Thank you >Chris.

    Can you provide a learning path, please? A learning path for mastering lexical >analysis languages.

    It's not the grammars (languages) used by the tools that you need to
    learn - it's the methods used the tools that are important.


    After I master Flex, what lexical analysis language should I then master? And >after mastering that, what is the next lexical analysis language that I should >master?

    /Roger

    Lexical analysis is not terribly interesting ... it really just is
    about tokenizing the input. There are 2 main techniques: regular
    expressions, and recursive descent code. In practice, these may be
    combined.

    Flex essentially is regex plus a pushdown stack. The stack offers
    some limited ability to recognize simple 'inclusion' languages that
    regex can't handle alone, but in the Chomsky sense it does not add
    very much.

    You can look at ANTLR for an example of RD lexing. There are others.

    The main difference between regex and RD is that regex is an unguided
    method which simply changes state in the match automaton in response
    to incoming characters. In contrast, the incoming character in RD
    guides the path taken through the matching code: e.g.,

    if ('h' == ch)
    if ('e' == ch)
    if ('l' == ch)
    if ('p' == ch)
    { // recognized 'help' }
    else
    if ('l' == ch)
    :

    In practice real RD code often matches whole words or word prefixes
    rather than individual characters as shown here, but the principle is
    the same. And sometimes regex are used to perform the word matches.



    Grammatical syntax analysis - i.e. parsing - is where things get
    interesting.

    There are 3 typical methods of parsing: LL, LR, and PEG.

    LL and PEG typically are associated with RD code. LR typically is
    associated with table driven pushdown automata.
    [I do recall seeing a paper in which the LR automata was /implemented/
    using RD code in CPS, but naturally I can't find that paper just now.
    It really just proves that there are more ways to implement FA than
    many people realize 8-). I have not similarly seen an LL parser
    implemented with table or graph FA/PDA, but I suspect that could be
    done as well. Chris Clark probably knows better.]

    You also may see references also to 'combinator' parsing, which some
    people consider to be a unique method, but which more typically is
    associated with PEG. IMO it is simply an implemenation technique that
    is equally applicable to LL, but MMV on this point because - although
    their implementation looks similar - there are some not-insignificant
    technical differences between LL and PEG.


    Bison is an LR based tool associated strongly with Flex. Bison offers
    a few different variants of LR (LALR,SLR,GLR) that offer different
    limitations on the recognized language and (where they overlap)
    differences in the sizes of the generated parsers. Parser size was
    much more important in the past - nowadays it's only relevant to
    particular programming niches such as embedded.

    ANTLR is an LL based tool that generates RD code for both parsers and
    lexers. Since v3, ANTLR is LL(*) - prior to v3, ANTLR was LL(k). Both
    LL(k) and LL(*) are interesting in their own right and you may wish to
    look at both.
    [Many LL tools only are LL(1). LL(k) and LL(*) both deal with
    management of lookahead. LL(k) was created to deal with limitations
    of LL(1), and LL(*) relieves the programmer of having to determine the
    proper 'k' for the grammar.]

    The biggest difference is in how you deal with recursive structure in
    the language: in LL (or PEG) you need to right-factor your grammar, in
    LR you need to left-factor instead. The factoring determines the
    order in which recursive structures are recognized.


    I personally have little experience with existing PEG tools ...
    hopefully someone can give you some suggestions.

    Combinator parsers usually are hand written using a library of
    matching (or match creating) functions. Most often combinator parsers
    are seen in [higher order] functional languages where the functions
    easily can be strung together, but there are at least a few libraries
    available for C and C++.

    Hope this helps.
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luser droog@21:1/5 to Roger L Costello on Sun May 8 18:08:24 2022
    On Friday, May 6, 2022 at 11:13:52 AM UTC-5, Roger L Costello wrote:
    Hi Folks,

    I want to master lexical analysis.

    In Chris Christopher's last post he said:

    [Flex] is not even as powerful as some other lexical analysis languages
    and even exploiting its power often requires things that cannot be expressed in Flex alone nor can they be done in ways that are simple
    and easy to reason about.

    I am currently in the process of mastering Flex. I feel that mastering Flex will give me a foundation for learning other more advanced lexical analysis languages. "This XYZ lexical analysis language does it this way, Flex does it this other way. Ah, yes, I can see how XYZ's way is simpler and more powerful."

    From Chris's post I see that there is much to learn beyond Flex. Thank you Chris.

    Can you provide a learning path, please? A learning path for mastering lexical
    analysis languages.

    After I master Flex, what lexical analysis language should I then master? And after mastering that, what is the next lexical analysis language that I should
    master?

    /Roger

    I'd like to echo George Neuner's remark:

    it's the methods used [by] the tools that are important.

    In my opinion the best way to truly "master"* such tools is to step
    away from the tools per se and delve into the algorithms and data
    structures involved. The best book for this (again, in my opinion) is
    Marvin Minsky's "Computation: Finite and Infinite Machines". He has a
    whole chapter about Regular Languages and their association with
    Finite State Automata and the algorithms for translating back and
    forth between a Regular Expression and its corresponding Finite State Automaton.

    And to really get a handle on when to use flex vs bison vs whatever
    other tool, you'll also need to know something about where Regular
    Languages sit in the Chomsky Hierarchy of Phrase Structured Languages.
    I'm not sure what the best resource is for this part. Chomsky's
    writings on linguistics are formidable for the non-specialist (even
    for very motivated amateurs) so I don't really recommend going
    straight to the source. Maybe the intro chapter from the Dragon Book.
    Or whatever Aho or Ullman has written on it.

    * of course using my own private definition of mastery.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul B Mann@21:1/5 to All on Sun May 8 22:27:55 2022
    /* Token Rules */

    <eof> -> \z

    <constant> -> literal
    -> integer
    -> decimal
    -> real

    <identifier> -> letter (letter|digit)*

    integer -> digit+

    real -> integer exp
    -> decimal exp

    decimal -> digit+ '.'
    -> '.' digit+
    -> digit+ '.' digit+

    exp -> 'e' digit+
    -> 'E' digit+
    -> 'e' '-' digit+
    -> 'E' '-' digit+
    -> 'e' '+' digit+
    -> 'E' '+' digit+

    literal -> ''' lchar '''

    lchar -> lany
    -> '\' '\'
    -> '\' '''
    -> '\' '"'
    -> '\' 'n'
    -> '\' 't'
    -> '\' 'a'
    -> '\' 'b'
    -> '\' 'f'
    -> '\' 'r'
    -> '\' 'v'
    -> '\' '0'

    <string> -> '"' schar* '"'

    schar -> sany
    -> '\' '\'
    -> '\' '''
    -> '\' '"'
    -> '\' 'n'
    -> '\' 't'
    -> '\' 'a'
    -> '\' 'b'
    -> '\' 'f'
    -> '\' 'r'
    -> '\' 'v'
    -> '\' '0'

    {whitespace} -> whitechar+

    {commentline} -> '/' '/' neol*

    {commentblock} -> '/' '*' na* '*'+ (nans na* '*'+)* '/'

    /* Character Sets */

    any = 0..255 - \z
    lany = any - ''' - '\' - \n
    sany = any - '"' - '\' - \n

    letter = 'a'..'z' | 'A'..'Z' | '_'
    digit = '0'..'9'

    whitechar = \t | \n | \r | \f | \v | ' '

    na = any - '*' // not asterisk
    nans = any - '*' - '/' // not asterisk not slash
    neol = any - \n // not end of line

    \t = 9 // tab
    \n = 10 // newline
    \v = 11 // vertical feed?
    \f = 12 // form feed
    \r = 13 // return
    \z = 26 // end of file
    \b = 32 // blank/space

    /* End */

    The above lexical rules define C-language symbols.
    It's just a lexical grammar, not too hard to figure out.
    This is input to the DFA lexer generator, which is provided
    with the LRSTAR parser generator on SourceForge.net.

    DFA creates lexers that run 80% faster than "flex" lexers
    and are about the same size.

    If you need more language power to define a lexer ...
    that's what parser are for.

    BTW, LRSTAR creates parsers in C++ than were running
    140 times faster than those created by ANTLR, using the
    C++ target, the last time I did a comparison, 2 years ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Fri May 13 13:42:30 2022
    On Friday, May 6, 2022 at 9:13:52 AM UTC-7, Roger L Costello wrote:

    I want to master lexical analysis.

    There is a story that there is someone at Microsoft whose only job is the "START" button in the corner.

    Does a large compiler company, or compiler group of a large company, have one person specializing in lexical analysis?

    I would have thought that people would work on both lexical analysis
    and syntax analysis (parsers), and not specialize in just one.

    I might believe that optimization and code generation are different enough, that someone could specialize in one or the other.

    Compiler courses normally teach all parts of compiler writing,
    and some compilers are written by one person, or a small group.

    Much of the story from Brooks' "Mythical Man-month" is the problem of how
    to divide up the work for a large software project. OS/360 was close to
    the transition between writing OS in assembler, (including compilers), and writing
    them in high-level languages.

    There is the story (I believe from Brooks) about the OS/360 linkage editor,
    the finest linker ever designed, with the ability to edit already linked load modules. Just at the time that everyone recompiled programs every time,
    and didn't need the ability to partially recompile them.
    [As we've seen in recent discussions, the line between the lexer and
    the parser can vary a lot and it doesn't make sense to think about one
    without the other.

    Brooks was referring to the static overlays the linker created, which were indeed very lovely and were completely obsolete when large physical memory
    and virtual addressing came along. "Mythical Man-Month", pages 56-57.
    In about 1970 they added a loader that could load and go without writing
    a linked module on disk. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Sun May 22 12:12:32 2022
    My apologies for this delayed reply. The startup I am working for is
    nearing the alpha release and I had things on my plate that I needed
    to give priority.

    Let me start my answer with two simple rules of thumb, in order of importance.

    1) Keep your lexer simple. Follow Einstein's famous suggestion (on a
    slightly different topic). Make it as simple as possible, but no
    simpler.

    2) Respect the lexer/parser boundary. Try to avoid mixing the two
    phases. That will help with item 1.

    And to that regard, don't look for a more powerful lexing mechanism
    when the job should be done in the parser. The exception is when it
    makes the combination simpler (i.e. you make your lexer just a little
    more complicated, but it makes the parser much simpler (or vice
    versa).

    So, I wouldn't go looking for lots of ways to make the lexer more
    powerful. Don't go looking for a flame thrower to light your barbecue
    grill.

    If you are seriously interested in regular expressions, I would
    recommend Friedl's book: Mastering Regular Expressions. That talks
    about nearly all the variations at a detailed level. But realize that
    the PCRE extensions (e.g. backreferences and zero-width assertions)
    are not generally available in lexical analysis tools.

    Those mechanisms are often overkill. In Snort, someone used them for
    matching quotes around expressions. Simpler would have been to write
    two rules, one for matched 's and one for matched "s. It probably
    would have run faster too. But someone saw that Snort rules allowed
    PCRE expressions and thought they would be clever.

    That goes to point 1. Write simple rules for each token. And, it is
    often better to have more rules that all generate the same token than
    to try and make one complicated regular expression that somehow
    combines them all. And, if you find yourself writing a complicated
    regular expression or a list of more than 5 different rules for the
    same token. Think twice about the problem you are solving. You might
    be doing the wrong thing at the wrong level. (Worth noting here that
    this point is true for language design. If your language design
    starts forcing you to invent new technologies or use complicated
    existing ones, ask if you are actually making the language simple to
    use, often it's an indication of a mistake in your language design.
    Now, if you are implementing an existing language, you may have no
    choice, but realize that the original designer may have made a mistake
    and you are just dealing with it.)

    By the way, don't be afraid to borrow or steal good ideas and reuse
    them. I have essentially written one lexer in Yacc++ and use it for
    all languages with only minor tweaks. Now, I'm not trying to parse
    Fortran 66 with it, but I have used it for Yacc++ itself, C/C++/C#
    dialects, Verilog, Pascal, Javascript, BASIC, and lots of tiny tools.
    I will likely use it for Rust with only the smallest of tweaks. I
    reuse that lexer for nearly everything.

    The structure of that lexer is quite simple.

    There are only a small set of rules that are actual regular
    expressions: identifiers, numbers (integers and floats distinguished),
    strings, whitespace, and comments.

    Punctuation is done by simple rules listing the legal "operators".
    Note, if I had operators like "&amp;" I would make a regular
    expression for that in the first set.

    Note that keywords I handle separately. This is based upon advice
    Frank Deremer once gave. The lexer should be a scanner (regular
    expression part) and a screener (looking for keywords and the likes).
    Those are two separate tasks and splitting them often makes both
    simpler. (In fact, it's one of my few gripes about ANTLR4, as it
    doesn't make doing that easy. We built that model, split
    scanner/screener, into Yacc++ and it was one of our better choices.
    You can make a more powerful screener without making the scanner more
    complex and vice-versa. That goes back to the idea of making things
    simple by doing only one thing at a time.)

    --------

    Now, since I know you are working on HTML or something similar, I will
    give one piece of advice where you want to use something a little more powerful. In markdown type languages, you often have two distinct
    portions, the markdown which is a little language that you want to
    tokenize and the rest of the text which you don't want to tokenize (or
    at least not the same way). That does suggest using lexer states (or
    some similar mechanism). You are really writing two lexers. One for
    inside markdown and another for the rest of the text. Don't try to
    mix them. The markdown is usually clearly marked e.g. between < and >
    or on lines that start with a special character (#) or something
    similer. So, markdown is one lexer state (or complete lexer) with one
    set of rules, while the rest of text is another lexer with a different
    set of rules, and the only tricky part is knowing when to switch
    between them (and in a good language, it is actually obvious).

    But, notice how that is a different implementation of separation of
    concerns. You have two lexers (or two lexer states). Each one doing
    one job and doing it simply. The hard part is the switching between
    them and that's not usually a complex regular expression, but the
    requirement to do something that is more "infrastructure" related.
    And, see how it answers your question about how to not lex "&amp;" in
    text outside of markdown. You are simply using a different lexer
    [state]. It doesn't know about "&amp;" and doesn't attempt to
    tokenize it.

    ------

    Since, I mentioned screeners above, I will talk a bit about them. The
    simplest model is a function you can call upon matching an identifer
    (or another token you want to screen, e.g. maybe you have comments
    that can be "Javadocs" or decorations and you want to process them).
    This function can look the token up (or lex and parse it) and decide
    to change the token type (or even insert multiple tokens) before
    handing the token from the lexer to the parser. Note, if the function
    is complicated, you might be essentially implementing a preprocessor.
    (So, "#.*\n" could be recognized as a token, but the screener for it
    is the "C preprocessor"). And, the point here is that it is
    sometimes useful to insert something between the lexer and the parser
    (and maybe only for specific tokens.) Another use of a screener is
    the C "typedef hack" where you look your identifier up in the symbol
    table and see if it is a typedef and return a different token for it
    in that case. That allows your screen to take feedback from the
    parser without exposing it down to the scanner. Still, notice the
    separation of concerns to keep each part simpler.

    -----

    Now, once you have a base simple lexer you are often done (or at least
    mostly done). Then, you can worry about the "mistakes". Comment
    processing is often a more complicated expression than one would like.
    It's worse if you allow nested comments that need balanced delimiters,
    e.g. "/*" and "*/". Balanced delimiters are a Dyck language and take
    you out of the regular expression world. You can do them with lexer
    states (if you can stack them), but it is easier if you simply have a
    CFG language (LL, LR, or PEG) to do so. Captures and backreferences
    can sometimes do it, provided they have a stack like quality.

    We had a similar issue in Yacc++, we handle nested braces as a single
    token "code". That makes the parser simpler. That token has matched
    brackets, strings, and comments inside. It is much more complicated
    than one really wants. It is essentially a mini-parser inside our
    lexers. And, it means we can't really use braces as tokens
    themselves. It's a sign that as language designers we make a mistake.
    It probably would have been better to use a lexical state model, see
    a brace "{" and go into a separate lexer to solve it. Sometimes
    having a flame thrower is not the right solution.

    ------

    A slightly different mistake was made in the original yacc design.
    They make semicolons optional at the end of the rule. This means you [sometimes] need "identifier:" to recognize the start of a rule. That
    makes the lexer more complex. Other languages have played with
    treating newline as a semicolon which has its own issue, but these are
    often at the parser level not the lexer.

    I'm sure I have more to say on this topic, but also I have said more
    than enough....

    Kind regards,
    Chris

    ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)