• Has lexing and parsing theory advanced since the 1970's?

    From Roger L Costello@21:1/5 to All on Tue Sep 14 13:16:01 2021
    Lately I have been learning to use Flex & Bison. As I understand it, Flex & Bison (and other parser generators) are built on solid theory. As a result, those parser generators were created quickly and cheaply using structured, well-known techniques. Contrast with parsers developed prior to the theory:
    The earliest parser back in the 1950s used utterly ad hoc techniques to
    analyze the syntax of the source code of programs they were parsing. During
    the 1960s, the field got a lot of academic attention and by the early 1970s, parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many others put parsing techniques solidly on their theoretical feet.

    One of the first techniques they (Aho, Ullman, Knuth, and others) espoused was to separate lexing from parsing. Lexing built upon regular expressions, which built upon Finite Automata (FA) theory and Nondeterministic Finite Automata (NFA) theory. FA and NFA were brilliantly melded together with the famous Kleene Theorem. Parsing built on top of a rich theory of grammars -- Context Free Grammars, Context Sensitive Grammars, etc. -- that Chomsky formulated. Neat!

    That said, Flex & Bison is old. Has lexing/parsing theory advanced since the 1970’s? If yes, are there parser generators available today which are based on those advances in lexing/parsing theory? Or does Flex & Bison still represent the state-of-the-art in terms of the underlying theory it uses?

    [Having been there in the 1970s I can report that the linguists and the computer
    scientists were dimly aware of each other but didn't work together and used different terms, e.g., linguists say phrase structure grammars where we would say CFG.
    Flex is a rewrite of lex which was a Bell Labs summer project to make
    it easier to turn regular expressions into DFAs (not NFAs) by Eric
    Schmidt. The code was awful, flex was a cleaner rewrite that avoided the
    AT&T software license. Bison was orignally a GPL rewrite of yacc but it
    has since added reentrant parsers and GLR parsing for ambiguous grammars.
    They knew about GLR in the 1970s, but Yacc had to run on a 16 bit PDP-11
    and the data structures were too big. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Roger L Costello on Thu Sep 16 17:09:02 2021
    Roger L Costello <costello@mitre.org> writes:
    That said, Flex & Bison is old. Has lexing/parsing theory advanced since the >1970s? If yes, are there parser generators available today which are based on >those advances in lexing/parsing theory? Or does Flex & Bison still represent >the state-of-the-art in terms of the underlying theory it uses?

    It seems to me that after the success of BNF (context-free grammars)
    with Algol 60 people wanted to go further, and Algol 68 used a Van
    Wijngaarden grammar (two-level grammars). But this formalism has not
    been used for describing and implementing later languages, and even
    Algol 60 put more in the grammar (in particular, the type difference
    between flags and numbers) than later languages. Most languages use
    just BNF and describe the rest of the language in prose, a few use a
    formal semantics, but I guess that's not what the question was about.
    So on the language definition side, there have been no advances in
    grammar formalism, because none are needed.

    On the implementation side, there have been many attribute grammar
    generators (e.g., Ox), but few uses. In the back end various more or
    less formal techniques have been used for instruction selection; e.g.,
    lcc uses the tree-parser generator lburg.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Roger L Costello on Fri Sep 17 05:51:32 2021
    On 2021-09-14, Roger L Costello <costello@mitre.org> wrote:
    Lately I have been learning to use Flex & Bison. As I understand it, Flex & Bison (and other parser generators) are built on solid theory. As a result, those parser generators were created quickly and cheaply using structured, well-known techniques. Contrast with parsers developed prior to the theory: The earliest parser back in the 1950s used utterly ad hoc techniques to analyze the syntax of the source code of programs they were parsing. During the 1960s, the field got a lot of academic attention and by the early 1970s, parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many
    others put parsing techniques solidly on their theoretical feet.

    One of the first techniques they (Aho, Ullman, Knuth, and others) espoused was
    to separate lexing from parsing.

    This is for one particular pragmatic reason. The LALR(1) parsing
    technique uses one token of lookahead to make key parsing decisions.

    If you combine lexing with parsing, then some of your original lookahead symbols may turn into sequences of symbols, which share the same initial element.

    E.g. supose you have a "++" token and a "+=" token, which appear as a
    lookahead symbol which determines which way to reduce.

    If we don't separate lexing and parsing, such that we have a
    character-level grammar, then we no longer have a "++" and "+="
    token. Each of these is two grammar symbols.

    And so in that same reduce situation, we no longer have two different
    lookahead tokens for "++" and "+=". The lookahead symbol is just "+".

    Fixing this requires techniques like multiple symbols of lookahead:
    LR(k). (Not that that is new or anything.)

    There are certain efficiencies that can be had in lexing, as such.
    Dedicated lexical analysis can recognize tokens "blindingly" fast,
    using buffering techniques integrated with the recognizer.

    Lexing built upon regular expressions, which
    built upon Finite Automata (FA) theory and Nondeterministic Finite Automata (NFA) theory. FA and NFA were brilliantly melded together with the famous Kleene Theorem. Parsing built on top of a rich theory of grammars -- Context Free Grammars, Context Sensitive Grammars, etc. -- that Chomsky formulated. Neat!

    However, LR parsing is built on regular expressions again, with
    some sophistication thrown in to handle nesting.

    You know from ad hoc programming with regexes that you can handle some
    forms of nesting if you invoke a regex more than once on a string.

    For instance balanced parentheses can be recognized via regex-driven
    algorithm if we us a regex to recognize all instances of an opening
    parenthesis followed by non-parenthesis characters [^()] followed by a
    closing parenthesis, and then reduce this by removing it from the input,
    and iterate on it.

    That sort of gives us an intuition behind LR parsing. At the core of a
    LR parser is a state machine, and if we ignore the push down aspects of
    it: how its actions maintain a stack, it's just a finite state machine
    that can be described by a regex. The terminating states of the regex
    basically rewrite something in the stack of symbols, or push something
    into it, and then choose some state to jump to to iterate.

    That said, Flex & Bison is old. Has lexing/parsing theory advanced since the 1970’s? If yes, are there parser generators available today which are based on
    those advances in lexing/parsing theory? Or does Flex & Bison still represent the state-of-the-art in terms of the underlying theory it uses?

    Firstly, Flex and Bison have enough power in them to handle
    state-of-the-art *languages*.

    Parsing theory has long outstripped the practical realm.

    Loosel speaking, if you design a computer language whose syntax is so complicated with difficult ambiguities that it benefits from being
    parsed by some technique only available from a 2020 dated paper, it's
    probably an awful language from a syntactic point of view, which
    ill serves its users in that regard.

    (Look at C++, by the way; it's famous for having a large, awful grammar.
    Yet, the GNU C++ implementation is a hand-maintained recursive-descent
    parser, that's about a megabyte and a half of C++ code all in one file.
    So for all the complexity, C++ can be handled using approaches that take advantage of approximately zero advances in parsing since 1965.)

    Now Bison is actively developed. There are avenues of improvement in
    dimensions other than parsing theory. Firstly, Bison does have some
    theoretical muscle in it in the way of handling GLR grammars.

    There are other pragmatic improvements in it like support for push
    parsers, and re-entrant parsers.

    It has support for multiple programming languages. Its C++ support goes
    as far as supporting AST nodes that are objects which can have
    destructors.

    Fairly recently, support was added to Bison for multiple start symbols
    in the same grammar, so you can parse just a subset of a language.

    Another thing I believe I saw recently on the mailing list recently is
    that contributions were merged to Bison which allow it to generate counterexamples for ambiguities, so that it's easier to debug bad
    grammars. Previous implementations of Yacc-like parsers just give you diagnostics like "state 123: reduce-reduce conflict" whose root cause
    can be hard to unravel.

    [Having been there in the 1970s I can report that the linguists and the computer
    scientists were dimly aware of each other but didn't work together and used different terms, e.g., linguists say phrase structure grammars where we would say CFG.
    Flex is a rewrite of lex which was a Bell Labs summer project to make
    it easier to turn regular expressions into DFAs (not NFAs) by Eric
    Schmidt. The code was awful, flex was a cleaner rewrite that avoided the
    AT&T software license. Bison was orignally a GPL rewrite of yacc but it
    has since added reentrant parsers and GLR parsing for ambiguous grammars. They knew about GLR in the 1970s, but Yacc had to run on a 16 bit PDP-11
    and the data structures were too big. -John]

    Bison was originally written by the same person, Robert Corbett, who
    went on to write Berkeley Yacc.

    https://en.wikipedia.org/wiki/Berkeley_Yacc

    Berkeley Yacc is far less actively maintained than Bison and has fewer features. It does support reentrant parsing in a way that is
    compatible with around Bison 2.5.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    [No coincidence that it was the same guy, since bison was a fork of
    byacc.
    The lookahead example of ++ vs += isn't very good since a LALR
    parser only needs to look ahead when it reduces but in general you're
    right, it'd be a problem. A much worse problem in combined parsers is
    getting rid of noise like comments and white space. In a separate lexer
    you can easily discard them between tokens, in a combined lex/parse you
    have to include them everywhere they can occur. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to Roger L Costello on Sat Sep 18 15:54:39 2021
    Roger L Costello <costello@mitre.org> asked:

    That said, Flex & Bison is old. Has lexing/parsing theory advanced since the 1970’s? If yes, are there parser generators available today which are based on
    those advances in lexing/parsing theory? Or does Flex & Bison still represent
    the state-of-the-art in terms of the underlying theory it uses?

    Yes, compiler theory has advanced since the 1970s.

    Multiple token lookahead is now supported by a variety of compiler
    generators.

    And, many modern parser generators allow regular expressions on the right-hand-side of rules. Conversely, you see lexer generators that
    accept context-free (recursive) rules to define tokens (while still
    allowing regular expressions).

    One of the key ways you often see multiple token lookahead implemented
    is in the form of predicated grammars. These are of the form, if you
    see this, try applying this rule first and if it doesn't match try
    other rules. Often the lookahead for them is the complete rule, i.e.
    try this rule and if it matches use it, often a form of backtrack
    parsing. This has evolved into "Parsing Expression Grammars" (PEGs),
    where you list the rules in the order you want them to match. And,
    "packrat parsing" has turned this into a practical method with [near?]
    linear performance. Thus, in most cases, you aren't paying an
    N-squared penalty for backtracking.

    You see a particularly interesting evolution of that in ANTLR4, where
    you can apply precedence and associativity rules to grammars with left-recursion and still in most cases get an LL-like parser out as
    the generated result. However, it interprets most rules in a PEG like
    fashion giving an implicit precedence to them. The result is grammars
    that are relatively easy to read, and mostly do the obvious correct
    thing.

    The GLR approach of making a parsing forest is an alternative to that.
    It won't be long before someone combines the two approaches. You can
    specify the order in which to try productions, but the generator will
    recognize cases where two or more apply and keep enough information
    around to allow the application choice to be made at reduction time,
    while allowing one to say in the case of ambiguity, I want this choice
    to apply or to make the choice be "both" and the semantics will
    disambiguate it.

    Beyond that, there are extensions to what language class the parser
    generators allow.

    To my mind, the most notable one is adaptive grammars. These are
    extensions that allow one to add to or modify the rules the grammar
    supports while the parser is running. The most obvious example is
    adding new "types" or "keywords" to the grammar. This is a more
    sophisticated solution to the typedef problem. And can result in
    grammars where you have properly scoped type analysis all done as part
    of parsing.

    Another variant are conjunctive grammars. These allow you to take the intersection of two rules and say match only if the intersection
    matches. This allows one to make finer grained decisions on what legal sentences are. It's effect is similar (but often more general than)
    predicated grammars, but similar techniques can be used to implement
    it.

    Finally, the regular expression model is suitable to extension with
    the PCRE operators, in particular captures and back-references, but
    also assertions (which are essentially predicated) and greedy versus
    non-greedy matching. I have not yet seen a parser generator that
    implements that, but it is actually a fairly small tweak to LR
    matching, captures are simply a special form of non-terminal where all occurrences (within a scope) need to match the same "string".

    ---------

    However, beyond all that, which are tweaks to the lexing and parsing
    engines and generators is a bigger change. Parser generator writers
    have realized that simply getting a correct parse is only a small
    fragment of building a front-end.

    In particular, one doesn't simply want a raw parse tree. One wants an
    AST that discards irrelevant information and more particular
    highlights relevant information and makes transformations and
    traversals easier.

    There have been several advancements along that line.

    Steve Johnson (the original yacc author) wrote a paper "Yacc meets
    C++" which shows how to turn an AST into C++ classes with relevant
    methods. The result is a very object-oriented view of what an AST is.

    Alternately, ANTLR incorporates the Visitor and Listener design
    patterns into the parse trees it builds. This makes semantic
    traversals much easier.

    There are also tools like sourcer which treats parse trees as
    s-expressions and works something like hygienic macros on them,
    allowing one to write transformation rules to reshape the tree as
    desired.

    Even more along those lines are the "pattern matching" (destructuring) operators found in many functional languages. Those also allow one to
    transform trees at the semantic level.

    Finally, much of this has gone to formalizing more standardized IRs.
    Much of the parsing work is now simply taking the input and reshaping
    it to match an IR used by a sophisticated backend.

    -- ***************************************************************************** Chris Clark email:
    christopher.f.clark@compiler-resources.com
    Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ev Drikos@21:1/5 to Roger L Costello on Wed Sep 29 05:07:52 2021
    On Thursday, September 16, 2021 at 7:56:25 PM UTC+3, Roger L Costello wrote:

    That said, Flex & Bison is old. Has lexing/parsing theory advanced since the 1970’s? If yes, are there parser generators available today which are based on
    those advances in lexing/parsing theory? Or does Flex & Bison still represent the state-of-the-art in terms of the underlying theory it uses?

    Hello,

    The routines that recognize tokens are still called scanners and those
    that parse the input are still called parsers. That explained, this long
    list may give a clue of what an answer to your last question shall be: https://en.wikipedia.org/wiki/Comparison_of_parser_generators

    Yet, I haven't used most of them. So, I'll give you an example with
    Syntaxis, a tool I've coded that isn't included in the above list.

    If we try to parse this erroneous Fortran line with the command 'fcheck' (binary available at https://github.com/drikosev/Fortran) we see that
    the expected tokens in the error message contain both a '=' and a name:

    program ? ; end

    Note that the command 'fcheck' uses a deterministic parser (built by
    Syntaxis) and the expected tokens in an error message are pre-computed.

    To my knowledge, the ability of a parser to shift simultaneously two
    distinct terminals in one transition isn't an advancement in theory but
    I guess several tools mentioned in the Wikipedia list above possibly
    provide similar or better goodies (ie Chris Clark described an ANTLR4
    feature that advances theory directly).

    Regards,
    Ev. Drikos

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)