• About finding the start symbol of a grammar

    From Eduardo Costa@21:1/5 to All on Fri May 21 03:49:43 2021
    Hey guys,

    I've been lately dealing with a parser generator for LL grammars, and since it's inception I've always been blindy assuming the first element read from within the input file is going to be the start symbol or starting rule.

    So I've been wondering all this time, just out of curiosity, if there exists a method or algorithm to find out the start symbol of a given grammar?

    I guess the answer is no.

    While there would exist grammars we could recursively check to find out which it's start symbol is (i.e.: it's the only rule that used the rest of them, where checking every other resulted in dangling rules that weren't even called in), there might be other grammars for which more than one rule yields full coverage (all of these obviously defining different languages) and so leading to ambiguity.

    I only contemplate a simple coverage test, even though other techniques could exist, again, all of them leading to a point where we couldn't ascertain if
    one or the other is what the user meant.

    So I'm wondering if this is even an issue in production-grade
    parser-generators out there?

    Regards,
    [yacc and its descendants have an explicit %start declaration, usually defaulting to
    the first rule in the file. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to Eduardo Costa on Fri May 21 17:02:34 2021
    On 5/21/21 12:49 PM, Eduardo Costa wrote:
    I've been lately dealing with a parser generator for LL grammars, and since it's inception I've always been blindy assuming the first element read from within the input file is going to be the start symbol or starting rule.

    So I've been wondering all this time, just out of curiosity, if there exists a
    method or algorithm to find out the start symbol of a given grammar?

    Graph analysis methods exist to find unreachable nodes which can become
    start symbols. In short any node that is a predecessor of *all* nodes
    can be a start symbol. If no such node exists then the grammar is faulty (discontiguous).

    While there would exist grammars we could recursively check to find out which it's start symbol is (i.e.: it's the only rule that used the rest of them, where checking every other resulted in dangling rules that weren't even called
    in), there might be other grammars for which more than one rule yields full coverage (all of these obviously defining different languages) and so leading to ambiguity.

    IMO this problem can be solved by introduction of an artificial start
    symbol that allows to reach all other symbols but can not be reached
    itself. Please note that this solution solves a syntactic problem but
    may not prevent or even cause semantic problems.

    I only contemplate a simple coverage test, even though other techniques could exist, again, all of them leading to a point where we couldn't ascertain if one or the other is what the user meant.

    So I'm wondering if this is even an issue in production-grade parser-generators out there?

    A useful parser generator should include checks for grammar sanity.

    DoDi

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Eduardo Costa on Fri May 21 14:14:37 2021
    On 2021-05-21, Eduardo Costa <ecosta.tmp@gmail.com> wrote:
    Hey guys,

    I've been lately dealing with a parser generator for LL grammars, and since it's inception I've always been blindy assuming the first element read from within the input file is going to be the start symbol or starting rule.

    So I've been wondering all this time, just out of curiosity, if there exists a
    method or algorithm to find out the start symbol of a given grammar?

    I guess the answer is no.

    Surely, the start symbol of a context-free grammar is one which appears
    only in the left hand side of a rule. If there is such a unique symbol,
    it must be /the/ start symbol.

    While there would exist grammars we could recursively check to find out which it's start symbol is (i.e.: it's the only rule that used the rest of them, where checking every other resulted in dangling rules that weren't even called
    in), there might be other grammars for which more than one rule yields full coverage (all of these obviously defining different languages) and so leading to ambiguity.

    Ambiguity doesn't imply there is no algorithm to find a start symbol,
    but that the algorithm must be able to report situations like the
    presence of multiple start symbols, or no start symbols.

    On the face of it, this problem does not smell of undecidability, or
    even NP completeness. Where do you suspect is the difficulty?

    It seems like this is a fairly trivial property of a graph, type of
    thing.

    Whether rules are dangling is also a graph property: is the graph
    connected.

    I only contemplate a simple coverage test, even though other techniques could exist, again, all of them leading to a point where we couldn't ascertain if one or the other is what the user meant.

    But tha seems like an identifiable point where the algorithm can stop
    and announce a decision. Then diagnostics can be issued.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    [I have seen useful grammars where the start symbol also appears in the RHS of a rule.
    Think of the standard expression grammar.

    You pick the start symbol that gives you the language you want to parse. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Sat May 22 12:14:08 2021
    As Dodi noted, this is basically a graph analysis problem and the
    graph may be disconnected (a forest). And our moderator has added
    several insightful comments. E.g. you can "declare" a start symbol
    and if not present default to some symbol, either the first one in the
    grammar, or some symbol from which all other symbols are reachable
    (presuming the graph isn't disconnected), and the start symbol can be recursively defined, etc.

    However, there is one particular curious aspect if you are writing a
    translator to a recursive descent parser, one generally makes a
    function of each rule, as a result one can consider each symbol a
    start symbol for whatever sub-graph is reachable from it. With a
    table driven parser, one has to make a table of entries into the
    parsing table to achieve the same effect, but that is not difficult to
    do, although that may require additional table rows if the symbol
    behaves slightly differently when used as a start symbol rather than
    in the context of other rules (e.g. follow symbols).

    So, in that sense, a start symbol is simply what one wants to parse.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ev. Drikos@21:1/5 to Eduardo Costa on Sat May 22 06:52:17 2021
    On 21/05/2021 13:49, Eduardo Costa wrote:
    While there would exist grammars we could recursively check to find out which it's start symbol is (i.e.: it's the only rule that used the rest of them, where checking every other resulted in dangling rules that weren't even called
    in), there might be other grammars for which more than one rule yields full coverage (all of these obviously defining different languages) and so leading to ambiguity.

    IMHO, it can be so simple as you describe here without important overhead. Typically, a parser will reduce the start symbol and finish. All rules
    that yield full coverage can be ie alternatives of a single root symbol:

    RootSymbol -> R1 | R2 | ... | Rn

    Ev. Drikos

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Kaz Kylheku on Fri May 21 15:32:22 2021
    Kaz Kylheku <563-365-8930@kylheku.com> writes:
    Surely, the start symbol of a context-free grammar is one which appears
    only in the left hand side of a rule. If there is such a unique symbol,
    it must be /the/ start symbol.

    It could be a now-unused nonterminal, while the start symbol is part
    of a strongly connected component of the graph.

    If you have one nonterminal that dominates all other nonterminals (the
    other nonterminals can be derived from it, but not the other way
    round), it probably is the start symbol. Why "probably"? There is
    still the possibility that there is a wrapper rule around the real
    start symbol that was used for debugging and is left in the grammar.

    On the face of it, this problem does not smell of undecidability, or
    even NP completeness. Where do you suspect is the difficulty?

    It's easy to find nodes with particular properties in a graph. But
    there is no guarantee that the result really is the start symbol.

    There is a reason why you specify the start symbol in some way.

    Whether rules are dangling is also a graph property: is the graph
    connected.

    "Connected" is an undirected-graph property. If a nonterminal is
    unreachable from the start symbol, it can still be connected to the
    reachable graph through a RHS-nonterminal.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Christopher F Clark on Sat May 22 13:17:25 2021
    On Saturday, May 22, 2021 at 10:24:24 AM UTC-7, Christopher F Clark wrote:
    As Dodi noted, this is basically a graph analysis problem and the
    graph may be disconnected (a forest). And our moderator has added
    several insightful comments. E.g. you can "declare" a start symbol
    and if not present default to some symbol, either the first one in the grammar, or some symbol from which all other symbols are reachable
    (presuming the graph isn't disconnected), and the start symbol can be recursively defined, etc.

    Seems to me that this should be related to the problem of finding the
    root of a phylogenetic tree.

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7149615/

    Unlike CS trees, there is no necessary directionality to the links, and so finding the root is more difficult. Yet biologists have some desire to
    know where the root is.

    But as also noted above, in the case of recursion, it depends on the language.

    In most languages, <expression> is recursive, allowing for

    '(' <expression> ')'

    however, a language (though I don't know of any) could require all expressions to be parenthesized, in which case the start would be the parenthesized form.

    [I think previous messagees have made it clear that while this is an interesting exercise, its only practical use is to see if the start
    symbol declared in the grammar is different from the computed one, in
    which case the grammar is broken. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)