Forum: >>> Magnum BBS <<<

About finding the start symbol of a grammar

From Eduardo Costa@21:1/5 to All on Fri May 21 03:49:43 2021

Hey guys,

I've been lately dealing with a parser generator for LL grammars, and since it's inception I've always been blindy assuming the first element read from within the input file is going to be the start symbol or starting rule.

So I've been wondering all this time, just out of curiosity, if there exists a method or algorithm to find out the start symbol of a given grammar?

I guess the answer is no.

While there would exist grammars we could recursively check to find out which it's start symbol is (i.e.: it's the only rule that used the rest of them, where checking every other resulted in dangling rules that weren't even called in), there might be other grammars for which more than one rule yields full coverage (all of these obviously defining different languages) and so leading to ambiguity.

I only contemplate a simple coverage test, even though other techniques could exist, again, all of them leading to a point where we couldn't ascertain if
one or the other is what the user meant.

So I'm wondering if this is even an issue in production-grade
parser-generators out there?

Regards,
[yacc and its descendants have an explicit %start declaration, usually defaulting to
the first rule in the file. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to Eduardo Costa on Fri May 21 17:02:34 2021

On 5/21/21 12:49 PM, Eduardo Costa wrote:

I've been lately dealing with a parser generator for LL grammars, and since it's inception I've always been blindy assuming the first element read from within the input file is going to be the start symbol or starting rule.

So I've been wondering all this time, just out of curiosity, if there exists a
method or algorithm to find out the start symbol of a given grammar?

Graph analysis methods exist to find unreachable nodes which can become
start symbols. In short any node that is a predecessor of *all* nodes
can be a start symbol. If no such node exists then the grammar is faulty (discontiguous).

While there would exist grammars we could recursively check to find out which it's start symbol is (i.e.: it's the only rule that used the rest of them, where checking every other resulted in dangling rules that weren't even called
in), there might be other grammars for which more than one rule yields full coverage (all of these obviously defining different languages) and so leading to ambiguity.

IMO this problem can be solved by introduction of an artificial start
symbol that allows to reach all other symbols but can not be reached
itself. Please note that this solution solves a syntactic problem but
may not prevent or even cause semantic problems.

I only contemplate a simple coverage test, even though other techniques could exist, again, all of them leading to a point where we couldn't ascertain if one or the other is what the user meant.

So I'm wondering if this is even an issue in production-grade parser-generators out there?

A useful parser generator should include checks for grammar sanity.

DoDi

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Eduardo Costa on Fri May 21 14:14:37 2021

On 2021-05-21, Eduardo Costa <ecosta.tmp@gmail.com> wrote:

Hey guys,

I've been lately dealing with a parser generator for LL grammars, and since it's inception I've always been blindy assuming the first element read from within the input file is going to be the start symbol or starting rule.

So I've been wondering all this time, just out of curiosity, if there exists a
method or algorithm to find out the start symbol of a given grammar?

I guess the answer is no.

Surely, the start symbol of a context-free grammar is one which appears
only in the left hand side of a rule. If there is such a unique symbol,
it must be /the/ start symbol.

While there would exist grammars we could recursively check to find out which it's start symbol is (i.e.: it's the only rule that used the rest of them, where checking every other resulted in dangling rules that weren't even called
in), there might be other grammars for which more than one rule yields full coverage (all of these obviously defining different languages) and so leading to ambiguity.

Ambiguity doesn't imply there is no algorithm to find a start symbol,
but that the algorithm must be able to report situations like the
presence of multiple start symbols, or no start symbols.

On the face of it, this problem does not smell of undecidability, or
even NP completeness. Where do you suspect is the difficulty?

It seems like this is a fairly trivial property of a graph, type of
thing.

Whether rules are dangling is also a graph property: is the graph
connected.

I only contemplate a simple coverage test, even though other techniques could exist, again, all of them leading to a point where we couldn't ascertain if one or the other is what the user meant.

But tha seems like an identifiable point where the algorithm can stop
and announce a decision. Then diagnostics can be issued.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[I have seen useful grammars where the start symbol also appears in the RHS of a rule.
Think of the standard expression grammar.

You pick the start symbol that gives you the language you want to parse. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Sat May 22 12:14:08 2021

As Dodi noted, this is basically a graph analysis problem and the
graph may be disconnected (a forest). And our moderator has added
several insightful comments. E.g. you can "declare" a start symbol
and if not present default to some symbol, either the first one in the
grammar, or some symbol from which all other symbols are reachable
(presuming the graph isn't disconnected), and the start symbol can be recursively defined, etc.

However, there is one particular curious aspect if you are writing a
translator to a recursive descent parser, one generally makes a
function of each rule, as a result one can consider each symbol a
start symbol for whatever sub-graph is reachable from it. With a
table driven parser, one has to make a table of entries into the
parsing table to achieve the same effect, but that is not difficult to
do, although that may require additional table rows if the symbol
behaves slightly differently when used as a start symbol rather than
in the context of other rules (e.g. follow symbols).

So, in that sense, a start symbol is simply what one wants to parse.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ev. Drikos@21:1/5 to Eduardo Costa on Sat May 22 06:52:17 2021

On 21/05/2021 13:49, Eduardo Costa wrote:

While there would exist grammars we could recursively check to find out which it's start symbol is (i.e.: it's the only rule that used the rest of them, where checking every other resulted in dangling rules that weren't even called
in), there might be other grammars for which more than one rule yields full coverage (all of these obviously defining different languages) and so leading to ambiguity.

IMHO, it can be so simple as you describe here without important overhead. Typically, a parser will reduce the start symbol and finish. All rules
that yield full coverage can be ie alternatives of a single root symbol:

RootSymbol -> R1 | R2 | ... | Rn

Ev. Drikos

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Kaz Kylheku on Fri May 21 15:32:22 2021

Kaz Kylheku <563-365-8930@kylheku.com> writes:

Surely, the start symbol of a context-free grammar is one which appears
only in the left hand side of a rule. If there is such a unique symbol,
it must be /the/ start symbol.

It could be a now-unused nonterminal, while the start symbol is part
of a strongly connected component of the graph.

If you have one nonterminal that dominates all other nonterminals (the
other nonterminals can be derived from it, but not the other way
round), it probably is the start symbol. Why "probably"? There is
still the possibility that there is a wrapper rule around the real
start symbol that was used for debugging and is left in the grammar.

On the face of it, this problem does not smell of undecidability, or
even NP completeness. Where do you suspect is the difficulty?

It's easy to find nodes with particular properties in a graph. But
there is no guarantee that the result really is the start symbol.

There is a reason why you specify the start symbol in some way.

Whether rules are dangling is also a graph property: is the graph
connected.

"Connected" is an undirected-graph property. If a nonterminal is
unreachable from the start symbol, it can still be connected to the
reachable graph through a RHS-nonterminal.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Christopher F Clark on Sat May 22 13:17:25 2021

On Saturday, May 22, 2021 at 10:24:24 AM UTC-7, Christopher F Clark wrote:

As Dodi noted, this is basically a graph analysis problem and the
graph may be disconnected (a forest). And our moderator has added
several insightful comments. E.g. you can "declare" a start symbol
and if not present default to some symbol, either the first one in the grammar, or some symbol from which all other symbols are reachable
(presuming the graph isn't disconnected), and the start symbol can be recursively defined, etc.

Seems to me that this should be related to the problem of finding the
root of a phylogenetic tree.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7149615/

Unlike CS trees, there is no necessary directionality to the links, and so finding the root is more difficult. Yet biologists have some desire to
know where the root is.

But as also noted above, in the case of recursion, it depends on the language.

In most languages, <expression> is recursive, allowing for

'(' <expression> ')'

however, a language (though I don't know of any) could require all expressions to be parenthesized, in which case the start would be the parenthesized form.

[I think previous messagees have made it clear that while this is an interesting exercise, its only practical use is to see if the start
symbol declared in the grammar is different from the computed one, in
which case the grammar is broken. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	242:06:45
Calls:	6,624
Files:	12,175
Messages:	5,320,145

About finding the start symbol of a grammar

Who's Online

System Info