Forum: >>> Magnum BBS <<<

Supporting multiple input syntaxes

From luser droog@21:1/5 to All on Wed Aug 12 15:20:35 2020

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection?

My feeling is that separating them will be less headache, but maybe
there's some advantage to changing out smaller pieces of the grammar
in that it might be easier to make sure that they produce the same
structure compatible with the backend.

Any guidance in this area?

https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

[Really, it's up to you. My inclination would be to make them
separate but use some sort of macro setup so you can insert
common pieces into each of the grammars. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to All on Thu Aug 13 10:27:45 2020

Am 13.08.2020 um 00:20 schrieb luser droog:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection?

IMO this depends widely on the usage of the parser output (diagnostics, backend...). C90 is much stricter than K&R, requires more checks. Do you
need extensive error diagnostics, or do you assume that all source code
is free of errors?

https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

You seem to implement an LL(1) parser? My C98 Parser is LL(2), i.e. an
LL(1) parser with one or two locations where more lookahead is required.
Also identifiers are classified as typenames and others prior to their
usage.

For real-world testing (recommended!) a preprocessor is required and a
copy of the standard libraries of existing compiler(s).

Your test_syntax() source misses "=" from the variable declarations (initializers). What about pointer syntax/semantics? If you add these
(and other) syntax differences conditionally (version specific) to your
code, which way would look better to you? Which way will be safer to
maintain?

Nice code BTW :-)

DoDi

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to luser droog on Thu Aug 13 00:43:47 2020

On 2020-08-12, luser droog <mijoryx@yahoo.com.dmarc.email.dmarc.email> wrote:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection?

My feeling is that separating them will be less headache, but maybe
there's some advantage to changing out smaller pieces of the grammar
in that it might be easier to make sure that they produce the same
structure compatible with the backend.

Any guidance in this area?

https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

I'd say that since you're not using a parser generator, but using code statements to construct the grammar objects at initialization time, you
have the flexibility to merge the implementation, because you can check
the value of some dialect-selecting variable, and construct the parser accordingly, and elsewhere check that same variable to do whatever else
needs to be done conditionally.

The trick is to find a way to embed the *semantics* of the older dialects
into the new so then everything after the parsing can be shared.

Similar remarks would apply to recursive descent.

If you were using something clunky like a Yacc, there are still ways
to combine everything into a single grammar. The input stream can be
primed with one of several "secret internal token" objects that has no
lexeme. (Primed, meaming that the first call to the lexer yields this
secret token instead of processing actual input.) Each token indicates
a dialect to parse. The top-level grammar production can then pick
one of several subordinate production rules corresponding to the entry
points for the respective dialects. Those can then share common rules
as much as possible.

translation_unit : C75_TOKEN c75_translation_unit /* orig flavor */
| C79_TOKEN c79_translation_unit /* "K&R" */
| C90_TOKEN c90_translation_unit /* ANSI/ISO */
;

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to All on Thu Aug 13 14:15:09 2020

Am Donnerstag, 13. August 2020 00:32:56 UTC+2 schrieb luser droog:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection? ...

Why not settle for one master dialect and use awk to translate between dialects?

[Probably because there is a great deal of C code written to comply with
the various versions of the standard, users want error messages that match
the code they wrote rather than some intermediate code, and it would be quite an awk program that could reconcile all the differences among C flavors. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to minf...@arcor.de on Thu Aug 13 21:37:55 2020

On Thursday, August 13, 2020 at 5:24:36 PM UTC-5, minf...@arcor.de wrote:

Am Donnerstag, 13. August 2020 00:32:56 UTC+2 schrieb luser droog:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection? ...

Why not settle for one master dialect and use awk to translate between dialects?

[Probably because there is a great deal of C code written to comply with
the various versions of the standard, users want error messages that match the code they wrote rather than some intermediate code, and it would be quite an awk program that could reconcile all the differences among C flavors. -John]

One of the possible goals for this project is exactly such a translator
that can downgrade or upgrade code from one standard version to another.

Another possible application is a source code formatter. Currently the
CST produced by the parser keeps all the original whitespace attached
to each token.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to Hans-Peter Diettrich on Thu Aug 13 21:36:13 2020

On Thursday, August 13, 2020 at 5:22:51 PM UTC-5, Hans-Peter Diettrich wrote:

Am 13.08.2020 um 00:20 schrieb luser droog:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection?

IMO this depends widely on the usage of the parser output (diagnostics, backend...). C90 is much stricter than K&R, requires more checks. Do you
need extensive error diagnostics, or do you assume that all source code
is free of errors?

https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

You seem to implement an LL(1) parser? My C98 Parser is LL(2), i.e. an
LL(1) parser with one or two locations where more lookahead is required.
Also identifiers are classified as typenames and others prior to their
usage.

Yes, it's basically LL(1) with backtracking. There's one part of the
grammar I'm using that's left-recursive and I still need to work that
out.

For real-world testing (recommended!) a preprocessor is required and a
copy of the standard libraries of existing compiler(s).

Your test_syntax() source misses "=" from the variable declarations (initializers). What about pointer syntax/semantics? If you add these
(and other) syntax differences conditionally (version specific) to your
code, which way would look better to you? Which way will be safer to maintain?

That's actually correct for the 1975 dialect: no '=' to initialize
variables. I think it's pretty ugly without it, but it could be
removed anyway for the AST.

Nice code BTW :-)

Thanks! I think I need to sidetrack a bit and work up some primitives
for pattern matching and decomposition to make the backend easier.
I'll report back if/when it can do more tricks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Fri Aug 14 12:44:31 2020

We did something similar with Yacc++. We used inheritance of grammars
(a feature of our product) to do so. In fact, the point of the
exercise was to demonstrate that feature. I would presume something
similar would work in a hand-written recursive descent parser.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Lovemore@21:1/5 to All on Sat Aug 15 06:42:29 2020

It may be useful to consider what you would like to happen if you encounter a syntax that is ambiguous or works differently or is for another expected
syntax from what you are parsing: produce a warning, error or handle quietly, or fall over, or don’t care.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to David Lovemore on Sat Aug 15 15:20:06 2020

On Saturday, August 15, 2020 at 9:46:47 AM UTC-5, David Lovemore wrote:

It may be useful to consider what you would like to happen if you encounter a syntax that is ambiguous or works differently or is for another expected syntax from what you are parsing: produce a warning, error or handle quietly, or fall over, or don’t care.

Very good points. The parser is backtracking, returning a list of
results. That could conceivably be useful for dealing with ambiguity by
looking at more than just the first result.

Warnings and error messages are going to be trickier I fear. The parser
is built around the idea of Monadic combinators. So my research suggests
that I'll need Monad Transformers to add the state needed for good
messages. There are a bunch of lectures I've found about using these
in Haskell, but not much about how to build them from scratch.

As it is, any error in parsing will just produce no results.

I started a prototype where the input list of characters was instead
a list of (character, line-number, line-position). But it got really
confusing at the time so I stopped. And the few times I've tried to
look at it, I can't figure out what I was thinking.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Lovemore@21:1/5 to All on Sun Aug 16 02:21:23 2020

My friend, reporting the furthest position examined by the parser I have
useful in error cases as a simple stop gap when using a combinator approach.

Thinking about it you kind of want to see the furthest failed position and the stack of rules above it. Such requires meta information when the code is written in the most natural way. For this reason and others I believe it is good to represent your grammar in data structures which is further in the direction of a compiler compiler tool (or compiler interpreter tool).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to davidl...@gmail.com on Thu Aug 20 14:45:28 2020

On Sunday, August 16, 2020 at 10:53:24 AM UTC-5, davidl...@gmail.com wrote:

My friend, reporting the furthest position examined by the parser I have [found]
useful in error cases as a simple stop gap when using a combinator approach.

Thinking about it you kind of want to see the furthest failed position and the
stack of rules above it. Such requires meta information when the code is written in the most natural way. For this reason and others I believe it is good to represent your grammar in data structures which is further in the direction of a compiler compiler tool (or compiler interpreter tool).

Thanks. I've done some further investigating. I built my parsers following
two papers. Hutton and Meijer, Monadic Parser Combinators
https://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf
and Hutton, Higher-Order Functions for Parsing
https://pdfs.semanticscholar.org/6669/f223fba59edaeed7fabe02b667809a5744d9.pdf

The first adds error reporting using Monad Transformers. I'm thinking about
how to move in this direction, but first I'd need to reformulate the code to make the Monad more explicit. It should be something like an interface or
a mixin, like a base class with all virtual member functions. That could be done by modelling my objects more like OO objects and have 'bind' and
'result' in a vtable in the Parser object.

But the second paper does it differently, and maybe something I can do
more easily. It redefines the parsers to no longer produce a list of results, so there's no longer support for ambiguity. Then it defines them to
return a Maybe,

maybe * ::= Fail [char] | Error [char] | OK *
.
where the OK branch has the parse tree, and Fail or Error both contain an error message. It describes how a Fail can be transformed into an Error. But it isn't entirely clear where the messages get injected.

Still need to do some thinking on it, but I think I can rewrite the parsers
to follow this model, and then decorate my grammar with possible errors
at each node.

Thanks for the encouragement. My classes start on Monday so I'm hoping
to accomplish something on this before then.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to luser droog on Sun Aug 23 14:26:27 2020

On Wednesday, August 12, 2020 at 3:32:56 PM UTC-7, luser droog wrote:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

(snip)

I remember using a later C compiler that accepted
the older form of assignment operators, such as =+, =-, etc.
Presumably for those with older code.

I then had to put an extra space when assigning negative values:

i= -4;

I don't remember now how I found out about that one.

That is the only old C syntax I remember.
(And much later than 1975.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to luser droog on Sun Aug 23 19:35:00 2020

On Sunday, August 23, 2020 at 1:39:30 PM UTC-5, luser droog wrote:

On Sunday, August 16, 2020 at 10:53:24 AM UTC-5, davidl...@gmail.com wrote:

My friend, reporting the furthest position examined by the parser I have [found]
useful in error cases as a simple stop gap when using a combinator approach.

Thinking about it you kind of want to see the furthest failed position and the
stack of rules above it. Such requires meta information when the code is written in the most natural way. For this reason and others I believe it is good to represent your grammar in data structures which is further in the direction of a compiler compiler tool (or compiler interpreter tool).

Thanks. I've done some further investigating. I built my parsers following two papers. Hutton and Meijer, Monadic Parser Combinators
https://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf
and Hutton, Higher-Order Functions for Parsing
https://pdfs.semanticscholar.org/6669/f223fba59edaeed7fabe02b667809a5744d9.pdf

The first adds error reporting using Monad Transformers. [...]>
But the second paper does it differently, and maybe something I can do
more easily. It redefines the parsers to no longer produce a list of results, so there's no longer support for ambiguity. Then it defines them to
return a Maybe,

maybe * ::= Fail [char] | Error [char] | OK *
.
where the OK branch has the parse tree, and Fail or Error both contain an error
message. It describes how a Fail can be transformed into an Error. But it isn't
entirely clear where the messages get injected.

Still need to do some thinking on it, but I think I can rewrite the parsers to follow this model, and then decorate my grammar with possible errors
at each node.

I've made some progress. I wrote a new prototype following Hutton and
modified it to add position information to the character stream.
And then rewrote the parsers to produce the maybe structure and then
to collect rudimentary error messages.

For these, a positive and negative case in PostScript,

0 0 (abcd\ne) input-pos
(abc) str exec
pc
0 0 (abcd\ne) input-pos
(abd) str nofail exec
pq

I get this output:

$ gsnd -q -dNOSAFER pc11.ps
stack:
[/OK [[(a) [(b) [(c) []]]] [[(d) [0 3]] [[(\n) [0 4]] [[(e) [1 0]] null]]]]] stack:
[/Error [[(after) (a)] [[(after) (b)] [[{(d) eq} (not satisfied)] [[(c) [0 2]] [[(d) [0 3]] [[(\n) [0 4]] [[(e) [1 0]] null]]]]]]]]

So, this indicates that I *can* modify my C parsers to produce error
messages. The remaining input list has the position information for
where the error occurred, [(c) [0 2]].

Following the prototype, I modified the input functions to add positions
for each character and modified the base parser item() to detect and
remove the position stuff before it passes into the rest of the machinery.

The next hurdle was making the extra position information work with
the Unicode filters ucs4_from_utf8() and utf8_from_ucs4(). And that
all appears to be working now.

But that's probably the end of the story for now. Got to gear up for
Operating Systems and Advanced Web Dev with Jave.

Thanks to everyone for the help, esp. Kaz with the brilliant suggestion
to pass a language id token between tokenizer and parser.

Ps. the prototype is written in PostScript extended with function syntax. https://github.com/luser-dr00g/pcomb/blob/master/ps/pc11.ps https://codereview.stackexchange.com/questions/193520/an-enhanced-syntax-for-defining-functions-in-postscript

--
l droog
[Why Postscript? I realize it's Turing complete, but it seems odd to run ones parser on a printer. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to luser droog on Sun Aug 23 20:39:30 2020

On Sunday, August 23, 2020 at 10:15:06 PM UTC-5, luser droog wrote:

Thanks to everyone for the help, esp. Kaz with the brilliant suggestion
to pass a language id token between tokenizer and parser.

Ps. the prototype is written in PostScript extended with function syntax. https://github.com/luser-dr00g/pcomb/blob/master/ps/pc11.ps https://codereview.stackexchange.com/questions/193520/an-enhanced-syntax-for-defining-functions-in-postscript

--
l droog
[Why Postscript? I realize it's Turing complete, but it seems odd to run ones parser on a printer. -John]

I discovered PostScript around '97 or '98. I was taking Computer Graphics
and it was in an Appendix to the textbook (Salman). At the same time
I was editor of the Honors College student magazine so it really piqued
my interest as a graphics and typography language.

But the language itself I just really enjoy. It's my "Lego blocks"
language. The RPN syntax removes all ambiguity about precedence and
sequencing. It has the same code=data properties as Lisp. Application
code can read from the program stream. It has strings, arrays and
dictionaries. It has first class procedures which can be constructed
on the fly. I've found it a nice playpen for syntax extension.

I was also on a many-decades long crusade to never use MS Word after that /first/ time they screwed everyone by changing the interface. And
PostScript has slowly become my tool for that as my programming skill grew. https://github.com/luser-dr00g/ibis.ps

On another front, I wanted to have parsers in PostScript so I could
evaluate infix math expressions. And I wanted regular expression
matching in PS thinking it would help to implement algorithmic
hyphenation of text.
[Take a look at Forth. Many of the same advantages, runs a lot more places. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to Hans-Peter Diettrich on Sun Aug 23 22:16:33 2020

On Thursday, August 13, 2020 at 5:22:51 PM UTC-5, Hans-Peter Diettrich wrote:

Am 13.08.2020 um 00:20 schrieb luser droog:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection?

IMO this depends widely on the usage of the parser output (diagnostics, backend...). C90 is much stricter than K&R, requires more checks. Do you
need extensive error diagnostics, or do you assume that all source code
is free of errors?

https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

You seem to implement an LL(1) parser? My C98 Parser is LL(2), i.e. an
LL(1) parser with one or two locations where more lookahead is required.

In which places do you need more lookahead? Btw, some of my reading
describes my parsers as LL(infinity) because of the backtracking.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to luser droog on Mon Aug 24 17:01:59 2020

luser droog <mijoryx@yahoo.com.dmarc.email.dmarc.email> schrieb:

[PostScript]

But the language itself I just really enjoy. It's my "Lego blocks"
language. The RPN syntax removes all ambiguity about precedence and sequencing.

I recently had the doubtful pleasure of evaluating the formula

x = ((a-b)*c^2+(-d^2+e^2-a^2+b^2)*c+a^2*b+(f^2-e^2-b^2)*a
+(-f^2+d^2)*b)/((-2*d+2*e)*c+(2*f-2*e)*a-2*b*(f-d))

in Postscript. (Yes, really. Don't ask.)

It was the first time in more than a decade that I wrote a
flex/bison grammar (mostly copied from the bison manual). It was
faster and less error-prone than trying to do it directly.

The grammar actually generated fairly unidiomatic PostScript
because I made it give names to all the variables, so

a-b

became

a b sub

I'm sure a real PostScript aficionado would have done it all
on the stack :-)
[Turning infix into RPN is a pretty basic intro compiler course exercise. Conceptually,
you make the parse tree and then do a postorder tree walk. Or if you'rs using yacc or
bison, in the expression grammar you just print the operator or token in the code for each
rule because they are recognized in the right order, e.g.:

expr: VARIABLE { printf(" %s ", $1): }
| expr '+' expr ( printf(" add "); }
| expr '-' expr ( printf(" sub "); }
| '(' expr ')' { /* print nothing */ }

-John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to luser droog on Mon Aug 24 12:14:42 2020

On Monday, August 24, 2020 at 10:42:33 AM UTC-5, luser droog wrote:

[Why Postscript? I realize it's Turing complete, but it seems odd to run ones parser on a printer. -John]

I discovered PostScript around '97 or '98. I was taking Computer Graphics
and it was in an Appendix to the textbook (Salman). At the same time
I was editor of the Honors College student magazine so it really piqued
my interest as a graphics and typography language. ...

[Take a look at Forth. Many of the same advantages, runs a lot more places. -John]

Good suggestion. I have looked at Forth quite a bit. I lurked in comp.lang.forth for a number of years. I've got a half-written
interpreter that stalled because my vm doesn't have any I/O.

https://groups.google.com/d/topic/comp.lang.forth/Y1XlX8wD3RQ/discussion https://retrocomputing.stackexchange.com/questions/6610/how-to-do-i-o-with-emulation-code

I went down a wild rabbit hole after discovering the document "X86 is an
octal machine" and tried to recode the assembler macros using more octal.
But I kind of stalled on that whole area since the first thing I would
want in Forth is tagged objects like PostScript has. There's Oforth and
8th which both supply that sort of thing, but then I'd probably miss
the PS graphics functions.,,. :)

I've also played with APL and tried writing a few interpreters for it.
But the common thread among all these interpreters was coding them all
in C. So I turned my attention to compiling and analyzing C code.
A friend of mine was wanting a really customizable C formatter so I
thought I might be able to make a tool to accommodate lots of different backends for doing something with the parse tree or syntax tree.
I want to be able to write C99 code and transpile it automatically to
something that will work with the MS compiler without having to maintain
any MS business in the "master" source.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to Thomas Koenig on Fri Aug 28 10:56:15 2020

On Monday, August 24, 2020 at 2:12:13 PM UTC-5, Thomas Koenig wrote:

luser droog <mijoryx@yahoo.com.dmarc.email.dmarc.email.dmarc.email> schrieb:

[PostScript]

But the language itself I just really enjoy. It's my "Lego blocks" language. The RPN syntax removes all ambiguity about precedence and sequencing.

I recently had the doubtful pleasure of evaluating the formula

x = ((a-b)*c^2+(-d^2+e^2-a^2+b^2)*c+a^2*b+(f^2-e^2-b^2)*a
+(-f^2+d^2)*b)/((-2*d+2*e)*c+(2*f-2*e)*a-2*b*(f-d))

in Postscript. (Yes, really. Don't ask.)

In case you need it, I've got a PostScript debugger that can single
step into loops and procedures.

https://github.com/luser-dr00g/debug.ps

$ gsnd db5.ps
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.

{ 0 0 1 5 { add } for } stepon traceon debug

%|-
0 %|- 0
0 %|- 0 0
1 %|- 0 0 1
5 %|- 0 0 1 5
{add} %|- 0 0 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?s
%|- 0 0
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 0
1 %|- 0 1
1 %|- 0 1 1
5 %|- 0 1 1 5
{add} %|- 0 1 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 0 1
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 1
2 %|- 1 2
1 %|- 1 2 1
5 %|- 1 2 1 5
{add} %|- 1 2 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 1 2
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 3
3 %|- 3 3
1 %|- 3 3 1
5 %|- 3 3 1 5
{add} %|- 3 3 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 3 3
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 6
4 %|- 6 4
1 %|- 6 4 1
5 %|- 6 4 1 5
{add} %|- 6 4 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 6 4
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 10
5 %|- 10 5
1 %|- 10 5 1
5 %|- 10 5 1 5
{add} %|- 10 5 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 10 5
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 15
6 %|- 15 6
1 %|- 15 6 1
5 %|- 15 6 1 5
{add} %|- 15 6 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
GS<1>==
15

quit

[This is drifting rather far from compilers now. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From antispam@math.uni.wroc.pl@21:1/5 to luser droog on Thu Feb 11 23:27:43 2021

luser droog <mijoryx@yahoo.com.dmarc.email.dmarc.email> wrote:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection?

My feeling is that separating them will be less headache, but maybe
there's some advantage to changing out smaller pieces of the grammar
in that it might be easier to make sure that they produce the same
structure compatible with the backend.

Any guidance in this area?

https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

[Really, it's up to you. My inclination would be to make them
separate but use some sort of macro setup so you can insert
common pieces into each of the grammars. -John]

Gnu Pascal supports several Pascal dialects. Gnu Pascal uses
unified parser for all dialects. Some ideas used:
- flags in scanner decide if dialect specific tokens are
recognized
- superset parsing: several constructs are generalized so
that single construct represents things that othewise
would lead to conflits. Later semantic stage looks at
dialects flags, prunes things not allowed in given
dialect. Example of superset contruction is rule
'call_or_cast', it handles several syntactically similar
constructs that are usually given by separate syntax
rules. Semantic rules beside dialect flags use types to
decide of meaning.
- even after usin two tricks above grammar still have
LALR conflicts, they are resolved using GLR option
of Bison. All conflicts are resolvable using lookahead,
and AFAICS some are only resolvable with lookahead.
Parser lookahead means that traditional trick of
passing semantic info back to scanner does not work
(parser actions are delayed, so scanner may be forced
to produce token before semantic info is available).
Still, it seems that GLR leads to cleaner parser.

My impression is that variation in Pascal dialects is larger
than in C dialects, so case for unified parser in C IMHO
is much stronger. OTOH Gnu Pascal is full compiler with
semantic actions invoked from grammar rules. Semantic code
embedded in the parser changed much more than grammar rules,
so maintaining separate parsers probably would be a
nightmare.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Elijah Stone@21:1/5 to antispam@math.uni.wroc.pl on Wed Feb 17 01:44:14 2021

On Thu, 11 Feb 2021, antispam@math.uni.wroc.pl wrote:

My impression is that variation in Pascal dialects is larger than in C dialects, so case for unified parser in C IMHO

Pascal is more fragmented, but it's also much easier to parse than C. I
think it's a wash.

(I also think the whole idea is horrifying and ought not to be pursued;
but.)

-E

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From antispam@math.uni.wroc.pl@21:1/5 to Elijah Stone on Tue Feb 23 23:28:16 2021

Elijah Stone <elronnd@elronnd.net> wrote:

On Thu, 11 Feb 2021, antispam@math.uni.wroc.pl wrote:

My impression is that variation in Pascal dialects is larger than in C dialects, so case for unified parser in C IMHO

Pascal is more fragmented, but it's also much easier to parse than C. I think it's a wash.

I did a C parser, it was not hard at all. I in C (like in standard
Pascal) there are conflicts, but that conflicts can be resolved
easily using semantic info. Alternativly, for C one can use 2
token lookahead. Turbo Pascal folks introduced "interesting"
difficulty with caret constants. Frank Heckenbach worked out
how to handle them and his analysis indicates that correct
handling of Turbo Pascal needs IIRC 6 tokens of lookahead.

Note that for both Pascal and C, with 1 token of lokahead
semantic info is available when needed to disambiguate
parsing, once you have more than 1 token of lokahead
semantic info is sometimes too late and in effect paser
must work purely syntactically.

(I also think the whole idea is horrifying and ought not to be pursued;
but.)

What you mean by "whole idea"? Do you think that creating
compiler that can correctly handle multiple dialects (Pascal
or other language) is wrong?

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From ltcmelo@gmail.com@21:1/5 to All on Sun Mar 14 21:08:33 2021

Elijah Stone <elr...@elronnd.net> wrote:
I did a C parser, it was not hard at all. I in C (like in standard
Pascal) there are conflicts, but that conflicts can be resolved
easily using semantic info. Alternativly, for C one can use 2
token lookahead.

I'm not sure whether I captured the full context of your statement, but, if I did, I don't think it's 100% correct:

- In regards to lookahead of 2:
This isn't enough to disambiguate, e.g., between a cast-expression in 6.5.4,
`( type-name ) cast-expression`, and a compound literal in 6.5.2, `( type-name ) { initializer-list }`.

- In regards to using semantic info:
Yes, with semantic info you can disambiguate things like `x * y;`, so I'd say that, from a pragmatic/practical standpoint, this affirmation is right. However, from a more theoretical perspective, a parser (thinking of it a program that "simply" validates a sentence based on a grammar), isn't expected — arguably — to rely on anything else other than syntax. Whether or not
the theoretical aspect of it is relevant, depends on the application of the parser, I guess. For instance, for the implementation of static analysis tool, not depending, as much as possible, on semantic information to guide parsing
is an advantage.

This is a table (only for expressions) that I recently put up when rewriting the C parser of my project: https://docs.google.com/spreadsheets/d/1oGjtFaqLzSoBEp2aGNgHrbEHxSi4Ijv57mXMP ymZEcQ/edit?usp=sharing

--
Leandro T. C. Melo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rock Brentwood@21:1/5 to luser droog on Sun Mar 14 17:36:01 2021

On Wednesday, August 12, 2020 at 5:32:56 PM UTC-5, luser droog wrote:

I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection?

I'm in a similar situation with a utility that I want to grandfather in the
old syntax for, but write with a new and better syntax. My recommendation is this: stick to C99, since that's already in POSIX. Write a separate utility to convert legacy syntax to C99 (and to call out any irregularities/inconsistencies in the program being converted). That's, like, "lint" on steroids.

The other syntaxes would be used in the other utilities, only - one per utility. It can also be hybridized with "indent" and a driver routine can control the conversion, so that all the conversion utilities can be combined
to one. So, on input, the source syntax is selected, and on output the format is driven in much the same way that it is with indent. It's an excellent exercise in Text-To-AST-To-Text programming.

Each program, upon upward conversion to C99, would replace the original, once it passes the consistency checks provided by the utility; so there isn't a question of cueing error messages to the format of the older program, because the older program would be replaced. Doing all of this is an example of "refactoring" used to pay off "code debt". And there's a lot of code debt out there that needs to be paid up.

Technical Debt (Wikipedia): https://en.wikipedia.org/wiki/Technical_debt
Code Refactoring (Wikipedia): https://en.wikipedia.org/wiki/Code_refactoring

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	61:54:27
Calls:	6,654
Files:	12,200
Messages:	5,331,620

Supporting multiple input syntaxes

Who's Online

System Info