Forum: >>> Magnum BBS <<<

syntax complexity

From gah4@21:1/5 to All on Wed Feb 15 15:08:12 2023

I started this in another thread, but I think it deserves its own.

The question is, how does one measure syntax complexity, with the
specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
for now.)

Even back to the 1970's, I always found that Fortran had strange and (seemingly) arbitrary rules on its syntax.

Some of them made sense on the small, early, computers, but didn't go
away later. Even worse, as new features were added, and some of the
rules were relaxed, new ones appeared.

On the other hand, PL/I from the beginning had more general rules.
And, it seems to me, more general rules lead to simpler syntax. That
is especially true for expressions, where PL/I allows pretty much the
same expression syntax everywhere it allows expressions.

PL/I had internal procedures from the beginning, though only added to
Fortran much later. But when they were, Fortran didn't allow them to
be nested. Only one level deep! That might have changed more recently,
or maybe now two levels.

One complication I see, is that syntax complexity as seen by people,
might be different from it as seen by programs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to gah4@u.washington.edu on Thu Feb 16 06:32:50 2023

gah4 <gah4@u.washington.edu> schrieb:

I started this in another thread, but I think it deserves its own.

The question is, how does one measure syntax complexity, with the
specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
for now.)

A reasonable zeroth-order approximation is the length of the language
standard (which would also include semantics).

Fortran 2018: 599 pages, without the index.

C++ 2020: 1663 pages, without cross-references and indices.

C (2020 draft): 393 pages without annexes, 572 pages including
the non-normative annexes.

It is probably not fair to compare document sizes of early standards
like the very first programming language standard, Fortran 66.
That was very short at 36 pages, but people were still learning
how to write language standards at the time.

Another very rough measure would be the size of a front end in the
same compiler. Using the even rougher estimate of text lines
for gcc, I get

$ cat fortran/*.cc fortran/*.h | wc -l
226682
$ cat c/*.cc c/*.h | wc -l
60548
$ cat d/*.cc | wc -l
24850
$ cat ada/*.adb ada/*.ads ada/*.c ada/*.h | wc -l
731635

(Right now, I'm not sure how to identify all the files in the C++
front end. Ada may be misrepresented because the compiler is mostly
written in Ada instead of C++, and it is a more verbose language than
C++. The Fortran front end is nominally C++, but due to its history
mostly sticks to the common subset of C and C++. This is text lines,
not statements. I'm not sure if I included the Ada runtime library in
that estimate or not. Like I wrote above, a rough estimate).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to All on Thu Feb 16 12:03:02 2023

On 2/16/23 12:08 AM, gah4 wrote:

I started this in another thread, but I think it deserves its own.

The question is, how does one measure syntax complexity,

In the original thread I was missing more specific distinctions:
- tools: flex/bison, ANTLR, CoCo/R...
- parser types: LL, LR, LF, LALR...
- or(?): recursive-descent, shift-reduce...
- grammar syntax: BNF, EBNF...
- grammar types: context-free, -sensitive...

with the
specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
for now.)

[...]

On that aspect I can not contribute anything, sorry :-(

One complication I see, is that syntax complexity as seen by people,
might be different from it as seen by programs.

How do programs recognize syntax complexity?
Number of rules and exceptions?

DoDi

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Hans-Peter Diettrich on Thu Feb 16 11:33:01 2023

On Thursday, February 16, 2023 at 9:57:53 AM UTC-8, Hans-Peter Diettrich wrote:

On 2/16/23 12:08 AM, gah4 wrote:

(snip)

One complication I see, is that syntax complexity as seen by people,
might be different from it as seen by programs.

How do programs recognize syntax complexity?
Number of rules and exceptions?

When I wrote that one, I was thinking about the places where Fortran uses special characters and PL/I uses words.

DO I=1,10,3

DO I = 1 TO 10 BY 3;

I think about them in a different way, such that the thought complexity is different.

A compiler doesn't "think" in that way.

I suppose I agree with the above, the length of the standard, with some assumptions on how it is written, or the length of the front end.
[Having written a couple of Fortran parsers, I can say that while the hacks
to deal with ignored spaces were ugly, they weren't that hard. PL/I has a separate issue that the same token might be a keyword or a variable depending on context, and the kinds of parsers you build with bison et al don't deal
very well with that. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to and our moderator on Thu Feb 16 16:08:16 2023

On Thursday, February 16, 2023 at 2:46:25 PM UTC-8, gah4 wrote:

(snip)

When I wrote that one, I was thinking about the places where Fortran uses special characters and PL/I uses words.

DO I=1,10,3

DO I = 1 TO 10 BY 3;

I think about them in a different way, such that the thought complexity is different.

A compiler doesn't "think" in that way.

I suppose I agree with the above, the length of the standard, with some assumptions on how it is written, or the length of the front end.

(and our moderator wrote)

[Having written a couple of Fortran parsers, I can say that while the hacks to deal with ignored spaces were ugly, they weren't that hard. PL/I has a separate issue that the same token might be a keyword or a variable depending on context, and the kinds of parsers you build with bison et al don't deal very well with that. -John]

Well there is that, but so far I was only thinking about the difference
between commas and keywords.

WRITE(2,*) A, B, C
PUT FILE(TWO) SKIP LIST(A, B, C);

or

READ(3'K) X
READ FILE(THREE) INTO(X) KEY(K);

the funny IBM use of a single apostrophe for direct access files.

I had a part of a summer undergrad project working with the STEP
macro processor, which I wrote about some time ago.
I was writing the parser for IBM Fortran (not including leaving
out spaces between words), but there is no way to match a single
apostrophe! Parsing string constants was done at a very low level.
[Like I said the hacks were ugly. For example, this statement
contains a Hollerith constant:

123 FORMAT(4HELLO)

but this one does not:

REAL*4HELLO

I was doing F77 which didn't do that strange quote thing but it's easy
enough to tell from context, since a quoted string has to follow a
punctuation mark that is not a close paren. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger L Costello@21:1/5 to All on Mon Feb 20 15:09:18 2023

Hello Compiler Experts!

Scenario: you have a language that has a BNF. You write a statement in the language. It is a relatively simple, basic statement. The statement conforms to the BNF. To show its conformance, you write the derivation of the statement. Surprisingly, deriving
the statement takes many, many rules. Does that signify that the language's syntax is too complex?

/Roger

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Roger L Costello on Mon Feb 20 09:57:20 2023

On Monday, February 20, 2023 at 9:46:04 AM UTC-8, Roger L Costello wrote:

Scenario: you have a language that has a BNF. You write a statement in the language.
It is a relatively simple, basic statement. The statement conforms to the BNF.
To show its conformance, you write the derivation of the statement. Surprisingly, deriving the statement takes many, many rules.
Does that signify that the language's syntax is too complex?

One suggestion above based it on the size of the standard. That isn't
quite as good as it could be, as some standard writers are wordier
than others, but maybe not so bad.

Another choice is on the size of the BNF.

I don't think I would rate it on one statement, but the whole BNF of
the language. One statement, such as an assignment statement, could
use a lot of rules. Whole language syntax complexity should be on the
whole BNF.

One complication, depending on how you write BNF. Some statements,
such as in PL/I, have items that can be in any order, but you only
have at most one of each. That is complicated to write in BNF, but I
don't believe complicates the syntax, as seen by people.
[There are definitely things that are hard to say in BNF, even though
they're intuitively simple. Another is floating point numbers with
optional "." and "E" but you need at least one of them. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to costello@mitre.org on Mon Feb 20 13:49:19 2023

On Mon, 20 Feb 2023 15:09:18 +0000, Roger L Costello
<costello@mitre.org> wrote:

Hello Compiler Experts!

Scenario: you have a language that has a BNF. You write a statement in
the language. It is a relatively simple, basic statement. The
statement conforms to the BNF. To show its conformance, you write the >derivation of the statement. Surprisingly, deriving the statement
takes many, many rules. Does that signify that the language's syntax
is too complex?

/Roger

Not necessarily.

What makes a language complicated (not "complex") is ambiguity, not
the number of grammar rules needed to recognize some particular
expression.

If you are restricted to BNF ... i.e. your tool does not allow
specifying precedence ... then recognizing even relatively simple
arithmetic expressions can (perhaps recursively) involve several
rules.

George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Our moderator on Tue Feb 21 08:14:10 2023

Our moderator writes:

[There are definitely things that are hard to say in BNF, even though
they're intuitively simple.

An example is the solution to the dangling-else problem that we
discussed some time ago. At least one of the language standards I
looked at during this discussion specified an ambiguous grammar for
the IF-statement, with the disambiguation coming from prose, rather
than specifying an unambiguous grammar.

Of course, better than either solution is to design the language to
require that an IF-statement is terminated with, e.g., fi (Algol 68) or END (Modula-2).

Another is floating point numbers with
optional "." and "E" but you need at least one of them. -John]

Regular expression syntax is missing an operator that signifies the intersection of the sets recognized by the operand regexps. Let's
call this operator "&". Then this requirement for an FP number can be expressed as:

([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*

"[0-9.]*" specifies a lenient form of the mantissa part; ".*[0-9].*"
specifies a string that has at least one digit; so
"([0-9.]*&.*[0-9].*)" says that the mantissa part can contain digits
and "." and must contain one digit. "(E[0-9]+)?" specifies the
optional exponent part. So "([0-9.]*&.*[0-9].*)(E[0-9]+)" is a
lenient form of an FP number that may miss both "." and "E".
".*[.E].*" specifies the requirement stated above: The number must
contain a "." or an "E". Again the "&" combines these requirements.

You can translate regexps with & to a DFA, and don't lose any
performance there. For an NFA-based implementation, the only way I
see is that you process both sub-NFAs and check if they both accept
the string, so it's somewhat slower; I guess that this is a major
reason why such an operator has not been added to widely-used regexp
syntaxes.

Has anybody implemented such an operator at all?

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[Wouldn't that pattern allow 1.2.3 ? -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to George Neuner on Tue Feb 21 20:54:11 2023

I want to expand on what
George Neuner <gneuner2@comcast.net> wrote:

If you are restricted to BNF ... i.e. your tool does not allow
specifying precedence ... then recognizing even relatively simple
arithmetic expressions can (perhaps recursively) involve several
rules.

Without precedence specifications, strict BNF does require nested
rules to be created and named to implement the "natural" hierarchy of expression rules, e.g. that one does the multiplications before the
additions. This was already recognized in 1973, when Steve Johnson et
al, described how such specifications are implemented in yacc. See "Deterministic Parsing of Ambiguous Grammars". (https://dl.acm.org/doi/epdf/10.1145/360933.360969) Of course, the
precedence parsing methods predate that paper, but it showed how one
could easily annotate a BNF grammar to incorporate the idea.

You can take the idea farther if you use numeric preferences for
operators. That gives one a simple way of expressing the ordering of
the operators and one "shifts" operators with higher or equal
precedence and "reduces" when seeing an operator of lower precedence.
(Note you can specify higher by either larger or smaller numbers. E.g.
on a scale of 1 to 10, 1 can be highest, 1st place, or lowest, last
place. One simply has to decide which way one wants it.)

Similarly, in more recent times, the idea of ordering productions to
specify precedence is used in Terence Parr's ANTLR4 to good effect.
Again, the idea is not new. LEX used rule ordering to disambiguate
which regular expression to use (and I suspect it was used before
that). The same idea also is used in most Parsing Expression
Grammar (PEG) implementations.

Next, there was a paper in the 1980s or 90s whose title and author
I have unfortunately forgotten, although I know it was published in
TOPLAS, that suggested using examples to express preferences.
And, that clearly goes back to the operators used in precedence
parsing, e.g. ".<." which expressed which grouping had higher
precedence and thus who should shift and who should reduce.

But, no matter how one specifies it, the idea is the same.
-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Nils M Holm@21:1/5 to Anton Ertl on Tue Feb 21 21:26:00 2023

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Of course, better than either solution is to design the language to
require that an IF-statement is terminated with, e.g., fi (Algol 68) or END (Modula-2).

Or you can have two different statement types for IF with and
without an alternative branch. For example, BCPL has

IF expression THEN statement

and

TEST expression THEN statement ELSE statement

--
Nils M Holm < n m h @ t 3 x . o r g > http://t3x.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Tue Feb 21 18:39:55 2023

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Regular expression syntax is missing an operator that signifies the >intersection of the sets recognized by the operand regexps. Let's
call this operator "&". Then this requirement for an FP number can be >expressed as:

([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*

...

[Wouldn't that pattern allow 1.2.3 ? -John]

Good point. The & operator makes it easy to fix this bug (once you
have found it):

([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*&[^.]*[.]?[^.]*

But it suggests that I tried to put too many requirements into the
first part, so let's try again:

At most one "." in front of at most one "E": [0-9]*[.]?[0-9]*E?[0-9]*
At least one of "." or "E": .*[.E].*
At least one digit in front of and after "E": .*[0-9].*E[0-9]+

In total:

[0-9]*[.]?[0-9]*E?[0-9]*&.*[.E].*&.*[0-9].*E[0-9]+

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	58:22:57
Calls:	6,712
Files:	12,243
Messages:	5,355,631

syntax complexity

Who's Online

System Info