• syntax complexity

    From gah4@21:1/5 to All on Wed Feb 15 15:08:12 2023
    I started this in another thread, but I think it deserves its own.

    The question is, how does one measure syntax complexity, with the
    specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
    for now.)

    Even back to the 1970's, I always found that Fortran had strange and (seemingly) arbitrary rules on its syntax.

    Some of them made sense on the small, early, computers, but didn't go
    away later. Even worse, as new features were added, and some of the
    rules were relaxed, new ones appeared.

    On the other hand, PL/I from the beginning had more general rules.
    And, it seems to me, more general rules lead to simpler syntax. That
    is especially true for expressions, where PL/I allows pretty much the
    same expression syntax everywhere it allows expressions.

    PL/I had internal procedures from the beginning, though only added to
    Fortran much later. But when they were, Fortran didn't allow them to
    be nested. Only one level deep! That might have changed more recently,
    or maybe now two levels.

    One complication I see, is that syntax complexity as seen by people,
    might be different from it as seen by programs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to gah4@u.washington.edu on Thu Feb 16 06:32:50 2023
    gah4 <gah4@u.washington.edu> schrieb:
    I started this in another thread, but I think it deserves its own.

    The question is, how does one measure syntax complexity, with the
    specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
    for now.)

    A reasonable zeroth-order approximation is the length of the language
    standard (which would also include semantics).

    Fortran 2018: 599 pages, without the index.

    C++ 2020: 1663 pages, without cross-references and indices.

    C (2020 draft): 393 pages without annexes, 572 pages including
    the non-normative annexes.

    It is probably not fair to compare document sizes of early standards
    like the very first programming language standard, Fortran 66.
    That was very short at 36 pages, but people were still learning
    how to write language standards at the time.

    Another very rough measure would be the size of a front end in the
    same compiler. Using the even rougher estimate of text lines
    for gcc, I get

    $ cat fortran/*.cc fortran/*.h | wc -l
    226682
    $ cat c/*.cc c/*.h | wc -l
    60548
    $ cat d/*.cc | wc -l
    24850
    $ cat ada/*.adb ada/*.ads ada/*.c ada/*.h | wc -l
    731635

    (Right now, I'm not sure how to identify all the files in the C++
    front end. Ada may be misrepresented because the compiler is mostly
    written in Ada instead of C++, and it is a more verbose language than
    C++. The Fortran front end is nominally C++, but due to its history
    mostly sticks to the common subset of C and C++. This is text lines,
    not statements. I'm not sure if I included the Ada runtime library in
    that estimate or not. Like I wrote above, a rough estimate).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to All on Thu Feb 16 12:03:02 2023
    On 2/16/23 12:08 AM, gah4 wrote:
    I started this in another thread, but I think it deserves its own.

    The question is, how does one measure syntax complexity,

    In the original thread I was missing more specific distinctions:
    - tools: flex/bison, ANTLR, CoCo/R...
    - parser types: LL, LR, LF, LALR...
    - or(?): recursive-descent, shift-reduce...
    - grammar syntax: BNF, EBNF...
    - grammar types: context-free, -sensitive...

    with the
    specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
    for now.)
    [...]

    On that aspect I can not contribute anything, sorry :-(


    One complication I see, is that syntax complexity as seen by people,
    might be different from it as seen by programs.

    How do programs recognize syntax complexity?
    Number of rules and exceptions?

    DoDi

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Hans-Peter Diettrich on Thu Feb 16 11:33:01 2023
    On Thursday, February 16, 2023 at 9:57:53 AM UTC-8, Hans-Peter Diettrich wrote:
    On 2/16/23 12:08 AM, gah4 wrote:

    (snip)

    One complication I see, is that syntax complexity as seen by people,
    might be different from it as seen by programs.

    How do programs recognize syntax complexity?
    Number of rules and exceptions?

    When I wrote that one, I was thinking about the places where Fortran uses special characters and PL/I uses words.

    DO I=1,10,3

    DO I = 1 TO 10 BY 3;

    I think about them in a different way, such that the thought complexity is different.

    A compiler doesn't "think" in that way.

    I suppose I agree with the above, the length of the standard, with some assumptions on how it is written, or the length of the front end.
    [Having written a couple of Fortran parsers, I can say that while the hacks
    to deal with ignored spaces were ugly, they weren't that hard. PL/I has a separate issue that the same token might be a keyword or a variable depending on context, and the kinds of parsers you build with bison et al don't deal
    very well with that. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to and our moderator on Thu Feb 16 16:08:16 2023
    On Thursday, February 16, 2023 at 2:46:25 PM UTC-8, gah4 wrote:

    (snip)

    When I wrote that one, I was thinking about the places where Fortran uses special characters and PL/I uses words.

    DO I=1,10,3

    DO I = 1 TO 10 BY 3;

    I think about them in a different way, such that the thought complexity is different.

    A compiler doesn't "think" in that way.

    I suppose I agree with the above, the length of the standard, with some assumptions on how it is written, or the length of the front end.

    (and our moderator wrote)
    [Having written a couple of Fortran parsers, I can say that while the hacks to deal with ignored spaces were ugly, they weren't that hard. PL/I has a separate issue that the same token might be a keyword or a variable depending on context, and the kinds of parsers you build with bison et al don't deal very well with that. -John]

    Well there is that, but so far I was only thinking about the difference
    between commas and keywords.

    WRITE(2,*) A, B, C
    PUT FILE(TWO) SKIP LIST(A, B, C);

    or

    READ(3'K) X
    READ FILE(THREE) INTO(X) KEY(K);

    the funny IBM use of a single apostrophe for direct access files.

    I had a part of a summer undergrad project working with the STEP
    macro processor, which I wrote about some time ago.
    I was writing the parser for IBM Fortran (not including leaving
    out spaces between words), but there is no way to match a single
    apostrophe! Parsing string constants was done at a very low level.
    [Like I said the hacks were ugly. For example, this statement
    contains a Hollerith constant:

    123 FORMAT(4HELLO)

    but this one does not:

    REAL*4HELLO

    I was doing F77 which didn't do that strange quote thing but it's easy
    enough to tell from context, since a quoted string has to follow a
    punctuation mark that is not a close paren. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger L Costello@21:1/5 to All on Mon Feb 20 15:09:18 2023
    Hello Compiler Experts!

    Scenario: you have a language that has a BNF. You write a statement in the language. It is a relatively simple, basic statement. The statement conforms to the BNF. To show its conformance, you write the derivation of the statement. Surprisingly, deriving
    the statement takes many, many rules. Does that signify that the language's syntax is too complex?

    /Roger

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Mon Feb 20 09:57:20 2023
    On Monday, February 20, 2023 at 9:46:04 AM UTC-8, Roger L Costello wrote:

    Scenario: you have a language that has a BNF. You write a statement in the language.
    It is a relatively simple, basic statement. The statement conforms to the BNF.
    To show its conformance, you write the derivation of the statement. Surprisingly, deriving the statement takes many, many rules.
    Does that signify that the language's syntax is too complex?

    One suggestion above based it on the size of the standard. That isn't
    quite as good as it could be, as some standard writers are wordier
    than others, but maybe not so bad.

    Another choice is on the size of the BNF.

    I don't think I would rate it on one statement, but the whole BNF of
    the language. One statement, such as an assignment statement, could
    use a lot of rules. Whole language syntax complexity should be on the
    whole BNF.

    One complication, depending on how you write BNF. Some statements,
    such as in PL/I, have items that can be in any order, but you only
    have at most one of each. That is complicated to write in BNF, but I
    don't believe complicates the syntax, as seen by people.
    [There are definitely things that are hard to say in BNF, even though
    they're intuitively simple. Another is floating point numbers with
    optional "." and "E" but you need at least one of them. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to costello@mitre.org on Mon Feb 20 13:49:19 2023
    On Mon, 20 Feb 2023 15:09:18 +0000, Roger L Costello
    <costello@mitre.org> wrote:

    Hello Compiler Experts!

    Scenario: you have a language that has a BNF. You write a statement in
    the language. It is a relatively simple, basic statement. The
    statement conforms to the BNF. To show its conformance, you write the >derivation of the statement. Surprisingly, deriving the statement
    takes many, many rules. Does that signify that the language's syntax
    is too complex?

    /Roger

    Not necessarily.

    What makes a language complicated (not "complex") is ambiguity, not
    the number of grammar rules needed to recognize some particular
    expression.

    If you are restricted to BNF ... i.e. your tool does not allow
    specifying precedence ... then recognizing even relatively simple
    arithmetic expressions can (perhaps recursively) involve several
    rules.

    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Our moderator on Tue Feb 21 08:14:10 2023
    Our moderator writes:

    [There are definitely things that are hard to say in BNF, even though
    they're intuitively simple.

    An example is the solution to the dangling-else problem that we
    discussed some time ago. At least one of the language standards I
    looked at during this discussion specified an ambiguous grammar for
    the IF-statement, with the disambiguation coming from prose, rather
    than specifying an unambiguous grammar.

    Of course, better than either solution is to design the language to
    require that an IF-statement is terminated with, e.g., fi (Algol 68) or END (Modula-2).

    Another is floating point numbers with
    optional "." and "E" but you need at least one of them. -John]

    Regular expression syntax is missing an operator that signifies the intersection of the sets recognized by the operand regexps. Let's
    call this operator "&". Then this requirement for an FP number can be expressed as:

    ([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*

    "[0-9.]*" specifies a lenient form of the mantissa part; ".*[0-9].*"
    specifies a string that has at least one digit; so
    "([0-9.]*&.*[0-9].*)" says that the mantissa part can contain digits
    and "." and must contain one digit. "(E[0-9]+)?" specifies the
    optional exponent part. So "([0-9.]*&.*[0-9].*)(E[0-9]+)" is a
    lenient form of an FP number that may miss both "." and "E".
    ".*[.E].*" specifies the requirement stated above: The number must
    contain a "." or an "E". Again the "&" combines these requirements.

    You can translate regexps with & to a DFA, and don't lose any
    performance there. For an NFA-based implementation, the only way I
    see is that you process both sub-NFAs and check if they both accept
    the string, so it's somewhat slower; I guess that this is a major
    reason why such an operator has not been added to widely-used regexp
    syntaxes.

    Has anybody implemented such an operator at all?

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/
    [Wouldn't that pattern allow 1.2.3 ? -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to George Neuner on Tue Feb 21 20:54:11 2023
    I want to expand on what
    George Neuner <gneuner2@comcast.net> wrote:

    If you are restricted to BNF ... i.e. your tool does not allow
    specifying precedence ... then recognizing even relatively simple
    arithmetic expressions can (perhaps recursively) involve several
    rules.

    Without precedence specifications, strict BNF does require nested
    rules to be created and named to implement the "natural" hierarchy of expression rules, e.g. that one does the multiplications before the
    additions. This was already recognized in 1973, when Steve Johnson et
    al, described how such specifications are implemented in yacc. See "Deterministic Parsing of Ambiguous Grammars". (https://dl.acm.org/doi/epdf/10.1145/360933.360969) Of course, the
    precedence parsing methods predate that paper, but it showed how one
    could easily annotate a BNF grammar to incorporate the idea.

    You can take the idea farther if you use numeric preferences for
    operators. That gives one a simple way of expressing the ordering of
    the operators and one "shifts" operators with higher or equal
    precedence and "reduces" when seeing an operator of lower precedence.
    (Note you can specify higher by either larger or smaller numbers. E.g.
    on a scale of 1 to 10, 1 can be highest, 1st place, or lowest, last
    place. One simply has to decide which way one wants it.)

    Similarly, in more recent times, the idea of ordering productions to
    specify precedence is used in Terence Parr's ANTLR4 to good effect.
    Again, the idea is not new. LEX used rule ordering to disambiguate
    which regular expression to use (and I suspect it was used before
    that). The same idea also is used in most Parsing Expression
    Grammar (PEG) implementations.

    Next, there was a paper in the 1980s or 90s whose title and author
    I have unfortunately forgotten, although I know it was published in
    TOPLAS, that suggested using examples to express preferences.
    And, that clearly goes back to the operators used in precedence
    parsing, e.g. ".<." which expressed which grouping had higher
    precedence and thus who should shift and who should reduce.

    But, no matter how one specifies it, the idea is the same.
    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nils M Holm@21:1/5 to Anton Ertl on Tue Feb 21 21:26:00 2023
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Of course, better than either solution is to design the language to
    require that an IF-statement is terminated with, e.g., fi (Algol 68) or END (Modula-2).

    Or you can have two different statement types for IF with and
    without an alternative branch. For example, BCPL has

    IF expression THEN statement

    and

    TEST expression THEN statement ELSE statement

    --
    Nils M Holm < n m h @ t 3 x . o r g > http://t3x.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Tue Feb 21 18:39:55 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Regular expression syntax is missing an operator that signifies the >intersection of the sets recognized by the operand regexps. Let's
    call this operator "&". Then this requirement for an FP number can be >expressed as:

    ([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*
    ...
    [Wouldn't that pattern allow 1.2.3 ? -John]

    Good point. The & operator makes it easy to fix this bug (once you
    have found it):

    ([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*&[^.]*[.]?[^.]*

    But it suggests that I tried to put too many requirements into the
    first part, so let's try again:

    At most one "." in front of at most one "E": [0-9]*[.]?[0-9]*E?[0-9]*
    At least one of "." or "E": .*[.E].*
    At least one digit in front of and after "E": .*[0-9].*E[0-9]+

    In total:

    [0-9]*[.]?[0-9]*E?[0-9]*&.*[.E].*&.*[0-9].*E[0-9]+

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)