• Re: Why are ambiguous grammars usually a bad idea? Why are languages us

    From Anton Ertl@21:1/5 to Roger L Costello on Mon Dec 27 09:49:29 2021
    Roger L Costello <costello@mitre.org> writes:
    Question: Beyond the fact that bison doesn't like them, why are ambiguous >grammars usually bad?

    An ambiguous grammar means that the same input can be parsed in at
    least two different ways. In many cases the two different ways have
    different meanings.

    The classical example is the dangling else. In EBNF:

    S -> if E then S [ else S ]

    if you see

    if E then if E then S else S

    which if does the else belong to? The two parses usually have very
    different meanings.

    My answer: Ambiguous grammars make it hard for users of the grammar to write >correct instances of the grammar. Stated another way, ambiguous grammars make >it easy to introduce errors into instances of the grammar.

    Sounds wrong to me. The if-Statement above is a correct Pascal
    statement (apart from the unexpanded Es and Ss) and not erroneous (and
    likewise for other languages with similar syntaxes, e.g., Algol 60 and C).

    Of course, one way to deal with ambiguities would be to make the input
    a syntax error, but that's usually not done (and I know of no parser
    generator that supports this approach).

    Question: Opine about why languages are usually defined and implemented with >ambiguous grammars.

    My answer: Writing a grammar that completely avoids ambiguities might result >in a grammar that is nightmarishly complex. It is easier/better to create a >simple grammar and then add rules/descriptions which explain and resolve the >ambiguities.

    Writing the grammar above in an unambiguous way to match up the else
    with the closest unresolved if is more complex, but not nightmarishly
    complex. But the better solution was used in Modula-2: require an
    ending for if:

    S -> if E then S [ else S ] end

    [Modula-2 also uses a Statement-Sequence instead of a Statement in the then-branch and else-branch, avoiding the need for begin...end.]

    This grammar is unambiguous and the programmer can write either of

    if E then if E then S else S end end
    if E then if E then S end else S end

    depending on what is intended.

    Hmm, makes me wonder where that style comes from; Lisp has had a
    closed COND from the beginning (i.e., 1960), Forth also had a closed
    IF from early on (i.e., early 1970s), but I guess neither of those
    inspired Wirth when he did Modula-2. Dijkstra's guarded command
    language (which closed the if with fi) was published in 1975, and
    maybe that was the inspiration, but OTOH, the Modula-2 if syntax was
    already present in Modula, developed 1975, report in March 1976 <https://www.research-collection.ethz.ch/handle/20.500.11850/68669>.
    Wirth writes in his HOPL-III paper:

    |In planning Modula-2, I saw it as a new version of Pascal, updated to
    |the requirements of the time, and I seized the opportunity to correct
    |various mistakes in Pascal’s design, such as, for example, the
    |syntactic anomaly of the dangling “else”

    but he does not give any background on where the new syntax is coming
    from.

    I recommend to my students to get rid of conflicts, because the
    default resolution of the conflict by Bison may be other than what is
    required. But I also design the language they have to implement such
    that not having conflicts is not too hard; e.g., I give them
    Modula-2-like rather than Algol-like syntax to implement.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/
    [Algol68 had if ... then ... else .. fi in 1968. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Fernando@21:1/5 to All on Mon Dec 27 03:56:35 2021
    Hi Roger,

    One problem with ambiguous grammars is that ambiguities might confuse the semantics of the programming language. One example is with the associativity and precedence of operators. Concerning associativity, the ambiguous grammar below does not specify if subtraction is left or right associative:

    E ::= E - E | num

    And, there is this classic example from C-like languages involving conditional statements:

    <cmd> ::= if <bool_expr> then <cmd>
    | if <bool_expr> then <cmd> else <cmd>

    What would be the meaning of a statement like the one below? Depending on how you fix the ambiguity, it's possible to make the else refer to the innermost
    or the outermost conditional.

    if (a > b) then if (c > d) then print(1) else print(2)

    Regards,

    Fernando

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Mon Dec 27 04:18:18 2021
    On Sunday, December 26, 2021 at 7:19:08 PM UTC-8, Roger L Costello wrote:

    I am reading the (excellent) book "flex & bison" by John Levine. Chapter 7 talks about conflicts in grammars: shift/reduce and reduce/reduce conflicts.

    At the end of the chapter are a few exercises/questions. I'd like to check with you on whether my answers to the questions are accurate and complete.

    As far as I know, the main cause of ambiguous grammar in programming languages is the nested if-then-optional-else structure. If you require else, then it isn't ambiguous,
    but people like the optional else. That usually comes out as a shift-reduce conflict,
    and parser generators know how to handle that.

    Otherwise, the usual regular expression has an ambiguity which is often cured by taking the longest of the possible matches. It seems to me that more often I want the shorter match, though.

    But you already have the reply from John Levine ...
    [See previous message, where we fixed that with "fi" in 1968. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Ziak@21:1/5 to Roger L Costello on Tue Dec 28 09:39:04 2021
    On Monday, December 27, 2021 at 4:19:08 AM UTC+1, Roger L Costello wrote:
    Question: Opine about why languages are usually defined and implemented with ambiguous grammars.

    My answer: Writing a grammar that completely avoids ambiguities might result in a grammar that is nightmarishly complex. It is easier/better to create a simple grammar and then add rules/descriptions which explain and resolve the ambiguities.

    Another issue is that creating an unambiguous grammar (without extending the parser generator with new features) would require inventing new abstract names in grammar rules to handle cases resolved by compilation passes executed after the parsing phase - but inventing new abstract names is very hard.

    For example, in C/C++ the meaning of the statement "a*b;" depends on the code parsed before the statement. If the code-before is "int a" then "a*b;" is a statement containing an expression - if the code-before is "typedef int a;" then "a*b;" is a variable declaration.

    Another C/C++ example: "(a)+1.23" is a binary expression if "a" is an integer variable - but it is a type conversion if "a" has been defined as "typedef double a;".

    In order to handle the above examples, the left-hand side of the [bison] grammar rules might want to use names such as STMT__MUL_EXP_OR_VAR_DECL and EXP__ADD_OR_CONVERT.

    In some languages derived from C/C++ but different from C/C++, declarations in the top-level scope and the package-level scope are order-independent. The meaning of the statement "a*b;" might depend on the code parsed before the statement or parsed after the statement.

    The other option (that is: extending the parser generator with new features, and thus avoiding abstract names in the rules of the grammar, and thus
    avoiding a nightmarishly complex grammar) means that the parser
    postpones the resolution of ambiguities until information computed in a subsequent compilation phase resolves the ambiguity. This basically requires both [the code generated by the parser generator] and [the programming
    language used to implement the compiler] to be concurrent programming languages.

    In summary: In my opinion, the answer to the question "Why languages are usually defined and implemented with ambiguous grammars?" translates to how well the parser generator integrates with compilation stages executed after
    the parsing stage.

    -atom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Tue Dec 28 16:18:13 2021
    You already have lots of relevant commentary on your answer.

    However, I want to offer some counterpoint.

    The 1973 paper by Aho, Johnson, and Ullman: Deterministic Parsing of
    Ambiguous Grammars,
    gives the background for how this is handled in yacc.

    https://www.researchgate.net/publication/234791287_Deterministic_parsing_of_ambiguous_grammars

    Thus, the use of the various directives is a long-known solution to
    the issue. However, LR conflicts frighten people. Their resolution
    hides the fact that the grammar is ambiguous and at least two parses
    are possible. But, with practice, one can get good at using them
    judiciously and knowing when they are simply expressing the desired
    intent or hiding a potential source of confusion.

    More recently, GLR parsing has deferred the resolution and builds a
    parse forest when the grammar is truly ambiguous and not the result of
    the limitations of the LR parsing methodology. When that happens, one
    can use semantics to determine which parse was correct or declare a
    relevant error.

    By the way, an LR conflict doesn't guarantee that the grammar is
    actually ambiguous, just that resolving whether it is or not is beyond
    the capacity of the LR algorithm.

    So, while it is good to know when your grammar is ambiguous (or even
    just potentially ambiguous) is useful in knowing whether your grammar
    describes the language you think it describes, it is not always the
    worst thing.

    Ambiguous grammars can have valuable properties. As you noted, writing
    an unambiguous grammar can be quite complicated (and in some cases not
    even possible), and the comparable ambiguous grammar can be quite
    concise, simple, and easy to read. And, putting the disambiguation
    outside the grammar can be a more maintainable solution.

    This is particularly, true if you want a language with user defined
    operators or ones where the user might want to adjust the precedence
    or associativity of the existing operators. Using numbers to indicate precedence is a well-known technique. It maps well onto things people understand. And the ability to compare the precedence of two operators
    based upon an associated number is easy to implement (while strictly
    outside the capacity of most parser generators).

    Kind regards,
    Chris

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to then I on Mon Dec 27 19:45:06 2021
    (snip about conflicts in grammars)
    (then I wrote)
    As far as I know, the main cause of ambiguous grammar in programming languages
    is the nested if-then-optional-else structure. If you require else, then it isn't ambiguous,
    but people like the optional else. That usually comes out as a shift-reduce conflict,
    and parser generators know how to handle that.

    (snip)

    But you already have the reply from John Levine ...
    [See previous message, where we fixed that with "fi" in 1968. -John]

    I first learned if-then-else from PL/I, where it was designed in
    about 1963. I now suspect that people get used to the one they
    learned first, and find it more obvious.

    There is also some language, and I forget now which, with an ELIF
    construct, such that you can make a nested if-then-else sequence
    without the increase of nesting level, and so need only one ENDIF.

    IF e THEN s; ELIF e THEN s; ELIF e THEN s; ENDIF;

    (No comment on my preference for ENDIF vs. FI.)

    [That would also be Algol 68, using "fi"
    It didn't need the semicolons because they are separators,
    not terminators as in PL/I and C.
    If you've written scripts in the Bourne shell or its descendants
    such as bash and zsh, it deliberately looks a lot like Algol 68.
    A request from your moderator: if anyone is planning another round
    in the semicolon terminator vs. separator war, please don't.
    -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Roger L Costello on Wed Dec 29 18:48:46 2021
    On 2021-12-16, Roger L Costello <costello@mitre.org> wrote:
    Question: Opine about why languages are usually defined and implemented with ambiguous grammars.

    But they aren't.

    Languages are processed as a stream of characters or tokens, with hidden
    rules about how those relate together and the meaning that emerges.
    All of the rules are hidden, including the entire grammar.

    If you're only aware of some of the hidden rules, but not others, then
    you see ambiguity.

    But if you're only aware of some of the hidden rules, but not others,
    then you are not working with the correct language.

    For instance, I don't know of any mainstream language in which if/else
    is actually ambiguous. They have a hidden rule like that the else goes
    with the closest preceding if statement. This is no more or less hidden
    than the rule which says that the token "if" heads a phrase structure
    called an if statement.

    I think what the question is really asking is why computer languages are designed in layers, such as an ambiguous grammar, with rules added to
    it.

    That simply has to do with the convenience of specification in relation
    to the available tooling.

    If you have a parser generator which lets you write an ambiguous grammar
    like:

    E := E + E | E - E | E * E | E / E | E ** E | (E) | id | num

    and then add precedence/associativity specifications, then it behooves
    you to take advantage of it, rather than breaking out separate rules
    like "additive expression", "multiplicative expression", ...

    When you add those rules, though, you no longer have an ambiguous
    grammar.

    There is another effect at play which is that designers are infatuated
    with complicated grammars that have lots of hidden rules. Thus we have languages whose programs can look ambiguous to someone who isn't an
    expert in all their rules. Keeping up the full expertise can require
    regular practice: constantly working with the language. (Use it or lose
    it).

    Thus, even though, two languages we may be looking at are formally
    unambiguous, one may be informally more ambigous than the other, due to
    being more loaded with hidden rules of syntax that one must internalize
    to read the code.

    So we can interpret the question as, why do we have all these languages
    with baroque syntax which give rise to ambiguity the moment you forget
    any of it?

    Languages are designed this way because of the belief that there is a notational advantage in it. If you have some hidden rule which causes
    some symbols to be related in such and such a way, it means that you
    have omitted the need for additional symbols which would otherwise
    indicate that structure. For instance in C, we can deduce from the
    hidden rules that A << B | C means (A << B) | C which is obvious to
    someone who has memorized the precedence rules and works with this
    stuff daily. Yet, we tend to more or less reject the philosphy in our
    coding standards; we call for disambiguating parentheses. The GNU C
    compiler won't let you write A && B || C if you have -Wall warnings
    enabled: you get the "suggest parentheses" warning.

    (It's a kind of ironic situation: why do we have hidden rules that allow parentheses to be omitted, only to turn around and write tooling and
    coding standards which asks for them to be put in.)

    Novice programmers have historically been attracted to cryptic-looking languages. It is one of the main reasons for the success of languages
    like C and Perl.

    For novice programmers, syntax is a barrier before semantics, and
    if you make the barrier sufficiently, though not impossibly high, that
    creates motivation. Novices feel they are really learning something and getting ahead when all they are doing is absorbing the rules of syntax.
    Simply being able to work out the syntax of some code example, or write
    one that has no errors, is an accomplishment.

    If you give most people a language in which the syntax is easy with few opportunities for informal ambiguity, they will just rush through the
    syntax and hit the brick wall of semantics: confronting the fact that programming is semantically hard. Of course, because people most often
    blame external factors for their failings, they will blame the language.
    Since they are not heavily invested it in, they can easily move on to
    something else. Maybe they will return to programming later, using
    a different language, and then pin their better success on that language
    rather than their own improved maturity.

    Informally amibiguous languages are needed to create a kind of tar pit
    to slow down newbies and keep the motivated. Then by the time they
    hit the real difficulties, thay are too invested in it to quit.

    "But I know all this syntax after months of learning! How can it be that
    my program doesn't work? I'm too far along not to stick with it and get
    it working. Doggone it, I now have a self-image as a programmer to
    defend!"

    I also believe there is one more element at play: mathematics. People
    study mathematics in school, and those who go on to do programming tend
    to be ones who were more exposed to it or paid more attention.

    People who are programmers actually had a first contact with formal
    syntax in mathematics.

    The conflation between syntax and semantics may ultimately come from
    that place. Mathematicians design their notations deliberately, in such
    ways that when they manipulate symbols, while observing certain rules,
    they are actually preserving semantics. The notation directly enables semantically meaningful manipulation, as a tool of thought.

    There is a psychological effect at play that a programming language
    designed with lots of syntactic rules will somehow also serve as a tool
    of thought, similarly to math notation. It cannot be denied that, to
    some extent, that plan pans out. Programmers play wiuth the symbols and discover idioms similar to algebraic rules. You look at C code and
    recognize "Duff's device" similarly to how you might recognize some
    Lagrangian thing in a math formula.


    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Ziak@21:1/5 to Kaz Kylheku on Wed Dec 29 16:05:19 2021
    On Wednesday, December 29, 2021 at 11:28:34 PM UTC+1, Kaz Kylheku wrote:
    On 2021-12-16, Roger L Costello wrote:
    Question: Opine about why languages are usually defined and implemented with
    ambiguous grammars.
    But they aren't.

    Languages are processed as a stream of characters or tokens, with hidden rules about how those relate together and the meaning that emerges.
    All of the rules are hidden, including the entire grammar.

    If you're only aware of some of the hidden rules, but not others, then
    you see ambiguity.

    But if you're only aware of some of the hidden rules, but not others,
    then you are not working with the correct language.

    For instance, I don't know of any mainstream language in which if/else
    is actually ambiguous. They have a hidden rule like that the else goes
    with the closest preceding if statement.

    When designing a grammar and implementing a parser: the grammar can
    either be unambiguous by design or unambiguous by accident. The
    viewpoint that "there isn't any mainstream language in which if/else
    is actually ambiguous" is actually the latter option: unambiguous by
    accident.

    A primary reason why grammars in many mainstream languages (that don't
    have a parser generated straight from a verified grammar) are
    unambiguous isn't intentional design, but rather it is a consequence
    of the fact that those parsers are directly implemented in a language
    that is executing statements/expressions/instructions without
    verifying consequences of the executions. Some examples of languages
    with such execution properties: assembly language (such as: i386,
    ARM), C, Haskell. Contrary to the accidental approach, a parser
    generator by design cares about consequences and it is verifying that
    the specification actually is unambiguous despite the fact that in the
    end the parser gets compiled down into machine instructions.
    Verification means to search the whole search space (or at least a
    reasonably large subspace of it) - but asm/C/Haskell will run a search
    only if it is explicitly (step by step) forced by the programmer to
    perform a search.

    -atom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Kaz Kylheku on Wed Dec 29 18:41:17 2021
    On Wednesday, December 29, 2021 at 2:28:34 PM UTC-8, Kaz Kylheku wrote:

    (snip)

    I also believe there is one more element at play: mathematics. People
    study mathematics in school, and those who go on to do programming tend
    to be ones who were more exposed to it or paid more attention.

    This reminds me of learning associativity of exponentiation (**)
    in Fortran IV (I believe it isn't in the Fortran 66 standard) before I
    learned it in algebra class. I suspect that there are others I learned
    from programming before learning them in math class

    People who are programmers actually had a first contact with formal
    syntax in mathematics.

    The conflation between syntax and semantics may ultimately come from
    that place. Mathematicians design their notations deliberately, in such
    ways that when they manipulate symbols, while observing certain rules,
    they are actually preserving semantics. The notation directly enables semantically meaningful manipulation, as a tool of thought.

    I suspect that people learn some things in the first programming language
    that they learn, and then expect it to be the same in others. When it isn't, people get surprised or confused.

    When I started with unix, I learned csh programming, and mostly
    avoided sh (and successors). One reason for that is, as well as I
    knew at the time, differences in the syntax and semantics of them.
    [Fortran has always had ** exponentiation, starting with the original
    version in 1956. It always bound tighter than +-*/ but wasn't
    associative, A**B**C not allowed, -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to gah4@u.washington.edu on Thu Dec 30 18:14:50 2021
    On 2021-12-30, gah4 <gah4@u.washington.edu> wrote:
    On Wednesday, December 29, 2021 at 2:28:34 PM UTC-8, Kaz Kylheku wrote:

    (snip)

    I also believe there is one more element at play: mathematics. People
    study mathematics in school, and those who go on to do programming tend
    to be ones who were more exposed to it or paid more attention.

    This reminds me of learning associativity of exponentiation (**)
    in Fortran IV (I believe it isn't in the Fortran 66 standard) before I learned it in algebra class. I suspect that there are others I learned
    from programming before learning them in math class

    In Common Lisp, the expt function is strictly binary, so it eliminates
    the question of associativity. Some basic arithmetic functions are
    n-ary, like (+ a b c d e f), where that is documented (and readily
    understood) that in cases where it matters, it is left-to-right
    reduction.

    In TXR Lisp, I made expt n-ary, so you can write

    (expt x y z w)

    But! The associativity is right-to-left, making that equivalent to:

    (expt x (expt y (expt z w)))

    This is for two reasons. One is math: (expt x y z w) defined this
    way follows:

    w
    z
    y
    x

    secondly, it is more useful, becuase the left-to-right interpretation
    is:

    ((( y) z) w)
    (((x ) ) )

    and that is just

    (expt x (* y z w))

    which is easy enough to write if that's what you want! It's not much
    more verbiage than the (expt x y z w) you may have wanted. Regardless of
    the number of operands, it's just an extra set of parentheses and a *
    operator.

    Whereas if you want (expt x (expt y (expt z w)) and (expt x y z w)
    doesn't give it to you, that *is* a lot of verbiage, whose nesting grows
    with each additional argument.

    The associativity rule that saves the most verbiage is the better one,
    even if it is opposite to many other arithmetic functions.

    People who are programmers actually had a first contact with formal
    syntax in mathematics.

    The conflation between syntax and semantics may ultimately come from
    that place. Mathematicians design their notations deliberately, in such
    ways that when they manipulate symbols, while observing certain rules,
    they are actually preserving semantics. The notation directly enables
    semantically meaningful manipulation, as a tool of thought.

    I suspect that people learn some things in the first programming language that they learn, and then expect it to be the same in others. When it isn't, people get surprised or confused.

    When I started with unix, I learned csh programming, and mostly
    avoided sh (and successors). One reason for that is, as well as I
    knew at the time, differences in the syntax and semantics of them.
    [Fortran has always had ** exponentiation, starting with the original
    version in 1956. It always bound tighter than +-*/ but wasn't
    associative, A**B**C not allowed, -John]

    When I started programming from nothing, I saw BASIC examples in a
    book which was doing things like:

    10 X = 2
    20 X = X + 1

    The only language with formulas that I was coming from was math.
    (Though I was only in grade 6, I know how to solve systems of linear
    equations from school, because I had recently come to Canada from
    Slovakia.)

    So, I thought, what? How can X be equal to X + 1; you cannot solve
    this absurdity!

    From then I knew that the people who program computers to understand
    symbols are free thinkers who make them mean anything they want.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Jan Ziak on Thu Dec 30 18:00:47 2021
    On 2021-12-30, Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote:
    On Wednesday, December 29, 2021 at 11:28:34 PM UTC+1, Kaz Kylheku wrote:
    On 2021-12-16, Roger L Costello wrote:
    Question: Opine about why languages are usually defined and implemented with
    ambiguous grammars.
    But they aren't.

    Languages are processed as a stream of characters or tokens, with hidden
    rules about how those relate together and the meaning that emerges.
    All of the rules are hidden, including the entire grammar.

    If you're only aware of some of the hidden rules, but not others, then
    you see ambiguity.

    But if you're only aware of some of the hidden rules, but not others,
    then you are not working with the correct language.

    For instance, I don't know of any mainstream language in which if/else
    is actually ambiguous. They have a hidden rule like that the else goes
    with the closest preceding if statement.

    When designing a grammar and implementing a parser: the grammar can
    either be unambiguous by design or unambiguous by accident. The
    viewpoint that "there isn't any mainstream language in which if/else
    is actually ambiguous" is actually the latter option: unambiguous by accident.

    Languages can be ambiguous at the specification level, even if a
    given implementation behaves unambiguouisly:

    E.g.:

    1. The implementation is completely unambiguous and says that else goes
    with closest preceding if.
    2. The documentation says something else. (So the users figure this out
    and get their code working based on the implementation.)
    3. (But) a new implementation appears, based on the documentation.

    Or:

    1. The language spec doesn't say anything about which way something
    is parsed.
    2. Mutiple implementations do it willy-nilly.

    Clearly, for instance in C, we have semantic ambiguities like
    a[i] = i++. (A parser depending on the behavior of something like that
    could have a grammar ambiguity: something is parsed in two or more
    different ways based on which way the undefined construct behaves. That
    would be a defect, of course.)

    A primary reason why grammars in many mainstream languages (that don't
    have a parser generated straight from a verified grammar) are
    unambiguous isn't intentional design, but rather it is a consequence
    of the fact that those parsers are directly implemented in a language
    that is executing statements/expressions/instructions without
    verifying consequences of the executions. Some examples of languages
    with such execution properties: assembly language (such as: i386,
    ARM), C, Haskell.

    I don't quite follow this; you seem to be saying that Turing machines
    are deterministic, and so if we implement a parser as a Turing process,
    it will be "accidentally" unambiguous because of determinism.

    However, it may so happen that the lack of ambiguity depends on a whole
    lot of context. For instance, a Turing process parsing a language,
    proceeding left to right, could decide the rule for "if/then/else" on
    a case-by-case basis, influenced by everything it has parsed before.

    Suppose you have a language which parses a sequence of top-level
    expressions from a terminal or file, and executes each one before moving
    to the next. Those expressions could be used to invoke an API in the
    language implementation to change the treatment of subsequent syntax.

    Sure, everything is still unambiguous, if we take into account the
    entire stream of expressions from the beginning and understand its
    effect on the parsing machine.

    If you don't do anything of this sort, and just write, say, a recursive
    descent parser which isn't influenced by any weird state flags (whether
    purely internal or external too) that change the parsing, and in a safe,
    very high level language in which there are few risks of nonportable or undefined behaviors, then you end up with a "simply unambiguous"
    grammar. You should be able to investigate the behavior of if/else
    with several test cases and be confident that the observations hold
    in all relevant contexts where that construct can appear.
    You might not exactly know what the grammar is for the entire language,
    but if you figure it out from the code and accurately document it,
    then you're good. (Unless the thing is required to conform to some
    external specification and fails.)

    Contrary to the accidental approach, a parser
    generator by design cares about consequences and it is verifying that
    the specification actually is unambiguous despite the fact that in the
    end the parser gets compiled down into machine instructions.
    Verification means to search the whole search space (or at least a
    reasonably large subspace of it) - but asm/C/Haskell will run a search
    only if it is explicitly (step by step) forced by the programmer to
    perform a search.

    Yes, e.g. a LALR(1) shift-reduce parser generator generates the entire
    space of LR(0) items that drive the stack machine, and then when it
    populates tables, it discovers conflicts there.

    With someone's hand-written parser, we cannot be sure without
    searching with test cases, which are language instances, which is
    intractable to do exhaustively.

    Just because if/else is behaving in certain ways in certain test cases
    doesn't mean that a new, more complex test case with more/different
    context cannot be found in which the if/else behaves differently.

    The procedural/recursive parser code doesn't declare to us what the
    grammar is. It could be hiding multiple rules for if/else active in
    different contexts which differ from each other.

    That's not an actual ambiguity, but for the purposes of working with the language, it's an informal ambiguity (related to my earlier observation
    of the user not knowing what all hidden rules are).

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    [If you write a hand written recursive descent parser, which was typical
    before yacc made it easy to use a LALR parser, it was common
    for the language the parser accepted to be somewhat different from the
    one the programmer intended to accept due to less than complete checking
    for unexpected inputs.
    On the other hand, that C example isn't ambiguous, it's deliberately indeterminate. Early Fortran allowed the compiler to compile any mathematically equivalent, not just numerically equivalent, version
    of an expression so A*B+A*C could turn into A*(B+C) which was great
    for optimizing, not so much for predictable results. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to All on Thu Dec 30 20:08:26 2021
    On 2021-12-30, Kaz Kylheku <480-992-1380@kylheku.com> wrote:
    John> On the other hand, that C example isn't ambiguous, it's deliberately John> indeterminate.

    The C example isn't ambiguous in its parse, but the semantics is
    ambiguous. Expressions can be evaluated in multiple orders in C,
    and if we follow different orders for that expression, we get different
    results (in such a way that it's deemed undefined).

    Here is something merely unspecified:

    a() + b() + c()

    Suppose a prints "a" to stdout, b prints "b" and c prints "c":

    int a() { putchar('a'); return 0; }

    The behavior is not undefined, but unspecified: we don't know
    which of six permutations is printed, but we know it's one of them:
    abc, acb, bac, bca, cab, cba.

    That's a kind of ambiguity in the language (syntax + semantics).

    We know that the parse is

    (a() + b()) + c()

    But the meaning requires that a, b and c are executed in order
    to produce the operands to the + operator, the order in which
    those calls take place is not specified.

    That's a clear ambiguity.

    Just like if I say "pick up the dry cleaning and fill up the gas
    tank", there is no grammar ambiguity; yet we don't know whether
    the sequencing is required, or whether the gas tank can be
    filled first. The request describes multiple possible scenarios,
    including going to gas station that does dry-cleaning, and
    picking up while an attendant fills the tank.

    indeterminate. Early Fortran allowed the compiler to compile any mathematically equivalent, not just numerically equivalent, version
    of an expression so A*B+A*C could turn into A*(B+C) which was great
    for optimizing, not so much for predictable results. -John]

    Why wait for early Fortran to arrive? If you write in C, you can
    use gcc -ffast-math (an umbrella option for turning on a whole lot
    of stuff, among it -fassociative-math and -freciprocal-math).

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Ziak@21:1/5 to Kaz Kylheku on Thu Dec 30 13:47:36 2021
    On Thursday, December 30, 2021 at 7:56:15 PM UTC+1, Kaz Kylheku wrote:
    When I started programming from nothing, I saw BASIC examples in a
    book which was doing things like:

    10 X = 2
    20 X = X + 1

    The only language with formulas that I was coming from was math.

    So, I thought, what? How can X be equal to X + 1; you cannot solve
    this absurdity!

    From then I knew that the people who program computers to understand
    symbols are free thinkers who make them mean anything they want.

    "X = X + Y" means "X[t+1] = X[t] + Y[t]" where t is time. Time had to be omitted from the notation of the BASIC programming language because otherwise the source code would consume a much larger amount of computer memory and it would complicate GOTO and FOR/NEXT statements.

    -atom
    [Interesting take. In reality, of couse, BASIC borrowed that from Fortran. Algol
    used := for assignment, different from = for equality comparison. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to All on Thu Dec 30 13:40:32 2021
    On Wednesday, December 29, 2021 at 7:28:33 PM UTC-8, gah4 wrote:

    (snip, I wrote)

    This reminds me of learning associativity of exponentiation (**)
    in Fortran IV (I believe it isn't in the Fortran 66 standard) before I learned it in algebra class. I suspect that there are others I learned
    from programming before learning them in math class

    (snip)

    [Fortran has always had ** exponentiation, starting with the original
    version in 1956. It always bound tighter than +-*/ but wasn't
    associative, A**B**C not allowed, -John]

    It was, at least, in Fortran IV for IBM 360/370:

    https://atariwiki.org/wiki/attach/Fortran/IBM_FORTRAN_IV-Language_1973.pdf

    My 8th grade graduation present was the above manual, though maybe
    one year earlier. I used to read IBM reference manuals like books,
    from start to finish. By the end of summer, I had run many Fortran programs.

    As well as I know it, IBM Fortran IV was the input to the 1966 standard,
    but not all features were included. It might also be that extensions were added later.
    [I used Fortran H on Princeston's 360/91 in a summer job I had in
    college in about 1973. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mac@21:1/5 to All on Mon Jan 3 19:51:53 2022
    [Interesting take. In reality, of couse, BASIC borrowed that from Fortran. Algol
    used := for assignment, different from = for equality comparison. -John]

    Indeed.
    Unfortunately, assignment is probably the single most common operator.
    The ASCII committee should have kept the left-arrow character instead of replacing it with underscore.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)