• Deep expression chain performance

    From Andy@21:1/5 to All on Tue Apr 25 06:51:55 2023
    I am writing C grammar.
    Grammar speed may be down caused by deep expression chain.
    For example, simple "n=0" has 20 levels:
    assignmentExpression
    conditionalExpression
    logicalOrExpression
    logicalAndExpression
    inclusiveOrExpression
    exclusiveOrExpression
    andExpression
    eqaulityExpression
    relationalExpression
    shiftExpression
    additiveExpression
    multiplicativeExpression
    castExpression
    unaryExpression
    postfixExpression
    postfixExpressionLeft
    atom
    literal
    IntegerConstant
    DeciamlConstant

    In other hand, Rust grammar for Antlr4 https://github.com/antlr/grammars-v4/blob/master/rust/RustParser.g4
    has precedence definition:
    expression
    : outerAttribute+ expression # AttributedExpression // technical, remove left recursive
    | literalExpression # LiteralExpression_
    | pathExpression # PathExpression_
    | expression '.' pathExprSegment '(' callParams? ')' # MethodCallExpression // 8.2.10
    | expression '.' identifier # FieldExpression // 8.2.11
    | expression '.' tupleIndex # TupleIndexingExpression // 8.2.7
    | expression '.' 'await' # AwaitExpression // 8.2.18
    ..............................
    If I do similar in C grammar:
    expr
    : atom # atom_
    | '__extension__'? '(' compoundStatement ')' # groupedExpression
    | expr '(' argumentExpressionList? ')' # postfixExpression
    | expr '.' Identifier # postfixExpression
    | expr '->' Identifier # postfixExpression
    | expr '[' expr ']' # postfixExpression

    parsing time increased by orders of magnitude: short .c files was processed longer a long file I didn't wait for it to finish (probably increasing nonlinear)

    My ask:
    form expr: expr operator expr
    is always ambiguous and waste time or I something bad have written?
    How is possible to reduce depth of chain?
    Parsing time with Antlr is much longer than parsing + semantic actions with GCC. Probably big part of this time is long depth of expressions.
    [If your parser is taking a lot of time, something is wrong. In the
    compilers I know, the pareer is always a tiny part of the work, which
    is dominated by the lexer which has to look at each character in the
    input, and optimizers that have to make multiple passes over the whole intermediate representation. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Andy on Wed Apr 26 16:33:54 2023
    Andy <borucki.andrzej@gmail.com> writes:
    [ANTLR]
    If I do similar in C grammar:
    expr
    : atom # atom_
    | '__extension__'? '(' compoundStatement ')' # groupedExpression
    | expr '(' argumentExpressionList? ')' # postfixExpression
    | expr '.' Identifier # postfixExpression
    | expr '->' Identifier # postfixExpression
    | expr '[' expr ']' # postfixExpression

    parsing time increased by orders of magnitude: short .c files was processed longer a long file I didn't wait for it to finish (probably increasing nonlinear)

    I am not an expert on ANTLR, but it is based on LL-parsing, and
    normally LL-based parsers endlessly recurse on left recursion.
    However, given that your Rust grammar has left recursion, too, ANTLR
    seems to be able to cope with that problem at least in some cases.

    There are other features in ANTLR that cause superlinear parsing
    times: syntactic predicates and semantic predicates. IIRC there can
    also be slowdowns if a long lookahead is necessary.

    My ask:
    form expr: expr operator expr
    is always ambiguous

    Yes.

    and waste time

    Depends on how the ambiguity is dealt with. E.g., yacc resolves the
    ambiguity in some way, and suppresses the alternative parse, and
    therefore has linear performance. However, the suppressed alternative
    may be the one you were interested in, so better define the grammar unambiguously. I don't see such a rule in the example above, though.

    By contrast, a GLR parser or Early parser keeps the alternatives in
    some way, resulting in superexponential performance (n^3 for Early,
    not sure for GLR).

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Andy on Wed Apr 26 13:06:51 2023
    On Tuesday, April 25, 2023 at 9:21:44 AM UTC-7, Andy wrote:
    I am writing C grammar.
    Grammar speed may be down caused by deep expression chain.
    For example, simple "n=0" has 20 levels:
    assignmentExpression
    conditionalExpression
    logicalOrExpression
    logicalAndExpression
    inclusiveOrExpression
    exclusiveOrExpression
    andExpression
    eqaulityExpression
    relationalExpression
    shiftExpression
    additiveExpression
    multiplicativeExpression
    castExpression
    unaryExpression
    postfixExpression
    postfixExpressionLeft
    atom
    literal
    IntegerConstant
    DeciamlConstant

    A not unusual compiler design in the olden (small memory) days, is recursive descent
    for statements, and operator precedence for expressions. If you do
    expressions with
    recursive descent, you get, as you note, 20 levels on the stack. Operator precedence
    avoids all those levels, and works well for expressions for most languages.

    Others have explained how parser generators do it.

    Otherwise, the design of small-memory compilers is a lost art.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Andy on Thu Apr 27 00:28:55 2023
    On 2023-04-25, Andy <borucki.andrzej@gmail.com> wrote:
    I am writing C grammar.
    Grammar speed may be down caused by deep expression chain.
    For example, simple "n=0" has 20 levels:
    assignmentExpression
    conditionalExpression
    logicalOrExpression
    logicalAndExpression
    inclusiveOrExpression
    exclusiveOrExpression
    andExpression
    eqaulityExpression
    relationalExpression
    shiftExpression
    additiveExpression
    multiplicativeExpression
    castExpression
    unaryExpression
    postfixExpression
    postfixExpressionLeft
    atom
    literal
    IntegerConstant
    DeciamlConstant

    ISO C specifies its grammar in an obtusely verbose way, similar to the
    above. The reason is that it makes the grammar unambiguous, meaning
    that it does not require any annotation or external remarks about
    ambiguities having to be resolved by a certain associativity or
    precedence.

    If you have some grammar processor which literally follows all those productions, doing all of their reductions, then it will likely
    not perform well.

    I suspect even a LALR(1) compiled grammar will have an issue
    there; I'm not aware that Yacc-like parser generators do any
    collapsing of cascaded reductions which do nothing but { $$ = $1; }.

    Here is a concrete example. Attached below is over 200 lines of
    y.output from a Yacc implementation (Bison). In that y.output
    we can trace the actions of the state machine and see what happens
    when a the simple token LIT is reduced to an expression.

    In State 0, when the parser sees a LIT, it will shift the token
    and go to State 1. There, a reduction takes place via Rule 9,
    and so we have a parser_expression on the top of the stack.

    Then, back to State 0, where now have a primary_expression,
    and so we go to state 8.

    State 8, similarly to State 1, performs a reduction:
    primary expression to mult_expression.

    Back to State 0, where we have to dispatch a similar
    thing again for multi_expression, then additive_expression,
    then expression and finally top.

    Thanks for raising this issue; I suspect it's a real performance
    problem. Not only is a factored-out grammar hard to read but the
    performance is bad due to useless back and forth transitions between superfluous states to propagate reductions.

    Above-referenced material follows:

    Grammar

    0 $accept: top $end

    1 top: expression ';'
    2 | expression ';' expression

    3 expression: additive_expression

    4 additive_expression: mult_expression
    5 | mult_expression '+' additive_expression

    6 mult_expression: primary_expression
    7 | mult_expression '*' primary_expression

    8 primary_expression: '(' expression ')'
    9 | LIT
    10 | IDENT


    Terminals, with rules where they appear

    $end (0) 0
    '(' (40) 8
    ')' (41) 8
    '*' (42) 7
    '+' (43) 5
    ';' (59) 1 2
    error (256)
    LIT (258) 9
    IDENT (259) 10


    Nonterminals, with rules where they appear

    $accept (10)
    on left: 0
    top (11)
    on left: 1 2, on right: 0
    expression (12)
    on left: 3, on right: 1 2 8
    additive_expression (13)
    on left: 4 5, on right: 3 5
    mult_expression (14)
    on left: 6 7, on right: 4 5 7
    primary_expression (15)
    on left: 8 9 10, on right: 6 7


    state 0

    0 $accept: . top $end

    LIT shift, and go to state 1
    IDENT shift, and go to state 2
    '(' shift, and go to state 3

    top go to state 4
    expression go to state 5
    additive_expression go to state 6
    mult_expression go to state 7
    primary_expression go to state 8


    state 1

    9 primary_expression: LIT .

    $default reduce using rule 9 (primary_expression)


    state 2

    10 primary_expression: IDENT .

    $default reduce using rule 10 (primary_expression)


    state 3

    8 primary_expression: '(' . expression ')'

    LIT shift, and go to state 1
    IDENT shift, and go to state 2
    '(' shift, and go to state 3

    expression go to state 9
    additive_expression go to state 6
    mult_expression go to state 7
    primary_expression go to state 8


    state 4

    0 $accept: top . $end

    $end shift, and go to state 10


    state 5

    1 top: expression . ';'
    2 | expression . ';' expression

    ';' shift, and go to state 11


    state 6

    3 expression: additive_expression .

    $default reduce using rule 3 (expression)


    state 7

    4 additive_expression: mult_expression .
    5 | mult_expression . '+' additive_expression
    7 mult_expression: mult_expression . '*' primary_expression

    '+' shift, and go to state 12
    '*' shift, and go to state 13

    $default reduce using rule 4 (additive_expression)


    state 8

    6 mult_expression: primary_expression .

    $default reduce using rule 6 (mult_expression)


    state 9

    8 primary_expression: '(' expression . ')'

    ')' shift, and go to state 14


    state 10

    0 $accept: top $end .

    $default accept


    state 11

    1 top: expression ';' .
    2 | expression ';' . expression

    LIT shift, and go to state 1
    IDENT shift, and go to state 2
    '(' shift, and go to state 3

    $default reduce using rule 1 (top)

    expression go to state 15
    additive_expression go to state 6
    mult_expression go to state 7
    primary_expression go to state 8


    state 12

    5 additive_expression: mult_expression '+' . additive_expression

    LIT shift, and go to state 1
    IDENT shift, and go to state 2
    '(' shift, and go to state 3

    additive_expression go to state 16
    mult_expression go to state 7
    primary_expression go to state 8


    state 13

    7 mult_expression: mult_expression '*' . primary_expression

    LIT shift, and go to state 1
    IDENT shift, and go to state 2
    '(' shift, and go to state 3

    primary_expression go to state 17


    state 14

    8 primary_expression: '(' expression ')' .

    $default reduce using rule 8 (primary_expression)


    state 15

    2 top: expression ';' expression .

    $default reduce using rule 2 (top)


    state 16

    5 additive_expression: mult_expression '+' additive_expression .

    $default reduce using rule 5 (additive_expression)


    state 17

    7 mult_expression: mult_expression '*' primary_expression .

    $default reduce using rule 7 (mult_expression)


    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Kaz Kylheku on Thu Apr 27 06:17:51 2023
    Kaz Kylheku <864-117-4973@kylheku.com> writes:
    ISO C specifies its grammar in an obtusely verbose way, similar to the
    above. The reason is that it makes the grammar unambiguous, meaning
    that it does not require any annotation or external remarks about
    ambiguities having to be resolved by a certain associativity or
    precedence.

    They were not that keen on eliminating ambiguity. The classical
    ambiguous grammar rule is there:

    |selection-statement:
    | if ( expression ) statement
    | if ( expression ) statement else statement
    | switch ( expression ) statement

    They resolve that ambiguity in prose:

    |An else is associated with the lexically nearest preceding if that is
    |allowed by the syntax.

    C is not alone in this approach.

    As for why they chose to deal with expressions through BNF, my guess
    is that it allowed them to split the expression syntax into different
    sections, each with their own piece of the grammar, rather than having
    a section on the syntax that presents an ambuguous BNF and resolves
    the ambiguity in prose (and by necessity has to cover the whole
    expression syntax), and then having sections on the semantics of the
    various expression sub-syntaxes.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)