Ambiguity when Parsing Preceding List - parsing

While writing parser code in Menhir, I keep coming across this design pattern that is becoming very frustrating. I'm trying to build a parser that accepts either "a*ba" or "bb". To do this, I'm using the following syntax (note that A* is the same as list(A)):
exp:
| A*; B; A; {1}
| B; B; {2}
However, this code fails to parse the string "ba". The menhir compiler also indicates that there are shift-reduce conflicts in the parser, specifically as follows:
** In state 0, looking ahead at B, shifting is permitted
** because of the following sub-derivation:
. B B
** In state 0, looking ahead at B, reducing production
** list(A) ->
** is permitted because of the following sub-derivation:
list(A) B A // lookahead token appears
So | B A requires a shift, while | A* B A requires a reduce when the first token is B. I can resolve this ambiguity manually and get the expected behavior by changing the expression to read as follows (note that A+ is the same as nonempty_list(A)):
exp2:
| B; A; {1}
| A+; B; A; {1}
| B; B; {2}
In my mind, exp and exp2 read the same, but are clearly treated differently. Is there a way to write exp to do what I want without code duplication (which can cause other problems)? Is this a design pattern I should be avoiding entirely?

exp and exp2 parse the same language, but they're definitely not the same grammar. exp requires a two-symbol lookahead to correctly parse a sentence starting with B, for precisely the reason you've noted: the parser can't decide whether to insert an empty A* into the parse before it sees the symbol after the B, but it needs to do that insertion before it can handle the B. By contrast, exp2 does not need an empty production to create an empty list of As before B A, so no decision is necessary.
You don't need a list to produce this conflict. Replacing A* with A? would produce exactly the same conflict.
You've already found the usual solution to this shift-reduce conflict for LALR(1) parser generators: a little bit of redundancy. As you note, though, that solution is not ideal.
Another common solution (but maybe not for menhir) involves using a right-recursive definition of a terminated list:
prefix:
| B;
| A; prefix;
exp:
| prefix; A; { 1 }
| B; B; { 2 }
As far as I know, menhir's standard library doesn't include a terminated list macro, but it would easy enough to write. It might look something like this:
%public terminated_list(X, Y):
| y = Y;
{ ( [], y ) }
| x = X; xsy = terminated_list(X, Y);
{ ( x :: (fst xsy), (snd xsy) ) }
There's probably a more idiomatic way to write that; I don't pretend to be an OCAML coder.

Related

Preferring shift over reduce in parser for language without statement terminators

I'm parsing a language that doesn't have statement terminators like ;. Expressions are defined as the longest sequence of tokens, so 5-5 has to be parsed as a subtraction, not as two statements (literal 5 followed by a unary negated -5).
I'm using LALRPOP as the parser generator (despite the name, it is LR(1) instead of LALR, afaik). LALRPOP doesn't have precedence attributes and doesn't prefer shift over reduce by default like yacc would do. I think I understand how regular operator precedence is encoded in an LR grammar by building a "chain" of rules, but I don't know how to apply that to this issue.
The expected parses would be (individual statements in brackets):
"5 - 5" → 5-5 instead of 5, -5
"5 (- 5)" → 5, -5
"- 5" → -5
"5 5" → 5, 5
How do I change the grammar such that it always prefers the longer parse?
Going through the first few pages of google results as well as stack overflow didn't yield any results for this specific problem. Most related questions need more lookahead or the result is to not allow consecutive statements without terminators.
I created a minimal sample grammar that reproduces the shift/reduce conflict (a statement in this grammar is just an expression, in the full grammar there would also be "if", "while", etc. and more levels of operator precedence, but I've omitted them for brevity). Besides unary minus, there are also other conflicts in the original grammar like print(5), which could be parsed as the identifier print and a parenthesized number (5) or a function call. There might be more conflicts like this, but all of them have the same underlying issue, that the longer sequence should be preferred, but both are currently valid, though only the first should be.
For convenience, I created a repo (checkout and cargo run). The grammar is:
use std::str::FromStr;
grammar;
match {
"+",
"-",
"(",
")",
r"[0-9]+",
// Skip whitespace
r"\s*" => { },
}
Expr: i32 = {
<l:Expr> "+" <r:Unary> => l + r,
<l:Expr> "-" <r:Unary> => l - r,
Unary,
};
Unary: i32 = {
"-" <r:Unary> => -r,
Term,
}
Term: i32 = {
Num,
"(" <Expr> ")",
};
Num: i32 = {
r"[0-9]+" => i32::from_str(<>).unwrap(),
};
Stmt: i32 = {
Expr
};
pub Stmts: Vec<i32> = {
Stmt*
};
Part of the error (full error message):
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8: Local ambiguity detected
The problem arises after having observed the following symbols in the input:
Stmt+ Expr
At that point, if the next token is a `"-"`, then the parser can proceed in two different ways.
First, the parser could execute the production at
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8, which would consume
the top 1 token(s) from the stack and produce a `Stmt`. This might then yield a parse tree like
Expr ╷ Stmt
├─Stmt──┤ │
├─Stmt+─┘ │
└─Stmt+──────┘
Alternatively, the parser could shift the `"-"` token and later use it to construct a `Expr`. This might
then yield a parse tree like
Stmt+ Expr "-" Unary
│ ├─Expr───────┤
│ └─Stmt───────┤
└─Stmt+────────────┘
See the LALRPOP manual for advice on making your grammar LR(1).
The issue you're going to have to confront is how to deal with function calls. I can't really give you any concrete advice based on your question, because the grammar you provide lacks any indication of the intended syntax of functions calls, but the hint that print(5) is a valid statement makes it clear that there are two distinct situations, which need to be handled separately.
Consider:
5 - 5 One statement 5 ( - 5 ) Two statements
print(-5) One statement print - 5 Two statements (presumably)
a - 5 ???
The ambiguity of the third expression could be resolved if the compiler knew whether a is a function or a variable (if we assume that functions are not first-class values, making print an invalid statement). But there aren't many ways that the parser could know that, and none of them seem very likely:
There might not be any user-defined functions. Then the lexer could be built to recognise identifier-like tokens which happen to be built-in functions (like print) and then a(-5) would be illegal since a is not a built-in function.
The names of functions and identifiers might differ in some way that the lexer can detect. For example, the language might require functions to start with a capital letter. I presume this is not the case since you wrote print rather than Print but there might be some other simple distinction, such as requiring identifiers to be a single character.
Functions must be declared as such before the first use of the function, and the parser shares the symbol table with the lexer. (I didn't search the rather inadequate documentation for the generator you're using to see if lexical feedback is practical.)
If there were an optional statement delimiter (as with Lua, for example), then you could simply require that statements which start with parentheses (usually a pretty rare case) be explicitly delimited unless they are the first statement in a block. Or there might be an optional keyword such as compute which can be used as an unambiguous statement starter and whose use is required for statements which start with a parenthesis. I presume that neither of these is the case here, since you could have used that to force 5 - 5 to be recognised as two statements (5; -5 or 5 compute - 5.)
Another unlikely possibility, again based on the print(5) example, is that function calls use a different bracket than expression grouping. In that case, a[5] (for example) would be a function call and a(5) would unambiguously be two statements.
Since I don't know the precise requirements here, I'll show a grammar (in yacc/bison syntax, although it should be easy enough to translate it) which attempts to illustrate a representative sample. It implements one statement (return) in addition to expression statements, and expressions include multiplication, subtraction, negation and single argument function calls. To force "greedy" expressions, it prohibits certain statement sequences:
statements starting with a unary operator
statements starting with an open parenthesis if the previous statement ends with an identifier. (This effectively requires that the function to be applied in a call expression be a simple identifier. Without that restriction, it becomes close to impossible to distinguish two consecutive parenthesized expressions from a single function call expression, and you then need some other way to disambiguate.)
Those rules are easy to state, but the actual implementation is annoyingly repetitive because it requires various different kinds of expressions, depending on what the first and last token in the expression is, and possibly different kinds of statements, if you have statements which might end with an expression. (return x, for example.) The formalism used by ECMAScript would be useful here, but I suspect that your parser-generator doesn't implement it -- although it's possible that its macro facility could be used to that effect, if it came with something resembling documentation. Without that, there is a lot of duplication.
In a vague attempt to generate the grammar, I used the following suffixes:
_un / _pr / _oth: starts with unary / parenthesis / other token
_id / _nid: ends / does not end with an id
The absence of a suffix is used for the union of different possibilities. There are probably more unit productions than necessary. It has not been thoroughly debugged, but it worked on a few test cases (see below):
program : block
block_id : stmt_id
| block_id stmt_oth_id
| block_nid stmt_pr_id
| block_nid stmt_oth_id
block_nid : stmt_nid
| block_id stmt_oth_nid
| block_nid stmt_pr_nid
| block_nid stmt_oth_nid
block : %empty
| block_id | block_nid
stmt_un_id : expr_un_id
stmt_un_nid : expr_un_nid
stmt_pr_id : expr_pr_id
stmt_pr_nid : expr_pr_nid
stmt_oth_id : expr_oth_id
| return_id
stmt_oth_nid : expr_oth_nid
| return_nid
stmt_id : stmt_un_id | stmt_pr_id | stmt_oth_id
stmt_nid : stmt_un_nid | stmt_pr_nid | stmt_oth_nid
return_id : "return" expr_id
return_nid : "return" expr_nid
expr_un_id : sum_un_id
expr_un_nid : sum_un_nid
expr_pr_id : sum_pr_id
expr_pr_nid : sum_pr_nid
expr_oth_id : sum_oth_id
expr_oth_nid : sum_oth_nid
expr_id : expr_un_id | expr_pr_id | expr_oth_id
expr_nid : expr_un_nid | expr_pr_nid | expr_oth_nid
expr : expr_id | expr_nid
sum_un_id : mul_un_id
| sum_un '-' mul_id
sum_un_nid : mul_un_nid
| sum_un '-' mul_nid
sum_un : sum_un_id | sum_un_nid
sum_pr_id : mul_pr_id
| sum_pr '-' mul_id
sum_pr_nid : mul_pr_nid
| sum_pr '-' mul_nid
sum_pr : sum_pr_id | sum_pr_nid
sum_oth_id : mul_oth_id
| sum_oth '-' mul_id
sum_oth_nid : mul_oth_nid
| sum_oth '-' mul_nid
sum_oth : sum_oth_id | sum_oth_nid
mul_un_id : unary_un_id
| mul_un '*' unary_id
mul_un_nid : unary_un_nid
| mul_un '*' unary_nid
mul_un : mul_un_id | mul_un_nid
mul_pr_id : mul_pr '*' unary_id
mul_pr_nid : unary_pr_nid
| mul_pr '*' unary_nid
mul_pr : mul_pr_id | mul_pr_nid
mul_oth_id : unary_oth_id
| mul_oth '*' unary_id
mul_oth_nid : unary_oth_nid
| mul_oth '*' unary_nid
mul_oth : mul_oth_id | mul_oth_nid
mul_id : mul_un_id | mul_pr_id | mul_oth_id
mul_nid : mul_un_nid | mul_pr_nid | mul_oth_nid
unary_un_id : '-' unary_id
unary_un_nid : '-' unary_nid
unary_pr_nid : term_pr_nid
unary_oth_id : term_oth_id
unary_oth_nid: term_oth_nid
unary_id : unary_un_id | unary_oth_id
unary_nid : unary_un_nid | unary_pr_nid | unary_oth_nid
term_oth_id : IDENT
term_oth_nid : NUMBER
| IDENT '(' expr ')'
term_pr_nid : '(' expr ')'
Here's a little test:
> 5-5
{ [- 5 5] }
> 5(-5)
{ 5; [~ -- 5] }
> a-5
{ [- a 5] }
> a(5)
{ [CALL a 5] }
> -7*a
{ [* [~ -- 7] a] }
> a*-7
{ [* a [~ -- 7]] }
> a-b*c
{ [- a [* b c]] }
> a*b-c
{ [- [* a b] c] }
> a*b(3)-c
{ [- [* a [CALL b 3]] c] }
> a*b-c(3)
{ [- [* a b] [CALL c 3]] }
> a*b-7(3)
{ [- [* a b] 7]; 3 }

How to deal with the implicit 'cat' operator while building a syntax tree for RE(use stack evaluation)

I am trying to build a syntax tree for regular expression. I use the strategy similar to arithmetic expression evaluation (i know that there are ways like recursive descent), that is, use two stack, the OPND stack and the OPTR stack, then to process.
I use different kind of node to represent different kind of RE. For example, the SymbolExpression, the CatExpression, the OrExpression and the StarExpression, all of them are derived from RegularExpression.
So the OPND stack stores the RegularExpression*.
while(c || optr.top()):
if(!isOp(c):
opnd.push(c)
c = getchar();
else:
switch(precede(optr.top(), c){
case Less:
optr.push(c)
c = getchar();
case Equal:
optr.pop()
c = getchar();
case Greater:
pop from opnd and optr then do operation,
then push the result back to opnd
}
But my primary question is, in typical RE, the cat operator is implicit.
a|bc represents a|b.c, (a|b)*abb represents (a|b)*.a.b.b. So when meeting an non-operator, how should i do to determine whether there's a cat operator or not? And how should i deal with the cat operator, to correctly implement the conversion?
Update
Now i've learn that there is a kind of grammar called "operator precedence grammar", its evaluation is similar to arithmetic expression's. It require that the pattern of the grammar cannot have the form of S -> ...AB...(A and B are non-terminal). So i guess that i just cannot directly use this method to parse the regular expression.
Update II
I try to design a LL(1) grammar to parse the basic regular expression.
Here's the origin grammar.(\| is the escape character, since | is a special character in grammar's pattern)
E -> E \| T | T
T -> TF | F
F -> P* | P
P -> (E) | i
To remove the left recursive, import new Variable
E -> TE'
E' -> \| TE' | ε
T -> FT'
T' -> FT' | ε
F -> P* | P
P -> (E) | i
now, for pattern F -> P* | P, import P'
P' -> * | ε
F -> PP'
However, the pattern T' -> FT' | ε has problem. Consider case (a|b):
E => TE'
=> FT' E'
=> PT' E'
=> (E)T' E'
=> (TE')T'E'
=> (FT'E')T'E'
=> (PT'E')T'E'
=> (iT'E')T'E'
=> (iFT'E')T'E'
Here, our human know that we should substitute the Variable T' with T' -> ε, but program will just call T' -> FT', which is wrong.
So, what's wrong with this grammar? And how should i rewrite it to make it suitable for the recursive descendent method.
1. LL(1) grammar
I don't see any problem with your LL(1) grammar. You are parsing the string
(a|b)
and you have gotten to this point:
(a T'E')T'E' |b)
The lookahead symbol is | and you have two possible productions:
T' ⇒ FT'
T' ⇒ ε
FIRST(F) is {(, i}, so the first production is clearly incorrect, both for the human and the LL(1) parser. (A parser without lookahead couldn't make the decision, but parsers without lookahead are almost useless for practical parsing.)
2. Operator precedence parsing
You are technically correct. Your original grammar is not an operator grammar. However, it is normal to augment operator precedence parsers with a small state machine (otherwise algebraic expressions including unary minus, for example, cannot be correctly parsed), and once you have done that it is clear where the implicit concatenation operator must go.
The state machine is logically equivalent to preprocessing the input to insert an explicit concatenation operator where necessary -- that is, between a and b whenever a is in {), *, i} and b is in {), i}.
You should take note that your original grammar does not really handle regular expressions unless you augment it with an explicit ε primitive to represent the empty string. Otherwise, you have no way to express optional choices, usually represented in regular expressions as an implicit operand (such as (a|), also often written as a?). However, the state machine is easily capable of detecting implicit operands as well because there is no conflict in practice between implicit concatenation and implicit epsilon.
I think just keeping track of the previous character should be enough. So if we have
(a|b)*abb
^--- we are here
c = a
pc = *
We know * is unary, so 'a' cannot be its operand. So we must have concatentation. Similarly at the next step
(a|b)*abb
^--- we are here
c = b
pc = a
a isn't an operator, b isn't an operator, so our hidden operator is between them. One more:
(a|b)*abb
^--- we are here
c = b
pc = |
| is a binary operator expecting a right-hand operand, so we do not concatenate.
The full solution probably involves building a table for each possible pc, which sounds painful, but it should give you enough context to get through.
If you don't want to mess up your loop, you could do a preprocessing pass where you insert your own concatenation character using similar logic. Can't tell you if that's better or worse, but it's an idea.

Bison: GLR-parsing of valid expression fails without error message

I'm working on a GLR-parser in GNU bison and I have the following problem:
the language I'm trying to parse allows boolean expressions including relations (<,>,<=,...) and boolean composition (and, or, not). Now the problem is that the language also allows to have multiple arithmetic expressions on the right side of a relation... and they are composed using the same AND token that is used for boolean composition! This is a very dumb language-design, but I can't change it.
So you can have a > b and c which is supposed to be equivalent to (a > b) and (a > c) and you can also have a > b and c > d which is supposed to be equivalent to (a > b) and (c > d)
The S/R conflict this causes is already obvious in this example: after reading a > b with lookahead and you could either reduce the a > b to a boolean expression and wait for another boolean expression or you could shift the and and wait for another arithmetic expression.
My grammar currently looks like this:
booleanexpression
: relation
| booleanexpression TOK_AND booleanexpression
...
;
relation
: arithmeticexpression TOK_GT maxtree
...
;
maxtree
: arithmeticexpression
| maxtree TOK_AND maxtree
...
;
The language is clearly not LR(k) for any k, since the S/R conflict can't be resolved using any constant k-lookahead, because the arithmeticexpression in between can have arbitrarily many tokens. Because of that, I turned GLR-parsing on.
But when I try to parse a > b and c with this, I can see in my debug outputs, that the parser behaves like this:
it reads the a and at lookahead > it reduces the a to an arithmeticexpression
it reads the b and at lookahead and it reduces the b to an arithmeticexpression and then already to a maxtree
it reduces the a > b to a relation
it reads the c and reduces it to an arithmeticexpression
then nothing happens! The and c are apparently discarded - the debug outputs don't show any action for these tokens. Not even an error message. The corresponding if-statement doesn't exist in my AST (I still get an AST because I have error recovery).
I would think that, after reading the b, there should be 2 stacks. But then the b shouldn't be reduced. Or at least it should give me some error message ("language is ambiguous" would be okay and I have seen that message before - I don't see why it wouldn't apply here). Can anyone make sense of this?
From looking at the grammar for a while, you can tell that the main question here is whether after the next arithmeticexpression there comes
another relation token (then you should reduce)
another boolean composition (then you should shift)
a token outside of the boolean/arithmetic -expression syntax (like THEN) which would terminate the expression and you should also shift
Can you think of a different grammar that captures the situation in a better / more deterministic way? How would you approach the problem? I'm currently thinking about making the grammar more right-to-left, like
booleanexpression : relation AND booleanexpression
maxtree : arithmeticexpression AND maxtree
etc.
I think that would make bison prefer shifting and only reduce on the right first. Maybe by using different non-terminals it would allow a quasi-"lookahead" behind the arithmeticexpression...
Side note: GnuCOBOL handles this problem by just collecting all the tokens, pushing them on an intermediate stack and manually building the expression from there. That discourages me, but I cling to the hope that they did it this way because bison didn't support GLR-parsing when they started...
EDIT:
a small reproducible example
%{
#include <stdio.h>
int yylex ();
void yyerror(const char* msg);
%}
%glr-parser
%left '&'
%left '>'
%%
input: %empty | input bool '\n' {printf("\n");};
arith : 'a' | 'b' | 'c';
maxtree : arith { printf("[maxtree : arith] "); }
| maxtree '&' maxtree { printf("[maxtree : maxtree & maxtree] "); } ;
rel : arith '>' maxtree { printf("[rel : arith > maxtree] "); } ;
bool : rel { printf("[bool : rel] "); }
| bool '&' bool { printf("[bool : bool & bool] "); } ;
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex () {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
return yyparse();
}
this one strangely does print the error message "syntax error" on input a>b&c.
Being able to simplify grammars by using precedence declarations is really handy (sometimes) [Note 1] but it doesn't play well with using GLR parsers because it can lead to early rejection of an unambiguous parse.
The idea behind precedence declarations is that they resolve ambiguities (or, more accurately, shift/reduce conflicts) using a simple one-token lookahead and a configured precedence between the possible reduction and the possible shift. If a grammar has no shift/reduce conflict, the precedence declarations won't be used, but if they are used they will be used to suppress either the shift or the reduce, depending on the (static) precedence relationship.
A Bison-generated GLR parser does not actually resolve ambiguity, but it allows possibly incorrect parses to continue to be developed until the ambiguity is resolved by the grammar. Unlike the use of precedence, this is a delayed resolution; a bit slower but a lot more powerful. (GLR parsers can produce a "parse forest" containing all possible parses. But Bison doesn't implement this feature, since it expects to be parsing programming languages and unlike human languages, programming languages cannot be ambiguous.)
In your language, it is impossible to resolve the non-determinism of the shift/reduce conflict statically, as you note yourself in the question. Your grammar is simply not LR(1), much less operator precedence, and GLR parsing is therefore a practical solution. But you have to allow GLR to do its work. Prematurely eliminating one of the plausible parses with a precedence comparison will prevent the GLR algorithm from considering it later. This will be particularly serious if you manage to eliminate the only parse which could have been correct.
In your grammar, it is impossible to define a precedence relationship between the rel productions and the & symbol, because no precedence relationship exists. In some sentences, the rel reduction needs to win; in other sentences, the shift should win. Since the grammar is not ambiguous, GLR will eventually figure out which is which, as long as both the shift and the reduce are allowed to proceed.
In your full language, both boolean and arithmetic expressions have something akin to operator precedence, but only within their respective domains. An operator precedence parser (and, equivalently, yacc/bison's precedence declarations) works by erasing the difference between different non-terminals; it cannot handle a grammar like yours in which some operator has different precedences in different domains (or between different domains).
Fortunately, this particular use of precedence declarations is only a shortcut; it does not give any additional power to the grammar and can easily and mechanically be implemented by creating new non-terminals, one for each precedence level. The alternative grammar will not be ambiguous. The classic example, which you can find in pretty well any textbook or tutorial which includes parsing arithmetic expressions, is the expr/term/factor grammar. Here I've also provided the precedence grammar for comparison:
%left '+' '-'
%left '*' '/'
%% %%
expr : term
| expr '+' term expr: expr '+' expr
| expr '-' term | expr '-' expr
term : factor
| term '*' factor | expr '*' expr
| term '/' factor | expr '/' expr
factor: ID | ID
| '(' expr ')' | '(' expr ')'
In your minimal example, there are already enough non-terminals that no new ones need to be invented, so I've just rewritten it according to the above model.
I've left the actions as I wrote them, in case the style is useful to you. Note that this style leaks memory like a sieve, but that's ok for quick tests:
%code top {
#define _GNU_SOURCE 1
}
%{
#include <ctype.h>
#include <stdio.h>
#include <string.h>
int yylex(void);
void yyerror(const char* msg);
%}
%define api.value.type { char* }
%glr-parser
%token ID
%%
input : %empty
| input bool '\n' { puts($2); }
arith : ID
maxtree : arith
| maxtree '&' arith { asprintf(&$$, "[maxtree& %s %s]", $1, $3); }
rel : arith '>' maxtree { asprintf(&$$, "[COMP %s %s]", $1, $3); }
bool : rel
| bool '&' rel { asprintf(&$$, "[AND %s %s]", $1, $3); }
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex(void) {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
if (isalpha(c)) {
*(yylval = strdup(" ")) = c;
return ID;
}
else return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
#if YYDEBUG
if (argc > 1 && strncmp(argv[1], "-d", 2) == 0) yydebug = 1;
#endif
return yyparse();
}
Here's a sample run. Note the warning from bison about a shift/reduce conflict. If there had been no such warning, the GLR parser would probably be unnecessary, since a grammar without conflicts is deterministic. (On the other hand, since bison's GLR implementation optimises for determinism, there is not too much cost for using a GLR parser on a deterministic language.)
$ bison -t -o glr_prec.c glr_prec.y
glr_prec.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
$ gcc -Wall -o glr_prec glr_prec.c
$ ./glr_prec
a>b
[COMP a b]
a>b & c
[COMP a [maxtree& b c]]
a>b & c>d
[AND [COMP a b] [COMP c d]]
a>b & c & c>d
[AND [COMP a [maxtree& b c]] [COMP c d]]
a>b & c>d & e
[AND [COMP a b] [COMP c [maxtree& d e]]]
$
Notes
Although precedence declarations are handy when you understand what's actually going on, there is a huge tendency for people to just cargo-cult them from some other grammar they found on the internet, and not infrequently a grammar which was also cargo-culted from somewhere else. When the precedence declarations don't work as expected, the next step is to randomly modify them in the hopes of finding a configuration which works. Sometimes that succeeds, often leaving behind unnecessary detritus which will go on to be cargo-culted again.
So, although there are circumstances in which precedence declarations really simplify grammars and the unambiguous implementation would be quite a lot more complicated (such as dangling-else resolution in languages which have many different compound statement types), I've still found myself recommending against their use.
In a recent answer to a different question, I wrote what I hope is a good explanation of the precedence algorithm (and if it isn't, please let me know how it falls short).
Welcome to the wonderful world of COBOL. I could be wrong, but you may have a few
additional problems here. An expression such as A > B AND C in COBOL is ambiguous
until you know how C was declared. Consider the following program:
IDENTIFICATION DIVISION.
PROGRAM-ID EXAMPLE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 A PIC 9 VALUE 2.
01 B PIC 9 VALUE 1.
01 W PIC 9 VALUE 3.
88 C VALUE 3.
PROCEDURE DIVISION.
IF A > B AND C
DISPLAY 'A > B AND 88 LEVEL C is TRUE because W = ' W
ELSE
DISPLAY 'A not > B or 88 LEVEL C is not TRUE'
END-IF
DISPLAY 'A: ' A ' B: ' B ' W:' W
GOBACK
.
Output from this program is:
A > B AND 88 LEVEL C is TRUE because W = 3
A: 2 B: 1 W: 3
In essence the expression: A > B AND C is equivalent to: A > B AND W = 3. Had C
been defined in a manner similar to A and B, the semantics would
have been: A > B AND A > C, which in this case, is FALSE.
The code mentioned above works well, but I had never gotten it to work in my real project, even though I couldn't see a difference between my real project and this code.
This drove me crazy, but I just found another problem in my code, which prevented this method from working:
I had an (admittedly cargo-culted) %skeleton "lalr1.cc" in my prologue, which disabled the GLR parsing again!
I needed to replace this with
%skeleton "glr.cc"

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

Producing Expressions from This Grammar with Recursive Descent

I've got a simple grammar. Actually, the grammar I'm using is more complex, but this is the smallest subset that illustrates my question.
Expr ::= Value Suffix
| "(" Expr ")" Suffix
Suffix ::= "->" Expr
| "<-" Expr
| Expr
| epsilon
Value matches identifiers, strings, numbers, et cetera. The Suffix rule is there to eliminate left-recursion. This matches expressions such as:
a -> b (c -> (d) (e))
That is, a graph where a goes to both b and the result of (c -> (d) (e)), and c goes to d and e. I'm trying to produce an abstract syntax tree for these expressions, but I'm running into difficulty because all of the operators can accept any number of operands on each side. I'd rather keep the logic for producing the AST within the recursive descent parsing methods, since it avoids having to duplicate the logic of extracting an expression. My current strategy is as follows:
If a Value appears, push it to the output.
If a From or To appears:
Output a separator.
Get the next Expr.
Create a Link node.
Pop the first set of operands from output into the Link until a separator appears.
Erase the separator discovered.
Pop the second set of operands into the Link until a separator.
Push the Link to the output.
If I run this through without obeying steps 2.3–2.7, I get a list of values and separators. For the expression quoted above, a -> b (c -> (d) (e)), the output should be:
A sep_1 B sep_2 C sep_3 D E
Applying the To rule would then yield:
A sep_1 B sep_2 (link from C to {D, E})
And subsequently:
(link from A to {B, (link from C to {D, E})})
The important thing to note is that sep_2, crucial to delimit the left-hand operands of the second ->, does not appear, so the parser believes that the expression was actually written:
a -> (b c -> (d) (e))
In order to solve this with my current strategy, I would need a way to produce a separator between adjacent expressions, but only if the current expression is a From or To expression enclosed in parentheses. If that's possible, then I'm just not seeing it and the answer ought to be simple. If there's a better way to go about this, however, then please let me know!
I haven't tried to analyze it in detail, but: "From or To expression enclosed in parentheses" starts to sound a lot like "context dependent", which recursive descent can't handle directly. To avoid context dependence you'll probably need a separate production for a From or To in parentheses vs. a From or To without the parens.
Edit: Though it may be too late to do any good, if my understanding of what you want to match is correct, I think I'd write it more like this:
Graph :=
| List Sep Graph
;
Sep := "->"
| "<-"
;
List :=
| Value List
;
Value := Number
| Identifier
| String
| '(' Graph ')'
;
It's hard to be certain, but I think this should at least be close to matching (only) the inputs you want, and should make it reasonably easy to generate an AST that reflects the input correctly.

Resources