Arithmetic expression grammar in prefix notation (Java Cup) - parsing

I'm writting a grammar for arithmetic expression in prefix notation. However I have an issue when parsing negative numbers or substraction. Grammar example is this:
precedence right +, -;
precedence right *, /;
precedence right uminus;
E ::= + E E
| - E E
| * E E
| / E E
| ( E )
| - E %prec uminus
| id
| digit
;
But if my input is - 5 4, it reduces 5 as E, next it reduces - E (negative) and then parser gives me a syntax error at 4. The correct one should be 5 as E, next 4 as E and then - E E as E. How can I solve this problem using associativity? or do I need to rewrite my grammar?

(Promoted from comment)
Your grammar really is ambiguous, and precedence declarations won't help you a bit.
Consider the input the input consisting of N - tokens, followed by M 1 tokens.
- - - - - - - ... - 1 1 1 ... 1
In order for this to be an expression, M-1 of the - tokens must be binary, and the remaining N-(M-1) unary, but there is no way to tell which is which (unless they are all binary).
Even if you arbitrarily say that the first N-(M-1) -s are unary, you can't tell what the value of N-(M-1) is until you read the entire input, which means you can't parse with a finite lookahead.
But the whole point of prefix notation is to avoid the need for parentheses. Arbitrary declarations like the above make it impossible to represent alternative interpretations, so that some expressions would be impossible to represent in prefix notation. That's just plain wrong.
Here's a simple case:
- 5 - - - 4 3 1
is either
5 - (- (4 - (3 - 1)))
5 - ((- (4 - 3)) - 1)
5 - (((- 4) - 3) - 1)
In prefix notation, you need to declare the "arity" of every operator, either implicitly (every operator has a known number of arguments), or explicitly using a notation like this, borrowed from Prolog:
-/2 5 -/2 -/2 -/1 4 3 1
Alternatively, you can delimit the arguments with mandatory parentheses, as with Lisp/Scheme "s-exprs":
(- 5 (- (- (- 4) 3) 1))

In first place, remove all precedence declarations. They are not needed in prefix grammars. In fact, that should be enough to solve the issue in any parser generator. Which one are you using, BTW?
Cup has a finite lookahead. As #rici points out, the ambiguity can't be resolved in this case. What you can do is to restrict the grammar so just one consecutive unary - can be used.
B ::= E
| - E
;
E ::= + B B
| - B B
| * B B
| / B B
| ( B )
| id
| digit
;
Please check the above several times as I'm pretty rusty.

Related

Preferring shift over reduce in parser for language without statement terminators

I'm parsing a language that doesn't have statement terminators like ;. Expressions are defined as the longest sequence of tokens, so 5-5 has to be parsed as a subtraction, not as two statements (literal 5 followed by a unary negated -5).
I'm using LALRPOP as the parser generator (despite the name, it is LR(1) instead of LALR, afaik). LALRPOP doesn't have precedence attributes and doesn't prefer shift over reduce by default like yacc would do. I think I understand how regular operator precedence is encoded in an LR grammar by building a "chain" of rules, but I don't know how to apply that to this issue.
The expected parses would be (individual statements in brackets):
"5 - 5" → 5-5 instead of 5, -5
"5 (- 5)" → 5, -5
"- 5" → -5
"5 5" → 5, 5
How do I change the grammar such that it always prefers the longer parse?
Going through the first few pages of google results as well as stack overflow didn't yield any results for this specific problem. Most related questions need more lookahead or the result is to not allow consecutive statements without terminators.
I created a minimal sample grammar that reproduces the shift/reduce conflict (a statement in this grammar is just an expression, in the full grammar there would also be "if", "while", etc. and more levels of operator precedence, but I've omitted them for brevity). Besides unary minus, there are also other conflicts in the original grammar like print(5), which could be parsed as the identifier print and a parenthesized number (5) or a function call. There might be more conflicts like this, but all of them have the same underlying issue, that the longer sequence should be preferred, but both are currently valid, though only the first should be.
For convenience, I created a repo (checkout and cargo run). The grammar is:
use std::str::FromStr;
grammar;
match {
"+",
"-",
"(",
")",
r"[0-9]+",
// Skip whitespace
r"\s*" => { },
}
Expr: i32 = {
<l:Expr> "+" <r:Unary> => l + r,
<l:Expr> "-" <r:Unary> => l - r,
Unary,
};
Unary: i32 = {
"-" <r:Unary> => -r,
Term,
}
Term: i32 = {
Num,
"(" <Expr> ")",
};
Num: i32 = {
r"[0-9]+" => i32::from_str(<>).unwrap(),
};
Stmt: i32 = {
Expr
};
pub Stmts: Vec<i32> = {
Stmt*
};
Part of the error (full error message):
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8: Local ambiguity detected
The problem arises after having observed the following symbols in the input:
Stmt+ Expr
At that point, if the next token is a `"-"`, then the parser can proceed in two different ways.
First, the parser could execute the production at
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8, which would consume
the top 1 token(s) from the stack and produce a `Stmt`. This might then yield a parse tree like
Expr ╷ Stmt
├─Stmt──┤ │
├─Stmt+─┘ │
└─Stmt+──────┘
Alternatively, the parser could shift the `"-"` token and later use it to construct a `Expr`. This might
then yield a parse tree like
Stmt+ Expr "-" Unary
│ ├─Expr───────┤
│ └─Stmt───────┤
└─Stmt+────────────┘
See the LALRPOP manual for advice on making your grammar LR(1).
The issue you're going to have to confront is how to deal with function calls. I can't really give you any concrete advice based on your question, because the grammar you provide lacks any indication of the intended syntax of functions calls, but the hint that print(5) is a valid statement makes it clear that there are two distinct situations, which need to be handled separately.
Consider:
5 - 5 One statement 5 ( - 5 ) Two statements
print(-5) One statement print - 5 Two statements (presumably)
a - 5 ???
The ambiguity of the third expression could be resolved if the compiler knew whether a is a function or a variable (if we assume that functions are not first-class values, making print an invalid statement). But there aren't many ways that the parser could know that, and none of them seem very likely:
There might not be any user-defined functions. Then the lexer could be built to recognise identifier-like tokens which happen to be built-in functions (like print) and then a(-5) would be illegal since a is not a built-in function.
The names of functions and identifiers might differ in some way that the lexer can detect. For example, the language might require functions to start with a capital letter. I presume this is not the case since you wrote print rather than Print but there might be some other simple distinction, such as requiring identifiers to be a single character.
Functions must be declared as such before the first use of the function, and the parser shares the symbol table with the lexer. (I didn't search the rather inadequate documentation for the generator you're using to see if lexical feedback is practical.)
If there were an optional statement delimiter (as with Lua, for example), then you could simply require that statements which start with parentheses (usually a pretty rare case) be explicitly delimited unless they are the first statement in a block. Or there might be an optional keyword such as compute which can be used as an unambiguous statement starter and whose use is required for statements which start with a parenthesis. I presume that neither of these is the case here, since you could have used that to force 5 - 5 to be recognised as two statements (5; -5 or 5 compute - 5.)
Another unlikely possibility, again based on the print(5) example, is that function calls use a different bracket than expression grouping. In that case, a[5] (for example) would be a function call and a(5) would unambiguously be two statements.
Since I don't know the precise requirements here, I'll show a grammar (in yacc/bison syntax, although it should be easy enough to translate it) which attempts to illustrate a representative sample. It implements one statement (return) in addition to expression statements, and expressions include multiplication, subtraction, negation and single argument function calls. To force "greedy" expressions, it prohibits certain statement sequences:
statements starting with a unary operator
statements starting with an open parenthesis if the previous statement ends with an identifier. (This effectively requires that the function to be applied in a call expression be a simple identifier. Without that restriction, it becomes close to impossible to distinguish two consecutive parenthesized expressions from a single function call expression, and you then need some other way to disambiguate.)
Those rules are easy to state, but the actual implementation is annoyingly repetitive because it requires various different kinds of expressions, depending on what the first and last token in the expression is, and possibly different kinds of statements, if you have statements which might end with an expression. (return x, for example.) The formalism used by ECMAScript would be useful here, but I suspect that your parser-generator doesn't implement it -- although it's possible that its macro facility could be used to that effect, if it came with something resembling documentation. Without that, there is a lot of duplication.
In a vague attempt to generate the grammar, I used the following suffixes:
_un / _pr / _oth: starts with unary / parenthesis / other token
_id / _nid: ends / does not end with an id
The absence of a suffix is used for the union of different possibilities. There are probably more unit productions than necessary. It has not been thoroughly debugged, but it worked on a few test cases (see below):
program : block
block_id : stmt_id
| block_id stmt_oth_id
| block_nid stmt_pr_id
| block_nid stmt_oth_id
block_nid : stmt_nid
| block_id stmt_oth_nid
| block_nid stmt_pr_nid
| block_nid stmt_oth_nid
block : %empty
| block_id | block_nid
stmt_un_id : expr_un_id
stmt_un_nid : expr_un_nid
stmt_pr_id : expr_pr_id
stmt_pr_nid : expr_pr_nid
stmt_oth_id : expr_oth_id
| return_id
stmt_oth_nid : expr_oth_nid
| return_nid
stmt_id : stmt_un_id | stmt_pr_id | stmt_oth_id
stmt_nid : stmt_un_nid | stmt_pr_nid | stmt_oth_nid
return_id : "return" expr_id
return_nid : "return" expr_nid
expr_un_id : sum_un_id
expr_un_nid : sum_un_nid
expr_pr_id : sum_pr_id
expr_pr_nid : sum_pr_nid
expr_oth_id : sum_oth_id
expr_oth_nid : sum_oth_nid
expr_id : expr_un_id | expr_pr_id | expr_oth_id
expr_nid : expr_un_nid | expr_pr_nid | expr_oth_nid
expr : expr_id | expr_nid
sum_un_id : mul_un_id
| sum_un '-' mul_id
sum_un_nid : mul_un_nid
| sum_un '-' mul_nid
sum_un : sum_un_id | sum_un_nid
sum_pr_id : mul_pr_id
| sum_pr '-' mul_id
sum_pr_nid : mul_pr_nid
| sum_pr '-' mul_nid
sum_pr : sum_pr_id | sum_pr_nid
sum_oth_id : mul_oth_id
| sum_oth '-' mul_id
sum_oth_nid : mul_oth_nid
| sum_oth '-' mul_nid
sum_oth : sum_oth_id | sum_oth_nid
mul_un_id : unary_un_id
| mul_un '*' unary_id
mul_un_nid : unary_un_nid
| mul_un '*' unary_nid
mul_un : mul_un_id | mul_un_nid
mul_pr_id : mul_pr '*' unary_id
mul_pr_nid : unary_pr_nid
| mul_pr '*' unary_nid
mul_pr : mul_pr_id | mul_pr_nid
mul_oth_id : unary_oth_id
| mul_oth '*' unary_id
mul_oth_nid : unary_oth_nid
| mul_oth '*' unary_nid
mul_oth : mul_oth_id | mul_oth_nid
mul_id : mul_un_id | mul_pr_id | mul_oth_id
mul_nid : mul_un_nid | mul_pr_nid | mul_oth_nid
unary_un_id : '-' unary_id
unary_un_nid : '-' unary_nid
unary_pr_nid : term_pr_nid
unary_oth_id : term_oth_id
unary_oth_nid: term_oth_nid
unary_id : unary_un_id | unary_oth_id
unary_nid : unary_un_nid | unary_pr_nid | unary_oth_nid
term_oth_id : IDENT
term_oth_nid : NUMBER
| IDENT '(' expr ')'
term_pr_nid : '(' expr ')'
Here's a little test:
> 5-5
{ [- 5 5] }
> 5(-5)
{ 5; [~ -- 5] }
> a-5
{ [- a 5] }
> a(5)
{ [CALL a 5] }
> -7*a
{ [* [~ -- 7] a] }
> a*-7
{ [* a [~ -- 7]] }
> a-b*c
{ [- a [* b c]] }
> a*b-c
{ [- [* a b] c] }
> a*b(3)-c
{ [- [* a [CALL b 3]] c] }
> a*b-c(3)
{ [- [* a b] [CALL c 3]] }
> a*b-7(3)
{ [- [* a b] 7]; 3 }

Parsing + and * in boolean expressions by recursive descent

I am writing a recursive descent parser for Boolean expressions, for example:
(1 * 0)
(0 + ~1)
(0 * (1 + c)
Where 1 is 'True', 0 is 'False', + is 'or', * is 'and', ~ is 'not' and 'c' is just some variable name (it could be any single alphabetic letter). I plan on using parentheses rather than implementing some kind of order of operations.
My current parser can recognize the following form of expression
Expression ::= 1
| 0
| Character
| ~ Expression
But I am unsure as to how I would implement + and * on top of this. I am fairly certain from what I have read the obvious implementation of
Expression ::= 1
| 0
| Character
| ( Expression + Expression )
| ( Expression * Expression )
Would cause an infinite loop as it is 'left-recursive'. I am unsure how to change this to remove such infinite recursion.
With the parenthesis in place, what you have there is not left recursive. Left recursion is when a production can reach itself (directly or indirectly) with no tokens consumed in between. Such grammars do indeed cause infinite recursion in recursive descent parsers, but that can't happen with yours.
You do have the issue that the grammar as it stands is ambiguous: After a parenthesis, it isn't known whether the + or the * form is being parsed until the entire left-hand expression has been parsed.
One way of getting around that issue is by pulling up the common parts in a shared prefix/suffix production:
Expression ::= 1
| 0
| Character
| ParExpr
ParExpr ::= ( Expression ParOp Expression )
ParOp ::= +
| *
Let me search that for you ...
https://en.wikipedia.org/wiki/Recursive_descent_parser
The leading LPAREN keeps this from being left-recursive.
If you want to generalize the expressions and have some operator precedence, follow the expression portion of the BNF in the Wikipedia article.
However, you have a syntax ambiguity in the grammar you've chosen. When you have operators of the same precedence, combine them into a non-terminal, such as
LogOp ::= + | *
Label similar operands to allow for expansion:
UnaryOp ::= ~
Now you can ... never mind, #500 just posted a good answer that covers my final point.

Operator precedence with LR(0) parser

A typical BNF defining arithmetic operations:
E :- E + T
| T
T :- T * F
| F
F :- ( E )
| number
Is there any way to re-write this grammar so it could be implemented with an LR(0) parser, while still retaining the precedence and left-associativity of the operators?
I'm thinking it should be possible by introducing some sort of disambiguation non-terminals, but I can't figure out how to do it.
Thanks!
A language can only have an LR(0) grammar if it's prefix-free, meaning that no string in the language is a prefix of another. In this case, the language you're describing isn't prefix-free. For example, the string number + number is a prefix of number + number + number.
A common workaround to address this would be to "endmark" your language by requiring all strings generated to end in a special "done" character. For example, you could require that all strings generated end in a semicolon. If you do that, you can build an LR(0) parser for the language with this grammar:
S → E;
E → E + T | T
T → T * F | F
F → number | (E)

How to remove left-recursion in the following grammar?

Unfortunately, it is not possible for ANTLR to support direct-left recursion when the rule has parameters passed. The only viable option is to remove the left recursion. Is there a way to remove the left-recursion in the following grammar ?
a[int x]
: b a[$x] c
| a[$x - 1]
(
c a[$x - 1]
| b c
)
;
The problem is in the second alternative involving left recursion. Any kind of help would be much appreciated.
Without the parameters and easier formatting, it would look like this:
a
: b a c
| a (c a | b c)
;
When a's left recursive alternative is matched n times, it would just mean that (c a | b c) will be matched n times, pre-pended with the terminating b a c (the first alternative). That means that this rule will always start with b a c, followed by zero or more occurrences of (c a | b c):
a
: b a c (c a | b c)*
;

Shift-reduce: when to stop reducing?

I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.

Resources