This is an arithmetic expression for my language: ADD 100 MUL 5 DIV 10 SUB 7 MOD 10 4
Where ADD = addition, SUB = subtraction, MUL = multiplication, DIV = division, MOD = modulo.
The expression above can also be rewitten into the standard 100 + (5 * (10 / (7 - (10 % 4)))), parenthesis included to mark the order of operations.
This is quite different than the standard because evaluation starts with the right most operation, that is MOD 10 4, then the result of that is then used to evaluate the next operation, that is SUB 7 2, where 2 is the result of the modulo operation. Parenthesis is not required for this grammar.
I have gotten hold the grammar for the standard notation from https://ruslanspivak.com/lsbasi-part6/, here it is:
<expr> := <term> ((ADD|SUB) <term>)*
<term> := <factor> ((MUL|DIV|MOD) <factor>)*
<factor> := integer
In my language, I am clueless in writing the grammar for arithmetic operations. Are modifications needed for the grammar above? Or do I need to write a completely new grammar?
I managed to solve this by analyzing the execution of each production in my code. To my surprise, I forgot to include an <expr> production in <factor>. Changing my code a bit my moving certain conditions and I was able to parse the sample expression above. This is the grammar for the arithmetic expression in my language:
<expr> := ((ADD|SUB) <term> <term>)* | <term>
<term> := ((MUL|DIV|MOD) <factor> <factor>)* | <factor>
<factor> := INTEGER | <expr>
The <expr> production in <factor> makes it possible to have multiple operations because it goes back to the start to parse the next operation.
Related
I am creating the simplest grammar possible that basically recognizes arithmetic expressions. The grammar needs to correctly follow arithmetic operators precedence rules (PEMDAS), and for that I placed expr ('*'|'/') term before expr ('+'|'-') term to ensure this precedence.
This is the arithmetic.g4 file that I have:
/*Productions */
expr: expr ('*'|'/') term
| expr ('+'|'-') term
| term
;
term: '('expr')'
| ID
| NUM
;
/*Tokens */
ID: [a-z]+;
NUM: [0-9]+;
WS: [\t\r\n]+->skip;
The output of the grammar is however not what it should be. For example for the arithmetic expression 4 * (3 + 10) I get the below parse tree (which is absolutely not correct):
Any suggestions on how I can change the grammar to get what I am looking for. I am new to antlr and am not sure what mistake I am making. (jbtw my OS is windows)
(I'm assuming that you've made a mistake in your example (which looks fine) and you really meant that you're getting the wrong tree for the input 4 + 3 * 10, so that's what I'm going to answer. If that's not what you meant, please clarify.)
You're right that ANTLR resolves ambiguities based on the order of rules, but that does not apply to your grammar because your grammar is not ambiguous. For an input like 4 + 3 * 10, there's only one way to parse it according to your grammar: with * being the outer operator, with 4 + 3 as its left and 10 as its right operand. The correct way (+ as the outer operator with 3 * 10 as the right operand) doesn't work with your grammar because 3 * 10 is not a valid term and the right operand needs to be a term according to your grammar.
In order to get an ambiguity that's resolved in the way you want, you'll need to make both operands of your operators exprs.
I am trying to write a top-down recursive-descent parser for a small language, and I am facing some issues with the assignment statements. Here is the grammar from the language specifications:
<assign_stmt> ::= <lvalue> <l_tail> ":=" <expr> ";"
<l_tail> ::= ":=" <lvalue> <l_tail>
| ""
<expr> ::= ....
#multiple layers betwen <expr> and <lvalue>, like <term>, <factor>, etc.
#in the end, <expr> can be a <lvalue>
| <lvalue>
so that the assignments can look like
a := b := 3;
c := d := e := f;
The grammar does not seem to be ambiguous, but it is causing me issues because <expr> can itself be a <lvalue>. When parsing <l_tail>, both production rules are equally valid and I don't know which one to pick. I tried various left-factorizations (see below), but so far, I have not been able to find a LL(1) grammar that works for me. Is it even possible here?
<assign_stmt> ::= <lvalue> ":=" <rest>
<rest> ::= <expr> ";"
| <lvalue> ":=" <l_tail>
Note that I could go around this issue by parsing <l_tail> and then looking for the ";" token. Depending on the result, I would know whether the last <lvalue> was actually an <expr> or not (without having to backtrack). However, I am learning here, and I would like to know the "right" way to overcome this problem.
I am writing a recursive descent parser for Boolean expressions, for example:
(1 * 0)
(0 + ~1)
(0 * (1 + c)
Where 1 is 'True', 0 is 'False', + is 'or', * is 'and', ~ is 'not' and 'c' is just some variable name (it could be any single alphabetic letter). I plan on using parentheses rather than implementing some kind of order of operations.
My current parser can recognize the following form of expression
Expression ::= 1
| 0
| Character
| ~ Expression
But I am unsure as to how I would implement + and * on top of this. I am fairly certain from what I have read the obvious implementation of
Expression ::= 1
| 0
| Character
| ( Expression + Expression )
| ( Expression * Expression )
Would cause an infinite loop as it is 'left-recursive'. I am unsure how to change this to remove such infinite recursion.
With the parenthesis in place, what you have there is not left recursive. Left recursion is when a production can reach itself (directly or indirectly) with no tokens consumed in between. Such grammars do indeed cause infinite recursion in recursive descent parsers, but that can't happen with yours.
You do have the issue that the grammar as it stands is ambiguous: After a parenthesis, it isn't known whether the + or the * form is being parsed until the entire left-hand expression has been parsed.
One way of getting around that issue is by pulling up the common parts in a shared prefix/suffix production:
Expression ::= 1
| 0
| Character
| ParExpr
ParExpr ::= ( Expression ParOp Expression )
ParOp ::= +
| *
Let me search that for you ...
https://en.wikipedia.org/wiki/Recursive_descent_parser
The leading LPAREN keeps this from being left-recursive.
If you want to generalize the expressions and have some operator precedence, follow the expression portion of the BNF in the Wikipedia article.
However, you have a syntax ambiguity in the grammar you've chosen. When you have operators of the same precedence, combine them into a non-terminal, such as
LogOp ::= + | *
Label similar operands to allow for expansion:
UnaryOp ::= ~
Now you can ... never mind, #500 just posted a good answer that covers my final point.
I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.
The following simple "calculator expression" grammar (BNF) can be easily parsed with the a trivial recursive-descent parser, which is predictive LL(1):
<expr> := <term> + <term>
| <term> - <term>
| <term>
<term> := <factor> * <factor>
<factor> / <factor>
<factor>
<factor> := <number>
| <id>
| ( <expr> )
<number> := \d+
<id> := [a-zA-Z_]\w+
Because it is always enough to see the next token in order to know the rule to pick. However, suppose that I add the following rule:
<command> := <expr>
| <id> = <expr>
For the purpose of interacting with the calculator on the command line, with variables, like this:
calc> 5+5
=> 10
calc> x = 8
calc> 6 * x + 1
=> 49
Is it true that I can not use a simple LL(1) predictive parser to parse <command> rules ? I tried to write the parser for it, but it seems that I need to know more tokens forward. Is the solution to use backtracking, or can I just implement LL(2) and always look two tokens forward ?
How to RD parser generators handle this problem (ANTLR, for instance)?
THe problem with
<command> := <expr>
| <id> = <expr>
is that when you "see" <id> you can't tell if it's the beginning of an assignement (second rule) or it's a "<factor>". You will only know when you'll read the next token.
AFAIK ANTLR is LL(*) (and is also able to generate rat-pack parsers if I'm not mistaken) so it will probably handle this grammare considering two tokens at once.
If you can play with the grammar I would suggest to either add a keyword for the assignment (e.g. let x = 8) :
<command> := <expr>
| "let" <id> "=" <expr>
or use the = to signify evaluation:
<command> := "=" <expr>
| <id> "=" <expr>
I think there are two ways to solve this with a recursive descent parser: either by using (more) lookahead or by backtracking.
Lookahead
command() {
if (currentToken() == id && lookaheadToken() == '=') {
return assignment();
} else {
return expr();
}
}
Backtracking
command() {
savedLocation = scanLocation();
if (accept( id )) {
identifier = acceptedTokenValue();
if (!accept( '=' )) {
setScanLocation( savedLocation );
return expr();
}
return new assignment( identifier, expr() );
} else {
return expr();
}
}
The problem is that the grammar:
<command> := <expr>
| <id> = <expr>
is not a mutually-recursive procedure. For a recursive decent parser you will need to determine a non-recursive equivalent.
rdentato post's shows how to fix this, assuming you can play with the grammar. This powerpoint spells out the problem in a bit more detail and shows how to correct it:
http://www.google.com/url?sa=t&source=web&ct=res&cd=7&url=http%3A%2F%2Fxml.cs.nccu.edu.tw%2Fcourses%2Fcompiler%2Fcp2006%2Fslides%2Flec3-Parsing%26TopDownParsing.ppt&ei=-YLaSPrWGaPwhAK5ydCqBQ&usg=AFQjCNGAFrODJxoxkgJEwDMQ8A8594vn0Q&sig2=nlYKQVfakmqy_57137XzrQ
ANTLR 3 uses a "LL(*)" parser as opposed to a LL(k) parser, so it will look ahead until it reaches the end of the input if it has to, without backtracking, using a specially optimized determinstic finite automata (DFA).