Converting given ambiguous arithmetic expression grammar to unambiguous LL(1) - parsing

In this term, I have course on Compilers and we are currently studying syntax - different grammars and types of parsers. I came across a problem which I can't exactly figure out, or at least I can't make sure I'm doing it correctly. I already did 2 attempts and counterexamples were found.
I am given this ambiguous grammar for arithmetic expressions:
E → E+E | E-E | E*E | E/E | E^E | -E | (E)| id | num , where ^ stands for power.
I figured out what the priorities should be. Highest priority are parenthesis, followed by power, followed by unary minus, followed by multiplication and division, and then there is addition and substraction. I am asked to convert this into equivalent LL(1) grammar. So I wrote this:
E → E+A | E-A | A
A → A*B | A/B | B
B → -C | C
C → D^C | D
D → (E) | id | num
What seems to be the problem with this is not equivalent grammar to the first one, although it's non-ambiguous. For example: Given grammar can recognize input: --5 while my grammar can't. How can I make sure I'm covering all cases? How should I modify my grammar to be equivalent with the given one? Thanks in advance.
Edit: Also, I would of course do elimination of left recursion and left factoring to make this LL(1), but first I need to figure out this main part I asked above.

Here's one that should work for your case
E = E+A | E-A | A
A = A*C | A/C | C
C = C^B | B
B = -B | D
D = (E) | id | num
As a sidenote: pay also attention to the requirements of your task since some applications might assign higher priority to the unary minus operator with respect to the power binary operator.

Related

Parse grammar alternating and repeating

I was able to add support to my parser's grammar for alternating characters (e.g. ababa or baba) by following along with this question.
I'm now looking to extend that by allowing repeats of characters.
For example, I'd like to be able to support abaaabab and aababaaa as well. In my particular case, only the a is allowed to repeat but a solution that allows for repeating b's would also be useful.
Given the rules from the other question:
expr ::= A | B
A ::= "a" B | "a"
B ::= "b" A | "b"
... I tried extending it to support repeats, like so:
expr ::= A | B
# support 1 or more "a"
A_one_or_more = A_one_or_more "a" | "a"
A ::= A_one_or_more B | A_one_or_more
B ::= "b" A | "b"
... but that grammar is ambiguous. Is it possible for this to be made unambiguous, and if so could anyone help me disambiguate it?
I'm using the lemon parser which is an LALR(1) parser.
The point of parsing, in general, is to parse; that is, determine the syntactic structure of an input. That's significantly different from simply verifying that an input belongs to a language.
For example, the language which consists of arbitrary repetitions of a and b can be described with the regular expression (a|b)*, which can be written in BNF as
S ::= /* empty */ | S a | S b
But that probably does not capture the syntactic structure you are trying to defind. On the other hand, since you don't specify that structure, it is hard to know.
Here are a couple more possibilities, which build different parse trees:
S ::= E | S E
E ::= A b | E b
A ::= a | A a
S ::= E | S E
E ::= A B
A ::= a | A a
B ::= b | B b
When writing a grammar to parse a language, it is useful to start by drawing your proposed parse trees. Usually, you can write the grammar directly from the form of the trees, which shows that a formal grammar is primarily a documentation tool, since it clearly describes the language in a way that informal descriptions cannot. Using a parser generator to turn that grammar into a parser ensures that the parser implements the described language. Or, at least, that is the goal.
Here is a nice tool for checking your grammar online http://smlweb.cpsc.ucalgary.ca/start.html. It actually accepts the grammar you provided as a valid LALR(1) grammar.
A different LALR(1) grammar, that allows reapeating a's, would be:
S ::= "a" S | "a" | "b" A | "b"
A ::= "a" S .

How to eliminate this Left Recursion for LL Parser

How do you eliminate a left recursion of the following type. I can't seem to be able to apply the general rule on this particular one.
A -> A | a | b
By using the elimination rule you get:
A -> aA' | bA'
A' -> A' | epsilon
Which still has left recursion.
Does this say anything about the grammar being/not being LL(1)?
Thank you.
Notice that the rule
A → A
is, in a sense, entirely useless. It doesn't do anything to a derivation to apply this rule. As a result, we can safely remove it from the grammar without changing what the grammar produces. This leaves
A → a | b
which is LL(1).

Is it possible to transform this grammar to be LR(1)?

The following grammar generates the sentences a, a, a, b, b, b, ..., h, b. Unfortunately it is not LR(1) so cannot be used with tools such as "yacc".
S -> a comma a.
S -> C comma b.
C -> a | b | c | d | e | f | g | h.
Is it possible to transform this grammar to be LR(1) (or even LALR(1), LL(k) or LL(1)) without the need to expand the nonterminal C and thus significantly increase the number of productions?
Not as long as you have the nonterminal C unchanged preceding comma in some rule.
In that case it is clear that a parser cannot decide, having seen an "a", and having lookahead "comma", whether to reduce or shift. So with C unchanged, this grammar is not LR(1), as you have said.
But the solution lies in the two phrases, "having seen an 'a'" and "C unchanged". You asked if there's fix that doesn't expand C. There isn't, but you could expand C "a little bit" by removing "a" from C, since that's the source of the problem:
S -> a comma a .
S -> a comma b .
S -> C comma b .
C -> b | c | d | e | f | g | h .
So, we did not "significantly" increase the number of productions.

Can a table-based LL parser handle repetition without right-recursion?

I understand how an LL recursive descent parser can handle rules of this form:
A = B*;
with a simple loop that checks whether to continue looping or not based on whether the lookahead token matches a terminal in the FIRST set of B. However, I'm curious about table based LL parsers: how can rules of this form work there? As far as I know, the only way to handle repetition like this in one is through right-recursion, but that messes up associativity in cases where a right-associative parse tree is not desired.
I'd like to know because I'm currently attempting to write an LL(1) table-based parser generator and I'm not sure how to handle a case like this without changing the intended parse tree shape.
The Grammar
Let's expand your EBNF grammar to simple BNF and assume, that b is a terminal and <e> is an empty string:
A -> X
X -> BX
X -> <e>
B -> b
This grammar produces strings of terminal b's of any length.
The LL(1) Table
To construct the table, we will need to generate the first and follow sets (constructing an LL(1) parsing table).
First sets
First(α) is the set of terminals that begin strings derived from any string of grammar symbols α.
First(A) : b, <e>
First(X) : b, <e>
First(B) : b
Follow sets
Follow(A) is the set of terminals a that can
appear immediately to the right of a nonterminal A.
Follow(A) : $
Follow(X) : $
Follow(B) : b$
Table
We can now construct the table based on the sets, $ is the end of input marker.
+---+---------+----------+
| | b | $ |
+---+---------+----------+
| A | A -> X | A -> X |
| X | X -> BX | X -> <e> |
| B | B -> b | |
+---+---------+----------+
The parser action always depends on the top of the parse stack and the next input symbol.
Terminal on top of the parse stack:
Matches the input symbol: pop stack, advance to the next input symbol
No match: parse error
Nonterminal on top of the parse stack:
Parse table contains production: apply production to stack
Cell is empty: parse error
$ on top of the parse stack:
$ is the input symbol: accept input
$ is not the input symbol: parse error
Sample Parse
Let us analyze the input bb. The initial parse stack contains the start symbol and the end marker A $.
+-------+-------+-----------+
| Stack | Input | Action |
+-------+-------+-----------+
| A $ | bb$ | A -> X |
| X $ | bb$ | X -> BX |
| B X $ | bb$ | B -> b |
| b X $ | bb$ | consume b |
| X $ | b$ | X -> BX |
| B X $ | b$ | B -> b |
| b X $ | b$ | consume b |
| X $ | $ | X -> <e> |
| $ | $ | accept |
+-------+-------+-----------+
Conclusion
As you can see, rules of the form A = B* can be parsed without problems. The resulting concrete parse tree for input bb would be:
Yes, this is definitely possible. The standard method of rewriting to BNF and constructing a parse table is useful for figuring out how the parser should work – but as far as I can tell, what you're asking is how you can avoid the recursive part, which would mean that you'd get the slanted binary tree/linked list form of AST.
If you're hand-coding the parser, you can simply use a loop, using the lookaheads from the parse table that indicate a recursive call to decide to go around the loop once more. (I.e., you could just use while with those lookaheads as the condition.) Then for each iteration, you simply append the constructed subtree as a child of the current parent. In your case, then, A would get several direct B-children.
Now, as I understand it, you're building a parser generator, and it might be easiest to follow the standard procedure, going via plan BNF. However, that's not really an issue; there is no substantive difference between iteration and recursion, after all. You simply have to have a class of “helper rules” that don't introduce new AST nodes, but that rather append their result to the node of the nonterminal that triggered them. So when turning the repetition into X -> BX, rather than constructing X nodes, you have your X rule extend the child-list of the A or X (whichever triggered it) by its own children. You'll still end up with A having several B children, and no X nodes in sight.

Shift-reduce: when to stop reducing?

I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.

Resources