Can a table-based LL parser handle repetition without right-recursion? - parsing

I understand how an LL recursive descent parser can handle rules of this form:
A = B*;
with a simple loop that checks whether to continue looping or not based on whether the lookahead token matches a terminal in the FIRST set of B. However, I'm curious about table based LL parsers: how can rules of this form work there? As far as I know, the only way to handle repetition like this in one is through right-recursion, but that messes up associativity in cases where a right-associative parse tree is not desired.
I'd like to know because I'm currently attempting to write an LL(1) table-based parser generator and I'm not sure how to handle a case like this without changing the intended parse tree shape.

The Grammar
Let's expand your EBNF grammar to simple BNF and assume, that b is a terminal and <e> is an empty string:
A -> X
X -> BX
X -> <e>
B -> b
This grammar produces strings of terminal b's of any length.
The LL(1) Table
To construct the table, we will need to generate the first and follow sets (constructing an LL(1) parsing table).
First sets
First(α) is the set of terminals that begin strings derived from any string of grammar symbols α.
First(A) : b, <e>
First(X) : b, <e>
First(B) : b
Follow sets
Follow(A) is the set of terminals a that can
appear immediately to the right of a nonterminal A.
Follow(A) : $
Follow(X) : $
Follow(B) : b$
Table
We can now construct the table based on the sets, $ is the end of input marker.
+---+---------+----------+
| | b | $ |
+---+---------+----------+
| A | A -> X | A -> X |
| X | X -> BX | X -> <e> |
| B | B -> b | |
+---+---------+----------+
The parser action always depends on the top of the parse stack and the next input symbol.
Terminal on top of the parse stack:
Matches the input symbol: pop stack, advance to the next input symbol
No match: parse error
Nonterminal on top of the parse stack:
Parse table contains production: apply production to stack
Cell is empty: parse error
$ on top of the parse stack:
$ is the input symbol: accept input
$ is not the input symbol: parse error
Sample Parse
Let us analyze the input bb. The initial parse stack contains the start symbol and the end marker A $.
+-------+-------+-----------+
| Stack | Input | Action |
+-------+-------+-----------+
| A $ | bb$ | A -> X |
| X $ | bb$ | X -> BX |
| B X $ | bb$ | B -> b |
| b X $ | bb$ | consume b |
| X $ | b$ | X -> BX |
| B X $ | b$ | B -> b |
| b X $ | b$ | consume b |
| X $ | $ | X -> <e> |
| $ | $ | accept |
+-------+-------+-----------+
Conclusion
As you can see, rules of the form A = B* can be parsed without problems. The resulting concrete parse tree for input bb would be:

Yes, this is definitely possible. The standard method of rewriting to BNF and constructing a parse table is useful for figuring out how the parser should work – but as far as I can tell, what you're asking is how you can avoid the recursive part, which would mean that you'd get the slanted binary tree/linked list form of AST.
If you're hand-coding the parser, you can simply use a loop, using the lookaheads from the parse table that indicate a recursive call to decide to go around the loop once more. (I.e., you could just use while with those lookaheads as the condition.) Then for each iteration, you simply append the constructed subtree as a child of the current parent. In your case, then, A would get several direct B-children.
Now, as I understand it, you're building a parser generator, and it might be easiest to follow the standard procedure, going via plan BNF. However, that's not really an issue; there is no substantive difference between iteration and recursion, after all. You simply have to have a class of “helper rules” that don't introduce new AST nodes, but that rather append their result to the node of the nonterminal that triggered them. So when turning the repetition into X -> BX, rather than constructing X nodes, you have your X rule extend the child-list of the A or X (whichever triggered it) by its own children. You'll still end up with A having several B children, and no X nodes in sight.

Related

Does an empty column in an LL(1) parsing table mean that it is wrong?

I have a grammar and I am asked to build a parsing table and to verify that it is an LL(1) grammar. I noticed after building the parsing table that some of the columns were empty and had no production rules in them. Does this mean that it is wrongfully built? Or that it is not LL(1)? Or maybe something is missing?
Thank you.
That’s nothing to worry about. An empty column indicates that there’s a terminal symbol that never starts a production (among other things). For example, take this simple LL(1) grammar:
S → abcdefg
Here, in the row for S, the columns for b, c, d, e, f, and g will all be empty, and the column for a will be the rule S → abcdefg. More specifically, the augmented grammar looks like this:
S' → S
S → abcdefg
The parsing table looks like this:
| a | b | c | d | e | f | g
---+---------+---+---+---+---+---+---
S'| S | | | | | |
S | abcdefg | | | | | |
Notice that most columns are empty.

Can a follow-follow conflict exist in a grammar?

I know that First/First and First/Follow conflicts exist in a grammar which makes the grammar "not LL(1)". I was just wondering if Follow/Follow conflict exist in a grammar.
Yes, this is possible, but it requires an unusual configuration to make it happen. Consider the following grammar, which has been augmented with a new start symbol:
S' → S$
S → tT
T → A | B
A → ε
B → ε
Now, let's imagine trying to fill in our LL(1) parse table, which is shown here:
$ t
+----------+----------+
S' | | S' -> S$ |
+----------+----------+
S | | S -> tT |
+----------+----------+
T | T -> A | |
| T -> B | |
+----------+----------+
A | A -> e | |
+----------+----------+
B | B -> e | |
+----------+----------+
Notice that there are two items in the entry for (T, $). And that makes sense: if we have the active nonterminal T and see a $, we know that we need to select a production that's going to expand out to the empty string. And we have two different ways of doing this: we could use T → A or T → B, with the ultimate goal of expanding each of those nonterminals out to the empty string. This is a problem - we can't predict which route to take.
Now, what sort of conflict is this? It can't be a FIRST/FIRST conflict, because FIRST(A) = {ε} and FIRST(B) = {ε}, so neither A nor B has any terminals in its first set. It can't be a FIRST/FOLLOW conflict for the same reason.
That means that it's the rare FOLLOW/FOLLOW conflict - we know that we'd choose the production based on what's in the FOLLOW sets of A and B, and yet they're exactly identical to one another and so the parser can't choose what to do next unambiguously.
This is prehaps a simpler example
S → A a
A → B | C
B → ε
C → ε
Here, since a is both in the FOLLOW of B and C, on (A, a) there will be a conflict between A → B and A → C. Note that there are no other conflicts.

Converting given ambiguous arithmetic expression grammar to unambiguous LL(1)

In this term, I have course on Compilers and we are currently studying syntax - different grammars and types of parsers. I came across a problem which I can't exactly figure out, or at least I can't make sure I'm doing it correctly. I already did 2 attempts and counterexamples were found.
I am given this ambiguous grammar for arithmetic expressions:
E → E+E | E-E | E*E | E/E | E^E | -E | (E)| id | num , where ^ stands for power.
I figured out what the priorities should be. Highest priority are parenthesis, followed by power, followed by unary minus, followed by multiplication and division, and then there is addition and substraction. I am asked to convert this into equivalent LL(1) grammar. So I wrote this:
E → E+A | E-A | A
A → A*B | A/B | B
B → -C | C
C → D^C | D
D → (E) | id | num
What seems to be the problem with this is not equivalent grammar to the first one, although it's non-ambiguous. For example: Given grammar can recognize input: --5 while my grammar can't. How can I make sure I'm covering all cases? How should I modify my grammar to be equivalent with the given one? Thanks in advance.
Edit: Also, I would of course do elimination of left recursion and left factoring to make this LL(1), but first I need to figure out this main part I asked above.
Here's one that should work for your case
E = E+A | E-A | A
A = A*C | A/C | C
C = C^B | B
B = -B | D
D = (E) | id | num
As a sidenote: pay also attention to the requirements of your task since some applications might assign higher priority to the unary minus operator with respect to the power binary operator.

Is it possible to transform this grammar to be LR(1)?

The following grammar generates the sentences a, a, a, b, b, b, ..., h, b. Unfortunately it is not LR(1) so cannot be used with tools such as "yacc".
S -> a comma a.
S -> C comma b.
C -> a | b | c | d | e | f | g | h.
Is it possible to transform this grammar to be LR(1) (or even LALR(1), LL(k) or LL(1)) without the need to expand the nonterminal C and thus significantly increase the number of productions?
Not as long as you have the nonterminal C unchanged preceding comma in some rule.
In that case it is clear that a parser cannot decide, having seen an "a", and having lookahead "comma", whether to reduce or shift. So with C unchanged, this grammar is not LR(1), as you have said.
But the solution lies in the two phrases, "having seen an 'a'" and "C unchanged". You asked if there's fix that doesn't expand C. There isn't, but you could expand C "a little bit" by removing "a" from C, since that's the source of the problem:
S -> a comma a .
S -> a comma b .
S -> C comma b .
C -> b | c | d | e | f | g | h .
So, we did not "significantly" increase the number of productions.

Confused at transferring an ambiguous grammar to an unambiguous one

An ambiguous grammar is given and I am asked to rewrite the grammar to make it unambiguous. In fact, I don't know why the given grammar is ambiguous, let alone rewriting it to an unambiguous one.
The given grammar is S -> SS | a | b , and I have four choices:
A: S -> Sa | Sb | epsilon
B: S -> SS’
S’-> a | b
C: S -> S | S’
S’-> a | b
D: S -> Sa | Sb.
For each choice, I have already know that D is incorrect because it generates no strings at all,C is incorrect because it only matches the strings 'a' and 'b'.
However, I think the answer is A while the correct answer is B.I think B is wrong because it just generates S over and over again, and B can't deal with empty strings.
Why is the given grammar ambiguous?
Why is A incorrect while B is correct?
The original grammar is ambiguous because multiple right-most (or left-most) derivations are possible for any string of at least three letters. For example:
S -> SS -> SSS -> SSa -> Saa -> aaa
S -> SS -> Sa -> SSa -> Saa -> aaa
The first one corresponds, roughly speaking, to the parse a(aa) while the second to the parse (aa)a.
None of the alternatives is correct. A incorrectly matches ε while B does not match anything (like D). If B were, for example,
S -> SS' | S'
S' -> a | b
it would be correct. (This grammar is left-associative.)

Resources