How to make this grammar LL(1) - parsing

I want to know if it would be possible to transform this grammar to LL(1). This is the grammar:
A -> B
| C
B -> a
| a ';'
C -> a D
| a D ';'
D -> ';' a
| D ';' a

Since this language is regular ( a; | a(;a)+;? ), then yes, it would be possible.
Not sure if I'm using the right syntax, but the language is basically a; (using A->B) or any string that starts with an a, followed one or more ;a pairs, optionally adding another ; on the end.

This is the same grammar but simpler:
A -> a | a ';' | a ';' A
It still not LL(1). But removing left factor now it is LL(1):
A -> a B B -> ε | ';' C C -> ε | A

Related

This context-free grammar is ambiguous and I'm not sure why. The SLR(1) compiler I'm building doesn't work the way I expect it to

I'm building a syntax parser. It's going good to be SLR(1) but I believe there are some reduce/shift conflicts or some kind of conflict that is making the parser reject strings too early . Here is the grammar:
Note: I did left factor the grammar to see if that was the problem, but that doesn't get rid of ambiguity. However this is the original grammar without left factoring
P'' -> P'$
P' -> P
P -> C | C;D
D -> R | RD
R -> pu{P}
C -> I | I;C
I -> h | O | A | R | Z
O -> i(V) | z(V)
Y -> u
V -> S | N
S -> u
N -> u
A -> S=s | S=S | N=X
X -> N | b | L
L -> d(X,X) | s(X,X) | m(X,X)
R -> f(B)t{C} | f(B)t{C}1{C}
B -> e(V,V) | (N<N) | (N>N) | nB | a(B,B) | o(B,B)
Z -> w(B){C} | r(N=0;N<N;N=a(N,1)){C}
I understand this grammar is quite big, but if you could help me here you would be a life saver. Thank you in advance!
Having recognized an I, and with ; as the next symbol, there's a shift-reduce conflict:
The production C -> I;C says to shift the ;.
The production P -> C;D says to reduce via C -> I.
So the grammar is not SLR(1).

Convert C- grammar to LL(1)

I am currently building a compiler for C-. I am currently working on the parser, and for some reason, I can't seem to resolve the first-sets collision (of terminal id) origination from the EXPRESSION production. Below, Is a subset of the grammar I have now, could someone point me in the right direction as to how to resolve the collision (or convert to a an equivalent LL(1) parseable grammar).
EXPRESSION -> id VAR eq EXPRESSION | SIMPLEEXPRESSION
VAR -> lbracket EXPRESSION rbracket | empty
SIMPLEEXPRESSION -> ADDITIVEEXPRESSION FADDITIVEEXPRESSION
FADDITIVEEXPRESSION -> RELOP ADDITIVEEXPRESSION | empty
RELOP -> ltoreq | lt | gt | gtoreq | doubleeq | noteq
ADDITIVEEXPRESSION -> TERM ADDITIVEEXPRESSION1
ADDITIVEEXPRESSION1 -> ADDOP TERM ADDITIVEEXPRESSION1 | empty
ADDOP -> plus | minus
TERM -> FACTOR TERM1
TERM1 -> MULOP FACTOR TERM1 | empty
MULOP -> times | divide
FACTOR -> lparen EXPRESSION rparen | id FACTOR1 | num
FACTOR1 -> a | b
C does not lend itself very well to LL(1) parsing, so what you are trying to do here could be quite difficult to achieve and may not even be possible for the full grammar.
But for the problem at hand, for the top-level production
EXPRESSION -> id VAR eq EXPRESSION | SIMPLEEXPRESSION
it's easy to see that id can be the start of either alternative, so an LL(1) parser will not know which alternative to pick.
One solution to the immediate problem would be to split the EXPRESSION production into two alternatives, one that always starts with an id terminal, and one that never does:
EXPRESSION -> EXPRESSION_id | EXPRESSION_non_id
For the id alternative, we would require the id terminal up front and then create id-only versions of the productions that follow:
EXPRESSION_id -> id (VAR eq EXPRESSION | SIMPLEEXPRESSION_id)
Similarly, for the non-id side, we would create non-id versions of the productions that follow:
EXPRESSION_non_id -> SIMPLEEXPRESSION_non_id
The required sub-productions to complete the grammar would look something like this:
SIMPLEEXPRESSION_id -> ADDITIVEEXPRESSION_id FADDITIVEEXPRESSION
ADDITIVEEXPRESSION_id -> TERM_id ADDITIVEEXPRESSION1
TERM_id -> FACTOR_id TERM1
FACTOR_id -> FACTOR1
SIMPLEEXPRESSION_non_id -> ADDITIVEEXPRESSION_non_id FADDITIVEEXPRESSION
ADDITIVEEXPRESSION_non_id -> TERM_non_id ADDITIVEEXPRESSION1
TERM_non_id -> FACTOR_non_id TERM1
FACTOR_non_id -> lparen EXPRESSION rparen | num
You can make similar transformations for other conflicts, but the resulting grammar can become quite unwieldy.

Is it possible to transform this grammar to be LR(1)?

The following grammar generates the sentences a, a, a, b, b, b, ..., h, b. Unfortunately it is not LR(1) so cannot be used with tools such as "yacc".
S -> a comma a.
S -> C comma b.
C -> a | b | c | d | e | f | g | h.
Is it possible to transform this grammar to be LR(1) (or even LALR(1), LL(k) or LL(1)) without the need to expand the nonterminal C and thus significantly increase the number of productions?
Not as long as you have the nonterminal C unchanged preceding comma in some rule.
In that case it is clear that a parser cannot decide, having seen an "a", and having lookahead "comma", whether to reduce or shift. So with C unchanged, this grammar is not LR(1), as you have said.
But the solution lies in the two phrases, "having seen an 'a'" and "C unchanged". You asked if there's fix that doesn't expand C. There isn't, but you could expand C "a little bit" by removing "a" from C, since that's the source of the problem:
S -> a comma a .
S -> a comma b .
S -> C comma b .
C -> b | c | d | e | f | g | h .
So, we did not "significantly" increase the number of productions.

can removing left recursion introduce ambiguity?

Let's assume we have the following CFG G:
A -> A b A
A -> a
Which should produce the strings
a, aba, ababa, abababa, and so on. Now I want to remove the left recursion to make it suitable for predictive parsing. The dragon book gives the following rule to remove immediate left recursions.
Given
A -> Aa | b
rewrite as
A -> b A'
A' -> a A'
| ε
If we simply apply the rule to the grammar from above, we get grammar G':
A -> a A'
A' -> b A A'
| ε
Which looks good to me, but apparently this grammar is not LL(1), because of some ambiguity. I get the following First/Follow sets:
First(A) = { "a" }
First(A') = { ε, "b" }
Follow(A) = { $, "b" }
Follow(A') = { $, "b" }
From which I construct the parsing table
| a | b | $ |
----------------------------------------------------
A | A -> a A' | | |
A' | | A' -> b A A' | A' -> ε |
| | A' -> ε | |
and there is a conflict in T[A',b], so the grammar isn't LL(1), although there are no left recursions any more and there are also no common prefixes of the productions so that it would require left factoring.
I'm not completely sure where the ambiguity comes from. I guess that during parsing the stack would fill with S'. And you can basically remove it (reduce to epsilon), if it isn't needed any more. I think this is the case if another S' is below on on the stack.
I think the LL(1) grammar G'' that I try to get from the original one would be:
A -> a A'
A' -> b A
| ε
Am I missing something? Did I do anything wrong?
Is there a more general procedure for removing left recursion that considers this edge case? If I want to automatically remove left recursions I should be able to handle this, right?
Is the second grammar G' a LL(k) grammar for some k > 1?
The original grammar was ambiguous, so it is not surprising that the new grammar is also ambiguous.
Consider the string a b a b a. We can derive this in two ways from the original grammar:
A ⇒ A b A
⇒ A b a
⇒ A b A b a
⇒ A b a b a
⇒ a b a b a
A ⇒ A b A
⇒ A b A b A
⇒ A b A b a
⇒ A b a b a
⇒ a b a b a
Unambiguous grammars are, of course possible. Here are right- and left-recursive versions:
A ⇒ a A ⇒ a
A ⇒ a b A A ⇒ A b a
(Although these represent the same language, they have different parses: the right-recursive version associates to the right, while the left-recursive one associates to the left.)
Removing left recursion cannot introduce ambiguity. This kind of transformation preserves ambiguity. If the CFG is already ambiguous, the result will be ambiguous too, and if the original is not, the resulting neither.

Making a Grammar LL(1)

I have the following grammar:
S → a S b S | b S a S | ε
Since I'm trying to write a small compiler for it, I'd like to make it LL(1). I see that there seems to be a FIRST/FOLLOW conflict here, and I know I have to use substitution to resolve it, but I'm not exactly sure how to go about it. Here is my proposed grammar, but I'm not sure if it's correct:
S-> aSbT | epsilon
T-> bFaF| epsilon
F-> epsilon
Can someone help out?
In his original paper on LR parsing, Knuth gives the following grammar for this language, which he conjectures "is the briefest possible unambiguous grammar for this language:"
S → ε | aAbS | bBaS
A → ε | aAbA
B → ε | bBaB
Intuitively, this tries to break up any string of As and Bs into blocks that balance out completely. Some blocks start with a and end with b, while others start with b and end with a.
We can compute FIRST and FOLLOW sets as follows:
FIRST(S) = { ε, a, b }
FIRST(A) = { ε, a }
FIRST(B) = { ε, b }
FOLLOW(S) = { $ }
FOLLOW(A) = { b }
FOLLOW(B) = { a }
Based on this, we get the following LL(1) parse table:
| a | b | $
--+-------+-------+-------
S | aAbS | bBaS | e
A | aAbA | e |
B | e | bBaB |
And so this grammar is not only LR(1), but it's LL(1) as well.
Hope this helps!

Resources