Parse the expression: IF i> i THEN i = i + i * i
using the following CFG definition of a small programming language,
S → ASSIGNMENT$| GOTO$| IF$| IO$
ASSIGNMENT$ → i = ALEX
GOTO$ → GOTO NUMBER
IF$ → IF CONDITION THEN S
| IF CONDITION THEN S ELSE S
CONDITION → ALEX = ALEX| ALEX ≠ ALEX| ALEX > ALEX
| CONDITION AND CONDITION
| CONDITION OR CONDITION
| NOT CONDITION
IO$ → READ i| PRINT i
HINTS:
ALEX stands for algebraic expression
the names end in $ are class
the terminals are { = GOTO IF THEN ELSE ≠ > AND OR NOT READ PRINT }
whatever terminals are introduced in the definitions of i, ALEX, and NUMBER.
It looks to me like '$' is an abbreviation for 'STATEMENT'. E.g., "IF$" means "IF STATEMENT".
But regardless of what it "means", you can treat it as if it were an ordinary letter for the purpose of the exercise: "IF$" is just the name of a non-terminal that occurs in the grammar.
Related
I have a grammar with one production rule:
S → aSbS | bSaS | ∈
This is ambiguous. Is it allowed to remove the ambiguity this way?
S → A|B|∈
A → aS
B → bS
This makes it unambiguous.
Another grammar:
S → A | B
A → aAb | ab
B → abB | ∈
Correction to make it unambiguous
S → A | B
A → aA'b
A' → ab
B → abB | ∈
I am not using any rules to make the grammars unambiguous. If it is wrong to remove ambiguity in a grammar this way, can anyone point a proper set of rules for removing ambiguity in ambiguous grammars?
As #kaby76 points out in a comment, in your first example, you haven't just removed the unambiguity. You have also changed the language recognised by the grammar.
S → a S b S
| b S a A
| ε
recognises only strings with the same number of a's and b's, while
S → A
| B
| ε
A → a S
B → b S
recognises any string made up of a's and b's.
So that's certainly not a legitimate disambiguation.
By the way, your second grammar could have been simplified; A and B serve no useful purpose.
S → a S
| b S
| ε
There are unambiguous grammars for this language. One example:
S → a A S
| b B S
A → a
| b A A
B → b
| a B B
See this post on the Computer Science StackExchange for an explanation.
In your second example, the grammar
S → A
| B
A → a A b
| a b
B → a b B
| ε
is ambiguous, but only because A and B both match ab. Every other recognised string has exactly one possible parse.
In this grammar, A matches strings which consist of some number of as followed by the same number of bs, while B matches strings which consist of any number of repetitions of ab.
ab fits both criteria: it is a repetition of ab (consisting of just one copy) and it is a sequence of as followed by the same number of bs (in this case, one of each). The empty string also matches both criteria (with repetition count 0), but it has been excluded from A by making the starting rule A → a b. An easy way to make the grammar unambiguous would be to change that base rule to A → a a b b.
Again, your disambiguated grammar does not recognise the same language as the original grammar. Your change makes A non-recursive, so that it only recognises aabb and now strings like aaabbb, aaaabbbb, and so on are not recognised. So once again, it is not simply an unambiguous version of the original.
Note that this grammar only matches strings with an equal number of as and bs, but it does not match all such strings. There are many other strings with an equal number of as and bs which are not matched, such as ba or aabbab. So its language is a subset of the language of the first grammar, but it is not the same language.
Finally, you ask if there is a mechanical procedure which can create an unambiguous grammar given an ambiguous grammar. The answer is no. It can be proven that there is no algorithm which can even decide whether a particular context-free grammar is ambiguous. Nor is there an algorithm which can decide whether two context-free grammars have the same language. So it shouldn't be surprising that there is no algorithm which can construct an equivalent unambiguous grammar from an ambiguous grammar.
That doesn't mean that the task cannot be done for certain grammars. But it's not a simple mechanical procedure, and it might not work for a particular grammar.
I am trying to build a syntax tree for regular expression. I use the strategy similar to arithmetic expression evaluation (i know that there are ways like recursive descent), that is, use two stack, the OPND stack and the OPTR stack, then to process.
I use different kind of node to represent different kind of RE. For example, the SymbolExpression, the CatExpression, the OrExpression and the StarExpression, all of them are derived from RegularExpression.
So the OPND stack stores the RegularExpression*.
while(c || optr.top()):
if(!isOp(c):
opnd.push(c)
c = getchar();
else:
switch(precede(optr.top(), c){
case Less:
optr.push(c)
c = getchar();
case Equal:
optr.pop()
c = getchar();
case Greater:
pop from opnd and optr then do operation,
then push the result back to opnd
}
But my primary question is, in typical RE, the cat operator is implicit.
a|bc represents a|b.c, (a|b)*abb represents (a|b)*.a.b.b. So when meeting an non-operator, how should i do to determine whether there's a cat operator or not? And how should i deal with the cat operator, to correctly implement the conversion?
Update
Now i've learn that there is a kind of grammar called "operator precedence grammar", its evaluation is similar to arithmetic expression's. It require that the pattern of the grammar cannot have the form of S -> ...AB...(A and B are non-terminal). So i guess that i just cannot directly use this method to parse the regular expression.
Update II
I try to design a LL(1) grammar to parse the basic regular expression.
Here's the origin grammar.(\| is the escape character, since | is a special character in grammar's pattern)
E -> E \| T | T
T -> TF | F
F -> P* | P
P -> (E) | i
To remove the left recursive, import new Variable
E -> TE'
E' -> \| TE' | ε
T -> FT'
T' -> FT' | ε
F -> P* | P
P -> (E) | i
now, for pattern F -> P* | P, import P'
P' -> * | ε
F -> PP'
However, the pattern T' -> FT' | ε has problem. Consider case (a|b):
E => TE'
=> FT' E'
=> PT' E'
=> (E)T' E'
=> (TE')T'E'
=> (FT'E')T'E'
=> (PT'E')T'E'
=> (iT'E')T'E'
=> (iFT'E')T'E'
Here, our human know that we should substitute the Variable T' with T' -> ε, but program will just call T' -> FT', which is wrong.
So, what's wrong with this grammar? And how should i rewrite it to make it suitable for the recursive descendent method.
1. LL(1) grammar
I don't see any problem with your LL(1) grammar. You are parsing the string
(a|b)
and you have gotten to this point:
(a T'E')T'E' |b)
The lookahead symbol is | and you have two possible productions:
T' ⇒ FT'
T' ⇒ ε
FIRST(F) is {(, i}, so the first production is clearly incorrect, both for the human and the LL(1) parser. (A parser without lookahead couldn't make the decision, but parsers without lookahead are almost useless for practical parsing.)
2. Operator precedence parsing
You are technically correct. Your original grammar is not an operator grammar. However, it is normal to augment operator precedence parsers with a small state machine (otherwise algebraic expressions including unary minus, for example, cannot be correctly parsed), and once you have done that it is clear where the implicit concatenation operator must go.
The state machine is logically equivalent to preprocessing the input to insert an explicit concatenation operator where necessary -- that is, between a and b whenever a is in {), *, i} and b is in {), i}.
You should take note that your original grammar does not really handle regular expressions unless you augment it with an explicit ε primitive to represent the empty string. Otherwise, you have no way to express optional choices, usually represented in regular expressions as an implicit operand (such as (a|), also often written as a?). However, the state machine is easily capable of detecting implicit operands as well because there is no conflict in practice between implicit concatenation and implicit epsilon.
I think just keeping track of the previous character should be enough. So if we have
(a|b)*abb
^--- we are here
c = a
pc = *
We know * is unary, so 'a' cannot be its operand. So we must have concatentation. Similarly at the next step
(a|b)*abb
^--- we are here
c = b
pc = a
a isn't an operator, b isn't an operator, so our hidden operator is between them. One more:
(a|b)*abb
^--- we are here
c = b
pc = |
| is a binary operator expecting a right-hand operand, so we do not concatenate.
The full solution probably involves building a table for each possible pc, which sounds painful, but it should give you enough context to get through.
If you don't want to mess up your loop, you could do a preprocessing pass where you insert your own concatenation character using similar logic. Can't tell you if that's better or worse, but it's an idea.
I am reading the famous violet dragon book 2nd edition, and can't get the example from page 65, about creating the FIRST set:
We have the following grammar (terminals are bolded):
stmt → expr;
| if ( expr ) stmt
| for ( optexpr ; optexpr ; optexpr ) stmt
| other
optexpr → ε
| expr
And the book suggests that the following is the correct calculation of FIRST:
FIRST(stmt) → {expr, if, for, other} // agree on this
FIRST(expr ;) → {expr} // Where does this come from?
As the comment suggests, from where does the second line come from?
There is no error in the textbook.
The FIRST function is defined (on page 64, emphasis added):
Let α be a string of grammar symbols (terminals and/or nonterminals). We define FIRST(α) to be the set of terminals that appear as the first symbols of one or more strings of terminals generated from α.
In this example, expr ; is a string of grammar symbols consisting of two terminals, so it is a possible value of α. Since it includes no nonterminals, it cannot only generate itself; thus the only string of terminals generated from that value of α is precisely expr ;, and the only terminal that will appear in FIRST(α) is the first symbol in that string, expr.
This may all seem to be belabouring the obvious, but it leads up to the important note just under the example you cite:
The FIRST sets must be considered if there are two productions A → α and A → β. Ignoring ε-productions for the moment, predictive parsing requires FIRST(α) and FIRST(β) to be disjoint.
Since expr ; is one of the possible right-hand sides for stmt, we need to compute its FIRST set (even though the computation in this case is trivial) in order to test this prerequisite.
In this term, I have course on Compilers and we are currently studying syntax - different grammars and types of parsers. I came across a problem which I can't exactly figure out, or at least I can't make sure I'm doing it correctly. I already did 2 attempts and counterexamples were found.
I am given this ambiguous grammar for arithmetic expressions:
E → E+E | E-E | E*E | E/E | E^E | -E | (E)| id | num , where ^ stands for power.
I figured out what the priorities should be. Highest priority are parenthesis, followed by power, followed by unary minus, followed by multiplication and division, and then there is addition and substraction. I am asked to convert this into equivalent LL(1) grammar. So I wrote this:
E → E+A | E-A | A
A → A*B | A/B | B
B → -C | C
C → D^C | D
D → (E) | id | num
What seems to be the problem with this is not equivalent grammar to the first one, although it's non-ambiguous. For example: Given grammar can recognize input: --5 while my grammar can't. How can I make sure I'm covering all cases? How should I modify my grammar to be equivalent with the given one? Thanks in advance.
Edit: Also, I would of course do elimination of left recursion and left factoring to make this LL(1), but first I need to figure out this main part I asked above.
Here's one that should work for your case
E = E+A | E-A | A
A = A*C | A/C | C
C = C^B | B
B = -B | D
D = (E) | id | num
As a sidenote: pay also attention to the requirements of your task since some applications might assign higher priority to the unary minus operator with respect to the power binary operator.
A typical BNF defining arithmetic operations:
E :- E + T
| T
T :- T * F
| F
F :- ( E )
| number
Is there any way to re-write this grammar so it could be implemented with an LR(0) parser, while still retaining the precedence and left-associativity of the operators?
I'm thinking it should be possible by introducing some sort of disambiguation non-terminals, but I can't figure out how to do it.
Thanks!
A language can only have an LR(0) grammar if it's prefix-free, meaning that no string in the language is a prefix of another. In this case, the language you're describing isn't prefix-free. For example, the string number + number is a prefix of number + number + number.
A common workaround to address this would be to "endmark" your language by requiring all strings generated to end in a special "done" character. For example, you could require that all strings generated end in a semicolon. If you do that, you can build an LR(0) parser for the language with this grammar:
S → E;
E → E + T | T
T → T * F | F
F → number | (E)