How can I simplify a recursive-descent parser? - parsing

I have the following simple LL(1) grammar, which describes a language with only three valid sentences: "", "x y" and "z x y":
S -> A x y | ε .
A -> z | ε .
I have constructed the following parsing table, and from it a "naive" recursive-descent parser:
| x | y | z | $
S | S -> A x y | | S -> A x y | S -> ε
A | A -> ε | | A -> z |
func S():
if next() in ['x', 'z']:
A()
expect('x')
expect('y')
expect('$')
elif next() == '$':
pass
else:
error()
func A():
if next() == 'x':
pass
elif next() == 'z':
expect('z')
else:
error()
However, the function A seems to be more complicated than necessary. All of my tests still pass if it's simplified to:
func A():
if next() == 'z':
expect('z')
Is this a valid simplification of A? If so, are there any general rules regarding when it's valid to make simplifications like this one?

That simplification is certainly valid (and quite common).
The main difference is that there is no code associated with the production A→ε. If there are some semantics to implement, you will need to test for the condition. If you only need to ignore the nullable production, you can certainly just return.
Coalescing errors and epsilon productions has one other difference: the error (for example, in the input y) is detected later, after A() returns. Sometimes that makes it harder to produce good error messages (and sometimes it doesn't).

Related

What is the main point of Active Patterns?

When we pattern match on a value using Active Patterns, there is a "convert" function implicitly called. So instead of writing:
match value with
| Tag1 -> ...
| Tag2 -> ...
I can explicitly write:
match convert value with
| Tag1 -> ...
| Tag2 -> ...
This way, I don't have to use Active Patterns here. Of course, I have to explicitly call the convert function, and have to explicitly declare a union type. But those are minor things to me.
So what is the main point of Active Patterns?
The primary power of pattern matching is not the funny syntax. The primary power of patterns is that they can be nested.
Take a look at this:
match value with
| Foo (Bar, Baz [First 42; Second "hello!"]) -> "It's a thing"
| Qux [42; 42; 42] -> "Triple fourty-two"
| _ -> "No idea"
Assuming all capitalized words are active patterns, let's try to rewrite the first pattern in terms of calling convert explicitly:
match convertFoo value with
| Foo (x, y) ->
match convertBar x, convertBaz y with
| (Bar, Baz [z1; z2]) ->
match convertFirst z1, convertSecond z2 with
| First 42, Second "hello!" -> "It's a thing"
Too long and convoluted? But wait, we didn't even get to write all the non-matching branches!
match convertFoo value with
| Foo (x, y) ->
match convertBar x, convertBaz y with
| (Bar, Baz [z1; z2]) ->
match convertFirst z1, convertSecond z2 with
| First 42, Second "hello!" -> "It's a thing"
| _ -> "No idea"
| _ -> "No idea"
| Qux [42; 42; 42] -> "Triple fourty-two"
| _ -> "No idea"
See how the "No idea" branch is triplicated? Isn't copy&paste wonderful? :-)
Incidentally, this is why C#'s feeble attempt at what they have the audacity to call "pattern matching" isn't really pattern matching: it can't be nested, and therefore, as you very astutely observe, it is no better than just calling classifier functions.

How to deal with the implicit 'cat' operator while building a syntax tree for RE(use stack evaluation)

I am trying to build a syntax tree for regular expression. I use the strategy similar to arithmetic expression evaluation (i know that there are ways like recursive descent), that is, use two stack, the OPND stack and the OPTR stack, then to process.
I use different kind of node to represent different kind of RE. For example, the SymbolExpression, the CatExpression, the OrExpression and the StarExpression, all of them are derived from RegularExpression.
So the OPND stack stores the RegularExpression*.
while(c || optr.top()):
if(!isOp(c):
opnd.push(c)
c = getchar();
else:
switch(precede(optr.top(), c){
case Less:
optr.push(c)
c = getchar();
case Equal:
optr.pop()
c = getchar();
case Greater:
pop from opnd and optr then do operation,
then push the result back to opnd
}
But my primary question is, in typical RE, the cat operator is implicit.
a|bc represents a|b.c, (a|b)*abb represents (a|b)*.a.b.b. So when meeting an non-operator, how should i do to determine whether there's a cat operator or not? And how should i deal with the cat operator, to correctly implement the conversion?
Update
Now i've learn that there is a kind of grammar called "operator precedence grammar", its evaluation is similar to arithmetic expression's. It require that the pattern of the grammar cannot have the form of S -> ...AB...(A and B are non-terminal). So i guess that i just cannot directly use this method to parse the regular expression.
Update II
I try to design a LL(1) grammar to parse the basic regular expression.
Here's the origin grammar.(\| is the escape character, since | is a special character in grammar's pattern)
E -> E \| T | T
T -> TF | F
F -> P* | P
P -> (E) | i
To remove the left recursive, import new Variable
E -> TE'
E' -> \| TE' | ε
T -> FT'
T' -> FT' | ε
F -> P* | P
P -> (E) | i
now, for pattern F -> P* | P, import P'
P' -> * | ε
F -> PP'
However, the pattern T' -> FT' | ε has problem. Consider case (a|b):
E => TE'
=> FT' E'
=> PT' E'
=> (E)T' E'
=> (TE')T'E'
=> (FT'E')T'E'
=> (PT'E')T'E'
=> (iT'E')T'E'
=> (iFT'E')T'E'
Here, our human know that we should substitute the Variable T' with T' -> ε, but program will just call T' -> FT', which is wrong.
So, what's wrong with this grammar? And how should i rewrite it to make it suitable for the recursive descendent method.
1. LL(1) grammar
I don't see any problem with your LL(1) grammar. You are parsing the string
(a|b)
and you have gotten to this point:
(a T'E')T'E' |b)
The lookahead symbol is | and you have two possible productions:
T' ⇒ FT'
T' ⇒ ε
FIRST(F) is {(, i}, so the first production is clearly incorrect, both for the human and the LL(1) parser. (A parser without lookahead couldn't make the decision, but parsers without lookahead are almost useless for practical parsing.)
2. Operator precedence parsing
You are technically correct. Your original grammar is not an operator grammar. However, it is normal to augment operator precedence parsers with a small state machine (otherwise algebraic expressions including unary minus, for example, cannot be correctly parsed), and once you have done that it is clear where the implicit concatenation operator must go.
The state machine is logically equivalent to preprocessing the input to insert an explicit concatenation operator where necessary -- that is, between a and b whenever a is in {), *, i} and b is in {), i}.
You should take note that your original grammar does not really handle regular expressions unless you augment it with an explicit ε primitive to represent the empty string. Otherwise, you have no way to express optional choices, usually represented in regular expressions as an implicit operand (such as (a|), also often written as a?). However, the state machine is easily capable of detecting implicit operands as well because there is no conflict in practice between implicit concatenation and implicit epsilon.
I think just keeping track of the previous character should be enough. So if we have
(a|b)*abb
^--- we are here
c = a
pc = *
We know * is unary, so 'a' cannot be its operand. So we must have concatentation. Similarly at the next step
(a|b)*abb
^--- we are here
c = b
pc = a
a isn't an operator, b isn't an operator, so our hidden operator is between them. One more:
(a|b)*abb
^--- we are here
c = b
pc = |
| is a binary operator expecting a right-hand operand, so we do not concatenate.
The full solution probably involves building a table for each possible pc, which sounds painful, but it should give you enough context to get through.
If you don't want to mess up your loop, you could do a preprocessing pass where you insert your own concatenation character using similar logic. Can't tell you if that's better or worse, but it's an idea.

can removing left recursion introduce ambiguity?

Let's assume we have the following CFG G:
A -> A b A
A -> a
Which should produce the strings
a, aba, ababa, abababa, and so on. Now I want to remove the left recursion to make it suitable for predictive parsing. The dragon book gives the following rule to remove immediate left recursions.
Given
A -> Aa | b
rewrite as
A -> b A'
A' -> a A'
| ε
If we simply apply the rule to the grammar from above, we get grammar G':
A -> a A'
A' -> b A A'
| ε
Which looks good to me, but apparently this grammar is not LL(1), because of some ambiguity. I get the following First/Follow sets:
First(A) = { "a" }
First(A') = { ε, "b" }
Follow(A) = { $, "b" }
Follow(A') = { $, "b" }
From which I construct the parsing table
| a | b | $ |
----------------------------------------------------
A | A -> a A' | | |
A' | | A' -> b A A' | A' -> ε |
| | A' -> ε | |
and there is a conflict in T[A',b], so the grammar isn't LL(1), although there are no left recursions any more and there are also no common prefixes of the productions so that it would require left factoring.
I'm not completely sure where the ambiguity comes from. I guess that during parsing the stack would fill with S'. And you can basically remove it (reduce to epsilon), if it isn't needed any more. I think this is the case if another S' is below on on the stack.
I think the LL(1) grammar G'' that I try to get from the original one would be:
A -> a A'
A' -> b A
| ε
Am I missing something? Did I do anything wrong?
Is there a more general procedure for removing left recursion that considers this edge case? If I want to automatically remove left recursions I should be able to handle this, right?
Is the second grammar G' a LL(k) grammar for some k > 1?
The original grammar was ambiguous, so it is not surprising that the new grammar is also ambiguous.
Consider the string a b a b a. We can derive this in two ways from the original grammar:
A ⇒ A b A
⇒ A b a
⇒ A b A b a
⇒ A b a b a
⇒ a b a b a
A ⇒ A b A
⇒ A b A b A
⇒ A b A b a
⇒ A b a b a
⇒ a b a b a
Unambiguous grammars are, of course possible. Here are right- and left-recursive versions:
A ⇒ a A ⇒ a
A ⇒ a b A A ⇒ A b a
(Although these represent the same language, they have different parses: the right-recursive version associates to the right, while the left-recursive one associates to the left.)
Removing left recursion cannot introduce ambiguity. This kind of transformation preserves ambiguity. If the CFG is already ambiguous, the result will be ambiguous too, and if the original is not, the resulting neither.

Confused at transferring an ambiguous grammar to an unambiguous one

An ambiguous grammar is given and I am asked to rewrite the grammar to make it unambiguous. In fact, I don't know why the given grammar is ambiguous, let alone rewriting it to an unambiguous one.
The given grammar is S -> SS | a | b , and I have four choices:
A: S -> Sa | Sb | epsilon
B: S -> SS’
S’-> a | b
C: S -> S | S’
S’-> a | b
D: S -> Sa | Sb.
For each choice, I have already know that D is incorrect because it generates no strings at all,C is incorrect because it only matches the strings 'a' and 'b'.
However, I think the answer is A while the correct answer is B.I think B is wrong because it just generates S over and over again, and B can't deal with empty strings.
Why is the given grammar ambiguous?
Why is A incorrect while B is correct?
The original grammar is ambiguous because multiple right-most (or left-most) derivations are possible for any string of at least three letters. For example:
S -> SS -> SSS -> SSa -> Saa -> aaa
S -> SS -> Sa -> SSa -> Saa -> aaa
The first one corresponds, roughly speaking, to the parse a(aa) while the second to the parse (aa)a.
None of the alternatives is correct. A incorrectly matches ε while B does not match anything (like D). If B were, for example,
S -> SS' | S'
S' -> a | b
it would be correct. (This grammar is left-associative.)

Grammar for Arithmetic Expressions

I was assigned a task for creating a parser for Arithmetic Expressions (with parenthesis and unary operators). So I just wanna know if this grammar correct or not and is it in LL(1) form and having real problems constructing the parse table for this
S -> TS'
S' -> +TS' | -TS' | epsilon
T -> UT'
T' -> *UT' | /UT' | epsilon
U -> VX
X -> ^U | epsilon
V -> (W) | -W | W | epsilon
W -> S | number
Precedence (high to low)
(), unary –
^
*, /
+, -
Associativity for binary operators
^ = right
+, -, *, / = left
Is it in LL(1) form?
To tell if the grammar is LL(1) or not, you need to expand the production rules out. If you can generate any sequence of productions which results in the left-hand-side appearing as the first thing on the right-hand-side, the grammar is not LL(1).
For example, consider this rule:
X --> X | x | epsilon
This clearly can't be part of an LL(1) grammar, since it's left-recursive if you apply the leftmost production. But what about this?
X --> Y | x
Y --> X + X
This isn't an LL(1) grammar either, but it's more subtle: first you have to apply X --> Y, then apply Y --> X + X to see that you now have X --> X + X, which is left-recursive.
You seem to be missing anything for unary plus operator. Try this instead...
V -> (W) | -W | +W | epsilon

Resources