Grammar with Epsilon or Lambda - parsing

So I have a set of grammar
S -> X Y
X -> a X
X ->
Y -> b
Z -> a Z
Z -> a
My only confusing with this grammar is that 2nd Production for X
There is nothing there. Is that the equivalent of using Epsilon ε, or Lamda λ
I am assuming it is merely a difference in notation for the grammars but wanted to be sure as I am trying to build the first and follow sets

Both ε and λ (and sometimes Λ) are used by different writers to represent the empty string. In modern writing, ε is much more common but you'll often find λ in older textbooks, and Λ in even older ones.
The point of using these symbols is to make the empty sequence visible. However it is written, it is the empty sequence and should be read as though it were nothing, as in your production X ⇒ .
If you find it difficult getting your head around the idea that a symbol means nothing, then you might enjoy reading Charles Seife's Zero: The Biography of a Dangerous Idea or Robert Kaplan's The Nothing that Is: A Natural History of Zero, both published in the emblematic year 2K and both of which explore the long and difficult struggle to understand the concept of nothing. ("No one goes out to buy zero fish" -- Alfred North Whitehead).
It has been suggested that Λ/λ comes from the German word "leer", meaning empty, while ε comes from English "empty". There was a time when German was more common in academic discussion of mathematical logic, so the theory seems reasonable.

Related

What production rule should I use to reduce in bottom-up parsing?

So far, my understanding of the algorithm of bottom-up parsing is this.
shift a token into the stack
check the stack from top if some elements including the top can be reduced by some production rule
if the elements can be reduced, pop and push the left hand side of the production rule.
continue those steps until top is the start symbol and next input is EOF
So to support my question with an example grammar,
S → aABe
A → Abc
A → b
B → d
if we have input string as
abbcde$
we will shift a in stack
and because there are no production rule that reduces a, we shift the next token b.
Then we can find a production rule A → b and reduce b to A.
Then my question is this. We have aA on stack and the next input is b. Then how can the parser determine whether we reduce b to A we wait for c to come and use the rule A → Abc?
Well of course, reducing b to A at that point results in an error. But how does the parser know at that point that we should wait for c?
I'm sorry if I missed something while studying.
That's an excellent question, and it will be addressed in the next part of your course.
For now, it's sufficient to pretend that there is some magic black box which tells the parser when it should reduce (and, sometimes, which of several possible productions to use).
The various parsing algorithms explain the construction of this black box. Note that one possible solution is to fork reality and try both actions in parallel, but a more common solution is to process the grammar in order to work out how to predict the correct action.

How to parse this simple grammar? Is it ambiguous?

I'm delving deeper into parsing and came across an issue I don't quite understand. I made up the following grammar:
S = R | aSc
R = b | RbR
where S is the start symbol. It is possible to show that abbbc is a valid sentence based on this grammar, hopefully, that is correct but I may have completely missunderstood something. If I try to implement this using recursive descent I seem to have a problem when trying to parse abbbc, using left-derivation eg
S => aSc
aSc => aRc
at this point I would have thought that recursive descent would pick the first option in the second production because the next token is b leading to:
aRc => abc
and we're finished since there are no more non-terminals, which isn't of course abbbc. The only way to show that abbbc is valid is to pick the second option but with one lookahead I assume it would always pick b. I don't think the grammar is ambiguous unless I missed something. So what I am doing wrong?
Update: I came across this nice derivation app at https://web.stanford.edu/class/archive/cs/cs103/cs103.1156/tools/cfg/. I used to do a sanity check that abbbc is a valid sentence and it is.
Thinking more about this problem, is it true to say that I can't use LL(1) to parse this grammar but in fact need LL(2)? With two lookaheads I could correctly pick the second option in the second production because I now also know there are more tokens to be read and therefore picking b would prematurely terminate the derivation.
For starters, I’m glad you’re finding our CFG tool useful! A few of my TAs made that a while back and we’ve gotten a lot of mileage out of it.
Your grammar is indeed ambiguous. This stems from your R nonterminal:
R → b | RbR
Generally speaking, if you have recursive production rules with two copies of the same nonterminal in it, it will lead to ambiguities because there will be multiple options for how to apply the rule twice. For example, in this case, you can derive bbbbb by first expanding R to RbR, then either
expanding the left R to RbR and converting each R to a b, or
expanding the right R to RbR and converting each R to a b.
Because this grammar is ambiguous, it isn’t going to be LL(k) for any choice of k because all LL(k) grammars must be unambiguous. That means that stepping up the power of your parser won’t help here. You’ll need to rewrite the grammar to not be ambiguous.
The nonterminal R that you’ve described here generates strings of odd numbers of b’s in them, so we could try redesigning R to achieve this more directly. An initial try might be something like this:
R → b | bbR
This, unfortunately, isn’t LL(1), since after seeing a single b it’s unclear whether you’re supposed to apply the first production rule or the second. However, it is LL(2).
If you’d like an LL(1) grammar, you could do something like this:
R → bX
X → bbX | ε
This works by laying down a single b, then laying down as many optional pairs of b’s as you’d like.

Automata - Construct NFA using copies of DFA - formal definition of NFA

I'm trying to learn how to solve the following exercise. I don't understand how to start, it's overwhelming. I do understand DFA's, NFA's and how to convert DFA's to NFA's. I also understand the formal notations.
This is not a homework exercise, it's just for studying. I do have the solution, but I can't make sense out of it either..
If someone could ELI5 the exercise that would be amazing, examples on solving exercises like these (with proper explanations) would be great too, I have yet to find a similar exercise on the web.
Given:
Alphabet Σ
A symbol c ∈ Σ
Regular language L over Σ
Language L-c = {uv | ucv ∈ L}
Let D = (Q, Σ, 𝛿, q0, F) be a DFA with ℒ(D)=L.
Show how to construct an NFA N using copies of D such that ℒ(N)=L-c. Provide a formal definition of N.
I am not going to write down the formal definition in detail. Basically you can use two copies of the DFA. In the first one you start, it does not have final states. For every transition from a state p to a state q that reads c, you add a transition not reading anything that goes from p in the first copy to q in the second copy. This way you skip exactly one c. Once you are in the second copy, you know that a c has been skipped and you can accept if the remaining inputleads to an accepting state.

Resolving reduce/reduce conflicts

We have a CFG grammar and we construct LR(1) Parsing Table. We see that one cell on the parsing table have a reduce - reduce conflict. Is it possible to solve this conflict by using more input symbols of lookahead at each step? I am asking this beacuse I think that by increasing lookahead symbols, we can(not always) only resolve shift - reduce conflicts.I mean the extra lookaheads in a reduce-reduce conflict doesn't help us. Am I right ?
It might be possible to solve a reduce/reduce conflict with more lookahead. It might also be possible to solve it by refactoring.
It really depends on the nature of the conflict. There is no general procedure.
An example of a reduce/reduce conflict which can be solved by extra lookahead:
A → something
B → A
C → A
D → B u v
D → C u w
Here, the last two productions of D are unambiguous, but the decision about reducing A to B or to C cannot be made when the u is seen. One more symbol of lookahead would do it, though, because the second next symbol determines the reduction.
The refactoring solution:
Au → A u
Bu → Au
Cu → Au
D → Bu v
D → Cu w
By deferring the B/C choice by one token, we've succeeded in removing the reduce/reduce conflict. Note that this solution will work even if u is not a single token; it could, for example, by a non-terminal. So this model may work in cases where simply increasing lookahead is not sufficient.
In general any conflict can be resolved by additional look ahead. In the extreme case you need to read to the end of the file. There is no significant difference between shift/reduce and reduce/reduce conflicts. Their resolution is kind of similar.
I wrote an article about conflict resolution. It proposes a method that allows finding out the reason of the conflict. In certain cases this helps to do refactoring of the grammar or defining the resolution strategy.
Please, take a look: http://cdsan.com/LinkPool/Data/2915/Conflicts-In-The-LR-Grammars.pdf
If you have questions, please let me know.

Bottom-Up-Parser: When to apply which reduction rule?

Let's take the following context-free grammar:
G = ( {Sum, Product, Number}, {decimal representations of numbers, +, *}, P, Sum)
Being P:
Sum → Sum + Product
Sum → Product
Product → Product * Number
Product → Number
Number → decimal representation of a number
I am trying to parse expressions produced by this grammar with a bottom-up-parser and a look-ahead-buffer (LAB) of length 1 (which supposingly should do without guessing and back-tracking).
Now, given a stack and a LAB, there are often several possibilities of how to reduce the stack or whether to reduce it at all or push another token.
Currently I use this decision tree:
If any top n tokens of the stack plus the LAB are the begining of the
right side of a rule, I push the next token onto the stack.
Otherwise, I reduce the maximum number of tokens on top of the stack.
I.e. if it is possible to reduce the topmost item and at the same time
it is possible to reduce the three top most items, I do the latter.
If no such reduction is possible, I push another token onto the stack.
Rinse and repeat.
This, seems (!) to work, but it requires an awful amount of rule searching, finding matching prefixes, etc. No way this can run in O(NM).
What is the standard (and possibly only sensible) approach to decide whether to reduce or push (shift), and in case of reduction, which reduction to apply?
Thank you in advance for your comments and answers.
The easiest bottom-up parsing approach for grammars like yours (basically, expression grammars) is operator-precedence parsing.
Recall that bottom-up parsing involves building the parse tree left-to-right from the bottom. In other words, at any given time during the parse, we have a partially assembled tree with only terminal symbols to the right of where we're reading, and a combination of terminals and non-terminals to the left (the "prefix"). The only possible reduction is one which applies to a suffix of the prefix; if no reduction applies, we must be able to shift a terminal from the input to the prefix.
An operator grammar has the feature that there are never two consecutive non-terminals in any production. Consequently, in a bottom-up parse of an operator grammar, either the last symbol in the prefix is a terminal or the second-last symbol is one. (Both of them could be.)
An operator precedence parser is essentially blind to non-terminals; it simply doesn't distinguish between them. So you cannot have two productions whose right-hand sides contain exactly the same sequence of terminals, because the op-prec parser wouldn't know which of these two productions to apply. (That's the traditional view. It's actually possible to extend that a bit so that you can have two productions with the same terminals, provided that the non-terminals are in different places. That allows grammars which have unary - operators, for example, since the right hand sides <non-terminal> - <non-terminal> and - <non-terminal> can be distinguished without knowing the names of the non-terminals; only their presence.
The other requirement is that you have to be able to build a precedence relationship between the terminals. More precisely, we define three precedence relations, usually written <·, ·> and ·=· (or some typographic variation on the theme), and insist that for any two terminals x and y, at most one of the relations x ·> y, x ·=· y and x <· y are true.
Roughly speaking, the < and > in the relations correspond to the edges of a production. In other words, if x <· y, that means that x can be followed by a non-terminal with a production whose first terminal is y. Similarly, x ·> y means that y can follow a non-terminal with a production whose last terminal is x. And x ·=· y means that there is some right-hand side where x and y are consecutive terminals, in that order (possibly with an intervening non-terminal).
If the single-relation restriction is true, then we can parse as follows:
Let x be the last terminal in the prefix (that is, either the last or second-last symbol), and let y be the lookahead symbol, which must be a terminal. If x ·> y then we reduce, and repeat the rule. Otherwise, we shift y onto the prefix.
In order to reduce, we need to find the start of the production. We move backwards over the prefix, comparing consecutive terminals (all of which must have <· or ·=· relations) until we find one with a <· relation. Then the terminals between the <· and the ·> are the right-hand side of the production we're looking for, and we can slot the non-terminals into the right-hand side as indicated.
There is no guarantee that there will be an appropriate production; if there isn't, the parse fails. But if the input is a valid sentence, and if the grammar is an operator-precedence grammar, then we will be able to find the right production to reduce.
Note that it is usually really easy to find the production, because most productions have only one (<non-terminal> * <non-terminal>) or two (( <non-terminal> )) terminals. A naive implementation might just run the terminals together into a string and use that as the key in a hash-table.
The classic implementation of operator-precedence parsing is the so-called "Shunting Yard Algorithm", devised by Edsger Dijkstra. In that algorithm, the precedence relations are modelled by providing two functions, left-precedence and right-precedence, which map terminals to integers such that x <· y is true only if right-precedence(x) < left-precedence(y) (and similarly for the other operators). It is not always possible to find such mappings, and the mappings are a cover of the actual precedence relations because it is often the case that there are pairs of terminals for which no precedence relationship applies. Nonetheless, it is often the case that these mappings can be found, and almost always the case for simple expression grammars.
I hope that's enough to get you started. I encourage you to actually read some texts about bottom-up parsing, because I think I've already written far too much for an SO answer, and I haven't yet included a single line of code. :)

Resources