First and follow in the following grammar - parsing

The following grammar is given:-
E->E+T|T
T->T*F|F
F->id
I have tried to find the first and follow. Can anyone verify it whether its correct???
First(E)={id}
First(T)={id}
First(F)={id}
Follow(E)={+,id}
Follow(T)={+}
Follow(F)={id,*}

FIRST sets are correct,
FOLLOW(A) of non-terminal A is the set of terminal symbols that can follow in the
derivation sequence
FOLLOW(E), check where it is there in the right-hand side of production. It is there in
E->E+T
what follows E when we consider this production for derivation is '+' and '$'(End of Input) is also added to the follow of start symbol
FOLLOW(E) ={+,$}
FOLLOW(T), it is there in right-hand side of three productions
E-> E+T E->T T->T*F
FOLLOW(T)={*} U FOLLOW(E)={*,+,$}
FOLLOW(F), it is there in right-hand side of two productions
T->T*F T->F
FOLLOW(F)=FOLLOW(T)={*,+,$}
If you are doing this exercise for computing LL(1) parsing table then first eliminate left recursion and proceed.

Related

LR(1) parsing, problem with look ahead symbols

I understand the concept of LR(1) parsing and lookahead symbols. I have the solution to the exercise and it does not agree with my solution.
I'm trying to fill the LR(1) parsing table for the grammar below:
S->xAz
S->BAx
A->Ay
A->e
B->yB
B->y
Ι don't have to extend the grammar since S does not appear in any right hand side of any rule.
First(A)=y,e
First(Ax)=x,y
First(B)=y
First(Ay)=y
Lookahead symbols in brackets.
So, I0 = Closure(S->.xAz($) , S->.BAx($) ) =
S->.xAz($)
S->.BAx($)
B->.yB(x,y)
B->.y(x,y)
When i try GOTO(0,x) i think that i should go to:
S->x.Az($)
A->.Ay(z)
A->. (z)
To find the lookahead symbol for A->. & A->.Ay i take First(z). But the official book solution says the lookeahead is (z,y).
Where does that y comes from?
Thank you in advance

\[$end\] lookaheads in LALR

I am trying to understand, how bison builds tables for this simple grammar:
input: rule ;
rule: rule '+' '1'
| '1' ;
I was able to calculate LR(1) transition table and item sets, but I don't understand how state 3 is built and works:
State 3
1 input: rule . [$end]
2 rule: rule . '+' '1'
'+' shift, and go to state 5
$default reduce using rule 1 (input)
For default reduce rule I should put 'r1' into all cells of GOTO table for each symbol. But for shift rule I should put 's5' into column for '+' terminal (this cell already contains 'r1'). For me this looks like shift/reduce conflict. But not for bison. Please explain how that '[$end]' lookahead symbol appeared in this state, and how this state is processed in overall by LR state machine.
default means "everything else", not "everything". In other words, you first fill in the specified actions, and then use the default action for any other lookahead symbol.
If there is no default action, the action for any unspecified lookahead symbol is an error. Default reduce actions are often used to reduce table size where the algorithm would otherwise trigger an error. This optimization may have the result of delaying the reporting of an error, but the error will always be detected before another input is consumed, precisely because an error action is never replaced with a default shift. (Indeed, many parser generators never use default shift actions at all.)
If you look at the grammar shown at the beginning of the .output file, you'll see that it has been augmented with the production:
0 $accept: input $end
Yacc/bison always adds a production like that to ensure that the complete input matches the start symbol. (The non-terminal input will, of course, be replaced by whatever start symbol has been declared with the %start directive, or with the first non-terminal in the grammar.)
There is nothing special about this rule aside from the fact that reducing the $accept symbol causes the input to be accepted. (You can see that in state 4).
$end is a special terminal symbol automatically generated when EOF is detected. (To be more precise, it is the terminal whose token value is 0, which the scanner returns to indicate end of file: (f)lex-generated scanners do this automatically.

Removing hidden ambiguity in grammar using left factoring

I am trying to reduce the grammar to LL(1) for a hypothetical language we created. I have removed most of the left factoring issues in the grammar, using the general rule of introducing new non-terminal characters for the same.
For example,
<assignmentstats>------- <type><id>=<E>/ id=<E>/id'['<E>']'=<E>/
is transformed to
<assignmentstats>------- <type><id>=<E>/ id<LF3>
<LF3>------- =<E>/[<E>]=<E>
After applying such rules for all the required statements, I expected an unambiguous grammar.
But the following statement is somehow still ambiguous.
<body>------- <forrelatedstuff>/ <floors>/<stats>
Here are the related statements of the grammar(only those required has been shown)
<funcbody>----- {<stats>}
<stats>----- <stat> <stats>/ #
<floors>-------- <floor><floors>/ #
<floor>------- floo <id><arr>{stats}/id<arr>{<stats>}
<stat>----- <superstats>/<returnstats>
<superstats>----- <type>id<Zip>/id<LF3>
<building> ------- build <id> {<body>}
<returnstats>--------- return <E>
I looked more into it , and found that the ambiguity is between 'floors' and 'stats'. The FIRST('stats') and FIRST('floors'), i.e. the set of first terminal characters contain 'id' and '}' for both.
I can see why this would be a problem and how left factoring could solve this.But how can I remove such kind of ambiguity through left factoring?
Note: Here,
'id' denotes an identifier.
'#' denotes epsilon.

Eliminating Epsilon Production for Left Recursion Elimination

Im following the algorithm for left recursion elimination from a grammar.It says remove the epsilon production if there is any
I have the following grammer
S-->Aa/b
A-->Ac/Sd/∈
I can see after removing the epsilon productions the grammer becomes
1) S-->Aa/a/b
2)A-->Ac/Sd/c/d
Im confused where the a/b comes in 1) and c/d comes in 2)
Can someone explain this?
lets look at the rule S->Aa, if A->∈ then S->∈a giving just S->a, so together with the previous rules we get S->Aa|a|b
now lets check the rule A->Ac and A->∈c which gives us A->c.
what about A->Sd? I dont see how you got A->d as a rule. if that is a rule, then the string "da" is accepted by this grammar (S->Aa & A->d --> "da"), but try to construct this string with the original grammar - if you start with S and the string finishes with a, it means you must use S->Aa, but then in order to have a "d" you must use A->Sd, which forces us to have another "a" or "b", meaning we cannot construct this string, and the rule A->d is not correct.

Grammar: start: (a b)? a c; Input: a d. Which error correct at position 2? 1. expected "b", "c". OR expected "c"

Grammar:
rule: (a b)? a c ;
Input:
a d
Question: Which error message correct at position 2 for given input?
1. expected "b", "c".
2. expected "c".
P.S.
I write parser and I have choice (dilemma) take into account that "b" expected at position or not take.
The #1 error (expected "b", "c") want to say that input "a b" expected but because it optional it may not expected but possible.
I don't know possible is the same as expected or not?
Which error message better and correct #1 or #2?
Thanks for answers.
P.S.
In first case I define marker of testing as limit of position.
if(_inputPos > testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
Limit moved in optional expressions:
OPTIONAL_EXPRESSION:
testing = _inputPos;
The "b" expression move _inputPos above the testing pos and add failure at _inputPos.
In second case I can define marker of testing as boolean flag.
if(!testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
The "b" expression in this case not add failure because it tested (inner for optional expression).
What you think what is better and correct?
Testing defined as specific position and if expression above this position (_inputPos > testing) it add failure (even it inside optional expression).
Testing defined as flag and if this flag set that the failures not takes into account. After executing optional expression it restore (not reset!) previous value of testing (true or false).
Also failures not takes into account if rule not fails. They only reported if parsing fails.
P.S.
Changes at 06 Jan 2014
This question raised because it related to two different problems.
First problem:
Parsing expression grammar (PEG) describe only three atomic items of input:
terminal symbol
nonterminal symbol
empty string
This grammar does not provide such operation as lexical preprocessing an thus it does not provide such element as the token.
Second problem:
What is a grammar? Are two grammars can be considred equal if they accept the same input but produce different result?
Assume we have two grammar:
Grammar 1
rule <- type? identifier
Grammar 2
rule <- type identifier / identifier
They both accept the same input but produce (in PEG) different result.
Grammar 1 results:
{type : type, identifier : identifier}
{type : null, identifier : identifier}
Grammar 2 results:
{type : type, identifier : identifier}
{identifier : identifier}
Quetions:
Both grammar equal?
It is painless to do optimization of grammars?
My answer on both questions is negative. No equal, Not painless.
But you may ask. "But why this happens?".
I can answer to you. "Because this is not a problem. This is a feature".
In PEG parser expression ALWAYS consists from these parts.
ORDERED_CHOICE => SEQUENCE => EXPRESSION
And this explanation is the my answer on question "But why this happens?".
Another problem.
PEG parser does not recognize WHITESPACES because it does not have tokens and tokens separators.
Now look at this grammar (in short):
program <- WHITESPACE expr EOF
expr <- ruleX
ruleX <- 'X' WHITESPACE
WHITESPACE < ' '?
EOF <- ! .
All PEG grammar desribed in this manner.
First WHITESPACE at begin and other WHITESPACE (often) at the end of rule.
In this case in PEG optional WHITESPACE must be assumed as expected.
But WHITESPACE not means only space. It may be more complex [\t\n\r] and even comments.
But the main rule of error messages is the following.
If not possible to display all expected elements (or not possible to display at least one from all set of expected elements) in this case is more correct do not display anything.
More precisely required to display "unexpected" error mesage.
How you in PEG will display expected WHITESPACE?
Parser error: expected WHITESPACE
Parser error: expected ' ', '\t', '\n' , 'r'
What about start charcters of comments? They also may be part of WHITESPACE in some grammars.
In this case optional WHITESPACE will be reject all other potential expected elements because not possible correctly to display WHITESPACE in error message because WHITESPACE is too complex to display.
Is this good or bad?
I think this is not bad and required some tricks to hide this nature of PEG parsers.
And in my PEG parser I not assume that the inner expression at first position of optional (optional & zero_or_more) expression must be treated as expected.
But all other inner (except at the first position) must treated as expected.
Example 1:
List<int list; // type? ident
Here "List<int" is a "type". But missing ">" is not at the first position in optional "type?".
This failure take into account and report as "expected '>'"
This is because we not skip "type" but enter into "type" and after really optional "List" we move position from first to next real "expected" (that already outside of testing position) element.
"List" was in "testing" position.
If inner expression (inside optional expression) "fits in the limitation" not continue at next position then it not assumed as the expected input.
From this assumption has been asked main question.
You must just take into account that we are talking about PEG parsers and their error messages.
Here is your grammar:
What is clear here is that after the first a there are two possible inputs: b or c. Your error message should not prioritize one over the other.
The basic idea to produce an error message for an invalid input is to find the most far place you failed (if your grammar where d | (a b)? a c, d wouldn't be part of the error) and determine what are all possible inputs that could make you advance and say "expected '...' but got '...'". There are other approaches to try to recover the parser and force it to continue. If there is only one possible expected token, let's temporarily insert it into the token stream and continue as if it where there since ever. This would lead to better error detection as you can find errors beyond the point where the parser first stopped.

Resources