Why if I add lambda left recursion occurs? - parsing

I am trying to write if syntax by using flex bison and in parser I have a problem
here is a grammar for if syntax in cpp
program : //start rule
|statements;
block:
TOKEN_BEGIN statements ';'TOKEN_END;
reexpression:
| TOKEN_OPERATOR expression;
expression: //x+2*a*3
TOKEN_ID reexpression
| TOKEN_NUMBER reexpression;
assignment:
TOKEN_ID'='expression
statement:
assignment;
statements:
statement';'
| block
| if_statement;
else_statement:
TOKEN_ELSE statements ;
else_if_statement:
TOKEN_ELSE_IF '(' expression ')' statements;
if_statement:
TOKEN_IF '(' expression ')' statements else_if_statement else_statement;
I can't understand why if I replace these three rules , left recursion happen I just add lambda to These rules
else_statement:
|TOKEN_ELSE statements ;
else_if_statement:
|TOKEN_ELSE_IF '(' expression ')' statements;
if_statement:
TOKEN_IF '(' expression ')' statements else_if_statement else_statement;
please help me understand.

There's no lambda or left-recursion involved.
When you add epsilon to the if rules (making the else optional), you get conflicts, because the resulting grammar is ambiguous. This is the classic dangling else ambiguity where when you have TWO ifs with a single else, the else can bind to either if.
IF ( expr1 ) IF ( expr2 ) block1 ELSE block2

Related

Warning: "rule useless in parser due to conflicts" in Bison

I am trying to make a C lexical analyzer and I have some warnings:
rule useless in parser due to conflicts: sentenceList: sentenceList sentence
rule useless in parser due to conflicts: sentSelection: IF '(' expression ')' sentence
rule useless in parser due to conflicts: sentSelection: IF '(' expression ')' sentence ELSE sentence
rule useless in parser due to conflicts: sentSelection: SWITCH '(' expression ')' sentence
rule useless in parser due to conflicts: sentIteration: WHILE '(' expression ')' sentence
rule useless in parser due to conflicts: sentIteration: FOR '(' expression ';' expression ';' expression ')' sentence
This is the part of the code where the warnings come from:
input: /* nothing */
| input line
;
line: '\n'
| sentence '\n'
;
sentence : sentComposed
|sentSelection
|sentExpression
|sentIteration
;
sentComposed: statementsList
|sentenceList
;
statementsList: statement
| statementsList statement
;
sentenceList: sentence
|sentenceList sentence
;
sentExpression: expression ';'
|';'
;
sentSelection: IF '(' expression ')' sentence
|IF '(' expression ')' sentence ELSE sentence
|SWITCH '(' expression ')' sentence
;
sentIteration: WHILE '(' expression ')' sentence
|DO sentence WHILE '(' expression ')' ';'
|FOR '(' expression ';' expression ';' expression ')' sentence
;
statement: DATATYPE varList
;
varList: aVar
|varList ',' aVar
;
aVar: variable inicial
;
variable: IDENTIFIER
;
initial: '=' NUM
;
I have just added some more information
Every word in uppercase letters are tokens.
If you need any aditional information please tell me
Here's a considerably simplified (but complete) excerpt of your grammar. I've declared expression to be a terminal so as to avoid having to define it:
%token expression IF
%%
sentence : sentComposed
|sentSelection
|sentExpression
sentComposed: sentenceList
sentenceList: sentence
|sentenceList sentence
sentExpression: expression ';'
|';'
sentSelection: IF '(' expression ')' sentence
When I run that through bison, it reports:
ez.y: warning: 4 shift/reduce conflicts [-Wconflicts-sr]
ez.y: warning: 8 reduce/reduce conflicts [-Wconflicts-rr]
Those conflicts are the actual problem, as indicated by the following warnings ("due to conflicts"):
ez.y:8.18-38: warning: rule useless in parser due to conflicts [-Wother]
|sentenceList sentence
^^^^^^^^^^^^^^^^^^^^^
ez.y:11.18-47: warning: rule useless in parser due to conflicts [-Wother]
sentSelection: IF '(' expression ')' sentence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When bison finds a conflict in a grammar, it resolves it according to a simple procedure:
shift-reduce conflicts are resolved in favour of the shift
reduce-reduce conflicts are resolved in favour of the production which occurs earlier in the grammar.
Once it does that, it might turn out that some production can no longer ever be used, because it was eliminated from every context in which it might have been reduced. That's a clear sign that the grammar is problematic. [Note 1]
The basic problem here is that sentComposed means that statements can just be strung together to make a longer statement. So what happens if you write:
IF (e) statement1 statement2
It could be that statement1 statement2 is intended to be reduced into a single sentComposed which is the target of the IF, so the two statements execute only if e is true. Or it could be that the sentComposed consists of the IF statement with target statement1, followed by statement2. In C terms, the difference is between:
if (e) { statement1; statement2; }
and
{ if (e) { statement1; } statement2; }
So that's a real ambiguity, and you probably need to rethink the absence of braces in order to fix it.
But that's not the only problem; you also have a bunch of reduce-reduce conflicts. Those come about in a much simpler way, because part of the above grammar is the following loop:
sentence: sentComposed
sentComposed: sentenceList
sentenceList: sentence
That loop means that your grammar allows a single sentence to be wrapped in an arbitrary number of unit reductions. You certainly did not intend that; I'm certain that your intent was that sentComposed only be used if actually necessary. But bison doesn't know your intent; it only knows what you say.
Again, you will probably solve this problem when you figure out how you actually want to identify the boundaries of a sentComposed.
Notes:
In some cases, conflicts are not actually a problem. For example, there is a shift-reduce conflict between these two productions; the so-called "dangling-else" ambiguity:
sentSelection: IF '(' expression ')' sentence
|IF '(' expression ')' sentence ELSE sentence
In a nested IF statement:
IF (e) IF (f) s1 ELSE s2
it's not clear whether the ELSE should apply to the inner or outer IF. If it applies to the inner IF, it must be shifted to allow the second production for sentSelection. If it applies to the outer IF, a reduction must first be performed to complete the inner (else-less) IF before shifting ELSE into the outer IF. Bison's default action ("prefer shift") does exactly the right thing in this case, which is to shift the ELSE immediately. (Indeed, that's why the default was chosen to be "prefer shift").

Ambiguous call expression in ANTLR4 grammar

I have a simple grammar (for demonstration)
grammar Test;
program
: expression* EOF
;
expression
: Identifier
| expression '(' expression? ')'
| '(' expression ')'
;
Identifier
: [a-zA-Z_] [a-zA-Z_0-9?]*
;
WS
: [ \r\t\n]+ -> channel(HIDDEN)
;
Obviously the second and third alternatives in the expression rule are ambiguous. I want to resolve this ambiguity by permitting the second alternative only if an expression is immediately followed by a '('.
So the following
bar(foo)
should match the second alternative while
bar
(foo)
should match the 1st and 3rd alternatives (even if the token between them is in the HIDDEN channel).
How can I do that? I have seen these ambiguities, between call expressions and parenthesized expressions, present in languages that have no (or have optional) expression terminator tokens (or rules) - example
The solution to this is to temporary "unhide" whitespace in your second alternative. Have a look at this question for how this can be done.
With that solution your code could look somthing like this
expression
: Identifier
| {enableWS();} expression '(' {disableWS();} expression? ')'
| '(' expression ')'
;
That way the second alternative matches the input WS-sensitive and will therefore only be matched if the identifier is directly followed by the bracket.
See here for the implementation of the MultiChannelTokenStream that is mentioned in the linked question.

YACC grammar for arithmetic expressions, with no surrounding parentheses

I want to write the rules for arithmetic expressions in YACC; where the following operations are defined:
+ - * / ()
But, I don't want the statement to have surrounding parentheses. That is, a+(b*c) should have a matching rule but (a+(b*c)) shouldn't.
How can I achieve this?
The motive:
In my grammar I define a set like this: (1,2,3,4) and I want (5) to be treated as a 1-element set. The ambiguity causes a reduce/reduce conflict.
Here's a pretty minimal arithmetic grammar. It handles the four operators you mention and assignment statements:
stmt: ID '=' expr ';'
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
It's easy to define "set" literals:
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
If we assume that a set literal can only appear as the value in an assignment statement, and not as the operand of an arithmetic operator, then we would add a syntax for "expressions or set literals":
value: expr | set
and modify the syntax for assignment statements to use that:
stmt: ID '=' value ';'
But that leads to the reduce/reduce conflict you mention because (5) could be an expr, through the expansion expr → term → factor → '(' expr ')'.
Here are three solutions to this ambiguity:
1. Explicitly remove the ambiguity
Disambiguating is tedious but not particularly difficult; we just define two kinds of subexpression at each precedence level, one which is possibly parenthesized and one which is definitely not surrounded by parentheses. We start with some short-hand for a parenthesized expression:
paren: '(' expr ')'
and then for each subexpression type X, we add a production pp_X:
pp_term: term | paren
and modify the existing production by allowing possibly parenthesized subexpressions as operands:
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
Unfortunately, we will still end up with a shift/reduce conflict, because of the way expr_list was defined. Confronted with the beginning of an assignment statement:
a = ( 5 )
having finished with the 5, so that ) is the lookahead token, the parser does not know whether the (5) is a set (in which case the next token will be a ;) or a paren (which is only valid if the next token is an operand). This is not an ambiguity -- the parse could be trivially resolved with an LR(2) parse table -- but there are not many tools which can generate LR(2) parsers. So we sidestep the issue by insisting that the expr_list has to have two expressions, and adding paren to the productions for set:
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
Now the parser doesn't need to choose between expr_list and expr in the assignment statement; it simply reduces (5) to paren and waits for the next token to clarify the parse.
So that ends up with:
stmt: ID '=' value ';'
value: expr | set
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
paren: '(' expr ')'
pp_expr: expr | paren
expr: term | pp_expr '-' pp_term | pp_expr '+' pp_term
pp_term: term | paren
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
pp_factor: factor | paren
factor: ID | NUMBER | '-' pp_factor
which has no conflicts.
2. Use a GLR parser
Although it is possible to explicitly disambiguate, the resulting grammar is bloated and not really very clear, which is unfortunate.
Bison can generated GLR parsers, which would allow for a much simpler grammar. In fact, the original grammar would work almost without modification; we just need to use the Bison %dprec dynamic precedence declaration to indicate how to disambiguate:
%glr-parser
%%
stmt: ID '=' value ';'
value: expr %dprec 1
| set %dprec 2
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
The %dprec declarations in the two productions for value tell the parser to prefer value: set if both productions are possible. (They have no effect in contexts in which only one production is possible.)
3. Fix the language
While it is possible to parse the language as specified, we might not be doing anyone any favours. There might even be complaints from people who are surprised when they change
a = ( some complicated expression ) * 2
to
a = ( some complicated expression )
and suddenly a becomes a set instead of a scalar.
It is often the case that languages for which the grammar is not obvious are also hard for humans to parse. (See, for example, C++'s "most vexing parse").
Python, which uses ( expression list ) to create tuple literals, takes a very simple approach: ( expression ) is always an expression, so a tuple needs to either be empty or contain at least one comma. To make the latter possible, Python allows a tuple literal to be written with a trailing comma; the trailing comma is optional unless the tuple contains a single element. So (5) is an expression, while (), (5,), (5,6) and (5,6,) are all tuples (the last two are semantically identical).
Python lists are written between square brackets; here, a trailing comma is again permitted, but it is never required because [5] is not ambiguous. So [], [5], [5,], [5,6] and [5,6,] are all lists.

Solving shift/reduce conflict in expression grammar

I am new to bison and I am trying to make a grammar parsing expressions.
I am facing a shift/reduce conflight right now I am not able to solve.
The grammar is the following:
%left "[" "("
%left "+"
%%
expression_list : expression_list "," expression
| expression
| /*empty*/
;
expression : "(" expression ")"
| STRING_LITERAL
| INTEGER_LITERAL
| DOUBLE_LITERAL
| expression "(" expression_list ")" /*function call*/
| expression "[" expression "]" /*index access*/
| expression "+" expression
;
This is my grammar, but I am facing a shift/reduce conflict with those two rules "(" expression ")" and expression "(" expression_list ")".
How can I resolve this conflict?
EDIT: I know I could solve this using precedence climbing, but I would like to not do so, because this is only a small part of the expression grammar, and the size of the expression grammar would explode using precedence climbing.
There is no shift-reduce conflict in the grammar as presented, so I suppose that it is just an excerpt of the full grammar. In particular, there will be precisely the shift/reduce conflict mentioned if the real grammar includes:
%start program
%%
program: %empty
| program expression
In that case, you will run into an ambiguity because given, for example, a(b), the parser cannot tell whether it is a single call-expression or two consecutive expressions, first a single variable, and second a parenthesized expression. To avoid this problem you need to have some token which separates expression (statements).
There are some other issues:
expression_list : expression_list "," expression
| expression
| /*empty*/
;
That allows an expression list to be ,foo (as in f(,foo)), which is likely not desirable. Better would be
arguments: %empty
| expr_list
expr_list: expr
| expr_list ',' expr
And the precedences are probably backwards. Usually one wants postfix operators like call and index to bind more tightly than arithmetic operators, so they should come at the end. Otherwise a+b(7) is (a+b)(7), which is unconventional.

Shift/reduce conflict in yacc due to look-ahead token limitation?

I've been trying to tackle a seemingly simple shift/reduce conflict with no avail. Naturally, the parser works fine if I just ignore the conflict, but I'd feel much safer if I reorganized my rules. Here, I've simplified a relatively complex grammar to the single conflict:
statement_list
: statement_list statement
|
;
statement
: lvalue '=' expression
| function
;
lvalue
: IDENTIFIER
| '(' expression ')'
;
expression
: lvalue
| function
;
function
: IDENTIFIER '(' ')'
;
With the verbose option in yacc, I get this output file describing the state with the mentioned conflict:
state 2
lvalue -> IDENTIFIER . (rule 5)
function -> IDENTIFIER . '(' ')' (rule 9)
'(' shift, and go to state 7
'(' [reduce using rule 5 (lvalue)]
$default reduce using rule 5 (lvalue)
Thank you for any assistance.
The problem is that this requires 2-token lookahead to know when it has reached the end of a statement. If you have input of the form:
ID = ID ( ID ) = ID
after parser shifts the second ID (lookahead is (), it doesn't know whether that's the end of the first statement (the ( is the beginning of a second statement), or this is a function. So it shifts (continuing to parse a function), which is the wrong thing to do with the example input above.
If you extend function to allow an argument inside the parenthesis and expression to allow actual expressions, things become worse, as the lookahead required is unbounded -- the parser needs to get all the way to the second = to determine that this is not a function call.
The basic problem here is that there's no helper punctuation to aid the parser in finding the end of a statement. Since text that is the beginning of a valid statement can also appear in the middle of a valid statement, finding statement boundaries is hard.

Resources