Bison can't solve conflicts shift-reduce and reduce-reduce - parsing

I am writing a parser using Bison, but can't seem to get the grammar correct.
There are two conflicts:
Here are some of the rules used around conflict one:
program : function END_OF_FILE {return 0;}
formal_parameters : OPEN_PAREN formal_parameter list_E_fparameter CLOSE_PAREN | OPEN_PAREN CLOSE_PAREN
formal_parameter : expression_parameter | function_parameter
function : return_options IDENTIFIER formal_parameters block
function_parameter : return_options IDENTIFIER formal_parameters
expression_parameter : VAR identifier_list IDENTIFIER | identifier_list IDENTIFIER
variable_creation : identifier_list COLON type SEMI_COLON
labels : LABELS identifier_list SEMI_COLON
list_E_identifiers : list_E_identifiers COMMA IDENTIFIER |
identifier_list : IDENTIFIER list_E_identifiers
return_options : VOID | IDENTIFIER
State 12 conflicts: 1 reduce/reduce
state 12
56 identifier_list: IDENTIFIER . list_E_identifiers
60 return_options: IDENTIFIER .
102 list_E_identifiers: . list_E_identifiers COMMA IDENTIFIER
103 | .
COMMA reduce using rule 103 (list_E_identifiers)
IDENTIFIER reduce using rule 60 (return_options)
IDENTIFIER [reduce using rule 103 (list_E_identifiers)]
$default reduce using rule 60 (return_options)
list_E_identifiers go to state 23
State 64 conflicts: 1 shift/reduce
state 64
8 body: OPEN_BRACE list_E_statement . CLOSE_BRACE
17 statement: . opt_declaration unlabeled_statement
18 | . compound
31 compound: . OPEN_BRACE list_NE_unlstatement CLOSE_BRACE
73 opt_declaration: . IDENTIFIER COLON
74 | .
94 list_E_statement: list_E_statement . statement
CLOSE_BRACE shift, and go to state 68
IDENTIFIER shift, and go to state 69
OPEN_BRACE shift, and go to state 70
IDENTIFIER [reduce using rule 74 (opt_declaration)]
$default reduce using rule 74 (opt_declaration)
statement go to state 71
compound go to state 72
opt_declaration go to state 73
Can anyone help me? I've looked at http://www.gnu.org/software/bison/manual/html_node/Understanding.html but can't understand what this means.
I can post the full grammar if that would help.
Thank you!

The second conflict is a classic problem with "optional" elements. It's very tempting to write a rule for optional labels as you have done it, but the fact that optional_label could produce nothing forces the parser to try to make a decision before it has enough information.
LR parsers must "reduce" (recognize) a non-terminal before absorbing any further tokens. They can lookahead at the next k tokens (the next 1 token for an LR(1) parser, which is what bison generates), but they can't tentatively use the token and later go back and do the reduction.
So when the parser is at the point where the next token, which is an identifier, should start a statement, it might be looking at a statement which starts with an identifier, or it might be looking at a label which starts with an identifier. It could tell the difference by looking at the colon which follows the identifier (if any) but it can't see that far ahead.
Now, if it weren't for the fact that it is required to reduce either an empty optional_declaration or one containing a label, there would not be a problem. If you had written something like this:
statement: basic_statement | compound
basic_statement: unlabeled_statement | declaration unlabeled_statement
declaration: IDENTIFIER COLON
then the parser would not have to make a decision when it sees the identifier. It only has to make a decision when it reaches the end of a production; it is perfectly capable of pressing forward when there are two possible productions to complete. But when you force it to recognize an optional label, then it has to know whether the label was not there in order to reduce (recognize) the empty production.
For the first conflict, we can see from the output that there is some context in which the lookahead symbol is IDENTIFIER and you could have either a return_options or an identifier_list. Since both of those productions can produce a single IDENTIFIER, the parser will not know which one to reduce.
With the actual grammar available, it is easy to find the context in which both return_options IDENTIFIER and identifier_list IDENTIFIER are possible:
formal_parameter : expression_parameter | function_parameter
expression_parameter: identifier_list IDENTIFIER
function_parameter : return_options IDENTIFIER …
That grammar is not ambiguous. If IDENTIFIER IDENTIFIER is the start of function_parameter, then it must be followed by a (; if it is an expression_parameter, then it must be followed by either , or ). But that's the second next token, which means you'd need an LR(2) parser.
So I will give my usual advice on handling LR(2) grammars. It is possible to rewrite an LR(k) grammar as an LR(1) grammar, regardless of the value of k, but the result is usually bloated and ugly. So if you are using bison and you are willing to live with the possibility that action evaluations could be slightly delayed, then your easiest solution is to ask bison to generate a GLR parser. Often, just adding %glr-parser to the options section is enough.
It's worth noting that your grammar seems to be an uneasy mix between C and Pascal-like syntaxes. In C, the first token in a parameter is always a type; either the return type of a function, or the type of the following identifier. In Pascal, the last token in a parameter is the type. But in your grammar, sometimes the first token is the type and sometimes it's the last token. In a certain sense, it is this inconsistency which leads to the awkwardness in the grammar.
(Pascal has a lot more punctuation: the type is always preceded by a colon and a function parameter is preceded by the word function. These extra tokens are not needed to make the grammar work, but it can be argued that they make the syntax easier to read by humans.)

Related

Solve shift/reduce conflict across rules

I'm trying to learn bison by writing a simple math parser and evaluator. I'm currently implementing variables. A variable can be part of a expression however I'd like do something different when one enters only a single variable name as input, which by itself is also a valid expression and hence the shift reduce conflict. I've reduced the language to this:
%token <double> NUM
%token <const char*> VAR
%nterm <double> exp
%left '+'
%precedence TWO
%precedence ONE
%%
input:
%empty
| input line
;
line:
'\n'
| VAR '\n' %prec ONE
| exp '\n' %prec TWO
;
exp:
NUM
| VAR %prec TWO
| exp '+' exp { $$ = $1 + $3; }
;
%%
As you can see, I've tried solving this by adding the ONE and TWO precedences manually to some rules, however it doesn't seem to work, I always get the exact same conflict. The goal is to prefer the line: VAR '\n' rule for a line consisting of nothing but a variable name, otherwise parse it as expression.
For reference, the conflicting state:
State 4
4 line: VAR . '\n'
7 exp: VAR . ['+', '\n']
'\n' shift, and go to state 8
'\n' [reduce using rule 7 (exp)]
$default reduce using rule 7 (exp)
Precedence comparisons are always, without exception, between a production and a token. (At least, on Yacc/Bison). So you can be sure that if your precedence level list does not contain a real token, it will have no effect whatsoever.
For the same reason, you cannot resolve reduce-reduce conflicts with precedence. That doesn't matter in this case, since it's a shift-reduce conflict, but all the same it's useful to know.
To be even more specific, the precedence comparison is between a reduction (using the precedence of the production to be reduced) and that of the incoming lookahead token. In this case, the lookahead token is \n and the reduction is exp: VAR. The precedence level of that production is the precedence of VAR, since that is the last terminal symbol in the production. So if you want the shift to win out over the reduction, you need to declare your precedences so that the shift is higher:
%precedence VAR
%precedence '\n'
No pseudotokens (or %prec modifiers) are needed.
This will not change the parse, because Bison always prefers shift if there are no applicable precedence rules. But it will suppress the warning.

Why doesn't this grammar have a reduce/reduce conflict?

Consider the following (admittedly nonsensical - it has been vastly simplified to illustrate the point) grammar:
negationExpression
: TOK_MINUS constantExpression %prec UNARYOP
| testRule
;
constantExpression
: TOK_INTEGER_CONSTANT
| TOK_FLOAT_CONSTANT
;
testRule
: negationExpression constantExpression // call this Rule 1
| constantExpression // Rule 2
;
Bison does not complain about a reduce/reduce conflict when ran on this grammar, but to me it seems like there is one. Assume we have parsed a negationExpression and a constantExpression; to me it seems there are two things the parser could now do, based on the above definition:
Reduce the sequence into a testRule using Rule 1 above
Reduce the constantExpression into a testRule using Rule 2 above (the negationExpression would be left untouched in this case, so the parse stack would look like this: negationExpression testRule)
However no warnings are emitted, and when I look at the .output file Bison generates, it seems there is no ambiguity whatsoever:
state 5
6 testRule: constantExpression .
$default reduce using rule 6 (testRule)
...
state 9
5 testRule: negationExpression constantExpression .
$default reduce using rule 5 (testRule)
According to the Bison docs:
A reduce/reduce conflict occurs if there are two or more rules that apply to the same sequence of input.
Isn't this precisely the case here?
No, it doesn't apply here.
"Sequence of input" is an unfortunate phrasing; what is meant is really "same input", or possibly more explicitly, "same prefix subsequence of a valid input". In other words, if there are two or more rules which could apply to the entire input, up to the current read point (and taking into account the lookahead).
In your grammar, testRule never follows anything. It (and negationExpression ) can only be reduced at the very beginning of some derivation. So if the (partially-reduced) input ends with negationExpression constantExpression, it is impossible to reduce constantExpression to testRule because no derivation of the start symbol can include testRule at a non-initial position.

How do I fix this shift-reduce conflict in my PLY grammar?

I am writing a grammar for a programming language, but I'm running headfirst into a shift/reduce problem. The problem can be found in the state:
fn_call -> ID . L_PAREN fn_args R_PAREN
assignment -> ID . ASSIGN value
assignment -> ID . ASSIGN container
value -> ID
Before explaining a bit further, I want to clarify:
Is this shift/reduce because the program can't determine if I am calling a function or using the ID as a value (eg. constant or variable)?
Moving on, is it possible to fix this? My language does not currently use line delimiters (such as ';' in C or '\n' in Python). The parser is LALR(1).
What is the most efficient (adding the fewest rules to the grammar) way to decipher between a function call or a variable with line delimiters?
EDIT: Here is the lookahead for that state.
! shift/reduce conflict for L_PAREN resolved as shift
L_PAREN shift and go to state 60
ASSIGN shift and go to state 61
COMMA reduce using rule 43 (value -> ID .)
R_PAREN reduce using rule 43 (value -> ID .)
DASH reduce using rule 43 (value -> ID .)
R_BRACE reduce using rule 43 (value -> ID .)
NONE reduce using rule 43 (value -> ID .)
DEFN reduce using rule 43 (value -> ID .)
FOR reduce using rule 43 (value -> ID .)
INT_T reduce using rule 43 (value -> ID .)
DBL_T reduce using rule 43 (value -> ID .)
STR_T reduce using rule 43 (value -> ID .)
ID reduce using rule 43 (value -> ID .)
INT reduce using rule 43 (value -> ID .)
DBL reduce using rule 43 (value -> ID .)
STR reduce using rule 43 (value -> ID .)
COMMENT_LINE reduce using rule 43 (value -> ID .)
L_BRACE reduce using rule 43 (value -> ID .)
SET reduce using rule 43 (value -> ID .)
! L_PAREN [ reduce using rule 43 (value -> ID .) ]
The following is just guesswork since you haven't shown much of your grammar. I'm assuming that you allow expressions as statements, and not just function calls. In that case, an expression can start with an (, and a statement can end with an ID. Since you have no statement delimiters (I think), then the following is truly ambiguous:
a = b
(c + d)
After reading the b (ID), it is unclear whether to reduce it to value, as part of the assignment, or to leave it as an ID and shift the ( as part of fn_call.
You can't remove ambiguity by adding productions. :)
If this is the set of items that form a "state" of the parser, then you haven't written it down right:
fn_call -> ID . L_PAREN fn_args R_PAREN
assignment -> ID . ASSIGN value
assignment -> ID . ASSIGN container
value -> ID . *missing lookahead set*
You don't exhibit the rest of your language, so we cannot know what the lookahead set is for the rule
value -> ID
Under the assumption that you indeed have a shift-reduce conflict in this state, then the lookahead set must contain "ASSIGN" or "L_PAREN". I can't tell you how to fix your problem without knowing more.
Given that your present grammar has these issues, you cannot fix this simply "adding rules" of any kind, whether they involve line delimiters or not, because adding rules will not change what is already in lookahead sets (it may add more tokens to existing sets).
EDIT: One way out of your problem may be to switch parsing technologies. Your problem is the LALR parsers cannot handle the local ambiguity that you seem to have. However, your overall grammar may not have an actual ambiguity if you look further ahead. That depends on your language syntax but you are rolling you own so you can do as you please. I suggest looking into GLR parsing technology, which can handle arbitrary lookahead; check out recent versions of Bison.

bison shift reduce conflict, i don't know where

Grammar: http://pastebin.com/ef2jt8Rg
y.output: http://pastebin.com/AEKXrrRG
I don't know where is those conflicts, someone can help me with this?
The y.output file tells you exactly where the conflicts are. The first one is in state 4, so if you go down to look at state 4, you see:
state 4
99 compound_statement: '{' . '}'
100 | '{' . statement_list '}'
IDENTIFIER shift, and go to state 6
:
IDENTIFIER [reduce using rule 1 (threat_as_ref)]
IDENTIFIER [reduce using rule 2 (func_call_start)]
This is telling you that in this state (parsing a compound_statement, having seen a {), and looking at the next token being IDENTIFIER, there are 3 possible things it could do -- shift the token (which would be the beginning of a statement_list), reduce the threat_as_ref empty production, or reduce the func_call_start empty production.
The brackets tell you that it has decided to never do those actions -- the default "prefer shift over reduce" conflict resolution means that it will always do the shift.
The problem with your grammar is these empty rules threat_as_ref and func_call_start -- they need to be reduced BEFORE shifting the IDENTIFIER, but in order to know if they're valid, the parser would need to see the tokens AFTER the identifer. func_call_start should only be reduced if this is the beginning of the function call (which depends on there being a ( after the IDENTIFIER.) So the parser needs more lookahead to deal with your gramar. In your specific case, you grammar is LALR(2) (2 token lookahead would suffice), but not LALR(1), so bison can't deal with it.
Now you could fix it by just getting rid of those empty rules -- func_call_start has no action at all, and the action for threat_as_ref could be moved into the action for variable, but if you want those rules in the future that may be a problem.
(1) I see at least one thing that looks odd. Your productions for expression_statement are similar to those for postfix_statement, but not quite the same. They don't have the '(' and ')' tokens:
expression_statement
: ';'
| expression ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } argument_expression_list func_call_end ';'
;
Since an expression can be a primary_expression, which can be an IDENTIFIER, and since func_call_start and func_call_end are epsilon (null) productions, when presented with the input
foo;
the parser has to decide whether to apply
expression_statement : expression ';'
or
expression_statement : func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
(2) Also, I'm not certain of this, but I suspect the epsilon non-terminal threat_as_ref might be causing you some trouble. I have not traced it through, but there may be a case where the parser has to decide whether something is a variable_ref or a variable.

Bison Shift/Reduce conflict for simple grammar

I'm building a parser for a language I've designed, in which type names start with an upper case letter and variable names start with a lower case letter, such that the lexer can tell the difference and provide different tokens. Also, the string 'this' is recognised by the lexer (it's an OOP language) and passed as a separate token. Finally, data members can only be accessed on the 'this' object, so I built the grammar as so:
%token TYPENAME
%token VARNAME
%token THIS
%%
start:
Expression
;
Expression:
THIS
| THIS '.' VARNAME
| Expression '.' TYPENAME
;
%%
The first rule of Expression allows the user to pass 'this' around as a value (for example, returning it from a method or passing to a method call). The second is for accessing data on 'this'. The third rule is for calling methods, however I've removed the brackets and parameters since they are irrelevant to the problem. The originally grammar was clearly much larger than this, however this is the smallest part that generates the same error (1 Shift/Reduce conflict) - I isolated it into its own parser file and verified this, so the error has nothing to do with any other symbols.
As far as I can see, the grammar given here is unambiguous and so should not produce any errors. If you remove any of the three rules or change the second rule to
Expression '.' VARNAME
there is no conflict. In any case, I probably need someone to state the obvious of why this conflict occurs and how to resolve it.
The problem is that the grammar can only look one ahead. Therefore when you see a THIS then a ., are you in line 2(Expression: THIS '.' VARNAME) or line 3 (Expression: Expression '.' TYPENAME, via a reduction according to line 1).
The grammar could reduce THIS. to Expression. and then look for a TYPENAME or shift it to THIS. and look for a VARNAME, but it has to decide when it gets to the ..
I try to avoid y.output but sometimes it does help. I looked at the file it produced and saw.
state 1
2 Expression: THIS. [$end, '.']
3 | THIS . '.' VARNAME
'.' shift, and go to state 4
'.' [reduce using rule 2 (Expression)]
$default reduce using rule 2 (Expression)
Basically it is saying it sees '.' and can reduce or it can shift. Reduce makes me anrgu sometimes because they are hard to fine. The shift is rule 3 and is obvious (but the output doesnt mention the rule #). The reduce where it see's '.' in this case is the line
| Expression '.' TYPENAME
When it goes to Expression it looks at the next letter (the '.') and goes in. Now it sees THIS | so when it gets to the end of that statement it expects '.' when it leaves or an error. However it sees THIS '.' while its between this and '.' (hence the dot in the out file) and it CAN reduce a rule so there is a path conflict. I believe you can use %glr-parser to allow it to try both but the more conflicts you have the more likely you'll either get unexpected output or an ambiguity error. I had ambiguity errors in the past. They are annoying to deal with especially if you dont remember what rule caused or affected them. it is recommended to avoid conflicts.
I highly recommend this book before attempting to use bison.
I cant think of a 'great' solution but this gives no conflicts
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
| THIS //trick is moving this AWAY so it doesnt reduce
rval:
THIS '.' VARNAME
Alternative you can make it reduce later by adding more to the rule so it doesnt reduce as soon or by adding a token after or before to make it clear which path to take or fails (remember, it must know BEFORE reducing ANY path)
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
rval:
THIS '#'
| THIS '.' VARNAME
%%
-edit- note if i want to do func param and type varname i cant because type according to the lexer func is a Var (which is A-Za-z09_) as well as type. param and varname are both var's as well so this will cause me a reduce/reduce conflict. You cant write this as what they are, only what they look like. So keep that in mind when writing. You'll have to write a token to differentiate the two or write it as one of the two but write additional logic in code (the part that is in { } on the right side of the rules) to check if it is a funcname or a type and handle both those case.

Resources