Why doesn't this grammar have a reduce/reduce conflict? - parsing

Consider the following (admittedly nonsensical - it has been vastly simplified to illustrate the point) grammar:
negationExpression
: TOK_MINUS constantExpression %prec UNARYOP
| testRule
;
constantExpression
: TOK_INTEGER_CONSTANT
| TOK_FLOAT_CONSTANT
;
testRule
: negationExpression constantExpression // call this Rule 1
| constantExpression // Rule 2
;
Bison does not complain about a reduce/reduce conflict when ran on this grammar, but to me it seems like there is one. Assume we have parsed a negationExpression and a constantExpression; to me it seems there are two things the parser could now do, based on the above definition:
Reduce the sequence into a testRule using Rule 1 above
Reduce the constantExpression into a testRule using Rule 2 above (the negationExpression would be left untouched in this case, so the parse stack would look like this: negationExpression testRule)
However no warnings are emitted, and when I look at the .output file Bison generates, it seems there is no ambiguity whatsoever:
state 5
6 testRule: constantExpression .
$default reduce using rule 6 (testRule)
...
state 9
5 testRule: negationExpression constantExpression .
$default reduce using rule 5 (testRule)
According to the Bison docs:
A reduce/reduce conflict occurs if there are two or more rules that apply to the same sequence of input.
Isn't this precisely the case here?

No, it doesn't apply here.
"Sequence of input" is an unfortunate phrasing; what is meant is really "same input", or possibly more explicitly, "same prefix subsequence of a valid input". In other words, if there are two or more rules which could apply to the entire input, up to the current read point (and taking into account the lookahead).
In your grammar, testRule never follows anything. It (and negationExpression ) can only be reduced at the very beginning of some derivation. So if the (partially-reduced) input ends with negationExpression constantExpression, it is impossible to reduce constantExpression to testRule because no derivation of the start symbol can include testRule at a non-initial position.

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

bison shift reduce conflict, i don't know where

Grammar: http://pastebin.com/ef2jt8Rg
y.output: http://pastebin.com/AEKXrrRG
I don't know where is those conflicts, someone can help me with this?
The y.output file tells you exactly where the conflicts are. The first one is in state 4, so if you go down to look at state 4, you see:
state 4
99 compound_statement: '{' . '}'
100 | '{' . statement_list '}'
IDENTIFIER shift, and go to state 6
:
IDENTIFIER [reduce using rule 1 (threat_as_ref)]
IDENTIFIER [reduce using rule 2 (func_call_start)]
This is telling you that in this state (parsing a compound_statement, having seen a {), and looking at the next token being IDENTIFIER, there are 3 possible things it could do -- shift the token (which would be the beginning of a statement_list), reduce the threat_as_ref empty production, or reduce the func_call_start empty production.
The brackets tell you that it has decided to never do those actions -- the default "prefer shift over reduce" conflict resolution means that it will always do the shift.
The problem with your grammar is these empty rules threat_as_ref and func_call_start -- they need to be reduced BEFORE shifting the IDENTIFIER, but in order to know if they're valid, the parser would need to see the tokens AFTER the identifer. func_call_start should only be reduced if this is the beginning of the function call (which depends on there being a ( after the IDENTIFIER.) So the parser needs more lookahead to deal with your gramar. In your specific case, you grammar is LALR(2) (2 token lookahead would suffice), but not LALR(1), so bison can't deal with it.
Now you could fix it by just getting rid of those empty rules -- func_call_start has no action at all, and the action for threat_as_ref could be moved into the action for variable, but if you want those rules in the future that may be a problem.
(1) I see at least one thing that looks odd. Your productions for expression_statement are similar to those for postfix_statement, but not quite the same. They don't have the '(' and ')' tokens:
expression_statement
: ';'
| expression ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } argument_expression_list func_call_end ';'
;
Since an expression can be a primary_expression, which can be an IDENTIFIER, and since func_call_start and func_call_end are epsilon (null) productions, when presented with the input
foo;
the parser has to decide whether to apply
expression_statement : expression ';'
or
expression_statement : func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
(2) Also, I'm not certain of this, but I suspect the epsilon non-terminal threat_as_ref might be causing you some trouble. I have not traced it through, but there may be a case where the parser has to decide whether something is a variable_ref or a variable.

Bison Shift/Reduce conflict for simple grammar

I'm building a parser for a language I've designed, in which type names start with an upper case letter and variable names start with a lower case letter, such that the lexer can tell the difference and provide different tokens. Also, the string 'this' is recognised by the lexer (it's an OOP language) and passed as a separate token. Finally, data members can only be accessed on the 'this' object, so I built the grammar as so:
%token TYPENAME
%token VARNAME
%token THIS
%%
start:
Expression
;
Expression:
THIS
| THIS '.' VARNAME
| Expression '.' TYPENAME
;
%%
The first rule of Expression allows the user to pass 'this' around as a value (for example, returning it from a method or passing to a method call). The second is for accessing data on 'this'. The third rule is for calling methods, however I've removed the brackets and parameters since they are irrelevant to the problem. The originally grammar was clearly much larger than this, however this is the smallest part that generates the same error (1 Shift/Reduce conflict) - I isolated it into its own parser file and verified this, so the error has nothing to do with any other symbols.
As far as I can see, the grammar given here is unambiguous and so should not produce any errors. If you remove any of the three rules or change the second rule to
Expression '.' VARNAME
there is no conflict. In any case, I probably need someone to state the obvious of why this conflict occurs and how to resolve it.
The problem is that the grammar can only look one ahead. Therefore when you see a THIS then a ., are you in line 2(Expression: THIS '.' VARNAME) or line 3 (Expression: Expression '.' TYPENAME, via a reduction according to line 1).
The grammar could reduce THIS. to Expression. and then look for a TYPENAME or shift it to THIS. and look for a VARNAME, but it has to decide when it gets to the ..
I try to avoid y.output but sometimes it does help. I looked at the file it produced and saw.
state 1
2 Expression: THIS. [$end, '.']
3 | THIS . '.' VARNAME
'.' shift, and go to state 4
'.' [reduce using rule 2 (Expression)]
$default reduce using rule 2 (Expression)
Basically it is saying it sees '.' and can reduce or it can shift. Reduce makes me anrgu sometimes because they are hard to fine. The shift is rule 3 and is obvious (but the output doesnt mention the rule #). The reduce where it see's '.' in this case is the line
| Expression '.' TYPENAME
When it goes to Expression it looks at the next letter (the '.') and goes in. Now it sees THIS | so when it gets to the end of that statement it expects '.' when it leaves or an error. However it sees THIS '.' while its between this and '.' (hence the dot in the out file) and it CAN reduce a rule so there is a path conflict. I believe you can use %glr-parser to allow it to try both but the more conflicts you have the more likely you'll either get unexpected output or an ambiguity error. I had ambiguity errors in the past. They are annoying to deal with especially if you dont remember what rule caused or affected them. it is recommended to avoid conflicts.
I highly recommend this book before attempting to use bison.
I cant think of a 'great' solution but this gives no conflicts
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
| THIS //trick is moving this AWAY so it doesnt reduce
rval:
THIS '.' VARNAME
Alternative you can make it reduce later by adding more to the rule so it doesnt reduce as soon or by adding a token after or before to make it clear which path to take or fails (remember, it must know BEFORE reducing ANY path)
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
rval:
THIS '#'
| THIS '.' VARNAME
%%
-edit- note if i want to do func param and type varname i cant because type according to the lexer func is a Var (which is A-Za-z09_) as well as type. param and varname are both var's as well so this will cause me a reduce/reduce conflict. You cant write this as what they are, only what they look like. So keep that in mind when writing. You'll have to write a token to differentiate the two or write it as one of the two but write additional logic in code (the part that is in { } on the right side of the rules) to check if it is a funcname or a type and handle both those case.

Shift-reduce: when to stop reducing?

I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.

How to resolve a shift-reduce conflict in unambiguous grammar

I'm trying to parse a simple grammar using an LALR(1) parser generator (Bison, but the problem is not specific to that tool), and I'm hitting a shift-reduce conflict. The docs and other sources I've found about fixing these tend to say one or more of the following:
If the grammar is ambiguous (e.g. if-then-else ambiguity), change the language to fix the ambiguity.
If it's an operator precedence issue, specify precedence explicitly.
Accept the default resolution and tell the generator not to complain about it.
However, none of these seem to apply to my situation: the grammar is unambiguous so far as I can tell (though of course it's ambiguous with only one character of lookahead), it has only one operator, and the default resolution leads to parse errors on correctly-formed input. Are there any techniques for reworking the definition of a grammar to remove shift-reduce conflicts that don't fall into the above buckets?
For concreteness, here's the grammar in question:
%token LETTER
%%
%start input;
input: /* empty */ | input input_elt;
input_elt: rule | statement;
statement: successor ';';
rule: LETTER "->" successor ';';
successor: /* empty */ | successor LETTER;
%%
The intent is to parse semicolon-separated lines of the form "[A-Za-z]+" or "[A-Za-z] -> [A-Za-z]+".
Using the Solaris version of yacc, I get:
1: shift/reduce conflict (shift 5, red'n 7) on LETTER
state 1
$accept : input_$end
input : input_input_elt
successor : _ (7)
$end accept
LETTER shift 5
. reduce 7
input_elt goto 2
rule goto 3
statement goto 4
successor goto 6
So, the trouble is, as it very often is, the empty rule - specifically, the empty successor. It isn't completely clear whether you want to allow a semi-colon as a valid input - at the moment, it is. If you modified the successor rule to:
successor: LETTER | successor LETTER;
the shift/reduce conflict is eliminated.
Thanks for whittling down the grammar and posting it. Changing the successor rule to -
successor: /* empty */ | LETTER successor;
...worked for me. ITYM the language looked unambiguous.

Resources