I've written a grammar that as rules as follows:
A : B '?'
| B
| A '+' A
;
B : "a"
| "c" A "t" A
;
And this gives me a shift/reduce conflict on
A : B . '?' (96)
A : B . (98)
I've tried multiple ways to change the grammar but I seem to create even more conflicts when I try to change things. How can I remove this conflict?
Thank you in advance, any help will be appreciated.
LALR(1) parsers resolve their conflicts using a single-character lookahead. When the lookahead can't disambiguate between the different actions, the conflict is then shown to the user.
In the following state, a "?" lookahead means the parser can shift.
A : B . '?'
A : B .
But what if it reduces A? It could possible reduce into the following state:
B: "c" A "t" A .
Which, by reducing B, could lead right back into:
A : B . '?'
A : B .
Therefore, "?" is also a valid lookahead to reduce.
So how can you solve this?
You have two options:
1) Try to rewrite the grammar with left-recursion instead of right-recursion. It might help, but that's not guaranteed.
2) Tell YACC which one to choose whenever it has this conflict (For example, using %left or %right).
Or, perhaps, use a smarter parser. For example elkhound or antlr.
Related
I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.
I've been using the Antlr Matlab grammar from Antlr grammars
I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such
result = input'
I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''
However, this failed to parse when multiple of these operators were used on a single line.
Here's a significantly simplified version of the grammar, still exhibiting the exact problem:
grammar Grammar;
unary_expression
: IDENTIFIER
| unary_expression '\''
;
translation_unit : unary_expression CR ;
STRING_LITERAL : '\'' [a-z]* '\'' ;
IDENTIFIER : [a-zA-Z] ;
CR : [\r\n] + ;
Test cases, being parsed as translation_unit:
"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes
The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.
The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.
For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.
What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.
The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:
x' + y';
while these are strings:
x = 'x' + ' + y';
?
Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.
If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:
string_literal
: QUOTE ~(QUOTE | CR)* QUOTE
;
// Needed to match characters inside strings
OTHER
: .
;
However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).
All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".
Grammar: http://pastebin.com/ef2jt8Rg
y.output: http://pastebin.com/AEKXrrRG
I don't know where is those conflicts, someone can help me with this?
The y.output file tells you exactly where the conflicts are. The first one is in state 4, so if you go down to look at state 4, you see:
state 4
99 compound_statement: '{' . '}'
100 | '{' . statement_list '}'
IDENTIFIER shift, and go to state 6
:
IDENTIFIER [reduce using rule 1 (threat_as_ref)]
IDENTIFIER [reduce using rule 2 (func_call_start)]
This is telling you that in this state (parsing a compound_statement, having seen a {), and looking at the next token being IDENTIFIER, there are 3 possible things it could do -- shift the token (which would be the beginning of a statement_list), reduce the threat_as_ref empty production, or reduce the func_call_start empty production.
The brackets tell you that it has decided to never do those actions -- the default "prefer shift over reduce" conflict resolution means that it will always do the shift.
The problem with your grammar is these empty rules threat_as_ref and func_call_start -- they need to be reduced BEFORE shifting the IDENTIFIER, but in order to know if they're valid, the parser would need to see the tokens AFTER the identifer. func_call_start should only be reduced if this is the beginning of the function call (which depends on there being a ( after the IDENTIFIER.) So the parser needs more lookahead to deal with your gramar. In your specific case, you grammar is LALR(2) (2 token lookahead would suffice), but not LALR(1), so bison can't deal with it.
Now you could fix it by just getting rid of those empty rules -- func_call_start has no action at all, and the action for threat_as_ref could be moved into the action for variable, but if you want those rules in the future that may be a problem.
(1) I see at least one thing that looks odd. Your productions for expression_statement are similar to those for postfix_statement, but not quite the same. They don't have the '(' and ')' tokens:
expression_statement
: ';'
| expression ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } argument_expression_list func_call_end ';'
;
Since an expression can be a primary_expression, which can be an IDENTIFIER, and since func_call_start and func_call_end are epsilon (null) productions, when presented with the input
foo;
the parser has to decide whether to apply
expression_statement : expression ';'
or
expression_statement : func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
(2) Also, I'm not certain of this, but I suspect the epsilon non-terminal threat_as_ref might be causing you some trouble. I have not traced it through, but there may be a case where the parser has to decide whether something is a variable_ref or a variable.
I've got a simple grammar. Actually, the grammar I'm using is more complex, but this is the smallest subset that illustrates my question.
Expr ::= Value Suffix
| "(" Expr ")" Suffix
Suffix ::= "->" Expr
| "<-" Expr
| Expr
| epsilon
Value matches identifiers, strings, numbers, et cetera. The Suffix rule is there to eliminate left-recursion. This matches expressions such as:
a -> b (c -> (d) (e))
That is, a graph where a goes to both b and the result of (c -> (d) (e)), and c goes to d and e. I'm trying to produce an abstract syntax tree for these expressions, but I'm running into difficulty because all of the operators can accept any number of operands on each side. I'd rather keep the logic for producing the AST within the recursive descent parsing methods, since it avoids having to duplicate the logic of extracting an expression. My current strategy is as follows:
If a Value appears, push it to the output.
If a From or To appears:
Output a separator.
Get the next Expr.
Create a Link node.
Pop the first set of operands from output into the Link until a separator appears.
Erase the separator discovered.
Pop the second set of operands into the Link until a separator.
Push the Link to the output.
If I run this through without obeying steps 2.3–2.7, I get a list of values and separators. For the expression quoted above, a -> b (c -> (d) (e)), the output should be:
A sep_1 B sep_2 C sep_3 D E
Applying the To rule would then yield:
A sep_1 B sep_2 (link from C to {D, E})
And subsequently:
(link from A to {B, (link from C to {D, E})})
The important thing to note is that sep_2, crucial to delimit the left-hand operands of the second ->, does not appear, so the parser believes that the expression was actually written:
a -> (b c -> (d) (e))
In order to solve this with my current strategy, I would need a way to produce a separator between adjacent expressions, but only if the current expression is a From or To expression enclosed in parentheses. If that's possible, then I'm just not seeing it and the answer ought to be simple. If there's a better way to go about this, however, then please let me know!
I haven't tried to analyze it in detail, but: "From or To expression enclosed in parentheses" starts to sound a lot like "context dependent", which recursive descent can't handle directly. To avoid context dependence you'll probably need a separate production for a From or To in parentheses vs. a From or To without the parens.
Edit: Though it may be too late to do any good, if my understanding of what you want to match is correct, I think I'd write it more like this:
Graph :=
| List Sep Graph
;
Sep := "->"
| "<-"
;
List :=
| Value List
;
Value := Number
| Identifier
| String
| '(' Graph ')'
;
It's hard to be certain, but I think this should at least be close to matching (only) the inputs you want, and should make it reasonably easy to generate an AST that reflects the input correctly.
I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.