I'm building a parser for a language I've designed, in which type names start with an upper case letter and variable names start with a lower case letter, such that the lexer can tell the difference and provide different tokens. Also, the string 'this' is recognised by the lexer (it's an OOP language) and passed as a separate token. Finally, data members can only be accessed on the 'this' object, so I built the grammar as so:
%token TYPENAME
%token VARNAME
%token THIS
%%
start:
Expression
;
Expression:
THIS
| THIS '.' VARNAME
| Expression '.' TYPENAME
;
%%
The first rule of Expression allows the user to pass 'this' around as a value (for example, returning it from a method or passing to a method call). The second is for accessing data on 'this'. The third rule is for calling methods, however I've removed the brackets and parameters since they are irrelevant to the problem. The originally grammar was clearly much larger than this, however this is the smallest part that generates the same error (1 Shift/Reduce conflict) - I isolated it into its own parser file and verified this, so the error has nothing to do with any other symbols.
As far as I can see, the grammar given here is unambiguous and so should not produce any errors. If you remove any of the three rules or change the second rule to
Expression '.' VARNAME
there is no conflict. In any case, I probably need someone to state the obvious of why this conflict occurs and how to resolve it.
The problem is that the grammar can only look one ahead. Therefore when you see a THIS then a ., are you in line 2(Expression: THIS '.' VARNAME) or line 3 (Expression: Expression '.' TYPENAME, via a reduction according to line 1).
The grammar could reduce THIS. to Expression. and then look for a TYPENAME or shift it to THIS. and look for a VARNAME, but it has to decide when it gets to the ..
I try to avoid y.output but sometimes it does help. I looked at the file it produced and saw.
state 1
2 Expression: THIS. [$end, '.']
3 | THIS . '.' VARNAME
'.' shift, and go to state 4
'.' [reduce using rule 2 (Expression)]
$default reduce using rule 2 (Expression)
Basically it is saying it sees '.' and can reduce or it can shift. Reduce makes me anrgu sometimes because they are hard to fine. The shift is rule 3 and is obvious (but the output doesnt mention the rule #). The reduce where it see's '.' in this case is the line
| Expression '.' TYPENAME
When it goes to Expression it looks at the next letter (the '.') and goes in. Now it sees THIS | so when it gets to the end of that statement it expects '.' when it leaves or an error. However it sees THIS '.' while its between this and '.' (hence the dot in the out file) and it CAN reduce a rule so there is a path conflict. I believe you can use %glr-parser to allow it to try both but the more conflicts you have the more likely you'll either get unexpected output or an ambiguity error. I had ambiguity errors in the past. They are annoying to deal with especially if you dont remember what rule caused or affected them. it is recommended to avoid conflicts.
I highly recommend this book before attempting to use bison.
I cant think of a 'great' solution but this gives no conflicts
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
| THIS //trick is moving this AWAY so it doesnt reduce
rval:
THIS '.' VARNAME
Alternative you can make it reduce later by adding more to the rule so it doesnt reduce as soon or by adding a token after or before to make it clear which path to take or fails (remember, it must know BEFORE reducing ANY path)
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
rval:
THIS '#'
| THIS '.' VARNAME
%%
-edit- note if i want to do func param and type varname i cant because type according to the lexer func is a Var (which is A-Za-z09_) as well as type. param and varname are both var's as well so this will cause me a reduce/reduce conflict. You cant write this as what they are, only what they look like. So keep that in mind when writing. You'll have to write a token to differentiate the two or write it as one of the two but write additional logic in code (the part that is in { } on the right side of the rules) to check if it is a funcname or a type and handle both those case.
Related
I have some grammar here in Bison: https://pastebin.com/raw/dA2bypFR.
It's fairly long but not very complex.
The problem is that after a call, it won't accept anything other than ; e.g a(b)(c) and is invalid, a(b).c is invalid, which both only accept a semicolon after the closing parenthesis.
a(b)+c is fine though.
I tried separating call_or_getattr into 2 where . has higer precedence than ( but this meant that a().b was invalid grammar.
I also tried putting call and getattr into the definition for basic_operand but this resulted in a 536 shift/reduce errors.
Your last production reads as follows (without the actions, which are an irrelevant distraction):
call_or_getattr:
basic_operand
| basic_operand '(' csv ')'
| basic_operand '.' T_ID
So those are postfix operators whose argument must be a basic_operand. In a(b)(c), the (c) argument list is not being applied to a basic_operand, so the grammar isn't going to match it.
What you were looking for, I suppose, is:
call_or_getattr:
basic_operand
| call_or_getattr '(' csv ')'
| call_or_getattr '.' T_ID
This is, by the way, very similar to the way you write productions for a binary operator. (Of course, the binary operator has a right-hand operand.)
I've been using the Antlr Matlab grammar from Antlr grammars
I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such
result = input'
I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''
However, this failed to parse when multiple of these operators were used on a single line.
Here's a significantly simplified version of the grammar, still exhibiting the exact problem:
grammar Grammar;
unary_expression
: IDENTIFIER
| unary_expression '\''
;
translation_unit : unary_expression CR ;
STRING_LITERAL : '\'' [a-z]* '\'' ;
IDENTIFIER : [a-zA-Z] ;
CR : [\r\n] + ;
Test cases, being parsed as translation_unit:
"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes
The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.
The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.
For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.
What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.
The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:
x' + y';
while these are strings:
x = 'x' + ' + y';
?
Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.
If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:
string_literal
: QUOTE ~(QUOTE | CR)* QUOTE
;
// Needed to match characters inside strings
OTHER
: .
;
However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).
All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".
I'm working on a grammar that is context-sensitive. Here is its description:
It describes the set of expressions.
Each expression contains one or more parts separated by logical operator.
Each part consists of optional field identifier followed by some comparison operator (that is also optional) and the list of values.
Values are separated by logical operator as well.
By default value is a sequence of characters. Sometimes (depending on context) set of possible characters for each value can be extended. It even can consume comparison operator (that is used for separating of field identifiers from list of values, according to 3rd rule) to treat it as value's character.
Here's the simplified version of a grammar:
grammar TestGrammar;
#members {
boolean isValue = false;
}
exprSet: (expr NL?)+;
expr: expr log_op expr
| part
| '(' expr ')'
;
part: (fieldId comp_op)? values;
fieldId: STRNG;
values: values log_op values
| value
| '(' values ')'
;
value: strng;
strng: ( STRNG
| {isValue}? comp_op
)+;
log_op: '&' '&';
comp_op: '=';
NL: '\r'? '\n';
WS: ' ' -> channel(HIDDEN);
STRNG: CHR+;
CHR: [A-Za-z];
I'm using semantic predicate in strng rule. It should extend the set of possible tokens depending on isValue variable;
The problem occurs when semantic predicate evaluates to false. I expect that 2 STRNG tokens with '=' token between them will be treated as part node. Instead of it, it parses each STRNG token as a value, and throws out '=' token when re-synchronizing.
Here's the input string and the resulting expression tree that is incorrect:
a && b=c
To look at correct expression tree it's enough to remove an alternative with semantic predicate from strng rule (that makes it static and so is inappropriate for my solution):
strng: ( STRNG
// | {isValue}? comp_op
)+;
Here's resulting expression tree:
BTW, when semantic predicate evaluates to true - the result is as expected: strng rule matches an extended set of tokens:
strng: ( STRNG
| {!isValue}? comp_op
)+;
Please explain why this happens in such way, and help to find out correct solution. Thanks!
What about removing one option from values? Otherwise the text a && b may be either a
expr -> expr log_op expr
or
expr -> part -> values log_op values
.
It seems Antlr resolves it by using the second option!
values
: //values log_op values
value
| '(' values ')'
;
I believe your expr rule is written in the wrong order. Try moving the binary expression to be the last alternative instead of the first.
Ok, I've realized that current approach is inappropriate for my task.
I've chosen another approach based on overriding of Lexer's nextToken() and emit() methods, as described in ANTLR4: How to inject tokens .
It has given me almost full control on the stream of tokens. I got following advantages:
assigning required types to tokens;
postpone sending tokens with yet undefined type to parser (by sending fake tokens on hidden channel);
possibility to split and merge tokens;
possibility to organize postponed tokens into queues.
Having all these possibilities I'm able to resolve all the ambiguities in the parser.
P.S. Thanks to everyone who tried to help, I appreciate it!
Grammar: http://pastebin.com/ef2jt8Rg
y.output: http://pastebin.com/AEKXrrRG
I don't know where is those conflicts, someone can help me with this?
The y.output file tells you exactly where the conflicts are. The first one is in state 4, so if you go down to look at state 4, you see:
state 4
99 compound_statement: '{' . '}'
100 | '{' . statement_list '}'
IDENTIFIER shift, and go to state 6
:
IDENTIFIER [reduce using rule 1 (threat_as_ref)]
IDENTIFIER [reduce using rule 2 (func_call_start)]
This is telling you that in this state (parsing a compound_statement, having seen a {), and looking at the next token being IDENTIFIER, there are 3 possible things it could do -- shift the token (which would be the beginning of a statement_list), reduce the threat_as_ref empty production, or reduce the func_call_start empty production.
The brackets tell you that it has decided to never do those actions -- the default "prefer shift over reduce" conflict resolution means that it will always do the shift.
The problem with your grammar is these empty rules threat_as_ref and func_call_start -- they need to be reduced BEFORE shifting the IDENTIFIER, but in order to know if they're valid, the parser would need to see the tokens AFTER the identifer. func_call_start should only be reduced if this is the beginning of the function call (which depends on there being a ( after the IDENTIFIER.) So the parser needs more lookahead to deal with your gramar. In your specific case, you grammar is LALR(2) (2 token lookahead would suffice), but not LALR(1), so bison can't deal with it.
Now you could fix it by just getting rid of those empty rules -- func_call_start has no action at all, and the action for threat_as_ref could be moved into the action for variable, but if you want those rules in the future that may be a problem.
(1) I see at least one thing that looks odd. Your productions for expression_statement are similar to those for postfix_statement, but not quite the same. They don't have the '(' and ')' tokens:
expression_statement
: ';'
| expression ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
| func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } argument_expression_list func_call_end ';'
;
Since an expression can be a primary_expression, which can be an IDENTIFIER, and since func_call_start and func_call_end are epsilon (null) productions, when presented with the input
foo;
the parser has to decide whether to apply
expression_statement : expression ';'
or
expression_statement : func_call_start IDENTIFIER { ras_parse_variable_psh($2); aFree($2); } func_call_end ';'
(2) Also, I'm not certain of this, but I suspect the epsilon non-terminal threat_as_ref might be causing you some trouble. I have not traced it through, but there may be a case where the parser has to decide whether something is a variable_ref or a variable.
I'm trying to parse a simple grammar using an LALR(1) parser generator (Bison, but the problem is not specific to that tool), and I'm hitting a shift-reduce conflict. The docs and other sources I've found about fixing these tend to say one or more of the following:
If the grammar is ambiguous (e.g. if-then-else ambiguity), change the language to fix the ambiguity.
If it's an operator precedence issue, specify precedence explicitly.
Accept the default resolution and tell the generator not to complain about it.
However, none of these seem to apply to my situation: the grammar is unambiguous so far as I can tell (though of course it's ambiguous with only one character of lookahead), it has only one operator, and the default resolution leads to parse errors on correctly-formed input. Are there any techniques for reworking the definition of a grammar to remove shift-reduce conflicts that don't fall into the above buckets?
For concreteness, here's the grammar in question:
%token LETTER
%%
%start input;
input: /* empty */ | input input_elt;
input_elt: rule | statement;
statement: successor ';';
rule: LETTER "->" successor ';';
successor: /* empty */ | successor LETTER;
%%
The intent is to parse semicolon-separated lines of the form "[A-Za-z]+" or "[A-Za-z] -> [A-Za-z]+".
Using the Solaris version of yacc, I get:
1: shift/reduce conflict (shift 5, red'n 7) on LETTER
state 1
$accept : input_$end
input : input_input_elt
successor : _ (7)
$end accept
LETTER shift 5
. reduce 7
input_elt goto 2
rule goto 3
statement goto 4
successor goto 6
So, the trouble is, as it very often is, the empty rule - specifically, the empty successor. It isn't completely clear whether you want to allow a semi-colon as a valid input - at the moment, it is. If you modified the successor rule to:
successor: LETTER | successor LETTER;
the shift/reduce conflict is eliminated.
Thanks for whittling down the grammar and posting it. Changing the successor rule to -
successor: /* empty */ | LETTER successor;
...worked for me. ITYM the language looked unambiguous.