Antlr non-LL(*) decision - parsing

I'm having some trouble creating part of my ANTLR grammar for my programming language.
I'm getting the error when the second part of the type declaration occurs:
public type
: ID ('.' ID)* ('?')? -> ^(R__Type ID ID* ('?')?)
| '(' type (',' type)* ')' ('?')? -> ^(R__Type type* ('?')?)
;
I'm trying to either match:
A line like System.String (works fine)
A tuple such as (System.String, System.Int32)
The error occurs slightly higher up the tree, and states:
[fatal] rule statement has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
What am I doing wrong?

Right, I managed to fix this a little earlier up the tree, by editing the rule that deals with variable declarations:
'my' ID (':' type '=' constant | ':' type | '=' constant) -> ^(R__VarDecl ID type? constant?)
So that it works like:
'my' ID
(
':' type ('=' constant)?
| '=' constant
) -> ^(R__VarDecl ID type? constant?)
I got the idea from the example of syntactic predicates here:
https://wincent.com/wiki/ANTLR_predicates
Luckily I didn't need a predicate in the end!

Related

Lemon Parser - Parsing conflict between rules for a.b.c & a.b[0].c

typename ::= typename DOT ID.
typename ::= ID.
lvalue ::= lvalue DOT lvalue2.
lvalue ::= lvalue2.
lvalue2 ::= ID LSQB expr RSQB. // LSQB & RSQB: left & right square bracket. i.e. [ ].
lvalue2 ::= ID.
typename is a rule for the names of types. It matches the following code:
ClassA
package_a.ClassA
while lvalue is a rule for left values. It matches the following code:
varA
varB.C
varD.E[i].F
Now the 2 rules conflicts with each other. Maybe it is because lvalue can also match package_a.ClassA?
How can I solve this?
You can't resolve this issue grammatically because your syntax is ambiguous. a.b = 3 is valid if a.b is a member of a, and invalid if a.b is a type, but the semantics of a.b cannot be determined by the syntax.
You could resolve this in a fairly messy way if you have some way to figure that out in the lexer (which will probably involve some kind of lexical feedback, since the lexer would presumably need access to the symbol table in order to provide that information). Then the lexer could use two different token types for IDs, based on whether or not they are type names.
But probably the best option is to either abandon the idea of distinguishing grammatically between lvalues and rvalues, and or to assume that all selection operations (a.b) produce lvalues, and then to validate the use of an expression as an lvalue in the semantic action or some subsequent semantic analysis.

Bison eliminate reduce/reduce conflict among nullable non-terminals?

I am using Bison (AFAIK they use LL(1) parsing as default).
My grammar says something like this:
function_decl: ID '(' params ')' ':' TYPE ... // body may go here
function_call: ID '(' arguments ')'
params: ID ':' TYPE
| params ',' ID ':' TYPE
| %empty
arguments: ID
| arguments ',' ID
| %empty
Now, bison warns saying a reduce/reduce conflict, because params and arguments both are null-able (in case of a zero-parameter function()).
My question is, how can I remove (instead of suppressing) this conflict ?
Although someone suggested to use different parsing technique, I want to make it clear for myself if it is possible (and I should do so) or I should just ignore it.
If you ignore the warning, you will end up with a parser which cannot recognise a function call without arguments. So that is probably not a good idea.
You are quite correct that the conflict is the result of both params and arguments producing an empty string. Because the parser can only lookahead at one symbol a when it reads the ) in func(), it needs to decide whether to reduce an empty params (which will force it to proceed with function_decl) or an empty arguments (which will commit it to function_call). But there is no way to know until the next token is read.
The easiest solution is to avoid the empty reductions, although it makes the grammar slightly wordier. In the following, neither params nor arguments produce the empty string, and extra productions for function_decl and function_call are used to recognise these cases.
function_decl: ID '(' params ')' ':' TYPE ... // body may go here
function_decl: ID '(' ')' ':' TYPE ...
function_call: ID '(' arguments ')'
function_call: ID '(' ')'
params: ID ':' TYPE
| params ',' ID ':' TYPE
arguments: ID
| arguments ',' ID
This works because it lets the parser defer the decision between call and declaration. An LR parser can defer decisions until it has to reduce; it can keep several productions open at the same time, but it must reduce a production when it reaches its end.
Note that even without the conflict, your original grammar is incorrect. As written, it allows arguments (or params) to start with an arbitrary number of commas. Your intention was not to allow %empty as an alternative base case, but rather as an alternative complete expansion. The correct grammar for an optional comma-separated list requires two non-terminals:
arguments
: %empty
| argument_list
argument_list
: argument
| argument_list ',' argument

Ambiguous call expression in ANTLR4 grammar

I have a simple grammar (for demonstration)
grammar Test;
program
: expression* EOF
;
expression
: Identifier
| expression '(' expression? ')'
| '(' expression ')'
;
Identifier
: [a-zA-Z_] [a-zA-Z_0-9?]*
;
WS
: [ \r\t\n]+ -> channel(HIDDEN)
;
Obviously the second and third alternatives in the expression rule are ambiguous. I want to resolve this ambiguity by permitting the second alternative only if an expression is immediately followed by a '('.
So the following
bar(foo)
should match the second alternative while
bar
(foo)
should match the 1st and 3rd alternatives (even if the token between them is in the HIDDEN channel).
How can I do that? I have seen these ambiguities, between call expressions and parenthesized expressions, present in languages that have no (or have optional) expression terminator tokens (or rules) - example
The solution to this is to temporary "unhide" whitespace in your second alternative. Have a look at this question for how this can be done.
With that solution your code could look somthing like this
expression
: Identifier
| {enableWS();} expression '(' {disableWS();} expression? ')'
| '(' expression ')'
;
That way the second alternative matches the input WS-sensitive and will therefore only be matched if the identifier is directly followed by the bracket.
See here for the implementation of the MultiChannelTokenStream that is mentioned in the linked question.

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

Shift/reduce conflict in yacc due to look-ahead token limitation?

I've been trying to tackle a seemingly simple shift/reduce conflict with no avail. Naturally, the parser works fine if I just ignore the conflict, but I'd feel much safer if I reorganized my rules. Here, I've simplified a relatively complex grammar to the single conflict:
statement_list
: statement_list statement
|
;
statement
: lvalue '=' expression
| function
;
lvalue
: IDENTIFIER
| '(' expression ')'
;
expression
: lvalue
| function
;
function
: IDENTIFIER '(' ')'
;
With the verbose option in yacc, I get this output file describing the state with the mentioned conflict:
state 2
lvalue -> IDENTIFIER . (rule 5)
function -> IDENTIFIER . '(' ')' (rule 9)
'(' shift, and go to state 7
'(' [reduce using rule 5 (lvalue)]
$default reduce using rule 5 (lvalue)
Thank you for any assistance.
The problem is that this requires 2-token lookahead to know when it has reached the end of a statement. If you have input of the form:
ID = ID ( ID ) = ID
after parser shifts the second ID (lookahead is (), it doesn't know whether that's the end of the first statement (the ( is the beginning of a second statement), or this is a function. So it shifts (continuing to parse a function), which is the wrong thing to do with the example input above.
If you extend function to allow an argument inside the parenthesis and expression to allow actual expressions, things become worse, as the lookahead required is unbounded -- the parser needs to get all the way to the second = to determine that this is not a function call.
The basic problem here is that there's no helper punctuation to aid the parser in finding the end of a statement. Since text that is the beginning of a valid statement can also appear in the middle of a valid statement, finding statement boundaries is hard.

Resources