Yacc fails with one declaration but reduce two declarations with success - parsing

I have the following yacc parser (don't have any conflict) for int or float variable declaration:
%token ID INT FLOAT
%token SEMICOLON
%%
program : list_declaration { printf("program\n"); }
;
list_declaration : declaration { printf("list_declaration\n"); }
| declaration declaration { printf("list_declaration\n"); }
;
declaration : var_declaration { printf("declaration\n"); }
;
var_declaration : type ID SEMICOLON { printf("var_declaration\n"); }
;
type : INT { printf("type\n"); }
| FLOAT { printf("type\n"); }
;
%%
I've been knocking my head trying to solve this problem but didn't come with any solutions.
If there's two variable declarations as input, like:
int test;
float test2;
Its parsed normally, here is the output:
type
var_declaration
declaration
type
var_declaration
declaration
list_declaration
program
But if there's only one declaration the parser never reduces it to program, for instance:
int test;
gives:
type
var_declaration
declaration
Shouldn't declaration be reduced to list_declaration and then list_declaration reduced to program? I'm planning later to extend list_declaration to any number of declarations, but I can't do that unless I understand first why is not working properly for at least two declarations.

The problem is almost certainly that you are suppressing the EOF return from yylex. yylex must return 0 on EOF; otherwise, bison parsers cannot reliably recognize the start production.
Like most parser generators -- and as described in most parsing textbooks -- bison and yacc create an "augmented" start production whose right-hand side consists of the declared (or implicit) start non-terminal followed by an EOF pseudo-token. The parse will only succeed if that production is reduced, and that production cannot be reduced without the EOF.
Because bison will reduce without lookahead for states in which lookahead is unnecessary, it is possible, with your grammar, for bison to reduce declaration declaration to program without lookahead. But it cannot reduce declaration to program without the EOF lookahead, so it doesn't. In the case with two declarations, despite the fact that program has been reduced, the parse has not actually succeeded and yyparse will not have returned.

Related

Flex and Bison - Grammar that sometimes care about spaces

Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.
However, in some cases the space letter makes big difference:
def some_callback(arg=0)
arg * 100
end
some_callback (1 + 1) + 1 # 300
some_callback(1 + 1) + 1 # 201
some_callback +1 # 100
some_callback+1 # 1
some_callback + 1 # 1
So currently all whitespaces are being ignored by the lexer:
{WHITESPACE} { ; }
And the language says for example something like:
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:
// OLD:
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression T_ADD MultiplicativeExpression
| AdditiveExpression T_SUB MultiplicativeExpression
;
// NEW:
_:
/* empty */
| WHITESPACE _;
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression _ T_ADD _ MultiplicativeExpression
| AdditiveExpression _ T_SUB _ MultiplicativeExpression
;
//...
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
So I liked to ask whether there is any best practice on how to solve this grammar.
Thank you in advance!
Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.
Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.
You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.
If you're doing this with (f)lex, you can use the / trailing-context operator:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval = strdup(yytext); return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]* { yylval = strdup(yytext); return IDENT; }
The grammar is now pretty straight-forward:
call: FUNC_CALL '(' expression_list ')' /* foo(1, 2) */
| IDENT expression_list /* foo (1, 2) */
| IDENT /* foo * 3 */
This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:
name: IDENT | FUNC_CALL
But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:
func_defn: "def" name '(' parameters ')' block "end"
(I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)
More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)
That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:
%x SIGNED_NUMBERS
%%
[[:alpha:]_][[:alnum:]_]*/'(' { yylval.id = strdup(yytext);
return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]*/[[:blank:]] { yylval.id = strdup(yytext);
if (!is_local(yylval.id))
BEGIN(SIGNED_NUMBERS);
return IDENT; }
[[:alpha:]_][[:alnum:]_]*/ { yylval.id = strdup(yytext);
return IDENT; }
<SIGNED_NUMBERS>[[:blank:]]+ ;
/* Numeric patterns, one version for each context */
<SIGNED_NUMBERS>[+-]?[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
BEGIN(INITIAL);
return INTEGER; }
[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
/* ... */
/* If the next character is not a digit or a sign, rescan in INITIAL state */
<SIGNED_NUMBERS>.|\n { yyless(0); BEGIN(INITIAL); }
Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.
It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.

Bison: GLR-parsing of valid expression fails without error message

I'm working on a GLR-parser in GNU bison and I have the following problem:
the language I'm trying to parse allows boolean expressions including relations (<,>,<=,...) and boolean composition (and, or, not). Now the problem is that the language also allows to have multiple arithmetic expressions on the right side of a relation... and they are composed using the same AND token that is used for boolean composition! This is a very dumb language-design, but I can't change it.
So you can have a > b and c which is supposed to be equivalent to (a > b) and (a > c) and you can also have a > b and c > d which is supposed to be equivalent to (a > b) and (c > d)
The S/R conflict this causes is already obvious in this example: after reading a > b with lookahead and you could either reduce the a > b to a boolean expression and wait for another boolean expression or you could shift the and and wait for another arithmetic expression.
My grammar currently looks like this:
booleanexpression
: relation
| booleanexpression TOK_AND booleanexpression
...
;
relation
: arithmeticexpression TOK_GT maxtree
...
;
maxtree
: arithmeticexpression
| maxtree TOK_AND maxtree
...
;
The language is clearly not LR(k) for any k, since the S/R conflict can't be resolved using any constant k-lookahead, because the arithmeticexpression in between can have arbitrarily many tokens. Because of that, I turned GLR-parsing on.
But when I try to parse a > b and c with this, I can see in my debug outputs, that the parser behaves like this:
it reads the a and at lookahead > it reduces the a to an arithmeticexpression
it reads the b and at lookahead and it reduces the b to an arithmeticexpression and then already to a maxtree
it reduces the a > b to a relation
it reads the c and reduces it to an arithmeticexpression
then nothing happens! The and c are apparently discarded - the debug outputs don't show any action for these tokens. Not even an error message. The corresponding if-statement doesn't exist in my AST (I still get an AST because I have error recovery).
I would think that, after reading the b, there should be 2 stacks. But then the b shouldn't be reduced. Or at least it should give me some error message ("language is ambiguous" would be okay and I have seen that message before - I don't see why it wouldn't apply here). Can anyone make sense of this?
From looking at the grammar for a while, you can tell that the main question here is whether after the next arithmeticexpression there comes
another relation token (then you should reduce)
another boolean composition (then you should shift)
a token outside of the boolean/arithmetic -expression syntax (like THEN) which would terminate the expression and you should also shift
Can you think of a different grammar that captures the situation in a better / more deterministic way? How would you approach the problem? I'm currently thinking about making the grammar more right-to-left, like
booleanexpression : relation AND booleanexpression
maxtree : arithmeticexpression AND maxtree
etc.
I think that would make bison prefer shifting and only reduce on the right first. Maybe by using different non-terminals it would allow a quasi-"lookahead" behind the arithmeticexpression...
Side note: GnuCOBOL handles this problem by just collecting all the tokens, pushing them on an intermediate stack and manually building the expression from there. That discourages me, but I cling to the hope that they did it this way because bison didn't support GLR-parsing when they started...
EDIT:
a small reproducible example
%{
#include <stdio.h>
int yylex ();
void yyerror(const char* msg);
%}
%glr-parser
%left '&'
%left '>'
%%
input: %empty | input bool '\n' {printf("\n");};
arith : 'a' | 'b' | 'c';
maxtree : arith { printf("[maxtree : arith] "); }
| maxtree '&' maxtree { printf("[maxtree : maxtree & maxtree] "); } ;
rel : arith '>' maxtree { printf("[rel : arith > maxtree] "); } ;
bool : rel { printf("[bool : rel] "); }
| bool '&' bool { printf("[bool : bool & bool] "); } ;
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex () {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
return yyparse();
}
this one strangely does print the error message "syntax error" on input a>b&c.
Being able to simplify grammars by using precedence declarations is really handy (sometimes) [Note 1] but it doesn't play well with using GLR parsers because it can lead to early rejection of an unambiguous parse.
The idea behind precedence declarations is that they resolve ambiguities (or, more accurately, shift/reduce conflicts) using a simple one-token lookahead and a configured precedence between the possible reduction and the possible shift. If a grammar has no shift/reduce conflict, the precedence declarations won't be used, but if they are used they will be used to suppress either the shift or the reduce, depending on the (static) precedence relationship.
A Bison-generated GLR parser does not actually resolve ambiguity, but it allows possibly incorrect parses to continue to be developed until the ambiguity is resolved by the grammar. Unlike the use of precedence, this is a delayed resolution; a bit slower but a lot more powerful. (GLR parsers can produce a "parse forest" containing all possible parses. But Bison doesn't implement this feature, since it expects to be parsing programming languages and unlike human languages, programming languages cannot be ambiguous.)
In your language, it is impossible to resolve the non-determinism of the shift/reduce conflict statically, as you note yourself in the question. Your grammar is simply not LR(1), much less operator precedence, and GLR parsing is therefore a practical solution. But you have to allow GLR to do its work. Prematurely eliminating one of the plausible parses with a precedence comparison will prevent the GLR algorithm from considering it later. This will be particularly serious if you manage to eliminate the only parse which could have been correct.
In your grammar, it is impossible to define a precedence relationship between the rel productions and the & symbol, because no precedence relationship exists. In some sentences, the rel reduction needs to win; in other sentences, the shift should win. Since the grammar is not ambiguous, GLR will eventually figure out which is which, as long as both the shift and the reduce are allowed to proceed.
In your full language, both boolean and arithmetic expressions have something akin to operator precedence, but only within their respective domains. An operator precedence parser (and, equivalently, yacc/bison's precedence declarations) works by erasing the difference between different non-terminals; it cannot handle a grammar like yours in which some operator has different precedences in different domains (or between different domains).
Fortunately, this particular use of precedence declarations is only a shortcut; it does not give any additional power to the grammar and can easily and mechanically be implemented by creating new non-terminals, one for each precedence level. The alternative grammar will not be ambiguous. The classic example, which you can find in pretty well any textbook or tutorial which includes parsing arithmetic expressions, is the expr/term/factor grammar. Here I've also provided the precedence grammar for comparison:
%left '+' '-'
%left '*' '/'
%% %%
expr : term
| expr '+' term expr: expr '+' expr
| expr '-' term | expr '-' expr
term : factor
| term '*' factor | expr '*' expr
| term '/' factor | expr '/' expr
factor: ID | ID
| '(' expr ')' | '(' expr ')'
In your minimal example, there are already enough non-terminals that no new ones need to be invented, so I've just rewritten it according to the above model.
I've left the actions as I wrote them, in case the style is useful to you. Note that this style leaks memory like a sieve, but that's ok for quick tests:
%code top {
#define _GNU_SOURCE 1
}
%{
#include <ctype.h>
#include <stdio.h>
#include <string.h>
int yylex(void);
void yyerror(const char* msg);
%}
%define api.value.type { char* }
%glr-parser
%token ID
%%
input : %empty
| input bool '\n' { puts($2); }
arith : ID
maxtree : arith
| maxtree '&' arith { asprintf(&$$, "[maxtree& %s %s]", $1, $3); }
rel : arith '>' maxtree { asprintf(&$$, "[COMP %s %s]", $1, $3); }
bool : rel
| bool '&' rel { asprintf(&$$, "[AND %s %s]", $1, $3); }
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex(void) {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
if (isalpha(c)) {
*(yylval = strdup(" ")) = c;
return ID;
}
else return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
#if YYDEBUG
if (argc > 1 && strncmp(argv[1], "-d", 2) == 0) yydebug = 1;
#endif
return yyparse();
}
Here's a sample run. Note the warning from bison about a shift/reduce conflict. If there had been no such warning, the GLR parser would probably be unnecessary, since a grammar without conflicts is deterministic. (On the other hand, since bison's GLR implementation optimises for determinism, there is not too much cost for using a GLR parser on a deterministic language.)
$ bison -t -o glr_prec.c glr_prec.y
glr_prec.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
$ gcc -Wall -o glr_prec glr_prec.c
$ ./glr_prec
a>b
[COMP a b]
a>b & c
[COMP a [maxtree& b c]]
a>b & c>d
[AND [COMP a b] [COMP c d]]
a>b & c & c>d
[AND [COMP a [maxtree& b c]] [COMP c d]]
a>b & c>d & e
[AND [COMP a b] [COMP c [maxtree& d e]]]
$
Notes
Although precedence declarations are handy when you understand what's actually going on, there is a huge tendency for people to just cargo-cult them from some other grammar they found on the internet, and not infrequently a grammar which was also cargo-culted from somewhere else. When the precedence declarations don't work as expected, the next step is to randomly modify them in the hopes of finding a configuration which works. Sometimes that succeeds, often leaving behind unnecessary detritus which will go on to be cargo-culted again.
So, although there are circumstances in which precedence declarations really simplify grammars and the unambiguous implementation would be quite a lot more complicated (such as dangling-else resolution in languages which have many different compound statement types), I've still found myself recommending against their use.
In a recent answer to a different question, I wrote what I hope is a good explanation of the precedence algorithm (and if it isn't, please let me know how it falls short).
Welcome to the wonderful world of COBOL. I could be wrong, but you may have a few
additional problems here. An expression such as A > B AND C in COBOL is ambiguous
until you know how C was declared. Consider the following program:
IDENTIFICATION DIVISION.
PROGRAM-ID EXAMPLE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 A PIC 9 VALUE 2.
01 B PIC 9 VALUE 1.
01 W PIC 9 VALUE 3.
88 C VALUE 3.
PROCEDURE DIVISION.
IF A > B AND C
DISPLAY 'A > B AND 88 LEVEL C is TRUE because W = ' W
ELSE
DISPLAY 'A not > B or 88 LEVEL C is not TRUE'
END-IF
DISPLAY 'A: ' A ' B: ' B ' W:' W
GOBACK
.
Output from this program is:
A > B AND 88 LEVEL C is TRUE because W = 3
A: 2 B: 1 W: 3
In essence the expression: A > B AND C is equivalent to: A > B AND W = 3. Had C
been defined in a manner similar to A and B, the semantics would
have been: A > B AND A > C, which in this case, is FALSE.
The code mentioned above works well, but I had never gotten it to work in my real project, even though I couldn't see a difference between my real project and this code.
This drove me crazy, but I just found another problem in my code, which prevented this method from working:
I had an (admittedly cargo-culted) %skeleton "lalr1.cc" in my prologue, which disabled the GLR parsing again!
I needed to replace this with
%skeleton "glr.cc"

Why does this grammar fail to parse this input?

I'm defining a grammar for a small language and Antlr4. The idea is in that language, there's a keyword "function" which can be used to either define a function or as a type specifier when defining parameters. I would like to be able to do something like this:
function aFunctionHere(int a, function callback) ....
However, it seems Antlr doesn't like that I use "function" in two different places. As far as I can tell, the grammar isn't even ambiguous.
In the following grammar, if I remove LINE 1, the generated parser parses the sample input without a problem. Also, if I change the token string in either LINE 2 or LINE 3, so that they are not equal, the parser works.
The error I get with the grammar as-is:
line 1:0 mismatched input 'function' expecting <INVALID>
What does "expecting <INVALID>" mean?
The (stripped down) grammar:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: BaseParamType IDENTIFIER ;
// Lexer stuff
BaseParamType:
INT_TYPE
| FUNCTION_TYPE // <---- LINE 1
;
FUNCTION : 'function'; // <---- LINE 2
INT_TYPE : 'int';
FUNCTION_TYPE : 'function'; // <---- LINE 3
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;
The input I'm using:
function abc(int c, int d, int a)
The program to test the generated parser:
from antlr4 import *
from testLexer import testLexer as Lexer
from testParser import testParser as Parser
from antlr4.tree.Trees import Trees
def main(argv):
input = FileStream(argv[1] if len(argv)>1 else "test.in")
lexer = Lexer(input)
tokens = CommonTokenStream(lexer)
parser = Parser(tokens)
tree = parser.begin()
print Trees.toStringTree(tree, None, parser)
if __name__ == '__main__':
import sys
main(sys.argv)
Just use one name for the token function.
A token is just a token. Looking at function in isolation, it is not possible to decide whether it is a FUNCTION or a FUNCTION_TYPE. Since FUNCTION, comes first in the file, that's what the lexer used. That makes it impossible to match FUNCTION_TYPE, so that becomes an invalid token type.
The parser will figure out the syntactic role of the token function. So there would be no point using two different lexical descriptors for the same token, even if it would be possible.
In the grammar in the OP, BaseParamType is also a lexical type, which will absorb all uses of the token function, preventing FUNCTION from being recognized in the production for function. Changing its name to baseParamType, which effectively changes it to a parser non-terminal, will allow the parser to work, although I suppose it may alter the parse tree in undesirable ways.
I understand the objection that the parser "should know" which lexical tokens are possible in context, given the nature of Antlr's predictive parsing strategy. I'm far from an Antlr expert so I won't pretend to explain why it doesn't seem to work, but with the majority of parser generators -- and all the ones I commonly use -- lexical analysis is effectively performed as a prior pass to parsing, so the conversion of textual input into a stream of tokens is done prior to the parser establishing context. (Most lexical generators, including Antlr, have mechanisms with which the user can build lexical context, but IMHO these mechanisms reduce grammar readability and should only be used if strictly necessary.)
Here's the grammar file which I tested:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: baseParamType IDENTIFIER ;
// Lexer stuff
baseParamType:
INT_TYPE
| FUNCTION //
;
FUNCTION : 'function';
INT_TYPE : 'int';
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;

Dealing with overloaded symbols in ambiguous grammars in ANTLR4

I am trying to write a parser for a dialect of Answer Set Programming (ASP) which, in terms of grammar, looks like Prolog with some extensions.
One extension, for instance is expansion, meaning that fact(1..3). for instance is expanded in fact(1). fact(2). fact(3).. Notice that the language understands INT and FLOAT numbers and uses . also as a terminator.
In some cases the parser fails to distinguish between integers, floats, extensions and separators because I reckon the language is clearly ambiguous. In that cases, I have to explicitly separate tokens with white spaces. Any Prolog or ASP parser, however, correctly deals with such productions. I read that ANTLR4 can disambiguate problematic productions autonomously, but probably it needs some help but I don't know how to do! ;-) I read something like here and here, but apparently they did not help me.
Could somebody please tell me what to do to overcome this ambiguity?
Please notice that I cannot change the language because it is quite standard.
In order to simplify the experts' work, I created a minimal working example that follows.
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum: // not needed, but helps in TestRig
FLOAT;
range: // defines an expansion
INT DOTS INT ;
DOTS: '..';
DOT: '.';
FLOAT: DIGIT+ '.' DIGIT* | '.' DIGIT+ ;
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
I use the following input:
1 .
1. .
1.5 .
.5 .
1 .. 5 .
1.
1..
1.5.
.5.
1..5.
And I get the following errors which instead are parsed corrected by other tools:
line 8:0 extraneous input '1.' expecting '.'
line 11:2 extraneous input '.5' expecting '.'
Many thanks in advance!
Before your DOTS rule, add a unique rule for the statement terminal dot and disambiguate the DOTS rule (and change your other rules to use the TERMINAL):
TERMINAL: DOT { isTerminal(1) }? ;
DOTS: DOT DOT { !isTerminal(2) }? ;
DOT: '.';
where the predicate method simply looks ahead on the _input character stream to see if, at the current token index, the next character is white space. Put something like this in an #member block in your grammar:
public boolean isTerminal(int la) {
int offset = _tokenStartCharIndex + 1 + la;
String s = _input.getText(Interval.of(offset, offset));
if (Character.isWhitespace(s.charAt(0))) {
return true;
}
return false;
}
May have to do a bit more work if whitespace is valid between a DOTS and the trailing INT.
I recommend shifting the work to the parser.
If the lexer can't decide if 1..2 is 1. .2 or 1 .. 2 leave if up to the parser.
Maybe there is a context in which it can be interpreted as the first alternative and another context in which it may be interpreted as the second alternative.
Btw: 1..2. could be interpreted as 1 .. 2 . (range) or as 1. . 2 . (floatNum, intNum). How do you want to deal with this?
The following grammar should parse everything. But note that . . is treated as dots as well as 1 . 23 is a floatNum! You can check these tough while parsing or after parsing (depending on whether it should influence the parsing or not).
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum:
INT DOT INT? | DOT INT ;
range: // defines an expansion
INT dots INT ;
dots : DOT DOT;
DOT: '.';
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
Prolog does not accept 1. as a float. This feature makes your grammar significantly more ambiguous, so maybe try removing that feature.

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

Resources