I am writing a parser to a C-like grammar, but I am having a problem with a shift/reduce conflict:
Basically, the grammar accept a list of optional global variables declarations followed by the functions.
I have the following rules:
program: global_list function_list;
type_name : TKINT /* int */
| TKFLOAT /* float */
| TKCHAR /* char */
global_list : global_list var_decl ';'
|
;
var_decl : type_name NAME;
function_list : function_list function_def
|
;
function_def : type_name NAME '(' param_list ')' '{' func_body '}' ;
I understand that I have a problem because the grammar can't decide if the next type_name NAME belongs to global_list or function_list, and by default it is expecting a global_list
Ex:
int var1;
int foo(){}
error: unexpcted '(', expecting ';'
The problem is that a function_def can only occur after a function_list, which means that the parser needs to reduce an empty function_list (using the production function_list → ε) before it can recognize a function_def. Furthermore, it needs to make that decision by only looking at the token which follows the empty production. Since that token (a type_name) could start either a var_decl or a function_def, there is no way for the parser to decide.
Even leaving the decision for one more token won't help; it's not until the third token that the correct decision can be made. So your grammar is not ambiguous, but it is LR(3).
Sequences of possibly empty lists of different type always create this problem. By contrast, sequences of non-empty lists do not, so a first approach to solving the problem is to eliminate the ε-productions.
First, we expand the top-level definition to make it clear that both lists are optional:
program: global_list function_list;
| global_list
| function_list
|
;
Then we make both list types non-empty:
global_list
: var_decl
| global_list var_decl
;
function_list
: function_def
| function_list function_def
;
The rest of the grammar is unchanged.
type_name : TKINT /* int */
| TKFLOAT /* float */
| TKCHAR /* char */
var_decl : type_name NAME;
function_def : type_name NAME '(' param_list ')' '{' func_body '}' ;
It's worth noting that the problem would never have arisen if declarations could be interspersed. Is it really necessary that all global variables be defined before any function? If not, you could just use a single list type, which would also be conflict free:
program: decl_list ;
decl_list:
| decl_list var_decl;
| decl_list function_def
;
Both these solutions work because a bottom-up parser can wait until the end of the production being reduced in order to decide which is the correct reduction; it does not matter that var_decl and function_def look identical until the third token.
The problem really is that it's hard to figure out the type of nothing.
Related
I've written the following arithmetic grammar:
grammar Calc;
program
: expressions
;
expressions
: expression (NEWLINE expression)*
;
expression
: '(' expression ')' // parenExpression has highest precedence
| expression MULDIV expression // then multDivExpression
| expression ADDSUB expression // then addSubExpression
| OPERAND // finally the operand itself
;
MULDIV
: [*/]
;
ADDSUB
: [-+]
;
// 12 or .12 or 2. or 2.38
OPERAND
: [0-9]+ ('.' [0-9]*)?
| '.' [0-9]+
;
NEWLINE
: '\n'
;
And I've noticed that regardless of how I space the tokens I get the same result, for example:
1+2
2+3
Or:
1 +2
2+3
Still give me the same thing. Also I've noticed that adding in the following rule does nothing for me:
WS
: [ \r\n\t] + -> skip
Which makes me wonder whether skipping whitespace is the default behavior of antlr4?
ANTLR4 based parsers have the ability to skip over single unwanted or missing tokens and continue parsing if possible (which is the case here). And there's no default to ignore whitespaces. You have to always specify a whitespace rule which either skips them or puts them on a hidden channel.
I'm trying to parse 'for loop' according to this (partial) grammar:
grammar GaleugParserNew;
/*
* PARSER RULES
*/
relational
: '>'
| '<'
;
varChange
: '++'
| '--'
;
values
: ID
| DIGIT
;
for_stat
: FOR '(' ID '=' values ';' values relational values ';' ID varChange ')' '{' '}'
;
/*
* LEXER RULES
*/
FOR : 'for' ;
ID : [a-zA-Z_] [a-zA-Z_0-9]* ;
DIGIT : [0-9]+ ;
SPACE : [ \t\r\n] -> skip ;
When I try to generate the gui of how it's parsed, it's not following the grammar I provided above. This is what it produces:
I've encountered this problem before, what I did then was simply exit cmd, open it again and compile everything and somehow that worked then. It's not working now though.
I'm not really very knowledgeable about antlr4 so I'm not sure where to look to solve this problem.
Must be a problem of the IDE you are using. The grammar is fine and produces this parse tree in Visual Studio Code:
I guess the IDE is using the wrong parser or lexer (maybe from a different work file?). Print the lexer tokens to see if they are what you expect. Hint: avoid defining implicit lexer tokens (like '(', '}' etc.), which will allow to give the tokens good names.
I'm building a grammar in bison, and I've narrowed down my last reduce/reduce error to the following test-case:
%{
#include <stdio.h>
#include <string.h>
extern yydebug;
void yyerror(const char *str)
{
fprintf(stderr, "Error: %s\n", str);
}
main()
{
yydebug = 1;
yyparse();
}
%}
%right '='
%precedence CAST
%left '('
%token AUTO BOOL BYTE DOUBLE FLOAT INT LONG SHORT SIGNED STRING UNSIGNED VOID
%token IDENTIFIER
%start file
%debug
%%
file
: %empty
| statement file
;
statement
: expression ';'
;
expression
: expression '=' expression
| '(' type ')' expression %prec CAST
| '(' expression ')'
| IDENTIFIER
;
type
: VOID
| AUTO
| BOOL
| BYTE
| SHORT
| INT
| LONG
| FLOAT
| DOUBLE
| SIGNED
| UNSIGNED
| STRING
| IDENTIFIER
;
Presumably the issue is that it can't tell the difference between type and expression when it sees (IDENTIFIER) in an expression.
Output:
fail.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
fail.y:64.5-14: warning: rule useless in parser due to conflicts [-Wother]
| IDENTIFIER
^^^^^^^^^^
What can I do to fix this conflict?
If the grammar were limited to the productions shown in the OP, it would be relatively easy to remove the conflict, since the grammar is unambiguous. The only problem is that it is LR(2) and not LR(1).
The analysis in the OP is completely correct. When the parser sees, for example:
( identifier1 · )
(where · marks the current point, so the lookahead token is )), it is not possible to know whether that is a prefix of
( identifier1 · ) ;
( identifier1 · ) = ...
( identifier1 · ) identifier2
( identifier1 · ) ( ...
In the first two cases, identifier1 must be reduced to expression, so that ( expression ) can subsequently be reduced to expression, whereas in the last two cases,
identifier1 must be reduced to type so that ( type ) expression can subsequently be reduced to expression. If the parser could see one token further into the future, the decision could be made.
Since for any LR(k) grammar, there is an LR(1) grammar which recognizes the same language, there is clearly a solution; the general approach is to defer the reduction until the one-token lookahead is sufficient to distinguish. One way to do this is:
cast_or_expr : '(' IDENTIFIER ')'
;
cast : cast_or_expr
| '(' type ')'
;
expr_except_id : cast_or_expr
| cast expression %prec CAST
| '(' expr_except_id ')'
| expression '=' expression
;
expression : IDENTIFIER
| expr_except_id
;
(The rest of the grammar is the same except for the removal of IDENTIFIER from the productions for type.)
That works fine for grammars where no symbol can be both a prefix and an infix operator (like -) and where no operator can be elided (effectively, as in function calls). In particular, it won't work for C, because it will leave the ambiguities:
( t ) * a // product or cast?
( t ) ( 3 ) // function-call or cast?
Those are real ambiguities in the grammar which can only be resolved by knowing whether t is a typename or a variable/function name.
The "usual" solution for C parsers is to resolve the ambiguity by sharing the symbol table between the scanner and the parser; since typedef type alias declarations must appear before the first use of the symbol as a typename in the applicable scope, it can be known in advance of the scan of the token whether or not the token has been declared with a typedef. More accurately, if the typedef has not been seen, it can be assumed that the symbol is not a type, although it may be completely undeclared.
By using a GLR grammar and a semantic predicate, it is possible to restrict the logic to the parser. Some people find that more elegant.
I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)
I am trying to parse CSP(Communicating Sequential Processes) CSP Reference Manual. I have defined following grammar rules.
assignment
: IDENT '=' processExpression
;
processExpression
: ( STOP
| SKIP
| chaos
| prefix
| prefixWithValue
| seqComposition
| interleaving
| externalChoice
....
seqComposition
: processExpression ';' processExpression
;
interleaving
: processExpression '|||' processExpression
;
externalChoice
: processExpression '[]' processExpression
;
Now ANTLR reports that
seqComposition
interleaving
externalChoice
are left recursive . Is there any way to remove this or I should better used Bison Flex for this type of grammar. (There are many such rules)
Define a processTerm. Then write rules looking like
assignment
: IDENT '=' processExpression
;
processTerm
: ( STOP
| SKIP
| chaos
| prefix
...
processExpression
: ( processTerm
| processTerm ';' processExpression
| processTerm '|||' processExpression
| processTerm '[]' processExpression
....
If you want to have things like seqComposition still defined, I think that would be OK as well. But you need to make sure that the parsing of processExpansion is going to always consume more text as you proceed through your rules.
Read the guide to removing left recursion in on the ANTLR wiki. It helped me a lot.