Problems with parsing decaf (variable declaration versus constructor)

Problems with parsing decaf (variable declaration versus constructor) - parsing

I am using bison (3.0.4) and lexer to implement the (partial) grammar of Decaf programming language. I am only implementing what is inside the class.
So, my task is simple: store every production rule (as a string) in the tree, and then just print it out.
For example, if you have the following line of code as an input
class Foo { Foo(int arg1) { some2 a; } }
you (must) get the following output
<ClassDecl> --> class identifier <classBody>
<ClassBody> --> { <VariableDecl>* <ConstructorDecl>* <MethodDecl>* }
<ConstructorDecl> --> identifier ( <ParameterList> ) <Block>
<ParameterList> --> <Parameter> <, Parameter>*
<Parameter> --> <Type> identifier
<Type> --> <SimpleType>
<SimpleType> --> int
<Block> --> { <LocalVariableDecl>* <Statement>* }
<LocalVariableDecl> --> <Type> identifier ;
<Type> --> <SimpleType>
<SimpleType> --> identifier
The first problem (solved) was that it parsed variable declaration instead of constructor declaration, though I have no variable declaration in the scope of a class itself (i.e. I have only inside the constructor's block). This is solved.
Nevertheless, if I give the following class abc { some1 abc; john doe; } it says that syntax error, unexpected SEMICOLON, expecting LP. So, the character at line 19 causes the problem.
Here is .y file (only classBody rule)
class_decl:
CLASS ID LC class_body RC
;
/* FIXME: Gotta add more grammar here */
class_body: var_declmore constructor_declmore method_declmore
| var_declmore constructor_declmore
| var_declmore method_declmore
| constructor_declmore method_declmore
| method_declmore
| var_declmore
| constructor_declmore
| %empty
;
var_declmore: var_decl
| var_declmore var_decl
;
constructor_declmore: constructor_decl
| constructor_declmore constructor_decl
;
var_decl: type ID SEMICOLON
| type ID error
;
constructor_decl: ID LP parameter_list RP block
| ID error parameter_list RP block
;
Here is the gist to the full .y file.

The essential problem is that constructor_declmore can be empty, and that both var_decl and constructor_decl can start with ID.
That's a problem because before the parser can recognize a constructor_decl, it needs to reduce an (empty) constructor_declmore. But it obviously cannot do that reduction unless it knows that the var_declmore is finished.
So when it sees an ID it has to decide between one of two actions:
Reduce an empty constructor_declmore, thereby deciding that there are no more var_decls; or
Shift the ID in order to start parsing a new var_decl.
In the absence of precedence declarations (which wouldn't help here), bison/yacc always resolves shift/reduce conflicts in favour of the shift action. So in this case, it assumes that Foo is the ID which starts a var_decl, leading to the error message you note.
The reduce/reduce conflict produced by the grammar should also be looked at. It comes from the method_declmore: method_decl rule, which conflicts with the other possible way of creating a method_declmore by starting with an empty method_declmore and then adding a method_decl.

Related

Flex and Bison - Grammar that sometimes care about spaces

Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.
However, in some cases the space letter makes big difference:
def some_callback(arg=0)
arg * 100
end
some_callback (1 + 1) + 1 # 300
some_callback(1 + 1) + 1 # 201
some_callback +1 # 100
some_callback+1 # 1
some_callback + 1 # 1
So currently all whitespaces are being ignored by the lexer:
{WHITESPACE} { ; }
And the language says for example something like:
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:
// OLD:
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression T_ADD MultiplicativeExpression
| AdditiveExpression T_SUB MultiplicativeExpression
;
// NEW:
_:
/* empty */
| WHITESPACE _;
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression _ T_ADD _ MultiplicativeExpression
| AdditiveExpression _ T_SUB _ MultiplicativeExpression
;
//...
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
So I liked to ask whether there is any best practice on how to solve this grammar.
Thank you in advance!

Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.
Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.
You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.
If you're doing this with (f)lex, you can use the / trailing-context operator:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval = strdup(yytext); return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]* { yylval = strdup(yytext); return IDENT; }
The grammar is now pretty straight-forward:
call: FUNC_CALL '(' expression_list ')' /* foo(1, 2) */
| IDENT expression_list /* foo (1, 2) */
| IDENT /* foo * 3 */
This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:
name: IDENT | FUNC_CALL
But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:
func_defn: "def" name '(' parameters ')' block "end"
(I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)
More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)
That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:
%x SIGNED_NUMBERS
%%
[[:alpha:]_][[:alnum:]_]*/'(' { yylval.id = strdup(yytext);
return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]*/[[:blank:]] { yylval.id = strdup(yytext);
if (!is_local(yylval.id))
BEGIN(SIGNED_NUMBERS);
return IDENT; }
[[:alpha:]_][[:alnum:]_]*/ { yylval.id = strdup(yytext);
return IDENT; }
<SIGNED_NUMBERS>[[:blank:]]+ ;
/* Numeric patterns, one version for each context */
<SIGNED_NUMBERS>[+-]?[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
BEGIN(INITIAL);
return INTEGER; }
[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
/* ... */
/* If the next character is not a digit or a sign, rescan in INITIAL state */
<SIGNED_NUMBERS>.|\n { yyless(0); BEGIN(INITIAL); }
Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.
It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.

Shift/Reduce conflict on C variation grammar

I am writing a parser to a C-like grammar, but I am having a problem with a shift/reduce conflict:
Basically, the grammar accept a list of optional global variables declarations followed by the functions.
I have the following rules:
program: global_list function_list;
type_name : TKINT /* int */
| TKFLOAT /* float */
| TKCHAR /* char */
global_list : global_list var_decl ';'
|
;
var_decl : type_name NAME;
function_list : function_list function_def
|
;
function_def : type_name NAME '(' param_list ')' '{' func_body '}' ;
I understand that I have a problem because the grammar can't decide if the next type_name NAME belongs to global_list or function_list, and by default it is expecting a global_list
Ex:
int var1;
int foo(){}
error: unexpcted '(', expecting ';'

The problem is that a function_def can only occur after a function_list, which means that the parser needs to reduce an empty function_list (using the production function_list → ε) before it can recognize a function_def. Furthermore, it needs to make that decision by only looking at the token which follows the empty production. Since that token (a type_name) could start either a var_decl or a function_def, there is no way for the parser to decide.
Even leaving the decision for one more token won't help; it's not until the third token that the correct decision can be made. So your grammar is not ambiguous, but it is LR(3).
Sequences of possibly empty lists of different type always create this problem. By contrast, sequences of non-empty lists do not, so a first approach to solving the problem is to eliminate the ε-productions.
First, we expand the top-level definition to make it clear that both lists are optional:
program: global_list function_list;
| global_list
| function_list
|
;
Then we make both list types non-empty:
global_list
: var_decl
| global_list var_decl
;
function_list
: function_def
| function_list function_def
;
The rest of the grammar is unchanged.
type_name : TKINT /* int */
| TKFLOAT /* float */
| TKCHAR /* char */
var_decl : type_name NAME;
function_def : type_name NAME '(' param_list ')' '{' func_body '}' ;
It's worth noting that the problem would never have arisen if declarations could be interspersed. Is it really necessary that all global variables be defined before any function? If not, you could just use a single list type, which would also be conflict free:
program: decl_list ;
decl_list:
| decl_list var_decl;
| decl_list function_def
;
Both these solutions work because a bottom-up parser can wait until the end of the production being reduced in order to decide which is the correct reduction; it does not matter that var_decl and function_def look identical until the third token.
The problem really is that it's hard to figure out the type of nothing.

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!

I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

ANTLR doesn't find the defined start rule

I'm facing a strange ANTLR issue with a that should just output an AST.
grammar ltxt.g;
options
{
language=CSharp3;
}
prog : start
;
start : '{Start 'loopname'}'statement'{Ende 'loopname'}'
| statement
;
loopname : (('a'..'z')|('A'..'Z')|('1'..'9'))*;
statement : '<%' table_ref '>'
| start;
table_ref : '{'format'}'ID;
format : FSTRING
| FSTRING OFSTRING{0,5}
;
FSTRING : '#F'
| '#D'
| '#U'
| '#K'
;
OFSTRING: 'F'
| 'D'
| 'U'
| 'K'
//| 1..65536
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
When I try to code-gen this I get
error(100):LTXT.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52). I didn't declare any 74 or 52.
also I do not get a Synatx diagram, since "rule "start"" cannot be found as a start state...
I know that this isn't pretty, but I thought it would work at least :)
Best,
wishi

There are four errors that I see.
A grammar name can't contain a period. That's the syntax error you're getting. The 74!=52 error message is a hint telling you that ANTLR found token id 74 when it was expecting token id 52, which in this case just translates to "it found one thing when it expected something else."
The grammar name ("ltxt") and the file name before the extension ("LTXT") need to match exactly.
The grammar won't produce an AST unless you specify output=AST; in the options section.
format's second alternative (FSTRING OFSTRING{0,5}) won't do what I think you think it's going to do. ANTLR doesn't support an arbitrary number of matches such as "match zero to five OFSTRINGs". You'll need to redefine the rule using semantic predicates that count occurrences for you. They aren't hard to use, but they're one of the trickier parts of ANTLR.
I hope that helps get you started.

How to remove Left recursive in this ANTLR grammar?

I am trying to parse CSP(Communicating Sequential Processes) CSP Reference Manual. I have defined following grammar rules.
assignment
: IDENT '=' processExpression
;
processExpression
: ( STOP
| SKIP
| chaos
| prefix
| prefixWithValue
| seqComposition
| interleaving
| externalChoice
....
seqComposition
: processExpression ';' processExpression
;
interleaving
: processExpression '|||' processExpression
;
externalChoice
: processExpression '[]' processExpression
;
Now ANTLR reports that
seqComposition
interleaving
externalChoice
are left recursive . Is there any way to remove this or I should better used Bison Flex for this type of grammar. (There are many such rules)

Define a processTerm. Then write rules looking like
assignment
: IDENT '=' processExpression
;
processTerm
: ( STOP
| SKIP
| chaos
| prefix
...
processExpression
: ( processTerm
| processTerm ';' processExpression
| processTerm '|||' processExpression
| processTerm '[]' processExpression
....
If you want to have things like seqComposition still defined, I think that would be OK as well. But you need to make sure that the parsing of processExpansion is going to always consume more text as you proceed through your rules.

Read the guide to removing left recursion in on the ANTLR wiki. It helped me a lot.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart