Why parsing this program with BNFC fails?

Why parsing this program with BNFC fails? - parsing

Given following grammar:
comment "/*" "*/" ;
TInt. Type1 ::= "int" ;
TBool. Type1 ::= "bool" ;
coercions Type 1 ;
BTrue. BExp ::= "true" ;
BFalse. BExp ::= "false" ;
EOr. Exp ::= Exp "||" Exp1 ;
EAnd. Exp1 ::= Exp1 "&&" Exp2 ;
EEq. Exp2 ::= Exp2 "==" Exp3 ;
ENeq. Exp2 ::= Exp2 "!=" Exp3 ;
ELt. Exp3 ::= Exp3 "<" Exp4 ;
EGt. Exp3 ::= Exp3 ">" Exp4 ;
ELte. Exp3 ::= Exp3 "<=" Exp4 ;
EGte. Exp3 ::= Exp3 ">=" Exp4 ;
EAdd. Exp4 ::= Exp4 "+" Exp5 ;
ESub. Exp4 ::= Exp4 "-" Exp5 ;
EMul. Exp5 ::= Exp5 "*" Exp6 ;
EDiv. Exp5 ::= Exp5 "/" Exp6 ;
EMod. Exp5 ::= Exp5 "%" Exp6 ;
ENot. Exp6 ::= "!" Exp ;
EVar. Exp8 ::= Ident ;
EInt. Exp8 ::= Integer ;
EBool. Exp8 ::= BExp ;
EIver. Exp8 ::= "[" Exp "]" ;
coercions Exp 8 ;
Decl. Decl ::= Ident ":" Type ;
terminator Decl ";" ;
LIdent. Lvalue ::= Ident ;
SBlock. Stm ::= "{" [Decl] [Stm] "}" ;
SExp. Stm ::= Exp ";" ;
SWhile. Stm ::= "while" "(" Exp ")" Stm ;
SReturn. Stm ::= "return" Exp ";" ;
SAssign. Stm ::= Lvalue "=" Exp ";" ;
SPrint. Stm ::= "print" Exp ";" ;
SIf. Stm ::= "if" "(" Exp ")" "then" Stm "endif" ;
SIfElse. Stm ::= "if" "(" Exp ")" "then" Stm "else" Stm "endif" ;
terminator Stm "" ;
entrypoints Stm;
parser created with bnfc fails to parse
{ c = a; }
although it parses
c = a;
or
{ print a; c = a; }
I think it could be a problem that parser sees Ident and doesn't know whether it's declaration or statement, LR stuff etc (still one token of lookeahed should be enough??). However I couldn't find any note in BNFC documentation that would say that it doesn't work for all grammars.
Any ideas how to get this working?

I would think you would get a shift/reduce conflict report for that grammar, although where that error message shows up might well depend on which tool BNFC is using to generate the parser. As far as I know, all the backend tools have the same approach to dealing with shift/reduce conflicts, which is to (1) warn the user about the conflict, and then (2) resolve the conflict in favour of shifting.
The problematic production is this one: (I've left out type annotations to reduce clutter)
Stm ::= "{" [Decl] [Stm] "}" ;
Here, [Decl] and [Stm] are macros, which automatically produce definitions for the non-terminals with those names (or something equivalent which will be accepted by the backend tool). Specifically, the automatically-produced productions are:
[Decl] ::= /* empty */
| Decl ';' [Decl]
[Stm] ::= /* empty */
| Stm [Stm]
(The ; in the first rule is the result of a "terminator" declaration. I don't know why BNFC generates right-recursive rules, but that's how I interpret the reference manual -- after a very quick glance -- and I'm sure they have their reasons. For the purpose of this problem, it doesn't matter.
What's important is that both Decl and Stm can start with an Ident. So let's suppose we're parsing { id ..., which might be { id : ... or { id = ..., but we've only read the { and the lookahead token id. So there are two possibilities:
id is the start of a Decl. We should shift the Ident and go to the state which includes Decl → Ident • ':' Type
id is the start of a Stm. In this case, we need to reduce the production [Decl] → • before we shift Ident into a Stm production.
So we have a shift/reduce conflict, because we cannot see the second next token (either : or =). And, as mentioned above, shift usually wins in this case, so the LR(1) parser will commit itself to expect a Decl. Consequently, { a = b ; } will fail.
An LR(2) parser generator would do fine with this grammar, but those are much harder to find. (Modern bison can produce GLR parsers, which are even more powerful than LR(2) at the cost of a bit of extra compute time, but not the version required by the BNFC tool.)
Possible solutions
Allow declarations to be intermingled with statements. This one is my preference. It is simple, and many programmers expect to be able to declare a variable at first use rather than at the beginning of the enclosing block.
Make the declaration recognizable from the first token, either by putting the type first (as in C) or by adding a keyword such as var (as in Javascript):
Modify the grammar to defer the lookahead. It is always possible to find an LR(1) grammar for any LR(k) language (provided k is finite), but it can be tedious. An ugly but effective alternative is to continue the lexical scan until either a : or some other non-whitespace character is found, so that id : gets tokenized as IdentDefine or some such. (This is the solution used by bison, as it happens. It means that you can't put comments between an identifier and the following :, but there are few, if any, good reasons to put a comment in that context.

Related

Why doesn't Bison accept this grammar file?

When I use the command bison -d -o parser.java parser.y to generate a parser from my grammar file parser.y, Bison produces the following error:
:8.8-10: syntax error, unexpected string, expecting char or identifier or type
Here is the file parser.y:
%{
import java.util.;
import java.io.;
%}
%start PROGRAM
%token number identifier function break call if else let read return while write
%token "(" ")" "{" "}" ";" "=" "+" "-" "" "/" "%" "<" ">" " <= " " >= " "==" "!=" "&" "|" "~" "!"
%left "+" "-"
%left "" "/" "%"
%left "&" "|"
%nonassoc "!"
%type <Node> PROGRAM FUNCTION PARAMLIST BLOCK STATEMENT IF ELSE EXPR
%type <String> identifier
%type <Integer> number
%union {
Node node;
String identifier;
int number;
}
%%
PROGRAM:
| PROGRAM FUNCTION
| BLOCK
;
FUNCTION:
function identifier '(' PARAMLIST ')' BLOCK
;
PARAMLIST:
identifier
| identifier ',' PARAMLIST
|
;
BLOCK:
'{' STATEMENT '}'
;
STATEMENT:
BREAK
| CALL ';'
| IF
| LET
| READ
| RETURN
| WHILE
| WRITE
;
BREAK:
break ';'
;
CALL:
call identifier '(' ARGLIST ')'
;
ARGLIST:
EXPR
| EXPR ',' ARGLIST
|
;
IF:
if EXPR BLOCK ELSE
;
ELSE:
else BLOCK
|
;
LET:
let identifier '=' EXPR ';'
| let identifier '=' CALL ';'
;
READ:
read identifier ';'
;
RETURN:
return EXPR ';'
;
WHILE:
while EXPR BLOCK
;
WRITE:
write EXPR ';'
;
EXPR:
number
| identifier
| '(' EXPR ')'
| '!' EXPR
| '~' EXPR
| EXPR '+' EXPR
| EXPR '-' EXPR
| EXPR '*' EXPR
| EXPR '/' EXPR
| EXPR '%' EXPR
| EXPR '&' EXPR
| EXPR '|' EXPR
| EXPR '<' EXPR
| EXPR '>' EXPR
| EXPR "<=" EXPR
| EXPR ">=" EXPR
| EXPR "==" EXPR
| EXPR "!=" EXPR
;
%%
int yyerror(String s) {
System.err.println("error: " + s);
}

Bison doesn't allow you to declare quoted token names (such as "(") with the %token declaration. It knows they are tokens; they cannot be anything else.
You use the %token declaration to declare symbolic names for tokens, which you will find useful when writing your lexer. In the declaration, the symbolic name comes first, optionally followed by the double-quoted alias. You can repeat that as often as you like. For example, you could write:
%token TK_LE "<=" TK_GE ">="
You can then use either the symbolic name or the alias in your grammar, but using the alias makes your grammar more readable. Also, Bison uses the alias when constructing error messages, which is a good thing since "expecting TK_SEMIC" is not a great way to communicate with a user that a ";" was required.
Keep in mind that a single-quoted single character token, such as '(', is not the same token as the double-quoted alias. In your grammar, you use '(' but attempt to declare "(". Had you succeeded in declaring "(", you would have gotten an "unused token" warning. Since '(' doesn't require a symbolic name, you can just remove the declaration. You will only need them for multicharacter tokens like "<=". (Note that spaces are significant inside quotes. " <= " is not the same as "<=".)
Symbolic token names are used as Java values, so their names cannot conflict with variables or Java keywords. You cannot, for example, use break as a symbolic token name. Trying to do so will cause compilation errors.
For this reason, it's customary to write token names in ALL_CAPS, and non-terminals in lower case. Non-terminals names are not used in the generated code, so you can use whatever names you wish.
You reverse this convention, which will cause a variety of errors when you compile the generated parser, and which is hard to read for those of us accustomed to the standard style.
A couple of other notes:
The bison Java interface does not use a %union declaration. The %type declarations are sufficient.
You are missing precedence declarations for many operators, particularly comparison operators. That will lead to a large number of parser conflicts. Make sure you write the precedence levels in the correct order.

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?

Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

shift/reduce Error with Cup

Hi i am writing a Parser for a Programming language my university uses, with jflex and Cup
I started with just the first basic structures such as Processes an Variable Declarations.
I get the following Errors
Warning : *** Shift/Reduce conflict found in state #4
between vardecls ::= (*)
and vardecl ::= (*) IDENT COLON vartyp SEMI
and vardecl ::= (*) IDENT COLON vartyp EQEQ INT SEMI
under symbol IDENT
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #2
between vardecls ::= (*)
and vardecl ::= (*) IDENT COLON vartyp SEMI
and vardecl ::= (*) IDENT COLON vartyp EQEQ INT SEMI
under symbol IDENT
Resolved in favor of shifting.
My Code in Cup looks like this :
non terminal programm;
non terminal programmtype;
non terminal vardecl;
non terminal vardecls;
non terminal processdecl;
non terminal processdecls;
non terminal vartyp;
programm ::= programmtype:pt vardecls:vd processdecls:pd
{: RESULT = new SolutionNode(pt, vd, pd); :} ;
programmtype ::= IDENT:v
{: RESULT = ProblemType.KA; :} ;
vardecls ::= vardecl:v1 vardecls:v2
{: v2.add(v1);
RESULT = v2; :}
|
{: ArrayList<VarDecl> list = new ArrayList<VarDecl>() ;
RESULT = list; :}
;
vardecl ::= IDENT:id COLON vartyp:vt SEMI
{: RESULT = new VarDecl(id, vt); :}
| IDENT:id COLON vartyp:vt EQEQ INT:i1 SEMI
{: RESULT = new VarDecl(id, vt, i1); :}
;
vartyp ::= INTEGER
{: RESULT = VarType.Integer ; :}
;
processdecls ::= processdecl:v1 processdecls:v2
{: v2.add(v1);
RESULT = v2; :}
| {: ArrayList<ProcessDecl> list = new ArrayList<ProcessDecl>() ;
RESULT = list; :}
;
processdecl ::= IDENT:id COLON PROCESS vardecls:vd BEGIN END SEMI
{: RESULT = new ProcessDecl(id, vd); :}
;
I Guess i get the Errors because the Process Declaration and the VariableDeclaration both start with Identifiers then a ":" and then either the Terminal PROCESS or a Terminal like INTEGER. If so i'd like to know how i can tell my Parser to look ahead a bit more. Or whatever Solution is possible.
Thanks for your answers.

Your diagnosis is absolutely correct. Because the parser cannot know whether IDENT starts a processdecl or a vardecl without two more lookahead tokens, it cannot know when it has just reduced a vardecl and is looking at an IDENT whether it is about to see another vardecl or a processdecl.
In the first case, it must just shift the IDENT as part of the following vardecl. In the second case, it needs to first reduce an empty vardecls and then successively reduce vardecls until it has constructed the complete list.
To get rid of the shift reduce conflict, you need to simplify the parser's decision-making.
The simplest solution is to allow the parser to accept declarations in any order. Then you end up with something like this:
program ::= program_type declaration_list ;
declaration_list ::=
var_declaration declaration_list
| process_declaration declaration_list
|
;
var_declaration_list ::=
var_declaration var_declaration_list
|
;
process_declaration ::=
IDENT:id COLON PROCESS var_declaration_list BEGIN END SEMI ;
(Personally, I'd make the declaration lists left-recursive rather than right-recursive, but it depends whether you prefer to append or prepend in the list's action. Left-recursion uses less parser stack.)
If you really want to insist that all variable declarations come before any process declaration, you can check for that in the action for declaration_list.
Alternatively, you can start by making both types of declaration list left-recursive instead of right recursive. That will almost work, but it will still generate a shift-reduce conflict in the same state as the original grammar, this time because it needs to reduce an empty process declaration list before the first process declaration can be reduced.
Fortunately, that's easier to work around. If the process declaration list cannot be empty, there is no problem, so it's just a question of rearranging the productions:
program ::= program_type var_declaration_list process_declaration_list
| program_type var_declaration_list
;
var_declaration_list ::=
var_declaration var_declaration_list
|
;
process_declaration_list ::=
process_declaration_list process_declaration
| process_declaration
;
Finally, an ugly but possible alternative is to make the variable declaration list left-recursive and the process declaration list right-recursive. In that case, there is no empty production between the last variable declaration and the first process declaration.

How to solve SHIFT/REDUCE conflict - in parser generator

I need help to solve this one and explanation how to deal with this SHIFT/REDUCE CONFLICTS in future.
I have some conflicts between few states in my cup file.
Grammer look like this:
I have conflicts between "(" [ActPars] ")" states.
1. Statement = Designator ("=" Expr | "++" | "‐‐" | "(" [ActPars] ")" ) ";"
2. Factor = number | charConst | Designator [ "(" [ActPars] ")" ].
I don't want to paste whole 700 lines of cup file.
I will give you the relevant states and error output.
This is code for the line 1.)
Matched ::= Designator LPAREN ActParamsList RPAREN SEMI_COMMA
ActParamsList ::= ActPars
|
/* EPS */
;
ActPars ::= Expr
|
Expr ActPComma
;
ActPComma ::= COMMA ActPars;
This is for the line 2.)
Factor ::= Designator ActParamsOptional ;
ActParamsOptional ::= LPAREN ActParamsList2 RPAREN
|
/* EPS */
;
ActParamsList2 ::= ActPars
|
/* EPS */
;
Expr ::= SUBSTRACT Term RepOptionalExpression
|
Term RepOptionalExpression
;
The ERROR output looks like this:
Warning : *** Shift/Reduce conflict found in state #182
between ActParamsOptional ::= LPAREN ActParamsList RPAREN (*)
and Matched ::= Designator LPAREN ActParamsList RPAREN (*) SEMI_COMMA
under symbol SEMI_COMMA
Resolved in favor of shifting.
Error : * More conflicts encountered than expected -- parser generation aborted

I believe the problem is that your parser won't know if it should shift to the token:
SEMI_COMMA
or reduce to the token
ActParamsOptional
since the tokens defined in both ActParamsOptional and Matched are
LPAREN ActPars RPAREN

BNFC parser and bracket Mathematica like syntax

I played a bit with the BNF Converter and tried to re-engineer parts of the Mathematica language. My BNF had already about 150 lines and worked OK, until I noticed a very basic bug. Brackets [] in Mathematica are used for two different things
expr[arg] to call a function
list[[spec]] to access elements of an expression, e.g. a List
Let's assume I want to create the parser for a language which consists only of identifiers, function calls, element access and sequence of expressions as arguments. These forms would be valid
f[]
f[a]
f[a,b,c]
f[[a]]
f[[a,b]]
f[a,f[b]]
f[[a,f[x]]]
A direct, but obviously wrong input-file for BNFC could look like
entrypoints Expr ;
TSymbol. Expr1 ::= Ident ;
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]]" ;
coercions Expr 1 ;
separator Sequence "," ;
SequenceExpr. Sequence ::= Expr ;
This BNF does not work for the last two examples of the first code-block.
The problem seems to be located in the created Yylex lexer file, which matches ] and ]] separately. This is wrong, because as can be seen in the last to examples, whether or not it's a closing ] or ]] depends on the context. So either you have to create a stack of braces to ensure the right matching or you leave that to the parser.
Can someone enlighten me whether it's possible to realize this with BNFC?
(Btw, other hints would be gratefully taken too)

Your problem is the token "]]". If the lexer collects this without having
any memory of its past, it might be mistaken. So just don't do that!
The parser by definition remembers its left context, so you can get
it to do the bracket matching correctly.
I would define your grammar this way:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[" "[" [Sequence] "]" "]" ;
with the lexer detecting only single "[" "]" as tokens.
An odd variant:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]" "]" ;
with the lexer also detecting "[[" as a token, since it can't be mistaken.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart