Using Mid Rule Actions - bison parser

Using Mid Rule Actions - bison parser - parsing

I have following grammar:
expr : FUNC '(' expr ')' { ... }
| FUNC '(' expr ',' expr ')' { ... }
| expr '+' expr { ... }
| expr '-' expr { ... }
| NUM { ... }
And I'd like to perform different mid-rule-action on FUNC rule prior to entering the expr, depending on the rule.
Meaning:
expr : FUNC { DO_ACTION_1 } '(' expr ')' { ... }
or
expr : FUNC '(' { DO_ACTION_1 } expr ')' { ... }
and in another case:
expr : FUNC { DO_ACTION_2 } '(' expr ',' expr ')' { ... }
or
expr : FUNC '(' { DO_ACTION_2 } expr ',' expr ')' { ... }
but I am keep getting reduce/reduce error and $$1 (or $$2 depending on the usage above) will never be reduced.

This problem is described in the bison documentation, as are the mechanisms to resolve the ambiguity.
The ambiguity is due to bison needing to commit to which of the two rules to apply early in the parsing, and at the stage of your action it cannot tell.
When it has matched:
FUNC
or
FUNC '('
or even
FUNC '(' expr
It could still be one of three syntax rules and thus the action is ambiguous.
To remove the ambiguity you can attach an action to the terminal symbol (either FUNC or '(') or make a rule for the preamble which contains the action:
func : FUNC { DO_ACTION1 } ;
bra : '(' { DO_ACTION1} ;
or
func_open : FUNC '(' expr { DO_ACTION1 };
expre : func_open | func_open ',' expr | etc

Related

Parser (Yacc) seems like it ignores tokens in grammar

Parsing the c-like example code, i have the following issue. Its like some tokens, like identifiers, are ignored by grammar, causing a non-reason syntax error.
Parser code :
%{
#include <stdio.h>
#include <stdlib.h>
int yylex();
void yyerror (char const *);
%}
%token T_MAINCLASS T_ID T_PUBLIC T_STATIC T_VOID T_MAIN T_PRINTLN T_INT T_FLOAT T_FOR T_WHILE T_IF T_ELSE T_EQUAL T_SMALLER T_BIGGER T_NOTEQUAL T_NUM T_STRING
%left '(' ')'
%left '+' '-'
%left '*' '/'
%left '{' '}'
%left ';' ','
%left '<' '>'
%%
PROGRAM : T_MAINCLASS T_ID '{' T_PUBLIC T_STATIC T_VOID T_MAIN '(' ')' COMP_STMT '}'
;
COMP_STMT : '{' STMT_LIST '}'
;
STMT_LIST : /* nothing */
| STMT_LIST STMT
;
STMT : ASSIGN_STMT
| FOR_STMT
| WHILE_STMT
| IF_STMT
| COMP_STMT
| DECLARATION
| NULL_STMT
| T_PRINTLN '(' EXPR ')' ';'
;
DECLARATION : TYPE ID_LIST ';'
;
TYPE : T_INT
| T_FLOAT
;
ID_LIST : T_ID ',' ID_LIST
|
;
NULL_STMT : ';'
;
ASSIGN_STMT : ASSIGN_EXPR ';'
;
ASSIGN_EXPR : T_ID '=' EXPR
;
EXPR : ASSIGN_EXPR
| RVAL
;
FOR_STMT : T_FOR '(' OPASSIGN_EXPR ';' OPBOOL_EXPR ';' OPASSIGN_EXPR ')' STMT
;
OPASSIGN_EXPR : /* nothing */
| ASSIGN_EXPR
;
OPBOOL_EXPR : /* nothing */
| BOOL_EXPR
;
WHILE_STMT : T_WHILE '(' BOOL_EXPR ')' STMT
;
IF_STMT : T_IF '(' BOOL_EXPR ')' STMT ELSE_PART
;
ELSE_PART : /* nothing */
| T_ELSE STMT
;
BOOL_EXPR : EXPR C_OP EXPR
;
C_OP : T_EQUAL | '<' | '>' | T_SMALLER | T_BIGGER | T_NOTEQUAL
;
RVAL : RVAL '+' TERM
| RVAL '-' TERM
| TERM
;
TERM : TERM '*' FACTOR
| TERM '/' FACTOR
| FACTOR
;
FACTOR : '(' EXPR ')'
| T_ID
| T_NUM
;
%%
void yyerror (const char * msg)
{
fprintf(stderr, "C-like : %s\n", msg);
exit(1);
}
int main ()
{
if(!yyparse()){
printf("Compiled !!!\n");
}
}
Part of Lexical Scanner code :
{Empty}+ { printf("EMPTY ") ; /* nothing */ }
"mainclass" { printf("MAINCLASS ") ; return T_MAINCLASS ; }
"public" { printf("PUBLIC ") ; return T_PUBLIC; }
"static" { printf("STATIC ") ; return T_STATIC ; }
"void" { printf("VOID ") ; return T_VOID ; }
"main" { printf("MAIN ") ; return T_MAIN ; }
"println" { printf("PRINTLN ") ; return T_PRINTLN ; }
"int" { printf("INT ") ; return T_INT ; }
"float" { printf("FLOAT ") ; return T_FLOAT ; }
"for" { printf("FOR ") ; return T_FOR ; }
"while" { printf("WHILE ") ; return T_WHILE ; }
"if" { printf("IF ") ; return T_IF ; }
"else" { printf("ELSE ") ; return T_ELSE ; }
"==" { printf("EQUAL ") ; return T_EQUAL ; }
"<=" { printf("SMALLER ") ; return T_SMALLER ; }
">=" { printf("BIGGER ") ; return T_BIGGER ; }
"!=" { printf("NOTEQUAL ") ; return T_NOTEQUAL ; }
{id} { printf("ID ") ; return T_ID ; }
{num} { printf("NUM ") ; return T_NUM ; }
{string} { printf("STRING ") ; return T_STRING ; }
{punct} { printf("PUNCT ") ; return yytext[0] ; }
<<EOF>> { printf("EOF ") ; return T_EOF; }
. { yyerror("lexical error"); exit(1); }
Example :
mainclass Example {
public static void main ( )
{
int c;
float x, sum, mo;
c=0;
x=3.5;
sum=0.0;
while (c<5)
{
sum=sum+x;
c=c+1;
x=x+1.5;
}
mo=sum/5;
println (mo);
}
}
Running all this stuff it showed up this output:
C-like : syntax error
MAINCLASS EMPTY ID
It seems like id is in wrong position although in grammar we have:
PROGRAM : T_MAINCLASS T_ID '{' T_PUBLIC T_STATIC T_VOID T_MAIN '(' ')' COMP_STMT '}'

Based on the "solution" proposed in OP's self answer, it's pretty clear that the original problem was that the generated header used to compile the scanner was not the same as the header generated by bison/yacc from the parser specification.
The generated header includes definitions of all the token types as small integers; in order for the scanner to communicate with the parser, it must identify each token with the correct token type. So the parser generator (bison/yacc) produces a header based on the parser specification (the .y file), and that header must be #included into the generated scanner so that scanner actions can used symbolic token type names.
If the scanner was compiled with a header file generated from some previous version of the parser specification, it is quite possible that the token numbers no longer correspond with what the parser is expecting.
The easiest way to avoid this problem is to use a build system like make, which will automatically recompile the scanner if necessary.
The easiest way to detect this problem is to use bison's built-in trace facility. Enabling tracing requires only a couple of lines of code, and saves you from having to scatter printf statements throughout your scanner and parser. The bison trace will show you exactly what is going on, so not only is it less work than adding printfs, it is also more precise. In particular, it reports every token which is passed to the parser (and, with a little more effort, you can get it to report the semantic values of those tokens as well). So if the parser is getting the wrong token code, you'll see that right away.

After many potential helpful changes, parser worked by changing the order of these tokens.
From
%token T_MAINCLASS T_ID T_PUBLIC T_STATIC T_VOID T_MAIN T_PRINTLN T_INT T_FLOAT T_FOR T_WHILE T_IF T_ELSE T_EQUAL T_SMALLER T_BIGGER T_NOTEQUAL T_NUM T_STRING
TO
%token T_MAINCLASS T_PUBLIC T_STATIC T_VOID T_MAIN T_PRINTLN T_INT T_FLOAT T_FOR T_WHILE T_IF T_EQUAL T_ID T_NUM T_SMALLER T_BIGGER T_NOTEQUAL T_ELSE T_STRING
It looked like that the reading element was else but lexer normaly returned an id. Somehow this modification was the solution.

unclear how to add extra productions to bison grammar to create error messages

This is not homework, but it is from a book.
I'm given a following bison spec file:
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
int yyerror();
%}
%token NUMBER
%%
command : exp { printf("%d\n", $1); }
; /* allows printing of the result */
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp ')' { $$ = $2; }
;
%%
int main() {
return yyparse();
}
int yylex() {
int c;
/* eliminate blanks*/
while((c = getchar()) == ' ');
if (isdigit(c)) {
ungetc(c, stdin);
scanf("%d", &yylval);
return (NUMBER);
}
/* makes the parse stop */
if (c == '\n') return 0;
return (c);
}
int yyerror(char * s) {
fprintf(stderr, "%s\n", s);
return 0;
} /* allows for printing of an error message */
The task is to do the following:
Rewrite the spec to add the following useful error messages:
"missing right parenthesis," generated by the string (2+3
"missing left parenthesis," generated by the string 2+3)
"missing operator," generated by the string 2 3
"missing operand," generated by the string (2+)
The simplest solution that I was able to come up with is to do the following:
half_exp : exp '+' { $$ = $1; }
| exp '-' { $$ = $1; }
| exp '*' { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp '\n' { yyerror("missing right parenthesis"); }
| exp ')' { yyerror("missing left parenthesis"); }
| '(' exp '\n' { yyerror("missing left parenthesis"); }
| '(' exp ')' { $$ = $2; }
| '(' half_exp ')' { yyerror("missing operand"); exit(0); }
;
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
| exp exp { yyerror("missing operator"); }
;
These changes work, however they lead to a lot of conflicts.
Here is my question.
Is there a way to rewrite this grammar in such a way so that it wouldn't generate conflicts?
Any help is appreciated.

Yes it is possible:
command : exp { printf("%d\n", $1); }
; /* allows printing of the result */
exp: exp '+' exp {
// code
}
| exp '-' exp {
// code
}
| exp '*' exp {
// code
}
| exp '/' exp {
// code
}
|'(' exp ')' {
// code
}
Bison allows Ambiguous grammars.
I don't see how can you rewrite grammar to avoid conflicts. You just missed the point of terms, factors etc. You use these when you want left recursion context free grammar.
From this grammar:
E -> E+T
|T
T -> T*F
|F
F -> (E)
|num
Once you free it from left recursion you would go to:
E -> TE' { num , ( }
E' -> +TE' { + }
| eps { ) , EOI }
T -> FT' { ( , num }
T' -> *FT' { * }
|eps { + , ) , EOI }
F -> (E) { ( }
|num { num }
These sets alongside rules are showing what input character has to be in order to use that rule. Of course this is just example for simple arithmetic expressions for example 2*(3+4)*5+(3*3*3+4+5*6) etc.
If you want to learn more about this topic I suggest you to read about "left recursion context free grammar". There are some great books covering this topic and also covering how to get input sets.
But as I said above, all of this can be avoided because Bison allows Ambiguous grammars.

Using bison rule inside another rule

Let's assume I have grammar like this:
expr : expr '-' expr { $$ = $1 - $3; }
| "Function" '(' expr ',' expr ')' { $$ = ($3 - $5) * 2; }
| NUMBER { $$ = $1; };
How can use rule
expr : expr '-' expr { $$ = $1 - $3; }
inside
expr : "Function" '(' expr ',' expr ')' { $$ = ($3 - $5) * 2; }
Because implementation of $1 - $3 is repeated? It would be much better if I can use already implemented subtraction from rule one and only add multiplication with 2. This is just the basic example, but I have very big grammar with lot of repeating calculations.

Yacc and Lex error in parsing expressions which use binary operators

I am new to Lex and Yacc and I am trying to create a parser for a simple language which allows for basic arithmetic and equality expressions. Though I have some of it working, I am encountering errors when trying to parse expressions involving binary operations. Here is my .y file:
%{
#include <stdlib.h>
#include <stdio.h>
%}
%token NUMBER
%token HOME
%token PU
%token PD
%token FD
%token BK
%token RT
%token LT
%left '+' '-'
%left '=' '<' '>'
%nonassoc UMINUS
%%
S : statement S { printf("S -> stmt S\n"); }
| { printf("S -> \n"); }
;
statement : HOME { printf("stmt -> HOME\n"); }
| PD { printf("stmt -> PD\n"); }
| PU { printf("stmt -> PU\n"); }
| FD expression { printf("stmt -> FD expr\n"); }
| BK expression { printf("stmt -> BK expr\n"); }
| RT expression { printf("stmt -> RT expr\n"); }
| LT expression { printf("stmt -> LT expr\n"); }
;
expression : expression '+' expression { printf("expr -> expr + expr\n"); }
| expression '-' expression { printf("expr -> expr - expr\n"); }
| expression '>' expression { printf("expr -> expr > expr\n"); }
| expression '<' expression { printf("expr -> expr < expr\n"); }
| expression '=' expression { printf("expr -> expr = expr\n"); }
| '(' expression ')' { printf("expr -> (expr)\n"); }
| '-' expression %prec UMINUS { printf("expr -> -expr\n"); }
| NUMBER { printf("expr -> number\n"); }
;
%%
int yyerror(char *s)
{
fprintf (stderr, "%s\n", s);
return 0;
}
int main()
{
yyparse();
}
And here is my .l file for Lex:
%{
#include "testYacc.h"
%}
number [0-9]+
%%
[ ] { /* skip blanks */ }
{number} { sscanf(yytext, "%d", &yylval); return NUMBER; }
home { return HOME; }
pu { return PU; }
pd { return PD; }
fd { return FD; }
bk { return BK; }
rt { return RT; }
lt { return LT; }
%%
When I try to enter an arithmetic expression on the command-line for evaluation, it results in the following error:
home
stmt -> HOME
pu
stmt -> PU
fd 10
expr -> number
fd 10
stmt -> FD expr
expr -> number
fd (10 + 10)
stmt -> FD expr
(expr -> number
+stmt -> FD expr
S ->
S -> stmt S
S -> stmt S
S -> stmt S
S -> stmt S
S -> stmt S
syntax error

Your lexer lacks rules to match and return tokens such as '+' and '*', so if there are any in your input, it will just echo them and discard them. This is what happens when you enter fd (10 + 10) -- the lexer returns the tokens FD NUMBER NUMBER while + and ( get echoed to stdout. The parser then gives a syntax error.
You want to add a rule to return these single character tokens. The easiest is to just add a single rule to your .l file at the end:
. { return *yytext; }
which matches any single character.
Note that this does NOT match a \n (newline), so newlines in your input will still be echoed and ignored. You might want to add them (and tabs and carriage returns) to your skip blanks rule:
[ \t\r\n] { /* skip blanks */ }

Expressions in a CoCo to ANTLR translator

I'm parsing CoCo/R grammars in a utility to automate CoCo -> ANTLR translation. The core ANTLR grammar is:
rule '=' expression '.' ;
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
term
: (factor (factor)*)? ;
factor
: symbol
| '(' expression ')'
-> ^( GROUPED_EXPR expression )
| '[' expression']'
-> ^( OPTIONAL_EXPR expression)
| '{' expression '}'
-> ^( SEQUENCE_EXPR expression)
;
symbol
: IF_ACTION
| ID (ATTRIBUTES)?
| STRINGLITERAL
;
My problem is with constructions such as these:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
CS results in an AST with a OR_EXPR node although no '|' character
actually appears. I'm sure this is due to the definition of
expression but I cannot see any other way to write the rules.
I did experiment with this to resolve the ambiguity.
// explicitly test for the presence of an '|' character
expression
#init { bool ored = false; }
: term {ored = (input.LT(1).Type == OR); } (OR term)*
-> {ored}? ^(OR_EXPR term term*)
-> ^(LIST term term*)
It works but the hack reinforces my conviction that something fundamental is wrong.
Any tips much appreciated.

Your rule:
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
always causes the rewrite rule to create a tree with a root of type OR_EXPR. You can create "sub rewrite rules" like this:
expression
: (term -> REWRITE_RULE_X) ('|' term -> ^(REWRITE_RULE_Y))*
;
And to resolve the ambiguity in your grammar, it's easiest to enable global backtracking which can be done in the options { ... } section of your grammar.
A quick demo:
grammar CocoR;
options {
output=AST;
backtrack=true;
}
tokens {
RULE;
GROUP;
SEQUENCE;
OPTIONAL;
OR;
ATOMS;
}
parse
: rule EOF -> rule
;
rule
: ID '=' expr* '.' -> ^(RULE ID expr*)
;
expr
: (a=atoms -> $a) ('|' b=atoms -> ^(OR $expr $b))*
;
atoms
: atom+ -> ^(ATOMS atom+)
;
atom
: ID
| '(' expr ')' -> ^(GROUP expr)
| '{' expr '}' -> ^(SEQUENCE expr)
| '[' expr ']' -> ^(OPTIONAL expr)
;
ID
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
with input:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
produces the AST:
and the input:
foo = a | b ({c} | d [e f]) .
produces:
The class to test this:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
/*
String source =
"CS = { ExternAliasDirective } \n" +
"{ UsingDirective } \n" +
"EOF . ";
*/
String source = "foo = a | b ({c} | d [e f]) .";
ANTLRStringStream in = new ANTLRStringStream(source);
CocoRLexer lexer = new CocoRLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CocoRParser parser = new CocoRParser(tokens);
CocoRParser.parse_return returnValue = parser.parse();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and with the output this class produces, I used the following website to create the AST-images: http://graph.gafol.net/
HTH
EDIT
To account for epsilon (empty string) in your OR expressions, you might try something (quickly tested!) like this:
expr
: (a=atoms -> $a) ( ( '|' b=atoms -> ^(OR $expr $b)
| '|' -> ^(OR $expr NOTHING)
)
)*
;
which parses the source:
foo = a | b | .
into the following AST:

The production for expression explicitly says that it can only return an OR_EXPR node. You can try something like:
expression
:
term
|
term ('|' term)+
-> ^( OR_EXPR term term* )
;
Further down, you could use:
term
: factor*;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart