Why does my "equation" grammar break the parser? - parsing

Currently, my parser file looks like this:
%{
#include <stdio.h>
#include <math.h>
int yylex();
void yyerror (const char *s);
%}
%union {
long num;
char* str;
}
%start line
%token print
%token exit_cmd
%token <str> identifier
%token <str> string
%token <num> number
%%
line: assignment {;}
| exit_stmt {;}
| print_stmt {;}
| line assignment {;}
| line exit_stmt {;}
| line print_stmt {;}
;
assignment: identifier '=' number {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' string {printf("Assigning var %s to value %s\n", $1, $3);}
;
exit_stmt: exit_cmd {exit(0);}
;
print_stmt: print print_expr {;}
;
print_expr: string {printf("%s\n", $1);}
| number {printf("%d\n", $1);}
;
%%
int main(void)
{
return yyparse();
}
void yyerror (const char *s) {fprintf(stderr, "%s\n", s);}
Giving the input: myvar = 3 gives the output Assigning var myvar = 3 to value 3, as expected. However, modifying the code to include an equation grammar rule breaks such assignments.
Equation grammar:
equation: number '+' number {$$ = $1 + $3;}
| number '-' number {$$ = $1 - $3;}
| number '*' number {$$ = $1 * $3;}
| number '/' number {$$ = $1 / $3;}
| number '^' number {$$ = pow($1, $3);}
| equation '+' number {$$ = $1 + $3;}
| equation '-' number {$$ = $1 - $3;}
| equation '*' number {$$ = $1 * $3;}
| equation '/' number {$$ = $1 / $3;}
| equation '^' number {$$ = pow($1, $3);}
;
Modifying the assignment grammar accordingly as well:
assignment: identifier '=' number {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' equation {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' string {printf("Assigning var %s to value %s\n", $1, $3);}
;
And giving the equation rule the type of num in the parser's first section:
%type <num> equation
Giving the same input: var = 3 freezes the program.
I know this is a long question but can anyone please explain what is going on here?
Also, here's the lexer in case you wanna take a look.

It doesn't "freeze the program". The program is just waiting for more input.
In your first grammar, var = 3 is a complete statement which cannot be extended. But in your second grammar, it could be the beginning of var = 3 + 4, for example. So the parser needs to read another token after the 3. If you want input lines to be terminated by a newline, you will need to modify your scanner to send a newline character as a token, and then modify your grammar to expect a newline token at the end of every statement. If you intend to allow statements to be spread out over several lines, you"ll need to be aware of that fact while typing input.
There are several problems with your grammar, and also with your parser. (Flex doesn't implement non-greedy repetition, for example.) Please look at the examples in the bison and flex manuals

Related

Odd bison parsing

To pretext this, I understand that the formatting of the parse structure is weird, the teacher wanted it to be ~roughly~ in this format.
I'm making a simple "calculator" parser form an assignment using flex and bison, but I am getting odd or unusual output for the answer when using modulus. IT seems to work fine for all other operations
Input: "10 % 5"
Output: " % 10"
Input: "101 % 12"
Output: " % 101"
Input: "2^(-1 + 15/5) - 3*(4-1) + (-6)"
Output: "-11" //Correct
Relevant section of bison.y
command : pexpri {printf("%d\n", $1); return;}
;
pexpri : '-' expri '+' termi {$$ = -$2 + $4;} /* Super glued on unary, also reduce conflict, TODO: find bug */
| '-' expri '-' termi {$$ = -$2 - $4;}
| '-' expri {$$ = -$2;}
| expri {$$ = $1;}
;
expri : expri '+' termi {$$ = $1 + $3;} /* Addition subtraction level operations*/
| expri '-' termi {$$ = $1 - $3;}
| termi {$$ = $1;}
;
termi : termi '*' factori {$$ = $1 * $3;} /* Multiplication division level operations*/
| termi '/' factori {$$ = $1 / $3;}
| termi '%' factori {$$ = $1 % $3;}
| factori {$$ = $1;}
;
factori : factori '^' parti {$$ = pow($1, $3);} /* Exponentiation level operations */
| parti {$$ = $1;}
;
parti : '(' pexpri ')' {$$ = $2;} /* Parentheses handling or terminal, also adds even more reduction errors.... */
| INTEGER
;
Relevant section tokenizer.l
0 { /* To avoid useless trailing zeros. */
yylval.iVal = atoi(yytext);
return INTEGER;
}
[1-9][0-9]* {
yylval.iVal = atoi(yytext);
return INTEGER;
}
[-()^\+\*/] {return *yytext;}
The main function is essentially just a wrapper for yyparse.
I don't understand how or why it is printing the modulus symbol in the output because the ONLY print in the entire code is in the command section. I understand that the code isn't the best (in fact, it is awful), but any insight is much appreciated.
Also, if anybody can help me figure out how to manage unary negation in a more elegant way (Hopefully without spoiling much), that would also be super appreciated. (I cant just use %precidence or %left) The way I have it currently set up is ambiguous and is causing reduction errors.
If you look closely at
[-()^\+\*/] {return *yytext;}
you'll notice that it is not going to match %. The most likely consequence is that (f)lex's default fallback rule will apply. That rule matches any single character and uses ECHO to copy the matched token to the output stream.
It looks to me like whitespace characters might also be falling through to the default rule. They should be ignored explicitly.
By the way, it is not necessary to backslash-escape regular expression operators inside character classes, since they have no special meaning in that context. Hence a correct and easier to read rule would be
[-+*/%^()] {return *yytext;}
However, I strongly recommend using a fallback rule instead of listing all the possible single-character tokens. If an invalid single-character token is handled by a fallback rule, then the parser will respond by flagging an error.
[[:space:]]+ { /* Ignore whitespace*/ }
0|[1-9][0-9]* { yylval.iVal = atoi(yytext); return INTEGER; }
. { return *yytext; /* Fallback rule */ }
The default fallback rule is rarely useful in parsing, and I find it useful to add
%option nodefault
to my flex prolog, which will cause flex to produce an error message if a fallback rule is required.

How does a parser solves shift/reduce conflict?

I have a grammar for arithmetic expression which solves number of expression (one per line) in a text file. While compiling YACC I am getting message 2 shift reduce conflicts. But my calculations are proper. If parser is giving proper output how does it resolves the shift/reduce conflict. And In my case is there any way to solve it in YACC Grammar.
YACC GRAMMAR
Calc : Expr {printf(" = %d\n",$1);}
| Calc Expr {printf(" = %d\n",$2);}
| error {yyerror("\nBad Expression\n ");}
;
Expr : Term { $$ = $1; }
| Expr '+' Term { $$ = $1 + $3; }
| Expr '-' Term { $$ = $1 - $3; }
;
Term : Fact { $$ = $1; }
| Term '*' Fact { $$ = $1 * $3; }
| Term '/' Fact { if($3==0){
yyerror("Divide by Zero Encountered.");
break;}
else
$$ = $1 / $3;
}
;
Fact : Prim { $$ = $1; }
| '-' Prim { $$ = -$2; }
;
Prim : '(' Expr ')' { $$ = $2; }
| Id { $$ = $1; }
;
Id :NUM { $$ = yylval; }
;
What change should I do to remove such conflicts in my grammar ?
Bison/yacc resolves shift-reduce conflicts by choosing to shift. This is explained in the bison manual in the section on Shift-Reduce conflicts.
Your problem is that your input is just a series of Exprs, run together without any delimiter between them. That means that:
4 - 2
could be one expression (4-2) or it could be two expressions (4, -2). Since bison-generated parsers always prefer to shift, the parser will choose to parse it as one expression, even if it were typed on two lines:
4
-2
If you want to allow users to type their expressions like that, without any separator, then you could either live with the conflict (since it is relatively benign) or you could codify it into your grammar, but that's quite a bit more work. To put it into the grammar, you need to define two different types of Expr: one (which is the one you use at the top level) cannot start with an unary minus, and the other one (which you can use anywhere else) is allowed to start with a unary minus.
I suspect that what you really want to do is use newlines or some other kind of expression separator. That's as simple as passing the newline through to your parser and changing Calc to Calc: | Calc '\n' | Calc Expr '\n'.
I'm sure that this appears somewhere else on SO, but I can't find it. So here is how you disallow the use of unary minus at the beginning of an expression, so that you can run expressions together without delimiters. The non-terminals starting n_ cannot start with a unary minus:
input: %empty | input n_expr { /* print $2 */ }
expr: term | expr '+' term | expr '-' term
n_expr: n_term | n_expr '+' term | n_expr '-' term
term: factor | term '*' factor | term '/' factor
n_term: value | n_term '+' factor | n_term '/' factor
factor: value | '-' factor
value: NUM | '(' expr ')'
That parses the same language as your grammar, but without generating the shift-reduce conflict. Since it parses the same language, the input
4
-2
will still be parsed as a single expression; to get the expected result you would need to type
4
(-2)

Bison nonassociative precedence rule

I am trying to enforce some parsing errors by defining a non-associative precedence.
Here is part of my grammar file:
Comparison :
Value ComparisonOp Value
{
$2->Left($1);
$2->Right($3);
$$ = $2;
}
;
Expressions like 1 = 2 should parse, but expressions like 1 = 2 = 3 are not allowed in the grammar. To accommodate this, I tried to make my operator non associative as follows:
%nonassoc NONASSOCIATIVE
.
.(rest of the grammar)
.
Comparison :
Value ComparisonOp Value %prec NONASSOCIATIVE
{
$2->Left($1);
$2->Right($3);
$$ = $2;
}
;
1 = 2 = 3 still passes, Could someone please tell me what I am doing wrong?
You need to set the associativity of the ComparisonOp token(s). Either
%nonassoc ComparisonOp NONASSOCIATIVE
if ComparisonOp is a token or something like
%nonassoc '=' '<' '>' NOT_EQUAL GREATOR_OR_EQUAL LESS_OR_EQUAL NONASSOCIATIVE
if you have mulitple tokens and ComparisonOp is a rule that expands to any of them
As a concrete example, the following works exactly like you are requesting:
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
void yyerror(const char *);
%}
%nonassoc '=' '<' '>' CMPOP
%left '+' '-' ADDOP
%left '*' '/' '%' MULOP
%token VALUE
%%
expr: expr cmp_op expr %prec CMPOP
| expr add_op expr %prec ADDOP
| expr mul_op expr %prec MULOP
| VALUE
| '(' expr ')'
;
cmp_op: '=' | '<' | '>' | '<' '=' | '>' '=' | '<' '>' ;
add_op: '+' | '-' ;
mul_op: '*' | '/' | '%' ;
%%
int main() { return yyparse(); }
int yylex() {
int ch;
while(isspace(ch = getchar()));
if (isdigit(ch)) return VALUE;
return ch;
}
void yyerror(const char *err) { fprintf(stderr, "%s\n", err); }
so if you are having other problems, try posting an MVCE that shows the actual problem you are having...

can't find a simple error when parsing with YACC

i'm trying to make a very simple YACC parser on Pascal language which just includes integer declarations, some basic expressions and if-else statements. however, i cant find the error for hours and i'm going to be crazy soon. terminal says Error at line:0 but it is impossible!. i use flex and byacc for parser.i will be very glad if you can help me. this is my lex file as you can see;
%{
#include <stdio.h>
#include <string.h>
#include "y.tab.h"
extern int yylval;
int linenum=0;
%}
digit [0-9]
letter [A-Za-z]
%%
if return IF;
then return THEN;
else return ELSE;
for return FOR;
while return WHILE;
PROGRAM return PROGRAM_SYM;
BEGIN return BEGIN_SYM;
VAR return VAR_SYM;
END return END_SYM;
INTEGER return INTEGER_SYM;
{letter}({letter}|{digit})* return identifier;
[0-9]+ return NUMBER;
[\<][\=] return CON_LE;
[\>][\=] return CON_GE;
[\=] return CON_EQ;
[\:][\=] return ASSIGNOP;
; return semiColon;
, return comma;
\n {linenum++;}
. return (int) yytext[0];
%%
and this is my Yacc file
%{
#include <stdio.h>
#include <string.h>
#include "y.tab.h"
extern FILE *yyin;
extern int linenum;
%}
%token PROGRAM_SYM VAR_SYM BEGIN_SYM END_SYM INTEGER_SYM NUMBER
%token identifier INTEGER ASSIGNOP semiColon comma THEN
%token IF ELSE FOR WHILE
%token CON_EQ CON_LE CON_GE GE LE
%left '*' '/'
%left '+' '-'
%start program
%%
program: PROGRAM_SYM identifier semiColon VAR_SYM dec_block BEGIN_SYM statement_list END_SYM '.'
;
dec_block:
dec_list semiColon;
dec_list:
dec_list dec
|
dec
;
dec:
int_dec_list
;
int_dec_list:
int_dec_list int_dec ':' type
|
int_dec ':' type
;
int_dec:
int_dec comma identifier
|
identifier
;
type:
INTEGER_SYM
;
statement_list:
statement_list statement
|
statement
;
statement:
assignment_list
|
expression_list
|
selection_list
;
assignment_list:
assignment_list assignment
|
assignment
;
assignment:
identifier ASSIGNOP expression_list
;
expression_list:
expression_list expression semiColon
|
expression semiColon
;
expression:
'(' expression ')'
|
expression '*' expression
|
expression '/' expression
|
expression '+' expression
|
expression '-' expression
|
factor
;
factor:
identifier
|
NUMBER
;
selection_list:
selection_list selection
|
selection
;
selection:
IF '(' logical_expression ')' THEN statement_list ELSE statement_list
;
logical_expression:
logical_expression '=' expression
|
logical_expression '>' expression
|
logical_expression '<' expression
;
%%
void yyerror(char *s){
fprintf(stderr,"Error at line: %d\n",linenum);
}
int yywrap(){
return 1;
}
int main(int argc, char *argv[])
{
/* Call the lexer, then quit. */
yyin=fopen(argv[1],"r");
yyparse();
fclose(yyin);
return 0;
}
and finally i take an error at the first line when i give the input;
PROGRAM myprogram;
VAR
i:INTEGER;
i3:INTEGER;
j:INTEGER;
BEGIN
i := 3;
j := 5;
i3 := i+j*2;
i := j*20;
if(i>j)
then i3 := i+50+(45*i+(40*j));
else i3 := i+50+(45*i+(40*j))+i+50+(45*i+(30*j));
END.
For debugging grammars, YYDEBUG is your friend. Either stick #define YYDEBUG 1 in the %{..%} in the top of your .y file, or compile with -DYYDEBUG, and stick a yydebug = 1; in main before calling yyparse and you'll get a slew of info about what tokens the parser is seeing and what it is doing with them....
Your lexical analyzer returns blanks and tabs as tokens, but the grammar doesn't recognize them.
Add a parser rule:
[ \t\r] { }
This gets you to line 6 instead of line 0 before you run into an error. You get that error because you don't allow semicolons between declarations:
dec_block:
dec_list semiColon;
dec_list:
dec_list dec
|
dec
;
dec:
int_dec_list
;
That should probably be:
dec_block:
dec_block dec
|
dec
;
dec:
int_dec_list semiColon
;
Doing this gets you to line 14 in the input.
Incidentally, one of the first things I did was to ensure that the lexical analyzer tells me what it is doing, by modifying the rules like this:
if { printf("IF\n"); return IF; }
In long term code, I'd make that diagnostic output selectable at run-time.
You have a general problem with where you expect semicolons. It is also not clear that you should allow an expression_list in the rule for statement (or, maybe, 'not yet' — that might be appropriate when you have function calls, but allowing 3 + 2 / 4 as a 'statement' is not very helpful).
This grammar gets to the end of the input:
%{
#include <stdio.h>
#include <string.h>
#include "y.tab.h"
extern FILE *yyin;
extern int linenum;
%}
%token PROGRAM_SYM VAR_SYM BEGIN_SYM END_SYM INTEGER_SYM NUMBER
%token identifier INTEGER ASSIGNOP semiColon comma THEN
%token IF ELSE FOR WHILE
%token CON_EQ CON_LE CON_GE GE LE
%left '*' '/'
%left '+' '-'
%start program
%%
program: PROGRAM_SYM identifier semiColon VAR_SYM dec_block BEGIN_SYM statement_list END_SYM '.'
;
dec_block:
dec_block dec
|
dec
;
dec:
int_dec_list semiColon
;
int_dec_list:
int_dec_list int_dec ':' type
|
int_dec ':' type
;
int_dec:
int_dec comma identifier
|
identifier
;
type:
INTEGER_SYM
;
statement_list:
statement_list statement
|
statement
;
statement:
assignment
|
selection
;
assignment:
identifier ASSIGNOP expression semiColon
;
expression:
'(' expression ')'
|
expression '*' expression
|
expression '/' expression
|
expression '+' expression
|
expression '-' expression
|
factor
;
factor:
identifier
|
NUMBER
;
selection:
IF '(' logical_expression ')' THEN statement_list ELSE statement_list
;
logical_expression:
expression '=' expression
|
expression '>' expression
|
expression '<' expression
;
%%
void yyerror(char *s){
fprintf(stderr,"Error at line: %d\n",linenum);
}
int yywrap(){
return 1;
}
int main(int argc, char *argv[])
{
/* Call the lexer, then quit. */
yyin=fopen(argv[1],"r");
yyparse();
fclose(yyin);
return 0;
}
Key changes include removing assignment_list and expression_list, and modifying logical_expression so that the two sides of the expansion are expression, rather than the LHS being logical_expression (which then never had a primitive definition, leading to the problems with warnings).
There are still issues to resolve; the expression_list in the selection should be more restrictive to accurately reflect the grammar of Pascal. (You need a block where that could be a single statement or a BEGIN, statement list, END.)

Extending example grammar for Fsyacc with unary minus

I tried to extend the example grammar that comes as part of the "F# Parsed Language Starter" to support unary minus (for expressions like 2 * -5).
I hit a block like Samsdram here
Basically, I extended the header of the .fsy file to include precedence like so:
......
%nonassoc UMINUS
....
and then the rules of the grammar like so:
...
Expr:
| MINUS Expr %prec UMINUS { Negative ($2) }
...
also, the definition of the AST:
...
and Expr =
| Negative of Expr
.....
but still get a parser error when trying to parse the expression mentioned above.
Any ideas what's missing? I read the source code of the F# compiler and it is not clear how they solve this, seems quite similar
EDIT
The precedences are ordered this way:
%left ASSIGN
%left AND OR
%left EQ NOTEQ LT LTE GTE GT
%left PLUS MINUS
%left ASTER SLASH
%nonassoc UMINUS
Had a play around and managed to get the precedence working without the need for %prec. Modified the starter a little though (more meaningful names)
Prog:
| Expression EOF { $1 }
Expression:
| Additive { $1 }
Additive:
| Multiplicative { $1 }
| Additive PLUS Multiplicative { Plus($1, $3) }
| Additive MINUS Multiplicative { Minus($1, $3) }
Multiplicative:
| Unary { $1 }
| Multiplicative ASTER Unary { Times($1, $3) }
| Multiplicative SLASH Unary { Divide($1, $3) }
Unary:
| Value { $1 }
| MINUS Value { Negative($2) }
Value:
| FLOAT { Value(Float($1)) }
| INT32 { Value(Integer($1)) }
| LPAREN Expression RPAREN { $2 }
I also grouped the expressions into a single variant, as I didn't like the way the starter done it. (was awkward to walk through it).
type Value =
| Float of Double
| Integer of Int32
| Expression of Expression
and Expression =
| Value of Value
| Negative of Expression
| Times of Expression * Expression
| Divide of Expression * Expression
| Plus of Expression * Expression
| Minus of Expression * Expression
and Equation =
| Equation of Expression
Taking code from my article Parsing text with Lex and Yacc (October 2007).
My precedences look like:
%left PLUS MINUS
%left TIMES DIVIDE
%nonassoc prec_uminus
%right POWER
%nonassoc FACTORIAL
and the yacc parsing code is:
expr:
| NUM { Num(float_of_string $1) }
| MINUS expr %prec prec_uminus { Neg $2 }
| expr FACTORIAL { Factorial $1 }
| expr PLUS expr { Add($1, $3) }
| expr MINUS expr { Sub($1, $3) }
| expr TIMES expr { Mul($1, $3) }
| expr DIVIDE expr { Div($1, $3) }
| expr POWER expr { Pow($1, $3) }
| OPEN expr CLOSE { $2 }
;
Looks equivalent. I don't suppose the problem is your use of UMINUS in capitals instead of prec_uminus in my case?
Another option is to split expr into several mutually-recursive parts, one for each precedence level.

Resources