Odd bison parsing

Odd bison parsing - parsing

To pretext this, I understand that the formatting of the parse structure is weird, the teacher wanted it to be ~roughly~ in this format.
I'm making a simple "calculator" parser form an assignment using flex and bison, but I am getting odd or unusual output for the answer when using modulus. IT seems to work fine for all other operations
Input: "10 % 5"
Output: " % 10"
Input: "101 % 12"
Output: " % 101"
Input: "2^(-1 + 15/5) - 3*(4-1) + (-6)"
Output: "-11" //Correct
Relevant section of bison.y
command : pexpri {printf("%d\n", $1); return;}
;
pexpri : '-' expri '+' termi {$$ = -$2 + $4;} /* Super glued on unary, also reduce conflict, TODO: find bug */
| '-' expri '-' termi {$$ = -$2 - $4;}
| '-' expri {$$ = -$2;}
| expri {$$ = $1;}
;
expri : expri '+' termi {$$ = $1 + $3;} /* Addition subtraction level operations*/
| expri '-' termi {$$ = $1 - $3;}
| termi {$$ = $1;}
;
termi : termi '*' factori {$$ = $1 * $3;} /* Multiplication division level operations*/
| termi '/' factori {$$ = $1 / $3;}
| termi '%' factori {$$ = $1 % $3;}
| factori {$$ = $1;}
;
factori : factori '^' parti {$$ = pow($1, $3);} /* Exponentiation level operations */
| parti {$$ = $1;}
;
parti : '(' pexpri ')' {$$ = $2;} /* Parentheses handling or terminal, also adds even more reduction errors.... */
| INTEGER
;
Relevant section tokenizer.l
0 { /* To avoid useless trailing zeros. */
yylval.iVal = atoi(yytext);
return INTEGER;
}
[1-9][0-9]* {
yylval.iVal = atoi(yytext);
return INTEGER;
}
[-()^\+\*/] {return *yytext;}
The main function is essentially just a wrapper for yyparse.
I don't understand how or why it is printing the modulus symbol in the output because the ONLY print in the entire code is in the command section. I understand that the code isn't the best (in fact, it is awful), but any insight is much appreciated.
Also, if anybody can help me figure out how to manage unary negation in a more elegant way (Hopefully without spoiling much), that would also be super appreciated. (I cant just use %precidence or %left) The way I have it currently set up is ambiguous and is causing reduction errors.

If you look closely at
[-()^\+\*/] {return *yytext;}
you'll notice that it is not going to match %. The most likely consequence is that (f)lex's default fallback rule will apply. That rule matches any single character and uses ECHO to copy the matched token to the output stream.
It looks to me like whitespace characters might also be falling through to the default rule. They should be ignored explicitly.
By the way, it is not necessary to backslash-escape regular expression operators inside character classes, since they have no special meaning in that context. Hence a correct and easier to read rule would be
[-+*/%^()] {return *yytext;}
However, I strongly recommend using a fallback rule instead of listing all the possible single-character tokens. If an invalid single-character token is handled by a fallback rule, then the parser will respond by flagging an error.
[[:space:]]+ { /* Ignore whitespace*/ }
0|[1-9][0-9]* { yylval.iVal = atoi(yytext); return INTEGER; }
. { return *yytext; /* Fallback rule */ }
The default fallback rule is rarely useful in parsing, and I find it useful to add
%option nodefault
to my flex prolog, which will cause flex to produce an error message if a fallback rule is required.

Related

Verifying if an expression conforms to restrictive context-free grammar

I'm trying to write a parser that accepts a toy language for a software project class. Part of the production rules relevant to the question in EBNF-like syntax is given here (there's way more relational operators, but I've removed some of them to keep it simple):
cond_expr = rel_expr
| '!' '(' cond_expr ')'
| '(' cond_expr ')' '&&' '(' cond_expr ')' ;
rel_expr = rel_factor '==' rel_factor
| rel_factor '!=' rel_factor ;
rel_factor = VAR | INTEGER | expr ;
expr = expr '+' term
| expr '-' term
| expr ;
term = term '*' factor
| term '/' factor
| factor ;
factor = VAR | INTEGER | '(' expr ')' ;
VAR = [a-zA-Z][a-zA-Z0-9]* ;
INTEGER = '0' | [1-9][0-9]* ;
I've written more or less the entire parser already. I used recursive descent for majority of the language except for expressions, which I decided to use the shunting yard algorithm to parse (because I couldn't get recursive descent to work even after left recursion elimination/left factoring).
The real problem I have is in the cond_expr rule; shunting yard is too powerful for this grammar i.e the grammar can't accept certain conditional expressions. For example, the expression (x == 1) is not accepted, neither is !(x == 1) || (y == 1). I would use the recursive descent method to check if the expression can be accepted, but the issue is with the rel_expr in cond_expr, rel_expr can be substituted with rel_factor '==' rel_factor or rel_factor '!=' rel_factor, and each rel_factor can be substituted with '(' expr ')'. This leads to ambiguity (idk if that's the correct term) when deciding what branch to take in the cond_expr method upon seeing a '(' token. Something like the below:
Expression cond_expr() {
if (next() == "!") {
expect("!");
expect("(");
auto cond = cond_expr();
expect(")");
return cond;
} else if (next() == "(") {
// this will fail for e.g (x + 1) == 2
expect("(");
auto cond1 = cond_expr();
expect(")");
expect("&&");
expect("(");
auto cond2 = cond_expr();
expect(")");
return Node("&&", cond1, cond2);
} else {
return rel_expr();
}
}
My current strategy I'm attempting is to first validate that the expression can be accepted by the grammar using some subroutine, then calling the shunting yard algorithm to parse it into the required AST. However, I'm having a lot of trouble writing this validation subroutine. Anyone have any suggestions on any methods to solve this?

Bison - Productions for longest matching expression

I am using Bison, together with Flex, to try and parse a simple grammar that was provided to me. In this grammar (almost) everything is considered an expression and has some kind of value; there are no statements. What's more, the EBNF definition of the grammar comes with certain ambiguities:
expression OP expression where op may be '+', '-' '&' etc. This can easily be solved using bison's associativity operators and setting %left, %right and %nonassoc according to common language standards.
IF expression THEN expression [ELSE expression] as well as DO expression WHILE expression, for which ignoring the common case dangling else problem I want the following behavior:
In if-then-else as well as while expressions, the embedded expressions are taken to be as long as possible (allowed by the grammar). E.g 5 + if cond_expr then then_expr else 10 + 12 is equivalent to 5 + (if cond_expr then then_expr else (10 + 12)) and not 5 + (if cond_expr then then_expr else 10) + 12
Given that everything in the language is considered an expression, I cannot find a way to re-write the production rules in a form that does not cause conflicts. One thing I tried, drawing inspiration from the dangling else example in the bison manual was:
expression: long_expression
| short_expression
;
long_expression: short_expression
| IF long_expression THEN long_expression
| IF long_expression long_expression ELSE long_expression
| WHILE long_expression DO long_expression
;
short_expression: short_expression '+' short_expression
| short_expression '-' short_expression
...
;
However this does not seem to work and I cannot figure out how I could tweak it into working. Note that I (assume I) have resolved the dangling ELSE problem using nonassoc for ELSE and THEN and the above construct as suggested in some book, but I am not sure this is even valid in the case where there are not statements but only expressions. Note as well as that associativity has been set for all other operators such as +, - etc. Any solutions or hints or examples that resolve this?
----------- EDIT: MINIMAL EXAMPLE ---------------
I tried to include all productions with tokens that have specific associativity, including some extra productions to show a bit of the grammar. Notice that I did not actually use my idea explained above. Notice as well that I have included a single binary and unary operator just to make the code a bit shorter, the rules for all operators are of the same form. Bison with -Wall flag finds no conflicts with these declarations (but I am pretty sure they are not 100% correct).
%token<int> INT32 LET IF WHILE INTEGER OBJECTID TYPEID NEW
%right <str> THEN ELSE STR
%right '^' UMINUS NOT ISNULL ASSIGN DO IN
%left '+' '-'
%left '*' '/'
%left <str> AND '.'
%nonassoc '<' '='
%nonassoc <str> LOWEREQ
%type<ast_expr> expression
%type ...
exprlist: expression { ; }
| exprlist ';' expression { ; };
block: '{' exprlist '}' { ; };
args: %empty { ; }
| expression { ; }
| args ',' expression { ; };
expression: IF expression THEN expression { ; }
| IF expression THEN expression ELSE expression { ; }
| WHILE expression DO expression { ; }
| LET OBJECTID ':' type IN expression { ; }
| NOT expression { /* UNARY OPERATORS */ ; }
| expression '=' expression { /* BINARY OPERATORS */ ; }
| OBJECTID '(' args ')' { ; }
| expression '.' OBJECTID '(' args ')' { ; }
| NEW TYPEID { ; }
| STR { ; }
| INTEGER { ; }
| '(' ')' { ; }
| '(' expression ')' { ; }
| block { ; }
;

The following associativity declarations resolved all shift/reduce conflicts and produced the expected output (in all tests I could think of at least):
...
%right <str> THEN ELSE
%right DO IN
%right ASSIGN
%left <str> AND
%right NOT
%nonassoc '<' '=' LOWEREQ
%left '+' '-'
%left '*' '/'
%right UMINUS ISNULL
%right '^'
%left '.'
...
%%
...
expression: IF expression THEN expression
| IF expression THEN expression ELSE expression
| WHILE expression DO expression
| LET OBJECTID ':' type IN expression
| LET OBJECTID ':' type ASSIGN expression IN expression
| OBJECTID ASSIGN expression
...
| '-' expression %prec UMINUS
| expression '=' expression
...
| expression LOWEREQ expression
| OBJECTID '(' args ')'
...
...
Notice that the order of declaration of associativity and precedence rules for all symbols matters! I have not included all the production rules but if-else-then, while-do, let in, unary and binary operands are the ones that produced conflicts or wrong results with different associativity declarations.

Why does my "equation" grammar break the parser?

Currently, my parser file looks like this:
%{
#include <stdio.h>
#include <math.h>
int yylex();
void yyerror (const char *s);
%}
%union {
long num;
char* str;
}
%start line
%token print
%token exit_cmd
%token <str> identifier
%token <str> string
%token <num> number
%%
line: assignment {;}
| exit_stmt {;}
| print_stmt {;}
| line assignment {;}
| line exit_stmt {;}
| line print_stmt {;}
;
assignment: identifier '=' number {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' string {printf("Assigning var %s to value %s\n", $1, $3);}
;
exit_stmt: exit_cmd {exit(0);}
;
print_stmt: print print_expr {;}
;
print_expr: string {printf("%s\n", $1);}
| number {printf("%d\n", $1);}
;
%%
int main(void)
{
return yyparse();
}
void yyerror (const char *s) {fprintf(stderr, "%s\n", s);}
Giving the input: myvar = 3 gives the output Assigning var myvar = 3 to value 3, as expected. However, modifying the code to include an equation grammar rule breaks such assignments.
Equation grammar:
equation: number '+' number {$$ = $1 + $3;}
| number '-' number {$$ = $1 - $3;}
| number '*' number {$$ = $1 * $3;}
| number '/' number {$$ = $1 / $3;}
| number '^' number {$$ = pow($1, $3);}
| equation '+' number {$$ = $1 + $3;}
| equation '-' number {$$ = $1 - $3;}
| equation '*' number {$$ = $1 * $3;}
| equation '/' number {$$ = $1 / $3;}
| equation '^' number {$$ = pow($1, $3);}
;
Modifying the assignment grammar accordingly as well:
assignment: identifier '=' number {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' equation {printf("Assigning var %s to value %d\n", $1, $3);}
| identifier '=' string {printf("Assigning var %s to value %s\n", $1, $3);}
;
And giving the equation rule the type of num in the parser's first section:
%type <num> equation
Giving the same input: var = 3 freezes the program.
I know this is a long question but can anyone please explain what is going on here?
Also, here's the lexer in case you wanna take a look.

It doesn't "freeze the program". The program is just waiting for more input.
In your first grammar, var = 3 is a complete statement which cannot be extended. But in your second grammar, it could be the beginning of var = 3 + 4, for example. So the parser needs to read another token after the 3. If you want input lines to be terminated by a newline, you will need to modify your scanner to send a newline character as a token, and then modify your grammar to expect a newline token at the end of every statement. If you intend to allow statements to be spread out over several lines, you"ll need to be aware of that fact while typing input.
There are several problems with your grammar, and also with your parser. (Flex doesn't implement non-greedy repetition, for example.) Please look at the examples in the bison and flex manuals

How does a parser solves shift/reduce conflict?

I have a grammar for arithmetic expression which solves number of expression (one per line) in a text file. While compiling YACC I am getting message 2 shift reduce conflicts. But my calculations are proper. If parser is giving proper output how does it resolves the shift/reduce conflict. And In my case is there any way to solve it in YACC Grammar.
YACC GRAMMAR
Calc : Expr {printf(" = %d\n",$1);}
| Calc Expr {printf(" = %d\n",$2);}
| error {yyerror("\nBad Expression\n ");}
;
Expr : Term { $$ = $1; }
| Expr '+' Term { $$ = $1 + $3; }
| Expr '-' Term { $$ = $1 - $3; }
;
Term : Fact { $$ = $1; }
| Term '*' Fact { $$ = $1 * $3; }
| Term '/' Fact { if($3==0){
yyerror("Divide by Zero Encountered.");
break;}
else
$$ = $1 / $3;
}
;
Fact : Prim { $$ = $1; }
| '-' Prim { $$ = -$2; }
;
Prim : '(' Expr ')' { $$ = $2; }
| Id { $$ = $1; }
;
Id :NUM { $$ = yylval; }
;
What change should I do to remove such conflicts in my grammar ?

Bison/yacc resolves shift-reduce conflicts by choosing to shift. This is explained in the bison manual in the section on Shift-Reduce conflicts.
Your problem is that your input is just a series of Exprs, run together without any delimiter between them. That means that:
4 - 2
could be one expression (4-2) or it could be two expressions (4, -2). Since bison-generated parsers always prefer to shift, the parser will choose to parse it as one expression, even if it were typed on two lines:
4
-2
If you want to allow users to type their expressions like that, without any separator, then you could either live with the conflict (since it is relatively benign) or you could codify it into your grammar, but that's quite a bit more work. To put it into the grammar, you need to define two different types of Expr: one (which is the one you use at the top level) cannot start with an unary minus, and the other one (which you can use anywhere else) is allowed to start with a unary minus.
I suspect that what you really want to do is use newlines or some other kind of expression separator. That's as simple as passing the newline through to your parser and changing Calc to Calc: | Calc '\n' | Calc Expr '\n'.
I'm sure that this appears somewhere else on SO, but I can't find it. So here is how you disallow the use of unary minus at the beginning of an expression, so that you can run expressions together without delimiters. The non-terminals starting n_ cannot start with a unary minus:
input: %empty | input n_expr { /* print $2 */ }
expr: term | expr '+' term | expr '-' term
n_expr: n_term | n_expr '+' term | n_expr '-' term
term: factor | term '*' factor | term '/' factor
n_term: value | n_term '+' factor | n_term '/' factor
factor: value | '-' factor
value: NUM | '(' expr ')'
That parses the same language as your grammar, but without generating the shift-reduce conflict. Since it parses the same language, the input
4
-2
will still be parsed as a single expression; to get the expected result you would need to type
4
(-2)

Why does ANTLR not parse the entire input?

I am quite new to ANTLR, so this is likely a simple question.
I have defined a simple grammar which is supposed to include arithmetic expressions with numbers and identifiers (strings that start with a letter and continue with one or more letters or numbers.)
The grammar looks as follows:
grammar while;
#lexer::header {
package ConFreeG;
}
#header {
package ConFreeG;
import ConFreeG.IR.*;
}
#parser::members {
}
arith:
term
| '(' arith ( '-' | '+' | '*' ) arith ')'
;
term returns [AExpr a]:
NUM
{
int n = Integer.parseInt($NUM.text);
a = new Num(n);
}
| IDENT
{
a = new Var($IDENT.text);
}
;
fragment LOWER : ('a'..'z');
fragment UPPER : ('A'..'Z');
fragment NONNULL : ('1'..'9');
fragment NUMBER : ('0' | NONNULL);
IDENT : ( LOWER | UPPER ) ( LOWER | UPPER | NUMBER )*;
NUM : '0' | NONNULL NUMBER*;
fragment NEWLINE:'\r'? '\n';
WHITESPACE : ( ' ' | '\t' | NEWLINE )+ { $channel=HIDDEN; };
I am using ANTLR v3 with the ANTLR IDE Eclipse plugin. When I parse the expression (8 + a45) using the interpreter, only part of the parse tree is generated:
Why does the second term (a45) not get parsed? The same happens if both terms are numbers.

You'll want to create a parser rule that has an EOF (end of file) token in it so that the parser will be forced to go through the entire token stream.
Add this rule to your grammar:
parse
: arith EOF
;
and let the interpreter start at that rule instead of the arith rule:

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart