Bison parser won't look-ahead for token - parsing

I have the following parser grammar (this is a small sample):
expr:
ident assignop expr
{
$$ = new NAssignment(new NAssignmentIdentifier(*$1), $2, *$3);
} |
STAR expr %prec IDEREF
{
$$ = new NDereferenceOperator(*$2);
} |
STAR expr assignop expr %prec IDEREF
{
$$ = new NAssignment(new NAssignmentDereference(*$2), $3, *$4);
} |
... ;
...
assignop:
ASSIGN_EQUAL |
ASSIGN_ADD |
ASSIGN_SUBTRACT |
ASSIGN_MULTIPLY |
ASSIGN_DIVIDE ;
Now I'm trying to parse any of the following lines:
*0x8000 = 0x7000;
*mem = 0x7000;
However, Bison keeps seeing "*mem" and reducing on the 'STAR expr' rule and not performing look-ahead to see whether 'STAR expr assignop...' matches. As far as I understand Bison, it should be doing this look-ahead. My closest guess is that %prec is turning off look-ahead or something strange like that, but I can't see why it would do so (since the prec values are equivalent).
How do I make it perform look-ahead in this case?
EDIT:
The state that it enters when encountering 'STAR expr' is:
state 45
28 expr: STAR expr .
29 | STAR expr . assignop expr
35 | expr . binaryop expr
$default reduce using rule 28 (expr)
assignop go to state 81
binaryop go to state 79
So I don't understand why it's picking $default when it could pick assignop (note that the order of the rules in the parser.y file don't affect which one it picks in this case; I've tried reordering the assignop one above the standard 'STAR expr').

This will happen if IDREF is higher precedence than ASSIGN_EQUAL, ASSIGN_ADD, etc. Specifically, in this case with the raw parser (before precedence is applies), you have shift/reduce conflicts between the expr: STAR expr rule and the various ASSIGN_XXX tokens. The precedence rules you have resolve all the conflicts in favor of the reduce.
The assignop in the state is a goto, not a shift or reduce, so doesn't enter into the lookahead or token handling at all -- gotos only occur after some token has been shifted and then later reduced to the non-terminal in question.

I ended up solving this problem by creating another rule 'deref' like so:
deref:
STAR ident
{
$$ = new NDereferenceOperator(*$<ident>2);
} |
STAR numeric
{
$$ = new NDereferenceOperator(*$2);
} |
STAR CURVED_OPEN expr CURVED_CLOSE
{
$$ = new NDereferenceOperator(*$3);
} |
deref assignop expr
{
if ($1->cType == "expression-dereference") // We can't accept NAssignments as the deref in this case.
$$ = new NAssignment(new NAssignmentDereference(((NDereferenceOperator*)$1)->expr), $2, *$3);
else
throw new CompilerException("Unable to apply dereferencing assignment operation to non-dereference operator based LHS.");
} ;
replacing both rules in 'expr' with a single 'deref'.

Related

Verifying if an expression conforms to restrictive context-free grammar

I'm trying to write a parser that accepts a toy language for a software project class. Part of the production rules relevant to the question in EBNF-like syntax is given here (there's way more relational operators, but I've removed some of them to keep it simple):
cond_expr = rel_expr
| '!' '(' cond_expr ')'
| '(' cond_expr ')' '&&' '(' cond_expr ')' ;
rel_expr = rel_factor '==' rel_factor
| rel_factor '!=' rel_factor ;
rel_factor = VAR | INTEGER | expr ;
expr = expr '+' term
| expr '-' term
| expr ;
term = term '*' factor
| term '/' factor
| factor ;
factor = VAR | INTEGER | '(' expr ')' ;
VAR = [a-zA-Z][a-zA-Z0-9]* ;
INTEGER = '0' | [1-9][0-9]* ;
I've written more or less the entire parser already. I used recursive descent for majority of the language except for expressions, which I decided to use the shunting yard algorithm to parse (because I couldn't get recursive descent to work even after left recursion elimination/left factoring).
The real problem I have is in the cond_expr rule; shunting yard is too powerful for this grammar i.e the grammar can't accept certain conditional expressions. For example, the expression (x == 1) is not accepted, neither is !(x == 1) || (y == 1). I would use the recursive descent method to check if the expression can be accepted, but the issue is with the rel_expr in cond_expr, rel_expr can be substituted with rel_factor '==' rel_factor or rel_factor '!=' rel_factor, and each rel_factor can be substituted with '(' expr ')'. This leads to ambiguity (idk if that's the correct term) when deciding what branch to take in the cond_expr method upon seeing a '(' token. Something like the below:
Expression cond_expr() {
if (next() == "!") {
expect("!");
expect("(");
auto cond = cond_expr();
expect(")");
return cond;
} else if (next() == "(") {
// this will fail for e.g (x + 1) == 2
expect("(");
auto cond1 = cond_expr();
expect(")");
expect("&&");
expect("(");
auto cond2 = cond_expr();
expect(")");
return Node("&&", cond1, cond2);
} else {
return rel_expr();
}
}
My current strategy I'm attempting is to first validate that the expression can be accepted by the grammar using some subroutine, then calling the shunting yard algorithm to parse it into the required AST. However, I'm having a lot of trouble writing this validation subroutine. Anyone have any suggestions on any methods to solve this?

Resolving messy precedence in Bison Parser

I'm currently building a parser for my own language but I'm having trouble with precedence (I think that's the problem). I only discovered this issue while I was constructing the AST. I defined the following tokens:
%token T_SEMICOLON T_COMMA T_PERIOD T_RETURN T_BOOLEAN T_EXTENDS T_TRUE T_FALSE T_IF T_DO T_NEW T_ELSE T_EQUALSSIGN T_PRINT T_INTEGER T_NONE T_WHILE T_NUMBER T_VARIABLE T_LEFTCURLY T_RIGHTCURLY T_LEFTPAREN T_RIGHTPAREN T_ARROW T_EOF
%left T_OR
%left T_AND
%left T_MINUS T_EQUALSWORD T_PLUS
%left T_GE T_GEQ
%left T_MULTIPLY T_DIVIDE
%right T_NOT
(only the ones with precedence really matter in this case) and I have a grammar that takes the form
Something: T_PRINT expression T_SEMICOLON
where I have several expression productions, some of them being
expression : expression T_PLUS expression {$$ = new PlusNode($1, $3);}
| expression T_MINUS expression {$$ = new MinusNode($1, $3);}
| expression T_EQUALSWORD expression {$$ = new EqualNode($1, $3);}
| expression T_MULTIPLY expression {$$ = new TimesNode($1, $3);}
| expression T_DIVIDE expression {$$ = new DivideNode($1, $3);}
| T_NOT expression {$$ = new NotNode($2);}
| T_MINUS expression {$$ = new NegationNode($2);}
| T_VARIABLE {$$ = new VariableNode($1);}
| T_NUMBER {$$ = new IntegerLiteralNode($1);}
| T_TRUE {$$ = new BooleanLiteralNode(new IntegerNode(1));}
| T_FALSE {$$ = new BooleanLiteralNode(new IntegerNode(0));}
| T_NEW T_VARIABLE {$$ = new NewNode($2, NULL);}
when I try to parse something like print true equals new c0 - new c0; I get a strange output AST that looks like Print(Minus(Equal(BooleanLiteral(1), New("c0")), New("c0"))). Here it looks like the T_EQUALSWORD token has higher precedence than the T_MINUS token even though I defined it otherwise?
This problem also occurs if I change the minus to a plus; inputting print true equals new c0 + new c0; I get Print(Equal(BooleanLiteral(1), GreaterEqual(New("c0"), New("c0")))) as output. The form seems correct but I for some reason am getting an T_GEQ token instead of an T_PLUS?
Interestingly, this parses correctly for T_MULTIPLY and T_DIVIDE instead:
Input: print true equals new c0 * new c0;
Output: Print(Equal(BooleanLiteral(1), Times(New("c0"), New("c0"))))
Input: print true equals new c0 / new c0;
Output: Print(Equal(BooleanLiteral(1), Divide(New("c0"), New("c0"))))
So this seems to work correctly with multiply and divide but is failing with plus and minus. Any ideas?
EDIT: I added %prec T_OR to the T_EQUALSWORD production and this solved the problem when I use subtraction, but I still get the weird T_GEQ token when I use addition.
This line
%left T_MINUS T_EQUALSWORD T_PLUS
Means the three operators have the same precedence and associate left to right. The first result you get are consistent with that.
a - b - c (a - b) - c # Left associative
a - b = c (a - b) = c # Also left associative
a = b - c (a = b) - c # Once again
If you want equality to be of lower precedence, give it its own precedence level.
The second problem, with the wrong token being produced, is probably because your scanner produces the wrong token.

How does a parser solves shift/reduce conflict?

I have a grammar for arithmetic expression which solves number of expression (one per line) in a text file. While compiling YACC I am getting message 2 shift reduce conflicts. But my calculations are proper. If parser is giving proper output how does it resolves the shift/reduce conflict. And In my case is there any way to solve it in YACC Grammar.
YACC GRAMMAR
Calc : Expr {printf(" = %d\n",$1);}
| Calc Expr {printf(" = %d\n",$2);}
| error {yyerror("\nBad Expression\n ");}
;
Expr : Term { $$ = $1; }
| Expr '+' Term { $$ = $1 + $3; }
| Expr '-' Term { $$ = $1 - $3; }
;
Term : Fact { $$ = $1; }
| Term '*' Fact { $$ = $1 * $3; }
| Term '/' Fact { if($3==0){
yyerror("Divide by Zero Encountered.");
break;}
else
$$ = $1 / $3;
}
;
Fact : Prim { $$ = $1; }
| '-' Prim { $$ = -$2; }
;
Prim : '(' Expr ')' { $$ = $2; }
| Id { $$ = $1; }
;
Id :NUM { $$ = yylval; }
;
What change should I do to remove such conflicts in my grammar ?
Bison/yacc resolves shift-reduce conflicts by choosing to shift. This is explained in the bison manual in the section on Shift-Reduce conflicts.
Your problem is that your input is just a series of Exprs, run together without any delimiter between them. That means that:
4 - 2
could be one expression (4-2) or it could be two expressions (4, -2). Since bison-generated parsers always prefer to shift, the parser will choose to parse it as one expression, even if it were typed on two lines:
4
-2
If you want to allow users to type their expressions like that, without any separator, then you could either live with the conflict (since it is relatively benign) or you could codify it into your grammar, but that's quite a bit more work. To put it into the grammar, you need to define two different types of Expr: one (which is the one you use at the top level) cannot start with an unary minus, and the other one (which you can use anywhere else) is allowed to start with a unary minus.
I suspect that what you really want to do is use newlines or some other kind of expression separator. That's as simple as passing the newline through to your parser and changing Calc to Calc: | Calc '\n' | Calc Expr '\n'.
I'm sure that this appears somewhere else on SO, but I can't find it. So here is how you disallow the use of unary minus at the beginning of an expression, so that you can run expressions together without delimiters. The non-terminals starting n_ cannot start with a unary minus:
input: %empty | input n_expr { /* print $2 */ }
expr: term | expr '+' term | expr '-' term
n_expr: n_term | n_expr '+' term | n_expr '-' term
term: factor | term '*' factor | term '/' factor
n_term: value | n_term '+' factor | n_term '/' factor
factor: value | '-' factor
value: NUM | '(' expr ')'
That parses the same language as your grammar, but without generating the shift-reduce conflict. Since it parses the same language, the input
4
-2
will still be parsed as a single expression; to get the expected result you would need to type
4
(-2)

Attempting to resolve shift-reduce parsing issue

I'm attempting to write a grammar for C and am having an issue that I don't quite understand. Relevant portions of the grammar:
stmt :
types decl SEMI { marks (A.Declare ($1, $2)) (1, 2) }
| simp SEMI { marks $1 (1, 1) }
| RETURN exp SEMI { marks (A.Return $2) (1, 2) }
| control { $1 }
| block { marks $1 (1, 1) }
;
control :
if { $1 }
| WHILE RPAREN exp LPAREN stmt { marks (A.While ($3, $5)) (1, 5) }
| FOR LPAREN simpopt SEMI exp SEMI simpopt RPAREN stmt { marks (A.For ($3, $5, $7, $9)) (1, 9) }
;
if :
IF RPAREN exp LPAREN stmt { marks (A.If ($3, $5, None)) (1, 5) }
| IF RPAREN exp LPAREN stmt ELSE stmt { marks (A.If ($3, $5, $7)) (1, 7) }
;
This doesn't work. I ran ocamlyacc -v and got the following report:
83: shift/reduce conflict (shift 86, reduce 14) on ELSE
state 83
if : IF RPAREN exp LPAREN stmt . (14)
if : IF RPAREN exp LPAREN stmt . ELSE stmt (15)
ELSE shift 86
IF reduce 14
WHILE reduce 14
FOR reduce 14
BOOL reduce 14
IDENT reduce 14
RETURN reduce 14
INT reduce 14
MAIN reduce 14
LBRACE reduce 14
RBRACE reduce 14
LPAREN reduce 14
I've read that shift/reduce conflicts are due to ambiguity in the specification of the grammar, but I don't see how I can specify this in a way that isn't ambiguous?
The grammar is certainly ambiguous, although you know what every string means, and furthermore despite the fact that ocamlyacc reports a shift/reduce conflict, its generated grammar will also produce the correct parse for every valid input.
The ambiguity comes from
if ( exp1 ) if ( exp2) stmt1 else stmt2;
Clearly stmt1 only executes if both exp1 and exp2 are true. But does stmt1 execute if exp1 is false, or if exp1 is true and exp2 is false? Those represent different parses; the first (invalid) parse attaches else stmt2 to if (exp1), while the parse that you, I and ocamlyacc know to be correct attaches else stmt2 to if (exp2).
The grammar can be rewritten, although it's a bit of a nuisance. The basic idea is to divide statements into two categories: "matched" (which means that every else in the statement is matched with some if) and "unmatched" (which means that a following else would match some if in the statement. A complete statement may be unmatched, because else clauses are optional, but you can never have an unmatched statement between an if and an else, because that else must match an if in the unmatched statement.
The following grammar is basically the one you provided, but rewritten to use bison-style single-quoted tokens, which I find more readable. I don't know if ocamlyacc handles those. (By the way, your grammar says IF RPAREN exp LPAREN... which, with the common definition of left and right parentheses, would mean if ) exp (. That's one reason I find single-quoted character terminals much more readable.)
Bison handles this grammar with no conflicts.
/* Fake non-terminals */
%token types decl simp exp
/* Keywords */
%token ELSE FOR IF RETURN WHILE
%%
stmt: matched_stmt | unmatched_stmt ;
stmt_list: stmt | stmt_list stmt ;
block: '{' stmt_list '}' ;
matched_stmt
: types decl ';'
| simp ';'
| RETURN exp ';'
| block
| matched_control
;
simpopt : simp | /* EMPTY */;
matched_control
: IF '(' exp ')' matched_stmt ELSE matched_stmt
| WHILE '(' exp ')' matched_stmt
| FOR '(' simpopt ';' exp ';' simpopt ')' matched_stmt
;
unmatched_stmt
: IF '(' exp ')' stmt
| IF '(' exp ')' matched_stmt ELSE unmatched_stmt
| WHILE '(' exp ')' unmatched_stmt
| FOR '(' simpopt ';' exp ';' simpopt ')' unmatched_stmt
;
Personally, I'd refactor a bit. Eg:
if_prefix : IF '(' exp ')'
;
loop_prefix: WHILE '(' exp ')'
| FOR '(' simpopt ';' exp ';' simpopt ')'
;
matched_control
: if_prefix matched_stmt ELSE matched_stmt
| loop_prefix matched_stmt
;
unmatched_stmt
: if_prefix stmt
| if_prefix ELSE unmatched_stmt
| loop_prefix unmatched_stmt
;
A common and simpler but less rigorous solution is to use precedence declarations as suggested in the bison manual.

Extending example grammar for Fsyacc with unary minus

I tried to extend the example grammar that comes as part of the "F# Parsed Language Starter" to support unary minus (for expressions like 2 * -5).
I hit a block like Samsdram here
Basically, I extended the header of the .fsy file to include precedence like so:
......
%nonassoc UMINUS
....
and then the rules of the grammar like so:
...
Expr:
| MINUS Expr %prec UMINUS { Negative ($2) }
...
also, the definition of the AST:
...
and Expr =
| Negative of Expr
.....
but still get a parser error when trying to parse the expression mentioned above.
Any ideas what's missing? I read the source code of the F# compiler and it is not clear how they solve this, seems quite similar
EDIT
The precedences are ordered this way:
%left ASSIGN
%left AND OR
%left EQ NOTEQ LT LTE GTE GT
%left PLUS MINUS
%left ASTER SLASH
%nonassoc UMINUS
Had a play around and managed to get the precedence working without the need for %prec. Modified the starter a little though (more meaningful names)
Prog:
| Expression EOF { $1 }
Expression:
| Additive { $1 }
Additive:
| Multiplicative { $1 }
| Additive PLUS Multiplicative { Plus($1, $3) }
| Additive MINUS Multiplicative { Minus($1, $3) }
Multiplicative:
| Unary { $1 }
| Multiplicative ASTER Unary { Times($1, $3) }
| Multiplicative SLASH Unary { Divide($1, $3) }
Unary:
| Value { $1 }
| MINUS Value { Negative($2) }
Value:
| FLOAT { Value(Float($1)) }
| INT32 { Value(Integer($1)) }
| LPAREN Expression RPAREN { $2 }
I also grouped the expressions into a single variant, as I didn't like the way the starter done it. (was awkward to walk through it).
type Value =
| Float of Double
| Integer of Int32
| Expression of Expression
and Expression =
| Value of Value
| Negative of Expression
| Times of Expression * Expression
| Divide of Expression * Expression
| Plus of Expression * Expression
| Minus of Expression * Expression
and Equation =
| Equation of Expression
Taking code from my article Parsing text with Lex and Yacc (October 2007).
My precedences look like:
%left PLUS MINUS
%left TIMES DIVIDE
%nonassoc prec_uminus
%right POWER
%nonassoc FACTORIAL
and the yacc parsing code is:
expr:
| NUM { Num(float_of_string $1) }
| MINUS expr %prec prec_uminus { Neg $2 }
| expr FACTORIAL { Factorial $1 }
| expr PLUS expr { Add($1, $3) }
| expr MINUS expr { Sub($1, $3) }
| expr TIMES expr { Mul($1, $3) }
| expr DIVIDE expr { Div($1, $3) }
| expr POWER expr { Pow($1, $3) }
| OPEN expr CLOSE { $2 }
;
Looks equivalent. I don't suppose the problem is your use of UMINUS in capitals instead of prec_uminus in my case?
Another option is to split expr into several mutually-recursive parts, one for each precedence level.

Resources