Creating a parser rule - parsing

I'm currently in a CSCI class, compiler at my college. I have to write a parser for the compiler and I've already done Adding subtracting multiplying dividing and the assignment statement. My question is we now have to do the less than equal (<=) and the greater than equal (>=) and I'm not sure how to write the rule for it...
I was thinking something like...
expr LESSTHAN expr { $1 <= $3 }
expr GREATERTHAN expr { $1 >= $3 }
any suggestions?

You should include a more precise question. Here are some general suggestions though.
The structure of the rule for relational operations should be the same as of the arithmetic operations. In both cases you have binary operators. The difference is that one returns a number, the other returns a boolean value. While 1 + 1 >= 3 usually is valid syntax, other combinations like 1 >= 2 => 5 is most likely invalid. Of course there are exceptions. Some languages allow it as syntactic sugar for multiple operations. Others simply define that boolean values are just integers (0 and 1). It's up to you (or your assignment) what you want the syntax to look like.
Anyway, you probably don't simply want append those rules to expr, but create a new rule. This way you distinguish between relational and arithmetical expressions.
expr :
expr PLUS expr |
expr MINUS expr |
... ;
relational_expr :
expr LESSTHAN expr |
expr GREATERTHAN expr ;
assignment :
identifier '=' relational_expr |
identifier '=' expr ;

Related

Simple YACC grammar with problems

I want to implement a simple YACC parser, but I have problems with my grammar:
%union {
s string
b []byte
t *ASTNode
}
%token AND OR NOT
%token<s> STRING
%token<b> BYTES
%type<t> expr
%left AND
%left OR
%right NOT
%%
expr: '(' expr ')' {{ $$ = $2 }}
| expr AND expr {{ $$ = NewASTComplexNode(OPND_AND, $1, $3) }}
| expr AND NOT expr {{ $$ = NewASTComplexNode(OPND_AND, $1, NewASTComplexNode(OPND_NOT, $4, nil)) }}
| NOT expr AND expr {{ $$ = NewASTComplexNode(OPND_AND, NewASTComplexNode(OPND_NOT, $2, nil), $4) }}
| expr OR expr {{ $$ = NewASTComplexNode(OPND_OR, $1, $3) }}
| STRING {{ $$ = NewASTSimpleNode(OPRT_STRING, $1) }}
| BYTES {{ $$ = NewASTSimpleNode(OPRT_BYTES, $1) }}
;
Cam someone explain me why it gives me these errors?:
rule expr: NOT expr AND expr never reduced
1 rules never reduced
conflicts: 3 reduce/reduce
In a comment, it is clarified that the requirement is that:
The NOT operator should apply only to operands of AND and [the operands] shouldn't be both NOT.
The second part of that requirement is a little odd, since the AND operator is defined to be left-associative. That would mean that
a AND NOT b AND NOT c
would be legal, because it is grouped as (a AND NOT b) AND NOT c, in which both AND operators have one positive term. But rotating the arguments (which might not change the semantics at all) produces:
NOT b AND NOT c AND a
which would be illegal because the first grouping (NOT b AND NOT c) contains two NOT expressions.
It might be that the intention was that any conjunction (sequence of AND operators) contain at least one positive term.
Both of these constraints are possible, but they cannot be achieved using operator precedence declarations.
Operator precedence can be used to resolve ambiguity in an ambiguous grammar (and expr: expr OR expr is certainly ambiguous, since it allows OR to associate in either direction). But it cannot be used to import additional requirements on operands, particularly not a requirement which takes both operands into account [Note 1]. In order to do that, you need to write out an unambiguous grammar. Fortunately, that's not too difficult.
The basis for the unambiguous grammar is what is sometimes called cascading precedence style; this effectively encodes the precedence into rules, so that precedence declarations are unnecessary. The basic boolean grammar looks like this:
expr: conjunction /* Cascade to the next level */
| expr OR conjunction
conjunction
: term /* Continue the cascade */
| conjunction AND term
term: atom /* NOT is a prefix operator */
| NOT term /* which is not what you want */
atom: '(' expr ')'
| STRING
Each precedence level has a corresponding non-terminal, and the levels "cascade" in the sense that each one, except the last, includes the next level as a unit production.
To adapt this to the requirement that NOT be restricted to at most one operand of an AND operator, we can write out the possibilities more or less as you did in the original grammar, but respecting the cascading style:
expr: conjunction
| expr OR conjunction
conjunction
: atom /* NOT is integrated, so no need for 'term' */
| conjunction AND atom
| conjunction AND NOT atom /* Can extend with a negative */
| NOT atom AND atom /* Special case for negative at beginning */
atom: '(' expr ')'
| STRING
The third rule for conjunction (conjunction: conjunction AND NOT atom) allows any number of NOT applications to be added at the end of a list of operands, but does not allow consecutive NOT operands at the beginning of the list. A single NOT at the beginning is allowed by the fourth rule.
If you prefer the rule that a conjunction has to have at least one positive term, you can use the following very simple adaptation:
expr: conjunction
| expr OR conjunction
conjunction
: atom /* NOT is integrated, so no need for 'term' */
| conjunction AND atom
| conjunction AND NOT atom
| negative AND atom /* Possible initial list of negatives */
negative /* This is not part of the cascade */
: NOT atom
| negative AND NOT atom
atom: '(' expr ')'
| STRING
In this variant, negative will match, for example, NOT a AND NOT b AND NOT c. But because it's not in the cascade, that sequence doesn't automatically become a valid expression. In order for it to be used as an expression, it needs to be part of conjunction: negative AND atom, which requires that the sequence include a positive.
Notes
The %nonassoc precedence declaration can be used to reject chained use of operators from the same precedence level. It's a bit of a hack, and can sometimes have unexpected consequences. The expected use case is operators like < in languages like C which don't have special handling for chained comparison; using %nonassoc you can declare that chaining is illegal in the precedence level for comparison operators.
But %nonassoc only works within a single precedence level, and it only works if all the operators in the precedence level require the same treatment. If the intended grammar does not fully conform to those requirements, it will be necessary -- as with this grammar -- to abandon the use of precedence declarations and write out an unambiguous grammar.

Combining unary operators with different precedence

I was having some trouble with Bison creating an operator as such:
<- = identity postfix operator with a low precedence to force evaluation of what's on the left first, e.g. 1+2<-*3 (equivalent (1+2)*3) as well as -> which is a prefix operator which does the same thing but to the right.
I was not able to get the syntax to work properly and tested with Python using - not False, which resulted in a syntax error (in Python, - has a greater precedence than not). However, this is not a problem in C or C++, where - and !/not have the same precedence.
Of course, the difference in precedence has nothing to do with the relationship between the 2 operators, only a relationship with other operators that result in the relative precedences between them.
Why is chaining prefix or postfix operators with different precedences a problem when parsing and how can implement the <- and -> operators while still having higher-precedence operators like !, ++, NOT, etc.?
Obligatory Bison (this pattern is repeated for all operators, where copy has greater precedence than post_unary):
post_unary:
copy
| post_unary "++"
| post_unary "--"
| post_unary '!'
;
Chaining operators in this category, e.g. x ! -- ! works fine syntactically.
Ok, let me suggest a possible erroneous grammar based on your sketch:
low_postfix:
mid_infix
| low_postfix "<-"
mid_infix:
high_postfix
| mid_infix '+' high_postfix
high_postfix:
term
| high_postfix "++"
term:
ID
'(' expr ')'
It should be clear just looking at those productions that var <- ++ is not part of the language. The only things that can be used as an operand to ++ are terms and other applications of ++. var <- is neither of these things.
On the other hand, var ++ <- is fine, because the operand to <- can be a mid_infix which can be a high_postfix which is an application of the ++ operator.
If the intention were to allow both of those postfix sequences, then that grammar is incorrect.
A version of that cascade is present in the Python grammar (albeit using prefix operators) which is why not - False is OK, but - not False is a syntax error. I'm reluctant to call that a bug because it may have been intentional. (Really, neither of those expressions makes much sense.) We could disagree about the value of such an intention but not on SO, which prefers to avoid opinionated discussions.
Note that what we might call "strict precedence" in this grammar and the Python grammar is by no means restricted to combinations of unary operators. Here's another one which you have likely never tried:
$ python3 -c 'print(41 + not False)'
File "<string>", line 1
print(41 + not False)
^
SyntaxError: invalid syntax
So, how can we fix that?
On some level, it would be nice to be able to just write an unambiguous grammar which conveyed our intention. And it is certainly possible to write an unambiguous grammar, which would convey the intention to bison. But it's at least an open question as to whether it would convey anything to a human reader, because the massive clutter of multiple rules necessary in order to keep track of what is and is not an acceptable grouping would be pretty daunting.
On the other hand, it's dead simple to do with bison/yacc precedence declarations. We just list the operators in order, and the parser generator resolves all the ambiguities accordingly. [See Note 1 below]
Here's a similar grammar to the above, with precedence declarations. (I left the actions in place in case you want to play with it, although it's by no means a Reproducible Example; the infrastructure it relies upon is much bigger than the grammar itself, and of little use to anyone other than me. So you'll have to define the three functions and fill in some of the bison type declarations. Or just delete the AST functions and use your own.)
%left ','
%precedence "<-"
%precedence "->"
%left '+'
%left '*'
%precedence NEG
%right "++" '('
%%
expr: expr ',' expr { $$ = make_binop(OP_LIST, $1, $3); }
| "<-" expr { $$ = make_unop(OP_LARR, $2); }
| expr "->" { $$ = make_unop(OP_RARR, $1); }
| expr '+' expr { $$ = make_binop(OP_ADD, $1, $3); }
| expr '*' expr { $$ = make_binop(OP_MUL, $1, $3); }
| '-' expr %prec NEG { $$ = make_unop(OP_NEG, $2); }
| expr '(' expr ')' %prec '(' { $$ = make_binop(OP_CALL, $1, $3); }
| "++" expr { $$ = make_unop(OP_PREINC, $2); }
| expr "++" { $$ = make_unop(OP_POSTINC, $1); }
| VALUE { $$ = make_ident($1); }
| '(' expr ')' { $$ = $2; }
A couple of notes:
I used %prec NEG on the unary minus production in order to separate that production from the subtraction production. I also used a %prec declaration to modify the precedence of the call production (the default would be ')'), although in this particular case that's unnecessary. It is necessary to put '(' into the precedence list, though. ( is the lookahead symbol which is used in precedence comparisons.
For many unary operators, I used bison %precedence declaration in the precedence list, rather than %right or %left. Really, there is no such thing as associativity with unary operators, so I think that it's more self-documenting to use %precedence, which doesn't resolve conflicts involving reductions and shifts in the same precedence level. However, even though there is no such thing as associativity between unary operators, the nature of the precedence resolution algorithm is that you can put prefix operators and postfix operators in the same precedence level and choose whether the postfix or prefix operators have priority by using %right or %left, respectively. %right is almost always correct. I did that with ++, because I was a bit lazy by the time I got to that point.
This does "work" (I think). It certainly resolves all the conflicts; bison happily produces a parser without warnings. And the tests that I tried worked at least as I expected them to:
? a++->
=> [-> [++/post a]]
? a->++
=> [++/post [-> a]]
? 3*f(a)+2
=> [+ [* 3 [CALL f a]] 2]
? 3*f(a)->+2
=> [+ [-> [* 3 [CALL f a]]] 2]
? 2+<-f(a)*3
=> [+ 2 [<- [* [CALL f a] 3]]]
? 2+<-f(a)*3->
=> [+ 2 [<- [-> [* [CALL f a] 3]]]]
But there are some expressions where the operator precedence, while "correct", might not be easily explained to a novice user. For example, although the arrow operators look somewhat like parentheses, they don't group that way. Furthermore, the decision as to which of the two operators has higher precedence seems to me to be totally arbitrary (and indeed I might have done it differently from what you expected). Consider:
? <-2*f(a)->+3
=> [<- [+ [-> [* 2 [CALL f a]]] 3]]
? <-2+f(a)->*3
=> [<- [* [-> [+ 2 [CALL f a]]] 3]]
? 2+<-f(a)->*3
=> [+ 2 [<- [* [-> [CALL f a]] 3]]]
There's also something a bit odd about how the arrow operators override normal operator precedence, so that you can't just drop them into a formula without changing its meaning:
? 2+f(a)*3
=> [+ 2 [* [CALL f a] 3]]
? 2+f(a)->*3
=> [* [-> [+ 2 [CALL f a]]] 3]
If that's your intention, fine. It's your language.
Note that there are operator precedence problems which are not quite so easy to solve by just listing operators in precedence order. Sometimes it would be convenient for a binary operator to have different binding power on the left- and right-hand sides.
A classic (but perhaps controversial) case is the assignment operator, if it is an operator. Assignment must associate to the right (because parsing a = b = 0 as (a = b) = 0 would be ridiculous), and the usual expectation is that it greedily accepts as much to the right as possible. If assignment had consistent precedence, then it would also accept as much to the left as possible, which seems a bit strange, at least to me. If a = 2 + b = 7 is meaningful, my intuitions say that its meaning should be a = (2 + (b = 7)) [Note 2]. That would require differential precedence, which is a bit complicated but not unheard of. C solves this problem by restricting the left-hand side of the assignment operators to (syntactic) lvalues, which cannot be binary operator expressions. But in C++, it really does mean a = ((2 + b) = 7), which is semantically valid if 2 + b has been overloaded by a function which returns a reference.
Notes
Precedence declarations do not really add any power to the parser generator. The languages it can produce a parser for are exactly the same languages; it produces the same sort of parsing machine (a pushdown automaton); and it is at least theoretically possible to take that pushdown automaton and reverse engineer a grammar out of it. (In practice, the grammars produced by this process are usually monstrous. But they exist.)
All that the precedence declarations do is resolve parsing conflicts (typically in an ambiguous grammar) according to some user-supplied rules. So it's worth asking why it's so much simpler with precedence declarations than by writing an unambiguous grammar.
The simple hand-waving answer is that precedence rules only apply when there is a conflict. If the parser is in a state where only one action is possible, that's the action which remains, regardless of what the precedence rules might say. In a simple expression grammar, an infix operator followed by a prefix operator is not at all ambiguous: the prefix operator must be shifted, because there is no reduce action for a partial sequence ending with an infix operator.
But when we're writing a grammar, we have to specify explicitly what constructs are possible at each point in the grammar, which we usually do by defining a bunch of non-terminals, each corresponding to some parsing state. An unambiguous grammar for expressions already has split the expression non-terminal into a cascading series of non-terminals, one for each operator precedence value. But unary operators do not have the same binding power on both sides (since, as noted above, one side of the unary operator cannot take an operand). That means that a binary operator could well be able to accept a unary operator for one of its operands, and not be able to accept the same unary operator for its other operand. Which in turn means that we need to split all of our non-terminals again, corresponding to whether the non-terminal appears on the left or the right side of a binary operator.
That's a lot of work, and it's really easy to make a mistake. If you're lucky, the mistake will result in a parsing conflict; but equally it could result in the grammar not being able to recognise a particular construct which you would never think of trying, but which some irate language user feels is an absolute necessity. (Like 41 + not False)
It's possible that my intuitions have been permanently marked by learning APL at a very early age. In APL, all operators associate to the right, basically without any precedence differences.

Bison subscript expression unexpected error

With the following grammar:
program: /*empty*/ | stmt program;
stmt: var_decl | assignment;
var_decl: type ID '=' expr ';';
assignment: expr '=' expr ';';
type: ID | ID '[' NUMBER ']';
expr: ID | NUMBER | subscript_expr;
subscript_expr: expr '[' expr ']';
I'd expect the following to be valid:
array[5] = 0;
That's just an assignment with a subscript_expr on the left-hand-side. However the generated parser gives an error for that statement:
syntax error, unexpected '=', expecting ID
Generating the parser also warns that there's 1 shift/reduce conflict. Removing subscript_expr makes it go away.
Why does this happen and how can I get it to parse array[5] = 0; as an assignment with a subscript_expr?
I'm using Bison 2.3.
The following two statements are both valid in your language:
x [ 3 ] = 42;
x [ 3 ] y = 42;
The first is an assignment of an element of the array variable x, while the second is a declaration and initialization of the array variable y whose elements are of type x.
But from the parser's viewpoint, x and y are both just IDs; it has no way of knowing that x is a variable in the first case and a type in the second case. All it can do is notice that the two statements match the productions assignment and var_decl, respectively.
Unfortunately, it cannot do that until it sees the token after the ]. If that token is an ID, then the statement must be a var_decl; otherwise, it's an assignment (assuming the statement is valid, of course).
But in order to parse the statement as an assignment, the parser must be able to produce
expr '=' expr
which in this case is the result of expr: subsciprt_expr, which in turn is subscript_expr: expr[expr]`.
So the set of reductions for the first statement will be as follows: (Note: I didn't write the shifts; rather, I mark the progress of the parse by putting a • at the end of each reduction. To get to the next step, just shift the • until you reach the end of the handle.)
ID • [ NUMBER ] = NUMBER ; expr: ID
expr [ NUMBER • ] = NUMBER ; expr: NUMBER
expr [ expr ] • = NUMBER ; subscript_expr: expr '[' expr ']'
subscript_expr • = NUMBER ; expr: subscript_expr
expr = NUMBER • ; expr: NUMBER
expr = expr ; • assignment: expr '=' expr ';'
assignment
The second statement must be parsed as follows:
ID [ NUMBER ] • ID = NUMBER ; type: ID '[' NUMBER ']'
type ID = NUMBER • ; expr: NUMBER
type ID = expr ; • var_decl: type ID '=' expr ';'
var_decl
That's a shift/reduce conflict, because the crucial decision must be made immediately after the first ID. In the first statement, we need to reduce the identifier to an expr. In the second statement, we must continue shifting until we are ready to reduce a type.
Of course, this problem wouldn't exist if we could lexically distinguish type IDs from variable name IDs, but that may not be possible (or, if possible, it may not be desirable because it requires feedback from the parser to the lexer).
As written, the shift/reduce prediction can be made with fixed lookahead, since the fourth token after the ID will determine the possibilities. That makes the grammar LALR(4), but that doesn't help much since bison only implements LALR(1) parsers. In any case, it is likely that a less simplified grammar will not be fixed-lookahead, for example if constant expressions are allowed for array sizes, or if arrays can have multiple dimensions.
Even so, the grammar is not ambiguous, so it is possible to use a GLR parser to parse it. Bison does implement GLR parsers; it is only necessary to insert
%glr-parser
into the prologue. (The shift/reduce warning will still be produced, but the parser will correctly identify both kinds of statement.)
It's worth noting that C doesn't have this particular parsing problem precisely because it puts the array size after the name of the variable being declared. I don't believe this was done to avoid parsing problems (although who knows?) but rather because it was believed that it is more natural to write declarations the way variables are used. Hence, we write int a[3] and char *p, because in the program we will dereference using a[i] and *p.
It is possible to write an LALR(1) grammar for this syntax, but it's a bit annoying. The key is to delay the reduction of the syntax ID [ NUMBER ] until we know for sure which production it will be the start of. That means we need to include the production expr: ID '[' NUMBER ']'. That will result in a larger number of shift/reduce warnings (since it makes the grammar ambiguous), but since bison always prefers to shift, it should produce a correct parser.
Adding %glr-parser solves this.

Grammar of calculator in a finite field

I have a working calculator apart from one thing: unary operator '-'.
It has to be evaluated and dealt with in 2 difference cases:
When there is some expression further like so -(3+3)
When there isn't: -3
For case 1, I want to get a postfix output 3 3 + -
For case 2, I want to get just correct value of this token in this field, so for example in Z10 it's 10-3 = 7.
My current idea:
E: ...
| '-' NUM %prec NEGATIVE { $$ = correct(-yylval); appendNumber($$); }
| '-' E %prec NEGATIVE { $$ = correct(P-$2); strcat(rpn, "-"); }
| NUM { appendNumber(yylval); $$ = correct(yylval); }
Where NUM is a token, but obviously compiler says there is a confict reduce/reduce as E can also be a NUM in some cases, altough it works I want to get rid of the compilator warning.. and I ran out of ideas.
It has to be evaluated and dealt with in 2 difference cases:
No it doesn't. The cases are not distinct.
Both - E and - NUM are incorrect. The correct grammar would be something like:
primary
: NUM
| '-' primary
| '+' primary /* for completeness */
| '(' expression ')'
;
Normally, this should be implemented as two rules (pseudocode, I don't know bison syntax):
This is the likely rule for the 'terminal' element of an expression. Naturally, a parenthesized expression leads to a recursion to the top rule:
Element => Number
| '(' Expression ')'
The unary minus (and also the unary plus!) are just on one level up in the stack of productions (grammar rules):
Term => '-' Element
| '+' Element
| Element
Naturally, this can unbundle into all possible combinations such as '-' Number, '-' '(' Expression ')', likewise with '+' and without any unary operator at all.
Suppose we want addition / subtraction, and multiplication / division. Then the rest of the grammar would look like this:
Expression => Expression '+' MultiplicationExpr
| Expression '-' MultiplicationExpr
| MultiplicationExpr
MultiplicationExpr => MultiplicationExpr '*' Term
| MultiplicationExpr '/' Term
| Term
For the sake of completeness:
Terminals:
Number
Non-terminals:
Expression
Element
Term
MultiplicationExpr
Number, which is a terminal, shall match a regexp like this [0-9]+. In other words, it does not parse a minus sign — it's always a positive integer (or zero). Negative integers are calculated by matching a '-' Number sequence of tokens.

Resolving reduce/reduce conflict in yacc/ocamlyacc

I'm trying to parse a grammar in ocamlyacc (pretty much the same as regular yacc) which supports function application with no operators (like in Ocaml or Haskell), and the normal assortment of binary and unary operators. I'm getting a reduce/reduce conflict with the '-' operator, which can be used both for subtraction and negation. Here is a sample of the grammar I'm using:
%token <int> INT
%token <string> ID
%token MINUS
%start expr
%type <expr> expr
%nonassoc INT ID
%left MINUS
%left APPLY
%%
expr: INT
{ ExprInt $1 }
| ID
{ ExprId $1 }
| expr MINUS expr
{ ExprSub($1, $3) }
| MINUS expr
{ ExprNeg $2 }
| expr expr %prec APPLY
{ ExprApply($1, $2) };
The problem is that when you get an expression like "a - b" the parser doesn't know whether this should be reduced as "a (-b)" (negation of b, followed by application) or "a - b" (subtraction). The subtraction reduction is correct. How do I resolve the conflict in favor of that rule?
Unfortunately, the only answer I can come up with means increasing the complexity of the grammar.
split expr into simple_expr and expr_with_prefix
allow only simple_expr or (expr_with_prefix) in an APPLY
The first step turns your reduce/reduce conflict into a shift/reduce conflict, but the parentheses resolve that.
You're going to have the same problem with 'a b c': is it a(b(c)) or (a(b))(c)? You'll need to also break off applied_expression and required (applied_expression) in the grammar.
I think this will do it, but I'm not sure:
expr := INT
| parenthesized_expr
| expr MINUS expr
parenthesized_expr := ( expr )
| ( applied_expr )
| ( expr_with_prefix )
applied_expr := expr expr
expr_with_prefix := MINUS expr
Well, this simplest answer is to just ignore it and let the default reduce/reduce resolution handle it -- reduce the rule that appears first in the grammar. In this case, that means reducing expr MINUS expr in preference to MINUS expr, which is exactly what you want. After seeing a-b, you want to parse it as a binary minus, rather than a unary minus and then an apply.

Resources