While searching for Bison grammars, i found this example of C grammar:
https://www.lysator.liu.se/c/ANSI-C-grammar-y.html
logical_and_expression
: inclusive_or_expression
| logical_and_expression AND_OP inclusive_or_expression
;
logical_or_expression
: logical_and_expression
| logical_or_expression OR_OP logical_and_expression
;
I didn't understand the reason for a rule for each logical operation. Is there an advantage over this construction below?
binary_expression:
: object // imagine it can be bool, int, real ...
| binary_expression AND_OP binary_expression
| binary_expression OR_OP binary_expression
;
The grammar you quote is unambiguous.
The one you suggest is ambiguous, although yacc/bison allow you to use precedence rules to resolve the ambiguities.
There are some advantages to using a grammar which makes operator precedence explicit:
It is a precise description of the language. Precedence rules are not part of the grammatical formalism and can be difficult to reason about. In particular, there is no general way to prove that they do what is desired.
The grammar is self-contained. Ambiguous grammars can only be understood with the addition of the precedence rules. This is particularly important for grammars used in language standards but it generally affects attempts to automatically build other syntax-based tools.
Explicit grammars are more general. Not all operator restrictions can be easily described with numeric precedence comparisons.
Precedence rules can hide errors in the grammar, by incorrectly resolving a shift-reduce conflict elsewhere in the grammar which happens to use some of the same tokens. Since the resolved conflicts are not reported, the grammar writer is not warned of the problem.
On the other hand, precedence rules do have some advantages:
The precedence table compactly describes operator precedence, which is useful for quick reference.
The resulting grammar requires fewer unit productions, slightly increasing parse speed. (Usually not noticeable, but still...)
Some conflicts are much easier to resolve with precedence declarations, although understanding how the conflict is resolved may not be obvious. (The classic example is the dangling-else ambiguity.) Such cases have little or nothing to do with the intuitive understanding of operator precedence, so the use of precedence rules is a bit of a hack.
The total size of the grammar is not really affected by using precedence rules. As mentioned, the precedence rules avoid the need for unit productions, but every unit production corresponds to one precedence declaration so the total number of lines is the same. There are fewer non-terminals, but non-terminals cost little; the major annoyance in yacc/bison is declaring all the semantic types, but that is easy to automate.
Related
I am trying to parse C. I have been consulting some free-context C grammars and I have observed they usually model expressions by using "chained" production rules, for example [here][1] something like this is done to model logical or and logical and expressions:
<logical-or-expression> ::= <logical-and-expression>
| <logical-or-expression> || <logical-and-expression>
<logical-and-expression> ::= <inclusive-or-expression>
| <logical-and-expression> && <inclusive-or-expression>
I say the expressions are chained because they follow this structure:
expression with operator(N) ::= expression with operator(N+1)
| (expression with operator(N)) operator(N) (expression with operator(N+1))
where N is the precedence of the operator.
I understand that objetive is to disambiguate the language and introduce precedence and association rules in a purely syntactic manner.
Is there any reason to model expressions like this in an actual parser with operator precedence support? My initial idea was to implement them simply as:
constant_expression ::= expression1 binary_op expression2
where binary_op is any binary operation and then disambiguate by setting the precedence of all the operators. For example:
logical_expr ::= simple_expr | logical_expr && logical_expr | logical_expr || logical_expr
and then set the precedence of && operator higher than ||. I think this tactic would give a much simpler grammar, as it would eliminate the necessity of a different rule for every level of precedence but I am reluctant to use it because all the implementations I have seen use the former strategy, even in cases where the parser had precedence support.
Thanks in advance.
[1]: https://cs.wmich.edu/~gupta/teaching/cs4850/sumII06/The%20syntax%20of%20C%20in%20Backus-Naur%20form.htm
Many LR-style parsers can handle operator precedence rules using some mechanism external to the grammar itself in part because it allows you to skip this “layering” approach to writing CFGs. If you have a parser generator that supports this, it’s fine to write an ambiguous grammar and then add those external rules in to get the precedence and associativity right.
As a note - parsers for CFGs and BNF rules usually are insensitive to the order in which rules are written, so listing the operators from highest-precedence to lowest-precedence alone isn’t sufficient. (PEG parsers, on the other hand, do represent ordered choices). Also, due to how most parser generators work (having code to execute associated with each production, and using the terminals in a production to determine operator precedence), it’s often easier to have separate rules, one per binary operator, than it is to have one rule of the form “Expr Operator Expr.” But otherwise the basic approach is sound.
I'm currently implementing a LR(k) parser interpreter, just for fun.
I'm trying to implement the precedence and associativity.
I got a little stuck when it came to how to assign associativity and precedence for the 'action' part i.e. what the precedence and associativity should be for the reduction.
if we got a production
E ->
| E + E { action1 }
| E * E { action2 }
| (E) { action3 }
| ID { action4 }
it should be quit clear that action1 should have the same associativity and precedence as +
and action2 should have the same as *. But in general we can not just assume that a rule in a production has only one symbol which has a precedence. A toy example
E -> E + E - E { action }
where - and + are some arbetrary operators, having some precedence and associativity. Should the action be associated with -, because it precedes the last E?
I know the rules for how to choose between shift/reduce, that is not what I ask for.
The classic precedence algorithm, as implemented by yacc (and many derivatives) uses the last non-terminal in each production to define its default precedence. That's not always the desired precedence for the production, so parser-generators typically also provide their users with a mechanism for explicitly specifying the precedence of a production.
This precedence model has proven to be useful, and while it is not without its problems -- see below -- it is probably the best implementation for a simple parser generator, if only because its quirks are at least documented.
This convention has perpetuated the idea that precedence is a feature of non-terminals (or "operators"). That's valid if you're building an operator-precedence parser but it does not correspond to LR(k) parsing. At best, it's a crude approximation, which can be highly misleading.
If the underlying grammar really is an operator precedence grammar -- that is, no production has two consecutive terminals and the imputed precedence relationships are unambiguous -- then it might be an acceptable approximation, although it's worth noting that operator precedence relationships are not transitive so they cannot usually be summarised as monotonic comparisons. But many uses of yacc-style precedence are well outside of this envelope, and can even lead to serious grammar bugs.
The problem is that modelling precedence as a simple transitive comparison between tokens can lead to the precedence declarations being used to disambiguate (and thereby hide) unrelated conflicts. In short, the use of precedence declarations in LR parsing is basically a hack. It's a useful hack, and sometimes beneficial -- as you say, it can reduce the number of states and the frequency of unit reductions -- but it needs to be approached with caution.
Indeed, some people have proposed an alternative model of precedence based on grammar rewriting. (See, for example, the 2013 paper by Ali Afroozeh et al., “Safe Specification of Operator Precedence Rules”). This model is considerably more precise, but partially as a consequence of this precision, it is not as amenable to (mis)-use for other purposes, such as the resolution of the dangling-else conflict.
I try a bit the parser generators with Haskell, using Happy here. I used to use parser combinators before, such as Parsec, and one thing I can't achieve now with that is the dynamic addition (during execution) of new externally defined operators. For example, Haskell has some basic operators, but we can add more, giving them precedence and fixity. So I would like to know how to reproduce this with Happy, following the Haskell design (view example code bellow to be parsed), if it is not trivially feasible, or if it should perhaps be done through the parser combinators.
-- Adding the new operator
infixl 5 ++
(++) :: [a] -> [a] -> [a]
[] ++ ys = ys
(x:xs) ++ ys = x : xs ++ ys
-- Using the new operator taking into consideration fixity and precedence during parsing
example = "Hello, " ++ "world!"
Haskell only allows a few precedence levels. So you don't strictly need a dynamic grammar; you could just write out the grammar using precedence-level token classes instead of individual operators, leaving the lexer with the problem of associating a given symbol with a given precedence level.
In effect, that moves the dynamic addition of operators to the lexer. That's a slightly uncomfortable design decision, although in some cases it may not be too difficult to implement. It's uncomfortable design because it requires semantic feedback to the lexer; at a minimum, the lexer needs to consult the symbol table to figure out what type of token it is looking at. In the case of Haskell, at least, this is made more uncomfortable by the fact that fixity declarations are scoped, so in order to track fixity information, the lexer would also need to understand scoping rules.
In practice, most languages which allow program text to define operators and operator precedence work in precisely the same way the Haskell compiler does: expressions are parsed by the grammar into a simple list of items (where parenthesized subexpressions count as a single item), and in a later semantic analysis the list is rearranged into an actual tree taking into account precedence and associativity rules, using a simple version of the shunting yard algorithm. (It's a simple version because it doesn't need to deal with parenthesized subconstructs.)
There are several reasons for this design decision:
As mentioned above, for the lexer to figure out what the precedence of a symbol is (or even if the symbol is an operator with precedence) requires a close collaboration between the lexer and the parser, which many would say violates separation of concerns. Worse, it makes it difficult or impossible to use parsing technologies without a small fixed lookahead, such as GLR parsers.
Many languages have more precedence levels than Haskell. In some cases, even the number of precedence levels is not defined by the grammar. In Swift, for example, you can declare your own precedence levels, and you define a level not with a number but with a comparison to another previously defined level, leading to a partial order between precedence levels.
IMHO, that's actually a better design decision than Haskell, in part because it avoids the ambiguity of a precedence level having both left- and right-associative operators, but more importantly because the relative precedence declarations both avoid magic numbers and allow the parser to flag the ambiguous use of operators from different modules. In other words, it does not force a precedence declaration to mechanically apply to any pair of totally unrelated operators; in this sense it makes operator declarations easier to compose.
The grammar is much simpler, and arguably easier to understand since most people anyway rely on precedence tables rather than analysing grammar productions to figure out how operators interact with each other. In that sense, having precedence set by the grammar is more a distraction than documentation. See the C++ grammar as a good example of why precedence tables are easier to read than grammars.
On the other hand, as the C++ grammar also illustrates, a grammar is a lot more general than simple precedence declarations because it can express asymmetric precedences. (The grammar doesn't always express these gracefully, but they can be expressed.) A classic example of an asymmetric precedence is a lambda construct (λ ID expr) which binds very loosely to the right and very tightly to the left: the expected parse of a ∘ λ b b ∘ a does not ever consult the associativity of ∘ because the λ comes between them.
In practice, there is very little cost to building the tree later. The algorithm to build the tree is well-known, simple and cheap.
Yes, I'm one of those insane people who have a parser-generator project. Minimal-LR(1) with operator-precedence was fairly straightforward. GLR support is next, preferably without making a mess of the corner cases around precedence and associativity (P&A).
Suppose you have an R/R conflict between rules with different precedence levels. A deterministic parser can safely choose the (first) rule of highest precedence. A parser designed to handle local ambiguity might not be sure, especially if the involved rules reduce to different non-terminals.
Suppose you have a R/R conflict between rules with- and without- precedence characteristics. A deterministic parser can reasonably choose the former. If you ask for GLR, do you mean to entertain both, or should the former clearly dominate the latter? Or is this scenario sufficiently weird as to justify rejecting the grammar?
Suppose you have an S/R/R conflict where only some of the participating rules have precedence, and maybe the look-ahead token does or doesn't have precedence. If P&A is all about what to do in front of the lookahead, then a non-precedent token should perhaps mean all options stay viable. But is that really the intended semantic here?
Suppose you have a nonassoc declaration on a terminal, and an S/R/R conflict where only ONE of the participating production rules hits the same non-associative precedence level. Then the other rule is clearly still viable to reduce, but what of the shift? Should we take it? What if we're mid-rule in a manner that doesn't trigger the same non-associativity problem? What if the look-ahead token is higher precedence than the remaining reduce, or the remaining reduce doesn't have precedence? How can we avoid accidentally constructing an invalid parse this way? Is there some trick with the parse-items to construct a shift-state that can't go wrong, or is this kind of thing beyond the scope of GLR parsing?
Also, how should semantic predicates interact with such ugly corner cases?
The simplest-thing-that-might-work is to treat anything involving operator-precedence in the same manner as a deterministic table-generator. But is that the intended semantic? Or perhaps: what kinds of declarations might grammar authors want to exert control over these weird cases?
Traditional yacc-style precedence rules cannot be used to resolve reduce/reduce conflicts.
Yacc/bison "resolve" reduce/reduce conflicts by choosing the first production in the grammar file. This has nothing to do with precedence, and in the grammars where you would want to use a GLR parser, it is almost certainly not correct; you want the GLR parser to pursue all possible paths.
The bison GLR parser requires that ambiguity be resolved; that is, that the grammar be unambiguous. However, it has two "outs": first, it lets you use "dynamic precedence" declarations (which is a completely different concept, although it happens to use the same word); second, if that's not enough, it lets you provide your own resolution function.
Amongst other possibilities, a custom resolution function can accept both reductions, for example by inserting a branch in the AST. There are some theoretical issues with this approach for general parsing, but it works fine with real programming languages, which tend to not be ambiguous, or at least "not very ambiguous".
A typical case for dynamic precedence is implementing a (textual) rule like C++'s §9.8/1:
There is an ambiguity in the grammar involving expression-statements and declarations: An expression-statement with a function-style explicit type conversion (8.2.3) as its leftmost subexpression can be indistinguishable from a declaration where the first declarator starts with a (. In those cases the statement is a declaration.
This rule cannot be expressed by a context-free grammar -- or, at least not in a way which would be readable -- but it is trivially expressible as a dynamic precedence rule.
As its name implies, dynamic precedence is dynamic; it's a rule applied at parse time by the parser. Bison's GLR algorithm only applies these rules if forced to; the parser handles multiline possible reductions normally (by maintaining all of them as possibilities). It is forced to apply dynamic precedence only when both possible reductions in a reduce/reduce conflict reduce to the same non-terminal.
By contrast, the yacc precedence algorithm, which as I mentioned only resolves shift/reduce conflicts, is static: it is compiled at generation time into the parse automaton (in effect, by removing actions from the transition tables), so the parser no longer sees the conflict.
This algorithm has been (justifiably) criticised for a variety of reasons, one of which is the odd behaviour of non-associative declarations in corner cases. Also, precedence rules do not compose well; because they are not scoped, they might end up accidentally applying to productions for which they were not intended. Not infrequently, they facilitate grammar bugs by hiding a conflict which should have been resolved by the grammar writer.
Best practice, therefore, is to avoid corner cases :-) Static precedence should be restricted to its originally-intended use cases: simple operator precedence and, possibly, documenting the "shift preferred" heuristic which resolves dangling-else resolution and certain grouped operator parses (iirc, there's a good example of this in the dragon book).
If you implement dynamic precedence -- and, honestly, there are good reasons not to -- then it should be applied to simple easily expressed rules like the C++ rule cited above: "if it looks like a declaration, it's a declaration." Even better would be to avoid writing ambiguous grammars; that particular C++ feature leads to the infamous "most vexatious parse", which has probably at some point bitten every one of us who have tried writing C++ programs.
So I've been trying to parse a haskell-like language grammar with bison. I'll omit the standard problems with grammars and unary minus (like, what is (-5) from -5 and \x->x-5 or if a-b is a-(b) or apply a (-b) which itself can still be apply a \x->x-b, haha.) and go straight to the thing that suprised me.
To simplify the whole thing to the point where it matters, consider following situation:
expression: '(' expression ')'
| expression expression /* lambda application */
| '\\' IDENTIFIER "->" expression /* lambda abstraction */
| expression '+' expression /* some operators to play with */
| expression '*' expression
| IDENTIFIER | CONSTANT /* | ..... */
;
I solved all shift/reduce conflicts with '+' and '*' with %left and %right
precedence macros, but I somehow failed to find any good solution how to set
precedence for the expression expression lambda application. I tried
%precedence and %left and %prec marker as shown for example here
%http://www.gnu.org/software/bison/manual/html_node/Non-Operators.html#Non-Operators,
but it looks like bison is completely ignoring any precedence setting on this
rule. At least all combinations I tried failed. Documentation on exactly this
topic is pretty sparse, whole thing looks like suited only for handling the
"classic" expr. OPER expr. case.
Question: Am I doing something wrong, or is this impossible in Bison? If not,
is it just unsupported or is there some theoretical justification why not?
Remark: Of course there's an easy workaround to force left-folding and
precedence that would look schematically like
expression: expression_without_lambda_application
| expression expression_without_lambda_application
;
expression_without_lambda_application: /* ..operators.. */
| '(' expression ')'
;
...but that's not as neat as it could be, right? :]
Thanks!
It's easiest to understand how bison precedence works if you understand how LR parsing works, since it's based on a simple modification of the LR algorithm. (Here, I'm just combining SLR, LALR and LR grammars, because the basic algorithm is the same.)
An LR(1) machine has two possible classes of action:
Reduce the right-hand side of the production which ends just before the lookahead token (and consequently is at the top of the stack).
Shift the lookahead token.
In an LR(1) grammar, the decision can always be made on the basis of the machine state and the lookahead token. But certain common constructs -- notably infix expressions -- apparently require grammars which appear more complicated than they need to be, and which require more unit reductions than should be necessary.
In an era in which LR parsing was new, and most practitioners were used to some sort of operator precedence grammar (see below for definition), and in which cycles were a lot more expensive than they are now so that the extra unit reductions seemed annoying, the modification of the LR algorithm to use standard precedence techniques was attractive.
The modification -- which is based on a classic algorithm for parsing operator precedence grammars -- involves assigning a precedence value (an integer) to every right-hand side (i.e. every production) and to every terminal. Then, when constructing the LR machine, if a given state and lookahead can trigger either a shift or a reduce action, the conflict is resolved by comparing the precedence of the possible reduction with the precedence of the lookahead token. If the reduction has a higher precedence, it wins; otherwise the machine shifts.
Note that reduction precedences are never compared with each other, and neither are token precedences. They can actually come from different domains. Furthermore, for a simple expression grammar, intuitively the comparison is with the operator "at the top of the stack"; this is actually accomplished by using the right-most terminal in a production to assign the precedence of the production. To handle left vs. right associativity, we don't actually use the same precedence value for a production as for a terminal. Left-associative productions are given a precedence slightly higher than the terminal's precedence, and right-associative productions are given a precedence slightly lower. This could be done by making the terminal precedences multiples of 3 and the reduction precedences a value one greater or less than the terminal. (Actually in practice the comparison is > rather than ≥ so it's possible to use even numbers for terminals, but that's an implementation detail.)
As it turns out, languages are not always quite so simple. So sometimes -- the case of unary operators is a classic example -- it's useful to explicitly provide a reduction precedence which is different from the default. (Another case is where the precedence is more related to the first terminal than the last, in the case where there are more than one.)
Editorial note:
Really, this is all a hack. It's a good hack, and it can be useful. But like all hacks, it can be pushed too far. Intricate tricks with precedence which require a full understanding of the algorithm and a detailed analysis of the grammar are not, IMHO, elegant. They are confusing. The whole point of using a context-free-grammar formalism and a parser generator is to simplify the presentation of the grammar and make it easier to verify. /Editorial note.
An operator precedence grammar is an operator grammar which can be bottom-up parsed using only precedence relations (using an algorithm such as the classic "shunting-yard" algorithm). An operator grammar is a grammar in which no right-hand side has two consecutive non-terminals. And the production:
expression: expression expression
cannot be expressed in an operator grammar.
In that production, the shift reduce conflict comes in the middle, just before where the operator would be if there were an operator. In that case, one would want to compare the precedence of whichever reduction gave rise to the first expression with the invisible operator which separates the expressions.
In some circumstances (and this requires careful grammar analysis, and is consequently very fragile), it's possible to distinguish between terminals which could start an expression and terminals which could be operators. In that case, it would be possible to use the precedence of the terminals in the FIRST set of expression as the comparators in the precedence comparison. Since those terminals will never be used as the comparators in an operator production, no additional ambiguity is created.
Of course, that fails as soon as it is possible for a terminal to be either an infix or a prefix operator, such as unary minus. So it's probably only of theoretical interest in most languages.
In summary, I personally think that the solution of explicitly defining non-application expressions is clear, elegant and consistent with the theory of LR parsing, while any attempt to use precedence relations will turn out to be far less easy to understand and verify.
But, if you insist, here is a grammar which will work in this particular case (without unary operators), based on assigning precedence values to the tokens which might start an expression:
%token IDENTIFIER CONSTANT APPLY
%left '(' ')' '\\' IDENTIFIER CONSTANT APPLY
%left '+'
%left '*'
%%
expression: '(' expression ')'
| expression expression %prec APPLY
| '\\' IDENTIFIER "->" expression
| expression '+' expression
| expression '*' expression
| IDENTIFIER | CONSTANT
;