Disallowing unnecessary outermost brackets in a BNFC-grammar - parsing

This is a continuation to this question I asked earlier about a BNFC-grammar for propositional logic. I got it working with parentheses, as per the definition, but I would now like to extend the grammar to work without parentheses, with a catch however: no unnecessary outer parentheses allowed.
For example, the atomic sentence a should be allowed, but (a) should not be recognized. The sentence (a => b) & c should also be allowed, but ((a => b) & c) not, and so forth. The last example highlights the necessity for paretheses. The precedence levels are
equivalence <=> and implication =>,
conjuction & and disjunction |
negation - and
atoms.
The higher the level, the earlier it will be parsed.
I got the grammar working with the unnecessary parentheses, by setting precedence levels to the different operators via recursion:
IFF . L ::= L "<=>" L1 ;
IF . L ::= L "=>" L1 ;
AND . L1 ::= L1 "&" L2 ;
OR . L1 ::= L1 "|" L2 ;
NOT . L2 ::= "-" L3 ;
NOT2 . L2 ::= "-" L2 ;
P . L3 ::= Ident ;
_ . L ::= L1 ;
_ . L1 ::= L2 ;
_ . L2 ::= L3 ;
_ . L3 ::= "(" L ")" ;
Now the question is, how do I not allow the outer parentheses, the allowance of which is caused by the last rule L3 ::= "(" L ")";? It is strictly necessary for allowing parentheses inside an expression, but it also allows them on the edges. I guess I need some extra rule for removing ambiguity, but what might that be like?
This grammar also results in about 6 reduce/reduce conflicts, but aren't those pretty much inevitable in recursive definitions?

You can do this by simply banning the parenthesised form from the toplevel. This requires writing the precedence hierarchy in a different fashion, in order to propagate the restriction through the hierarchy. In the following, the r suffix indicates that the production is "restricted" to not be a parenthesised form.
I also fixed the reduce/reduce conflicts by eliminating one of the NOT productions. See below.
(I hope I got the BNFC right. I wrote this in bison and tried to convert the syntax afterwards.)
_ . S ::= L0r ;
IFF . L0r ::= L0 "<=>" L1 ;
IF . L0r ::= L0 "=>" L1 ;
AND . L1r ::= L1 "&" L2 ;
OR . L1r ::= L1 "|" L2 ;
NOT . L2r ::= "-" L2 ;
ID . L2r ::= Ident ;
PAREN . L3 ::= "(" L0 ")" ;
_ . L0r ::= L1r ;
_ . L1r ::= L2r ;
_ . L0 ::= L0r ;
_ . L1 ::= L1r ;
_ . L2 ::= L2r ;
_ . L0 ::= L3 ;
_ . L1 ::= L3 ;
_ . L2 ::= L3 ;
(Edit: I changed the IFF, IF, AND and OR rules by removing the restriction (r) from the first arguments. That allows the rules to match expressions which start with a parenthesis without matching the PAREN syntax.)
If you also wanted to disallow redundant internal parentheses (like ((a & b))), you could change the PAREN rule to
PAREN . L3 ::= "(" L0r ")" ;
which would make the L0 rule unnecessary.
A variant approach which uses fewer unit productions can be found in the answer by #IraBaxter to Grammar for expressions which disallows outer parentheses.
Side note:
This grammar also results in about 6 reduce/reduce conflicts, but aren't those pretty much inevitable in recursive definitions?
No, recursive grammars can and should be unambiguous. Reduce/reduce conflicts are not inevitable, and almost always indicate problems in the grammar. In this case, they are the result of the redundant productions for the unary NOT operator. Having two different non-terminals which can both accept "-" L3 is obviously going to lead to an ambiguity, and ambiguities always produce parsing conflicts.

Related

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

Definition of Shift/Reduce conflict in LR(1) parsing

Is it correct to state:
" A shift reduce conflict occurs in an LR(1) parser if and only if there exist items:
A -> alpha .
A -> alpha . beta
such that Follow(A) is not disjoint from First(beta).
where A is a non-terminal, alpha and beta are possibly empty sequences of grammar symbols."
(Intuitively this is because there is no way to determine, based on the top of the stack and the lookahead, whether it is appropriate to shift or reduce)
Note: If you guys think this depends on any specifics of the definition of first/follow comment and I will provide.
No, that statement is not correct.
Suppose in some grammar:
the productions include both:
A → α
A → α β
FOLLOW(A) ∩ FIRST(β) ≠ ∅
That is not quite sufficient for the grammar to have a shift-reduce conflict, because it requires that the non-terminal A be reachable in a context in which the lookahead includes FOLLOW(A) ∩ FIRST(β).
So only if we require the grammar to be reduced, or at least to not contain any unreachable or useless productions, is the above sufficient to generate a shift-reduce conflict.
However, the above conditions are not necessary because there is no requirement that the shift and the reduction apply to the same non-terminal, or even "related" non-terminals. Consider the following simple grammar:
prog → stmt
prog → prog stmt
stmt → expr
stmt → func
expr → ID
expr → '(' expr ')'
func → ID '(' ')' stmt
That grammar (which is not ambiguous, as it happens) has a shift-reduce conflict in the state containing ID . ( because ID could be reduced to expr and then stmt, or ( could be shifted as part of func → ID '(' ')' stmt.
Although it's a side-point, it's worth noting the the FOLLOW set is only used in the construction of SLR(k) grammars. The canonical LR(k) construction -- and even the LALR(k) construction -- will successfully generating parsers for grammars in which the use of the FOLLOW set instead of a full lookahead computation will indicate a (non-existent) shift-reduce conflict. The classic example, taken from the Dragon book (example 4.39 in my copy), edited with slightly more meaningful non-terminal names:
stmt → lvalue '=' rvalue
stmt → rvalue
lvalue → '*' rvalue
lvalue → ID
rvalue → lvalue
Here, FOLLOW(rvalue) is {=, $}, and as a result in the state {stmt → lvalue · '=' rvalue, rvalue → lvalue ·}, it appears that a reduction to rvalue is possible, leading to a false shift-reduce conflict.

Find an equivalent LR grammar

I am trying to find an LR(1) or LR(0) grammar for pascal. Here is a part of my grammar which is not LR(0) as it has shift/reduce conflict.
EXPR --> AEXPR | AEXPR realop AEXPR
AEXPR --> TERM | SIGN TERM | AEXPR addop TERM
TERM --> TERM mulop FACTOR | FACTOR
FACTOR --> id | num | ( EXPR )
SIGN --> + | -
(Uppercase words are variables and lowercase words, + , - are terminals)
As you see , EXPR --> AEXPR | AEXPR realop AEXPR cause a shift/reduce conflict on LR(0) parsing. I tried adding a new variable , and some other ways to find an equivalent LR (0) grammar for this, but I was not successful.
I have two problems.
First: Is this grammar a LR(1) grammar?
Second: Is it possible to find a LR(0) equivalent for this grammar? what about LR(1) equivalent?
Yes, your grammar is an LR(1) grammar. [see note below]
It is not just the first production which causes an LR(0) conflict. In an LR(0) grammar, you must be able to predict whether to shift or reduce (and which production to reduce) without consulting the lookahead symbol. That's a highly restrictive requirement.
Nonetheless, there is a grammar which will recognize the same language. It's not an equivalent grammar in the sense that it does not produce the same parse tree (or any useful parse tree), so it depends on what you consider equivalent.
EXPR → TERM | EXPR OP TERM
TERM → num | id | '(' EXPR ')' | addop TERM
OP → addop | mulop | realop
The above works by ignoring operator precedence; it regards an expression as simply the regular language TERM (op TERM)*. (I changed + | - to addop because I couldn't see how your scanner could work otherwise, but that's not significant.)
There is a transformation normally used to make LR(1) expression grammars suitable for LL(1) parsing, but since LL(1) is allowed to examine the lookahead character, it is able to handle operator precedence in a normal way. The LL(1) "equivalent" grammar does not produce a parse tree with the correct operator associativity -- all operators become right-associative -- but it is possible to recover the correct parse tree by a simple tree rotation.
In the case of the LR(0) grammar, where operator precedence has been lost, the tree transformation would be almost equivalent to reparsing the input, using something like the shunting yard algorithm to create the true parse tree.
Note
I don't believe the grammar presented is the correct grammar, because it makes unary plus and minus bind less tightly than multiplication, with the result that -3*4 is parsed as -(3*4). As it happens, there is no semantic difference most of the time, but it still feels wrong to me. I would have written the grammar as:
EXPR → AEXPR | AEXPR realop AEXPR
AEXPR → TERM | AEXPR addop TERM
TERM → FACTOR | TERM mulop FACTOR
FACTOR → num | id | '(' EXPR ')' | addop FACTOR
which makes unary operators bind more tightly. (As above, I assume that addop is precisely + or -.)

Is this a LL(1) grammar?

Considering the following grammar for propositional logic:
<A> ::= <B> <-> <A> | <B>
<B> ::= <C> -> <B> | <C>
<C> ::= <D> \/ <C> | <D>
<D> ::= <E> /\ <D> | <E>
<E> ::= <F> | -<F>
<F> ::= <G> | <H>
<G> ::= (<A>)
<H> ::= p | q | r | ... | z
Precedence for conectives is: -, /\, /, ->, <->.
Associativity is also considered, for example p\/q\/r should be the same as p\/(q\/r). The same for the other conectives.
I pretend to make a predictive top-down parser in java. I dont see here ambiguity or direct left recursion, but not sure if thats all i need to consider this a LL(1) grammar. Maybe undirect left recursion?
If this is not a LL(1) grammar, what would be the steps required to transform it for my intentions?
It's not LL(1). Here's why:
The first rule of an LL(1) grammar is:
A grammar G is LL(1) if and only if whenever A --> C | D are two distinct productions of G, the following conditions hold:
For no terminal a , do both C and D derive strings beginning with a.
This rule is, so that there are no conflicts while parsing this code. When the parser encounters a (, it won't know which production to use.
Your grammar violates this first rule. All your non-terminals on the right hand of the same production , that is, all your Cs and Ds, eventually reduce to G and H, so all of them derive at least one string beginning with (.

Relation between grammar and operator associativity

Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:
Expr ::= Term { ( "+" | "-" ) Term }
Term ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"
which is an EBNF representation of this BNF:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"
According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.
I think the issue here is that a language has an abstract syntax which is just like:
E ::= E + E | E - E | E * E | E / E | Int | (E)
but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!
There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:
Expr ::= Term + Term | Term - Term | Term
Term ::= Factor * Factor | Factor / Factor | Factor
Factor ::= INTEGER | (Expr)
Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?
Expr -> Term
-> Factor * Factor
-> 1 * Factor
-> 1 * 2
We can see the other grammar handles this in a slightly different way:
Expr -> Term Expr'
-> Factor Term' Expr'
-> 1 Term' Expr'
-> 1 * Factor Term' Expr'
-> 1 * 2 Term' Expr'
-> 1 * 2 ε Expr'
-> 1 * 2 ε ε
= 1 * 2
but this achieves the same effect.
Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.
Expr -> Term + Term
-> ? (can't get another + in here)
The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.
So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!
To get associative trees, you really need to have the trees formed with the operator as the subtree root node, with children having similar roots.
Your implementation grammar:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor ::= INTEGER | "(" Expr ")"
must make that awkward; if you implement recursive descent on this, the Expr' routine has no access to the "left child" and so can't build the tree. You can always patch this up by passing around pieces (in this case, passing tree parts up the recursion) but that just seems awkward. You could have chosen this instead as a grammar:
Expr ::= Term ( ("+"|"-") Term )*;
Term ::= Factor ( ( "*" | "/" ) Factor )* ;
Factor ::= INTEGER | "(" Expr ")"
which is just as easy (easier?) to code recursive descent-wise, but now you can form the trees you need without trouble.
This doesn't really get you associativity; it just shapes the trees so that it could be allowed. Associativity means that the tree ( + (+ a b) c) means the same thing as (+ a (+ b c)); its actually a semantic property (sure doesn't work for "-" but the grammar as posed can't distinguish).
We have a tool (the DMS Software Reengineering Toolkit) that includes parsers and term-rewriting (using source-to-source transformations) in which the associativity is explicitly expressed. We'd write your grammar:
Expr ::= Term ;
[Associative Commutative] Expr ::= Expr "+" Term ;
Expr ::= Expr "-" Term ;
Term ::= Factor ;
[Associative Commutative] Term ::= Term "*" Factor ;
Term ::= Term "/" Factor ;
Factor ::= INTEGER ;
Factor ::= "(" Expr ")" ;
The grammar seems longer and clumsier this way, but it in fact allows us to break out the special cases and mark them as needed. In particular, we can now distinguish operators that are associative from those that are not, and mark them accordingly. With that semantic marking, our tree-rewrite engine automatically accounts for associativity and commutativity. You can see a full example of such DMS rules being used to symbolically simplify high-school algebra using explicit rewrite rules over a typical expression grammar that don't have to account for such semantic properties. That is built into the rewrite engine.

Resources