Is this a LL(1) grammar? - parsing

Considering the following grammar for propositional logic:
<A> ::= <B> <-> <A> | <B>
<B> ::= <C> -> <B> | <C>
<C> ::= <D> \/ <C> | <D>
<D> ::= <E> /\ <D> | <E>
<E> ::= <F> | -<F>
<F> ::= <G> | <H>
<G> ::= (<A>)
<H> ::= p | q | r | ... | z
Precedence for conectives is: -, /\, /, ->, <->.
Associativity is also considered, for example p\/q\/r should be the same as p\/(q\/r). The same for the other conectives.
I pretend to make a predictive top-down parser in java. I dont see here ambiguity or direct left recursion, but not sure if thats all i need to consider this a LL(1) grammar. Maybe undirect left recursion?
If this is not a LL(1) grammar, what would be the steps required to transform it for my intentions?

It's not LL(1). Here's why:
The first rule of an LL(1) grammar is:
A grammar G is LL(1) if and only if whenever A --> C | D are two distinct productions of G, the following conditions hold:
For no terminal a , do both C and D derive strings beginning with a.
This rule is, so that there are no conflicts while parsing this code. When the parser encounters a (, it won't know which production to use.
Your grammar violates this first rule. All your non-terminals on the right hand of the same production , that is, all your Cs and Ds, eventually reduce to G and H, so all of them derive at least one string beginning with (.

Related

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

Parse grammar alternating and repeating

I was able to add support to my parser's grammar for alternating characters (e.g. ababa or baba) by following along with this question.
I'm now looking to extend that by allowing repeats of characters.
For example, I'd like to be able to support abaaabab and aababaaa as well. In my particular case, only the a is allowed to repeat but a solution that allows for repeating b's would also be useful.
Given the rules from the other question:
expr ::= A | B
A ::= "a" B | "a"
B ::= "b" A | "b"
... I tried extending it to support repeats, like so:
expr ::= A | B
# support 1 or more "a"
A_one_or_more = A_one_or_more "a" | "a"
A ::= A_one_or_more B | A_one_or_more
B ::= "b" A | "b"
... but that grammar is ambiguous. Is it possible for this to be made unambiguous, and if so could anyone help me disambiguate it?
I'm using the lemon parser which is an LALR(1) parser.
The point of parsing, in general, is to parse; that is, determine the syntactic structure of an input. That's significantly different from simply verifying that an input belongs to a language.
For example, the language which consists of arbitrary repetitions of a and b can be described with the regular expression (a|b)*, which can be written in BNF as
S ::= /* empty */ | S a | S b
But that probably does not capture the syntactic structure you are trying to defind. On the other hand, since you don't specify that structure, it is hard to know.
Here are a couple more possibilities, which build different parse trees:
S ::= E | S E
E ::= A b | E b
A ::= a | A a
S ::= E | S E
E ::= A B
A ::= a | A a
B ::= b | B b
When writing a grammar to parse a language, it is useful to start by drawing your proposed parse trees. Usually, you can write the grammar directly from the form of the trees, which shows that a formal grammar is primarily a documentation tool, since it clearly describes the language in a way that informal descriptions cannot. Using a parser generator to turn that grammar into a parser ensures that the parser implements the described language. Or, at least, that is the goal.
Here is a nice tool for checking your grammar online http://smlweb.cpsc.ucalgary.ca/start.html. It actually accepts the grammar you provided as a valid LALR(1) grammar.
A different LALR(1) grammar, that allows reapeating a's, would be:
S ::= "a" S | "a" | "b" A | "b"
A ::= "a" S .

Find an equivalent LR grammar

I am trying to find an LR(1) or LR(0) grammar for pascal. Here is a part of my grammar which is not LR(0) as it has shift/reduce conflict.
EXPR --> AEXPR | AEXPR realop AEXPR
AEXPR --> TERM | SIGN TERM | AEXPR addop TERM
TERM --> TERM mulop FACTOR | FACTOR
FACTOR --> id | num | ( EXPR )
SIGN --> + | -
(Uppercase words are variables and lowercase words, + , - are terminals)
As you see , EXPR --> AEXPR | AEXPR realop AEXPR cause a shift/reduce conflict on LR(0) parsing. I tried adding a new variable , and some other ways to find an equivalent LR (0) grammar for this, but I was not successful.
I have two problems.
First: Is this grammar a LR(1) grammar?
Second: Is it possible to find a LR(0) equivalent for this grammar? what about LR(1) equivalent?
Yes, your grammar is an LR(1) grammar. [see note below]
It is not just the first production which causes an LR(0) conflict. In an LR(0) grammar, you must be able to predict whether to shift or reduce (and which production to reduce) without consulting the lookahead symbol. That's a highly restrictive requirement.
Nonetheless, there is a grammar which will recognize the same language. It's not an equivalent grammar in the sense that it does not produce the same parse tree (or any useful parse tree), so it depends on what you consider equivalent.
EXPR → TERM | EXPR OP TERM
TERM → num | id | '(' EXPR ')' | addop TERM
OP → addop | mulop | realop
The above works by ignoring operator precedence; it regards an expression as simply the regular language TERM (op TERM)*. (I changed + | - to addop because I couldn't see how your scanner could work otherwise, but that's not significant.)
There is a transformation normally used to make LR(1) expression grammars suitable for LL(1) parsing, but since LL(1) is allowed to examine the lookahead character, it is able to handle operator precedence in a normal way. The LL(1) "equivalent" grammar does not produce a parse tree with the correct operator associativity -- all operators become right-associative -- but it is possible to recover the correct parse tree by a simple tree rotation.
In the case of the LR(0) grammar, where operator precedence has been lost, the tree transformation would be almost equivalent to reparsing the input, using something like the shunting yard algorithm to create the true parse tree.
Note
I don't believe the grammar presented is the correct grammar, because it makes unary plus and minus bind less tightly than multiplication, with the result that -3*4 is parsed as -(3*4). As it happens, there is no semantic difference most of the time, but it still feels wrong to me. I would have written the grammar as:
EXPR → AEXPR | AEXPR realop AEXPR
AEXPR → TERM | AEXPR addop TERM
TERM → FACTOR | TERM mulop FACTOR
FACTOR → num | id | '(' EXPR ')' | addop FACTOR
which makes unary operators bind more tightly. (As above, I assume that addop is precisely + or -.)

Practical solution to fix a Grammar Problem

We have little snippets of vb6 code (the only use a subset of features) that gets wirtten by non-programmers. These are called rules. For the people writing these they are hard to debug so somebody wrote a kind of add hoc parser to be able to evaluete the subexpressions and thereby show better where the problem is.
This addhoc parser is very bad and does not really work woll. So Im trying to write a real parser (because im writting it by hand (no parser generator I could understand with vb6 backends) I want to go with recursive decent parser). I had to reverse-engineer the grammer because I could find anything. (Eventully I found something http://www.notebar.com/GoldParserEngine.html but its LALR and its way bigger then i need)
Here is the grammer for the subset of VB.
<Rule> ::= expr rule | e
<Expr> ::= ( expr )
| Not_List CompareExpr <and_or> expr
| Not_List CompareExpr
<and_or> ::= Or | And
<Not_List> ::= Not Not_List | e
<CompareExpr> ::= ConcatExpr comp CompareExpr
|ConcatExpr
<ConcatExpr> ::= term term_tail & ConcatExpr
|term term_tail
<term> ::= factor factor_tail
<term_tail> ::= add_op term term_tail | e
<factor> ::= add_op Value | Value
<factor_tail> ::= multi_op factor factor_tail | e
<Value> ::= ConstExpr | function | expr
<ConstExpr> ::= <bool> | number | string | Nothing
<bool> ::= True | False
<Nothing> ::= Nothing | Null | Empty
<function> ::= id | id ( ) | id ( arg_list )
<arg_list> ::= expr , arg_list | expr
<add_op> ::= + | -
<multi_op> ::= * | /
<comp> ::= > | < | <= | => | =< | >= | = | <>
All in all it works pretty good here are some simple examples:
my_function(1, 2 , 3)
looks like
(Programm
(rule
(expr
(Not_List)
(CompareExpr
(ConcatExpr
(term
(factor
(value
(function
my_function
(arg_list
(expr
(Not_List)
(CompareExpr
(ConcatExpr (term (factor (value 1))) (term_tail))))
(arg_list
(expr
(Not_List)
(CompareExpr
(ConcatExpr (term (factor (value 2))) (term_tail))))
(arg_list
(expr
(Not_List)
(CompareExpr
(ConcatExpr (term (factor (value 3))) (term_tail))))
(arg_list))))))))
(term_tail))))
(rule)))
Now whats my problem?
if you have code that looks like this (( true OR false ) AND true) I have a infinit recursion but the real problem is that in the (true OR false) AND true (after the first ( expr ) ) is understood as only (true or false).
Here is the Parstree:
So how to solve this. Should I change the grammer somehow or use some implmentation hack?
Something hard exmplale in case you need it.
(( f1 OR f1 ) AND (( f3="ALL" OR f4="test" OR f5="ALL" OR f6="make" OR f9(1, 2) ) AND ( f7>1 OR f8>1 )) OR f8 <> "")
You have several issues that I see.
You are treating OR and AND as equal precedence operators. You should have separate rules for OR, and for AND. Otherwise you will the wrong precedence (therefore evaluation) for the expression A OR B AND C.
So as a first step, I'd revise your rules as follows:
<Expr> ::= ( expr )
| Not_List AndExpr Or Expr
| Not_List AndExpr
<AndExpr> ::=
| CompareExpr And AndExpr
| Not_List CompareExpr
Next problem is that you have ( expr ) at the top level of your list. What if I write:
A AND (B OR C)
To fix this, change these two rules:
<Expr> ::= Not_List AndExpr Or Expr
| Not_List AndExpr
<Value> ::= ConstExpr | function | ( expr )
I think your implementation of Not is not appropriate. Not is an operator,
just with one operand, so its "tree" should have a Not node and a child which
is the expression be Notted. What you have a list of Nots with no operands.
Try this instead:
<Expr> ::= AndExpr Or Expr
| AndExpr
<Value> ::= ConstExpr | function | ( expr ) | Not Value
I haven't looked, but I think VB6 expressions have other messy things in them.
If you notice, the style of Expr and AndExpr I have written use right recursion to avoid left recursion. You should change your Concat, Sum, and Factor rules to follow a similar style; what you have is pretty complicated and hard to follow.
If they are just creating snippets then perhaps VB5 is "good enough" for creating them. And if VB5 is good enough, the free VB5 Control Creation Edition might be worth tracking down for them to use:
http://www.thevbzone.com/vbcce.htm
You could have them start from a "test harness" project they add snippets to, and they can even test them out.
With a little orientation this will probably prove much more practical than hand crafting a syntax analyzer, and a lot more useful since they can test for more than correct syntax.
Where VB5 is lacking you might include a static module in the "test harness" that provides a rough and ready equivalent of Split(), Replace(), etc:
http://support.microsoft.com/kb/188007

Relation between grammar and operator associativity

Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:
Expr ::= Term { ( "+" | "-" ) Term }
Term ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"
which is an EBNF representation of this BNF:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"
According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.
I think the issue here is that a language has an abstract syntax which is just like:
E ::= E + E | E - E | E * E | E / E | Int | (E)
but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!
There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:
Expr ::= Term + Term | Term - Term | Term
Term ::= Factor * Factor | Factor / Factor | Factor
Factor ::= INTEGER | (Expr)
Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?
Expr -> Term
-> Factor * Factor
-> 1 * Factor
-> 1 * 2
We can see the other grammar handles this in a slightly different way:
Expr -> Term Expr'
-> Factor Term' Expr'
-> 1 Term' Expr'
-> 1 * Factor Term' Expr'
-> 1 * 2 Term' Expr'
-> 1 * 2 ε Expr'
-> 1 * 2 ε ε
= 1 * 2
but this achieves the same effect.
Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.
Expr -> Term + Term
-> ? (can't get another + in here)
The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.
So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!
To get associative trees, you really need to have the trees formed with the operator as the subtree root node, with children having similar roots.
Your implementation grammar:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor ::= INTEGER | "(" Expr ")"
must make that awkward; if you implement recursive descent on this, the Expr' routine has no access to the "left child" and so can't build the tree. You can always patch this up by passing around pieces (in this case, passing tree parts up the recursion) but that just seems awkward. You could have chosen this instead as a grammar:
Expr ::= Term ( ("+"|"-") Term )*;
Term ::= Factor ( ( "*" | "/" ) Factor )* ;
Factor ::= INTEGER | "(" Expr ")"
which is just as easy (easier?) to code recursive descent-wise, but now you can form the trees you need without trouble.
This doesn't really get you associativity; it just shapes the trees so that it could be allowed. Associativity means that the tree ( + (+ a b) c) means the same thing as (+ a (+ b c)); its actually a semantic property (sure doesn't work for "-" but the grammar as posed can't distinguish).
We have a tool (the DMS Software Reengineering Toolkit) that includes parsers and term-rewriting (using source-to-source transformations) in which the associativity is explicitly expressed. We'd write your grammar:
Expr ::= Term ;
[Associative Commutative] Expr ::= Expr "+" Term ;
Expr ::= Expr "-" Term ;
Term ::= Factor ;
[Associative Commutative] Term ::= Term "*" Factor ;
Term ::= Term "/" Factor ;
Factor ::= INTEGER ;
Factor ::= "(" Expr ")" ;
The grammar seems longer and clumsier this way, but it in fact allows us to break out the special cases and mark them as needed. In particular, we can now distinguish operators that are associative from those that are not, and mark them accordingly. With that semantic marking, our tree-rewrite engine automatically accounts for associativity and commutativity. You can see a full example of such DMS rules being used to symbolically simplify high-school algebra using explicit rewrite rules over a typical expression grammar that don't have to account for such semantic properties. That is built into the rewrite engine.

Resources