How to build Parser in Haskell - parsing

data Expr = ExprNum Double -- constants
| ExprVar String -- variables
| ExprAdd Expr Expr
| ExprSub Expr Expr
| ExprNeg Expr -- The unary '-' operator
| ExprMul Expr Expr
| ExprDiv Expr Expr
deriving Show
This is my user define data type. I want to handle arithmetic expression like (2+3 *4 - x) using above data types without using buildExpression parser. What can I do?
Please help me.It should handle operator precedence.

Suppose we want to build an addsub level parser. We'd like to say that (ignoring actual returning of correct values and just focusing on the raw parsing)
addsub = muldiv >> oneOf "+-" >> muldiv
This doesn't really work. But we can left factor this as
addsub = muldiv >> addsub'
addsub' = many $ oneOf "+-" >> muldiv
Where we assume muldiv is a parser for just multiplication and division which you can write in a similar manner.
That is, instead of using the grammar
addsub = addsub (+-) muldiv | muldiv
We use the slightly more complicated, but actually usable by Parsec:
addsub = muldiv addsub'
addsub' = (+-) muldiv addsub' | Nothing
Which we can of course refactor the latter into a many which gives us a list of expressions that we would add. You could then convert that to whatever form you want, such as (Add a1 (Add a2 (Add a3))).

Related

Adding additional parameters to data constructors using infix operators

I have written a data constructor like
data Expr = IntL Integer | Expr :*: Expr
and would like to annotate it with extra constructor parameters (such as positional information) like this:
data Expr = IntL Integer Pos | Expr :*: Expr Pos
However GHC does not like this:
Expected kind '* -> *' but 'Expr' has kind '*'
In the type 'Expr Position'
In the definition of data constructor ':*:'
In the data declaration for 'Expr'
I know I could use something like Mul Expr Expr Pos as a work around or even wrap Expr in another data constructor, but I'd really like to use the infix operator and cannot figure a way to do so! Is this possible?
I've tried wrapping the constructor in brackets:
data Expr = IntL Integer Pos | (Expr :*: Expr) Pos
And also making :*: a prefix:
data Expr = IntL Integer Pos | (:*:) Expr Expr Pos
but this does not allow me to pattern match in the same way. I'm not sure this even makes sense as a type constructor but thought I'd ask just in case.
It might be better to do this with an extra constructor, so:
infixl 6 :*:
infixl 7 :#
data Expr = IntL Integer | PosExpr :*: PosExpr
data PosExpr = Expr :# Pos
Then you can construct items with:
(IntL 5 :# foo :*: IntL 6 :# bar) :# qux

Parse grammar alternating and repeating

I was able to add support to my parser's grammar for alternating characters (e.g. ababa or baba) by following along with this question.
I'm now looking to extend that by allowing repeats of characters.
For example, I'd like to be able to support abaaabab and aababaaa as well. In my particular case, only the a is allowed to repeat but a solution that allows for repeating b's would also be useful.
Given the rules from the other question:
expr ::= A | B
A ::= "a" B | "a"
B ::= "b" A | "b"
... I tried extending it to support repeats, like so:
expr ::= A | B
# support 1 or more "a"
A_one_or_more = A_one_or_more "a" | "a"
A ::= A_one_or_more B | A_one_or_more
B ::= "b" A | "b"
... but that grammar is ambiguous. Is it possible for this to be made unambiguous, and if so could anyone help me disambiguate it?
I'm using the lemon parser which is an LALR(1) parser.
The point of parsing, in general, is to parse; that is, determine the syntactic structure of an input. That's significantly different from simply verifying that an input belongs to a language.
For example, the language which consists of arbitrary repetitions of a and b can be described with the regular expression (a|b)*, which can be written in BNF as
S ::= /* empty */ | S a | S b
But that probably does not capture the syntactic structure you are trying to defind. On the other hand, since you don't specify that structure, it is hard to know.
Here are a couple more possibilities, which build different parse trees:
S ::= E | S E
E ::= A b | E b
A ::= a | A a
S ::= E | S E
E ::= A B
A ::= a | A a
B ::= b | B b
When writing a grammar to parse a language, it is useful to start by drawing your proposed parse trees. Usually, you can write the grammar directly from the form of the trees, which shows that a formal grammar is primarily a documentation tool, since it clearly describes the language in a way that informal descriptions cannot. Using a parser generator to turn that grammar into a parser ensures that the parser implements the described language. Or, at least, that is the goal.
Here is a nice tool for checking your grammar online http://smlweb.cpsc.ucalgary.ca/start.html. It actually accepts the grammar you provided as a valid LALR(1) grammar.
A different LALR(1) grammar, that allows reapeating a's, would be:
S ::= "a" S | "a" | "b" A | "b"
A ::= "a" S .

What to use in ANTLR4 to resolve ambiguities in more complex cases (instead of syntactic predicates)?

In ANTLR v3, syntactic predicates could be used to solve ambiguitites, i.e., to explicitly tell ANTLR which alternative should be chosen. ANTLR4 seems to simply accept grammars with similar ambiguities, but during parsing it reports these ambiguities. It produces a parse tree, despite these ambiguities (by chosing the first alternative, according to the documentation). But what can I do, if I want it to chose some other alternative? In other words, how can I explicitly resolve ambiguities?
(For the simple case of the dangling else problem see: What to use in ANTLR4 to resolve ambiguities (instead of syntactic predicates)?)
A more complex example:
If I have a rule like this:
expr
: expr '[' expr? ']'
| ID expr
| '[' expr ']'
| ID
| INT
;
This will parse foo[4] as (expr foo (expr [ (expr 4) ])). But I may want to parse it as (expr (expr foo) [ (expr 4) ]). (I. e., always take the first alternative if possible. It is the first alternative, so according to the documentation, it should have higher precedence. So why it builds this tree?)
If I understand correctly, I have 2 solutions:
Basically implement the syntactic predicate with a semantic predicate (however, I'm not sure how, in this case).
Restructure the grammar.
For example, replace expr with e:
e : expr | pe
;
expr
: expr '[' expr? ']'
| ID expr
| ID
| INT
;
pe : '[' expr ']'
;
This seems to work, although the grammar became more complex.
I may misunderstood some things, but both of these solutions seem less elegant and more complicated than syntactic predicates. Although, I like the solution for the dangling else problem with the ?? operator. But I'm not sure how to use in this case. Is it possible?
You may be able to resolve this by placing the ID alternative above ID expr. When left-recursion is eliminated, all of your alternatives which are not left recursive are parsed before your alternatives which are left recursive.
For your example, the first non-left-recursive alternative ID expr matches the entire expression, so there is nothing left to parse afterwards.
To get this expression (expr (expr foo) [ (expr 4) ]), you can use
top : expr EOF;
expr : expr '[' expr? ']' | ID | INT ;

Creating a parser rule

I'm currently in a CSCI class, compiler at my college. I have to write a parser for the compiler and I've already done Adding subtracting multiplying dividing and the assignment statement. My question is we now have to do the less than equal (<=) and the greater than equal (>=) and I'm not sure how to write the rule for it...
I was thinking something like...
expr LESSTHAN expr { $1 <= $3 }
expr GREATERTHAN expr { $1 >= $3 }
any suggestions?
You should include a more precise question. Here are some general suggestions though.
The structure of the rule for relational operations should be the same as of the arithmetic operations. In both cases you have binary operators. The difference is that one returns a number, the other returns a boolean value. While 1 + 1 >= 3 usually is valid syntax, other combinations like 1 >= 2 => 5 is most likely invalid. Of course there are exceptions. Some languages allow it as syntactic sugar for multiple operations. Others simply define that boolean values are just integers (0 and 1). It's up to you (or your assignment) what you want the syntax to look like.
Anyway, you probably don't simply want append those rules to expr, but create a new rule. This way you distinguish between relational and arithmetical expressions.
expr :
expr PLUS expr |
expr MINUS expr |
... ;
relational_expr :
expr LESSTHAN expr |
expr GREATERTHAN expr ;
assignment :
identifier '=' relational_expr |
identifier '=' expr ;

Relation between grammar and operator associativity

Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:
Expr ::= Term { ( "+" | "-" ) Term }
Term ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"
which is an EBNF representation of this BNF:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"
According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.
I think the issue here is that a language has an abstract syntax which is just like:
E ::= E + E | E - E | E * E | E / E | Int | (E)
but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!
There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:
Expr ::= Term + Term | Term - Term | Term
Term ::= Factor * Factor | Factor / Factor | Factor
Factor ::= INTEGER | (Expr)
Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?
Expr -> Term
-> Factor * Factor
-> 1 * Factor
-> 1 * 2
We can see the other grammar handles this in a slightly different way:
Expr -> Term Expr'
-> Factor Term' Expr'
-> 1 Term' Expr'
-> 1 * Factor Term' Expr'
-> 1 * 2 Term' Expr'
-> 1 * 2 ε Expr'
-> 1 * 2 ε ε
= 1 * 2
but this achieves the same effect.
Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.
Expr -> Term + Term
-> ? (can't get another + in here)
The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.
So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!
To get associative trees, you really need to have the trees formed with the operator as the subtree root node, with children having similar roots.
Your implementation grammar:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor ::= INTEGER | "(" Expr ")"
must make that awkward; if you implement recursive descent on this, the Expr' routine has no access to the "left child" and so can't build the tree. You can always patch this up by passing around pieces (in this case, passing tree parts up the recursion) but that just seems awkward. You could have chosen this instead as a grammar:
Expr ::= Term ( ("+"|"-") Term )*;
Term ::= Factor ( ( "*" | "/" ) Factor )* ;
Factor ::= INTEGER | "(" Expr ")"
which is just as easy (easier?) to code recursive descent-wise, but now you can form the trees you need without trouble.
This doesn't really get you associativity; it just shapes the trees so that it could be allowed. Associativity means that the tree ( + (+ a b) c) means the same thing as (+ a (+ b c)); its actually a semantic property (sure doesn't work for "-" but the grammar as posed can't distinguish).
We have a tool (the DMS Software Reengineering Toolkit) that includes parsers and term-rewriting (using source-to-source transformations) in which the associativity is explicitly expressed. We'd write your grammar:
Expr ::= Term ;
[Associative Commutative] Expr ::= Expr "+" Term ;
Expr ::= Expr "-" Term ;
Term ::= Factor ;
[Associative Commutative] Term ::= Term "*" Factor ;
Term ::= Term "/" Factor ;
Factor ::= INTEGER ;
Factor ::= "(" Expr ")" ;
The grammar seems longer and clumsier this way, but it in fact allows us to break out the special cases and mark them as needed. In particular, we can now distinguish operators that are associative from those that are not, and mark them accordingly. With that semantic marking, our tree-rewrite engine automatically accounts for associativity and commutativity. You can see a full example of such DMS rules being used to symbolically simplify high-school algebra using explicit rewrite rules over a typical expression grammar that don't have to account for such semantic properties. That is built into the rewrite engine.

Resources