Best ADT representation of AST - parsing

I have the following grammar for expressions that I'm trying to represent as a Haskell ADT:
Expr = SimpleExpr [OPrelation SimpleExpr]
SimpleExpr = [OPunary] Term {OPadd Term}
Term = Factor {OPmult Factor}
where:
{} means 0 or more
[] means 0 or 1
OPmult, OPadd, OPrelation, OPunary are classes of operators
Note that this grammar does get precedence right.
Here's something I tried:
data Expr = Expr SimpleExpr (Maybe OPrelation) (Maybe SimpleExpr)
data SimpleExpr = SimpleExpr (Maybe OPunary) Term [OPadd] [Term]
data Term = Term Factor [OPmult] [Factor]
which in hindsight I think is awful, especially the [OPadd] [Term] and [OPmult] [Factor] parts. Because, for example, in the parse tree for 1+2+3 it would put [+, +] in one branch and [2, 3] in another, meaning they're decoupled.
What would be a good representation that'll play nice later in the next stages of compilation?
Decomposing { } and [ ] into more data types seems like an overkill
Using lists seems not quite right as it would no longer be a tree (Just a node that's a list)
Maybe for { }. A good idea ?
And finally, I'm assuming after parsing I'll have to pass over the Parse Tree and reduce it to an AST? or should the whole grammar be modified to be less complex? or maybe it's abstract enough?

The AST does not need to be that close to the grammar. The grammar is structured into multiple levels to encode precedence and uses repetition to avoid left-recursion while still being able to correctly handle left-associative operators. The AST does not need to worry about such things.
Instead I'd define the AST like this:
data Expr = BinaryOperation BinaryOperator Expr Expr
| UnaryOperation UnaryOperator Expr
| Literal LiteralValue
| Variable Id
data BinaryOperator = Add | Sub | Mul | Div
data UnaryOperator = Not | Negate

Here's an additional answer that might help you. I don't want to spoil your fun, so here's a very simple example grammar:
-- Expr = Term ['+' Term]
-- Term = Factor ['*' Factor]
-- Factor = number | '(' Expr ')'
-- number = one or more digits
Using a CST
As one approach, we can represent this grammar as a concrete syntax tree (CST):
data Expr = TermE Term | PlusE Term Term deriving (Show)
data Term = FactorT Factor | TimesT Factor Factor deriving (Show)
data Factor = NumberF Int | ParenF Expr deriving (Show)
A Parsec-based parser to turn the concrete syntax into a CST might look like this:
expr :: Parser Expr
expr = do
t1 <- term
(PlusE t1 <$ symbol "+" <*> term)
<|> pure (TermE t1)
term :: Parser Term
term = do
f1 <- factor
(TimesT f1 <$ symbol "*" <*> factor)
<|> pure (FactorT f1)
factor :: Parser Factor
factor = NumberF . read <$> lexeme (many1 (satisfy isDigit))
<|> ParenF <$> between (symbol "(") (symbol ")") expr
with helper functions for whitespace processing:
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces
symbol :: String -> Parser String
symbol = lexeme . string
and main entry point:
parseExpr :: String -> Expr
parseExpr pgm = case parse (spaces *> expr) "(string)" pgm of
Right e -> e
Left err -> error $ show err
after which we can run:
> parseExpr "1+1*(3+4)"
PlusE (FactorT (Number 1)) (TimesT (Number 1) (ParenF (PlusE
(FactorT (Number 3)) (FactorT (Number 4)))))
>
To convert this into the following AST:
data AExpr -- Abstract Expression
= NumberA Int
| PlusA AExpr AExpr
| TimesA AExpr AExpr
we could write:
aexpr :: Expr -> AExpr
aexpr (TermE t) = aterm t
aexpr (PlusE t1 t2) = PlusA (aterm t1) (aterm t2)
aterm :: Term -> AExpr
aterm (FactorT f) = afactor f
aterm (TimesT f1 f2) = TimesA (afactor f1) (afactor f2)
afactor :: Factor -> AExpr
afactor (NumberF n) = NumberA n
afactor (ParenF e) = aexpr e
To interpret the AST, we could use:
interp :: AExpr -> Int
interp (NumberA n) = n
interp (PlusA e1 e2) = interp e1 + interp e2
interp (TimesA e1 e2) = interp e1 * interp e2
and then write:
calc :: String -> Int
calc = interp . aexpr . parseExpr
after which we have a crude little calculator:
> calc "1 + 2 * (6 + 3)"
19
>
Skipping the CST
As an alternative approach, we could replace the parser with one that parses directly into an AST of type AExpr:
expr :: Parser AExpr
expr = do
t1 <- term
(PlusA t1 <$ symbol "+" <*> term)
<|> pure t1
term :: Parser AExpr
term = do
f1 <- factor
(TimesA f1 <$ symbol "*" <*> factor)
<|> pure f1
factor :: Parser AExpr
factor = NumberA . read <$> lexeme (many1 (satisfy isDigit))
<|> between (symbol "(") (symbol ")") expr
You can see how little the structure of these parsers changes. All that's disappeared is the distinction between expressions, terms, and factors at the type level, and constructors like TermE, FactorT, and ParenF whose only purpose is to allow embedding of these types within each other.
In more complex scenarios, the CST and AST parsers might exhibit bigger differences. (For example, in a grammar that allowed 1 + 2 + 3, this might be represented as a single constructor data Expr = ... | PlusE [Term] | ... in the CST but with a nested series of binary PlusA constructors in the same AExpr AST type as above.)
After redefining parseExpr to return an AExpr and dropping the aexpr step from calc, everything else stays the same, and we still have:
> calc "1 + 2 * (6 + 3)"
19
>
Programs for Reference
Here's the full program using an intermediate CST:
-- Calc1.hs, using a CST
{-# OPTIONS_GHC -Wall #-}
module Calc1 where
import Data.Char
import Text.Parsec
import Text.Parsec.String
data Expr = TermE Term | PlusE Term Term deriving (Show)
data Term = FactorT Factor | TimesT Factor Factor deriving (Show)
data Factor = NumberF Int | ParenF Expr deriving (Show)
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces
symbol :: String -> Parser String
symbol = lexeme . string
expr :: Parser Expr
expr = do
t1 <- term
(PlusE t1 <$ symbol "+" <*> term)
<|> pure (TermE t1)
term :: Parser Term
term = do
f1 <- factor
(TimesT f1 <$ symbol "*" <*> factor)
<|> pure (FactorT f1)
factor :: Parser Factor
factor = NumberF . read <$> lexeme (many1 (satisfy isDigit))
<|> ParenF <$> between (symbol "(") (symbol ")") expr
parseExpr :: String -> Expr
parseExpr pgm = case parse (spaces *> expr) "(string)" pgm of
Right e -> e
Left err -> error $ show err
data AExpr -- Abstract Expression
= NumberA Int
| PlusA AExpr AExpr
| TimesA AExpr AExpr
aexpr :: Expr -> AExpr
aexpr (TermE t) = aterm t
aexpr (PlusE t1 t2) = PlusA (aterm t1) (aterm t2)
aterm :: Term -> AExpr
aterm (FactorT f) = afactor f
aterm (TimesT f1 f2) = TimesA (afactor f1) (afactor f2)
afactor :: Factor -> AExpr
afactor (NumberF n) = NumberA n
afactor (ParenF e) = aexpr e
interp :: AExpr -> Int
interp (NumberA n) = n
interp (PlusA e1 e2) = interp e1 + interp e2
interp (TimesA e1 e2) = interp e1 * interp e2
calc :: String -> Int
calc = interp . aexpr . parseExpr
and here's the full program for the more traditional solution that skips an explicit CST representation:
-- Calc2.hs, with direct parsing to AST
{-# OPTIONS_GHC -Wall #-}
module Calc where
import Data.Char
import Text.Parsec
import Text.Parsec.String
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces
symbol :: String -> Parser String
symbol = lexeme . string
expr :: Parser AExpr
expr = do
t1 <- term
(PlusA t1 <$ symbol "+" <*> term)
<|> pure t1
term :: Parser AExpr
term = do
f1 <- factor
(TimesA f1 <$ symbol "*" <*> factor)
<|> pure f1
factor :: Parser AExpr
factor = NumberA . read <$> lexeme (many1 (satisfy isDigit))
<|> between (symbol "(") (symbol ")") expr
parseExpr :: String -> AExpr
parseExpr pgm = case parse (spaces *> expr) "(string)" pgm of
Right e -> e
Left err -> error $ show err
data AExpr -- Abstract Expression
= NumberA Int
| PlusA AExpr AExpr
| TimesA AExpr AExpr
interp :: AExpr -> Int
interp (NumberA n) = n
interp (PlusA e1 e2) = interp e1 + interp e2
interp (TimesA e1 e2) = interp e1 * interp e2
calc :: String -> Int
calc = interp . parseExpr

Okay so Buhr's answer is quite nice. Here's how I did though (no CST) inspired by sepp2k's response:
The AST:
data OP = OPplus | OPminus | OPstar | OPdiv
| OPidiv | OPmod | OPand | OPeq | OPneq
| OPless | OPgreater | OPle | OPge
| OPin | OPor
data Expr =
Relation Expr OP Expr -- > < == >= etc..
| Unary OP Expr -- + -
| Mult Expr OP Expr -- * / div mod and
| Add Expr OP Expr -- + - or
| FactorInt Int | FactorReal Double
| FactorStr String
| FactorTrue | FactorFalse
| FactorNil
| FactorDesig Designator -- identifiers
| FactorNot Expr
| FactorFuncCall FuncCall deriving (Show)
The parsers:
parseExpr :: Parser Expr
parseExpr = (try $ Relation <$>
parseSimpleExpr <*> parseOPrelation <*> parseSimpleExpr)
<|> parseSimpleExpr
parseSimpleExpr :: Parser Expr
parseSimpleExpr = (try simpleAdd)
<|> (try $ Unary <$> parseOPunary <*> simpleAdd)
<|> (try $ Unary <$> parseOPunary <*> parseSimpleExpr)
<|> parseTerm
where simpleAdd = Add <$> parseTerm <*> parseOPadd <*> parseSimpleExpr
parseTerm :: Parser Expr
parseTerm = (try $ Mult <$>
parseFactor <*> parseOPmult <*> parseTerm)
<|> parseFactor
parseFactor :: Parser Expr
parseFactor =
(parseKWnot >> FactorNot <$> parseFactor)
<|> (exactTok "true" >> return FactorTrue)
<|> (exactTok "false" >> return FactorFalse)
<|> (parseNumber)
<|> (FactorStr <$> parseString)
<|> (betweenCharTok '(' ')' parseExpr)
<|> (FactorDesig <$> parseDesignator)
<|> (FactorFuncCall <$> parseFuncCall)
I didn't include basic parsers like parseOPadd as those are what you'd expect and are easy to build.
I still parsed according to the grammar but tweaked it slightly to match my AST.
You could check out the full source which is a compiler for Pascal here.

Related

How to parse a bool expression in Haskell

I am trying to parse a bool expression in Haskell. This line is giving me an error: BoolExpr <$> parseBoolOp <*> (n : ns). This is the error:
• Couldn't match type ‘[]’ with ‘Parser’
Expected type: Parser [Expr]
Actual type: [Expr]
-- define the expression types
data Expr
= BoolExpr BoolOp [Expr]
deriving (Show, Eq)
-- define the type for bool value
data Value
= BoolVal Bool
deriving (Show, Eq)
-- many x = Parser.some x <|> pure []
-- some x = (:) <$> x <*> Parser.many x
kstar :: Alternative f => f a -> f [a]
kstar x = kplus x <|> pure []
kplus :: Alternative f => f a -> f [a]
kplus x = (:) <$> x <*> kstar x
symbol :: String -> Parser String
symbol xs = token (string xs)
-- a bool expression is the operator followed by one or more expressions that we have to parse
-- TODO: add bool expressions
boolExpr :: Parser Expr
boolExpr = do
n <- parseExpr
ns <- kstar (symbol "," >> parseExpr)
BoolExpr <$> parseBoolOp <*> (n : ns)
-- an atom is a literalExpr, which can be an actual literal or some other things
parseAtom :: Parser Expr
parseAtom =
do
literalExpr
-- the main parsing function which alternates between all the options you have
parseExpr :: Parser Expr
parseExpr =
do
parseAtom
<|> parseParens boolExpr
<|> parseParens parseExpr
-- implement parsing bool operations, these are 'and' and 'or'
parseBoolOp :: Parser BoolOp
parseBoolOp =
do symbol "and" >> return And
<|> do symbol "or" >> return Or
The boolExpr is expecting a Parser [Expr] but I am returning only an [Expr]. Is there a way to fix this or do it in another way? When I try pure (n:ns), evalStr "(and true (and false true) true)" returns Left (ParseError "'a' didn't match expected character") instead of Right (BoolVal False)
The expression (n : ns) is a list. Therefore the compiler thinks that the applicative operators <*> and <$> should be used in the context [], while you want Parser instead.
I would guess you need pure (n : ns) instead.

How to re-write a function in Haskell

I am making a Mini lisp language. How can I re-write this using just arrows:
-- a comp expression is the comp operator and the parsing of two expressions
compExpr :: Parser Expr
compExpr = CompExpr <$> parseCompOp <*> parseExpr <*> parseExpr
For reference, here is what the other parsers are:
data CompOp = Eq | Lt deriving (Show, Eq)
-- define the expression types
data Expr
= CompExpr CompOp Expr Expr
deriving (Show, Eq)
parseCompOp :: Parser CompOp
parseCompOp =
do parseKeyword "equal?" >> return Eq
<|> do symbol "<" >> return Lt
-- the main parsing function which alternates between all the options you have
parseExpr :: Parser Expr
parseExpr =
do
parseAtom
<|> parseParens compExpr
<|> parseParens parseExpr
-- an atom is a literalExpr, which can be an actual literal or some other things
parseAtom :: Parser Expr
parseAtom =
do
literalExpr
Here is what I have tried:
compExpr :: Parser Expr
compExpr = do
op <- parseCompOp
expr1 <- parseExpr
expr2 <- parseExpr
return (CompExpr op expr1 expr2)
Since I am not sure how to test this, is this function equivalent to the first function I showed?

Expression Evaluation using combinators in Haskell

I'm trying to make an expression evaluator in Hakell:
data Parser i o
= Success o [i]
| Failure String [i]
| Parser
{parse :: [i] -> Parser i o}
data Operator = Add | Sub | Mul | Div | Pow
data Expr
= Op Operator Expr Expr
| Val Double
expr :: Parser Char Expr
expr = add_sub
where
add_sub = calc Add '+' mul_div <|> calc Sub '-' mul_div <|> mul_div
mul_div = calc Mul '*' pow <|> calc Div '/' pow <|> pow
pow = calc Pow '^' factor <|> factor
factor = parens <|> val
val = Val <$> parseDouble
parens = parseChar '(' *> expr <* parseChar ')'
calc c o p = Op c <$> (p <* parseChar o) <*> p
My problem is that when I try to evaluate an expression with two operators with same priority (e.g. 1+1-1) the parser will fail.
How can I say that an add_sub can be an operation between two other add_subs without creating an infinite loop?
As explained by #chi the problem is that calc was using p twice which doesn't allow for patterns like muldiv + .... | muldiv - ... | ...
I just changed the definition of calc to :
calc c o p p2 = Op c <$> (p <* parseChar o) <*> p2
where p2 is the current priority (mul_div in the definition of mul_div)
it works much better but the order of calulations is backwards:
2/3/4 is parsed as 2/(3/4) instead of (2/3)/4

Haskell : Non-Exhaustive Pattern In Function Prevents Another Function From Executing Even Though Its Not Used

I'm trying to implement car, cdr, and cons functionality into a toy language I'm writing however when I try to execute my car function through main, I get the following error:
./parser "car [1 2 3]"
parser: parser.hs:(48,27)-(55,45): Non-exhaustive patterns in case
The function on lines 48-55 is the following:
parseOp :: Parser HVal
parseOp = (many1 letter <|> string "+" <|> string "-" <|> string "*" <|> string "/" <|> string "%" <|> string "&&" <|> string "||") >>=
(\x -> return $ case x of
"&&" -> Op And
"||" -> Op Or
"+" -> Op Add
"-" -> Op Sub
"*" -> Op Mult
"/" -> Op Div
"%" -> Op Mod)
I'm really unsure why the error message points to this function because it has nothing to do with the list functionality. The car function is working however because I was able to successfully execute it through GHCI. I know my problem is due to parsing but I don't see where it is. The following are the functions that relate to lists. I can't see from them how they are influenced by parseOp.
data HVal = Number Integer
| String String
| Boolean Bool
| List [HVal]
| Op Op
| Expr Op HVal HVal
| Car [HVal]
deriving (Read)
car :: [HVal] -> HVal
car xs = head xs
parseListFunctions :: Parser HVal
parseListFunctions = do
_ <- string "car "
_ <- char '['
x <- parseList
_ <- char ']'
return $ Car [x]
parseExpr :: Parser HVal
parseExpr = parseNumber
<|> parseOp
<|> parseBool
<|> parseListFunctions
<|> do
_ <- char '['
x <- parseList
_ <- char ']'
return x
<|> do
_ <- char '('
x <- parseExpression
_ <- char ')'
return x
eval :: HVal -> HVal
eval val#(Number _) = val
eval val#(String _) = val
eval val#(Boolean _) = val
eval val#(List _) = val -- Look at list eval NOT WORKING
eval val#(Op _) = val
eval (Expr op x y) = eval $ evalExpr (eval x) op (eval y)
eval (Car xs) = eval $ car xs
The removal of many1 letter in parseOp transfers the same error to the following function parseBool:
parseBool :: Parser HVal
parseBool = many1 letter >>= (\x -> return $ case x of
"True" -> Boolean True
"False" -> Boolean False)
You write
parseExpr = ... <|> parseOp <|> ... <|> parseListFunctions <|> ...
and so
car ...
is passed to parseOp first, then parseListFunctions. The parseOp parser succeeds in the
many1 letter
branch, and so in the \x -> return $ case x of ..., x is bound to "car". Because parseOp succeeds (and returns an error value with an embedded, not-yet-evaluated inexhaustive case error!), parseListFunctions is never tried.
You will need to modify your grammar to reduce the ambiguity in it, so that these conflicts where multiple branches may match do not arise.

Parsing an expression grammar having function application with parser combinators (left-recursion)

As a simplified subproblem of a parser for a real language, I am trying to implement a parser for expressions of a fictional language which looks similar to standard imperative languages (like Python, JavaScript, and so). Its syntax features the following construct:
integer numbers
identifiers ([a-zA-Z]+)
arithmetic expressions with + and * and parenthesis
structure access with . (eg foo.bar.buz)
tuples (eg (1, foo, bar.buz)) (to remove ambiguity one-tuples are written as (x,))
function application (eg foo(1, bar, buz()))
functions are first class so they can also be returned from other functions and directly be applied (eg foo()() is legal because foo() might return a function)
So a fairly complex program in this language is
(1+2*3, f(4,5,6)(bar) + qux.quux()().quuux)
the associativity is supposed to be
( (1+(2*3)), ( ((f(4,5,6))(bar)) + ((((qux.quux)())()).quuux) ) )
I'm currently using the very nice uu-parsinglib an applicative parser combinator library.
The first problem was obviously that the intuitive expression grammar (expr -> identifier | number | expr * expr | expr + expr | (expr) is left-recursive. But I could solve that problem using the the pChainl combinator (see parseExpr in the example below).
The remaining problem (hence this question) is function application with functions returned from other functions (f()()). Again, the grammar is left recursive expr -> fun-call | ...; fun-call -> expr ( parameter-list ). Any ideas how I can solve this problem elegantly using uu-parsinglib? (the problem should directly apply to parsec, attoparsec and other parser combinators as well I guess).
See below my current version of the program. It works well but function application is only working on identifiers to remove the left-recursion:
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE RankNTypes #-}
module TestExprGrammar
(
) where
import Data.Foldable (asum)
import Data.List (intercalate)
import Text.ParserCombinators.UU
import Text.ParserCombinators.UU.Utils
import Text.ParserCombinators.UU.BasicInstances
data Node =
NumberLiteral Integer
| Identifier String
| Tuple [Node]
| MemberAccess Node Node
| FunctionCall Node [Node]
| BinaryOperation String Node Node
parseFunctionCall :: Parser Node
parseFunctionCall =
FunctionCall <$>
parseIdentifier {- `parseExpr' would be correct but left-recursive -}
<*> parseParenthesisedNodeList 0
operators :: [[(Char, Node -> Node -> Node)]]
operators = [ [('+', BinaryOperation "+")]
, [('*' , BinaryOperation "*")]
, [('.', MemberAccess)]
]
samePrio :: [(Char, Node -> Node -> Node)] -> Parser (Node -> Node -> Node)
samePrio ops = asum [op <$ pSym c <* pSpaces | (c, op) <- ops]
parseExpr :: Parser Node
parseExpr =
foldr pChainl
(parseIdentifier
<|> parseNumber
<|> parseTuple
<|> parseFunctionCall
<|> pParens parseExpr
)
(map samePrio operators)
parseNodeList :: Int -> Parser [Node]
parseNodeList n =
case n of
_ | n < 0 -> parseNodeList 0
0 -> pListSep (pSymbol ",") parseExpr
n -> (:) <$>
parseExpr
<* pSymbol ","
<*> parseNodeList (n-1)
parseParenthesisedNodeList :: Int -> Parser [Node]
parseParenthesisedNodeList n = pParens (parseNodeList n)
parseIdentifier :: Parser Node
parseIdentifier = Identifier <$> pSome pLetter <* pSpaces
parseNumber :: Parser Node
parseNumber = NumberLiteral <$> pNatural
parseTuple :: Parser Node
parseTuple =
Tuple <$> parseParenthesisedNodeList 1
<|> Tuple [] <$ pSymbol "()"
instance Show Node where
show n =
let showNodeList ns = intercalate ", " (map show ns)
showParenthesisedNodeList ns = "(" ++ showNodeList ns ++ ")"
in case n of
Identifier i -> i
Tuple ns -> showParenthesisedNodeList ns
NumberLiteral n -> show n
FunctionCall f args -> show f ++ showParenthesisedNodeList args
MemberAccess f g -> show f ++ "." ++ show g
BinaryOperation op l r -> "(" ++ show l ++ op ++ show r ++ ")"
Looking briefly at the list-like combinators for uu-parsinglib (I'm more familiar with parsec), I think you can solve this by folding over the result of the pSome combinator:
parseFunctionCall :: Parser Node
parseFunctionCall =
foldl' FunctionCall <$>
parseIdentifier {- `parseExpr' would be correct but left-recursive -}
<*> pSome (parseParenthesisedNodeList 0)
This is also equivalent to the Alternative some combinator, which should indeed apply to the other parsing libs you mentioned.
I don't know this library but can show you how to remove left recursion. The standard right recursive expression grammar is
E -> T E'
E' -> + TE' | eps
T -> F T'
T' -> * FT' | eps
F -> NUMBER | ID | ( E )
To add function application you must decide its level of precedence. In most languages I've seen it is highest. So you'd add another layer of productions for function application.
E -> T E'
E' -> + TE' | eps
T -> AT'
T' -> * A T' | eps
A -> F A'
A' -> ( E ) A' | eps
F -> NUMBER | ID | ( E )
Yes this is a hairy-looking grammar and bigger than the left recursive one. That's the price of top-down predictive parsing. If you want simpler grammars use a bottom up parser generator a la yacc.

Resources