Is there a way to fix an expression with operators in it after parsing, using a table of associativities and precedences? - parsing

I'm currently working on a parser for a simple programming language written in Haskell. I ran into a problem when I tried to allow for binary operators with differing associativities and precedences. Normally this wouldn't be an issue, but since my language allows users to define their own operators, the precedence of operators isn't known by the compiler until the program has already been parsed.
Here are some of the data types I've defined so far:
data Expr
= Var String
| Op String Expr Expr
| ..
data Assoc
= LeftAssoc
| RightAssoc
| NonAssoc
type OpTable =
Map.Map String (Assoc, Int)
At the moment, the compiler parses all operators as if they were right-associative with equal precedence. So if I give it an expression like a + b * c < d the result will be Op "+" (Var "a") (Op "*" (Var "b") (Op "<" (Var "c") (Var "d"))).
I'm trying to write a function called fixExpr which takes an OpTable and an Expr and rearranges the Expr based on the associativities and precedences listed in the OpTable. For example:
operators :: OpTable
operators =
Map.fromList
[ ("<", (NonAssoc, 4))
, ("+", (LeftAssoc, 6))
, ("*", (LeftAssoc, 7))
]
expr :: Expr
expr = Op "+" (Var "a") (Op "*" (Var "b") (Op "<" (Var "c") (Var "d")))
fixExpr operators expr should evaluate to Op "<" (Op "+" (Var "a") (Op "*" (Var "b") (Var "c"))) (Var "d").
How do I define the fixExpr function? I've tried multiple solutions and none of them have worked.

An expression e may be an atomic term n (e.g. a variable or literal), a parenthesised expression, or an application of an infix operator ○.
e ⩴ n | (e​) | e1 ○ e2
We need the parentheses to know whether the user entered a * b + c, which we happen to associate as a * (b + c) and need to reassociate as (a * b) + c, or if they entered a * (b + c) literally, which should not be reassociated. Therefore I’ll make a small change to the data type:
data Expr
= Var String
| Group Expr
| Op String Expr Expr
| …
Then the method is simple:
The rebracketing of an expression ⟦e⟧ applies recursively to all its subexpressions.
⟦n⟧ = n
⟦(e)⟧ = (⟦e⟧)
⟦e1 ○ e2⟧ = ⦅⟦e1⟧ ○ ⟦e2⟧⦆
A single reassociation step ⦅e⦆ removes redundant parentheses on the right, and reassociates nested operator applications leftward in two cases: if the left operator has higher precedence, or if the two operators have equal precedence, and are both left-associative. It leaves nested infix applications alone, that is, associating rightward, in the opposite cases: if the right operator has higher precedence, or the operators have equal precedence and right associativity. If the associativities are mismatched, then the result is undefined.
⦅e ○ n⦆ = e ○ n
⦅e1 ○ (e2)⦆ = ⦅e1 ○ e2⦆
⦅e1 ○ (e2 ● e3)⦆ =
⦅e1 ○ e2⦆ ● e3, if:
a. P(○) > P(●); or
b. P(○) = P(●) and A(○) = A(●) = L
e1 ○ (e2 ● e3), if:
a. P(○) < P(●); or
b. P(○) = P(●) and A(○) = A(●) = R
undefined otherwise
NB.: P(o) and A(o) are respectively the precedence and associativity (L or R) of operator o.
This can be translated fairly literally to Haskell:
fixExpr operators = reassoc
where
-- 1.1
reassoc e#Var{} = e
-- 1.2
reassoc (Group e) = Group (reassoc e)
-- 1.3
reassoc (Op o e1 e2) = reassoc' o (reassoc e1) (reassoc e2)
-- 2.1
reassoc' o e1 e2#Var{} = Op o e1 e2
-- 2.2
reassoc' o e1 (Group e2) = reassoc' o e1 e2
-- 2.3
reassoc' o1 e1 r#(Op o2 e2 e3) = case compare prec1 prec2 of
-- 2.3.1a
GT -> assocLeft
-- 2.3.2a
LT -> assocRight
EQ -> case (assoc1, assoc2) of
-- 2.3.1b
(LeftAssoc, LeftAssoc) -> assocLeft
-- 2.3.2b
(RightAssoc, RightAssoc) -> assocRight
-- 2.3.3
_ -> error $ concat
[ "cannot mix ‘", o1
, "’ ("
, show assoc1
, " "
, show prec1
, ") and ‘"
, o2
, "’ ("
, show assoc2
, " "
, show prec2
, ") in the same infix expression"
]
where
(assoc1, prec1) = opInfo o1
(assoc2, prec2) = opInfo o2
assocLeft = Op o2 (Group (reassoc' o1 e1 e2)) e3
assocRight = Op o1 e1 r
opInfo op = fromMaybe (notFound op) (Map.lookup op operators)
notFound op = error $ concat
[ "no precedence/associativity defined for ‘"
, op
, "’"
]
Note the recursive call in assocLeft: by reassociating the operator applications, we may have revealed another association step, as in a chain of left-associative operator applications like a + b + c + d = (((a + b) + c) + d).
I insert Group constructors in the output for illustration, but they can be removed at this point, since they’re only necessary in the input.
This hasn’t been tested very thoroughly at all, but I think the idea is sound, and should accommodate modifications for more complex situations, even if the code leaves something to be desired.
An alternative that I’ve used is to parse expressions as “flat” sequences of operators applied to terms, and then run a separate parsing pass after name resolution, using e.g. Parsec’s operator precedence parser facility, which would handle these details automatically.

Related

Haskell monadic parser with anamorphisms

My problem is how to combine the recursive, F-algebra-style recursive type definitions, with monadic/applicative-style parsers, in way that would scale to a realistic programming language.
I have just started with the Expr definition below:
data ExprF a = Plus a a |
Val Integer deriving (Functor,Show)
data Rec f = In (f (Rec f))
type Expr = Rec ExprF
and I am trying to combine it with a parser which uses anamorphisms:
ana :: Functor f => (a -> f a) -> a -> Rec f
ana psi x = In $ fmap (ana psi) (psi x)
parser = ana psi
where psi :: String -> ExprF String
psi = ???
as far as I could understand, in my example, psi should either parse just an integer, or it should decide that the string is a <expr> + <expr> and then (by recursively calling fmap (ana psi)), it should parse the left-hand side and the right-hand side expressions.
However, (monadic/applicative) parsers don't work like that:
they first attempt parsing the left-hand expression,
the +,
and the right-hand expression
One solution that I see, is to change the type definition for Plus a a to Plus Integer a, such that it reflects the parsing process, however this doesn't seem like the best avenue.
Any suggestions (or reading directions) would be welcome!
If you need a monadic parser, you need a monad in your unfold:
anaM :: (Traversable f, Monad m) => (a -> m (f a)) -> a -> m (Rec f)
anaM psiM x = In <$> (psiM x >>= traverse (anaM psiM))
Then you can write something that parses just one level of an ExprF like this:
parseNum :: Parser Integer
parseNum = -- ...
char :: Char -> Parser Char
char c = -- ...
parseExprF :: Maybe Integer -> Parser (ExprF (Maybe Integer))
parseExprF (Just n) = pure (Val n)
parseExprF Nothing = do
n <- parseNum
empty
<|> (Plus (Just n) Nothing <$ char '+')
<|> (pure (Val n))
Given that, you now have your recursive Expr parser:
parseExpr :: Parser Expr
parseExpr = anaM parseExprF Nothing
You will need to have instances of Foldable and Traversable for ExprF, of course, but the compiler can write these for you and they are not themselves recursive.

Folding Set to make a new Set

Say I have a type Prop for propositions:
type Prop =
| P of string
| Disjunction of Prop * Prop
| Conjunction of Prop * Prop
| Negation of Prop
Where:
• A "p" representing the atom P,
• Disjunction(A "P", A "q") representing the proposition P ∨ q.
• Conjunction(A "P", A "q") representing the proposition P ∧ q.
• Negation(A "P") representing the proposition ¬P.
I'm supposed to use a set-based representation of formulas in disjunctive normal form. Since conjunction is commutative, associative and (a ∧ a) is equivalent to a it is convenient to represent a basic conjunct bc by its set of literals litOf(bc).
bc is defined as: A literal is an atom or the negation of an atom and a basic conjunct is a conjunction of literals
This leads me to the function for litOf:
let litOf bc =
Set.fold (fun acc (Con(x, y)) -> Set.add (x, y) acc) Set.empty bc
I'm pretty sure my litOf is wrong, and I get an error on the (Con(x,y)) part saying: "Incomplete pattern m
atches on this expression. For example, the value 'Dis (_, _)' may indicate a cas
e not covered by the pattern(s).", which I also have no clue what actually means in this context.
Any hints to how I can procede?
I assume your example type Prop changed on the way from keyboard to here, and orginally looked like this:
type Prop =
| P of string
| Dis of Prop * Prop
| Con of Prop * Prop
| Neg of Prop
There are several things that tripped you up:
Set.fold operates on input that is a set, and does something for each element in the set. In your case, the input is a boolean clause, and the output is a set.
You did not fully define what constitutes a literal. For a conjunction, the set of literals is the union of the literals on the left and on the right side. But what about a disjunction? The compiler error message means exactly that.
Here's what I think you are after:
let rec literals = function
| P s -> Set.singleton s
| Dis (x, y) -> Set.union (literals x) (literals y)
| Con (x, y) -> Set.union (literals x) (literals y)
| Neg x -> literals x
With that, you will get
> literals (Dis (P "A", Neg (Con (P "B", Con (P "A", P "C")))))
val it : Set<string> = set ["A"; "B"; "C"]

Parsing non binary operators with Parsec

Traditionally, arithmetic operators are considered to be binary (left or right associative), thus most tools are dealing only with binary operators.
Is there an easy way to parse arithmetic operators with Parsec, which can have an arbitrary number of arguments?
For example, the following expression should be parsed into the tree
(a + b) + c + d * e + f
Yes! The key is to first solve a simpler problem, which is to model + and * as tree nodes with only two children. To add four things, we'll just use + three times.
This is a great problem to solve since there's a Text.Parsec.Expr module for just this problem. Your example is actually parseable by the example code in the documentation. I've slightly simplified it here:
module Lib where
import Text.Parsec
import Text.Parsec.Language
import qualified Text.Parsec.Expr as Expr
import qualified Text.Parsec.Token as Tokens
data Expr =
Identifier String
| Multiply Expr Expr
| Add Expr Expr
instance Show Expr where
show (Identifier s) = s
show (Multiply l r) = "(* " ++ (show l) ++ " " ++ (show r) ++ ")"
show (Add l r) = "(+ " ++ (show l) ++ " " ++ (show r) ++ ")"
-- Some sane parser combinators that we can plagiarize from the Haskell parser.
parens = Tokens.parens haskell
identifier = Tokens.identifier haskell
reserved = Tokens.reservedOp haskell
-- Infix parser.
infix_ operator func =
Expr.Infix (reserved operator >> return func) Expr.AssocLeft
parser =
Expr.buildExpressionParser table term <?> "expression"
where
table = [[infix_ "*" Multiply], [infix_ "+" Add]]
term =
parens parser
<|> (Identifier <$> identifier)
<?> "term"
Running this in GHCi:
λ> runParser parser () "" "(a + b) + c + d * e + f"
Right (+ (+ (+ (+ a b) c) (* d e)) f)
There are lots of ways of converting this tree to the desired form. Here's a hacky gross slow one:
data Expr' =
Identifier' String
| Add' [Expr']
| Multiply' [Expr']
deriving (Show)
collect :: Expr -> (Expr -> Bool) -> [Expr]
collect e f | (f e == False) = [e]
collect e#(Add l r) f =
collect l f ++ collect r f
collect e#(Multiply l r) f =
collect l f ++ collect r f
isAdd :: Expr -> Bool
isAdd (Add _ _) = True
isAdd _ = False
isMultiply :: Expr -> Bool
isMultiply (Multiply _ _) = True
isMultiply _ = False
optimize :: Expr -> Expr'
optimize (Identifier s) = Identifier' s
optimize e#(Add _ _) = Add' (map optimize (collect e isAdd))
optimize e#(Multiply _ _) = Multiply' (map optimize (collect e isMultiply))
I will note, however, that almost always Expr is Good Enough™ for the purposes of a parser or compiler.

parsec using between to parse parens

If I wanted to parse a string with multiple parenthesized groups into a list of strings holding each group, for example
"((a b c) a b c)"
into
["((a b c) a b c)","( a b c)"]
How would I do that using parsec? The use of between looks nice but it does not seem possible to separate with a beginning and end value.
I'd use a recursive parser:
data Expr = List [Expr] | Term String
expr :: Parsec String () Expr
expr = recurse <|> terminal
where terminal is your primitives, in this case these seem to be strings of characters so
where terminal = Term <$> many1 letter
and recurse is
recurse = List <$>
(between `on` char) '(' ')' (expr `sepBy1` char ' ')
Now we have a nice tree of Exprs which we can gather with
collect r#(List ts) = r : concatMap collect ts
collect _ = []
While jozefg's solution is almost identical to what I came up with (and I completely agree to all his suggestions), there are some small differences that made me think that I should post a second answer:
Due to the expected result of the initial example, it is not necessary to treat space-separated parts as individual subtrees.
Further it might be interesting to see the part that actually computes the expected results (i.e., a list of strings).
So here is my version. As already suggested by jozefg, split the task into several sub-tasks. Those are:
Parse a string into an algebraic data type representing some kind of tree.
Collect the (desired) subtrees of this tree.
Turn trees into strings.
Concerning 1, we first need a tree data type
import Text.Parsec
import Text.Parsec.String
import Control.Applicative ((<$>))
data Tree = Leaf String | Node [Tree]
and then a function that can parse strings into values of this type.
parseTree :: Parser Tree
parseTree = node <|> leaf
where
node = Node <$> between (char '(') (char ')') (many parseTree)
leaf = Leaf <$> many1 (noneOf "()")
In my version I do consider the hole string between parenthesis as a Leaf node (i.e., I do not split at white-spaces).
Now we need to collect the subtrees of a tree we are interested in:
nodes :: Tree -> [Tree]
nodes (Leaf _) = []
nodes t#(Node ts) = t : concatMap nodes ts
Finally, a Show-instance for Trees allows us to turn them into strings.
instance Show Tree where
showsPrec d (Leaf x) = showString x
showsPrec d (Node xs) = showString "(" . showList xs . showString ")"
where
showList [] = id
showList (x:xs) = shows x . showList xs
Then the original task can be solved, e.g., by:
parseGroups :: Parser [String]
parseGroups = map show . nodes <$> parseTree
> parseTest parseGroups "((a b c) a b c)"
["((a b c) a b c)","(a b c)"]

Pretty print expression with as few parentheses as possible?

My Question: What is the cleanest way to pretty print an expression without redundant parentheses?
I have the following representation of lambda expressions:
Term ::= Fun(String x, Term t)
| App(Term t1, Term t2)
| Var(String x)
By convention App is left associative, that is a b c is interpreted as (a b) c and function bodies stretch as far to the right as possible, that is, λ x. x y is interpreted as λ x. (x y).
I have a parser that does a good job, but now I want a pretty printer. Here's what I currently have (pseudo scala):
term match {
case Fun(v, t) => "(λ %s.%s)".format(v, prettyPrint(t))
case App(s, t) => "(%s %s)".format(prettyPrint(s), prettyPrint(t))
case Var(v) => v
}
The above printer always puts ( ) around expressions (except for atomic variables). Thus for Fun(x, App(Fun(y, x), y)) it produces
(λ x.((λ y.x) y))
I would like to have
λ x.(λ y.x) y
Here I'll use a simple grammar for infix expressions with the associativity and precedence defined by the following grammar whose operators are listed in ascending order of precedence
E -> E + T | E - T | T left associative
T -> T * F | T / F | F left associative
F -> G ^ F | G right associative
G -> - G | ( E ) | NUM
Given an abstract syntax tree (AST) we convert the AST to a string with only the necessary parenthesis as described in the pseudocode below. We examine relative precedence and associativity as we recursively descend the tree to determine when parenthesis are necessary. Note that all decisions to wrap parentheses around an expression must be made in the parent node.
toParenString(AST) {
if (AST.type == NUM) // simple atomic type (no operator)
return toString(AST)
else if (AST.TYPE == UNARY_MINUS) // prefix unary operator
if (AST.arg.type != NUM AND
precedence(AST.op) > precedence(AST.arg.op))
return "-(" + toParenString(AST.arg) + ")"
else
return "-" + toParenString(AST.arg)
else { // binary operation
var useLeftParen =
AST.leftarg.type != NUM AND
(precedence(AST.op) > precedence(AST.leftarg.op) OR
(precedence(AST.op) == precedence(AST.leftarg.op) AND
isRightAssociative(AST.op)))
var useRightParen =
AST.rightarg.type != NUM AND
(precedence(AST.op) > precedence(AST.rightarg.op) OR
(precedence(AST.op) == precedence(AST.rightarg.op) AND
isLeftAssociative(AST.op)))
var leftString;
if (useLeftParen) {
leftString = "(" + toParenString(AST.leftarg) + ")"
else
leftString = toParenString(AST.leftarg)
var rightString;
if (useRightParen) {
rightString = "(" + toParenString(AST.rightarg) + ")"
else
rightString = toParenString(AST.rightarg)
return leftString + AST.op + rightString;
}
}
Isn't it that you just have to check the types of the arguments of App?
I'm not sure how to write this in scala..
term match {
case Fun(v: String, t: Term) => "λ %s.%s".format(v, prettyPrint(t))
case App(s: Fun, t: App) => "(%s) (%s)".format(prettyPrint(s), prettyPrint(t))
case App(s: Term, t: App) => "%s (%s)".format(prettyPrint(s), prettyPrint(t))
case App(s: Fun, t: Term) => "(%s) %s".format(prettyPrint(s), prettyPrint(t))
case App(s: Term, t: Term) => "%s %s".format(prettyPrint(s), prettyPrint(t))
case Var(v: String) => v
}

Resources