loading elements from file into a tree in haskell - parsing

I am trying to make a tree from the info in a text document. For example in example.txt we have aritmetchic expression (3 + x) * (5 - 2). I want to make a tree which seems like this:
Node * (Node + (Leaf 3) (Leaf x)) (Node - (Leaf 5) (Leaf 2)
So far after a lot of unsuccessful attempts I have done this:
data Tree a = Empty
| Leaf a
| Node a (Tree a) (Tree a)
deriving (Show)
this is the tree I use and :
take name = do
elements <- readFile name
return elements
So how can I put the elements in the tree?

You'll need to make a data type to put in the tree that can store both operations and values. One way to do this would be to create an ADT representing everything you want to store in the tree:
data Eval a
= Val a
| Var Char
| Op (a -> a -> a)
type EvalTree a = Tree (Eval a)
But this isn't really ideal because someone could have Leaf (Op (+)), which doesn't make much sense here. Rather, I would suggest structuring it as
data Eval a
= Val a
| Var Char
| Op (a -> a -> a) (Eval a) (Eval a)
Which is essentially the tree structure you have, just restricted to be syntactically correct. Then you can write a simple evaluator as
eval :: Eval a -> Data.Map.Map Char a -> Maybe a
eval vars (Val a) = Just a
eval vars (Var x) = Data.Map.lookup x vars
eval vars (Op op l r) = do
left <- eval l
right <- eval r
return $ left `op` right
This will just walk down both branches, evaluating as it goes, then finally returning the computed value. You just have to supply it with a map of variables to values to use
So for example, (3 + x) * (5 - 2) would be represented as Op (*) (Op (+) (Val 3) (Var 'x')) (Op (-) (Val 5) (Val 2)). All that's left is to parse the file, which is another problem entirely.

Related

Avoiding code duplication for data type with lots of similar constructors

I'm working on a writing simple parser in Haskell and have this datatype which holds the results of the parse.
data AST = Imm Integer
| ArgName String
| Arg Integer
| Add AST AST
| Sub AST AST
| Mul AST AST
| Div AST AST
deriving (Show, Eq)
The problem comes when I want to map over the tree to replace variable names with its reference number using a map. I have to write this code
refVars :: M.Map String Integer -> AST -> Maybe AST
refVars d (ArgName s) = case d M.!? s of
Just n -> Just (Arg n)
Nothing -> Nothing
refVars _ (Imm n) = Just $ Imm n
refVars _ (Arg n) = Just $ Arg n
refVars d (Add a1 a2) = Add <$> refVars d a1 <*> refVars d a2
refVars d (Sub a1 a2) = Sub <$> refVars d a1 <*> refVars d a2
refVars d (Mul a1 a2) = Mul <$> refVars d a1 <*> refVars d a2
refVars d (Div a1 a2) = Div <$> refVars d a1 <*> refVars d a2
Which seems incredibly redundant. Ideally I'd want to have one pattern which matches any (op a1 a2) but Haskell won't allow that. Any suggestions?
As proposed in the comments, the fix for your immediate problem is to move the information about the operator type out of the constructor:
data Op = Add | Sub | Mul | Div
data AST = Imm Integer
| ArgName String
| Arg Integer
| Op Op AST AST
This datatype has one constructor for all of the binary operations, so you only need one line to take it apart:
refVars :: M.Map String Integer -> AST -> Maybe AST
refVars d (ArgName s) = Arg <$> d !? s
refVars _ (Imm n) = Just $ Imm n
refVars _ (Arg n) = Just $ Arg n
refVars d (Op op a1 a2) = Op op <$> refVars d a1 <*> refVars d a2
You can handle all different types of binary operators without modifying refVars, but if you add different syntactic forms to your AST you'll have to add clauses to refVars.
data AST = -- other constructors as before
| Ternary AST AST AST
| List [AST]
| Call AST [AST] -- function and args
refVars -- other clauses as before
refVars d (Ternary cond tt ff) = Ternary <$> refVars d cond <*> refVars d tt <*> refVars d ff
refVars d (List l) = List <$> traverse (refVars d) l
refVars d (Call f args) = Call <$> refVars d f <*> traverse (refVars d) args
So it's still tedious - all this code does is traverse the tree to the leaves, whereupon refVars can scrutinise whether the leaf is an ArgName or otherwise. The interesting part of refVars is the one ArgName line; the remaining six lines of the function are pure boilerplate.
It'd be nice if we could define "traverse the tree" separately from "handle ArgNames". This is where generic programming comes in. There are lots of generic programming libraries out there, each with its own style and approach, but I'll demonstrate using lens.
The Control.Lens.Plated module defines a Plated class for types which know how to access their children. The deal is: you show lens how to access your children (by passing them to a callback g), and lens can recursively apply that to access the children's children and so on.
instance Plated AST where
plate g (Op op a1 a2) = Op op <$> g a1 <*> g a2
plate g (Ternary cond tt ff) = Ternary <$> g cond <*> g tt <*> g ff
plate g (List l) = List <$> traverse g l
plate g (Call f args) = Call <$> g f <*> traverse g args
plate _ a = pure a
Aside: you might object that even writing plate clause-by-clause is rather too much boilerplate. The compiler should be able to locate
the AST's children for you. lens has your back — there's a default
implementation of plate for any type which is an instance of
Data,
so you should be able to slap deriving Data onto AST and leave the
Plated instance empty.
Now we can implement refVars using transformM :: (Monad m, Plated a) => (a -> m a) -> a -> m a.
refVars :: M.Map String Integer -> AST -> Maybe AST
refVars d = transformM $ \case
ArgName s -> Arg <$> d !? s
x -> Just x
transformM takes a (monadic) transformation function and applies that to every descendant of the AST. Our transformation function searches for ArgName nodes and replaces them with Arg nodes, leaving any non-ArgNames unchanged.
For a more detailed explanation, see this paper (or the accompanying slides, if you prefer) by Neil Mitchell. It's what the Plated module is based on.
Here's how you could do it with Edward Kmett's recursion-schemes package:
{-# LANGUAGE DeriveTraversable, TemplateHaskell, TypeFamilies #-}
import Data.Functor.Foldable
import Data.Functor.Foldable.TH
import qualified Data.Map as M
data AST = Imm Integer
| ArgName String
| Arg Integer
| Add AST AST
| Sub AST AST
| Mul AST AST
| Div AST AST
deriving (Show, Eq)
makeBaseFunctor ''AST
refVars :: M.Map String Integer -> AST -> Maybe AST
refVars d (ArgName s) = case d M.!? s of
Just n -> Just (Arg n)
Nothing -> Nothing
refVars d a = fmap embed . traverse (refVars d) . project $ a
This works because your refVars function recurses just like a traverse. Doing makeBaseFunctor ''AST creates an auxiliary type based on your original type that has a Traversable instance. We then use project to switch to the auxiliary type, traverse to do the recursion, and embed to switch back to your type.
Side note: you can simplify the ArgName case to just refVars d (ArgName s) = Arg <$> d M.!? s.

Parse error when reading

GHC can't derive Read or Show for complicated GADTs, so I attempted to define custom instances below that satisfy read . show == id. I've simplified the example as much as possible (while still triggering the runtime error like my real code). I decided to let GHC do the heavy lifting of writing Read and Show instances by making newtype wrappers for each GADT constructor (more accurately: for each type output by the GADT). In the Read/Show instances, I simply read/show the newtype wrapper and convert where necessary. I assumed this was foolproof: I'm letting GHC define all of the non-trivial instances. However, I seem to have done something wrong.
In my real code, Foo below is a complicated GADT that GHC won't derive Show or Read for. Since Foo is a wrapper (in part) around a newtype, I use the derived Show/Read instances for that.
{-# LANGUAGE FlexibleContexts, FlexibleInstances, GADTs, ScopedTypeVariables #-}
import Text.Read (Read(readPrec))
newtype Bar r = Bar r deriving (Show, Read)
newtype Foo r = Foo (Bar r)
-- use the GHC-derived Show/Read for Bar
instance (Show r) => Show (Foo r) where
show (Foo x) = show x
instance (Read r) => Read (Foo r) where
readPrec = Foo <$> readPrec
This instance seems to work: I can call read . show and I get back the input.
Now I have a wrapper around Foo:
data U t rep r where
U1 :: t r -> U t Int r
U2 :: t r -> U t Char r
-- use the Read/Show instances for U1Wrap and U2Wrap
newtype U1Wrap t r = U1Wrap {unU1Wrap :: t r} deriving (Show, Read)
newtype U2Wrap t r = U2Wrap (t r) deriving (Show, Read)
instance (Read (t r)) => Read (U t Int r) where
readPrec = (U1 . unU1Wrap) <$> readPrec
instance (Read (U2Wrap t r)) => Read (U t Char r) where
readPrec = do
x <- readPrec
return $ case x of
(U2Wrap y) -> U2 y
instance (Show (t r)) => Show (U t Int r) where
show (U1 x) = show $ U1Wrap x
instance (Show (t r)) => Show (U t Char r) where
show (U2 x) = show (U2Wrap x :: U2Wrap t r)
Like Foo, U is a complicated GADT, so I define custom newtype wrappers for each output type of the GADT. Unfortunately, my Show/Read instances don't work:
main :: IO ()
main = do
let x = U1 $ Foo $ Bar 3
y = U2 $ Foo $ Bar 3
print $ show (read (show x) `asTypeOf` x)
print $ show (read (show y) `asTypeOf` y)
main prints the first line, but I get Exception: Prelude.read: no parse on the second line.
This is my first time using Read, so I suspect I'm doing something incorrectly, though I don't see what that is.
My questions are:
Why am I getting this error, and logically how can I fix it? (There are several ways to poke the minimal example above to make the error go away, but I can't do those things in my real code.)
Is there a different/better high-level approach to Reading GADTs?
Your custom Show instance for Foo doesn't parenthesize correctly.
> U2 $ Foo $ Bar 3
U2Wrap Bar 3
When writing custom Show instances, you should define showsPrec instead. That's because show just gives back a string independently of the context, while showsPrec parenthesizes based on precendence. See Text.Show for further documentation.
instance (Show r) => Show (Foo r) where
showsPrec n (Foo x) = showsPrec n x
Which works here.
I don't know of an elegant approach that would automacially get us a Read instance for this GADT. The deriving mechanism can't seem to figure out that only a single constructor has to be considered for each rep.
At least Show can be derived even here. I also include here a manual Read instance which I hope conforms to Show. I tried to mimic the definitions in Text.Read, which you could also do in other cases. Alternatively, one could use the -ddump-deriv GHC argument to look at derived Read instances, and try to copy them in custom code.
{-# LANGUAGE GADTs, StandaloneDeriving, FlexibleInstances, FlexibleContexts #-}
import Text.Read
import Data.Proxy
data U t rep r where
U1 :: t r -> U t Int r
U2 :: t r -> U t Char r
deriving instance Show (t r) => Show (U t rep r)
instance Read (t r) => Read (U t Int r) where
readPrec = parens $ do
prec 10 $ do
Ident "U1" <- lexP
U1 <$> readPrec
instance Read (t r) => Read (U t Char r) where
readPrec = parens $ do
prec 10 $ do
Ident "U2" <- lexP
U2 <$> readPrec

Parse string with lex in Haskell

I'm following Gentle introduction to Haskell tutorial and the code presented there seems to be broken. I need to understand whether it is so, or my seeing of the concept is wrong.
I am implementing parser for custom type:
data Tree a = Leaf a | Branch (Tree a) (Tree a)
printing function for convenience
showsTree :: Show a => Tree a -> String -> String
showsTree (Leaf x) = shows x
showsTree (Branch l r) = ('<':) . showsTree l . ('|':) . showsTree r . ('>':)
instance Show a => Show (Tree a) where
showsPrec _ x = showsTree x
this parser is fine but breaks when there are spaces
readsTree :: (Read a) => String -> [(Tree a, String)]
readsTree ('<':s) = [(Branch l r, u) | (l, '|':t) <- readsTree s,
(r, '>':u) <- readsTree t ]
readsTree s = [(Leaf x, t) | (x,t) <- reads s]
this one is said to be a better solution, but it does not work without spaces
readsTree_lex :: (Read a) => String -> [(Tree a, String)]
readsTree_lex s = [(Branch l r, x) | ("<", t) <- lex s,
(l, u) <- readsTree_lex t,
("|", v) <- lex u,
(r, w) <- readsTree_lex v,
(">", x) <- lex w ]
++
[(Leaf x, t) | (x, t) <- reads s ]
next I pick one of parsers to use with read
instance Read a => Read (Tree a) where
readsPrec _ s = readsTree s
then I load it in ghci using Leksah debug mode (this is unrelevant, I guess), and try to parse two strings:
read "<1|<2|3>>" :: Tree Int -- succeeds with readsTree
read "<1| <2|3> >" :: Tree Int -- succeeds with readsTree_lex
when lex encounters |<2... part of the former string, it splits onto ("|<", _). That does not match ("|", v) <- lex u part of parser and fails to complete parsing.
There are two questions arising:
how do I define parser that really ignores spaces, not requires them?
how can I define rules for splitting encountered literals with lex
speaking of second question -- it is asked more of curiousity as defining my own lexer seems to be more correct than defining rules of existing one.
lex splits into Haskell lexemes, skipping whitespace.
This means that since Haskell permits |< as a lexeme, lex will not split it into two lexemes, since that's not how it parses in Haskell.
You can only use lex in your parser if you're using the same (or similar) syntactic rules to Haskell.
If you want to ignore all whitespace (as opposed to making any whitespace equivalent to one space), it's much simpler and more efficient to first run filter (not.isSpace).
The answer to this seems to be a small gap between text of Gentle introduction to Haskell and its code samples, plus an error in sample code.
there should also be one more lexer, but there is no working example (satisfying my need) in codebase, so I written one. Please point out any flaw in it:
lexAll :: ReadS String
lexAll s = case lex s of
[("",_)] -> [] -- nothing to parse.
[(c, r)] -> if length c == 1 then [(c, r)] -- we will try to match
else [(c, r), ([head s], tail s)]-- not only as it was
any_else -> any_else -- parsed but also splitted
author sais:
Finally, the complete reader. This is not sensitive to white space as
were the previous versions. When you derive the Show class for a data
type the reader generated automatically is similar to this in style.
but lexAll should be used instead of lex (which seems to be said error):
readsTree' :: (Read a) => ReadS (Tree a)
readsTree' s = [(Branch l r, x) | ("<", t) <- lexAll s,
(l, u) <- readsTree' t,
("|", v) <- lexAll u,
(r, w) <- readsTree' v,
(">", x) <- lexAll w ]
++
[(Leaf x, t) | (x, t) <- reads s]

Bottleneck in math parser Haskell

I got this code below from the wiki books page here. It parses math expressions, and it works very well for the code I'm working on. Although there is one problem, when I start to add layers of brackets to my expression the program slows down dramatically, crashing my computer at some point. It has something to do with the number of operators I have it check for, the more operators I have the less brackets I can parse. Is there anyway to get around or fix this bottleneck?
Any help is much appreciated.
import Text.ParserCombinators.ReadP
-- slower
operators = [("Equality",'='),("Sum",'+'), ("Product",'*'), ("Division",'/'), ("Power",'^')]
-- faster
-- ~ operators = [("Sum",'+'), ("Product",'*'), ("Power",'^')]
skipWhitespace = do
many (choice (map char [' ','\n']))
return ()
brackets p = do
skipWhitespace
char '('
r <- p
skipWhitespace
char ')'
return r
data Tree op = Apply (Tree op) (Tree op) | Branch op (Tree op) (Tree op) | Leaf String deriving Show
leaf = chainl1 (brackets tree
+++ do
skipWhitespace
s <- many1 (choice (map char "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-[]" ))
return (Leaf s))
(return Apply)
tree = foldr (\(op,name) p ->
let this = p +++ do
a <- p +++ brackets tree
skipWhitespace
char name
b <- this
return (Branch op a b)
in this)
(leaf +++ brackets tree)
operators
readA str = fst $ last $ readP_to_S tree str
main = do loop
loop = do
-- ~ try this
-- ~ (a+b+(c*d))
str <- getLine
print $ last $ readP_to_S tree str
loop
This is a classic problem in backtracking (or parallel parsing, they are basically the same thing).... Backtracking grows (at worst) exponentially with the size of the input, so the time to parse something can suddenly explode. In practice backtracking works OK in language parsing for most input, but explodes with recursive infix operator notation. You can see why by considering how many possibile ways this could be parsed (using made up & and % operators):
a & b % c & d
could be parsed as
a & (b % (c & d))
a & ((b % c) & d)
(a & (b % c)) & d
((a & b) % c) & d
This grows like 2^(n-1). The solution to this is to add some operator precidence information earlier in the parse, and throw away all but the sensible cases.... You will need an extra stack to hold pending operators, but you can always go through infix operator expressions in O(1).
LR parsers like yacc do this for you.... With a parser combinator you need to do it by hand. In parsec, there is a Expr package with a buildExpressionParser function that builds this for you.

Sequence in F# folding triples

I've googled and read, and I'm trying to find a "correct" way to do it, but every question I read on SO seems to have completely different answers.
Here is the gist of my problem. files has the type signature of a seq of a triple (a:string, b:string,c:Int64). Being new to f# I'm still not fluent in expressing type signatures (or for that matter understanding them). a is a filename, b is an internal identifier, and c is a value representing the length (size) of the file. baseconfig is a string from earlier in the code.
ignore(files
|> Seq.filter( fun(x,y,z) -> y = baseconfig) // used to filter only files we want
|> Seq.fold( fun f n ->
if( (fun (_,_,z) -> z) n > 50L*1024L*1024L) then
zipfilex.Add((fun (z:string, _, _) -> z) n)
printfn("Adding 50mb to zip")
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
("","",0L)
else
zipfilex.Add((fun (z, _, _) -> z) n)
("", "", (fun (_, _, z:Int64) -> z) n + (fun (_, _, z:Int64) -> z) f)
) ("","",0L)
)
What this chunk of code is supposed to do, is iterate through each file in files, add it to a zip archive (but not really, it just goes on a list to be committed later), and when the files exceed 50MB, commit the currently pending files to the zip archive. Adding a file is cheap, committing is expensive, so I try to mitigate the cost by batching it.
So far the code kinda works... Except for the ObjectDisposedException I got when it approached 150MB of committed files. But I'm not sure this is the right way to do such an operation. It feels like I'm using Seq.fold in a unconventional way, but yet, I don't know of a better way to do it.
Bonus question: Is there a better way to snipe values out of tuples? fst and snd only work for 2 valued tuples, and I realize you can define your own functions instead of inline them like I did, but it seems there should be a better way.
Update: My previous attempts at fold, I couldn't understand why I couldn't just use an Int64 as an accumulator. Turns out I was missing some critical parenthesis. Little simpler version below. Also eliminates all the crazy tuple extraction.
ignore(foundoldfiles
|> Seq.filter( fun (x,y,z) -> y = baseconfig)
|> Seq.fold( fun (a) (f,g,j) ->
zipfilex.Add( f)
if( a > 50L*1024L*1024L) then
printfn("Adding 50mb to zip")
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
0L
else
a + j
) 0L
)
Update 2: I'm going to have to go with an imperative solution, F# is somehow re-entering this block of code, after the zip file is closed in the statement which follows it. Which explains the ObjectDisposedException. No idea how that works or why.
As an alternative to the "dirty" imperative style, you can extend the Seq module with a general and reusable function for chunking. The function is a bit like fold, but it takes a lambda that returns option<'State>. If it returns None, then a new chunk is started and otherwise the element is added to the previous chunk. Then you can write an elegant solution:
files
|> Seq.filter(fun (x, y, z) -> y = baseconfig)
|> Seq.chunkBy(fun (x, y, z) sum ->
if sum + z > 50L*1024L*1024L then None
else Some(sum + z)) 0L
|> Seq.iter(fun files ->
zipfilex.BeginUpdate()
for f, _, _ in files do zipfilex.Add(f)
zipfilex.CommitUpdate())
The implementation of the chunkBy function is a bit longer - it needs to use IEnumerator directly & it can be expressed using recursion:
module Seq =
let chunkBy f initst (files:seq<_>) =
let en = files.GetEnumerator()
let rec loop chunk st = seq {
if not (en.MoveNext()) then
if chunk <> [] then yield chunk
else
match f en.Current st with
| Some(nst) -> yield! loop (en.Current::chunk) nst
| None ->
yield chunk
yield! loop [en.Current] initst }
loop [] initst
I don't think your problem benefits from the use of fold. It's most useful when building immutable structures. My opinion, in this case, is that it makes what you're trying to do less clear. The imperative solution works nicely:
let mutable a = 0L
for (f, g, j) in foundoldfiles do
if g = baseconfig then
zipfilex.Add(f)
if a > 50L * 1024L * 1024L then
printfn "Adding 50mb to zip"
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
a <- 0L
else
a <- a + j
Here's my take:
let inline zip a b = a, b
foundoldfiles
|> Seq.filter (fun (_, internalid, _) -> internalid = baseconfig)
|> zip 0L
||> Seq.fold (fun acc (filename, _, filesize) ->
zipfilex.Add filename
let acc = acc + filesize
if acc > 50L*1024L*1024L then
printfn "Adding 50mb to zip"
zipfilex.CommitUpdate ()
zipfilex.BeginUpdate ()
0L
else acc)
|> ignore
Some notes:
The zip helper function makes for a clean a pipeline through the entire function without any overhead, and in more complex scenarios helps with type inferrence since the state gets shifted from the right to the left side of the fold functor (though that doesn't matter or help in this particular case)
The use of _ to locally discard elements of the tuple that you don't need makes the code easier to read
The approach of pipelining into ignore rather than wrapping the entire expression with extra parenthesis makes the code easier to read
Wrapping the arguments of unary functions in parenthesis looks bizarre; you can't use parenthesis for non-unary curried functions, so using them for unary functions is inconsistent. My policy is to reserve parenthesis for constructor calls and tupled-function calls
EDIT: P.S. if( a > 50L*1024L*1024L) then is incorrect logic -- the if needs to take into account the accumulator plus the current filesize. E.g., if the first file was >= 50MB then the if wouldn't trigger.
If you're not fond of mutable variables and imperative loops, you could always rewrite this using GOTO a functional loop:
let rec loop acc = function
| (file, id, size) :: files ->
if id = baseconfig then
zipfilex.Add file
if acc > 50L*1024L*1024L then
printfn "Adding 50mb to zip"
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
loop 0L files
else
loop (acc + size) files
else
loop acc files
| [] -> ()
loop 0L foundoldfiles
The advantage of this is it explicitly states the three different ways that the inductive case can proceed and how the accumulator is transformed in each case (so you're less likely to get this wrong - witness the bug in Daniel's for loop version).
You could even move the baseconfig check into a when clause:
let rec loop acc = function
| (file, id, size) :: files when id = baseconfig ->
zipfilex.Add file
if acc > 50L*1024L*1024L then
printfn "Adding 50mb to zip"
zipfilex.CommitUpdate()
zipfilex.BeginUpdate()
loop 0L files
else
loop (acc + size) files
| _ :: files -> loop acc files
| [] -> ()
loop 0L foundoldfiles

Resources