Parser in Haskell for Triples - parsing

I am currently doing an assignment about Parsing in Haskell, but I am struggling with some of the basics.
Assignment :
I am supposed to create a function which parses a string into a list of Triples.
So that:
A, B, C
,E ,D
would result in
Triples [("A","B","C"), ("A","E","D")]
The input string is going to include ;\n as an indication for the beginng of a new Triple. The string is going to end with a dot.
The elements of the Triples can be letters or digits or combination,
e.g. abc, a, 1, abc121.
Therefore,
"a,b,c;\n d,e;\n f,g;\n h,i."
would result in:
Triples [("a","b","c"),("a","d","e"),("a","f","g"),("a","h","i")]
My current solution:
parseTriplesD :: Parser Triples
parseTriplesD = parseTriples
>>= \rs -> return (Triples rs)
This function is pretty simple and correct. Takes the string and returns a object of the newtype Triples with the List of Triples created by parseTriples.
parseTriples :: Parser [Triple]
parseTriples = parseTriple
>>= \r -> ((string ";\n" >> parseTriples >>= \rs -> return (r:rs))
P.<|>(return[r]))
This function needs some work. My idea is that I use another function which creates a Triple with tree Elements of the input string, ignores the /n and recursivly calls it self while adding the created triples to a return list. When this does not work because it can only create one Triple, it returns a list with the Triple.
I somehow need to create the first Triple, and then use first element of this triple as the first element of the other ones.
Question 1
How do I create the first Triple and use the first Elements of the Triple for the other Triples?
parseTriple :: Parser Triple
parseTriple = P.many (letter<|>digit) >>= \a -> P.char ','
>> P.many (letter<|>digit)>>= \b -> P.char ','
>> P.many (letter<|>digit)>>= \c -> return ((a,b,c))
This function is pretty simple but I am not sure if its correct.
My idea is that it takes the first couple of characters of the string which are either a letter or a digit, up until the comma "," , and saves these charcters in a.
It is repeated 3 times, and the creates and returns a Triple with the three elements.
Question 2
How do I take only a few characters (which are either a letter or a digit EDIT: Or A SPACE Character) of the string up until the comma?
Is P.many (letter<|>digit) correct?
What we are given:
The Triples data structue:
newtype Triples = Triples [Triple] deriving (Show,Eq)
type Triple = (String, String, String)
Imports:
import Test.HUnit (runTestTT,Test(TestLabel,TestList),(~?=))
import qualified Text.Parsec as P (char,runP,noneOf,many,(<|>),eof)
import Text.ParserCombinators.Parsec
import Text.Parsec.String
import Text.Parsec.Char
import Data.Maybe
Test cases
runParsec :: Parser a -> String -> Maybe a
runParsec parser input = case P.runP parser () "" input of
Left _ -> Nothing
Right a -> Just a
-- | Tests the implementations of 'parseScore'.
main :: IO ()
main = do
testresults <- runTestTT tests
print testresults
-- | List of tests for 'parseScore'.
tests :: Test
tests = TestLabel "parseScoreTest" (TestList [
runParsec parseTriplesD "0,1,2;\n2,3." ~?= Just (Triples [("0","1","2"),("0","2","3")]),
runParsec parseTriplesD "a,bcde ,23." ~?= Just (Triples [("a","bcde ","23")]),
runParsec parseTriplesD "a,b,c;\n d,e;\n f,g;\n h,i." ~?= Just (Triples [("a","b","c"),("a","d","e"),("a","f","g"),("a","h","i")]),
runParsec parseTriplesD "a,bcde23." ~?= Nothing,
runParsec parseTriplesD "a,b,c;d,e;f,g;h,i." ~?= Nothing,
runParsec parseTriplesD "a,b,c;\nd;\nf,g;\nh,i." ~?= Nothing
])

What you could do is:
Parse the first character
Parse a list of pairs
Add the first character to each of the pairs to create triples
Using do notation will make your code more readable.
You can use alphaNum as a shorthand for letter <|> digit.
parseTriplesD :: Parser Triples
parseTriplesD = Triples <$> parseTriples
parseTriples :: Parser [Triple]
parseTriples = do
a <- parseString
char ','
pairs <- parsePair `sepBy1` string ";\n"
char '.'
eof
return (map (\(b, c) -> (a, b, c)) pairs)
parsePair :: Parser (String, String)
parsePair = do
first <- parseString
char ','
second <- parseString
return (first, second)
parseString :: Parser String
parseString = many (char ' ') >> many (alphaNum <|> char ' ')

Related

Passing argument to a ReadP Parser in Haskell

I am trying to create a parser from scratch in Haskell. I have problems passing a string as an argument to a function that is already part of a do block in which the parsing occurs. Why does the following Minimal viable example code return [] and not 4 as expected.
import Data.Char
import Text.ParserCombinators.ReadP
import Control.Applicative ((<|>))
type Parser a = ReadP a
token :: Parser a -> Parser a
token combinator = (do spaces
combinator)
space :: Parser Char
space = satisfy isSpace
spaces :: Parser String
spaces = many space
parseString input = readP_to_S (do
e <- pExpr
token eof
return e) input
pExpr = (do
pv <- pOpHelper
spaces
str <- string pv
return str
)
pOpHelper :: Parser String
pOpHelper = (do
e1 <- munch isDigit
return e1
)
I am of course interested in returning a processed version of whatever string pv returns. However I can't understand why the current setup wouldn't return anything besides [] on parseString "4" since calling just pOpHelper wihtout pExpr seems to work.
Edit
I think I have located the 'bug' to be part of the string function. I had a closer look at it here but I can't see from the documentation why it shouldn't work in the above. But the above code is narrowed down to the parts that produce the unintended outputs as specified.
EDIT EDIT
I have now narrowed the problem down even further. It has to do with how 'consumption' works for the parser. The problem is that if I give it parseString "4" the string pv expects the "4" that is returned by pv, but it will still be parsing the next characters on which munch isDigit is no longer satisfies. This means that it will only return [("4","")] rather than [] if the input is parseString "4 4", and only if the spaces has been added to the do-clause in pExpr.
But how can I work around this and avoid 'consuming' the string that I put as input. Is there a way to use look for instance, in the above documentation.
As pointed out in the comments below I am interested in transforming whatever is the input to pOpHelper and then passing its output to functions (in a recursion) that is part of the parent parser-function called. But how can I do it without consuming the input with pOpHelper first such that the following example would return str on input of "4":
pExpr = (do
pv <- pOpHelper
--spaces
str <- string pv
if str == "(4)" then return str -- do stuff!
else pfail
)
pOpHelper :: Parser String
pOpHelper = (do
e1 <- munch isDigit
return ( "(" ++ e1 ++ ")" )
)

How do I parse S-expressions into a data structure in Haskell?

I'm new to Haskell and could use some guidance.
The challenge: take an S-expression and parse it into a record.
Where I have succeeded: I can take a file and read it into a parsed String.
Yet, using parsing Text to DFA s.t
let
toDFA :: [L.Text] -> EntryDFA
toDFA t =
let [q,a,d,s,f] = t
in EntryDFA {
state = read q
,alpha = read a
,delta = read d
,start = read s
,final = read f }
returns this error:
• Couldn't match type ‘L.Text’ with ‘[Char]’
Expected type: String
Actual type: L.Text
There must be a more idiomatic approach.
read is a partial function with type Read a => String -> a, which throws an exception on parsing failure. Normally you want to avoid it (use readMaybe instead if you have a string). String and L.Text are different types, which is why you're getting an error.
Your sample code is missing an extra ) after the trans-func.
I'm using the Megaparsec package which provides an easy way to work with parser combinators. The author of the library has written a longer tutorial here.
The basic idea is that Parser a is the type of a value that can parse something of type a. In Text.Megaparsec there are several functions which you can use (parse, parseMaybe etc.), to "run" the parser on a "stringy" data type (e.g. String or strict/lazy Text).
When you use do notation for IO, it means "do one action after another". Similarly, you can use do notation with Parser, it means "parse this one thing, then parse the next thing".
p1 *> p2 means run the parser p1, run p2 and return the result of running p2. p1 <* p2 means run the parser p1, run p2 and return the result of running p1. You can also look up documentation on Hoogle in case you're having trouble understanding something.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE NamedFieldPuns #-}
-- In practice, many of these imports would be unqualified, but I've
-- opted for explicitness for clarity.
import Control.Applicative (empty, many, some, (<*), (*>))
import Control.Exception (try, IOException)
import Data.Maybe (fromMaybe)
import Data.Set (Set)
import Data.Text (Text)
import qualified Data.Set as Set
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MPC
import qualified Text.Megaparsec.Char.Lexer as MPCL
type Q = Text
type E = Char
data EntryDFA = EntryDFA
{ state :: Set Q
, alpha :: Set E
, delta :: Set (Q,E,Q)
, start :: Q
, final :: Set Q
} deriving Show
inputFile = "foo.sexp"
main :: IO ()
main = do
-- read file and check for exception instead of checking if
-- it exists and then trying to read it
result <- try (TIO.readFile inputFile)
case result of
Left e -> print (e :: IOException)
Right txt -> do
case MP.parse dfaParser inputFile txt of
Left e -> print e
Right dfa -> print dfa
type Parser = MP.Parsec () Text
-- There are no comments in the S-exprs, so leave those empty
spaceConsumer :: Parser ()
spaceConsumer = MPCL.space MPC.space1 empty empty
symbol :: Text -> Parser Text
symbol txt = MPCL.symbol spaceConsumer txt
parens :: Parser a -> Parser a
parens p = MP.between (symbol "(") (symbol ")") p
setP :: Ord a => Parser a -> Parser (Set a)
setP p = do
items <- parens (p `MP.sepBy1` (symbol ","))
return (Set.fromList items)
pair :: Parser a -> Parser b -> Parser (a, b)
pair p1 p2 = parens $ do
x1 <- p1
x2 <- symbol "," *> p2
return (x1, x2)
stateP :: Parser Text
stateP = do
c <- MPC.letterChar
cs <- many MPC.alphaNumChar
return (T.pack (c:cs))
dfaParser :: Parser EntryDFA
dfaParser = do
() <- spaceConsumer
(_, state) <- pair (symbol "states") (setP stateP)
(_, alpha) <- pair (symbol "alpha") (setP alphaP)
(_, delta) <- pair (symbol "trans-func") (setP transFuncP)
(_, start) <- pair (symbol "start") valP
(_, final) <- pair (symbol "final") (setP valP)
return (EntryDFA {state, alpha, delta, start, final})
where
alphaP :: Parser Char
alphaP = MPC.letterChar <* spaceConsumer
transFuncP :: Parser (Text, Char, Text)
transFuncP = parens $ do
s1 <- stateP
a <- symbol "," *> alphaP
s2 <- symbol "," *> stateP
return (s1, a, s2)
valP :: Parser Text
valP = fmap T.pack (some MPC.digitChar)

Parsing simple molecule names with Attoparsec

I find it extremely difficult to learn how to use Attoparsec, because the documentation is really just an API documentation and there are basically no tutorials around (except the one from FPComplete). If you know other places where I can learn Attoparsec, that'd be great.
I have to parse simple molecule names, in the following format: NaCl, CO2, H2O, HCN, H2O2.
An element name is an uppercase letter optionally followed by a lowercase one (I'm not considering those elements with a symbol longer than 2 characters).
An element can be followed by a number (that would be the subscript in a formula).
New version (thanks to Mark's and Tarmil's suggestions), which compiles but does not parse:
module Chem
where
import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text
data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]
parseString :: String -> Result Molecule
parseString = parse (many' parseElement) . pack
parseElement :: Parser Element
parseElement = do
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
n <- option 1 decimal
return $ Element el n
pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
Any suggestion is appreciated.
EDIT: I managed to get it running. Basically, a Partial continuation was returned, and to finish the parsing it's necessary to feed the parser with an empty bytestring. So the correct parseString would be:
parseString = flip feed empty . parse (many' parseElement) . pack
where empty is Data.Text.empty. However, since I don't need incremental parsing there is the useful function parseOnly, which does not wait for more input and returns an Either.
With that in mind, I rewrote the code like this (it works now):
module Chem
where
import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text
data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]
parseString :: String -> Either String Molecule
parseString = parseOnly (many' parseElement) . pack
parseElement :: Parser Element
parseElement = do
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
n <- option 1 decimal
return $ Element el n
pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
You have two problems in the letters parsing part:
inClass is not a parser, it is a function that is meant to be passed to satisfy.
<*> has type Parser (a -> b) -> Parser a -> Parser b, so the parser on the left should return a function. Typically, it is used like this:
pf <$> p1 <*> p2 <*> ... <*> pn
where pf is a function with n arguments.
So here you probably want something like this:
-- parse a character in the given class, and transform it to a single-char string
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
-- ...
el <- ((++) <$> pClass "A-Z" <*> pClass "a-z") <|> pClass "A-Z"
-- ...
I think this would be enhanced by using option, instead of duplicating the A-Z parser:
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")

parsec using between to parse parens

If I wanted to parse a string with multiple parenthesized groups into a list of strings holding each group, for example
"((a b c) a b c)"
into
["((a b c) a b c)","( a b c)"]
How would I do that using parsec? The use of between looks nice but it does not seem possible to separate with a beginning and end value.
I'd use a recursive parser:
data Expr = List [Expr] | Term String
expr :: Parsec String () Expr
expr = recurse <|> terminal
where terminal is your primitives, in this case these seem to be strings of characters so
where terminal = Term <$> many1 letter
and recurse is
recurse = List <$>
(between `on` char) '(' ')' (expr `sepBy1` char ' ')
Now we have a nice tree of Exprs which we can gather with
collect r#(List ts) = r : concatMap collect ts
collect _ = []
While jozefg's solution is almost identical to what I came up with (and I completely agree to all his suggestions), there are some small differences that made me think that I should post a second answer:
Due to the expected result of the initial example, it is not necessary to treat space-separated parts as individual subtrees.
Further it might be interesting to see the part that actually computes the expected results (i.e., a list of strings).
So here is my version. As already suggested by jozefg, split the task into several sub-tasks. Those are:
Parse a string into an algebraic data type representing some kind of tree.
Collect the (desired) subtrees of this tree.
Turn trees into strings.
Concerning 1, we first need a tree data type
import Text.Parsec
import Text.Parsec.String
import Control.Applicative ((<$>))
data Tree = Leaf String | Node [Tree]
and then a function that can parse strings into values of this type.
parseTree :: Parser Tree
parseTree = node <|> leaf
where
node = Node <$> between (char '(') (char ')') (many parseTree)
leaf = Leaf <$> many1 (noneOf "()")
In my version I do consider the hole string between parenthesis as a Leaf node (i.e., I do not split at white-spaces).
Now we need to collect the subtrees of a tree we are interested in:
nodes :: Tree -> [Tree]
nodes (Leaf _) = []
nodes t#(Node ts) = t : concatMap nodes ts
Finally, a Show-instance for Trees allows us to turn them into strings.
instance Show Tree where
showsPrec d (Leaf x) = showString x
showsPrec d (Node xs) = showString "(" . showList xs . showString ")"
where
showList [] = id
showList (x:xs) = shows x . showList xs
Then the original task can be solved, e.g., by:
parseGroups :: Parser [String]
parseGroups = map show . nodes <$> parseTree
> parseTest parseGroups "((a b c) a b c)"
["((a b c) a b c)","(a b c)"]

Custom ADT vs. Tree for parser return value

I'm using Parsec to build a simple Lisp parser.
What are the (dis)advantages of using a custom ADT for the parser types versus using a standard Tree (i.e. Data.Tree)?
After trying both ways, I've come up with a couple points for custom ADTs (i.e. Parser ASTNode):
seems to be much clearer and simpler
others have done it this way(including Tiger, which is/was bundled with Parsec)
and one against (i.e. Parser (Tree ASTNode):
Data.Tree already has Functor, Monad, etc. instances, which will be very helpful for semantic analysis, evaluation, calculating code statistics
For example:
custom ADT
import Text.ParserCombinators.Parsec
data ASTNode
= Application ASTNode [ASTNode]
| Symbol String
| Number Float
deriving (Show)
int :: Parser ASTNode
int = many1 digit >>= (return . Number . read)
symbol :: Parser ASTNode
symbol = many1 (oneOf ['a'..'z']) >>= (return . Symbol)
whitespace :: Parser String
whitespace = many1 (oneOf " \t\n\r\f")
app :: Parser ASTNode
app =
char '(' >>
sepBy1 expr whitespace >>= (\(e:es) ->
char ')' >>
(return $ Application e es))
expr :: Parser ASTNode
expr = symbol <|> int <|> app
example use:
ghci> parse expr "" "(a 12 (b 13))"
Right
(Application
(Symbol "a")
[Number 12.0, Application
(Symbol "b")
[Number 13.0]])
Data.Tree
import Text.ParserCombinators.Parsec
import Data.Tree
data ASTNode
= Application (Tree ASTNode)
| Symbol String
| Number Float
deriving (Show)
int :: Parser (Tree ASTNode)
int = many1 digit >>= (\x -> return $ Node (Number $ read x) [])
symbol :: Parser (Tree ASTNode)
symbol = many1 (oneOf ['a' .. 'z']) >>= (\x -> return $ Node (Symbol x) [])
whitespace :: Parser String
whitespace = many1 (oneOf " \t\n\r\f")
app :: Parser (Tree ASTNode)
app =
char '(' >>
sepBy1 expr whitespace >>= (\(e:es) ->
char ')' >>
(return $ Node (Application e) es))
expr :: Parser (Tree ASTNode)
expr = symbol <|> int <|> app
and example use:
ghci> parse expr "" "(a 12 (b 13))"
Right
(Node
(Application
(Node (Symbol "a") []))
[Node (Number 12.0) [],
Node
(Application
(Node (Symbol "b") []))
[Node (Number 13.0) []]])
(sorry for the formatting -- hopefully it's clear)
I'd absolutely go for the AST, because interpretation/compilation/language analysis in general is very much driven by the structure of your language. The AST will simply and naturally represent and respect that structure, while Tree will do neither.
For example, a common form of language implementation technique is to implement some complex features by translation: translate programs that involve those features or constructs into programs in a subset of the a language that does not use them (Lisp macros, for example, are all about this). If you use an AST, the type system will, for example, often forbid you from producing illegal translations as output. Whereas a Tree type that doesn't understand your program will not help there.
Your AST doesn't look very complicated, so writing utility functions for it should not be hard. Take this one for example:
foldASTNode :: (r -> [r] -> r) -> (String -> r) -> (Float -> r) -> r
foldASTNode app sym num node =
case node of
Application f args -> app (subfold f) (map subfold args)
Symbol str -> sym str
Number n -> num n
where subfold = foldASTNode app sym num
And in any case, what sort of Functor do you wish to have on your AST? There's no type parameter on it...

Resources