parsec using between to parse parens - parsing

If I wanted to parse a string with multiple parenthesized groups into a list of strings holding each group, for example
"((a b c) a b c)"
into
["((a b c) a b c)","( a b c)"]
How would I do that using parsec? The use of between looks nice but it does not seem possible to separate with a beginning and end value.

I'd use a recursive parser:
data Expr = List [Expr] | Term String
expr :: Parsec String () Expr
expr = recurse <|> terminal
where terminal is your primitives, in this case these seem to be strings of characters so
where terminal = Term <$> many1 letter
and recurse is
recurse = List <$>
(between `on` char) '(' ')' (expr `sepBy1` char ' ')
Now we have a nice tree of Exprs which we can gather with
collect r#(List ts) = r : concatMap collect ts
collect _ = []

While jozefg's solution is almost identical to what I came up with (and I completely agree to all his suggestions), there are some small differences that made me think that I should post a second answer:
Due to the expected result of the initial example, it is not necessary to treat space-separated parts as individual subtrees.
Further it might be interesting to see the part that actually computes the expected results (i.e., a list of strings).
So here is my version. As already suggested by jozefg, split the task into several sub-tasks. Those are:
Parse a string into an algebraic data type representing some kind of tree.
Collect the (desired) subtrees of this tree.
Turn trees into strings.
Concerning 1, we first need a tree data type
import Text.Parsec
import Text.Parsec.String
import Control.Applicative ((<$>))
data Tree = Leaf String | Node [Tree]
and then a function that can parse strings into values of this type.
parseTree :: Parser Tree
parseTree = node <|> leaf
where
node = Node <$> between (char '(') (char ')') (many parseTree)
leaf = Leaf <$> many1 (noneOf "()")
In my version I do consider the hole string between parenthesis as a Leaf node (i.e., I do not split at white-spaces).
Now we need to collect the subtrees of a tree we are interested in:
nodes :: Tree -> [Tree]
nodes (Leaf _) = []
nodes t#(Node ts) = t : concatMap nodes ts
Finally, a Show-instance for Trees allows us to turn them into strings.
instance Show Tree where
showsPrec d (Leaf x) = showString x
showsPrec d (Node xs) = showString "(" . showList xs . showString ")"
where
showList [] = id
showList (x:xs) = shows x . showList xs
Then the original task can be solved, e.g., by:
parseGroups :: Parser [String]
parseGroups = map show . nodes <$> parseTree
> parseTest parseGroups "((a b c) a b c)"
["((a b c) a b c)","(a b c)"]

Related

Haskell monadic parser with anamorphisms

My problem is how to combine the recursive, F-algebra-style recursive type definitions, with monadic/applicative-style parsers, in way that would scale to a realistic programming language.
I have just started with the Expr definition below:
data ExprF a = Plus a a |
Val Integer deriving (Functor,Show)
data Rec f = In (f (Rec f))
type Expr = Rec ExprF
and I am trying to combine it with a parser which uses anamorphisms:
ana :: Functor f => (a -> f a) -> a -> Rec f
ana psi x = In $ fmap (ana psi) (psi x)
parser = ana psi
where psi :: String -> ExprF String
psi = ???
as far as I could understand, in my example, psi should either parse just an integer, or it should decide that the string is a <expr> + <expr> and then (by recursively calling fmap (ana psi)), it should parse the left-hand side and the right-hand side expressions.
However, (monadic/applicative) parsers don't work like that:
they first attempt parsing the left-hand expression,
the +,
and the right-hand expression
One solution that I see, is to change the type definition for Plus a a to Plus Integer a, such that it reflects the parsing process, however this doesn't seem like the best avenue.
Any suggestions (or reading directions) would be welcome!
If you need a monadic parser, you need a monad in your unfold:
anaM :: (Traversable f, Monad m) => (a -> m (f a)) -> a -> m (Rec f)
anaM psiM x = In <$> (psiM x >>= traverse (anaM psiM))
Then you can write something that parses just one level of an ExprF like this:
parseNum :: Parser Integer
parseNum = -- ...
char :: Char -> Parser Char
char c = -- ...
parseExprF :: Maybe Integer -> Parser (ExprF (Maybe Integer))
parseExprF (Just n) = pure (Val n)
parseExprF Nothing = do
n <- parseNum
empty
<|> (Plus (Just n) Nothing <$ char '+')
<|> (pure (Val n))
Given that, you now have your recursive Expr parser:
parseExpr :: Parser Expr
parseExpr = anaM parseExprF Nothing
You will need to have instances of Foldable and Traversable for ExprF, of course, but the compiler can write these for you and they are not themselves recursive.

Parser in Haskell for Triples

I am currently doing an assignment about Parsing in Haskell, but I am struggling with some of the basics.
Assignment :
I am supposed to create a function which parses a string into a list of Triples.
So that:
A, B, C
,E ,D
would result in
Triples [("A","B","C"), ("A","E","D")]
The input string is going to include ;\n as an indication for the beginng of a new Triple. The string is going to end with a dot.
The elements of the Triples can be letters or digits or combination,
e.g. abc, a, 1, abc121.
Therefore,
"a,b,c;\n d,e;\n f,g;\n h,i."
would result in:
Triples [("a","b","c"),("a","d","e"),("a","f","g"),("a","h","i")]
My current solution:
parseTriplesD :: Parser Triples
parseTriplesD = parseTriples
>>= \rs -> return (Triples rs)
This function is pretty simple and correct. Takes the string and returns a object of the newtype Triples with the List of Triples created by parseTriples.
parseTriples :: Parser [Triple]
parseTriples = parseTriple
>>= \r -> ((string ";\n" >> parseTriples >>= \rs -> return (r:rs))
P.<|>(return[r]))
This function needs some work. My idea is that I use another function which creates a Triple with tree Elements of the input string, ignores the /n and recursivly calls it self while adding the created triples to a return list. When this does not work because it can only create one Triple, it returns a list with the Triple.
I somehow need to create the first Triple, and then use first element of this triple as the first element of the other ones.
Question 1
How do I create the first Triple and use the first Elements of the Triple for the other Triples?
parseTriple :: Parser Triple
parseTriple = P.many (letter<|>digit) >>= \a -> P.char ','
>> P.many (letter<|>digit)>>= \b -> P.char ','
>> P.many (letter<|>digit)>>= \c -> return ((a,b,c))
This function is pretty simple but I am not sure if its correct.
My idea is that it takes the first couple of characters of the string which are either a letter or a digit, up until the comma "," , and saves these charcters in a.
It is repeated 3 times, and the creates and returns a Triple with the three elements.
Question 2
How do I take only a few characters (which are either a letter or a digit EDIT: Or A SPACE Character) of the string up until the comma?
Is P.many (letter<|>digit) correct?
What we are given:
The Triples data structue:
newtype Triples = Triples [Triple] deriving (Show,Eq)
type Triple = (String, String, String)
Imports:
import Test.HUnit (runTestTT,Test(TestLabel,TestList),(~?=))
import qualified Text.Parsec as P (char,runP,noneOf,many,(<|>),eof)
import Text.ParserCombinators.Parsec
import Text.Parsec.String
import Text.Parsec.Char
import Data.Maybe
Test cases
runParsec :: Parser a -> String -> Maybe a
runParsec parser input = case P.runP parser () "" input of
Left _ -> Nothing
Right a -> Just a
-- | Tests the implementations of 'parseScore'.
main :: IO ()
main = do
testresults <- runTestTT tests
print testresults
-- | List of tests for 'parseScore'.
tests :: Test
tests = TestLabel "parseScoreTest" (TestList [
runParsec parseTriplesD "0,1,2;\n2,3." ~?= Just (Triples [("0","1","2"),("0","2","3")]),
runParsec parseTriplesD "a,bcde ,23." ~?= Just (Triples [("a","bcde ","23")]),
runParsec parseTriplesD "a,b,c;\n d,e;\n f,g;\n h,i." ~?= Just (Triples [("a","b","c"),("a","d","e"),("a","f","g"),("a","h","i")]),
runParsec parseTriplesD "a,bcde23." ~?= Nothing,
runParsec parseTriplesD "a,b,c;d,e;f,g;h,i." ~?= Nothing,
runParsec parseTriplesD "a,b,c;\nd;\nf,g;\nh,i." ~?= Nothing
])
What you could do is:
Parse the first character
Parse a list of pairs
Add the first character to each of the pairs to create triples
Using do notation will make your code more readable.
You can use alphaNum as a shorthand for letter <|> digit.
parseTriplesD :: Parser Triples
parseTriplesD = Triples <$> parseTriples
parseTriples :: Parser [Triple]
parseTriples = do
a <- parseString
char ','
pairs <- parsePair `sepBy1` string ";\n"
char '.'
eof
return (map (\(b, c) -> (a, b, c)) pairs)
parsePair :: Parser (String, String)
parsePair = do
first <- parseString
char ','
second <- parseString
return (first, second)
parseString :: Parser String
parseString = many (char ' ') >> many (alphaNum <|> char ' ')

Parsing juxtaposition-based, indentation-aware syntax using Text.Parsec.Layout

I'm trying to parse a small language with Haskell-like syntax, using parsec-layout. The two key features that don't seem to interact too well with each other are:
Function application syntax is juxtaposition, i.e. if F and E are terms, F E is the syntax for F applied to E.
Indentation can be used to denote nesting, i.e. the following two are equivalent:
X = case Y of
A -> V
B -> W
X = case Y of A -> V; B -> W
I haven't managed to figure out a combination of skipping and keeping whitespace that would allow me to parse a list of such definitions. Here's my simplified code:
import Text.Parsec hiding (space, runP)
import Text.Parsec.Layout
import Control.Monad (void)
type Parser = Parsec String LayoutEnv
data Term = Var String
| App Term Term
| Case Term [(String, Term)]
deriving Show
name :: Parser String
name = spaced $ (:) <$> upper <*> many alphaNum
kw :: String -> Parser ()
kw = void . spaced . string
reserved :: String -> Parser ()
reserved s = try $ spaced $ string s >> notFollowedBy alphaNum
term :: Parser Term
term = foldl1 App <$> part `sepBy1` space
where
part = choice [ caseBlock
, Var <$> name
]
caseBlock = Case <$> (reserved "case" *> term <* reserved "of") <*> laidout alt
alt = (,) <$> (name <* kw "->") <*> term
binding :: Parser (String, Term)
binding = (,) <$> (name <* kw "=") <*> term
-- https://github.com/luqui/parsec-layout/issues/1
trim :: String -> String
trim = reverse . dropWhile (== '\n') . reverse
runP :: Parser a -> String -> Either ParseError a
runP p = runParser (p <* eof) defaultLayoutEnv "" . trim
If I try to run it on input like
s = unlines [ "A = case B of"
, " X -> Y Z"
, "C = D"
]
via runP (laidout binding) s, it fails on the application Y Z:
(line 2, column 10):
expecting space or semi-colon
However, if I change the definition of term to
term = foldl1 App <$> many1 part
then it doesn't stop parsing the term at the start of the (unindented!) third line, leading to
(line 3, column 4):
expecting semi-colon
I think the problem has to do with that name already eliminates the following space, so the sepBy1 in the definition of term doesn't see it.
Consider these simplified versions of term:
term0 = foldl1 App <$> (Var <$> name) `sepBy1` space
term1 = foldl1 App <$> (Var <$> name') `sepBy1` space
name' = (:) <$> upper <*> many alphaNum
term3 = foldl1 App <$> many (Var <$> name)
Then:
runP term0 "A B C" -- fails
runP term1 "A B C" -- succeeds
runP term3 "A B C" -- succeeds
I think part of the solution is to define
part = [ caseBlock, Var <$> name' ]
where name' is as above. However, there are still some issues.

Custom ADT vs. Tree for parser return value

I'm using Parsec to build a simple Lisp parser.
What are the (dis)advantages of using a custom ADT for the parser types versus using a standard Tree (i.e. Data.Tree)?
After trying both ways, I've come up with a couple points for custom ADTs (i.e. Parser ASTNode):
seems to be much clearer and simpler
others have done it this way(including Tiger, which is/was bundled with Parsec)
and one against (i.e. Parser (Tree ASTNode):
Data.Tree already has Functor, Monad, etc. instances, which will be very helpful for semantic analysis, evaluation, calculating code statistics
For example:
custom ADT
import Text.ParserCombinators.Parsec
data ASTNode
= Application ASTNode [ASTNode]
| Symbol String
| Number Float
deriving (Show)
int :: Parser ASTNode
int = many1 digit >>= (return . Number . read)
symbol :: Parser ASTNode
symbol = many1 (oneOf ['a'..'z']) >>= (return . Symbol)
whitespace :: Parser String
whitespace = many1 (oneOf " \t\n\r\f")
app :: Parser ASTNode
app =
char '(' >>
sepBy1 expr whitespace >>= (\(e:es) ->
char ')' >>
(return $ Application e es))
expr :: Parser ASTNode
expr = symbol <|> int <|> app
example use:
ghci> parse expr "" "(a 12 (b 13))"
Right
(Application
(Symbol "a")
[Number 12.0, Application
(Symbol "b")
[Number 13.0]])
Data.Tree
import Text.ParserCombinators.Parsec
import Data.Tree
data ASTNode
= Application (Tree ASTNode)
| Symbol String
| Number Float
deriving (Show)
int :: Parser (Tree ASTNode)
int = many1 digit >>= (\x -> return $ Node (Number $ read x) [])
symbol :: Parser (Tree ASTNode)
symbol = many1 (oneOf ['a' .. 'z']) >>= (\x -> return $ Node (Symbol x) [])
whitespace :: Parser String
whitespace = many1 (oneOf " \t\n\r\f")
app :: Parser (Tree ASTNode)
app =
char '(' >>
sepBy1 expr whitespace >>= (\(e:es) ->
char ')' >>
(return $ Node (Application e) es))
expr :: Parser (Tree ASTNode)
expr = symbol <|> int <|> app
and example use:
ghci> parse expr "" "(a 12 (b 13))"
Right
(Node
(Application
(Node (Symbol "a") []))
[Node (Number 12.0) [],
Node
(Application
(Node (Symbol "b") []))
[Node (Number 13.0) []]])
(sorry for the formatting -- hopefully it's clear)
I'd absolutely go for the AST, because interpretation/compilation/language analysis in general is very much driven by the structure of your language. The AST will simply and naturally represent and respect that structure, while Tree will do neither.
For example, a common form of language implementation technique is to implement some complex features by translation: translate programs that involve those features or constructs into programs in a subset of the a language that does not use them (Lisp macros, for example, are all about this). If you use an AST, the type system will, for example, often forbid you from producing illegal translations as output. Whereas a Tree type that doesn't understand your program will not help there.
Your AST doesn't look very complicated, so writing utility functions for it should not be hard. Take this one for example:
foldASTNode :: (r -> [r] -> r) -> (String -> r) -> (Float -> r) -> r
foldASTNode app sym num node =
case node of
Application f args -> app (subfold f) (map subfold args)
Symbol str -> sym str
Number n -> num n
where subfold = foldASTNode app sym num
And in any case, what sort of Functor do you wish to have on your AST? There's no type parameter on it...

Parsec parsing and separating different structures

Let's say I have different parsers p1, ..., pk. I want to define a function pk :: Parser ([t1], ..., [tk]) where pi :: Parser ti.
That will parse a collection of strings (one after the other) that match any of p1...pk and separate them in the corresponding lists. Assume for simplicity that none of the strings matches two parsers.
I managed to do it, but I am really struggling to find an elegant way (preferably using many or any other builtin parsec parser).
The first step is to turn each of your parsers into parser of the big type:
p1 :: Parser t1
p2 :: Parser t2
p3 :: Parser t3
p1 = undefined
p2 = undefined
p3 = undefined
p1', p2', p3' :: Parser ([t1], [t2], [t3])
p1' = fmap (\x -> ([x], [], [])) p1
p2' = fmap (\x -> ([], [x], [])) p2
p3' = fmap (\x -> ([], [], [x])) p3
Now, we repeatedly choose from these last parsers, and concatenate the results at the end:
parser :: Parser ([t1], [t2], [t3])
parser = fmap mconcat . many . choice $ [p1', p2', p3']
There are Monoid instances for tuples up to size five; beyond that, you can either use nested tuples or a more appropriate data structure.
Representing the parsers as a list makes this easy. Using:
choice :: [Parser a] -> Parser a
many :: Parser a -> Parser [a]
We can write:
combineParsers :: [Parser a] -> Parser [a]
combineParsers = many . choice
This isn't quite right, as it bundles them all into a single list. Let's associate each parser with a unique identifier:
combineParsers' :: [(k, Parser a)] -> Parser [(k, a)]
combineParsers' = many . choice . combine
where combine = map (\(k,p) -> (,) k <$> p)
We can then convert this back to the list form:
combineParsers :: [Parser a] -> Parser [[a]]
combineParsers ps = map snd . sortBy fst <$> combineParsers' (zip [0..] ps)
You could perhaps make this more efficient by writing combineParsers' :: [(k, Parser a)] -> Parser (Map k [a]) instead, using e.g. combine = map $ \(k,p) -> fmap (\a -> Map.insertWith (++) k [a]) p.
This requires that all the parsers have the same result type, so you'll need to wrap each of their results in a data-type with Cons <$> p for each p. You can then unwrap the constructor from each list. Admittedly, this is fairly ugly, but to do it heterogeneously with tuples would require even uglier type-class hacks.
There might be a simpler solution for your specific use-case, but this is the easiest way to do it generically that I can think of.
You can't do this completely generically without type-level trickery, unfortunately. However, an approach along the lines of Daniel Wagner's suggestion can be constructed in DIY style, using polymorphic combinators and some pre-existing recursive instances. I'll present a simple example to demonstrate.
First, a few simpleminded parsers to test with:
type Parser = Parsec String ()
parseNumber :: Parser Int
parseNumber = read <$> many1 digit
parseBool :: Parser Bool
parseBool = string "yes" *> return True
<|> string "no" *> return False
parseName :: Parser String
parseName = many1 letter
Next we create a "base case" type to mark the end of the possibilities, and give it a parser that always succeeds on no input.
data Nil = Nil deriving (Eq, Ord, Read, Show, Enum, Bounded)
instance Monoid Nil where
mempty = Nil
mappend _ _ = Nil
parseNil :: Parser Nil
parseNil = return Nil
This is equivalent to (), of course--I'm only creating a new type to disambiguate in case we actually want to parse (). Next, the combinator that chains parsers together:
infixr 3 ?|>
(?|>) :: (Monoid b) => Parser a -> Parser b -> Parser ([a], b)
p1 ?|> p2 = ((,) <$> ((:[]) <$> p1) <*> pure mempty)
<|> ((,) <$> pure [] <*> p2)
What's going on here is that p1 ?|> p2 tries p1, and if it succeeds wraps that in a singleton list and fills in mempty for the second element of the tuple. If p1 fails, if fills in an empty list and uses p2 for the second element.
parseOne :: Parser ([Int], ([Bool], ([String], Nil)))
parseOne = parseNumber ?|> parseBool ?|> parseName ?|> parseNil
Combining parsers with the new combinator is simple, and the result type is pretty self-explanatory.
parseMulti :: Parser ([Int], ([Bool], ([String], Nil)))
parseMulti = mconcat <$> many1 (parseOne <* newline)
We're relying on the recursive Monoid instance for tuples and the usual instance for lists to combine multiple lines. Now, to show that it works, define a quick test:
runTest = parseTest parseMulti testInput
testInput = unlines [ "yes", "123", "acdf", "8", "no", "qq" ]
Which parses successfully as Right ([123,8],([True,False],(["acdf","qq"],Nil))).

Resources