I am following this tutorial for implementing Parser Combinators (a la parsec) in Haskell. I implemented everything of the NanoParsec mentioned throught this post.
For some hours now, I am struggeling to implement
-- try p. If p fails continue without consuming anything
try :: Parser a -> Parser a
try p = ...
-- Parser for everything, until the character-sequence stop appears.
-- If stop does not appear at all: fail
untilStop :: String -> Parser String
untilStop stop = ...
My best attempt to implement untilStop looks like somewhat like this and does not quite work
untilStop :: String -> Parser String
untilStop (c : cs) = do
s <- some $ satisfy (/= d)
string (c : cs) <|> do
i <- item
untilStop (d : ds)
-- maybe use msum from MonadPlus to combine?
I could not figure out how to combine s, i and the recursive call without failing everthing because of string are not getting everything together.
I think once I have try, untilStop should be straightforward. Can someone point me in the right direction or implement it (try) for me?
Right now I am still learning about Monads, Applicative and related stuff so trying to understand the sourcecode of parsec was impossible for me.
As I said in a comment, I think you don't need to have a Parsec-like try.
For the untilStop, check this:
untilStop :: String -> Parser String
untilStop [] = everything
untilStop (c:cs) = item >>= fun
where fun i = do { s <- untilStop cs;
if i == c && s == "" then return "" else failure } <|> do
s <- untilStop (c : cs)
return (i : s)
First, if the stop string is empty, you parse everything. Where everything is:
everything :: Parser String
everything = Parser (\inp -> [(inp,"")])
Otherwise, if it is of the form c:cs, then parse a character i and consider two cases:
The stop string is right in the front of the parsing stream (because c == i and parsing with the rest of the string cs gives an empty result), then return "". Or,
It is somewhere in the stream, so you look for it further.
Note that the <|> operator is used to backtrack. If untilStop cs fails to be what we want, we need to reparse, using untilStop (c:cs) instead.
Related
I'm trying to define a greedy function
greedy :: ReadP a -> ReadP [a]
that parses a sequence of values, returning only the "maximal" sequences that cannot be extended any further. For example,
> readP_to_S (greedy (string "a" +++ string "ab")) "abaac"
[(["a"],"baac"),(["ab","a","a"],"c")]
I'm using a very simple and probably clumsy way. Just parse the values and see if they can be parsed any further; if so, then reapply the function again to get all the possible values and concat that with the previous ones, or else just return the value itself. However, there seems to be some type problems, below is my code.
import Text.ParserCombinators.ReadP
addpair :: a -> [([a],String)] -> [([a],String)]
addpair a [] = []
addpair a (c:cs) = (a : (fst c), snd c ) : (addpair a cs)
greedy :: ReadP a -> ReadP [a]
greedy ap = readS_to_P (\s ->
let list = readP_to_S ap s in
f list )
where
f :: [(a,String)] -> [([a],String)]
f ((value, str2):cs) =
case readP_to_S ap str2 of
[] -> ([value], str2) : (f cs)
_ -> (addpair value (readP_to_S (greedy ap) str2)) ++ (f cs)
The GHC processes the code and says that function "f" has type [(a1,String)] -> [([a1],String)] but greedy is ReadP a -> ReadP [a]. I wonder why it is so because I think their type should agree. It also really helps if anyone can come up with some clever and more elegant approach to define the function greedy(my approach is definitely way too redundant)
To fix the compilation error, you need to add the language extension
{-# LANGUAGE ScopedTypeVariables #-}
to your source file, or pass the corresponding flag into the compiler. You also need to change the type signature of greedy to
greedy :: forall a. ReadP a -> ReadP [a]
This is because your two a type variables are not actually the same; they're in different scopes. With the extension and the forall, they are treated as being the same variable, and your types unify properly. Even then, the code errors, because you don't have an exhaustive pattern match in your definition of f. If you add
f [] = []
then the code seems to work as intended.
In order to simplify your code, I took a look at the provided function munch, which is defined as:
munch :: (Char -> Bool) -> ReadP String
-- ^ Parses the first zero or more characters satisfying the predicate.
-- Always succeeds, exactly once having consumed all the characters
-- Hence NOT the same as (many (satisfy p))
munch p =
do s <- look
scan s
where
scan (c:cs) | p c = do _ <- get; s <- scan cs; return (c:s)
scan _ = do return ""
In that spirit, your code can be rewritten as:
greedy2 :: forall a. ReadP a -> ReadP [a]
greedy2 ap = do
-- look at the string
s <- look
-- try parsing it, but without do notation
case readP_to_S ap s of
-- if we failed, then return nothing
[] -> return []
-- if we parsed something, ignore it
(_:_) -> do
-- parse it again, but this time inside of the monad
x <- ap
-- recurse, greedily parsing again
xs <- greedy2 ap
-- and return the concatenated values
return (x:xs)
This does have the speed disadvantage of executing ap twice as often as needed; this may be too slow for your use case. I'm sure my code could be further rewritten to avoid that, but I'm not a ReadP expert.
I am trying to write a parser for a small language with the following piece of code
import Text.ParserCombinators.Parsec
import Text.Parsec.Token
data Exp = Atom String | Op String Exp
instance Show Exp where
show (Atom x) = x
show (Op f x) = f ++ "(" ++ (show x) ++ ")"
parse_exp :: Parser Exp
parse_exp = (try parse_atom) <|> parse_op
parse_atom :: Parser Exp
parse_atom = do
x <- many1 letter
return (Atom x)
parse_op :: Parser Exp
parse_op = do
x <- many1 letter
char '('
y <- parse_exp
char ')'
return (Op x y)
But when I type in ghci
>>> parse (parse_exp <* eof) "<error>" "s(t)"
I get the output
Left "<error>" (line 1, column 2):
unexpected '('
expecting letter or end of input
If I redefine parse_exp as
parse_exp = (try parse_op) <|> parse_atom
then with I get correct result
>>> parse (parse_exp <* eof) "<error>" "s(t)"
Right s(t)
But I am confused why the first one does not work. Is there a general fix to these kinds of problems in parsing?
When a Parsec parser, like parse_atom, is run on a particular string, there are four possible results:
It succeeds, consuming some input.
It fails, consuming some input.
It succeeds, consuming no input.
It fails, consuming no input.
In the Parsec source code, these are referred to as "consumed ok", "consumed err", "empty ok" and "empty err" (sometimes abbreviated cok, cerr, eok, eerr).
When two Parsec parsers are used in an alternative, like p <|> q, here's how it's parsed. First, Parsec tries to parse with p. Then:
If this results in "consumed ok" or "empty ok", the parse succeeds and this becomes the result of the entire parser p <|> q.
If this results in "empty err", Parsec tries the alternative q, and this becomes the result of the entire p <|> q parser.
If this results in "consumed err", the entire parser p <|> q fails with "consumed err" (cerr).
Note the critical difference between p returning cerr (which causes the whole parser to fail) versus returning eerr (which causes the alternative parser q to be tried).
The try function changes the behavior of a parser by converting a "cerr" result to an "eerr" result.
This means that if you are trying to parse the text "s(t)" with different parsers:
with the parser parse_atom <|> parse_op, the parser parse_atom returns "cok" consuming "s" and leaving unparseable text "(t)" which causes an error
with the parser try parse_atom <|> parse_op, the parser parse_atom still returns "cok" consuming "s", so the try (which only changes cerr to eerr) has no effect, and the unparseable text "(t)" causes the same error
with the parser parse_op <|> parse_atom, the parser parse_op successfully parses the string (actually, it doesn't because the recursive call to parse_exp can't parse "t", but let's ignore that); however, if the same parser was used on the text "s", then parse_op would consume the "s" before failing (i.e., cerr), causing the entire parse to fail instead of trying the alternative parse_atom
with the parser try parse_op <|> parse_atom, this would parse "s(t)", exactly as the previous example, and the try would have no effect; however, it would also work on the text "s", because parse_op would consume the "s" before failing with cerr, then try would "rescue" the parse by turning the cerr into an eerr, and the alternative parse_atom would be checked, successfully parsing (cok) the atom "s".
That's why the "correct" parser for your problem is try parse_op <|> parse_atom.
Be warned that this behavior isn't a fundamental aspect of monadic parsers. It's a design choice made by Parsec (and compatible parsers like Megaparsec). Other monadic parsers can have different rules for how alternatives with <|> work.
The "general fix" for these kind of Parsec parsing problems is to be aware of the facts that in the expression p <|> q:
p is tried first, and if it succeeds, q will be ignored, even if q would provide a "longer" or "better" or "more sensible" parse or avoid additional parsing errors further down the road. In parse_atom <|> parse_op, because parse_atom can succeed on strings meant for parse_op, this order won't work correctly.
q is only tried if p fails without consuming input. You must arrange for p to not consume anything on failure, possibly by using try, if you expect the alternative q to be checked. So, parse_op <|> parse_atom isn't going to work if parse_op starts to consume something (like an identifier) before realizing it can't continue and returning cerr.
As an alternative to using try, you can also think more carefully about the structure of your parser. An alternative way of writing parse_exp, for example, would be:
parse_exp :: Parser Exp
parse_exp = do
-- there's always an identifier
x <- many1 letter
-- there *might* be an expression in parentheses
y <- optionMaybe (parens parse_exp)
case y of
Nothing -> return (Atom x)
Just y' -> return (Op x y')
where parens = between (char '(') (char ')')
This can be written a little more concisely, but even then it's not as "elegant" as something like try parse_op <|> parse_atom. (It performs better, though, so that might be a consideration in some applications.)
The problem is that the string "s" counts as an atom according to your definitions. Try this:
parse parse_atom "" "s(t)"
> Atom "s"
So your parser parse_exp actually succeeds, returning Atom "s", but then you also expect an EOF right after it, and that's where it fails, encountering an open paren instead of an EOF (just like the error message says!)
When you swap the alternative around, it would first attempt parse_op, which would succeed, returning Op "s" "t", and then encounter EOF, just as expected.
This is task from online course. I've been sitting on this for two days. Please give some explanation or hints to solve it.
Here's type
newtype Prs a = Prs { runPrs :: String -> Maybe (a, String) }
I need to implement many1 parser. This is how it should work
> runPrs (many1 $ char 'A') "AAABCDE"
Just ("AAA","BCDE")
> runPrs (many1 $ char 'A') "BCDE"
Nothing
I have parser many implemented like that
many p = (:) <$> p <*> many p <|> pure []
Here's output for previous example.
*Main> test9
Just ("AAA","BCDE")
*Main> test10
Just ("","BCDE")
Note last result, it returns empty string but many1 should return Nothing. I don't know how to change many code to make work like many1. I can't undestand how to stop on first incorrect symbol.
Your many1 will need some way to fail: as you've written it it consumes characters for a while, consing them onto a pending result, until it eventually runs out of matches. This doesn't cover any cases where the parse could fail.
What you've implemented here is, in a way, many0, a parser which consumes 0 or more repetitions of something. Can you think of a way to implement many1 in terms of many0? It will look something like:
Consume one instance of p, without an alternative in case that fails
Consume 0 or more instances of p, returning [] when that fails.
Or in Haskell,
many1 :: Prs a -> Prs [a]
many1 p = (:) <$> p <*> many0 p
I've been coding up an attoparsec parser and have been hitting a pattern where I want to turn parsers into recursive parsers (recursively combining them with the monad bind >>= operator).
So I created a function to turn a parser into a recursive parser as follows:
recursiveParser :: (a -> A.Parser a) -> a -> A.Parser a
recursiveParser parser a = (parser a >>= recursiveParser parser) <|> return a
Which is useful if you have a recursive data type like
data Expression = ConsExpr Expression Expression | EmptyExpr
parseRHS :: Expression -> Parser Expression
parseRHS e = ConsExpr e <$> parseFoo
parseExpression :: Parser Expression
parseExpression = parseLHS >>= recursiveParser parseRHS
where parseLHS = parseRHS EmptyExpr
Is there a more idiomatic solution? It almost seems like recursiveParser should be some kind of fold... I also saw sepBy in the docs, but this method seems to suit me better for my application.
EDIT: Oh, actually now that I think about it should actually be something similar to fix... Don't know how I forgot about that.
EDIT2: Rotsor makes a good point with his alternative for my example, but I'm afraid my AST is actually a bit more complicated than that. It actually looks something more like this (although this is still simplified)
data Segment = Choice1 Expression
| Choice2 Expression
data Expression = ConsExpr Segment Expression
| Token String
| EmptyExpr
where the string a -> b brackets to the right and c:d brackets to the left, with : binding more tightly than ->.
I.e. a -> b evaluates to
(ConsExpr (Choice1 (Token "a")) (Token "b"))
and c:d evaluates to
(ConsExpr (Choice2 (Token "d")) (Token "c"))
I suppose I could use foldl for the one and foldr for the other but there's still more complexity in there. Note that it's recursive in a slightly strange way, so "a:b:c -> e:f -> :g:h ->" is actually a valid string, but "-> a" and "b:" are not. In the end fix seemed simpler to me. I've renamed the recursive method like so:
fixParser :: (a -> A.Parser a) -> a -> A.Parser a
fixParser parser a = (parser a >>= fixParser parser) <|> pure a
Thanks.
Why not just parse a list and fold it into whatever you want later?
Maybe I am missing something, but this looks more natural to me:
consChain :: [Expression] -> Expression
consChain = foldl ConsExpr EmptyExpr
parseExpression :: Parser Expression
parseExpression = consChain <$> many1 parseFoo
And it's shorter too.
As you can see, consChain is now independent from parsing and can be useful somewhere else. Also, if you separate out the result folding, the somewhat unintuitive recursive parsing simplifies down to many or many1 in this case.
You may want to take a look at how many is implemented too:
many :: (Alternative f) => f a -> f [a]
many v = many_v
where many_v = some_v <|> pure []
some_v = (:) <$> v <*> many_v
It has a lot in common with your recursiveParser:
some_v is similar to parser a >>= recursiveParser parser
many_v is similar to recursiveParser parser
You may ask why I called your recursive parser function unintuitive. This is because this pattern allows parser argument to affect the parsing behaviour (a -> A.Parser a, remember?), which may be useful, but not obviously (I don't see a use case for this yet). The fact that your example does not use this feature makes it look redundant.
As part of the 4th exercise here
I would like to use a reads type function such as readHex with a parsec Parser.
To do this I have written a function:
liftReadsToParse :: Parser String -> (String -> [(a, String)]) -> Parser a
liftReadsToParse p f = p >>= \s -> if null (f s) then fail "No parse" else (return . fst . head ) (f s)
Which can be used, for example in GHCI, like this:
*Main Numeric> parse (liftReadsToParse (many1 hexDigit) readHex) "" "a1"
Right 161
Can anyone suggest any improvement to this approach with regard to:
Will the term (f s) be memoised, or evaluated twice in the case of a null (f s) returning False?
Handling multiple successful parses, i.e. when length (f s) is greater than one, I do not know how parsec deals with this.
Handling the remainder of the parse, i.e. (snd . head) (f s).
This is a nice idea. A more natural approach that would make
your ReadS parser fit in better with Parsec would be to
leave off the Parser String at the beginning of the type:
liftReadS :: ReadS a -> String -> Parser a
liftReadS reader = maybe (unexpected "no parse") (return . fst) .
listToMaybe . filter (null . snd) . reader
This "combinator" style is very idiomatic Haskell - once you
get used to it, it makes function definitions much easier
to read and understand.
You would then use liftReadS like this in the simple case:
> parse (many1 hexDigit >>= liftReadS readHex) "" "a1"
(Note that listToMaybe is in the Data.Maybe module.)
In more complex cases, liftReadS is easy to use inside any
Parsec do block.
Regarding some of your other questions:
The function reader is applied only once now, so there is nothing to "memoize".
It is common and accepted practice to ignore all except the first parse in a ReadS parser in most cases, so you're fine.
To answer the first part of your question, no (f s) will not be memoised, you would have to do that manually:
liftReadsToParse p f = p >>= \s -> let fs = f s in if null fs then fail "No parse"
else (return . fst . head ) fs
But I'd use pattern matching instead:
liftReadsToParse p f = p >>= \s -> case f s of
[] -> fail "No parse"
(answer, _) : _ -> return answer