difference between munch and many (satisfy p)? - parsing

In the module Text.ParserCombinators.ReadP, munch (and munch1) documentation says:
Parses the first zero or more characters satisfying the predicate. Always succeds, exactly once having consumed all the characters Hence NOT the same as (many (satisfy p)).
How are they different?

First, let's find a counter-example and let's use Haskell tools to do it automatically. The QuickCheck library can give us such an counter-example very quickly:
import Data.Function (on)
import Text.ParserCombinators.ReadP
import Test.QuickCheck
prop_ParseTest :: String -> Property
prop_ParseTest input = on (===) runParser (munch pred) (many (satisfy pred))
where
runParser :: ReadP a -> [(a, String)]
runParser = flip readP_to_S input -- run a given parser on a given input
pred :: Char -> Bool
pred = (> 'A')
main :: IO ()
main = quickCheck prop_ParseTest
We ask it to test, whether the two parsers much pred and many (satisfy pred) are the same. QuickCheck immediately finds that they're different and tries to produce as short counter-example as possible:
*** Failed! Falsifiable (after 5 tests and 2 shrinks):
[("a","")] /= [("","a"),("a","")]
So munch pred always consumes 'a' unconditionally, while many (satisfy pred) gives a nondeterministic answer - it might or might not not consume 'a'.
For example, consider running the following two parsers on string "abcd":
test1 = munch (> 'A') >> string "cd"
test2 = many (satisfy (> 'A')) >> string "cd"
The first fails, because munch consumes the whole string and then it's not possible to match "cd". The second one succeeds, because many (satisfy ...) creates all possible branches
[("","abcd"),("a","bcd"),("ab","cd"),("abc","d"),("abcd","")]
and string "cd" succeeds on the branch that consumed "ab".

Related

Haskell Type error in Double recursion function

I'm trying to define a greedy function
greedy :: ReadP a -> ReadP [a]
that parses a sequence of values, returning only the "maximal" sequences that cannot be extended any further. For example,
> readP_to_S (greedy (string "a" +++ string "ab")) "abaac"
[(["a"],"baac"),(["ab","a","a"],"c")]
I'm using a very simple and probably clumsy way. Just parse the values and see if they can be parsed any further; if so, then reapply the function again to get all the possible values and concat that with the previous ones, or else just return the value itself. However, there seems to be some type problems, below is my code.
import Text.ParserCombinators.ReadP
addpair :: a -> [([a],String)] -> [([a],String)]
addpair a [] = []
addpair a (c:cs) = (a : (fst c), snd c ) : (addpair a cs)
greedy :: ReadP a -> ReadP [a]
greedy ap = readS_to_P (\s ->
let list = readP_to_S ap s in
f list )
where
f :: [(a,String)] -> [([a],String)]
f ((value, str2):cs) =
case readP_to_S ap str2 of
[] -> ([value], str2) : (f cs)
_ -> (addpair value (readP_to_S (greedy ap) str2)) ++ (f cs)
The GHC processes the code and says that function "f" has type [(a1,String)] -> [([a1],String)] but greedy is ReadP a -> ReadP [a]. I wonder why it is so because I think their type should agree. It also really helps if anyone can come up with some clever and more elegant approach to define the function greedy(my approach is definitely way too redundant)
To fix the compilation error, you need to add the language extension
{-# LANGUAGE ScopedTypeVariables #-}
to your source file, or pass the corresponding flag into the compiler. You also need to change the type signature of greedy to
greedy :: forall a. ReadP a -> ReadP [a]
This is because your two a type variables are not actually the same; they're in different scopes. With the extension and the forall, they are treated as being the same variable, and your types unify properly. Even then, the code errors, because you don't have an exhaustive pattern match in your definition of f. If you add
f [] = []
then the code seems to work as intended.
In order to simplify your code, I took a look at the provided function munch, which is defined as:
munch :: (Char -> Bool) -> ReadP String
-- ^ Parses the first zero or more characters satisfying the predicate.
-- Always succeeds, exactly once having consumed all the characters
-- Hence NOT the same as (many (satisfy p))
munch p =
do s <- look
scan s
where
scan (c:cs) | p c = do _ <- get; s <- scan cs; return (c:s)
scan _ = do return ""
In that spirit, your code can be rewritten as:
greedy2 :: forall a. ReadP a -> ReadP [a]
greedy2 ap = do
-- look at the string
s <- look
-- try parsing it, but without do notation
case readP_to_S ap s of
-- if we failed, then return nothing
[] -> return []
-- if we parsed something, ignore it
(_:_) -> do
-- parse it again, but this time inside of the monad
x <- ap
-- recurse, greedily parsing again
xs <- greedy2 ap
-- and return the concatenated values
return (x:xs)
This does have the speed disadvantage of executing ap twice as often as needed; this may be too slow for your use case. I'm sure my code could be further rewritten to avoid that, but I'm not a ReadP expert.

Self made "Many1" parser

This is task from online course. I've been sitting on this for two days. Please give some explanation or hints to solve it.
Here's type
newtype Prs a = Prs { runPrs :: String -> Maybe (a, String) }
I need to implement many1 parser. This is how it should work
> runPrs (many1 $ char 'A') "AAABCDE"
Just ("AAA","BCDE")
> runPrs (many1 $ char 'A') "BCDE"
Nothing
I have parser many implemented like that
many p = (:) <$> p <*> many p <|> pure []
Here's output for previous example.
*Main> test9
Just ("AAA","BCDE")
*Main> test10
Just ("","BCDE")
Note last result, it returns empty string but many1 should return Nothing. I don't know how to change many code to make work like many1. I can't undestand how to stop on first incorrect symbol.
Your many1 will need some way to fail: as you've written it it consumes characters for a while, consing them onto a pending result, until it eventually runs out of matches. This doesn't cover any cases where the parse could fail.
What you've implemented here is, in a way, many0, a parser which consumes 0 or more repetitions of something. Can you think of a way to implement many1 in terms of many0? It will look something like:
Consume one instance of p, without an alternative in case that fails
Consume 0 or more instances of p, returning [] when that fails.
Or in Haskell,
many1 :: Prs a -> Prs [a]
many1 p = (:) <$> p <*> many0 p

Please explain the behavior of this Parsec permutation parser

Why does this Parsec permutation parser not parse b?
p :: Parser (String, String)
p = permute (pair
<$?> ("", pa)
<|?> ("", pb))
where pair a b = (a, b)
pa :: Parser String
pa = do
char 'x'
many1 (char 'a')
pb :: Parser String
pb = do
many1 (char 'b')
λ> parseTest p "xaabb"
("aa","bb") -- expected result, good
λ> parseTest p "aabb"
("","") -- why "" for b?
Parser pa is configured as optional via <$?> so I don't understand why its failing has impacted the parsing of b. I can change it to optional (char 'x') to get the expected behavior, but I don't understand why.
pa :: Parser String
pa = do
optional (char 'x')
many1 (char 'a')
pb :: Parser String
pb = do
optional (char 'x')
many1 (char 'b')
λ> parseTest p "xaaxbb"
parse error at (line 1, column 2):
unexpected "a"
expecting "b"
λ> parseTest p "xbbxaa"
("aa","bb")
How can both input orderings be supported when we have identical shared prefix "x"?
I also don't understand the impact that consumption of the optional "x" is having on the parse behavior:
pb :: Parser String
pb = do
try px -- with this try x remains unconsumed and "aa" gets parsed
-- without this try x is consumed, but "aa" isn't parsed even though "x" is optional anyway
many1 (char 'b')
px :: Parser Char
px = do
optional (char 'x')
char 'x' <?> "second x"
λ> parseTest p "xaaxbb" -- without try on px
parse error at (line 1, column 2):
unexpected "a"
expecting second x
λ> parseTest p "xaaxbb" -- with try on px
("aa","")
Why parseTest p "aabb" gives ("","")
The permutation parser tries to strip off the front of the given string prefixes that can be parsed by its constituent parsers (pa and pb in this case). Here, it will have tried to apply both pa and pb to "aabb" and failed in both cases - it never even gets around to trying to parse "bb".
Why can't both pa and pb start with optional (char 'x')
Looking at permute, you'll see it uses choice, which in turn relies on (<|>). As the documentation of (<|>) says,
This combinator implements choice. The parser p <|> q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried. This combinator is defined equal to the mplus member of the MonadPlus class and the (<|>) member of Alternative.
The parser is called predictive since q is only tried when parser p didn't consume any input (i.e.. the look ahead is 1). This non-backtracking behaviour allows for both an efficient implementation of the parser combinators and the generation of good error messages.
So when you do something like parseTest p "xbb", pa doesn't fail immediately (it consumes and 'x') and then the whole thing fails because it cannot backtrack.
How to make shared prefixes work?
As Daniel has suggested, it is best to factor out your grammar. Alternately, you can use try:
The parser try p behaves like parser p, except that it pretends that it hasn't consumed any input when an error occurs
Based on what we talked about before for (<|>), you ought then to put try in front of both of optional (char 'x').
Why does this Parsec permutation parser not parse b?
Because 'a' is not a valid first character for either parser pa or parser pb.
How can both input orderings be supported when we have identical shared prefix "x"?
Shared prefixes must be factored out of your grammar; or backtracking points inserted (using try) at the cost of performance.

Preserving the failure mode of a parser in Parsec

Problem statement
Suppose I have two parsers p and q and I concatenate them like this:
r = try p *> q
In Parsec, the behavior of this is:
If p fails without consuming input, then r fails without consuming input.
If p fails after consuming input, then r fails without consuming input.
if q fails without consuming input, then r fails after consuming p.
if q fails after consuming input, then r fails after consuming p and parts of q.
However, the behavior I'm looking for is a bit unusual:
If p fails without consuming input, then r should fail without consuming input.
If p fails after consuming input, then r should fail without consuming input.
if q fails without consuming input, then r should fail without consuming input.
if q fails after consuming input, then r should fail after consuming some input.
I can't seem to think of a clean way to do this.
Rationale
The reason is that I have a parser like this:
s = (:) <$> q <*> many r
The q parser, embedded inside the r parser, needs a way to signal either: invalid input (which occurs when q consumes input but fails), or end of the many loop (which occurs when q doesn't consume anything and fails). If the input is invalid, it should just fail the parser entirely and report the problem to the user. If there is no more input to consume, then it should end the many loop (without reporting a parser error to the user). The trouble is that it's possible that the input ends with a p but without any more valid q's to consume, in which case q will fail but without consuming any input.
So I was wondering if anyone have an elegant way to solve this problem? Thanks.
Addendum: Example
p = string "P"
q = (++) <$> try (string "xy") <*> string "z"
Test input on (hypothetical) parser s, had it worked the way I wanted:
xyz (accept)
xyzP (accept; P remains unparsed)
xyzPx (accept; Px remains unparsed; q failed but did not consume any input)
xyzPxy (reject; parser q consumed xy but failed)
xyzPxyz (accept)
In the form r = try p *> q, s will actually reject both case #2 and case #3 above. Of course, it's possible to achieve the above behavior by writing:
r = (++) <$> try (string "P" *> string "xy") <*> string "z"
but this isn't a general solution that works for any parsers p and q. (Perhaps a general solution doesn't exist?)
I believe I found a solution. It's not especially nice, but seems to works. At least something to start with:
{-# LANGUAGE FlexibleContexts #-}
import Control.Applicative hiding (many, (<|>))
import Control.Monad (void)
import Control.Monad.Trans (lift)
import Control.Monad.Trans.Maybe
import Text.Parsec hiding (optional)
import Text.Parsec.Char
import Text.Parsec.String
rcomb :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m b -> ParsecT s u m b
rcomb p q = ((test $ opt p *> opt q) <|> pure (Just ()))
>>= maybe empty (\_ -> p *> q)
where
-- | Converts failure to #MaybeT Nothing#:
opt = MaybeT . optional -- optional from Control.Applicative!
-- | Tests running a parser, returns Nothing if parsers failed consuming no
-- input, Just () otherwise.
test = lookAhead . try . runMaybeT . void
This is the r combinator you're asking for. The idea is that we first execute the parsers in a "test" run (using lookAhead . try) and if any of them fails without consuming input, we record it as Nothing inside MaybeT. This is accomplished by opt, it converts a failure to Nothing and wraps it into MaybeT. Thanks to MaybeT, if opt p returns Nothing, opt q is skipped.
If both p and q succeed, the test .. part returns Just (). And if one of them consumes input, the whole test .. fails. This way, we distinguish the 3 possibilities:
Failure with some input consumed by p or q.
Failure such that the failing part doesn't consume input.
Success.
After <|> pure (Just ()) both 1. and 3. result in Just (), while 2. results in Nothing. Finally, the maybe part converts Nothing into a non-consuming failure, and Just () into running the parsers again, now without any protection. This means that 1. fails again with consuming some input, and 3. succeeds.
Testing:
samples =
[ "xyz" -- (accept)
, "xyzP" -- (accept; P remains unparsed)
, "xyzPz" -- (accept; Pz remains unparsed)
, "xyzPx" -- (accept; Px remains unparsed; q failed but did not consume any input)
, "xyzPxy" -- (reject; parser q consumed xy but failed)
, "xyzPxyz" -- (accept)
]
main = do
-- Runs a parser and then accept anything, which shows what's left in the
-- input buffer:
let run p x = runP ((,) <$> p <*> many anyChar) () x x
let p, q :: Parser String
p = string "P"
q = (++) <$> try (string "xy") <*> string "z"
let parser = show <$> ((:) <$> q <*> many (rcomb p q))
mapM_ (print . run parser) samples

Correct ReadP usage in Haskell

I did a very simple parser for lists of numbers in a file, using ReadP in Haskell. It works, but it is very slow... is this normal behavior of this type of parser or am I doing something wrong?
import Text.ParserCombinators.ReadP
import qualified Data.IntSet as IntSet
import Data.Char
setsReader :: ReadP [ IntSet.IntSet ]
setsReader =
setReader `sepBy` ( char '\n' )
innocentWhitespace :: ReadP ()
innocentWhitespace =
skipMany $ (char ' ') <++ (char '\t' )
setReader :: ReadP IntSet.IntSet
setReader = do
innocentWhitespace
int_list <- integerReader `sepBy1` innocentWhitespace
innocentWhitespace
return $ IntSet.fromList int_list
integerReader :: ReadP Int
integerReader = do
digits <- many1 $ satisfy isDigit
return $ read digits
readClusters:: String -> IO [ IntSet.IntSet ]
readClusters filename = do
whole_file <- readFile filename
return $ ( fst . last ) $ readP_to_S setsReader whole_file
setReader has exponential behavior, because it is allowing the whitespace between the numbers to be optional. So for the line:
12 34 56
It is seeing these parses:
[1,2,3,4,5,6]
[12,3,4,5,6]
[1,2,34,5,6]
[12,34,5,6]
[1,2,3,4,56]
[12,3,4,56]
[1,2,34,56]
[12,34,56]
You could see how this could get out of hand for long lines. ReadP returns all valid parses in increasing length order, so to get to the last parse you have to traverse through all these intermediate parses. Change:
int_list <- integerReader `sepBy1` innocentWhitespace
To:
int_list <- integerReader `sepBy1` mandatoryWhitespace
For a suitable definition of mandatoryWhitespace to squash this exponential behavior. The parsing strategy used by parsec is more resistant to this kind of error, because it is greedy -- once it consumes input in a given branch, it is committed to that branch and never goes back (unless you explicitly asked it to). So once it correctly parsed 12, it would never go back to parse 1 2. Of course that means it matters in which order you state your choices, which I always find to be a bit of a pain to think about.
Also I would use:
head [ x | (x,"") <- readP_to_S setsReader whole_file ]
To extract a valid whole-file parse, in case it very quickly consumed all input but there were a hundred bazillion ways to interpret that input. If you don't care about the ambiguity, you would probably rather it return the first one than the last one, because the first one will arrive faster.

Resources