Preserving the failure mode of a parser in Parsec - parsing

Problem statement
Suppose I have two parsers p and q and I concatenate them like this:
r = try p *> q
In Parsec, the behavior of this is:
If p fails without consuming input, then r fails without consuming input.
If p fails after consuming input, then r fails without consuming input.
if q fails without consuming input, then r fails after consuming p.
if q fails after consuming input, then r fails after consuming p and parts of q.
However, the behavior I'm looking for is a bit unusual:
If p fails without consuming input, then r should fail without consuming input.
If p fails after consuming input, then r should fail without consuming input.
if q fails without consuming input, then r should fail without consuming input.
if q fails after consuming input, then r should fail after consuming some input.
I can't seem to think of a clean way to do this.
Rationale
The reason is that I have a parser like this:
s = (:) <$> q <*> many r
The q parser, embedded inside the r parser, needs a way to signal either: invalid input (which occurs when q consumes input but fails), or end of the many loop (which occurs when q doesn't consume anything and fails). If the input is invalid, it should just fail the parser entirely and report the problem to the user. If there is no more input to consume, then it should end the many loop (without reporting a parser error to the user). The trouble is that it's possible that the input ends with a p but without any more valid q's to consume, in which case q will fail but without consuming any input.
So I was wondering if anyone have an elegant way to solve this problem? Thanks.
Addendum: Example
p = string "P"
q = (++) <$> try (string "xy") <*> string "z"
Test input on (hypothetical) parser s, had it worked the way I wanted:
xyz (accept)
xyzP (accept; P remains unparsed)
xyzPx (accept; Px remains unparsed; q failed but did not consume any input)
xyzPxy (reject; parser q consumed xy but failed)
xyzPxyz (accept)
In the form r = try p *> q, s will actually reject both case #2 and case #3 above. Of course, it's possible to achieve the above behavior by writing:
r = (++) <$> try (string "P" *> string "xy") <*> string "z"
but this isn't a general solution that works for any parsers p and q. (Perhaps a general solution doesn't exist?)

I believe I found a solution. It's not especially nice, but seems to works. At least something to start with:
{-# LANGUAGE FlexibleContexts #-}
import Control.Applicative hiding (many, (<|>))
import Control.Monad (void)
import Control.Monad.Trans (lift)
import Control.Monad.Trans.Maybe
import Text.Parsec hiding (optional)
import Text.Parsec.Char
import Text.Parsec.String
rcomb :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m b -> ParsecT s u m b
rcomb p q = ((test $ opt p *> opt q) <|> pure (Just ()))
>>= maybe empty (\_ -> p *> q)
where
-- | Converts failure to #MaybeT Nothing#:
opt = MaybeT . optional -- optional from Control.Applicative!
-- | Tests running a parser, returns Nothing if parsers failed consuming no
-- input, Just () otherwise.
test = lookAhead . try . runMaybeT . void
This is the r combinator you're asking for. The idea is that we first execute the parsers in a "test" run (using lookAhead . try) and if any of them fails without consuming input, we record it as Nothing inside MaybeT. This is accomplished by opt, it converts a failure to Nothing and wraps it into MaybeT. Thanks to MaybeT, if opt p returns Nothing, opt q is skipped.
If both p and q succeed, the test .. part returns Just (). And if one of them consumes input, the whole test .. fails. This way, we distinguish the 3 possibilities:
Failure with some input consumed by p or q.
Failure such that the failing part doesn't consume input.
Success.
After <|> pure (Just ()) both 1. and 3. result in Just (), while 2. results in Nothing. Finally, the maybe part converts Nothing into a non-consuming failure, and Just () into running the parsers again, now without any protection. This means that 1. fails again with consuming some input, and 3. succeeds.
Testing:
samples =
[ "xyz" -- (accept)
, "xyzP" -- (accept; P remains unparsed)
, "xyzPz" -- (accept; Pz remains unparsed)
, "xyzPx" -- (accept; Px remains unparsed; q failed but did not consume any input)
, "xyzPxy" -- (reject; parser q consumed xy but failed)
, "xyzPxyz" -- (accept)
]
main = do
-- Runs a parser and then accept anything, which shows what's left in the
-- input buffer:
let run p x = runP ((,) <$> p <*> many anyChar) () x x
let p, q :: Parser String
p = string "P"
q = (++) <$> try (string "xy") <*> string "z"
let parser = show <$> ((:) <$> q <*> many (rcomb p q))
mapM_ (print . run parser) samples

Related

Why does try not trigger backtracking in this example

I am trying to wrap my head around writing parser using parsec in Haskell, in particular how backtracking works.
Take the following simple parser:
import Text.Parsec
type Parser = Parsec String () String
parseConst :: Parser
parseConst = do {
x <- many digit;
return $ read x
}
parseAdd :: Parser
parseAdd = do {
l <- parseExp;
char '+';
r <- parseExp;
return $ l <> "+" <> r
}
parseExp :: Parser
parseExp = try parseConst <|> parseAdd
pp :: Parser
pp = parseExp <* eof
test = parse pp "" "1+1"
test has value
Left (line 1, column 2):
unexpected '+'
expecting digit or end of input
In my mind this should succeed since I used the try combinator on parseConst in the definition of parseExp.
What am I missing? I am also interrested in pointers for how to debug this in my own, I tried using parserTraced which just allowed me to conclude that it indeed wasn't backtracking.
PS.
I know this is an awful way to write an expression parser, but I'd like to understand why it doesn't work.
There are a lot of problems here.
First, parseConst can never work right. The type says it must produce a String, so read :: String -> String. That particular Read instance requires the input be a quoted string, so being passed 0 or more digit characters to read is always going to result in a call to error if you try to evaluate the value it produces.
Second, parseConst can succeed on matching zero characters. I think you probably wanted some instead of many. That will make it actually fail if it encounters input that doesn't start with a digit.
Third, (<|>) doesn't do what you think. You might think that (a <* c) <|> (b <* c) is interchangeable with (a <|> b) <* c, but it isn't. There is no way to throw try in and make it the same, either. The problem is that (<|>) commits to whichever branch succeed, if one does. In (a <|> b) <* c, if a matches, there's no way later to backtrack and try b there. Doesn't matter how you lob try around, it can't undo the fact that (<|>) committed to a. In contrast, (a <* c) <|> (b <* c) doesn't commit until both a and c or b and c match the input.
This is the situation you're encountering. You have (try parseConst <|> parseAdd) <* eof, after a bit of inlining. Since parseConst will always succeed (see the second issue), parseAdd will never get tried, even if the eof fails. So after parseConst consumes zero or more leading digits, the parse will fail unless that's the end of the input. Working around this essentially requires carefully planning your grammar such that any use of (<|>) is safe to commit locally. That is, the contents of each branch must not overlap in a way that is disambiguated only by later portions of the grammar.
Note that this unpleasant behavior with (<|>) is how the parsec family of libraries work, but not how all parser libraries in Haskell work. Other libraries work without the left bias or commit behavior the parsec family have chosen.

How do try and <|> functions from parsers lib work

(I use trifecta parser lib). I'm trying to make a parser that parses integers into Right and literal sequences (alphabet, numeral symbols and "-" are allowed) into Left:
*Lib> parseString myParser mempty "123 qwe 123qwe 123-qwe-"
Success [Right 123,Left "qwe",Left "123qwe",Left "123-qwe-"]
That is what I invented:
myParser :: Parser [Either String Integer]
myParser = sepBy1 (try (Right . read <$> (some digit <* notFollowedBy (choice [letter, char '-'])))
<|> Left <$> some (choice [alphaNum, char '-']))
(char ' ')
My problem is that I don't understand why try is needed there (and in any other similar situations). When try is not used, an error appears:
*Lib> parseString myParser mempty "123 qwe 123qwe 123-qwe-"
Failure (ErrInfo {_errDoc = (interactive):1:12: error: expected: digit
1 | 123 qwe 123qwe 123-qwe-<EOF>
| ^ , _errDeltas = [Columns 11 11]})
So try puts the parsing cursor back to where we started on failure. Imagine try isn't used:
123qwe
^ failed there, the cursor position remains there
On the other hand, <|> is like "either". It should run the second parser Left <$> some (choice [alphaNum, char '-'])) (when the first parser failed) and consume just "qwe".
Somewhere I'm wrong.
The second parser would indeed consume the "qwe" part if only it was given a chance to run. But it isn't given such chance.
Look at the definition of (<|>) for Parser:
Parser m <|> Parser n = Parser $ \ eo ee co ce d bs ->
m eo (\e -> n (\a e' -> eo a (e <> e')) (\e' -> ee (e <> e')) co ce d bs) co ce d bs
Hmm... Maybe not such a good idea to look at that. But let's push through nevertheless. To make sense of all those eo, ee, etc., let's look at their explanations on the Parser definition:
The first four arguments are behavior continuations:
epsilon success: the parser has consumed no input and has a result as well as a possible Err; the position and chunk are unchanged (see pure)
epsilon failure: the parser has consumed no input and is failing with the given Err; the position and chunk are unchanged (see empty)
committed success: the parser has consumed input and is yielding the result, set of expected strings that would have permitted this parse to continue, new position, and residual chunk to the continuation.
committed failure: the parser has consumed input and is failing with a given ErrInfo (user-facing error message)
In your case we clearly have "committed failure" - i.e. the Right parser has consumed some input and failed. So in this case it's going to call the fourth continuation - denoted ce in the definition of (<|>).
And now look at the body of the definition: the fourth continuation is passed to parser m unchanged:
m eo (\e -> n (\a e' -> eo a (e <> e')) (\e' -> ee (e <> e')) co ce d bs) co ce d bs
^
|
here it is
This means that the parser returned from (<|>) will call the fourth continuation in all cases in which parser m calls it. Which means that it will fail with "committed failure" in all cases in which the parser m fails with "committed failure". Which is exactly what you observe.

difference between munch and many (satisfy p)?

In the module Text.ParserCombinators.ReadP, munch (and munch1) documentation says:
Parses the first zero or more characters satisfying the predicate. Always succeds, exactly once having consumed all the characters Hence NOT the same as (many (satisfy p)).
How are they different?
First, let's find a counter-example and let's use Haskell tools to do it automatically. The QuickCheck library can give us such an counter-example very quickly:
import Data.Function (on)
import Text.ParserCombinators.ReadP
import Test.QuickCheck
prop_ParseTest :: String -> Property
prop_ParseTest input = on (===) runParser (munch pred) (many (satisfy pred))
where
runParser :: ReadP a -> [(a, String)]
runParser = flip readP_to_S input -- run a given parser on a given input
pred :: Char -> Bool
pred = (> 'A')
main :: IO ()
main = quickCheck prop_ParseTest
We ask it to test, whether the two parsers much pred and many (satisfy pred) are the same. QuickCheck immediately finds that they're different and tries to produce as short counter-example as possible:
*** Failed! Falsifiable (after 5 tests and 2 shrinks):
[("a","")] /= [("","a"),("a","")]
So munch pred always consumes 'a' unconditionally, while many (satisfy pred) gives a nondeterministic answer - it might or might not not consume 'a'.
For example, consider running the following two parsers on string "abcd":
test1 = munch (> 'A') >> string "cd"
test2 = many (satisfy (> 'A')) >> string "cd"
The first fails, because munch consumes the whole string and then it's not possible to match "cd". The second one succeeds, because many (satisfy ...) creates all possible branches
[("","abcd"),("a","bcd"),("ab","cd"),("abc","d"),("abcd","")]
and string "cd" succeeds on the branch that consumed "ab".

Creating a parser combinator of type Parser a -> Parser b -> Parser (Either a b)

I want to parse some text in which certain fields have structure most of the time but occasionally (due to special casing, typos etc) this structure is missing.
E.g. Regular case is Cost: 5, but occasionally it will read Cost: 5m or Cost: 3 + 1 per ally, or some other random stuff.
In the case of the normal parser (p) not working, I'd like to fallback to a parser which just takes the whole line as a string.
To this end, I'd like to create a combinator of type Parser a -> Parser b -> Either a b. However, I cannot work out how to inspect the results of attempting to see if the first parser succeeds or not, without doing something like case parse p "" txt of ....
I can't see a build in combinator, but I'm sure there's some easy way to solve this that I'm missing
I think you want something like this
eitherParse :: Parser a -> Parser b -> Parser (Either a b)
eitherParse a b = fmap Left (try a) <|> fmap Right b
The try is just to ensure that if a consumes some input and then fails, you'll backtrack properly. Then you can just use the normal methods for running a parser to yield Either ParseError (Either a b)
Which is quite easy to transform into your Either a b
case parse p "" str of
Right (Left a) -> useA a
Right (Right b) -> useB b
Left err -> handleParserError err
Try this: (<|>) :: ParsecT s u m a -> ParsecT s u m a -> ParsecT s u m a
As a rule you could use it this way:
try p <|> q

Insert a character into parser combinator character stream in Haskell

This question is related to both Parsec and uu-parsinglib. When we write parser combinators, they process characters streams from compiler. Is it somehow possible to parse a character and put it back (or return another character back) to the input stream?
I want for example to parse input "test + 5", parse the t, e, s, t and after recognition of test pattern, put for example v character back into the character stream, so while continuating the parsing process we are matching against v + 5
I do not want to use this in any particular case for now - I want to deeply learn the possibilities.
I'm not sure if it's possible with these parsers directly, but in general you can accomplish it by combining parsers with some streaming that allows injecting leftovers.
For example, using attoparsec-conduit you can turn a parser into a conduit using
sinkParser :: (AttoparsecInput a, MonadThrow m)
=> Parser a b -> Consumer a m b
where Consumer is a special kind of conduit that doesn't produce any output, only receives input and returns a final value.
Since conduits support leftovers, you can create a helper method that converts a parser that optionally returns a value to be pushed into the stream into a conduit:
import Data.Attoparsec.Types
import Data.Conduit
import Data.Conduit.Attoparsec
import Data.Functor
reinject :: (AttoparsecInput a, MonadThrow m)
=> Parser a (Maybe a, b) -> Consumer a m b
reinject p = do
(lo, r) <- sinkParser p
maybe (return ()) leftover lo
return r
Then you convert standard parsers to conduits using sinkParser and these special parsers using reinject, and then combine conduits instead of parsers.
I think the simplest way to archive this is to build a multi-layered parser. Think of a lexer + parser combination. This is a clean approach to this problem.
You have to separate the two kind of parsing. The search-and-replace parsing goes to the first parser and the build-the-AST parsing to the second. Or you can create an intermediate token representation.
import Text.Parsec
import Text.Parsec.String
parserLvl1 :: Parser String
parserLvl1 = many (try (string "test" >> return 'v') <|> anyChar)
parserLvl2 :: Parser Plus
parserLvl2 = do text1 <- many (noneOf "+")
char '+'
text2 <- many (noneOf "+")
return $ Plus text1 text2
data Plus = Plus String String
deriving Show
wholeParse :: String -> Either ParseError Plus
wholeParse source = do res1 <- parse parserLvl1 "lvl1" source
res2 <- parse parserLvl2 "lvl2" res1
return res2
Now you can parse your example. wholeParse "test+5" results in Right (Plus "v" "5").
Possible variations:
Create a class and an instance for combining wrapped parser stages. (Possibly carrying parser state.)
Create an intermediate representation, a stream of tokens
This is easily done in uu-parsinglib using the pSwitch function. But the question is why you want to do so? Because the v is missing from the input? In that case uu-parsinglib will perform error correction automatically so you do not need something like this. Otherwise you can write
pSwitch :: (st1 -> (st2, st2 -> st1)) -> P st2 a -> P st1 a
pInsert_v = pSwitch (\st1 -> (prepend v st2, id) (pSucceed ())
It depends on your actual state type how the v is actually added, so you will have to define the function prepend yourself. I do not know e.g. how such an insertion would influence the current position in the file etc.
Doaitse Swierstra

Resources