How do try and <|> functions from parsers lib work - parsing

(I use trifecta parser lib). I'm trying to make a parser that parses integers into Right and literal sequences (alphabet, numeral symbols and "-" are allowed) into Left:
*Lib> parseString myParser mempty "123 qwe 123qwe 123-qwe-"
Success [Right 123,Left "qwe",Left "123qwe",Left "123-qwe-"]
That is what I invented:
myParser :: Parser [Either String Integer]
myParser = sepBy1 (try (Right . read <$> (some digit <* notFollowedBy (choice [letter, char '-'])))
<|> Left <$> some (choice [alphaNum, char '-']))
(char ' ')
My problem is that I don't understand why try is needed there (and in any other similar situations). When try is not used, an error appears:
*Lib> parseString myParser mempty "123 qwe 123qwe 123-qwe-"
Failure (ErrInfo {_errDoc = (interactive):1:12: error: expected: digit
1 | 123 qwe 123qwe 123-qwe-<EOF>
| ^ , _errDeltas = [Columns 11 11]})
So try puts the parsing cursor back to where we started on failure. Imagine try isn't used:
123qwe
^ failed there, the cursor position remains there
On the other hand, <|> is like "either". It should run the second parser Left <$> some (choice [alphaNum, char '-'])) (when the first parser failed) and consume just "qwe".
Somewhere I'm wrong.

The second parser would indeed consume the "qwe" part if only it was given a chance to run. But it isn't given such chance.
Look at the definition of (<|>) for Parser:
Parser m <|> Parser n = Parser $ \ eo ee co ce d bs ->
m eo (\e -> n (\a e' -> eo a (e <> e')) (\e' -> ee (e <> e')) co ce d bs) co ce d bs
Hmm... Maybe not such a good idea to look at that. But let's push through nevertheless. To make sense of all those eo, ee, etc., let's look at their explanations on the Parser definition:
The first four arguments are behavior continuations:
epsilon success: the parser has consumed no input and has a result as well as a possible Err; the position and chunk are unchanged (see pure)
epsilon failure: the parser has consumed no input and is failing with the given Err; the position and chunk are unchanged (see empty)
committed success: the parser has consumed input and is yielding the result, set of expected strings that would have permitted this parse to continue, new position, and residual chunk to the continuation.
committed failure: the parser has consumed input and is failing with a given ErrInfo (user-facing error message)
In your case we clearly have "committed failure" - i.e. the Right parser has consumed some input and failed. So in this case it's going to call the fourth continuation - denoted ce in the definition of (<|>).
And now look at the body of the definition: the fourth continuation is passed to parser m unchanged:
m eo (\e -> n (\a e' -> eo a (e <> e')) (\e' -> ee (e <> e')) co ce d bs) co ce d bs
^
|
here it is
This means that the parser returned from (<|>) will call the fourth continuation in all cases in which parser m calls it. Which means that it will fail with "committed failure" in all cases in which the parser m fails with "committed failure". Which is exactly what you observe.

Related

Problem while writing a small parser in Haskell using Parsec

I am trying to write a parser for a small language with the following piece of code
import Text.ParserCombinators.Parsec
import Text.Parsec.Token
data Exp = Atom String | Op String Exp
instance Show Exp where
show (Atom x) = x
show (Op f x) = f ++ "(" ++ (show x) ++ ")"
parse_exp :: Parser Exp
parse_exp = (try parse_atom) <|> parse_op
parse_atom :: Parser Exp
parse_atom = do
x <- many1 letter
return (Atom x)
parse_op :: Parser Exp
parse_op = do
x <- many1 letter
char '('
y <- parse_exp
char ')'
return (Op x y)
But when I type in ghci
>>> parse (parse_exp <* eof) "<error>" "s(t)"
I get the output
Left "<error>" (line 1, column 2):
unexpected '('
expecting letter or end of input
If I redefine parse_exp as
parse_exp = (try parse_op) <|> parse_atom
then with I get correct result
>>> parse (parse_exp <* eof) "<error>" "s(t)"
Right s(t)
But I am confused why the first one does not work. Is there a general fix to these kinds of problems in parsing?
When a Parsec parser, like parse_atom, is run on a particular string, there are four possible results:
It succeeds, consuming some input.
It fails, consuming some input.
It succeeds, consuming no input.
It fails, consuming no input.
In the Parsec source code, these are referred to as "consumed ok", "consumed err", "empty ok" and "empty err" (sometimes abbreviated cok, cerr, eok, eerr).
When two Parsec parsers are used in an alternative, like p <|> q, here's how it's parsed. First, Parsec tries to parse with p. Then:
If this results in "consumed ok" or "empty ok", the parse succeeds and this becomes the result of the entire parser p <|> q.
If this results in "empty err", Parsec tries the alternative q, and this becomes the result of the entire p <|> q parser.
If this results in "consumed err", the entire parser p <|> q fails with "consumed err" (cerr).
Note the critical difference between p returning cerr (which causes the whole parser to fail) versus returning eerr (which causes the alternative parser q to be tried).
The try function changes the behavior of a parser by converting a "cerr" result to an "eerr" result.
This means that if you are trying to parse the text "s(t)" with different parsers:
with the parser parse_atom <|> parse_op, the parser parse_atom returns "cok" consuming "s" and leaving unparseable text "(t)" which causes an error
with the parser try parse_atom <|> parse_op, the parser parse_atom still returns "cok" consuming "s", so the try (which only changes cerr to eerr) has no effect, and the unparseable text "(t)" causes the same error
with the parser parse_op <|> parse_atom, the parser parse_op successfully parses the string (actually, it doesn't because the recursive call to parse_exp can't parse "t", but let's ignore that); however, if the same parser was used on the text "s", then parse_op would consume the "s" before failing (i.e., cerr), causing the entire parse to fail instead of trying the alternative parse_atom
with the parser try parse_op <|> parse_atom, this would parse "s(t)", exactly as the previous example, and the try would have no effect; however, it would also work on the text "s", because parse_op would consume the "s" before failing with cerr, then try would "rescue" the parse by turning the cerr into an eerr, and the alternative parse_atom would be checked, successfully parsing (cok) the atom "s".
That's why the "correct" parser for your problem is try parse_op <|> parse_atom.
Be warned that this behavior isn't a fundamental aspect of monadic parsers. It's a design choice made by Parsec (and compatible parsers like Megaparsec). Other monadic parsers can have different rules for how alternatives with <|> work.
The "general fix" for these kind of Parsec parsing problems is to be aware of the facts that in the expression p <|> q:
p is tried first, and if it succeeds, q will be ignored, even if q would provide a "longer" or "better" or "more sensible" parse or avoid additional parsing errors further down the road. In parse_atom <|> parse_op, because parse_atom can succeed on strings meant for parse_op, this order won't work correctly.
q is only tried if p fails without consuming input. You must arrange for p to not consume anything on failure, possibly by using try, if you expect the alternative q to be checked. So, parse_op <|> parse_atom isn't going to work if parse_op starts to consume something (like an identifier) before realizing it can't continue and returning cerr.
As an alternative to using try, you can also think more carefully about the structure of your parser. An alternative way of writing parse_exp, for example, would be:
parse_exp :: Parser Exp
parse_exp = do
-- there's always an identifier
x <- many1 letter
-- there *might* be an expression in parentheses
y <- optionMaybe (parens parse_exp)
case y of
Nothing -> return (Atom x)
Just y' -> return (Op x y')
where parens = between (char '(') (char ')')
This can be written a little more concisely, but even then it's not as "elegant" as something like try parse_op <|> parse_atom. (It performs better, though, so that might be a consideration in some applications.)
The problem is that the string "s" counts as an atom according to your definitions. Try this:
parse parse_atom "" "s(t)"
> Atom "s"
So your parser parse_exp actually succeeds, returning Atom "s", but then you also expect an EOF right after it, and that's where it fails, encountering an open paren instead of an EOF (just like the error message says!)
When you swap the alternative around, it would first attempt parse_op, which would succeed, returning Op "s" "t", and then encounter EOF, just as expected.

Please explain the behavior of this Parsec permutation parser

Why does this Parsec permutation parser not parse b?
p :: Parser (String, String)
p = permute (pair
<$?> ("", pa)
<|?> ("", pb))
where pair a b = (a, b)
pa :: Parser String
pa = do
char 'x'
many1 (char 'a')
pb :: Parser String
pb = do
many1 (char 'b')
λ> parseTest p "xaabb"
("aa","bb") -- expected result, good
λ> parseTest p "aabb"
("","") -- why "" for b?
Parser pa is configured as optional via <$?> so I don't understand why its failing has impacted the parsing of b. I can change it to optional (char 'x') to get the expected behavior, but I don't understand why.
pa :: Parser String
pa = do
optional (char 'x')
many1 (char 'a')
pb :: Parser String
pb = do
optional (char 'x')
many1 (char 'b')
λ> parseTest p "xaaxbb"
parse error at (line 1, column 2):
unexpected "a"
expecting "b"
λ> parseTest p "xbbxaa"
("aa","bb")
How can both input orderings be supported when we have identical shared prefix "x"?
I also don't understand the impact that consumption of the optional "x" is having on the parse behavior:
pb :: Parser String
pb = do
try px -- with this try x remains unconsumed and "aa" gets parsed
-- without this try x is consumed, but "aa" isn't parsed even though "x" is optional anyway
many1 (char 'b')
px :: Parser Char
px = do
optional (char 'x')
char 'x' <?> "second x"
λ> parseTest p "xaaxbb" -- without try on px
parse error at (line 1, column 2):
unexpected "a"
expecting second x
λ> parseTest p "xaaxbb" -- with try on px
("aa","")
Why parseTest p "aabb" gives ("","")
The permutation parser tries to strip off the front of the given string prefixes that can be parsed by its constituent parsers (pa and pb in this case). Here, it will have tried to apply both pa and pb to "aabb" and failed in both cases - it never even gets around to trying to parse "bb".
Why can't both pa and pb start with optional (char 'x')
Looking at permute, you'll see it uses choice, which in turn relies on (<|>). As the documentation of (<|>) says,
This combinator implements choice. The parser p <|> q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried. This combinator is defined equal to the mplus member of the MonadPlus class and the (<|>) member of Alternative.
The parser is called predictive since q is only tried when parser p didn't consume any input (i.e.. the look ahead is 1). This non-backtracking behaviour allows for both an efficient implementation of the parser combinators and the generation of good error messages.
So when you do something like parseTest p "xbb", pa doesn't fail immediately (it consumes and 'x') and then the whole thing fails because it cannot backtrack.
How to make shared prefixes work?
As Daniel has suggested, it is best to factor out your grammar. Alternately, you can use try:
The parser try p behaves like parser p, except that it pretends that it hasn't consumed any input when an error occurs
Based on what we talked about before for (<|>), you ought then to put try in front of both of optional (char 'x').
Why does this Parsec permutation parser not parse b?
Because 'a' is not a valid first character for either parser pa or parser pb.
How can both input orderings be supported when we have identical shared prefix "x"?
Shared prefixes must be factored out of your grammar; or backtracking points inserted (using try) at the cost of performance.

Preserving the failure mode of a parser in Parsec

Problem statement
Suppose I have two parsers p and q and I concatenate them like this:
r = try p *> q
In Parsec, the behavior of this is:
If p fails without consuming input, then r fails without consuming input.
If p fails after consuming input, then r fails without consuming input.
if q fails without consuming input, then r fails after consuming p.
if q fails after consuming input, then r fails after consuming p and parts of q.
However, the behavior I'm looking for is a bit unusual:
If p fails without consuming input, then r should fail without consuming input.
If p fails after consuming input, then r should fail without consuming input.
if q fails without consuming input, then r should fail without consuming input.
if q fails after consuming input, then r should fail after consuming some input.
I can't seem to think of a clean way to do this.
Rationale
The reason is that I have a parser like this:
s = (:) <$> q <*> many r
The q parser, embedded inside the r parser, needs a way to signal either: invalid input (which occurs when q consumes input but fails), or end of the many loop (which occurs when q doesn't consume anything and fails). If the input is invalid, it should just fail the parser entirely and report the problem to the user. If there is no more input to consume, then it should end the many loop (without reporting a parser error to the user). The trouble is that it's possible that the input ends with a p but without any more valid q's to consume, in which case q will fail but without consuming any input.
So I was wondering if anyone have an elegant way to solve this problem? Thanks.
Addendum: Example
p = string "P"
q = (++) <$> try (string "xy") <*> string "z"
Test input on (hypothetical) parser s, had it worked the way I wanted:
xyz (accept)
xyzP (accept; P remains unparsed)
xyzPx (accept; Px remains unparsed; q failed but did not consume any input)
xyzPxy (reject; parser q consumed xy but failed)
xyzPxyz (accept)
In the form r = try p *> q, s will actually reject both case #2 and case #3 above. Of course, it's possible to achieve the above behavior by writing:
r = (++) <$> try (string "P" *> string "xy") <*> string "z"
but this isn't a general solution that works for any parsers p and q. (Perhaps a general solution doesn't exist?)
I believe I found a solution. It's not especially nice, but seems to works. At least something to start with:
{-# LANGUAGE FlexibleContexts #-}
import Control.Applicative hiding (many, (<|>))
import Control.Monad (void)
import Control.Monad.Trans (lift)
import Control.Monad.Trans.Maybe
import Text.Parsec hiding (optional)
import Text.Parsec.Char
import Text.Parsec.String
rcomb :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m b -> ParsecT s u m b
rcomb p q = ((test $ opt p *> opt q) <|> pure (Just ()))
>>= maybe empty (\_ -> p *> q)
where
-- | Converts failure to #MaybeT Nothing#:
opt = MaybeT . optional -- optional from Control.Applicative!
-- | Tests running a parser, returns Nothing if parsers failed consuming no
-- input, Just () otherwise.
test = lookAhead . try . runMaybeT . void
This is the r combinator you're asking for. The idea is that we first execute the parsers in a "test" run (using lookAhead . try) and if any of them fails without consuming input, we record it as Nothing inside MaybeT. This is accomplished by opt, it converts a failure to Nothing and wraps it into MaybeT. Thanks to MaybeT, if opt p returns Nothing, opt q is skipped.
If both p and q succeed, the test .. part returns Just (). And if one of them consumes input, the whole test .. fails. This way, we distinguish the 3 possibilities:
Failure with some input consumed by p or q.
Failure such that the failing part doesn't consume input.
Success.
After <|> pure (Just ()) both 1. and 3. result in Just (), while 2. results in Nothing. Finally, the maybe part converts Nothing into a non-consuming failure, and Just () into running the parsers again, now without any protection. This means that 1. fails again with consuming some input, and 3. succeeds.
Testing:
samples =
[ "xyz" -- (accept)
, "xyzP" -- (accept; P remains unparsed)
, "xyzPz" -- (accept; Pz remains unparsed)
, "xyzPx" -- (accept; Px remains unparsed; q failed but did not consume any input)
, "xyzPxy" -- (reject; parser q consumed xy but failed)
, "xyzPxyz" -- (accept)
]
main = do
-- Runs a parser and then accept anything, which shows what's left in the
-- input buffer:
let run p x = runP ((,) <$> p <*> many anyChar) () x x
let p, q :: Parser String
p = string "P"
q = (++) <$> try (string "xy") <*> string "z"
let parser = show <$> ((:) <$> q <*> many (rcomb p q))
mapM_ (print . run parser) samples

Parsing in Haskell for a simple interpreter

I'm relatively new to Haskell with main programming background coming from OO languages. I am trying to write an interpreter with a parser for a simple programming language. So far I have the interpreter at a state which I am reasonably happy with, but am struggling slightly with the parser.
Here is the piece of code which I am having problems with
data IntExp
= IVar Var
| ICon Int
| Add IntExp IntExp
deriving (Read, Show)
whitespace = many1 (char ' ')
parseICon :: Parser IntExp
parseICon =
do x <- many (digit)
return (ICon (read x :: Int))
parseIVar :: Parser IntExp
parseIVar =
do x <- many (letter)
prime <- string "'" <|> string ""
return (IVar (x ++ prime))
parseIntExp :: Parser IntExp
parseIntExp =
do x <- try(parseICon)<|>try(parseIVar)<|>parseAdd
return x
parseAdd :: Parser IntExp
parseAdd =
do x <- parseIntExp
whitespace
string "+"
whitespace
y <- parseIntExp
return (Add x y)
runP :: Show a => Parser a -> String -> IO ()
runP p input
= case parse p "" input of
Left err ->
do putStr "parse error at "
print err
Right x -> print x
The language is slightly more complex, but this is enough to show my problem.
So in the type IntExp ICon is a constant and IVar is a variable, but now onto the problem. This for example runs successfully
runP parseAdd "5 + 5"
which gives (Add (ICon 5) (ICon 5)), which is the expected result. The problem arises when using IVars rather than ICons eg
runP parseAdd "n + m"
This causes the program to error out saying there was an unexpected "n" where a digit was expected. This leads me to believe that parseIntExp isn't working as I intended. My intention was that it will try to parse an ICon, if that fails then try to parse an IVar and so on.
So I either think the problem exists in parseIntExp, or that I am missing something in parseIVar and parseICon.
I hope I've given enough info about my problem and I was clear enough.
Thanks for any help you can give me!
Your problem is actually in parseICon:
parseICon =
do x <- many (digit)
return (ICon (read x :: Int))
The many combinator matches zero or more occurrences, so it's succeeding on "m" by matching zero digits, then probably dying when read fails.
And while I'm at it, since you're new to Haskell, here's some unsolicited advice:
Don't use spurious parentheses. many (digit) should just be many digit. Parentheses here just group things, they're not necessary for function application.
You don't need to do ICon (read x :: Int). The data constructor ICon can only take an Int, so the compiler can figure out what you meant on its own.
You don't need try around the first two options in parseIntExp as it stands--there's no input that would result in either one consuming some input before failing. They'll either fail immediately (which doesn't need try) or they'll succeed after matching a single character.
It's usually a better idea to tokenize first before parsing. Dealing with whitespace at the same time as syntax is a headache.
It's common in Haskell to use the ($) operator to avoid parentheses. It's just function application, but with very low precedence, so that something like many1 (char ' ') can be written many1 $ char ' '.
Also, doing this sort of thing is redundant and unnecessary:
parseICon :: Parser IntExp
parseICon =
do x <- many digit
return (ICon (read x))
When all you're doing is applying a regular function to the result of a parser, you can just use fmap:
parseICon :: Parser IntExp
parseICon = fmap (ICon . read) (many digit)
They're the exact same thing. You can make things look even nicer if you import the Control.Applicative module, which gives you an operator version of fmap, called (<$>), as well as another operator (<*>) that lets you do the same thing with functions of multiple arguments. There's also operators (<*) and (*>) that discard the right or left values, respectively, which in this case lets you parse something while discarding the result, e.g., whitespace and such.
Here's a lightly modified version of your code with some of the above suggestions applied and some other minor stylistic tweaks:
whitespace = many1 $ char ' '
parseICon :: Parser IntExp
parseICon = ICon . read <$> many1 digit
parseIVar :: Parser IntExp
parseIVar = IVar <$> parseVarName
parseVarName :: Parser String
parseVarName = (++) <$> many1 letter <*> parsePrime
parsePrime :: Parser String
parsePrime = option "" $ string "'"
parseIntExp :: Parser IntExp
parseIntExp = parseICon <|> parseIVar <|> parseAdd
parsePlusWithSpaces :: Parser ()
parsePlusWithSpaces = whitespace *> string "+" *> whitespace *> pure ()
parseAdd :: Parser IntExp
parseAdd = Add <$> parseIntExp <* parsePlusWithSpaces <*> parseIntExp
I'm also new to Haskell, just wondering:
will parseIntExp ever make it to parseAdd?
It seems like ICon or IVar will always get parsed before reaching 'parseAdd'.
e.g. runP parseIntExp "3 + m"
would try parseICon, and succeed, giving
(ICon 3) instead of (Add (ICon 3) (IVar m))
Sorry if I'm being stupid here, I'm just unsure.

Haskell: Lifting a reads function to a parsec parser

As part of the 4th exercise here
I would like to use a reads type function such as readHex with a parsec Parser.
To do this I have written a function:
liftReadsToParse :: Parser String -> (String -> [(a, String)]) -> Parser a
liftReadsToParse p f = p >>= \s -> if null (f s) then fail "No parse" else (return . fst . head ) (f s)
Which can be used, for example in GHCI, like this:
*Main Numeric> parse (liftReadsToParse (many1 hexDigit) readHex) "" "a1"
Right 161
Can anyone suggest any improvement to this approach with regard to:
Will the term (f s) be memoised, or evaluated twice in the case of a null (f s) returning False?
Handling multiple successful parses, i.e. when length (f s) is greater than one, I do not know how parsec deals with this.
Handling the remainder of the parse, i.e. (snd . head) (f s).
This is a nice idea. A more natural approach that would make
your ReadS parser fit in better with Parsec would be to
leave off the Parser String at the beginning of the type:
liftReadS :: ReadS a -> String -> Parser a
liftReadS reader = maybe (unexpected "no parse") (return . fst) .
listToMaybe . filter (null . snd) . reader
This "combinator" style is very idiomatic Haskell - once you
get used to it, it makes function definitions much easier
to read and understand.
You would then use liftReadS like this in the simple case:
> parse (many1 hexDigit >>= liftReadS readHex) "" "a1"
(Note that listToMaybe is in the Data.Maybe module.)
In more complex cases, liftReadS is easy to use inside any
Parsec do block.
Regarding some of your other questions:
The function reader is applied only once now, so there is nothing to "memoize".
It is common and accepted practice to ignore all except the first parse in a ReadS parser in most cases, so you're fine.
To answer the first part of your question, no (f s) will not be memoised, you would have to do that manually:
liftReadsToParse p f = p >>= \s -> let fs = f s in if null fs then fail "No parse"
else (return . fst . head ) fs
But I'd use pattern matching instead:
liftReadsToParse p f = p >>= \s -> case f s of
[] -> fail "No parse"
(answer, _) : _ -> return answer

Resources