parsec: stopping at empty line - parsing

I would like to solve the following task with parsec, although splitOn "\n\n" is probably the simpler answer.
I have an inputstring like
testInput = unlines ["ab", "cd", "", "e"] -- "ab\ncd\n\ne"
The parser shall stop when encountering an empty line.
I tried this
import Text.ParserCombinators.Parsec
inputFileP :: GenParser Char st String
inputFileP = many (lower <|> delimP)
delimP :: GenParser Char st Char
delimP = do
x <- char '\n'
notFollowedBy (char '\n')
return x
This fails with unexpected '\n'.
Why?
I was under the impression that many x parses x until it fails and then stops.

I was under the impression that many x parses x until it fails and then stops.
This is only the case if x fails without consuming any input. If x fails after consuming input, the whole parse will fail unless there's a try somewhere (this isn't just specific to many: x <|> y would also fail in that case even if y would succeed). In your case delimP fails on the notFollowedBy (char '\n') after already consuming the first \n, so the whole parse fails.
To change this behaviour, you need to explicitly enable backtracking using try like this:
delimP = try $ do
x <- char '\n'
notFollowedBy (char '\n')
return x
Alternatively, you can make it so that delimP fails without consuming any input (and thus no need for try) by making it look ahead by two characters before matching the \n:
delimP = do
notFollowedBy (string "\n\n")
char '\n'

Related

Problem while writing a small parser in Haskell using Parsec

I am trying to write a parser for a small language with the following piece of code
import Text.ParserCombinators.Parsec
import Text.Parsec.Token
data Exp = Atom String | Op String Exp
instance Show Exp where
show (Atom x) = x
show (Op f x) = f ++ "(" ++ (show x) ++ ")"
parse_exp :: Parser Exp
parse_exp = (try parse_atom) <|> parse_op
parse_atom :: Parser Exp
parse_atom = do
x <- many1 letter
return (Atom x)
parse_op :: Parser Exp
parse_op = do
x <- many1 letter
char '('
y <- parse_exp
char ')'
return (Op x y)
But when I type in ghci
>>> parse (parse_exp <* eof) "<error>" "s(t)"
I get the output
Left "<error>" (line 1, column 2):
unexpected '('
expecting letter or end of input
If I redefine parse_exp as
parse_exp = (try parse_op) <|> parse_atom
then with I get correct result
>>> parse (parse_exp <* eof) "<error>" "s(t)"
Right s(t)
But I am confused why the first one does not work. Is there a general fix to these kinds of problems in parsing?
When a Parsec parser, like parse_atom, is run on a particular string, there are four possible results:
It succeeds, consuming some input.
It fails, consuming some input.
It succeeds, consuming no input.
It fails, consuming no input.
In the Parsec source code, these are referred to as "consumed ok", "consumed err", "empty ok" and "empty err" (sometimes abbreviated cok, cerr, eok, eerr).
When two Parsec parsers are used in an alternative, like p <|> q, here's how it's parsed. First, Parsec tries to parse with p. Then:
If this results in "consumed ok" or "empty ok", the parse succeeds and this becomes the result of the entire parser p <|> q.
If this results in "empty err", Parsec tries the alternative q, and this becomes the result of the entire p <|> q parser.
If this results in "consumed err", the entire parser p <|> q fails with "consumed err" (cerr).
Note the critical difference between p returning cerr (which causes the whole parser to fail) versus returning eerr (which causes the alternative parser q to be tried).
The try function changes the behavior of a parser by converting a "cerr" result to an "eerr" result.
This means that if you are trying to parse the text "s(t)" with different parsers:
with the parser parse_atom <|> parse_op, the parser parse_atom returns "cok" consuming "s" and leaving unparseable text "(t)" which causes an error
with the parser try parse_atom <|> parse_op, the parser parse_atom still returns "cok" consuming "s", so the try (which only changes cerr to eerr) has no effect, and the unparseable text "(t)" causes the same error
with the parser parse_op <|> parse_atom, the parser parse_op successfully parses the string (actually, it doesn't because the recursive call to parse_exp can't parse "t", but let's ignore that); however, if the same parser was used on the text "s", then parse_op would consume the "s" before failing (i.e., cerr), causing the entire parse to fail instead of trying the alternative parse_atom
with the parser try parse_op <|> parse_atom, this would parse "s(t)", exactly as the previous example, and the try would have no effect; however, it would also work on the text "s", because parse_op would consume the "s" before failing with cerr, then try would "rescue" the parse by turning the cerr into an eerr, and the alternative parse_atom would be checked, successfully parsing (cok) the atom "s".
That's why the "correct" parser for your problem is try parse_op <|> parse_atom.
Be warned that this behavior isn't a fundamental aspect of monadic parsers. It's a design choice made by Parsec (and compatible parsers like Megaparsec). Other monadic parsers can have different rules for how alternatives with <|> work.
The "general fix" for these kind of Parsec parsing problems is to be aware of the facts that in the expression p <|> q:
p is tried first, and if it succeeds, q will be ignored, even if q would provide a "longer" or "better" or "more sensible" parse or avoid additional parsing errors further down the road. In parse_atom <|> parse_op, because parse_atom can succeed on strings meant for parse_op, this order won't work correctly.
q is only tried if p fails without consuming input. You must arrange for p to not consume anything on failure, possibly by using try, if you expect the alternative q to be checked. So, parse_op <|> parse_atom isn't going to work if parse_op starts to consume something (like an identifier) before realizing it can't continue and returning cerr.
As an alternative to using try, you can also think more carefully about the structure of your parser. An alternative way of writing parse_exp, for example, would be:
parse_exp :: Parser Exp
parse_exp = do
-- there's always an identifier
x <- many1 letter
-- there *might* be an expression in parentheses
y <- optionMaybe (parens parse_exp)
case y of
Nothing -> return (Atom x)
Just y' -> return (Op x y')
where parens = between (char '(') (char ')')
This can be written a little more concisely, but even then it's not as "elegant" as something like try parse_op <|> parse_atom. (It performs better, though, so that might be a consideration in some applications.)
The problem is that the string "s" counts as an atom according to your definitions. Try this:
parse parse_atom "" "s(t)"
> Atom "s"
So your parser parse_exp actually succeeds, returning Atom "s", but then you also expect an EOF right after it, and that's where it fails, encountering an open paren instead of an EOF (just like the error message says!)
When you swap the alternative around, it would first attempt parse_op, which would succeed, returning Op "s" "t", and then encounter EOF, just as expected.

Why Parsec's sepBy stops and does not parse all elements?

I am trying to parse some comma separated string which may or may not contain a string with image dimensions. For example "hello world, 300x300, good bye world".
I've written the following little program:
import Text.Parsec
import qualified Text.Parsec.Text as PS
parseTestString :: Text -> [Maybe (Int, Int)]
parseTestString s = case parse dimensStringParser "" s of
Left _ -> [Nothing]
Right dimens -> dimens
dimensStringParser :: PS.Parser [Maybe (Int, Int)]
dimensStringParser = (optionMaybe dimensParser) `sepBy` (char ',')
dimensParser :: PS.Parser (Int, Int)
dimensParser = do
w <- many1 digit
char 'x'
h <- many1 digit
return (read w, read h)
main :: IO ()
main = do
print $ parseTestString "300x300,40x40,5x5"
print $ parseTestString "300x300,hello,5x5,6x6"
According to optionMaybe documentation, it returns Nothing if it can't parse, so I would expect to get this output:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing, Just (5,5), Just (6,6)]
but instead I get:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing]
I.e. parsing stops after first failure. So I have two questions:
Why does it behave this way?
How do I write a correct parser for this case?
In order to answer this question, it's handy to take a piece of paper, write down the input, and act as a dumb parser.
We start with "300x300,hello,5x5,6x6", our current parser is optionMaybe .... Does our dimensParser correctly parse the dimension? Let's check:
w <- many1 digit -- yes, "300"
char 'x' -- yes, "x"
h <- many1 digit -- yes, "300"
return (read w, read h) -- never fails
We've successfully parsed the first dimension. The next token is ,, so sepBy successfully parses that as well. Next, we try to parse "hello" and fail:
w <- many1 digit -- no. 'h' is not a digit. Stop
Next, sepBy tries to parse ,, but that's not possible, since the next token is a 'h', not a ,. Therefore, sepBy stops.
We haven't parsed all the input, but that's not actually necessary. You would get a proper error message if you've used
parse (dimensStringParser <* eof)
Either way, if you want to discard anything in the list that's not a dimension, you can use
dimensStringParser1 :: Parser (Maybe (Int, Int))
dimensStringParser1 = (Just <$> dimensParser) <|> (skipMany (noneOf ",") >> Nothing)
dimensStringParser = dimensStringParser1 `sepBy` char ','
I'd guess that optionMaybe dimensParser, when fed with input "hello,...", tries dimensParser. That fails, so optionMaybe returns success with Nothing, and consumes no portion of the input.
The last part is the crucial one: after Nothing is returned, the input string to be parsed is still "hello,...".
At that point sepBy tries to parse char ',', which fails. So, it deduces that the list is over, and terminates the output list, without consuming any more input.
If you want to skip other entities, you need a "consuming" parser that returns Nothing instead of optionMaybe. That parser, however, need to know how much to consume: in your case, until the comma.
Perhaps you need some like (untested)
( try (Just <$> dimensParser)
<|> (noneOf "," >> return Nothing))
`sepBy` char ','

Parse a sub-string with parsec (by ignoring unmatched prefixes)

I would like to extract the repository name from the first line of git remote -v, which is usually of the form:
origin git#github.com:some-user/some-repo.git (fetch)
I quickly made the following parser using parsec:
-- | Parse the repository name from the output given by the first line of `git remote -v`.
repoNameFromRemoteP :: Parser String
repoNameFromRemoteP = do
_ <- originPart >> hostPart
_ <- char ':'
firstPart <- many1 alphaNum
_ <- char '/'
secondPart <- many1 alphaNum
_ <- string ".git"
return $ firstPart ++ "/" ++ secondPart
where
originPart = many1 alphaNum >> space
hostPart = many1 alphaNum
>> (string "#" <|> string "://")
>> many1 alphaNum `sepBy` char '.'
But this parser looks a bit awkward. Actually I'm only interested in whatever follows the colon (":"), and it would be easier if I could just write a parser for it.
Is there a way to have parsec skip a character upon a failed match, and re-try from the next position?
If I've understood the question, try many (noneOf ":"). This will consume any character until it sees a ':', then stop.
Edit: Seems I had not understood the question. You can use the try combinator to turn a parser which may consume some characters before failing into one that consumes no characters on a failure. So:
skipUntil p = try p <|> (anyChar >> skipUntil p)
Beware that this can be quite expensive, both in runtime (because it will try matching p at every position) and memory (because try prevents p from consuming characters and so the input cannot be garbage collected at all until p completes). You might be able to alleviate the first of those two problems by parameterizing the anyChar bit so that the caller could choose some cheap parser for finding candidate positions; e.g.
skipUntil p skipper = try p <|> (skipper >> skipUntil p skipper)
You could then potentially use the above many (noneOf ":") construction to only try p on positions that start with a :.
The
sepCap
combinator from
replace-megaparsec
can skip a character upon a failed match, and re-try from the next position.
Maybe this is overkill for your particular case, but it does solve the
general problem.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
username :: Parsec Void String String
username = do
void $ single ':'
some $ alphaNumChar <|> single '-'
listToMaybe . rights =<< parseMaybe (sepCap username)
"origin git#github.com:some-user/some-repo.git (fetch)"
Just "some-user"

Please explain the behavior of this Parsec permutation parser

Why does this Parsec permutation parser not parse b?
p :: Parser (String, String)
p = permute (pair
<$?> ("", pa)
<|?> ("", pb))
where pair a b = (a, b)
pa :: Parser String
pa = do
char 'x'
many1 (char 'a')
pb :: Parser String
pb = do
many1 (char 'b')
λ> parseTest p "xaabb"
("aa","bb") -- expected result, good
λ> parseTest p "aabb"
("","") -- why "" for b?
Parser pa is configured as optional via <$?> so I don't understand why its failing has impacted the parsing of b. I can change it to optional (char 'x') to get the expected behavior, but I don't understand why.
pa :: Parser String
pa = do
optional (char 'x')
many1 (char 'a')
pb :: Parser String
pb = do
optional (char 'x')
many1 (char 'b')
λ> parseTest p "xaaxbb"
parse error at (line 1, column 2):
unexpected "a"
expecting "b"
λ> parseTest p "xbbxaa"
("aa","bb")
How can both input orderings be supported when we have identical shared prefix "x"?
I also don't understand the impact that consumption of the optional "x" is having on the parse behavior:
pb :: Parser String
pb = do
try px -- with this try x remains unconsumed and "aa" gets parsed
-- without this try x is consumed, but "aa" isn't parsed even though "x" is optional anyway
many1 (char 'b')
px :: Parser Char
px = do
optional (char 'x')
char 'x' <?> "second x"
λ> parseTest p "xaaxbb" -- without try on px
parse error at (line 1, column 2):
unexpected "a"
expecting second x
λ> parseTest p "xaaxbb" -- with try on px
("aa","")
Why parseTest p "aabb" gives ("","")
The permutation parser tries to strip off the front of the given string prefixes that can be parsed by its constituent parsers (pa and pb in this case). Here, it will have tried to apply both pa and pb to "aabb" and failed in both cases - it never even gets around to trying to parse "bb".
Why can't both pa and pb start with optional (char 'x')
Looking at permute, you'll see it uses choice, which in turn relies on (<|>). As the documentation of (<|>) says,
This combinator implements choice. The parser p <|> q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried. This combinator is defined equal to the mplus member of the MonadPlus class and the (<|>) member of Alternative.
The parser is called predictive since q is only tried when parser p didn't consume any input (i.e.. the look ahead is 1). This non-backtracking behaviour allows for both an efficient implementation of the parser combinators and the generation of good error messages.
So when you do something like parseTest p "xbb", pa doesn't fail immediately (it consumes and 'x') and then the whole thing fails because it cannot backtrack.
How to make shared prefixes work?
As Daniel has suggested, it is best to factor out your grammar. Alternately, you can use try:
The parser try p behaves like parser p, except that it pretends that it hasn't consumed any input when an error occurs
Based on what we talked about before for (<|>), you ought then to put try in front of both of optional (char 'x').
Why does this Parsec permutation parser not parse b?
Because 'a' is not a valid first character for either parser pa or parser pb.
How can both input orderings be supported when we have identical shared prefix "x"?
Shared prefixes must be factored out of your grammar; or backtracking points inserted (using try) at the cost of performance.

Parsing in Haskell for a simple interpreter

I'm relatively new to Haskell with main programming background coming from OO languages. I am trying to write an interpreter with a parser for a simple programming language. So far I have the interpreter at a state which I am reasonably happy with, but am struggling slightly with the parser.
Here is the piece of code which I am having problems with
data IntExp
= IVar Var
| ICon Int
| Add IntExp IntExp
deriving (Read, Show)
whitespace = many1 (char ' ')
parseICon :: Parser IntExp
parseICon =
do x <- many (digit)
return (ICon (read x :: Int))
parseIVar :: Parser IntExp
parseIVar =
do x <- many (letter)
prime <- string "'" <|> string ""
return (IVar (x ++ prime))
parseIntExp :: Parser IntExp
parseIntExp =
do x <- try(parseICon)<|>try(parseIVar)<|>parseAdd
return x
parseAdd :: Parser IntExp
parseAdd =
do x <- parseIntExp
whitespace
string "+"
whitespace
y <- parseIntExp
return (Add x y)
runP :: Show a => Parser a -> String -> IO ()
runP p input
= case parse p "" input of
Left err ->
do putStr "parse error at "
print err
Right x -> print x
The language is slightly more complex, but this is enough to show my problem.
So in the type IntExp ICon is a constant and IVar is a variable, but now onto the problem. This for example runs successfully
runP parseAdd "5 + 5"
which gives (Add (ICon 5) (ICon 5)), which is the expected result. The problem arises when using IVars rather than ICons eg
runP parseAdd "n + m"
This causes the program to error out saying there was an unexpected "n" where a digit was expected. This leads me to believe that parseIntExp isn't working as I intended. My intention was that it will try to parse an ICon, if that fails then try to parse an IVar and so on.
So I either think the problem exists in parseIntExp, or that I am missing something in parseIVar and parseICon.
I hope I've given enough info about my problem and I was clear enough.
Thanks for any help you can give me!
Your problem is actually in parseICon:
parseICon =
do x <- many (digit)
return (ICon (read x :: Int))
The many combinator matches zero or more occurrences, so it's succeeding on "m" by matching zero digits, then probably dying when read fails.
And while I'm at it, since you're new to Haskell, here's some unsolicited advice:
Don't use spurious parentheses. many (digit) should just be many digit. Parentheses here just group things, they're not necessary for function application.
You don't need to do ICon (read x :: Int). The data constructor ICon can only take an Int, so the compiler can figure out what you meant on its own.
You don't need try around the first two options in parseIntExp as it stands--there's no input that would result in either one consuming some input before failing. They'll either fail immediately (which doesn't need try) or they'll succeed after matching a single character.
It's usually a better idea to tokenize first before parsing. Dealing with whitespace at the same time as syntax is a headache.
It's common in Haskell to use the ($) operator to avoid parentheses. It's just function application, but with very low precedence, so that something like many1 (char ' ') can be written many1 $ char ' '.
Also, doing this sort of thing is redundant and unnecessary:
parseICon :: Parser IntExp
parseICon =
do x <- many digit
return (ICon (read x))
When all you're doing is applying a regular function to the result of a parser, you can just use fmap:
parseICon :: Parser IntExp
parseICon = fmap (ICon . read) (many digit)
They're the exact same thing. You can make things look even nicer if you import the Control.Applicative module, which gives you an operator version of fmap, called (<$>), as well as another operator (<*>) that lets you do the same thing with functions of multiple arguments. There's also operators (<*) and (*>) that discard the right or left values, respectively, which in this case lets you parse something while discarding the result, e.g., whitespace and such.
Here's a lightly modified version of your code with some of the above suggestions applied and some other minor stylistic tweaks:
whitespace = many1 $ char ' '
parseICon :: Parser IntExp
parseICon = ICon . read <$> many1 digit
parseIVar :: Parser IntExp
parseIVar = IVar <$> parseVarName
parseVarName :: Parser String
parseVarName = (++) <$> many1 letter <*> parsePrime
parsePrime :: Parser String
parsePrime = option "" $ string "'"
parseIntExp :: Parser IntExp
parseIntExp = parseICon <|> parseIVar <|> parseAdd
parsePlusWithSpaces :: Parser ()
parsePlusWithSpaces = whitespace *> string "+" *> whitespace *> pure ()
parseAdd :: Parser IntExp
parseAdd = Add <$> parseIntExp <* parsePlusWithSpaces <*> parseIntExp
I'm also new to Haskell, just wondering:
will parseIntExp ever make it to parseAdd?
It seems like ICon or IVar will always get parsed before reaching 'parseAdd'.
e.g. runP parseIntExp "3 + m"
would try parseICon, and succeed, giving
(ICon 3) instead of (Add (ICon 3) (IVar m))
Sorry if I'm being stupid here, I'm just unsure.

Resources