How to do a backtrack search with parser combinators? - parsing

I have a list of parsers e.g. [string "a",string "ab"] that are "overlapping". I can change neither the parsers themselves nor their order.
With these parsers I want to parse a sequence of tokens that would each be exact matches for one of the parsers e.g. "aaaab", "ab", "abab" but not "abb"
Without parsers I would just implement a dept first search, but I would like to solve this with parsers.
I get about this far:
import Control.Applicative
import Text.Trifecta
parsers = [string "a",string "ab"]
parseString (many (choice parsers) <* eof) mempty "aab"
This fails because it will parse "a" both times, and not backtrack because choice doesn't do that. And further, string "a" has succeeded both times so the consumed input probably can't be retrieved anymore.
How can implement a parser that can backtrack and produce a list of parse results e.g. Success ["a","ab"]?
If I require the input to have the tokens separated, I still can't make it work:
This works:
parseString (try (string "a" <* eof) <|> (string "ab" <*eof)) mempty "ab"
But this does not:
parseString (try (foldl1 (<|>) $ map (\x -> x <* eof) parsers)) mempty "ab"

The try level is performed too high. You should perform it on the individual parsers. For example:
parseString (foldl1 (<|>) $ map (\x -> try (x <* eof)) parsers) mempty "ab"
In the original parser you wrote:
parseString ((try (string "a" <* eof)) <|> (string "ab" <*eof)) mempty "ab"
Notice that the left operand of <|> is try (string "a" <* eof) with try included.
whereas in the one you performed with foldl1, you wrote:
parseString (try ((string "a" <* eof) <|> (string "ab" <*eof))) mempty "ab"
So here is the try not part of the left operand. As a result, if the first parser fails, the "cursor" will not return to the point where it made the decision to try the first operand.
We can improve the above, by making use of asum :: (Foldable t, Alternative f) -> t (f a) -> f a:
import Data.Foldable(asum)
parseString (asum (map (\x -> try (x <* eof)) parsers)) mempty "ab"

Related

Using makeExprParser with ambiguity

I'm currently encountering a problem while translating a parser from a CFG-based tool (antlr) to Megaparsec.
The grammar contains lists of expressions (handled with makeExprParser) that are enclosed in brackets (<, >) and separated by ,.
Stuff like <>, <23>, <23,87> etc.
The problem now is that the expressions may themselves contain the > operator (meaning "greater than"), which causes my parser to fail.
<1223>234> should, for example, be parsed into [BinaryExpression ">" (IntExpr 1223) (IntExpr 234)].
I presume that I have to strategically place try somewhere, but the places I tried (to the first argument of sepBy and the first argument of makeExprParser) did unfortunately not work.
Can I use makeExprParser in such a situation or do I have to manually write the expression parser?:
This is the relevant part of my parser:
-- uses megaparsec, text, and parser-combinators
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Monad.Combinators.Expr
import Data.Text
import Data.Void
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
type BinaryOperator = Text
type Name = Text
data Expr
= IntExpr Integer
| BinaryExpression BinaryOperator Expr Expr
deriving (Eq, Show)
type Parser = Parsec Void Text
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: Text -> Parser Text
symbol = L.symbol sc
sc :: Parser ()
sc = L.space space1 (L.skipLineComment "//") (L.skipBlockCommentNested "/*" "*/")
parseInteger :: Parser Expr
parseInteger = do
number <- some digitChar
_ <- sc
return $ IntExpr $ read number
parseExpr :: Parser Expr
parseExpr = makeExprParser parseInteger [[InfixL (BinaryExpression ">" <$ symbol ">")]]
parseBracketList :: Parser [Expr]
parseBracketList = do
_ <- symbol "<"
exprs <- sepBy parseExpr (symbol ",")
_ <- symbol ">"
return exprs
main :: IO ()
main = do
text : _ <- getArgs
let res = runParser parseBracketList "stdin" (pack text)
case res of
(Right suc) -> do
print suc
(Left err) ->
putStrLn $ errorBundlePretty err
You've (probably) misdiagnosed the problem. Your parser fails on <1233>234> because it's trying to parse > as a left associative operator, like +. In other words, the same way:
1+2+
would fail, because the second + has no right-hand operand, your parser is failing because:
1233>234>
has no digit following the second >. Assuming you don't want your > operator to chain (i.e., 1>2>3 is not a valid Expr), you should first replace InfixL with InfixN (non-associative) in your makeExprParser table. Then, it will parse this example fine.
Unfortunately, with or without this change your parser will still fail on the simpler test case:
<1233>
because the > is interpreted as an operator within a continuing expression.
In other words, the problem isn't that your parser can't handle expressions with > characters, it's that it's overly aggressive in treating > characters as part of an expression, preventing them from being recognized as the closing angle bracket.
To fix this, you need to figure out exactly what you're parsing. Specifically, you need to resolve the ambiguity in your parser by precisely characterizing the situations where > can be part of a continuing expression and where it can't.
One rule that will probably work is to only consider a > as an operator if it is followed by a valid "term" (i.e., a parseInteger). You can do this with lookAhead. The parser:
symbol ">" <* lookAhead term
will parse a > operator only if it is followed by a valid term. If it fails to find a term, it will consume some input (at least the > symbol itself), so you must surround it with a try:
try (symbol ">" <* lookAhead term)
With the above two fixes applied to parseExpr:
parseExpr :: Parser Expr
parseExpr = makeExprParser term
[[InfixN (BinaryExpression ">" <$ try (symbol ">" <* lookAhead term))]]
where term = parseInteger
you'll get the following parses:
λ> parseTest parseBracketList "<23>"
[IntExpr 23]
λ> parseTest parseBracketList "<23,87>"
[IntExpr 23,IntExpr 87]
λ> parseTest parseBracketList "<23,87>18>"
[IntExpr 23,BinaryExpression ">" (IntExpr 87) (IntExpr 18)]
However, the following will fail:
λ> parseTest parseBracketList "<23,87>18"
1:10:
|
1 | <23,87>18
| ^
unexpected end of input
expecting ',', '>', or digit
λ>
because the fact that the > is followed by 18 means that it is a valid operator, and it is parse failure that the valid expression 87>18 is followed by neither a comma nor a closing > angle bracket.
If you need to parse something like <23,87>18, you have bigger problems. Consider the following two test cases:
<1,2>3,4,5,6,7,...,100000000000,100000000001>
<1,2>3,4,5,6,7,...,100000000000,100000000001
It's a challenge to write an efficient parser that will parse the first one as a list of 10000000000 expressions but the second one as a list of two expression:
[IntExpr 1, IntExpr 2]
followed by some "extra" text. Hopefully, the underlying "language" you're trying to parse isn't so hopelessly broken that this will be an issue.

Chain two parsers in Haskell (Parsec)

Parsec provides an operator to choose between two parsers:
(<|>)
:: Text.Parsec.Prim.ParsecT s u m a
-> Text.Parsec.Prim.ParsecT s u m a
-> Text.Parsec.Prim.ParsecT s u m a
Is there a similar function to chain two parsers? I didn't find one with the same signature using Hoogle.
As an example, let's say I want to parse any word optionally followed by a single digit. My first idea was to use >> but it doesn't seem to work.
parser = many1 letter >> optional (fmap pure digit)
I used fmap pure in order to convert the digit to an actual string and thus match the parsed type of many1 letter. I don't know if it is useful.
Try this:
parser = (++) <$> many1 letter <*> option "" (fmap pure digit)
This is equivalent to:
parser = pure (++) <*> many1 letter <*> option "" (fmap pure digit)
option [] (fmap pure digit) return empty string if the parser digit have failed and a string from one digital char otherwise.
You can also use do-notation for more readable code:
parser = do
s1 <- many1 letter
s2 <- option "" (fmap pure digit)
return (s1 ++ s2)

Problem while writing a small parser in Haskell using Parsec

I am trying to write a parser for a small language with the following piece of code
import Text.ParserCombinators.Parsec
import Text.Parsec.Token
data Exp = Atom String | Op String Exp
instance Show Exp where
show (Atom x) = x
show (Op f x) = f ++ "(" ++ (show x) ++ ")"
parse_exp :: Parser Exp
parse_exp = (try parse_atom) <|> parse_op
parse_atom :: Parser Exp
parse_atom = do
x <- many1 letter
return (Atom x)
parse_op :: Parser Exp
parse_op = do
x <- many1 letter
char '('
y <- parse_exp
char ')'
return (Op x y)
But when I type in ghci
>>> parse (parse_exp <* eof) "<error>" "s(t)"
I get the output
Left "<error>" (line 1, column 2):
unexpected '('
expecting letter or end of input
If I redefine parse_exp as
parse_exp = (try parse_op) <|> parse_atom
then with I get correct result
>>> parse (parse_exp <* eof) "<error>" "s(t)"
Right s(t)
But I am confused why the first one does not work. Is there a general fix to these kinds of problems in parsing?
When a Parsec parser, like parse_atom, is run on a particular string, there are four possible results:
It succeeds, consuming some input.
It fails, consuming some input.
It succeeds, consuming no input.
It fails, consuming no input.
In the Parsec source code, these are referred to as "consumed ok", "consumed err", "empty ok" and "empty err" (sometimes abbreviated cok, cerr, eok, eerr).
When two Parsec parsers are used in an alternative, like p <|> q, here's how it's parsed. First, Parsec tries to parse with p. Then:
If this results in "consumed ok" or "empty ok", the parse succeeds and this becomes the result of the entire parser p <|> q.
If this results in "empty err", Parsec tries the alternative q, and this becomes the result of the entire p <|> q parser.
If this results in "consumed err", the entire parser p <|> q fails with "consumed err" (cerr).
Note the critical difference between p returning cerr (which causes the whole parser to fail) versus returning eerr (which causes the alternative parser q to be tried).
The try function changes the behavior of a parser by converting a "cerr" result to an "eerr" result.
This means that if you are trying to parse the text "s(t)" with different parsers:
with the parser parse_atom <|> parse_op, the parser parse_atom returns "cok" consuming "s" and leaving unparseable text "(t)" which causes an error
with the parser try parse_atom <|> parse_op, the parser parse_atom still returns "cok" consuming "s", so the try (which only changes cerr to eerr) has no effect, and the unparseable text "(t)" causes the same error
with the parser parse_op <|> parse_atom, the parser parse_op successfully parses the string (actually, it doesn't because the recursive call to parse_exp can't parse "t", but let's ignore that); however, if the same parser was used on the text "s", then parse_op would consume the "s" before failing (i.e., cerr), causing the entire parse to fail instead of trying the alternative parse_atom
with the parser try parse_op <|> parse_atom, this would parse "s(t)", exactly as the previous example, and the try would have no effect; however, it would also work on the text "s", because parse_op would consume the "s" before failing with cerr, then try would "rescue" the parse by turning the cerr into an eerr, and the alternative parse_atom would be checked, successfully parsing (cok) the atom "s".
That's why the "correct" parser for your problem is try parse_op <|> parse_atom.
Be warned that this behavior isn't a fundamental aspect of monadic parsers. It's a design choice made by Parsec (and compatible parsers like Megaparsec). Other monadic parsers can have different rules for how alternatives with <|> work.
The "general fix" for these kind of Parsec parsing problems is to be aware of the facts that in the expression p <|> q:
p is tried first, and if it succeeds, q will be ignored, even if q would provide a "longer" or "better" or "more sensible" parse or avoid additional parsing errors further down the road. In parse_atom <|> parse_op, because parse_atom can succeed on strings meant for parse_op, this order won't work correctly.
q is only tried if p fails without consuming input. You must arrange for p to not consume anything on failure, possibly by using try, if you expect the alternative q to be checked. So, parse_op <|> parse_atom isn't going to work if parse_op starts to consume something (like an identifier) before realizing it can't continue and returning cerr.
As an alternative to using try, you can also think more carefully about the structure of your parser. An alternative way of writing parse_exp, for example, would be:
parse_exp :: Parser Exp
parse_exp = do
-- there's always an identifier
x <- many1 letter
-- there *might* be an expression in parentheses
y <- optionMaybe (parens parse_exp)
case y of
Nothing -> return (Atom x)
Just y' -> return (Op x y')
where parens = between (char '(') (char ')')
This can be written a little more concisely, but even then it's not as "elegant" as something like try parse_op <|> parse_atom. (It performs better, though, so that might be a consideration in some applications.)
The problem is that the string "s" counts as an atom according to your definitions. Try this:
parse parse_atom "" "s(t)"
> Atom "s"
So your parser parse_exp actually succeeds, returning Atom "s", but then you also expect an EOF right after it, and that's where it fails, encountering an open paren instead of an EOF (just like the error message says!)
When you swap the alternative around, it would first attempt parse_op, which would succeed, returning Op "s" "t", and then encounter EOF, just as expected.

Haskell - intersperse a parser with another one

I have two parsers parser1 :: Parser a and parser2 :: Parser a.
I would like now to parse a list of as interspersing them with parser2
The desired signature is something like
interspersedParser :: Parser b -> Parser a -> Parser [a]
For example, if Parser a parses the 'a' character and Parser b parser the 'b' character, then the interspersedParser should parse
""
"a"
"aba"
"ababa"
...
I'm using megaparsec. Is there already some combinator which behaves like this, which I'm currently not able to find?
In parsec there is a sepBy parser which does that. The same parser seems to be available in megaparsec as well: https://hackage.haskell.org/package/megaparsec-4.4.0/docs/Text-Megaparsec-Combinator.html
Sure, you can use sepBy, but isn't this just:
interspersedParser sepP thingP = (:) <$> thingP <*> many (sepP *> thingP)
EDIT: Oh, this requires at least one thing to be there. You also wanted empty, so just stick a <|> pure [] on the end.
In fact, this is basically how sepBy1 (a variant of sepBy that requires at least one) is implemented:
-- | #sepBy p sep# parses /zero/ or more occurrences of #p#, separated
-- by #sep#. Returns a list of values returned by #p#.
--
-- > commaSep p = p `sepBy` comma
sepBy :: Alternative m => m a -> m sep -> m [a]
sepBy p sep = sepBy1 p sep <|> pure []
{-# INLINE sepBy #-}
-- | #sepBy1 p sep# parses /one/ or more occurrences of #p#, separated
-- by #sep#. Returns a list of values returned by #p#.
sepBy1 :: Alternative m => m a -> m sep -> m [a]
sepBy1 p sep = (:) <$> p <*> many (sep *> p)
{-# INLINE sepBy1 #-}

Haskell Parsec Parser for Encountering [...]

I'm attempting to write a parser in Haskell using Parsec. Currently I have a program that can parse
test x [1,2,3] end
The code that does this is given as follows
testParser = do {
reserved "test";
v <- identifier;
symbol "[";
l <- sepBy natural commaSep;
symbol "]";
p <- pParser;
return $ Test v (List l) p
} <?> "end"
where commaSep is defined as
commaSep = skipMany1 (space <|> char ',')
Now is there some way for me to parse a similar statement, specifically:
test x [1...3] end
Being new to Haskell, and Parsec for that matter, I'm sure there's some nice concise way of doing this that I'm just not aware of. Any help would be appreciated.
Thanks again.
I'll be using some functions from Control.Applicative like (*>). These functions are useful if you want to avoid the monadic interface of Parsec and prefer the applicative interface, because the parsers become easier to read that way in my opinion.
If you aren't familiar with the basic applicative functions, leave a comment and I'll explain them. You can look them up on Hoogle if you are unsure.
As I've understood your problem, you want a parser for some data structure like this:
data Test = Test String Numbers
data Numbers = List [Int] | Range Int Int
A parser that can parse such a data structure would look like this (I've not compiled the code, but it should work):
-- parses "test <identifier> [<numbers>] end"
testParser :: Parser Test
testParser =
Test <$> reserved "test" *> identifier
<*> symbol "[" *> numbersParser <* symbol "]"
<* reserved "end"
<?> "test"
numbersParser :: Parser Numbers
numbersParser = try listParser <|> rangeParser
-- parses "<natural>, <natural>, <natural>" etc
listParser :: Parser Numbers
listParser =
List <$> sepBy natural (symbol ",")
<?> "list"
-- parses "<natural> ... <natural>"
rangeParser :: Parser Numbers
rangeParser =
Range <$> natural <* symbol "..."
<*> natural
<?> "range"

Resources