Parsec: Handling Overlapping Parsers

Parsec: Handling Overlapping Parsers - parsing

I'm really new to parsing in Haskell, but it's mostly making sense.
I'm working on building a Templating program mostly to learn parsing better; templates can interpolate values in via {{ value }} notation.
Here's my current parser,
data Template a = Template [Either String a]
data Directive = Directive String
templateFromFile :: FilePath -> IO (Either ParseError (Template Directive))
templateFromFile = parseFromFile templateParser
templateParser :: Parser (Template Directive)
templateParser = do
tmp <- template
eof
return tmp
template :: Parser (Template Directive)
template = Template <$> many (dir <|> txt)
where
dir = Right <$> directive
txt = Left <$> many1 (anyChar <* notFollowedBy directive)
directive :: Parser Directive
directive = do
_ <- string "{{"
txt <- manyTill anyChar (string "}}")
return $ Directive txt
Then I run it on a file something like this:
{{ value }}
This is normal Text
{{ expression }}
When I run this using templateFromFile "./template.txt" I get the error:
Left "./template.txt" (line 5, column 17):
unexpected Directive " expression "
Why is this happening and how can I fix it?
My basic understanding is that many1 (anyChar <* notFollowedBy directive)
should grab all of the characters up until the start of the next directive, then should fail and return the list of characters up till that point; then
it should fall back to the previous many and should try parsing dir again and should succeed; clearly something else is happening though. I'm
having trouble figuring out how to parse things between other things when
the parsers mostly overlap.
I'd love some tips on how to structure this all more idiomatically, please let me know if I'm doing something in a silly way. Cheers! Thanks for your time!

You have a couple of problems. First, in Parsec, if a parser consumes any input and then fails, that's an error. So, when the parser:
anyChar <* notFollowedBy directive
fails (because the character is followed by a directive), it fails after anyChar has consumed input, and that generates an error immediately. Therefore, the parser:
let p1 = many1 (anyChar <* notFollowedBy directive)
will never succeed if it runs into a directive. For example:
parse p1 "" "okay" -- works
parse p1 "" "oops {{}}" -- will fail after consuming "oops "
You can fix this by inserting a try clause:
let p2 = many1 (try (anyChar <* notFollowedBy directive))
parse p2 "" "okay {{}}"
which yields Right "okay" and reveals the second problem. Parser p2 only consumes characters that aren't followed by a directive, so that excludes the space immediately before the directive, and you have no means in your parser to consume a character that is followed by a directive, so it gets stuck.
You actually want something like:
let p3 = many1 (notFollowedBy directive *> anyChar)
which first checks that, at the current position, we aren't looking at a directive before grabbing a character. No try clause is needed because if this fails, it fails without consuming input. (notFollowedBy never consumes input, as per the documentation.)
parse p3 "" "okay" -- returns Right "okay"
parse p3 "" "okay {{}}" -- return Right "okay "
parse p3 "" "{{fails}}" -- correctly fails w/o consuming input
So, taking your original example with:
txt = Left <$> many1 (notFollowedBy directive *> anyChar)
should work fine.

replace-megaparsec
is a library for doing search-and-replace with parsers. The
search-and-replace function is
streamEdit,
which can find your {{ value }} patterns and then substitute in some other text.
streamEdit is built from a generalized version
of your template function called
sepCap.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Char
input = unlines
[ "{{ value }}"
, ""
, "This is normal Text"
, ""
, "{{ expression }}"
]
directive :: Parsec Void String String
directive = do
_ <- string "{{"
txt <- manyTill anySingle (string "}}")
return txt
editor k = fmap toUpper k
streamEdit directive editor input
VALUE
This is normal Text
EXPRESSION

many1 (anyChar <* notFollowedBy directive)
This parses only characters not followed by a directive.
{{ value }}
This is normal Text
{{ expression }}
When parsing the text in the middle, it will stop at the last t, leaving the newline before the directive unconsumed (because it's, well, a character followed by a directive), so the next iteration, you try to parse a directive and you fail. Then you retry txt on that newline, the parser expects it not to be followed by a directive, but it finds one, hence the error.

Related

Why Parsec's sepBy stops and does not parse all elements?

I am trying to parse some comma separated string which may or may not contain a string with image dimensions. For example "hello world, 300x300, good bye world".
I've written the following little program:
import Text.Parsec
import qualified Text.Parsec.Text as PS
parseTestString :: Text -> [Maybe (Int, Int)]
parseTestString s = case parse dimensStringParser "" s of
Left _ -> [Nothing]
Right dimens -> dimens
dimensStringParser :: PS.Parser [Maybe (Int, Int)]
dimensStringParser = (optionMaybe dimensParser) `sepBy` (char ',')
dimensParser :: PS.Parser (Int, Int)
dimensParser = do
w <- many1 digit
char 'x'
h <- many1 digit
return (read w, read h)
main :: IO ()
main = do
print $ parseTestString "300x300,40x40,5x5"
print $ parseTestString "300x300,hello,5x5,6x6"
According to optionMaybe documentation, it returns Nothing if it can't parse, so I would expect to get this output:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing, Just (5,5), Just (6,6)]
but instead I get:
[Just (300,300),Just (40,40),Just (5,5)]
[Just (300,300),Nothing]
I.e. parsing stops after first failure. So I have two questions:
Why does it behave this way?
How do I write a correct parser for this case?

In order to answer this question, it's handy to take a piece of paper, write down the input, and act as a dumb parser.
We start with "300x300,hello,5x5,6x6", our current parser is optionMaybe .... Does our dimensParser correctly parse the dimension? Let's check:
w <- many1 digit -- yes, "300"
char 'x' -- yes, "x"
h <- many1 digit -- yes, "300"
return (read w, read h) -- never fails
We've successfully parsed the first dimension. The next token is ,, so sepBy successfully parses that as well. Next, we try to parse "hello" and fail:
w <- many1 digit -- no. 'h' is not a digit. Stop
Next, sepBy tries to parse ,, but that's not possible, since the next token is a 'h', not a ,. Therefore, sepBy stops.
We haven't parsed all the input, but that's not actually necessary. You would get a proper error message if you've used
parse (dimensStringParser <* eof)
Either way, if you want to discard anything in the list that's not a dimension, you can use
dimensStringParser1 :: Parser (Maybe (Int, Int))
dimensStringParser1 = (Just <$> dimensParser) <|> (skipMany (noneOf ",") >> Nothing)
dimensStringParser = dimensStringParser1 `sepBy` char ','

I'd guess that optionMaybe dimensParser, when fed with input "hello,...", tries dimensParser. That fails, so optionMaybe returns success with Nothing, and consumes no portion of the input.
The last part is the crucial one: after Nothing is returned, the input string to be parsed is still "hello,...".
At that point sepBy tries to parse char ',', which fails. So, it deduces that the list is over, and terminates the output list, without consuming any more input.
If you want to skip other entities, you need a "consuming" parser that returns Nothing instead of optionMaybe. That parser, however, need to know how much to consume: in your case, until the comma.
Perhaps you need some like (untested)
( try (Just <$> dimensParser)
<|> (noneOf "," >> return Nothing))
`sepBy` char ','

Parse a sub-string with parsec (by ignoring unmatched prefixes)

I would like to extract the repository name from the first line of git remote -v, which is usually of the form:
origin git#github.com:some-user/some-repo.git (fetch)
I quickly made the following parser using parsec:
-- | Parse the repository name from the output given by the first line of `git remote -v`.
repoNameFromRemoteP :: Parser String
repoNameFromRemoteP = do
_ <- originPart >> hostPart
_ <- char ':'
firstPart <- many1 alphaNum
_ <- char '/'
secondPart <- many1 alphaNum
_ <- string ".git"
return $ firstPart ++ "/" ++ secondPart
where
originPart = many1 alphaNum >> space
hostPart = many1 alphaNum
>> (string "#" <|> string "://")
>> many1 alphaNum `sepBy` char '.'
But this parser looks a bit awkward. Actually I'm only interested in whatever follows the colon (":"), and it would be easier if I could just write a parser for it.
Is there a way to have parsec skip a character upon a failed match, and re-try from the next position?

If I've understood the question, try many (noneOf ":"). This will consume any character until it sees a ':', then stop.
Edit: Seems I had not understood the question. You can use the try combinator to turn a parser which may consume some characters before failing into one that consumes no characters on a failure. So:
skipUntil p = try p <|> (anyChar >> skipUntil p)
Beware that this can be quite expensive, both in runtime (because it will try matching p at every position) and memory (because try prevents p from consuming characters and so the input cannot be garbage collected at all until p completes). You might be able to alleviate the first of those two problems by parameterizing the anyChar bit so that the caller could choose some cheap parser for finding candidate positions; e.g.
skipUntil p skipper = try p <|> (skipper >> skipUntil p skipper)
You could then potentially use the above many (noneOf ":") construction to only try p on positions that start with a :.

The
sepCap
combinator from
replace-megaparsec
can skip a character upon a failed match, and re-try from the next position.
Maybe this is overkill for your particular case, but it does solve the
general problem.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
username :: Parsec Void String String
username = do
void $ single ':'
some $ alphaNumChar <|> single '-'
listToMaybe . rights =<< parseMaybe (sepCap username)
"origin git#github.com:some-user/some-repo.git (fetch)"
Just "some-user"

Parsing multiple lines into a list of lists in Haskell

I am trying to parse a file that looks like:
a b c
f e d
I want to match each of the symbols in the line and parse everything into a list of lists such as:
[[A, B, C], [D, E, F]]
In order to do that I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it fails to return multiple inner lists. Instead what I get is:
[[A, B, C, D, E, F]]
What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.

Parser combinators are overkill for something this simple. I'd use lines :: String -> [String] and words :: String -> [String] to break up the input and then map the individual tokens into MyTypes.
toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing
parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines

You're right that manyTill keeps parsing until a newline. But manyTill never gets to see the newline because cellP is too eager. cellP ends up calling P.symbol, whose documentation states
symbol :: String -> ParsecT s u m String
Lexeme parser symbol s parses string s and skips trailing white space.
The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace, which includes newlines. So P.symbol is happily consuming the c, followed by the space and the newline, and then manyTill looks and doesn't see a newline because it's already been consumed.
If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace field to correctly define whitespace to not be newlines. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace to be whatever you want to define as whiteSpace, not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace was previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.

Grouping lines with Parsec

I have a line-based text format I want to parse with Parsec†. A line either starts with a pound sign and specifies a key value pair separated by a colon or is a URL that is described by the previous tags.
Here's a short example:
#foo:bar
#faz:baz
https://example.com
#foo:beep
https://example.net
For simplicity's sake, I'm going to store everything as String. A Tag is a type Tag = (String, String), for example ("foo", "bar"). Ultimately, I'd like to group these as ([Tag], URL).
However, I struggle figuring out how to parse either [one or more tags] or [one URL].
My current approach looks like this:
import qualified System.Environment as Env
import qualified Text.Megaparsec as M
import qualified Text.Megaparsec.Text as M
type Tag = (String, String)
data Segment = Tags [Tag] | URL String
deriving (Eq, Show)
tagP :: M.Parser Tag
tagP = M.char '#' *> ((,) <$> M.someTill M.printChar (M.char ':') <*> M.someTill M.printChar M.eol) M.<?> "Tag starting with #"
urlP :: M.Parser String
urlP = M.someTill M.printChar M.eol M.<?> "Some URL"
parser :: M.Parser Segment
parser = (Tags <$> M.many tagP) M.<|> (URL <$> urlP)
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (parser <* M.eof) fname
print res
If I try to run this on the above sample, I get a parsing error like this:
3:1:
unexpected 'h'
expecting Tag starting with # or end of input
Clearly my use of many in combination with <|> is incorrect. Since the tag parser won't consume any input from the URL parser it cannot be related to backtracking. How do I need to change this to get to the desired result?
The full example is available on GitHub.
† I'm actually using MegaParsec here for better error messages but I think the problem is quite generic and not about any particular implementation of parser combinators.

What you're doing works quite fine, only, at the moment you only parse a single segment (i.e., either only tags or only a URL), but that doesn't consume the whole input. It's eof that's causing the error.
Simply use one more many or some, to allow for multiple segments:
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (many parser <* M.eof) fname
print res

#cocreature answered this for me on Twitter.
As leftaroundabout pointed out here, there are two separate mistakes in my code:
The parser itself misuses <|> while it should just sequentially parse the lines and skip to the next parser if it doesn't consume any input.
The invocation (parseFromFile) only applies the parser function a single time and would fail as soon as it would get to the second block.
We can fix the parser and introduce grouping in one go:
parser :: M.Parser ([Tag], String)
parser = liftA2 (,) (M.many tagP) urlP
Afterwards, we just need to apply the change suggested by leftaroundabout:
...
res <- M.parseFromFile (M.many parser <* M.eof) fname
Running this leads to the desired result:
[([("foo","bar"),("faz","baz")],"https://example.com"),([("foo","beep")],"https://example.net")]

Parsing data with Parsec and omitting comments

I am trying to write a Haksell Parsec Parser that parses input data from a file into the LogLine datatype as follows:
--Final parser that holds the indvidual parsers.
final :: Parser [LogLine]
final = do{ logLines <- sepBy1 logLine eol
; return logLines
}
--The logline token declaration
logLine :: Parser LogLine
logLine = do
name <- plainValue -- parse the name (identifier)
many1 space -- parse and throw away a space
args1 <- bracketedValue -- parse the first arguments
many1 space -- throw away the second sapce
args2 <- bracketedValue -- parse the second list of arguments
many1 space --
constant <- plainValue -- parse the constant identifier
space
weighting <- plainValue --parse the weighting double
space
return $ LogLine name args1 args2 constant weighting
It parses everything just fine, but now I need to add comments to the file, and I have to modify the parser so that it ignores them.
It should support single-line comments only beginning with "--" and ending with a '\n'
I've tried defining the comment token as follows:
comments :: Parser String
comments = do
string "--"
comment <- (manyTill anyChar newline)
return ""
And then plugging it into the final parser like so:
final :: Parser [LogLine]
final = do
optional comments
logLines <- sepBy1 logLine (comments<|>newline)
optional comments
return logLines
It compiles fine, but it does not parse. I've tried several minor modifications but the best result was parsing everything up to the first comment, so I'm beginning to think that this is not the way to do it.
PS:
I've seen this Similar Question, but it is slightly different from what I'm trying to achieve.

If I understand your description of the format in your comment correctly, your example for the format would be
name arg1 arg2 c1 weight
-- comment goes here
optionally followed by further log-lines and/or comments.
Then your problem is that there is a newline between the log-line and the comment line, which means that the comments part of the separator parser fails - comments must start with "--" - without consuming input, so newline is tried and succeeds. Then the next line begins with "--" which makes plainValue fail without consuming input, and thus ends the sepBy1.
The solution is to let the separator first consume a newline, and then as many comment lines as follow:
final = do
skipMany comments
sepEndBy1 logLine (newline >> skipMany comments)
by allowing the sequence to be ended by a separator (sepEndBy1 instead of sepBy1), any comment lines after the final LogLine are automatically skipped.

The way i understand your problem, each line is either a comment or a logLine. If so, final should look something like this:
final :: Parser [LogLine]
final = do
logLines <- sepBy1 (comment<|>logLine) newline
return logLines

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Parsec: Handling Overlapping Parsers - parsing

Related

Why Parsec's sepBy stops and does not parse all elements?

Parse a sub-string with parsec (by ignoring unmatched prefixes)

Parsing multiple lines into a list of lists in Haskell

Grouping lines with Parsec

Parsing data with Parsec and omitting comments

Categories

Resources