How to negate a parser with Parsec - parsing

I have a file with line endings “\r\r\n”, and use the parser eol = string "\r\r\n" :: Parser String to handle them. To get a list of the lines between these separators, I would like to use sepBy along with a parser that returns any text that would not be captured by eol. Looking through the documentation I did not see a combinator that negates a parser (an ‘anything but the pattern ”\r\r\n”’ parser).
I have tried using sepBy (many anyToken) end, but many anyToken appears to be greedy, not stopping for eol matches. I cannot use many (noneOf "\n\r"), because there are several places in my text with the single '\n' character.
Is there a combinator that can get me the inverse of string "\r\r\n"?

I'm afraid you're going about it backwards. Parsec parsers don't chop up the input, they build the output.
The more you try to parse by thinking about what you don't want, the harder it'll be. You need to think bottom-up what's permissable, not top down where you chop.
You should start with the smallest, most basic thing you do want. For example, don't think of an identifier as everything before a space, think of it as a letter followed by alphanumeric data. You can then combine that, separated by whitespace with the other things you expect on a line.
line = do
i <- identifier
whiteSpace
string "="
e <- expr
return $ Line i e
Only when you've completed a parser that successfully parses what you want from a line and rejects invalid lines should you parse multiple lines:
lines = sepBy line eol

As a tentative answer, it looks like manyTill anyChar (try eol) does what I want. As part of my original question though, I'm still interested in knowing whether there is a general way to negate a parser, or whether there's another recommended way of doing what I want.

The
sepCap
parser combinator from the package
replace-megaparsec
does this kind of parser negation, and returns a list of Either with the negative matches in Left and the positive matches in Right.
import Replace.Megaparsec
import Text.Megaparsec
parseTest (sepCap (chunk "\r\r\n" :: Parsec Void String String))
$ "one\r\r\ntwo\r\r\nthree\r\r\n"
[ Left "one"
, Right "\r\r\n"
, Left "two"
, Right "\r\r\n"
, Left "three"
, Right "\r\r\n"
]

Related

Why are the newline and the white space treated differently in the lexer specification?

I am using F#'s FsLex to generate a lexer. I have difficulties to understand the following two lines from a textbook. Why is the newline (\n) treated differently from the white space? In particular, what does "lexbuf.EndPos <- lexbuf.EndPos.NextLine" do differently from "Tokenize lexbuf"?
rule Tokenize = parse
| [' ' '\t' '\r'] { Tokenize lexbuf }
| '\n' { lexbuf.EndPos <- lexbuf.EndPos.NextLine; Tokenize lexbuf }
A rule is essentially a function that takes a lexer buffer as an argument. Each case on the left side of your rule matches a given character (e.g., '\n') or class of characters ([' ' '\t' '\r']) in your input. The expression on the right size of the rule, inside the curly braces { ... }, defines an action. The purpose of the definition you pasted in appears to be a tokenizer.
The expression Tokenize lexbuf is a recursive call to the Tokenize rule. In essence, this rule ignores whitespace character. Why? Because tokenizers aim to simplify the input. Whitespace typically has no meaning in a programming language, so this rule filters it out. Tokenized input generally makes writing your parser simpler later on. You'll eventually want to add other cases to your Tokenize rule (e.g., for keywords, assignment statements, and other expressions) to produce a complete lexer definition.
The second rule, the one that matches \n, also ignores the whitespace, but as you correctly point out, it does something different. What it's doing is updating the position of the end of the line (lexbuf.EndPos) with the position of the next line's end (lexbuf.EndPos.NextLine) before recursively calling Tokenize again. Why? Presumably so that the end position is correct on the next recursive call.
Since you're only showing a lexer fragment here, I can only guess as to what lexbug.EndPos is used for, but it's pretty common to keep that information around for diagnostic purposes.

Parse Within Parse - Attoparsec [duplicate]

given the following type and function, meant to parse a field of a CSV field into a string:
type Parser resultType = ParsecT String () Identity resultType
cell :: Parser String
I have implemented the following function:
customCell :: String -> Parser res -> Parser res
customCell typeName subparser =
cell
>>= either (const $ unexpected typeName)
return . parse (subparser <* eof) ""
Though I cannot stop thinking that I am not using the Monad concept as much as desired and that eventually there is a better way to merge the result of the inner with the outer parser, specially on what regards its failure.
Does anybody know how could I do so, or is this code what is meant to be done?
PS - I now realised that my type simplification is probably not appropriate and that maybe what I want is to replace the underlying Identity Monad by the Either Monad.... Unfortunately, I do not feel enough acquainted with monad transformers yet.
PS2 - What the hell is the underlying monad good for anyway?
Elaborating on #Daniel Wagner's answer... The way parsers are normally built with Parsec, you start with low-level parsers that parse specific characters (e.g., a plus sign or a digit), and you build parsers on top of them using combinators (like a many1 combinator that turns a parser that reads a single digit into a parser that reads one or more digits, or a monadic parse that parsers "one or more digits" followed by a "plus sign" followed by "one or more digits").
However, each parser, whether it's a low-level digit parser or a higher-level "addition expression" parser, is intended to be applied directly to the same input stream.
What you don't typically do is write a parser that gobbles a chunk of the input stream to produce, say, a String and another parser that parses that String (instead of the original input stream) and try to combine them. This is the kind of "vertical composition" that isn't directly supported by Parsec and looks unnatural and non-monadic.
As pointed out in the comments, there are some situations where vertical composition is the cleanest overall approach (like when you have one language embedded within the components or expressions of another language), but it's not the usual approach taken by a Parsec parser.
The bottom line in your application is that a cell parser that produces only a String is too specialized to be useful. A more useful Parsec framework for CSV files would be:
import Text.Parsec
import Text.Parsec.String
-- | `csv cell` parses a CSV file each of whose elements is parsed by `cell`
csv :: Parser a -> Parser [[a]]
csv cell = many (row cell)
-- | `row cell` parses a newline-terminated row of comma separated
-- `cell`-expressions
row :: Parser a -> Parser [a]
row cell = sepBy cell (char ',') <* char '\n'
Now, you can write a custom cell parser that parses positive integers:
customCell :: Parser Int
customCell = read <$> many1 digit
and parse CSV files:
> parse (csv customCell) "" "1,2,3\n4,5,6\n"
Right [[1,2,3],[4,5,6]]
>
Here, instead of having a cell subparser that explicitly parses a comma-delimited cell into a string to be fed to a different parser, the "cell" is an implicit context in which a supplied cell parser is called to parse the underlying input stream at the appropriate point where one would expect a comma-delimited cell in the middle of a row in the middle of the input stream.
Sadly I know of no parser library or parser generator for Haskell that supports vertical parser composition like this. Something like what you wrote is about as good as it gets. Dang!

Generate parser that runs a received parser on the output of another parser and monadically joins the results

given the following type and function, meant to parse a field of a CSV field into a string:
type Parser resultType = ParsecT String () Identity resultType
cell :: Parser String
I have implemented the following function:
customCell :: String -> Parser res -> Parser res
customCell typeName subparser =
cell
>>= either (const $ unexpected typeName)
return . parse (subparser <* eof) ""
Though I cannot stop thinking that I am not using the Monad concept as much as desired and that eventually there is a better way to merge the result of the inner with the outer parser, specially on what regards its failure.
Does anybody know how could I do so, or is this code what is meant to be done?
PS - I now realised that my type simplification is probably not appropriate and that maybe what I want is to replace the underlying Identity Monad by the Either Monad.... Unfortunately, I do not feel enough acquainted with monad transformers yet.
PS2 - What the hell is the underlying monad good for anyway?
Elaborating on #Daniel Wagner's answer... The way parsers are normally built with Parsec, you start with low-level parsers that parse specific characters (e.g., a plus sign or a digit), and you build parsers on top of them using combinators (like a many1 combinator that turns a parser that reads a single digit into a parser that reads one or more digits, or a monadic parse that parsers "one or more digits" followed by a "plus sign" followed by "one or more digits").
However, each parser, whether it's a low-level digit parser or a higher-level "addition expression" parser, is intended to be applied directly to the same input stream.
What you don't typically do is write a parser that gobbles a chunk of the input stream to produce, say, a String and another parser that parses that String (instead of the original input stream) and try to combine them. This is the kind of "vertical composition" that isn't directly supported by Parsec and looks unnatural and non-monadic.
As pointed out in the comments, there are some situations where vertical composition is the cleanest overall approach (like when you have one language embedded within the components or expressions of another language), but it's not the usual approach taken by a Parsec parser.
The bottom line in your application is that a cell parser that produces only a String is too specialized to be useful. A more useful Parsec framework for CSV files would be:
import Text.Parsec
import Text.Parsec.String
-- | `csv cell` parses a CSV file each of whose elements is parsed by `cell`
csv :: Parser a -> Parser [[a]]
csv cell = many (row cell)
-- | `row cell` parses a newline-terminated row of comma separated
-- `cell`-expressions
row :: Parser a -> Parser [a]
row cell = sepBy cell (char ',') <* char '\n'
Now, you can write a custom cell parser that parses positive integers:
customCell :: Parser Int
customCell = read <$> many1 digit
and parse CSV files:
> parse (csv customCell) "" "1,2,3\n4,5,6\n"
Right [[1,2,3],[4,5,6]]
>
Here, instead of having a cell subparser that explicitly parses a comma-delimited cell into a string to be fed to a different parser, the "cell" is an implicit context in which a supplied cell parser is called to parse the underlying input stream at the appropriate point where one would expect a comma-delimited cell in the middle of a row in the middle of the input stream.
Sadly I know of no parser library or parser generator for Haskell that supports vertical parser composition like this. Something like what you wrote is about as good as it gets. Dang!

Grouping lines with Parsec

I have a line-based text format I want to parse with Parsec†. A line either starts with a pound sign and specifies a key value pair separated by a colon or is a URL that is described by the previous tags.
Here's a short example:
#foo:bar
#faz:baz
https://example.com
#foo:beep
https://example.net
For simplicity's sake, I'm going to store everything as String. A Tag is a type Tag = (String, String), for example ("foo", "bar"). Ultimately, I'd like to group these as ([Tag], URL).
However, I struggle figuring out how to parse either [one or more tags] or [one URL].
My current approach looks like this:
import qualified System.Environment as Env
import qualified Text.Megaparsec as M
import qualified Text.Megaparsec.Text as M
type Tag = (String, String)
data Segment = Tags [Tag] | URL String
deriving (Eq, Show)
tagP :: M.Parser Tag
tagP = M.char '#' *> ((,) <$> M.someTill M.printChar (M.char ':') <*> M.someTill M.printChar M.eol) M.<?> "Tag starting with #"
urlP :: M.Parser String
urlP = M.someTill M.printChar M.eol M.<?> "Some URL"
parser :: M.Parser Segment
parser = (Tags <$> M.many tagP) M.<|> (URL <$> urlP)
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (parser <* M.eof) fname
print res
If I try to run this on the above sample, I get a parsing error like this:
3:1:
unexpected 'h'
expecting Tag starting with # or end of input
Clearly my use of many in combination with <|> is incorrect. Since the tag parser won't consume any input from the URL parser it cannot be related to backtracking. How do I need to change this to get to the desired result?
The full example is available on GitHub.
† I'm actually using MegaParsec here for better error messages but I think the problem is quite generic and not about any particular implementation of parser combinators.
What you're doing works quite fine, only, at the moment you only parse a single segment (i.e., either only tags or only a URL), but that doesn't consume the whole input. It's eof that's causing the error.
Simply use one more many or some, to allow for multiple segments:
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (many parser <* M.eof) fname
print res
#cocreature answered this for me on Twitter.
As leftaroundabout pointed out here, there are two separate mistakes in my code:
The parser itself misuses <|> while it should just sequentially parse the lines and skip to the next parser if it doesn't consume any input.
The invocation (parseFromFile) only applies the parser function a single time and would fail as soon as it would get to the second block.
We can fix the parser and introduce grouping in one go:
parser :: M.Parser ([Tag], String)
parser = liftA2 (,) (M.many tagP) urlP
Afterwards, we just need to apply the change suggested by leftaroundabout:
...
res <- M.parseFromFile (M.many parser <* M.eof) fname
Running this leads to the desired result:
[([("foo","bar"),("faz","baz")],"https://example.com"),([("foo","beep")],"https://example.net")]

Searching for a pattern with Parsec

Not sure if this is possible (or recommended), but I am essentially trying to search for a sequence of characters in file using Parsec. Example file:
START (name)
junk
morejunk=junk;
dontcare
foo ()
bar
care_about this (stuff in here i dont care about);
don't care about this
or this
foo = bar;
also_care
about_this
(dont care whats in here);
and_this too(only the names
at the front
do i care about
);
foobar
may hit something = perhaps maybe (like this);
foobar
END
And here is my attempt at getting it working:
careAbout :: Parser (String, String)
careAbout = do
name1 <- many1 (noneOf " \n\r")
skipMany space
name2 <- many1 (noneOf " (\r\n")
skipMany space
skipMany1 parens
skipMany space
char ';'
return (name1, name2)
parens :: Parser ()
parens = do
char '('
many (parens <|> skipMany1 (noneOf "()"))
char ')'
return ()
parseFile = do
manyTill (do
try careAbout <|>
anyChar >> return ("", "")) (try $ string "END")
I'm trying to brute force the search by looking for careAbout, and if that doesn't work, eat one character and try again. I could parse all the junk in the middle (I know what it could be), but I don't care about what it is (so why bother parsing it), and it's potentially complicated.
Problem is, my solution doesn't quite work. anyChar ends up consuming everything, and the searching for END never gets a chance. Also, somewhere in the careAbout we hit eof and some Exception is thrown because of it.
This is probably the exact wrong way of doing it, and I would like to know of a way, or even better, the Right Way™, of doing it.
If not for the parens parser, this would be a good fit for a regular language parser, such as regex-applicative. This is because regular language parsers are much more "smart" about "backtracking" (in fact there's no backtracking going on at all, and yet every possible branch is explored).
However, as you probably know, matching parentheses is not a regular language. If you can relax your grammar to become regular, give regex-applicative a try.
I can't really tell from OP's post which parts of the file we care about or
don't, so I'm not going to post a specific solution. But in general, for
searching through a file for patterns which match a recursive parser, one
can use
replace-megaparsec.

Resources