FParsec - how to parse strings separated by pipes? - f#

I'm using a FParsec to write a small org-mode parser, for fun, and I'm having a little trouble with parsing a table row into a list of strings. My current code looks like this:
let parseRowEntries :Parser<RowEntries, unit> =
let skipInitialPipe = skipChar '|'
let notaPipe = function
| '|' -> false
| _ -> true
let pipeSep = pchar '|'
skipInitialPipe >>. sepEndBy (many1Satisfy notaPipe) pipeSep
|>> RowEntries
This works fine until you parse the string |blah\n|blah\n|blah| which should fail because of the newline character. Unfortunately simply making \n false in the notaPipe condition causes the parser to stop after the first 'blah' and say it was parsed successfully. What I want the manySatisfy to do is parse (almost) any characters, stopping at the pipe, failing to parse for newlines (and likely the eof character).
I've tried using charsTillString but that also just halts parsing at the first pipe, without an error.

If I've understood your spec correctly, this should work:
let parseOneRow :Parser<_, unit> =
let notaPipe = function
| '|' -> false
| '\n' -> false
| _ -> true
let pipe = pchar '|'
pipe >>. manyTill (many1Satisfy notaPipe .>> pipe) (skipNewline <|> eof)
let parseRowEntries :Parser<_, unit> =
many parseOneRow
run parseRowEntries "|row|with|four|columns|\n|second|row|"
// Success: [["row"; "with"; "four"; "columns"]; ["second"; "row"]]
The structure is that each row starts with a pipe, then the segments within a row are conceptually row|, with|, and so on. The .>> combinator discards the pipe. The reason the "till" part of that line uses skipNewline instead of newline is because the eof parser returns unit, so we need a parser that expects newlines and returns unit. That's the skipNewline parser.
I've tried throwing newlines in where they don't belong (before the pipes, for example) and that causes this parser to fail exactly as it should. It also fails if a column is empty (that is, two pipe characters occur side by side like ||), which I think is also what you want. If you want to allow empty rows, just use manySatisfy instead of many1Satisfy.

Related

How to parse many characters except few in the parentheses?

The best way to parse any character except few, is to use noneOf combinator,
Unfortunately it doesn't work if I combine it in the following way:
Combine.parse (Combine.parens <| Combine.many <| Combine.Char.noneOf ['"', '\\']) "()"
Err ((),{ data = "()", input = "", position = 2 },["expected \")\""])
: Result.Result
(Combine.ParseErr ()) (Combine.ParseOk () (List Char))
Your use of noneOf results in that parser consuming all characters including the closing parenthesis. Since the inner portion consumes the closing paren, the Combine.parens parser will not see the closing paren. You need to cause the many <| noneOf ... parser to halt on a closing parenthesis.
Consider adding the closing parenthesis to the list of characters in noneOf:
Combine.parse (Combine.parens <| Combine.many <| Combine.Char.noneOf ['"', '\\', ')']) "()"

Parsec: Handling Overlapping Parsers

I'm really new to parsing in Haskell, but it's mostly making sense.
I'm working on building a Templating program mostly to learn parsing better; templates can interpolate values in via {{ value }} notation.
Here's my current parser,
data Template a = Template [Either String a]
data Directive = Directive String
templateFromFile :: FilePath -> IO (Either ParseError (Template Directive))
templateFromFile = parseFromFile templateParser
templateParser :: Parser (Template Directive)
templateParser = do
tmp <- template
eof
return tmp
template :: Parser (Template Directive)
template = Template <$> many (dir <|> txt)
where
dir = Right <$> directive
txt = Left <$> many1 (anyChar <* notFollowedBy directive)
directive :: Parser Directive
directive = do
_ <- string "{{"
txt <- manyTill anyChar (string "}}")
return $ Directive txt
Then I run it on a file something like this:
{{ value }}
This is normal Text
{{ expression }}
When I run this using templateFromFile "./template.txt" I get the error:
Left "./template.txt" (line 5, column 17):
unexpected Directive " expression "
Why is this happening and how can I fix it?
My basic understanding is that many1 (anyChar <* notFollowedBy directive)
should grab all of the characters up until the start of the next directive, then should fail and return the list of characters up till that point; then
it should fall back to the previous many and should try parsing dir again and should succeed; clearly something else is happening though. I'm
having trouble figuring out how to parse things between other things when
the parsers mostly overlap.
I'd love some tips on how to structure this all more idiomatically, please let me know if I'm doing something in a silly way. Cheers! Thanks for your time!
You have a couple of problems. First, in Parsec, if a parser consumes any input and then fails, that's an error. So, when the parser:
anyChar <* notFollowedBy directive
fails (because the character is followed by a directive), it fails after anyChar has consumed input, and that generates an error immediately. Therefore, the parser:
let p1 = many1 (anyChar <* notFollowedBy directive)
will never succeed if it runs into a directive. For example:
parse p1 "" "okay" -- works
parse p1 "" "oops {{}}" -- will fail after consuming "oops "
You can fix this by inserting a try clause:
let p2 = many1 (try (anyChar <* notFollowedBy directive))
parse p2 "" "okay {{}}"
which yields Right "okay" and reveals the second problem. Parser p2 only consumes characters that aren't followed by a directive, so that excludes the space immediately before the directive, and you have no means in your parser to consume a character that is followed by a directive, so it gets stuck.
You actually want something like:
let p3 = many1 (notFollowedBy directive *> anyChar)
which first checks that, at the current position, we aren't looking at a directive before grabbing a character. No try clause is needed because if this fails, it fails without consuming input. (notFollowedBy never consumes input, as per the documentation.)
parse p3 "" "okay" -- returns Right "okay"
parse p3 "" "okay {{}}" -- return Right "okay "
parse p3 "" "{{fails}}" -- correctly fails w/o consuming input
So, taking your original example with:
txt = Left <$> many1 (notFollowedBy directive *> anyChar)
should work fine.
replace-megaparsec
is a library for doing search-and-replace with parsers. The
search-and-replace function is
streamEdit,
which can find your {{ value }} patterns and then substitute in some other text.
streamEdit is built from a generalized version
of your template function called
sepCap.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Char
input = unlines
[ "{{ value }}"
, ""
, "This is normal Text"
, ""
, "{{ expression }}"
]
directive :: Parsec Void String String
directive = do
_ <- string "{{"
txt <- manyTill anySingle (string "}}")
return txt
editor k = fmap toUpper k
streamEdit directive editor input
VALUE
This is normal Text
EXPRESSION
many1 (anyChar <* notFollowedBy directive)
This parses only characters not followed by a directive.
{{ value }}
This is normal Text
{{ expression }}
When parsing the text in the middle, it will stop at the last t, leaving the newline before the directive unconsumed (because it's, well, a character followed by a directive), so the next iteration, you try to parse a directive and you fail. Then you retry txt on that newline, the parser expects it not to be followed by a directive, but it finds one, hence the error.

Using ParserResult

The example code below appears to work nicely:
open FParsec
let capitalized : Parser<unit,unit> =(asciiUpper >>. many asciiLower >>. eof)
let inverted : Parser<unit,unit> =(asciiLower >>. many asciiUpper >>. eof)
let capsOrInvert =choice [capitalized;inverted]
You can then do:
run capsOrInvert "Dog";;
run capsOrInvert "dOG";;
and get a success or:
run capsOrInvert "dog";;
and get a failure.
Now that I have a ParserResult, how do I do things with it? For example, print the string backwards?
There are several notable issues with your code.
First off, as noticed in #scrwtp's answer, your parser returns unit. Here's why: operator (>>.) returns only the result returned by the right inner parser. On the other hand, (.>>) would return the result of a left parser, while (.>>.) would return a tuple of both left and right ones.
So, parser1 >>. parser2 >>. eof is essentially (parser1 >>. parser2) >>. eof.
The code in parens completely ignores the result of parser1, and the second (>>.) then ignores the entire result of the parser in parens. Finally, eof returns unit, and this value is being returned.
You may need some meaningful data returned instead, e.g. the parsed string. The easiest way is:
let capitalized = (asciiUpper .>>. many asciiLower .>> eof)
Mind the operators.
The code for inverted can be done in a similar manner.
This parser would be of type Parser<(char * char list), unit>, a tuple of first character and all the remaining ones, so you may need to merge them back. There are several ways to do that, here's one:
let mymerge (c1: char, cs: char list) = c1 :: cs // a simple cons
let pCapitalized = capitalized >>= mymerge
The beauty of this code is that your mymerge is a normal function, working with normal char's, it knows nothing about parsers or so. It just works with the data, and (>>=) operator does the rest.
Note, pCapitalized is also a parser, but it returns a single char list.
Nothing stops you from applying further transitions. As you mentioned printing the string backwards:
let pCapitalizedAndReversed =
capitalized
>>= mymerge
>>= List.rev
I have written the code in this way for purpose. In different lines you see a gradual transition of your domain data, still within the paradigm of Parser. This is an important consideration, because any subsequent transition may "decide" that the data is bad for some reason and raise a parsing exception, for example. Or, alternatively, it may be merged with other parser.
As soon as your domain data (a parsed-out word) is complete, you extract the result as mentioned in another answer.
A minor note. choice is superfluous for only two parsers. Use (<|>) instead. From experience, careful choosing parser combinators is important because a wrong choice deep inside your core parser logic can easily make your parsers dramatically slow.
See FParsec Primitives for further details.
ParserResult is a discriminated union. You simply match the Success and Failure cases.
let r = run capsOrInvert "Dog"
match r with
| Success(result, _, _) -> printfn "Success: %A" result
| Failure(errorMsg, _, _) -> printfn "Failure: %s" errorMsg
But this is probably not what you find tricky about your situation.
The thing about your Parser<unit, unit> type is that the parsed value is of type unit (the first type argument to Parser). What this means is that this parser doesn't really produce any sensible output for you to use - it can only tell you whether it can parse a string (in which case you get back a Success ((), _, _) - carrying the single value of type unit) or not.
What do you expect to get out of this parser?
Edit: This sounds close to what you want, or at least you should be able to pick up some pointers from it. capitalized accepts capitalized strings, inverted accepts capitalized strings that have been reversed and reverses them as part of the parser logic.
let reverse (s: string) =
System.String(Array.rev (Array.ofSeq s))
let capitalized : Parser<string,unit> =
(asciiUpper .>>. manyChars asciiLower)
|>> fun (upper, lower) -> string upper + lower
let inverted : Parser<string,unit> =
(manyChars asciiLower .>>. asciiUpper)
|>> fun (lower, upper) -> reverse (lower + string upper)
let capsOrInvert = choice [capitalized;inverted]
run capsOrInvert "Dog"
run capsOrInvert "doG"
run capsOrInvert "dog"

Haskell -- parser combinators keywords

I am working on building a parser in Haskell using parser combinators. I have an issue with parsing keywords such as "while", "true", "if" etc
So the issue I am facing is that after a keyword there is a requirement that there is a separator or whitespace, for example in the statement
if cond then stat1 else stat2 fi;x = 1
with this statement all keywords have either a space in front of them or a semi colon. However in different situations there can be different separators.
Currently I have implemented it as follows:
keyword :: String -> Parser String
keyword k = do
kword <- leadingWS (string k)
check (== ';') <|> check isSpace <|> check (== ',') <|> check (== ']')
junk
return word
however the problem with this keyword parser is that it will allow programs which have statements like if; cond then stat1 else stat2 fi
We tried passing in a (Char -> Bool) to keyword, which would then be passed to check. But this wouldn’t work because where we parse the keyword we don’t know what kind of separator is allowed.
I was wondering if I could have some help with this issue?
Don't try to handle the separators in keyword but you need to ensure that keyword "if" will not be confused with an identifier "iffy" (see comment by sepp2k).
keyword :: String -> Parser String
keyword k = leadingWS $ try (do string k
notFollowedBy alphanum)
Handling separators for statements would go like this:
statements = statement `sepBy` semi
statement = ifStatement <|> assignmentStatement <|> ...

Grouping lines with Parsec

I have a line-based text format I want to parse with Parsec†. A line either starts with a pound sign and specifies a key value pair separated by a colon or is a URL that is described by the previous tags.
Here's a short example:
#foo:bar
#faz:baz
https://example.com
#foo:beep
https://example.net
For simplicity's sake, I'm going to store everything as String. A Tag is a type Tag = (String, String), for example ("foo", "bar"). Ultimately, I'd like to group these as ([Tag], URL).
However, I struggle figuring out how to parse either [one or more tags] or [one URL].
My current approach looks like this:
import qualified System.Environment as Env
import qualified Text.Megaparsec as M
import qualified Text.Megaparsec.Text as M
type Tag = (String, String)
data Segment = Tags [Tag] | URL String
deriving (Eq, Show)
tagP :: M.Parser Tag
tagP = M.char '#' *> ((,) <$> M.someTill M.printChar (M.char ':') <*> M.someTill M.printChar M.eol) M.<?> "Tag starting with #"
urlP :: M.Parser String
urlP = M.someTill M.printChar M.eol M.<?> "Some URL"
parser :: M.Parser Segment
parser = (Tags <$> M.many tagP) M.<|> (URL <$> urlP)
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (parser <* M.eof) fname
print res
If I try to run this on the above sample, I get a parsing error like this:
3:1:
unexpected 'h'
expecting Tag starting with # or end of input
Clearly my use of many in combination with <|> is incorrect. Since the tag parser won't consume any input from the URL parser it cannot be related to backtracking. How do I need to change this to get to the desired result?
The full example is available on GitHub.
† I'm actually using MegaParsec here for better error messages but I think the problem is quite generic and not about any particular implementation of parser combinators.
What you're doing works quite fine, only, at the moment you only parse a single segment (i.e., either only tags or only a URL), but that doesn't consume the whole input. It's eof that's causing the error.
Simply use one more many or some, to allow for multiple segments:
main :: IO ()
main = do
fname <- head <$> Env.getArgs
res <- M.parseFromFile (many parser <* M.eof) fname
print res
#cocreature answered this for me on Twitter.
As leftaroundabout pointed out here, there are two separate mistakes in my code:
The parser itself misuses <|> while it should just sequentially parse the lines and skip to the next parser if it doesn't consume any input.
The invocation (parseFromFile) only applies the parser function a single time and would fail as soon as it would get to the second block.
We can fix the parser and introduce grouping in one go:
parser :: M.Parser ([Tag], String)
parser = liftA2 (,) (M.many tagP) urlP
Afterwards, we just need to apply the change suggested by leftaroundabout:
...
res <- M.parseFromFile (M.many parser <* M.eof) fname
Running this leads to the desired result:
[([("foo","bar"),("faz","baz")],"https://example.com"),([("foo","beep")],"https://example.net")]

Resources