Skip whitespace and comments with FParsec - f#

I try to skip any whitespace or comment while parsing a programming language.
There are two types of comment I want to skip:
Line comment: ;; skip rest of line
Block comment: (; skip anything between ;)
Example code to parse with comments and whitespaces:
(type (; block comment ;) (func))
(import "env" "g" (global $g (mut i32)))
(func (type 0) ;; line comment
i32.const 100
global.set $g)
(export "f" (func 0))
I tried multiple approaches but the parser always breaks somewhere. My idea goes like this:
let comment : Parser<unit, Ctx> =
let lineComment = skipString ";;" >>. skipRestOfLine true
let blockComment = between (skipString "(;") (skipString ";)") (skipMany anyChar)
spaces >>. lineComment <|> blockComment
let wsOrComment = attempt comment <|> spaces
I would like the comments to be ignored completely like the spaces are. Any ideas how to accomplish that? (It's my first project with FParsec)

Based on the suggestion by Koenig Lear, I filtered all comments with an regex before running the text through the parser. This is maybe not the nicest option, but it does the job reliable with only two lines of code.
let removeComments s =
let regex = Regex(#"\(;.*;\)|;;.*")
regex.Replace(s, String.Empty)
let input = """
(type (; block comment ;) (func))
(import "env" "g" (global $g (mut i32)))
(func (type 0) ;; line comment
i32.const 100
global.set $g)
(export "f" (func 0))
"""
let filtered = removeComments input
// parse "filtered" with FParsec

Related

FParsec match string which have one of 2 patterns

I'm trying to learn FParsec and am trying to match strings which follow on of two patterns.
The string can either be an ordanary string like "string" or it can be a string with one dot in it, like "st.ring".
The parser should look like this: Parser<(string Option * string),unit>. The first string is optional depending of if the string is splitted by a dot or not. The optional string represent the part of the string which is before the ".".
I have tried a few different things but I feel this attempt was the closes:
let charstilldot = manyCharsTill anyChar (pstring ".")
let parser = opt(charstilldot) .>>. (many1Chars anyChar)
This works with input like this "st.ring" but not "string" since not dot exists in the latter.
I would verry much appriciate some help, thank you!
EDIT:
I have solution which basicly parse the arguments in order and swap the arguments depending of their is a dot or not in the string
let colTargetWithoutDot : Parser<string Option,unit> = spaces |>> fun _ -> None
let colTargetWithDot = (pstring "." >>. alphastring) |>> Some
let specificColumn = alphastring .>>. (colTargetWithDot <|> colTargetWithoutDot) |>> (fun (h,t) ->
match h,t with
| h,None -> (None,h)
| h,Some(t) -> (Some(h),t))
However this is not pretty so I would still appriciate another solution!
I think the main problem here is that charstilldot consumes characters even when it fails. In that situation, many1chars then fails because the entire input has already been consumed. The easiest way to address this is by using attempt to rollback when there is no dot:
let charstilldot = attempt (manyCharsTill anyChar (pstring "."))
let parser = opt(charstilldot) .>>. (many1Chars anyChar)
Result:
"str.ing" -> (Some "str", "ing")
"string" -> (None, "string")
I think there are other good solutions as well, but I've tried to give you one that requires the least change to your current code.

How to fail a nested megaparsec parser?

I am stuck at the following parsing problem:
Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.
This is what I came up with (simplified):
import qualified Text.Megaparsec as MP
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...
-- The escape character.
escChar :: Char
...
pComponent :: Parser (Maybe Text)
pComponent = do
t <- MP.many (escaped <|> regular)
if null t then return Nothing else return $ Just (T.pack t)
where
regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'.
Then, the following parses correctly: ABC\:D:EF to yield ABC:D.
However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.
Two questions:
Why does fail end parsing instead of failing the parser?
Is the above approach sensible to approach the problem, or is there a "proper", canonical way to parse such terminated strings that I am not aware of?
many has to allow its sub-parser to fail once without the whole parse
failing - for example many (char 'A') *> char 'B', while parsing
"AAAB", has to fail to parse the B to know it got to the end of the
As.
You might want manyTill which allows you to recognise the terminator
explicitly. Something like this:
MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)
"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.
Or if you want to parse more than one component you might keep your
existing definition of pComponent and use it with sepBy or similar, like:
MP.sepBy pComponent (MP.satisfy isControlChar)
If you also check for end-of-file after this, like:
MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof
then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.
What a parser object normally does is to extract from the input stream whatever subset it is supposed to accept. That's the usual rule.
Here, it seems you want the parser to accept strings that are followed by something specific. From your examples, it is either end of file (eof) or character ':'. So you might want to consider look ahead.
Environment and auxiliary functions:
import Data.Void (Void)
import qualified Data.Text as T
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MC
type Parser = MP.Parsec Void T.Text
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
isAdmissibleChar ch = elem ch ['A' .. 'Z']
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
isControlChar ch = elem ch ":"
-- The escape character:
escChar :: Char
escChar = '\\'
Termination parser, to be used for look ahead:
termination :: Parser ()
termination = MP.eof MP.<|> do
_ <- MP.satisfy isControlChar
return ()
Modified pComponent parser:
pComponent :: Parser (Maybe T.Text)
pComponent = do
txt <- MP.many (escaped MP.<|> regular)
MP.lookAhead termination -- **CHANGE HERE**
if (null txt) then (return Nothing) else (return $ Just (T.pack txt))
where
regular = (MP.satisfy isAdmissibleChar) MP.<|> (fail "Inadmissible character")
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Testing utility:
tryParse :: String -> IO ()
tryParse str = do
let res = MP.parse pComponent "(noname)" (T.pack str)
putStrLn $ (show res)
Let's try to rerun your examples:
$ ghci
λ>
λ> :load q67809465.hs
λ>
λ> str1 = "ABC\\:D:EF"
λ> putStrLn str1
ABC\:D:EF
λ>
λ> tryParse str1
Right (Just "ABC:D")
λ>
So that is successful, as desired.
λ>
λ> tryParse "ABC&D"
Left (ParseErrorBundle {bundleErrors = TrivialError 3 (Just (Tokens ('&' :| ""))) (fromList [EndOfInput]) :| [], bundlePosState = PosState {pstateInput = "ABC&D", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "(noname)", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})
λ>
So that fails, as desired.
Trying our 2 acceptable termination contexts:
λ> tryParse "ABC:&D"
Right (Just "ABC")
λ>
λ>
λ> tryParse "ABCDEF"
Right (Just "ABCDEF")
λ>
fail does not end parsing in general. It just continues with the next alternative. In this case it selects the empty list alternative introduced by the many combinator, so it stops parsing without an error message.
I think the best way to solve your problem is to specify that the input must end in a termination character, that means that it cannot "succeed" halfway like this. You can do that with the notFollowedBy or lookAhead combinators. Here is the relevant part of the megaparsec tutorial.

How to add a condition that a parsed number must satisfy in FParsec?

I am trying to parse an int32 with FParsec but have an additional restriction that the number must be less than some maximum value. Is their a way to perform this without writing my own custom parser (as below) and/or is my custom parser (below) the appropriate way of achieving the requirements.
I ask because most of the built-in library functions seem to revolve around a char satisfying certain predicates and not any other type.
let pRow: Parser<int> =
let error = messageError ("int parsed larger than maxRows")
let mutable res = Reply(Error, error)
fun stream ->
let reply = pint32 stream
if reply.Status = Ok && reply.Result <= 1000000 then
res <- reply
res
UPDATE
Below is an attempt at a more fitting FParsec solution based on the direction given in the comment below:
let pRow2: Parser<int> =
pint32 >>= (fun x -> if x <= 1048576 then (preturn x) else fail "int parsed larger than maxRows")
Is this the correct way to do it?
You've done an excellent research and almost answered your own question.
Generally, there are two approaches:
Unconditionally parse out an int and let the further code to check it for validity;
Use a guard rule bound to the parser. In this case (>>=) is the right tool;
In order to make a good choice, ask yourself whether an integer that failed to pass the guard rule has to "give another chance" by triggering another parser?
Here's what I mean. Usually, in real-life projects, parsers are combined in some chains. If one parser fails, the following one is attempted. For example, in this question, some programming language is parsed, so it needs something like:
let pContent =
pLineComment <|> pOperator <|> pNumeral <|> pKeyword <|> pIdentifier
Theoretically, your DSL may need to differentiate a "small int value" from another type:
/// The resulting type, or DSL
type Output =
| SmallValue of int
| LargeValueAndString of int * string
| Comment of string
let pSmallValue =
pint32 >>= (fun x -> if x <= 1048576 then (preturn x) else fail "int parsed larger than maxRows")
|>> SmallValue
let pLargeValueAndString =
pint32 .>> ws .>>. (manyTill ws)
|>> LargeValueAndString
let pComment =
manyTill ws
|>> Comment
let pCombined =
[ pSmallValue; pLargeValueAndString; pComment]
|> List.map attempt // each parser is optional
|> choice // on each iteration, one of the parsers must succeed
|> many // a loop
Built this way, pCombined will return:
"42 ABC" gets parsed as [ SmallValue 42 ; Comment "ABC" ]
"1234567 ABC" gets parsed as [ LargeValueAndString(1234567, "ABC") ]
As we see, the guard rule impacts how the parsers are applied, so the guard rule has to be within the parsing process.
If, however, you don't need such complication (e.g., an int is parsed unconditionally), your first snippet is just fine.

how parse the between of when the right could come after a repeating pattern?

How would you use existing FParsec functionality to find a repeating consecutive pattern in the rightmost tag?
It's a legitimate possibility in this context. Pre-parsing + escaping might work, but is there a better solution? Do we need to write a new forward combinator, and if so, what does it look like?
#r"""bin\debug\FParsecCS.dll"""
#r"""bin\debug\FParsec.dll"""
open FParsec
let str = pstring
let phraseEscape = pchar '\\' >>. pchar '"'
let phraseChar = phraseEscape <|> (noneOf "|\"\r\n]") // <- this right square bracket needs to be removed
let phrase = manyChars phraseChar
let wrapped = between (str"[[") (str"]]".>>newline) phrase
run wrapped "[[some text]]\n" // <- works fine
// !! problem
run wrapped "[[array[] d]]\n" // <- that means we can't make ']' invalid in phraseChar
// !! problem
run wrapped "[[array[]]]\n" // <- and this means that the first ]] gets match leaving a floating one to break the parser
Sorry to be answering my own question, but...
See composable function phraseTill, and the pend parser that is passed to it of (notFollowedBy(s"]]]")>>.(s"]]"))
#r"""bin\debug\FParsecCS.dll"""
#r"""bin\debug\FParsec.dll"""
open FParsec
let s = pstring
let phraseChar = (noneOf "\r\n")
let phrase = manyChars phraseChar
/// keep eating characters until the pend parser is successful
let phraseTill pend = manyCharsTill phraseChar pend
/// when not followed by tipple, a double will truly be the end
let repeatedTo repeatedPtrn ptrn = notFollowedBy(s repeatedPtrn)>>.(s ptrn)
let wrapped = (s"[[")>>.phraseTill (repeatedTo "]]]" "]]")
run wrapped "[[some text]]]"
run wrapped "[[some text]]"
NB. if you try this out in FSharp Interactive (FSI), make sure you have at least one "run wrapped" line when you send your text to FSI to be evaluated (ie. right-click 'Execute In Interactive'). The type only gets inferred / pinned on application in this example. We could have provided explicit definitions at the risk of being more verbose.

Whitespace sensitive FParsec

I'm trying to implement a whitespace sensitive parser using FParsec, and I'm starting off with the baby step of defining a function which will parse lines of text that start with n chars of whitespace.
Here's what I have so far:
let test: Parser<string list,int>
= let manyNSatisfy i p = manyMinMaxSatisfy i i p
let p = fun (stream:CharStream<int>) ->
let state = stream.UserState
// Should fail softly if `state` chars wasn't parsed
let result = attempt <| manyNSatisfy state (System.Char.IsWhiteSpace) <| stream
if result.Status <> Ok
then result
else restOfLine false <| stream
sepBy p newline
My issue is that when I run
runParserOnString test 1 "test" " hi\n there\nyou" |> printfn "%A"
I get an error on "you". I was under the impression that attempt would backtrack any state changes, and returning Error as my status would give me soft failure.
How do I get ["hi"; "there"] back from my parser?
Oh dear, how embarrassing.
I wanted sepEndBy, which is to say that I should terminate the parse on the separator.
This looks more idiomatic. I have hard-coded 1, but it's easy to extract as parameter.
let skipManyNSatisfy i = skipManyMinMaxSatisfy i i
let pMyText =
( // 1st rule
skipManyNSatisfy 1 System.Char.IsWhiteSpace // skip an arbitrary # of WhiteSpaces
>>. restOfLine false |>> Some // return the rest as Option
)
<|> // If the 1st rule failed...
( // 2nd rule
skipRestOfLine false // skip till the end of the line
>>. preturn None // no result
)
|> sepBy <| newline // Wrap both rules, separated by newLine
|>> Seq.choose id // Out of received string option seq, select only Some()

Resources