How to parse comments with FParsec

How to parse comments with FParsec - f#

I'm attempting to parse lisp-style comments from an s-expression language with FParsec. I got a bit of help with parsing single-line comments in this previous thread - How to convert an FParsec parser to parse whitespace
While that was resolved, I still need to parse multiline comments. Here's the current code -
/// Read whitespace character as a string.
let spaceAsStr = anyOf whitespaceChars |>> fun chr -> string chr
/// Read a line comment.
let lineComment = pchar lineCommentChar >>. restOfLine true
/// Read a multiline comment.
/// TODO: make multiline comments nest.
let multilineComment =
between
(pstring openMultilineCommentStr)
(pstring closeMultilineCommentStr)
(charsTillString closeMultilineCommentStr true System.Int32.MaxValue)
/// Read whitespace text.
let whitespace =
lineComment <|>
multilineComment <|>
spaceAsStr
/// Skip any white space characters.
let skipWhitespace = skipMany whitespace
/// Skip at least one white space character.
let skipWhitespace1 = skipMany1 whitespace
Unfortunately, the multilineComment parse never succeeds. Since this is a combinator, I can't put breakpoints or analyze why it won't work.
Any ideas why it won't work?

Try changing the bool argument for closeMultilineCommentStr to false
(charsTillString closeMultilineCommentStr false System.Int32.MaxValue)
Otherwise it will skip over the closeMultilineCommentStr string.
To make it work with nested comments
let rec multilineComment o=
let ign x = charsTillString x false System.Int32.MaxValue
between
(pstring openMultilineCommentStr)
(pstring closeMultilineCommentStr)
(attempt (ign openMultilineCommentStr >>. multilineComment >>. ign closeMultilineCommentStr) <|>
ign closeMultilineCommentStr) <|o

Related

How to parse seq of words separated by double spaces using fparsec?

Given the input:
alpha beta gamma one two three
How could I parse this into the below?
[["alpha"; "beta"; "gamma"]; ["one"; "two"; "three"]]
I can write this when there is a better separator (e.g.__), as then
sepBy (sepBy word (pchar ' ')) (pstring "__")
works, but in the case of double space, the pchar in the first sepBy consumes the first space and then the parser fails.

The FParsec manual says that in sepBy p sep, if sep succeds and the subsequent p fails (without changing the state), the entire sepBy fails, too. Hence, your goal is:
to make the separator fail if it encounters more than a single space char;
to backtrack so that the "inner" sepBy loop closed happily and passed control to the "outer" sepBy loop.
Here's how to do the both:
// this is your word parser; it can be different of course,
// I just made it as simple as possible;
let pWord = many1Satisfy isAsciiLetter
// this is the Inner separator to separate individual words
let pSepInner =
pchar ' '
.>> notFollowedBy (pchar ' ') // guard rule to prevent 2nd space
|> attempt // a wrapper that fails NON-fatally
// this is the Outer separator
let pSepOuter =
pchar ' '
|> many1 // loop over 1+ spaces
// this is the parser that would return String list list
let pMain =
pWord
|> sepBy <| pSepInner // the Inner loop
|> sepBy <| pSepOuter // the Outer loop
Use:
run pMain "alpha beta gamma one two three"
Success: [["alpha"; "beta"; "gamma"]; ["one"; "two"; "three"]]

I'd recommend replacing sepBy word (pchar ' ') with something like this:
let pOneSpace = pchar ' ' .>> notFollowedBy (pchar ' ')
let pTwoSpaces = pstring " "
// Or if two spaces are allowed as separators but *not* three spaces...
let pTwoSpaces = pstring " " .>> notFollowedBy (pchar ' ')
sepBy (sepBy word pOneSpace) pTwoSpaces
Note: not tested (since I don't have time at the moment), just typed into answer box. So test it in case I made a mistake somewhere.

With FParsec, how does one use the manyCharsTill and between parsers and not fail on the closing string?

I'm trying to use FParsec to parse a TOML multi-line string, and I'm having trouble with the closing delimiter ("""). I have the following parsers:
let controlChars =
['\u0000'; '\u0001'; '\u0002'; '\u0003'; '\u0004'; '\u0005'; '\u0006'; '\u0007';
'\u0008'; '\u0009'; '\u000a'; '\u000b'; '\u000c'; '\u000d'; '\u000e'; '\u000f';
'\u0010'; '\u0011'; '\u0012'; '\u0013'; '\u0014'; '\u0015'; '\u0016'; '\u0017';
'\u0018'; '\u0019'; '\u001a'; '\u001b'; '\u001c'; '\u001d'; '\u001e'; '\u001f';
'\u007f']
let nonSpaceCtrlChars =
Set.difference (Set.ofList controlChars) (Set.ofList ['\n';'\r';'\t'])
let multiLineStringContents : Parser<char,unit> =
satisfy (isNoneOf nonSpaceCtrlChars)
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (pstring "\"\"\"")
|> between (pstring "\"\"\"") (pstring "\"\"\"")
let test parser str =
match run parser str with
| Success (s1, s2, s3) -> printfn "Ok: %A %A %A" s1 s2 s3
| Failure (f1, f2, f3) -> printfn "Fail: %A %A %A" f1 f2 f3
When I test multiLineString against an input like this:
test multiLineString "\"\"\"x\"\"\""
The parser fails with this error:
Fail: "Error in Ln: 1 Col: 8 """x"""
^ Note: The error occurred at the end of the input stream. Expecting: '"""'
I'm confused by this. Wouldn't the manyCharsTill multiLineStringContents (pstring "\"\"\"") parser stop at the """ for the between parser to find it? Why is the parser eating all the input and then failing the between parser?
This seems like a relevant post: How to parse comments with FParsec
But I don't see how the solution to that one differs from what I'm doing here, really.

The manyCharsTill documentation says (emphasis mine):
manyCharsTill cp endp parses chars with the char parser cp until the parser endp succeeds. It stops after endp and returns the parsed chars as a string.
So you don't want to use between in combination with manyCharsTill; you want to do something like pstring "\"\"\"" >>. manyCharsTill (pstring "\"\"\"").
But as it happens, I can save you a lot of work. I've been working on a TOML parser with FParsec myself in my spare time. It's far from complete, but the string part works and handles backslash escapes correctly (as far as I can tell: I've tested thoroughly but not exhaustively). The only thing I'm missing is the "strip first newline if it appears right after the opening delimiter" rule, which you've handled with optional newline. So just add that bit into my code below and you should have a working TOML string parser.
BTW, I am planning to license my code (if I finish it) under the MIT license. So I hereby release the following code block under the MIT license. Feel free to use it in your project if it's useful to you.
let pShortCodepointInHex = // Anything from 0000 to FFFF, *except* the range D800-DFFF
(anyOf "dD" >>. (anyOf "01234567" <?> "a Unicode scalar value (range D800-DFFF not allowed)") .>>. exactly 2 isHex |>> fun (c,s) -> sprintf "d%c%s" c s)
<|> (exactly 4 isHex <?> "a Unicode scalar value")
let pLongCodepointInHex = // Anything from 00000000 to 0010FFFF, *except* the range D800-DFFF
(pstring "0000" >>. pShortCodepointInHex)
<|> (pstring "000" >>. exactly 5 isHex)
<|> (pstring "0010" >>. exactly 4 isHex |>> fun s -> "0010" + s)
<?> "a Unicode scalar value (i.e., in range 00000000 to 0010FFFF)"
let toCharOrSurrogatePair p =
p |> withSkippedString (fun codePoint _ -> System.Int32.Parse(codePoint, System.Globalization.NumberStyles.HexNumber) |> System.Char.ConvertFromUtf32)
let pStandardBackslashEscape =
anyOf "\\\"bfnrt"
|>> function
| 'b' -> "\b" // U+0008 BACKSPACE
| 'f' -> "\u000c" // U+000C FORM FEED
| 'n' -> "\n" // U+000A LINE FEED
| 'r' -> "\r" // U+000D CARRIAGE RETURN
| 't' -> "\t" // U+0009 CHARACTER TABULATION a.k.a. Tab or Horizonal Tab
| c -> string c
let pUnicodeEscape = (pchar 'u' >>. (pShortCodepointInHex |> toCharOrSurrogatePair))
<|> (pchar 'U' >>. ( pLongCodepointInHex |> toCharOrSurrogatePair))
let pEscapedChar = pstring "\\" >>. (pStandardBackslashEscape <|> pUnicodeEscape)
let quote = pchar '"'
let isBasicStrChar c = c <> '\\' && c <> '"' && c > '\u001f' && c <> '\u007f'
let pBasicStrChars = manySatisfy isBasicStrChar
let pBasicStr = stringsSepBy pBasicStrChars pEscapedChar |> between quote quote
let pEscapedNewline = skipChar '\\' .>> skipNewline .>> spaces
let isMultilineStrChar c = c = '\n' || isBasicStrChar c
let pMultilineStrChars = manySatisfy isMultilineStrChar
let pTripleQuote = pstring "\"\"\""
let pMultilineStr = stringsSepBy pMultilineStrChars (pEscapedChar <|> (notFollowedByString "\"\"\"" >>. pstring "\"")) |> between pTripleQuote pTripleQuote

#rmunn provided a correct answer, thanks! I also solved this in a slightly different way after playing with the FParsec API a bit more. As explained in the other answer, The endp argument to manyCharTill was eating the closing """, so I needed to switch to something that wouldn't do that. A simple modification using lookAhead did the trick:
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (lookAhead (pstring "\"\"\""))
|> between (pstring "\"\"\"") (pstring "\"\"\"")

Parsing in to a recursive data structure

I wish to parse a string in to a recursive data structure using F#. In this question I'm going to present a simplified example that cuts to the core of what I want to do.
I want to parse a string of nested square brackets in to the record type:
type Bracket = | Bracket of Bracket option
So:
"[]" -> Bracket None
"[[]]" -> Bracket ( Some ( Bracket None) )
"[[[]]]" -> Bracket ( Some ( Bracket ( Some ( Bracket None) ) ) )
I would like to do this using the parser combinators in the FParsec library. Here is what I have so far:
let tryP parser =
parser |>> Some
<|>
preturn None
/// Parses up to nesting level of 3
let parseBrakets : Parser<_> =
let mostInnerLevelBracket =
pchar '['
.>> pchar ']'
|>> fun _ -> Bracket None
let secondLevelBracket =
pchar '['
>>. tryP mostInnerLevelBracket
.>> pchar ']'
|>> Bracket
let firstLevelBracket =
pchar '['
>>. tryP secondLevelBracket
.>> pchar ']'
|>> Bracket
firstLevelBracket
I even have some Expecto tests:
open Expecto
[<Tests>]
let parserTests =
[ "[]", Bracket None
"[[]]", Bracket (Some (Bracket None))
"[[[]]]", Bracket ( Some (Bracket (Some (Bracket None)))) ]
|> List.map(fun (str, expected) ->
str
|> sprintf "Trying to parse %s"
|> testCase
<| fun _ ->
match run parseBrakets str with
| Success (x, _,_) -> Expect.equal x expected "These should have been equal"
| Failure (m, _,_) -> failwithf "Expected a match: %s" m
)
|> testList "Bracket tests"
let tests =
[ parserTests ]
|> testList "Tests"
runTests defaultConfig tests
The problem is of course how to handle and arbitrary level of nesting - the code above only works for up to 3 levels. The code I would like to write is:
let rec pNestedBracket =
pchar '['
>>. tryP pNestedBracket
.>> pchar ']'
|>> Bracket
But F# doesn't allow this.
Am I barking up the wrong tree completely with how to solve this (I understand that there are easier ways to solve this particular problem)?

You are looking for FParsecs createParserForwardedToRef method. Because parsers are values and not functions it is impossible to make mutually recursive or self recursive parsers in order to do this you have to in a sense declare a parser before you define it.
Your final code will end up looking something like this
let bracketParser, bracketParserRef = createParserForwardedToRef<Bracket>()
bracketParserRef := ... //here you can finally declare your parser
//you can reference bracketParser which is a parser that uses the bracketParserRef
Also I would recommend this article for basic understanding of parser combinators. https://fsharpforfunandprofit.com/posts/understanding-parser-combinators/. The final section on a JSON parser talks about the createParserForwardedToRef method.

As an example of how to use createParserForwardedToRef, here's a snippet from a small parser I wrote recently. It parses lists of space-separated integers between brackets (and the lists can be nested), and the "integers" can be small arithmetic expressions like 1+2 or 3*5.
type ListItem =
| Int of int
| List of ListItem list
let pexpr = // ... omitted for brevity
let plist,plistImpl = createParserForwardedToRef()
let pListContents = (many1 (plist |>> List .>> spaces)) <|>
(many (pexpr |>> Int .>> spaces))
plistImpl := pchar '[' >>. spaces
>>. pListContents
.>> pchar ']'
P.S. I would have put this as a comment to Thomas Devries's answer, but a comment can't contain nicely-formatted code. Go ahead and accept his answer; mine is just intended to flesh his out.

how do I handle an optional trailing comma?

given the following
let maxCount = System.Int32.MaxValue
let pmlcomment = pstring "/*" >>. skipCharsTillString "*/" true (maxCount)
let ws = pspaces >>. many (pspaces >>. pmlcomment .>> pspaces) |>> (function | [] -> () | _ -> ())
let str_ws s = pstring s .>> ws
let exprBraceSeqOpt p =
let trailingComma = (str_ws "," |>> fun _ -> None )
between (str_ws "{") (str_ws "}") ((opt (attempt (sepBy p (str_ws ",")))) <|> (attempt trailingComma))
let sampleP = exprBraceSeqOpt (str_ws "x")
it properly matches all of the following except the last one:
["{}";"{x}";"{x,x}";"{x,x,}"]
I'm guessing something is altering the state or something.
How do I handle an optional trailing comma in fparsec?

sepBy "eats up" the extra separator if it is present. That's just how it works, period. You cannot hack it by applying attempt in various places: if you apply attempt to the separator, it won't help, because the last separator actually succeeds, so attempt will have no effect. And applying attempt to the whole sepBy will also not help, because then the whole sepBy will be rolled back, not just the last separator. And applying attempt to the "x" parser itself, while achieving the desired trailing comma behavior, would also have the adverse effect of making the parser accept multiple commas in a row.
And because it is impossible to achieve the desired result via clever use of combinators, there is actually a special function for doing just what you're after - sepEndBy.
This would work as desired:
let exprBraceSeqOpt p =
between (str_ws "{") (str_ws "}") (sepEndBy p (str_ws ","))
Also, as an aside, I should point out that function | [] -> () | _ -> () is a remarkably elaborate way to do ignore. :-)

Advice on FParsec for handling whitespace

I have the following subexpression to parse 'quotes' which have the following format
"5.75 # 5.95"
I therefore have this parsec expression to parse it
let pquote x = (sepBy (pfloat) ((spaces .>> (pchar '/' <|> pchar '#' )>>. spaces))) x
It works fine.. except when there is a trailing space in my input, as the separator expression starts to consume content.So I wrapped it around an attempt, which works and seems, from what I understand, more or less what this was meant to be.
let pquote x = (sepBy (pfloat) (attempt (spaces .>> (pchar '/' <|> pchar '#' )>>. spaces))) x
As I dont know fparsec so well, I wonder if there are any better way to write this. it seems a bit heavy (while still being very manageable of course)

let s1 = "5.75 # 5.95 "
let s2 = "5.75/5.95 "
let pquote: Parser<_> =
pfloat
.>> spaces .>> skipAnyOf ['#'; '/'] .>> spaces
.>>. pfloat
.>> spaces
Notes:
I've made spaces optional everywhere spaces skips any sequence of zero or more whitespaces, so there's no need to use opt - thanks #Daniel;
type Parser<'t> = Parser<'t, UserState> - I define it this way in order to avoid "value restriction" error; you may remove it;
Also, don't forget the following if your program may run on a system with default language settings having decimal comma: System.Threading.Thread.CurrentThread.CurrentCulture <- Globalization.CultureInfo.GetCultureInfo "en-US" this won't work, thanks #Stephan
I would not use sepBy unless I have a value list of unknown size.
If you don't really need the value returned (e.g. '#' characters), it is recommended to use skip* functions instead p* for performance considerations.
UPD added slash as separator

I would probably do something like this, which returns float * float:
let ws = spaces
let quantity = pfloat .>> ws
let price = pfloat .>> ws
let quoteSep = pstring "#" .>> ws
let quote = quantity .>> quoteSep .>>. price //`.>> eof` (if final parser)
It's typical for each parser to consume trailing whitespace. Just make sure your top-level parser includes eof.

Assuming that you could have more than two float in the input and '/' and '#' are delimiters:
let ws = spaces
let str_ws s = pstring s .>> ws
let float_ws = pfloat .>> ws
let pquote = sepBy float_ws (str_ws "/" <|> str_ws "#")
Talking about handling whitespaces, this section in FParsec tutorial is really helpful.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to parse comments with FParsec - f#

Related

How to parse seq of words separated by double spaces using fparsec?

With FParsec, how does one use the manyCharsTill and between parsers and not fail on the closing string?

Parsing in to a recursive data structure

how do I handle an optional trailing comma?

Advice on FParsec for handling whitespace

Categories

Resources