Recursive parsing grammar consumes input and fails to parse sequence - parsing

I'm attempting to write an Augmented Backus-Naur form parser. However, I am coming across a Stack Overflow exception whenever I attempt to parse alternatives. Below is an example which triggers the issue:
#r #"..\packages\FParsec\lib\net40-client\FParsecCS.dll"
#r #"..\packages\FParsec\lib\net40-client\FParsec.dll"
open FParsec
type Parser<'t> = Parser<'t, unit>
type Element =
| Alternates of Element list
| ParsedString of string
let (pRuleElement, pRuleElementRef) : (Parser<Element> * Parser<Element> ref) = createParserForwardedToRef()
let pString =
pchar '"' >>. manyCharsTill (noneOf ['"']) (pchar '"')
|>> ParsedString
let pAlternates : Parser<_> =
sepBy1 pRuleElement (many (pchar ' ') >>. (pchar '/') >>. many (pchar ' ') )
|>> Alternates
do pRuleElementRef :=
choice
[
pString
pAlternates
]
"\"0\" / \"1\" / \"2\" / \"3\" / \"4\" / \"5\" / \"6\" / \"7\""
|> run (pRuleElement .>> (skipNewline <|> eof))
The issue is easily resolved by simply reordering the choice like so:
do pRuleElementRef :=
choice
[
pAlternates
pString
]
However, that then causes a Stack Overflow because it continuously attempts to parse a new sequence of alternatives without consuming input. In addition, that method would then break ABNF precedence:
Strings, names formation
Comment
Value range
Repetition
Grouping, optional
Concatenation
Alternative
My question essentially boils down to this: How can I combine parsing of a single element that can be a sequence of elements or a single instance of an element? Please let me know if you require any clarification / additional examples.
Your help is much appreciated, thank you!
EDIT:
I should probably mention that there are various other kinds of groupings as well. A sequence group (element[s]) and an optional group [optional element[s]. Where element can be nested groups / optional groups / strings / other element types. Below is an example with sequence group parsing (optional group parsing not included for simplicity):
#r #"..\packages\FParsec\lib\net40-client\FParsecCS.dll"
#r #"..\packages\FParsec\lib\net40-client\FParsec.dll"
open FParsec
type Parser<'t> = Parser<'t, unit>
type Element =
| Alternates of Element list
| SequenceGroup of Element list
| ParsedString of string
let (pRuleElement, pRuleElementRef) : (Parser<Element> * Parser<Element> ref) = createParserForwardedToRef()
let pString =
pchar '"' >>. manyCharsTill (noneOf ['"']) (pchar '"')
|>> ParsedString
let pAlternates : Parser<_> =
pipe2
(pRuleElement .>> (many (pchar ' ') >>. (pchar '/') >>. many (pchar ' ')))
(sepBy1 pRuleElement (many (pchar ' ') >>. (pchar '/') >>. many (pchar ' ') ))
(fun first rest -> first :: rest)
|>> Alternates
let pSequenceGroup : Parser<_> =
between (pchar '(') (pchar ')') (sepBy1 pRuleElement (pchar ' '))
|>> SequenceGroup
do pRuleElementRef :=
choice
[
pAlternates
pSequenceGroup
pString
]
"\"0\" / ((\"1\" \"2\") / \"2\") / \"3\" / (\"4\" / \"5\") / \"6\" / \"7\""
|> run (pRuleElement .>> (skipNewline <|> eof))
If I attempt to parse alternates / sequence groups first, it terminates with a stack overflow exception because it then tries to parse alternates repeatedly.

The issue is that when you run the pRuleElement parser on the input, it correctly parses one string, leaving some unconsumed input, but then it fails later outside of the choice that would backtrack.
You can run the pAlternates parser on the main input, which actually works:
"\"0\" / \"1\" / \"2\" / \"3\" / \"4\" / \"5\" / \"6\" / \"7\""
|> run (pAlternates .>> (skipNewline <|> eof))
I suspect that you can probably just do this - the pAlternates parser works correctly, even on just a single string - it will just return Alternates containing a singleton list.

It looks like the solution was simply not attempting to parse alternatives whilst parsing alternatives in order to avoid an infinite loop resulting in a stack overflow. A working version of the code posted in my question is as follows:
#r #"..\packages\FParsec\lib\net40-client\FParsecCS.dll"
#r #"..\packages\FParsec\lib\net40-client\FParsec.dll"
open FParsec
type Parser<'t> = Parser<'t, unit>
type Element =
| Alternates of Element list
| SequenceGroup of Element list
| ParsedString of string
let (pRuleElement, pRuleElementRef) : (Parser<Element> * Parser<Element> ref) = createParserForwardedToRef()
let (pNotAlternatives, pNotAlternativesRef) : (Parser<Element> * Parser<Element> ref) = createParserForwardedToRef()
let pString =
pchar '"' >>. manyCharsTill (noneOf ['"']) (pchar '"')
|>> ParsedString
let pAlternates : Parser<_> =
pipe2
(pNotAlternatives .>>? (many (pchar ' ') >>? (pchar '/') >>. many (pchar ' ')))
(sepBy1 pNotAlternatives (many (pchar ' ') >>? (pchar '/') >>. many (pchar ' ') ))
(fun first rest -> first :: rest)
|>> Alternates
let pSequenceGroup : Parser<_> =
between (pchar '(') (pchar ')') (sepBy1 pRuleElement (pchar ' '))
|>> SequenceGroup
do pRuleElementRef :=
choice
[
pAlternates
pSequenceGroup
pString
]
do pNotAlternativesRef :=
choice
[
pSequenceGroup
pString
]
"\"0\" / (\"1\" \"2\") / \"3\" / (\"4\" / \"5\") / \"6\" / \"7\""
|> run (pRuleElement .>> (skipNewline <|> eof))
In addition to the addition of pNotAlternatives I also modified it so that it would backtrack when failing to parse the alternative separator / which allows it to proceed after "realizing" that it wasn't a list of alternatives after all.

Related

How to parse seq of words separated by double spaces using fparsec?

Given the input:
alpha beta gamma one two three
How could I parse this into the below?
[["alpha"; "beta"; "gamma"]; ["one"; "two"; "three"]]
I can write this when there is a better separator (e.g.__), as then
sepBy (sepBy word (pchar ' ')) (pstring "__")
works, but in the case of double space, the pchar in the first sepBy consumes the first space and then the parser fails.
The FParsec manual says that in sepBy p sep, if sep succeds and the subsequent p fails (without changing the state), the entire sepBy fails, too. Hence, your goal is:
to make the separator fail if it encounters more than a single space char;
to backtrack so that the "inner" sepBy loop closed happily and passed control to the "outer" sepBy loop.
Here's how to do the both:
// this is your word parser; it can be different of course,
// I just made it as simple as possible;
let pWord = many1Satisfy isAsciiLetter
// this is the Inner separator to separate individual words
let pSepInner =
pchar ' '
.>> notFollowedBy (pchar ' ') // guard rule to prevent 2nd space
|> attempt // a wrapper that fails NON-fatally
// this is the Outer separator
let pSepOuter =
pchar ' '
|> many1 // loop over 1+ spaces
// this is the parser that would return String list list
let pMain =
pWord
|> sepBy <| pSepInner // the Inner loop
|> sepBy <| pSepOuter // the Outer loop
Use:
run pMain "alpha beta gamma one two three"
Success: [["alpha"; "beta"; "gamma"]; ["one"; "two"; "three"]]
I'd recommend replacing sepBy word (pchar ' ') with something like this:
let pOneSpace = pchar ' ' .>> notFollowedBy (pchar ' ')
let pTwoSpaces = pstring " "
// Or if two spaces are allowed as separators but *not* three spaces...
let pTwoSpaces = pstring " " .>> notFollowedBy (pchar ' ')
sepBy (sepBy word pOneSpace) pTwoSpaces
Note: not tested (since I don't have time at the moment), just typed into answer box. So test it in case I made a mistake somewhere.

With FParsec, how does one use the manyCharsTill and between parsers and not fail on the closing string?

I'm trying to use FParsec to parse a TOML multi-line string, and I'm having trouble with the closing delimiter ("""). I have the following parsers:
let controlChars =
['\u0000'; '\u0001'; '\u0002'; '\u0003'; '\u0004'; '\u0005'; '\u0006'; '\u0007';
'\u0008'; '\u0009'; '\u000a'; '\u000b'; '\u000c'; '\u000d'; '\u000e'; '\u000f';
'\u0010'; '\u0011'; '\u0012'; '\u0013'; '\u0014'; '\u0015'; '\u0016'; '\u0017';
'\u0018'; '\u0019'; '\u001a'; '\u001b'; '\u001c'; '\u001d'; '\u001e'; '\u001f';
'\u007f']
let nonSpaceCtrlChars =
Set.difference (Set.ofList controlChars) (Set.ofList ['\n';'\r';'\t'])
let multiLineStringContents : Parser<char,unit> =
satisfy (isNoneOf nonSpaceCtrlChars)
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (pstring "\"\"\"")
|> between (pstring "\"\"\"") (pstring "\"\"\"")
let test parser str =
match run parser str with
| Success (s1, s2, s3) -> printfn "Ok: %A %A %A" s1 s2 s3
| Failure (f1, f2, f3) -> printfn "Fail: %A %A %A" f1 f2 f3
When I test multiLineString against an input like this:
test multiLineString "\"\"\"x\"\"\""
The parser fails with this error:
Fail: "Error in Ln: 1 Col: 8 """x"""
^ Note: The error occurred at the end of the input stream. Expecting: '"""'
I'm confused by this. Wouldn't the manyCharsTill multiLineStringContents (pstring "\"\"\"") parser stop at the """ for the between parser to find it? Why is the parser eating all the input and then failing the between parser?
This seems like a relevant post: How to parse comments with FParsec
But I don't see how the solution to that one differs from what I'm doing here, really.
The manyCharsTill documentation says (emphasis mine):
manyCharsTill cp endp parses chars with the char parser cp until the parser endp succeeds. It stops after endp and returns the parsed chars as a string.
So you don't want to use between in combination with manyCharsTill; you want to do something like pstring "\"\"\"" >>. manyCharsTill (pstring "\"\"\"").
But as it happens, I can save you a lot of work. I've been working on a TOML parser with FParsec myself in my spare time. It's far from complete, but the string part works and handles backslash escapes correctly (as far as I can tell: I've tested thoroughly but not exhaustively). The only thing I'm missing is the "strip first newline if it appears right after the opening delimiter" rule, which you've handled with optional newline. So just add that bit into my code below and you should have a working TOML string parser.
BTW, I am planning to license my code (if I finish it) under the MIT license. So I hereby release the following code block under the MIT license. Feel free to use it in your project if it's useful to you.
let pShortCodepointInHex = // Anything from 0000 to FFFF, *except* the range D800-DFFF
(anyOf "dD" >>. (anyOf "01234567" <?> "a Unicode scalar value (range D800-DFFF not allowed)") .>>. exactly 2 isHex |>> fun (c,s) -> sprintf "d%c%s" c s)
<|> (exactly 4 isHex <?> "a Unicode scalar value")
let pLongCodepointInHex = // Anything from 00000000 to 0010FFFF, *except* the range D800-DFFF
(pstring "0000" >>. pShortCodepointInHex)
<|> (pstring "000" >>. exactly 5 isHex)
<|> (pstring "0010" >>. exactly 4 isHex |>> fun s -> "0010" + s)
<?> "a Unicode scalar value (i.e., in range 00000000 to 0010FFFF)"
let toCharOrSurrogatePair p =
p |> withSkippedString (fun codePoint _ -> System.Int32.Parse(codePoint, System.Globalization.NumberStyles.HexNumber) |> System.Char.ConvertFromUtf32)
let pStandardBackslashEscape =
anyOf "\\\"bfnrt"
|>> function
| 'b' -> "\b" // U+0008 BACKSPACE
| 'f' -> "\u000c" // U+000C FORM FEED
| 'n' -> "\n" // U+000A LINE FEED
| 'r' -> "\r" // U+000D CARRIAGE RETURN
| 't' -> "\t" // U+0009 CHARACTER TABULATION a.k.a. Tab or Horizonal Tab
| c -> string c
let pUnicodeEscape = (pchar 'u' >>. (pShortCodepointInHex |> toCharOrSurrogatePair))
<|> (pchar 'U' >>. ( pLongCodepointInHex |> toCharOrSurrogatePair))
let pEscapedChar = pstring "\\" >>. (pStandardBackslashEscape <|> pUnicodeEscape)
let quote = pchar '"'
let isBasicStrChar c = c <> '\\' && c <> '"' && c > '\u001f' && c <> '\u007f'
let pBasicStrChars = manySatisfy isBasicStrChar
let pBasicStr = stringsSepBy pBasicStrChars pEscapedChar |> between quote quote
let pEscapedNewline = skipChar '\\' .>> skipNewline .>> spaces
let isMultilineStrChar c = c = '\n' || isBasicStrChar c
let pMultilineStrChars = manySatisfy isMultilineStrChar
let pTripleQuote = pstring "\"\"\""
let pMultilineStr = stringsSepBy pMultilineStrChars (pEscapedChar <|> (notFollowedByString "\"\"\"" >>. pstring "\"")) |> between pTripleQuote pTripleQuote
#rmunn provided a correct answer, thanks! I also solved this in a slightly different way after playing with the FParsec API a bit more. As explained in the other answer, The endp argument to manyCharTill was eating the closing """, so I needed to switch to something that wouldn't do that. A simple modification using lookAhead did the trick:
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (lookAhead (pstring "\"\"\""))
|> between (pstring "\"\"\"") (pstring "\"\"\"")

Parsing in to a recursive data structure

I wish to parse a string in to a recursive data structure using F#. In this question I'm going to present a simplified example that cuts to the core of what I want to do.
I want to parse a string of nested square brackets in to the record type:
type Bracket = | Bracket of Bracket option
So:
"[]" -> Bracket None
"[[]]" -> Bracket ( Some ( Bracket None) )
"[[[]]]" -> Bracket ( Some ( Bracket ( Some ( Bracket None) ) ) )
I would like to do this using the parser combinators in the FParsec library. Here is what I have so far:
let tryP parser =
parser |>> Some
<|>
preturn None
/// Parses up to nesting level of 3
let parseBrakets : Parser<_> =
let mostInnerLevelBracket =
pchar '['
.>> pchar ']'
|>> fun _ -> Bracket None
let secondLevelBracket =
pchar '['
>>. tryP mostInnerLevelBracket
.>> pchar ']'
|>> Bracket
let firstLevelBracket =
pchar '['
>>. tryP secondLevelBracket
.>> pchar ']'
|>> Bracket
firstLevelBracket
I even have some Expecto tests:
open Expecto
[<Tests>]
let parserTests =
[ "[]", Bracket None
"[[]]", Bracket (Some (Bracket None))
"[[[]]]", Bracket ( Some (Bracket (Some (Bracket None)))) ]
|> List.map(fun (str, expected) ->
str
|> sprintf "Trying to parse %s"
|> testCase
<| fun _ ->
match run parseBrakets str with
| Success (x, _,_) -> Expect.equal x expected "These should have been equal"
| Failure (m, _,_) -> failwithf "Expected a match: %s" m
)
|> testList "Bracket tests"
let tests =
[ parserTests ]
|> testList "Tests"
runTests defaultConfig tests
The problem is of course how to handle and arbitrary level of nesting - the code above only works for up to 3 levels. The code I would like to write is:
let rec pNestedBracket =
pchar '['
>>. tryP pNestedBracket
.>> pchar ']'
|>> Bracket
But F# doesn't allow this.
Am I barking up the wrong tree completely with how to solve this (I understand that there are easier ways to solve this particular problem)?
You are looking for FParsecs createParserForwardedToRef method. Because parsers are values and not functions it is impossible to make mutually recursive or self recursive parsers in order to do this you have to in a sense declare a parser before you define it.
Your final code will end up looking something like this
let bracketParser, bracketParserRef = createParserForwardedToRef<Bracket>()
bracketParserRef := ... //here you can finally declare your parser
//you can reference bracketParser which is a parser that uses the bracketParserRef
Also I would recommend this article for basic understanding of parser combinators. https://fsharpforfunandprofit.com/posts/understanding-parser-combinators/. The final section on a JSON parser talks about the createParserForwardedToRef method.
As an example of how to use createParserForwardedToRef, here's a snippet from a small parser I wrote recently. It parses lists of space-separated integers between brackets (and the lists can be nested), and the "integers" can be small arithmetic expressions like 1+2 or 3*5.
type ListItem =
| Int of int
| List of ListItem list
let pexpr = // ... omitted for brevity
let plist,plistImpl = createParserForwardedToRef()
let pListContents = (many1 (plist |>> List .>> spaces)) <|>
(many (pexpr |>> Int .>> spaces))
plistImpl := pchar '[' >>. spaces
>>. pListContents
.>> pchar ']'
P.S. I would have put this as a comment to Thomas Devries's answer, but a comment can't contain nicely-formatted code. Go ahead and accept his answer; mine is just intended to flesh his out.

fparsec parsing alternatives with discriminated unions for DSL

I am trying to achieve below with fparsec and unions
(1 + (2 * 3)) //DSL sample input(recursive)
type AirthmeticExpression =
| Constant of float
| AddNumber of AirthmeticExpression * AirthmeticExpression
| Mul of AirthmeticExpression * AirthmeticExpression
in fparsec I have createParserForwardedToRef for Add and Mul as
let parseExpression, implementation = createParserForwardedToRef<AirthmeticExpression, unit>();;
let parseAdd = between pstring"(" pstring ")" (tuple2 (parseExpression .>> pstring " + ") parseExpression) |>> AddNumber
let parseMul = between pstring"(" pstring ")" (tuple2 (parseExpression .>> pstring " * ") parseExpression) |>> Mul
implementation := parseConstant <|> parseAdd <|> parseMull
but fparsec doc says for alternatives if parser p1 consumes input and fails it will not try p2.
in my case both Add and Mul has same pattern before operator, so only p1 is working. how can I refactor it so I can parse my input? in fparsec doc solution example, it worked as it was just parsing and not constructing Discriminated union instance. in my case I have to know which pattern matched so that I can create either Add or Mul
Edit: my original comment was just as flawed, as pointed out by #FyodorSoikin.
You are on the right track in your comment from yesterday by making a common parser for the operators and then having a single parser for operations that uses it. To make this more functional, you can have the operator parser return the union case to apply. This way, when parsing the full operation, you can just call it as a function.
let parseExpression, implementation = createParserForwardedToRef<AirthmeticExpression, unit>();;
let parseOperator = // : Parser<AirthmeticExpression * AirthmeticExpression -> AirthmeticExpression>
(pstring " + " |>> AddNumber)
<|> (pstring " * " |>> Mul)
let parseOperation =
pipe3 parseConstant parseOperator parseConstant
(fun x op y -> op (x, y)) // Here, op is either AddNumber or Mul
|> between (pstring "(") (pstring ")")
implementation := parseConstant <|> parseOperation
Original comment:
One possibility is to use attempt as said in the comments, but that function should generally be used as a last resort. A better solution is to factor out the wrapping:
let parseExpression, implementation = createParserForwardedToRef<AirthmeticExpression, unit>();;
let parseAdd = tuple2 (parseExpression .>> pstring " + ") parseExpression |>> AddNumber
let parseMul = tuple2 (parseExpression .>> pstring " * ") parseExpression |>> Mul
let parseOp = between (pstring "(") (pstring ")") (parseAdd <|> parseMul)
implementation := parseConstant <|> parseOp

Parsing an optionally-multi-line expression with FParsec

I'm writing an FParsec parser for strings in this form:
do[ n times]([ action] | \n([action]\n)*endDo)
in other words this is a "do" statement with an optional time quantifier, and either a single "action" statement or a list of "action"s (each on a new line) with an "end do" at the end (I omitted indentations/trailing space handling for simplicity).
These are examples of valid inputs:
do action
do 3 times action
do
endDo
do 3 times
endDo
do
action
action
endDo
do 3 times
action
action
endDo
This does not look very complicated, but:
Why does this not work?
let statement = pstring "action"
let beginDo = pstring "do"
>>. opt (spaces1 >>. pint32 .>> spaces1 .>> pstring "times")
let inlineDo = tuple2 beginDo (spaces >>. statement |>> fun w -> [w])
let expandedDo = (tuple2 (beginDo .>> newline)
(many (statement .>> newline)))
.>> pstring "endDo"
let doExpression = (expandedDo <|> inlineDo)
What is a correct parser for this expression?
You need to use the attempt function.
I just modified your beginDo and doExpression functions.
This is the code:
let statement o=o|> pstring "action"
let beginDo o=
attempt (pstring "do"
>>. opt (spaces1 >>. pint32 .>> spaces1 .>> pstring "times")) <|>
(pstring "do" >>% None) <|o
let inlineDo o= tuple2 beginDo (spaces >>. statement |>> fun w -> [w]) <|o
let expandedDo o= (tuple2 (beginDo .>> newline) (many (statement .>> newline)))
.>> pstring "endDo" <|o
let doExpression o= ((attempt expandedDo) <|> inlineDo) .>> eof <|o
I added an eof at the end. This way it will be easier to test.
I added also dummy o parameters to avoid the value restriction.

Resources