Parsing an optionally-multi-line expression with FParsec - f#

I'm writing an FParsec parser for strings in this form:
do[ n times]([ action] | \n([action]\n)*endDo)
in other words this is a "do" statement with an optional time quantifier, and either a single "action" statement or a list of "action"s (each on a new line) with an "end do" at the end (I omitted indentations/trailing space handling for simplicity).
These are examples of valid inputs:
do action
do 3 times action
do
endDo
do 3 times
endDo
do
action
action
endDo
do 3 times
action
action
endDo
This does not look very complicated, but:
Why does this not work?
let statement = pstring "action"
let beginDo = pstring "do"
>>. opt (spaces1 >>. pint32 .>> spaces1 .>> pstring "times")
let inlineDo = tuple2 beginDo (spaces >>. statement |>> fun w -> [w])
let expandedDo = (tuple2 (beginDo .>> newline)
(many (statement .>> newline)))
.>> pstring "endDo"
let doExpression = (expandedDo <|> inlineDo)
What is a correct parser for this expression?

You need to use the attempt function.
I just modified your beginDo and doExpression functions.
This is the code:
let statement o=o|> pstring "action"
let beginDo o=
attempt (pstring "do"
>>. opt (spaces1 >>. pint32 .>> spaces1 .>> pstring "times")) <|>
(pstring "do" >>% None) <|o
let inlineDo o= tuple2 beginDo (spaces >>. statement |>> fun w -> [w]) <|o
let expandedDo o= (tuple2 (beginDo .>> newline) (many (statement .>> newline)))
.>> pstring "endDo" <|o
let doExpression o= ((attempt expandedDo) <|> inlineDo) .>> eof <|o
I added an eof at the end. This way it will be easier to test.
I added also dummy o parameters to avoid the value restriction.

Related

Recursive parsing grammar consumes input and fails to parse sequence

I'm attempting to write an Augmented Backus-Naur form parser. However, I am coming across a Stack Overflow exception whenever I attempt to parse alternatives. Below is an example which triggers the issue:
#r #"..\packages\FParsec\lib\net40-client\FParsecCS.dll"
#r #"..\packages\FParsec\lib\net40-client\FParsec.dll"
open FParsec
type Parser<'t> = Parser<'t, unit>
type Element =
| Alternates of Element list
| ParsedString of string
let (pRuleElement, pRuleElementRef) : (Parser<Element> * Parser<Element> ref) = createParserForwardedToRef()
let pString =
pchar '"' >>. manyCharsTill (noneOf ['"']) (pchar '"')
|>> ParsedString
let pAlternates : Parser<_> =
sepBy1 pRuleElement (many (pchar ' ') >>. (pchar '/') >>. many (pchar ' ') )
|>> Alternates
do pRuleElementRef :=
choice
[
pString
pAlternates
]
"\"0\" / \"1\" / \"2\" / \"3\" / \"4\" / \"5\" / \"6\" / \"7\""
|> run (pRuleElement .>> (skipNewline <|> eof))
The issue is easily resolved by simply reordering the choice like so:
do pRuleElementRef :=
choice
[
pAlternates
pString
]
However, that then causes a Stack Overflow because it continuously attempts to parse a new sequence of alternatives without consuming input. In addition, that method would then break ABNF precedence:
Strings, names formation
Comment
Value range
Repetition
Grouping, optional
Concatenation
Alternative
My question essentially boils down to this: How can I combine parsing of a single element that can be a sequence of elements or a single instance of an element? Please let me know if you require any clarification / additional examples.
Your help is much appreciated, thank you!
EDIT:
I should probably mention that there are various other kinds of groupings as well. A sequence group (element[s]) and an optional group [optional element[s]. Where element can be nested groups / optional groups / strings / other element types. Below is an example with sequence group parsing (optional group parsing not included for simplicity):
#r #"..\packages\FParsec\lib\net40-client\FParsecCS.dll"
#r #"..\packages\FParsec\lib\net40-client\FParsec.dll"
open FParsec
type Parser<'t> = Parser<'t, unit>
type Element =
| Alternates of Element list
| SequenceGroup of Element list
| ParsedString of string
let (pRuleElement, pRuleElementRef) : (Parser<Element> * Parser<Element> ref) = createParserForwardedToRef()
let pString =
pchar '"' >>. manyCharsTill (noneOf ['"']) (pchar '"')
|>> ParsedString
let pAlternates : Parser<_> =
pipe2
(pRuleElement .>> (many (pchar ' ') >>. (pchar '/') >>. many (pchar ' ')))
(sepBy1 pRuleElement (many (pchar ' ') >>. (pchar '/') >>. many (pchar ' ') ))
(fun first rest -> first :: rest)
|>> Alternates
let pSequenceGroup : Parser<_> =
between (pchar '(') (pchar ')') (sepBy1 pRuleElement (pchar ' '))
|>> SequenceGroup
do pRuleElementRef :=
choice
[
pAlternates
pSequenceGroup
pString
]
"\"0\" / ((\"1\" \"2\") / \"2\") / \"3\" / (\"4\" / \"5\") / \"6\" / \"7\""
|> run (pRuleElement .>> (skipNewline <|> eof))
If I attempt to parse alternates / sequence groups first, it terminates with a stack overflow exception because it then tries to parse alternates repeatedly.
The issue is that when you run the pRuleElement parser on the input, it correctly parses one string, leaving some unconsumed input, but then it fails later outside of the choice that would backtrack.
You can run the pAlternates parser on the main input, which actually works:
"\"0\" / \"1\" / \"2\" / \"3\" / \"4\" / \"5\" / \"6\" / \"7\""
|> run (pAlternates .>> (skipNewline <|> eof))
I suspect that you can probably just do this - the pAlternates parser works correctly, even on just a single string - it will just return Alternates containing a singleton list.
It looks like the solution was simply not attempting to parse alternatives whilst parsing alternatives in order to avoid an infinite loop resulting in a stack overflow. A working version of the code posted in my question is as follows:
#r #"..\packages\FParsec\lib\net40-client\FParsecCS.dll"
#r #"..\packages\FParsec\lib\net40-client\FParsec.dll"
open FParsec
type Parser<'t> = Parser<'t, unit>
type Element =
| Alternates of Element list
| SequenceGroup of Element list
| ParsedString of string
let (pRuleElement, pRuleElementRef) : (Parser<Element> * Parser<Element> ref) = createParserForwardedToRef()
let (pNotAlternatives, pNotAlternativesRef) : (Parser<Element> * Parser<Element> ref) = createParserForwardedToRef()
let pString =
pchar '"' >>. manyCharsTill (noneOf ['"']) (pchar '"')
|>> ParsedString
let pAlternates : Parser<_> =
pipe2
(pNotAlternatives .>>? (many (pchar ' ') >>? (pchar '/') >>. many (pchar ' ')))
(sepBy1 pNotAlternatives (many (pchar ' ') >>? (pchar '/') >>. many (pchar ' ') ))
(fun first rest -> first :: rest)
|>> Alternates
let pSequenceGroup : Parser<_> =
between (pchar '(') (pchar ')') (sepBy1 pRuleElement (pchar ' '))
|>> SequenceGroup
do pRuleElementRef :=
choice
[
pAlternates
pSequenceGroup
pString
]
do pNotAlternativesRef :=
choice
[
pSequenceGroup
pString
]
"\"0\" / (\"1\" \"2\") / \"3\" / (\"4\" / \"5\") / \"6\" / \"7\""
|> run (pRuleElement .>> (skipNewline <|> eof))
In addition to the addition of pNotAlternatives I also modified it so that it would backtrack when failing to parse the alternative separator / which allows it to proceed after "realizing" that it wasn't a list of alternatives after all.

With FParsec, how does one use the manyCharsTill and between parsers and not fail on the closing string?

I'm trying to use FParsec to parse a TOML multi-line string, and I'm having trouble with the closing delimiter ("""). I have the following parsers:
let controlChars =
['\u0000'; '\u0001'; '\u0002'; '\u0003'; '\u0004'; '\u0005'; '\u0006'; '\u0007';
'\u0008'; '\u0009'; '\u000a'; '\u000b'; '\u000c'; '\u000d'; '\u000e'; '\u000f';
'\u0010'; '\u0011'; '\u0012'; '\u0013'; '\u0014'; '\u0015'; '\u0016'; '\u0017';
'\u0018'; '\u0019'; '\u001a'; '\u001b'; '\u001c'; '\u001d'; '\u001e'; '\u001f';
'\u007f']
let nonSpaceCtrlChars =
Set.difference (Set.ofList controlChars) (Set.ofList ['\n';'\r';'\t'])
let multiLineStringContents : Parser<char,unit> =
satisfy (isNoneOf nonSpaceCtrlChars)
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (pstring "\"\"\"")
|> between (pstring "\"\"\"") (pstring "\"\"\"")
let test parser str =
match run parser str with
| Success (s1, s2, s3) -> printfn "Ok: %A %A %A" s1 s2 s3
| Failure (f1, f2, f3) -> printfn "Fail: %A %A %A" f1 f2 f3
When I test multiLineString against an input like this:
test multiLineString "\"\"\"x\"\"\""
The parser fails with this error:
Fail: "Error in Ln: 1 Col: 8 """x"""
^ Note: The error occurred at the end of the input stream. Expecting: '"""'
I'm confused by this. Wouldn't the manyCharsTill multiLineStringContents (pstring "\"\"\"") parser stop at the """ for the between parser to find it? Why is the parser eating all the input and then failing the between parser?
This seems like a relevant post: How to parse comments with FParsec
But I don't see how the solution to that one differs from what I'm doing here, really.
The manyCharsTill documentation says (emphasis mine):
manyCharsTill cp endp parses chars with the char parser cp until the parser endp succeeds. It stops after endp and returns the parsed chars as a string.
So you don't want to use between in combination with manyCharsTill; you want to do something like pstring "\"\"\"" >>. manyCharsTill (pstring "\"\"\"").
But as it happens, I can save you a lot of work. I've been working on a TOML parser with FParsec myself in my spare time. It's far from complete, but the string part works and handles backslash escapes correctly (as far as I can tell: I've tested thoroughly but not exhaustively). The only thing I'm missing is the "strip first newline if it appears right after the opening delimiter" rule, which you've handled with optional newline. So just add that bit into my code below and you should have a working TOML string parser.
BTW, I am planning to license my code (if I finish it) under the MIT license. So I hereby release the following code block under the MIT license. Feel free to use it in your project if it's useful to you.
let pShortCodepointInHex = // Anything from 0000 to FFFF, *except* the range D800-DFFF
(anyOf "dD" >>. (anyOf "01234567" <?> "a Unicode scalar value (range D800-DFFF not allowed)") .>>. exactly 2 isHex |>> fun (c,s) -> sprintf "d%c%s" c s)
<|> (exactly 4 isHex <?> "a Unicode scalar value")
let pLongCodepointInHex = // Anything from 00000000 to 0010FFFF, *except* the range D800-DFFF
(pstring "0000" >>. pShortCodepointInHex)
<|> (pstring "000" >>. exactly 5 isHex)
<|> (pstring "0010" >>. exactly 4 isHex |>> fun s -> "0010" + s)
<?> "a Unicode scalar value (i.e., in range 00000000 to 0010FFFF)"
let toCharOrSurrogatePair p =
p |> withSkippedString (fun codePoint _ -> System.Int32.Parse(codePoint, System.Globalization.NumberStyles.HexNumber) |> System.Char.ConvertFromUtf32)
let pStandardBackslashEscape =
anyOf "\\\"bfnrt"
|>> function
| 'b' -> "\b" // U+0008 BACKSPACE
| 'f' -> "\u000c" // U+000C FORM FEED
| 'n' -> "\n" // U+000A LINE FEED
| 'r' -> "\r" // U+000D CARRIAGE RETURN
| 't' -> "\t" // U+0009 CHARACTER TABULATION a.k.a. Tab or Horizonal Tab
| c -> string c
let pUnicodeEscape = (pchar 'u' >>. (pShortCodepointInHex |> toCharOrSurrogatePair))
<|> (pchar 'U' >>. ( pLongCodepointInHex |> toCharOrSurrogatePair))
let pEscapedChar = pstring "\\" >>. (pStandardBackslashEscape <|> pUnicodeEscape)
let quote = pchar '"'
let isBasicStrChar c = c <> '\\' && c <> '"' && c > '\u001f' && c <> '\u007f'
let pBasicStrChars = manySatisfy isBasicStrChar
let pBasicStr = stringsSepBy pBasicStrChars pEscapedChar |> between quote quote
let pEscapedNewline = skipChar '\\' .>> skipNewline .>> spaces
let isMultilineStrChar c = c = '\n' || isBasicStrChar c
let pMultilineStrChars = manySatisfy isMultilineStrChar
let pTripleQuote = pstring "\"\"\""
let pMultilineStr = stringsSepBy pMultilineStrChars (pEscapedChar <|> (notFollowedByString "\"\"\"" >>. pstring "\"")) |> between pTripleQuote pTripleQuote
#rmunn provided a correct answer, thanks! I also solved this in a slightly different way after playing with the FParsec API a bit more. As explained in the other answer, The endp argument to manyCharTill was eating the closing """, so I needed to switch to something that wouldn't do that. A simple modification using lookAhead did the trick:
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (lookAhead (pstring "\"\"\""))
|> between (pstring "\"\"\"") (pstring "\"\"\"")

fparsec parsing alternatives with discriminated unions for DSL

I am trying to achieve below with fparsec and unions
(1 + (2 * 3)) //DSL sample input(recursive)
type AirthmeticExpression =
| Constant of float
| AddNumber of AirthmeticExpression * AirthmeticExpression
| Mul of AirthmeticExpression * AirthmeticExpression
in fparsec I have createParserForwardedToRef for Add and Mul as
let parseExpression, implementation = createParserForwardedToRef<AirthmeticExpression, unit>();;
let parseAdd = between pstring"(" pstring ")" (tuple2 (parseExpression .>> pstring " + ") parseExpression) |>> AddNumber
let parseMul = between pstring"(" pstring ")" (tuple2 (parseExpression .>> pstring " * ") parseExpression) |>> Mul
implementation := parseConstant <|> parseAdd <|> parseMull
but fparsec doc says for alternatives if parser p1 consumes input and fails it will not try p2.
in my case both Add and Mul has same pattern before operator, so only p1 is working. how can I refactor it so I can parse my input? in fparsec doc solution example, it worked as it was just parsing and not constructing Discriminated union instance. in my case I have to know which pattern matched so that I can create either Add or Mul
Edit: my original comment was just as flawed, as pointed out by #FyodorSoikin.
You are on the right track in your comment from yesterday by making a common parser for the operators and then having a single parser for operations that uses it. To make this more functional, you can have the operator parser return the union case to apply. This way, when parsing the full operation, you can just call it as a function.
let parseExpression, implementation = createParserForwardedToRef<AirthmeticExpression, unit>();;
let parseOperator = // : Parser<AirthmeticExpression * AirthmeticExpression -> AirthmeticExpression>
(pstring " + " |>> AddNumber)
<|> (pstring " * " |>> Mul)
let parseOperation =
pipe3 parseConstant parseOperator parseConstant
(fun x op y -> op (x, y)) // Here, op is either AddNumber or Mul
|> between (pstring "(") (pstring ")")
implementation := parseConstant <|> parseOperation
Original comment:
One possibility is to use attempt as said in the comments, but that function should generally be used as a last resort. A better solution is to factor out the wrapping:
let parseExpression, implementation = createParserForwardedToRef<AirthmeticExpression, unit>();;
let parseAdd = tuple2 (parseExpression .>> pstring " + ") parseExpression |>> AddNumber
let parseMul = tuple2 (parseExpression .>> pstring " * ") parseExpression |>> Mul
let parseOp = between (pstring "(") (pstring ")") (parseAdd <|> parseMul)
implementation := parseConstant <|> parseOp

How to parse prefix function such as Pow() with multiple parameters using FParsec

I tried to parse a prefix function such as Pow(3+2,2) using FParsec. I read the calculator tutorial in the example files as follows. The examples are all unary prefix function. I wonder how can I achieve prefix functions with more than one inputs using FParsec.OperatorPrecedenceParser.
http://www.quanttec.com/fparsec/reference/operatorprecedenceparser.html#members.PrefixOperator
let number = pfloat .>> ws
let opp = new OperatorPrecedenceParser<float,unit,unit>()
let expr = opp.ExpressionParser
opp.TermParser <- number <|> between (str_ws "(") (str_ws ")") expr
opp.AddOperator(InfixOperator("+", ws, 1, Associativity.Left, (+)))
opp.AddOperator(InfixOperator("-", ws, 1, Associativity.Left, (-)))
opp.AddOperator(InfixOperator("*", ws, 2, Associativity.Left, (*)))
opp.AddOperator(InfixOperator("/", ws, 2, Associativity.Left, (/)))
opp.AddOperator(InfixOperator("^", ws, 3, Associativity.Right, fun x y -> System.Math.Pow(x, y)))
opp.AddOperator(PrefixOperator("-", ws, 4, true, fun x -> -x))
let ws1 = nextCharSatisfiesNot isLetter >>. ws
opp.AddOperator(PrefixOperator("log", ws1, 4, true, System.Math.Log))
opp.AddOperator(PrefixOperator("exp", ws1, 4, true, System.Math.Exp))
Update 1
I've written a quick script following after-string parser example as I need after-string parser for the actual application
http://www.quanttec.com/fparsec/users-guide/tips-and-tricks.html#parsing-f-infix-operators
abs(pow(1,2)) can be parsed but pow(abs(1),2) cannot be done. I'm puzzled about how to use prefix function as part of the input for identWithArgs.
#I #"..\packages\FParsec.1.0.2\lib\net40-client"
#r "FParsecCS.dll"
#r "FParsec.dll"
open FParsec
type PrefixFunc = POW
type Expr =
| InfixOpExpr of string * Expr * Expr
| PrefixOpExpr of string * Expr
| PrefixFuncExpr of PrefixFunc * Expr list
| Number of int
let ws = spaces
let ws1 = spaces1
let str s = pstring s
let str_ws s = ws >>. str s .>> ws
let strci s = pstringCI s
let strci_ws s = ws >>. strci s .>> ws
let strciret_ws s x = ws >>. strci s .>> ws >>% x
let isSymbolicOperatorChar = isAnyOf "!%&*+-./<=>#^|~?"
let remainingOpChars_ws = manySatisfy isSymbolicOperatorChar .>> ws
let primitive = pint32 .>> ws |>> Number
let argList = sepBy primitive (str_ws ",")
let argListInParens = between (str_ws "(") (str_ws ")") argList
let prefixFunc = strciret_ws "pow" POW
let identWithArgs =
pipe2 prefixFunc argListInParens (fun funcId args -> PrefixFuncExpr(funcId, args))
let opp = new OperatorPrecedenceParser<Expr, string, unit>()
opp.TermParser <-
primitive <|>
identWithArgs <|>
between (pstring "(") (pstring ")") opp.ExpressionParser
// a helper function for adding infix operators to opp
let addSymbolicInfixOperators prefix precedence associativity =
let op = InfixOperator(prefix, remainingOpChars_ws,
precedence, associativity, (),
fun remOpChars expr1 expr2 ->
InfixOpExpr(prefix + remOpChars, expr1, expr2))
opp.AddOperator(op)
// the operator definitions:
addSymbolicInfixOperators "*" 10 Associativity.Left
addSymbolicInfixOperators "**" 20 Associativity.Right
opp.AddOperator(PrefixOperator("abs",remainingOpChars_ws,3,true,(),fun remOpChars expr -> PrefixOpExpr("abs", expr)))
opp.AddOperator(PrefixOperator("log",remainingOpChars_ws,3,true,(),fun remOpChars expr -> PrefixOpExpr("log", expr)))
run opp.ExpressionParser "abs(pow(1,2))"
run opp.ExpressionParser "pow(abs(1),2)"
I started to review the problem after one year and finally realized the problem.
I've changed the following code
let argList = sepBy primitive (str_ws ",")
to the following
let opp = new OperatorPrecedenceParser<Expr, string, unit>()
let argList = sepBy opp.ExpressionParser (str_ws ",")
I bring OperatorPrecedenceParser to the beginning of the code. And then I achieve recursively calling opp.ExpressionParser by putting it directly into argList.
I just realized that OperatorPrecedenceParser is very similar to createParserForwardedToRef. It creates a parser first without writing down implementation until later. FParsec has to achieve recursiveness in this way. Similar to its JSON sample parser.
After this change, both abs(pow(1,2)) and pow(abs(1),2) can be parsed. Hope this helps others who ever got this problem.

why combinator "between" does not work with "choice" as applied parser?

As far as I understand the choice combinator implicitly appends pzero parser to my parser list and when fparsec fails to parse next part of input stream, it should search for brackets.
Here is minimal complete code:
open System
open System.Collections.Generic
open FParsec
type IDL =
|Library of string * IDL list
|ImportLib of string
|ImportAlias of string
let comment : Parser<unit,unit> = pstring "//" >>. skipRestOfLine true >>. spaces
let ws = spaces >>. (opt comment)
let str s = pstring s >>. ws
let identifierString = ws >>. many1Satisfy isLetter .>> ws // [A-z]+
let identifierPath = ws >>. many1Satisfy (fun c -> isLetter c || isDigit c || c = '.' || c = '\\' || c = '/') .>> ws // valid path letters
let keywords = ["importlib"; "IMPORTLIB"; "interface"; "typedef"; "coclass"]
let keywordsSet = new HashSet<string>(keywords)
let isKeyword (set : HashSet<string>) str = set.Contains(str)
let pidentifier set (f : Parser<string, unit>) : Parser<string, unit> =
let expectedIdentifier = expected "identifier"
fun stream ->
let state = stream.State
let reply = f stream
if reply.Status <> Ok || not (isKeyword set reply.Result) then
printf "got id %s\n" reply.Result
ws stream |> ignore
reply
else // result is keyword, so backtrack to before the string
stream.BacktrackTo(state)
Reply(Error, expectedIdentifier)
let identifier = pidentifier keywordsSet
let stmt, stmtRef = createParserForwardedToRef()
let stmtList = sepBy1 stmt (str ";")
let importlib =
str "importlib" >>.
between (str "(" .>> str "\"") (str "\"" >>. str ")")
(identifier identifierPath) |>> ImportLib
let importalias =
str "IMPORTLIB" >>.
between (str "(") (str ")")
(identifier identifierString) |>> ImportAlias
let library =
pipe2
(str "library" >>. identifier identifierString)
(between (str "{") (str "}") stmtList)
(fun lib slist -> Library(lib, slist))
do stmtRef:= choice [importlib; importalias]
let prog =
ws >>. library .>> ws .>> eof
let s = #"
library ModExpress
{
importlib(""stdole2.tlb"");
importlib(""msxml6.dll"");
}"
let test p str =
match run p str with
| Success(result, _, _) -> printfn "Success: %A" result
| Failure(errorMsg, _, _) -> printfn "Failure: %s" errorMsg
test prog s
System.Console.Read() |> ignore
but for the input string
library ModExpress
{
importlib(""stdole2.tlb"");
importlib(""msxml6.dll"");
}
I got following error:
Failure: Error in Ln: 6 Col: 1
}
^
Expecting: '//', 'IMPORTLIB' or 'importlib'
It seems that the problem here is that the stmtList parser is implemented with the sepBy1 combinator. sepBy1 stmt sep parses one or more occurrences of p separated (but not ended) by sep, i.e. in EBNF: p (sep p)*. When the parser sees the semicolon after importlib(""msxml6.dll""), it expects another statement after the whitespace.
If you want to make the semicolon at the end of a statement list optional, you could simply use sepEndBy1 instead of sepBy1, or if you always want to require a semicolon, you could use
let stmtList = many1 stmt
do stmtRef:= choice [importlib; importalias] .>> str ";"

Resources