As an exercise for myself I'm using FParsec to write a function that can generate a random string from a spec in (limited) Regex form.
E.g.
Input: ([Hh]ello ){1,3} world!?
Output: Hello hello world!
Input: (\d{1,2}\.){3}(\d{1,2})
Output: 38.12.29.05
I have a lot of it working, but I'm having trouble with the idea of postfix terms (i.e. that a parser might need to go back and modify the output, not just the position in the Char stream). E.g. "a" vs "a+"
Here's a cut-down version of my domain types:
type Count =
| ExactCount of int
| MinCount of int
| MaxCount of int
| RangeCount of int * int
type Term =
| CharLiteral of char
| Count of Term * Count
type RandExpSpec = Term list
So the input ab should generate [CharLiteral 'a'; CharLiteral 'b'] but ab+ should generate [CharLiteral 'a'; Count (CharLiteral 'b', MinCount 1)]. So that kind of implies that, on encountering a Count term in the stream, the parser needs to backtrack through the output in order to wrap the last term in another object.
Now, I don't know how to do that. Here's my current parsing definition, that does (mostly) work but is very inefficient:
let parseCharLiteral = choice [ letter; digit ] |>> CharLiteral
let rec parseTerm =
parse.Delay(fun () -> choice [ parseCharLiteral ])
and parseCount =
parseTerm
.>>. choice [ skipChar '*' >>% (MinCount 0)
skipChar '+' >>% (MinCount 1)
skipChar '?' >>% (RangeCount(0, 1)) ]
|>> Count
let parseTerms =
many ((attempt parseCount) <|> parseTerm) .>> eof
You can see that in parseCount I call parseTerm first, then parse the actual count information. Then, in parseTerms I attempt the parseCount parser every time, backtracking through the input if it doesn't work. This is very inefficient because I'm essentially making two passes on almost every char in the input stream just in case it's followed by a count modifier.
Is there a more efficient way to do this? I feel like I should be writing something more like:
let parseCharLiteral = choice [ letter; digit ] |>> CharLiteral
let rec parseTerm =
parse.Delay(fun () -> choice [ parseCharLiteral ] .>>. (attempt parseCount))
and parseCount =
choice [ skipChar '*' >>% (MinCount 0)
skipChar '+' >>% (MinCount 1)
skipChar '?' >>% (RangeCount(0, 1)) ]
|>> Count
let parseTerms =
many parseTerm .>> eof
but I can't make that work since parseCount qould need to wrap the previous term returned by parseTerm.
I think you can use opt to allow parseCount to not find a count if there isn't one:
let parseCount =
parseTerm
.>>. opt (choice [ skipChar '*' >>% (MinCount 0)
skipChar '+' >>% (MinCount 1)
skipChar '?' >>% (RangeCount(0, 1)) ])
|>> function
| term, None -> term
| term, Some count -> Count (term, count)
let parseTerms =
many parseCount .>> eof
Related
I usually use FParsec for LL grammars, but sometimes it happens that in a whole grammar only one element requires left recursive parsing (so the grammar is no longer LL). Currently I have such a situation, I have a large LL grammar implemented with FParsec, but a small grammar element is bothering me because it obviously cannot be parsed correctly.
The syntax element in question is an access to an array index à la F#, e.g. myArray.[index] where myArray can be any expression and index can be any expression too. It turns out that my function calls use square brackets, not parentheses, and my identifiers can be qualified with dots.
An example of correct syntax for an expression is: std.fold[fn, f[myArray.[0]], std.tail[myArray]].
The .[] syntax element is obviously left recursive, but perhaps there is a trick that allows me to parse it anyway? My minimal code is as follows:
open FParsec
type Name = string list
type Expr =
(* foo, Example.Bar.fizz *)
| Variable of Name
(* 9, 17, -1 *)
| Integer of int
(* foo[3, 2], Std.sqrt[2] *)
| FunCall of Name * Expr list
(* (a + b), (a + (1 - c)) *)
| Parens of Expr
(* myArray.[0], table.[index - 1] *)
| ArrayAccess of Expr * Expr
(* a + b *)
| Addition of Expr * Expr
let opp =
new OperatorPrecedenceParser<Expr, _, _>()
let pExpr = opp.ExpressionParser
let pName =
let id =
identifier (IdentifierOptions(isAsciiIdStart = isAsciiLetter, isAsciiIdContinue = isAsciiLetter))
sepBy1 id (skipChar '.')
let pVariable = pName |>> Variable
let pInt = pint32 |>> Integer
let pFunCall =
pipe4
pName
(spaces >>. skipChar '[')
(sepBy (spaces >>. pExpr) (skipChar ','))
(spaces >>. skipChar ']')
(fun name _ args _ -> FunCall(name, args))
let pArrayAccess =
pipe5
pExpr
(spaces >>. skipChar '.')
(spaces >>. skipChar '[')
(spaces >>. pExpr)
(spaces >>. skipChar ']')
(fun expr _ _ index _ -> ArrayAccess(expr, index))
let pParens =
between (skipChar '(') (skipChar ')') (spaces >>. pExpr)
opp.TermParser <-
choice [ attempt pFunCall
pVariable
pArrayAccess
pInt
pParens ]
.>> spaces
let addInfixOperator str prec assoc mapping =
opp.AddOperator
<| InfixOperator(str, spaces, prec, assoc, (), (fun _ leftTerm rightTerm -> mapping leftTerm rightTerm))
addInfixOperator "+" 6 Associativity.Left (fun a b -> Addition(a, b))
let startParser = runParserOnString (pExpr .>> eof) () ""
printfn "%A" <| startParser "std.fold[fn, f[myArray.[0]], std.tail[myArray]]"
One way to do this is as follows: instead of making a list of parsing choices that also lists pArrayAccess like above, which will at some point cause an infinite loop, one can modify pExpr to parse the grammar element in question as an optional element following an expression:
let pExpr =
parse {
let! exp = opp.ExpressionParser
let pArrayAccess =
between (skipString ".[") (skipString "]") opp.ExpressionParser
match! opt pArrayAccess with
| None -> return exp
| Some index -> return ArrayAccess(exp, index)
}
After testing, it turns out that this works very well if the following two conditions are not met:
The contents of the square brackets must not contain access to another array ;
An array cannot be accessed a second time in succession (my2DArray.[x].[y]).
This restricts usage somewhat. How can I get away with this? Is there a way to do this or do I have to change the grammar?
Finally, a solution to this problem is quite simple: just expect a list of array access. If the list is empty, then return the initial expression, otherwise fold over all the array accesses and return the result. Here is the implementation:
let rec pExpr =
parse {
let! exp = opp.ExpressionParser
let pArrayAccess =
between (skipString ".[") (skipString "]") pExpr
match! many pArrayAccess with
| [] -> return exp
| xs -> return List.fold
(fun acc curr -> ArrayAccess(acc, curr)) exp xs
}
This way of doing things meets my needs, so I'd be happy with it, if anyone passes by and wants something more general and not applicable with the proposed solution, then I refer to #Martin Freedman comment, using createParserForwardedToRef().
NOTE: Not long ago, I had already asked a similar question. This is not a duplication, but the clarifications to be requested did not fall within the scope of the subject itself. I therefore allow myself to open another position dealing with the analysis of an ML-like syntax based on indentation, and considering everything as an instruction / expression.
For example:
"Hello" is an expression,
let foo = 2 + 1 is an instruction using an expression (2 + 1),
print foo is an instruction, ...
In short, a syntax and semantics that is quite modular and dynamic. Like F#, or OCaml.
To do this, I use F#, with the API (available on nuget) FParsec. The FParsec wiki provides an example of a syntax based on indentation, so I have taken it up again. The module in the code below used is IndentationParserWithoutBacktracking.
The example code to be parsed uses an elementary indentation, not mixing "literal" and "instructions/expressions":
loop i 1 10
loop k 1 10
print k
print i
print j
A simple code, and without context (but this is not important at the moment).
My implementation allows codes such as:
let foo = a + b
let foo =
let a = 9
let b = 1
a + b
let foo = 7
let foo =
loop i 1 10
print i
For example. (The loop and print are there just for the tests...)
The problem I have been having for a long week now, and that I can't solve, is the fact that the indentation module asks me every time an instruction is expected in a parser for a new line... Here is a screenshot:
This applies to all the examples mentioned above. I don't really understand the problem, and therefore don't know how to solve it.
Here is the code tested for this question, it meets the minimum and functional code criteria, however, FParsec must be used:
open FParsec
// This module come from 'https://github.com/stephan-tolksdorf/fparsec/wiki/Parsing-indentation-based-syntax-with-FParsec'
// I used the second module: 'IndentationParserWithoutBacktracking'
module IndentationParserWithoutBacktracking =
let tabStopDistance = 8
type LastParsedIndentation() =
[<DefaultValue>]
val mutable Value: int32
[<DefaultValue>]
val mutable EndIndex: int64
type UserState =
{Indentation: int
// We put LastParsedIndentation into the UserState so that we
// can conveniently use a separate instance for each stream.
// The members of the LastParsedIndentation instance will be mutated
// directly and hence won't be affected by any stream backtracking.
LastParsedIndentation: LastParsedIndentation}
with
static member Create() = {Indentation = -1
LastParsedIndentation = LastParsedIndentation(EndIndex = -1L)}
type CharStream = CharStream<UserState>
type Parser<'t> = Parser<'t, UserState>
// If this function is called at the same index in the stream
// where the function previously stopped, then the previously
// returned indentation will be returned again.
// This way we can avoid backtracking at the end of indented blocks.
let skipIndentation (stream: CharStream) =
let lastParsedIndentation = stream.UserState.LastParsedIndentation
if lastParsedIndentation.EndIndex = stream.Index then
lastParsedIndentation.Value
else
let mutable indentation = stream.SkipNewlineThenWhitespace(tabStopDistance, false)
lastParsedIndentation.EndIndex <- stream.Index
lastParsedIndentation.Value <- indentation
indentation
let indentedMany1 (p: Parser<'t>) label : Parser<'t list> =
fun stream ->
let oldIndentation = stream.UserState.Indentation
let indentation = skipIndentation stream
if indentation <= oldIndentation then
Reply(Error, expected (if indentation < 0 then "newline" else "indented " + label))
else
stream.UserState <- {stream.UserState with Indentation = indentation}
let results = ResizeArray()
let mutable stateTag = stream.StateTag
let mutable reply = p stream // parse the first element
let mutable newIndentation = 0
while reply.Status = Ok
&& (results.Add(reply.Result)
newIndentation <- skipIndentation stream
newIndentation = indentation)
do
stateTag <- stream.StateTag
reply <- p stream
if reply.Status = Ok
|| (stream.IsEndOfStream && results.Count > 0 && stream.StateTag = stateTag)
then
if newIndentation < indentation || stream.IsEndOfStream then
stream.UserState <- {stream.UserState with Indentation = oldIndentation}
Reply(List.ofSeq results)
else
Reply(Error, messageError "wrong indentation")
else // p failed
Reply(reply.Status, reply.Error)
open IndentationParserWithoutBacktracking
let isBlank = fun c -> c = ' ' || c = '\t'
let ws = spaces
let ws1 = skipMany1SatisfyL isBlank "whitespace"
let str s = pstring s .>> ws
let keyword str = pstring str >>? nextCharSatisfiesNot (fun c -> isLetter c || isDigit c) <?> str
// AST
type Identifier = Identifier of string
// A value is just a literal or a data name, called here "Variable"
type Value =
| Int of int | Float of float
| Bool of bool | String of string
| Char of char | Variable of Identifier
// All is an instruction, but there are some differences:
type Instr =
// Arithmetic
| Literal of Value | Infix of Instr * InfixOp * Instr
// Statements (instructions needing another instructions)
| Let of Identifier * Instr list
| Loop of Identifier * Instr * Instr * Instr list
// Other - the "print" function, from the link seen above
| Print of Identifier
and InfixOp =
// Arithmetic
| Sum | Sub | Mul | Div
// Logic
| And | Or | Equal | NotEqual | Greater | Smaller | GreaterEqual | SmallerEqual
// Literals
let numberFormat = NumberLiteralOptions.AllowMinusSign ||| NumberLiteralOptions.AllowFraction |||
NumberLiteralOptions.AllowHexadecimal ||| NumberLiteralOptions.AllowOctal |||
NumberLiteralOptions.AllowBinary
let literal_numeric =
numberLiteral numberFormat "number" |>> fun nl ->
if nl.IsInteger then Literal (Int(int nl.String))
else Literal (Float(float nl.String))
let literal_bool =
(choice [
(stringReturn "true" (Literal (Bool true)))
(stringReturn "false" (Literal (Bool false)))
]
.>> ws) <?> "boolean"
let literal_string =
(between (pstring "\"") (pstring "\"") (manyChars (satisfy (fun c -> c <> '"')))
|>> fun s -> Literal (String s)) <?> "string"
let literal_char =
(between (pstring "'") (pstring "'") (satisfy (fun c -> c <> '''))
|>> fun c -> Literal (Char c)) <?> "character"
let identifier =
(many1Satisfy2L isLetter (fun c -> isLetter c || isDigit c) "identifier"
|>> Identifier) <?> "identifier"
let betweenParentheses p =
(between (str "(") (str ")") p) <?> ""
let variable = identifier |>> fun id -> Literal (Variable id)
let literal = (attempt literal_numeric <|>
attempt literal_bool <|>
attempt literal_char <|>
attempt literal_string <|>
attempt variable)
// Instressions and statements
let pInstrs, pInstrimpl = createParserForwardedToRef()
// `ploop` is located here to force `pInstrs` to be of the type `Instr list`, `ploop` requesting an instression list.
let ploop =
pipe4
(keyword "loop" >>. ws1 >>. identifier)
(ws1 >>. literal)
(ws1 >>. literal)
(pInstrs)
(fun id min max stmts -> Loop(id, min, max, stmts))
// `singlepInstr` allows to use only one Instression, used just after.
let singlepInstr =
pInstrs |>> fun ex -> ex.Head
let term =
(ws >>. singlepInstr .>> ws) <|>
(betweenParentheses (ws >>. singlepInstr)) <|>
(ws >>. literal .>> ws) <|>
(betweenParentheses (ws >>. literal))
let infixOperator (p: OperatorPrecedenceParser<_, _, _>) op prec map =
p.AddOperator(InfixOperator(op, ws, prec, Associativity.Left, map))
let ops =
// Arithmetic
[ "+"; "-"; "*"; "/"; "%" ] #
// Logical
[ "&&"; "||"; "=="; "!="; ">"; "<"; ">="; "<=" ]
let opCorrespondance op =
match op with
// Arithmetic operators
| "+" -> Sum | "-" -> Sub
| "*" -> Mul | "/" -> Div
// Logical operators
| "&&" -> And | "||" -> Or
| "==" -> Equal | "!=" -> NotEqual
| ">" -> Greater | "<" -> Smaller
| ">=" -> GreaterEqual | "<=" -> SmallerEqual
| _ -> failwith ("Unknown operator: " + op)
let opParser = new OperatorPrecedenceParser<Instr, unit, UserState>()
for op in ops do
infixOperator opParser op 1 (fun x y -> Infix(x, opCorrespondance op, y))
opParser.TermParser <- term
// Statements
(*
- let:
let <identifier> = <instruction(s) / value>
- print:
print <identifier>
- loop:
loop <identifier> <literal> <literal> <indented statements>
*)
let plet =
pipe2
(keyword "let" >>. ws1 >>. identifier)
(ws >>. str "=" >>. ws >>. pInstrs)
(fun id exp -> Let(id, exp))
let print =
keyword "print" >>. ws1 >>. identifier
|>> Print
let instruction =
print <|> ploop <|> plet <|>
opParser.ExpressionParser <|>
literal
pInstrimpl := indentedMany1 instruction "instruction"
let document = pInstrs .>> spaces .>> eof
let test str =
match runParserOnString document (UserState.Create()) "" str with
| Success(result, _, _) -> printfn "%A" result
| Failure(errorMsg, _, _) -> printfn "%s" errorMsg
System.Console.Clear()
let code = test #"
let foo = a + b
"
I would like to understand first of all why it doesn't work, but also to be able to find a solution to my problem, and that this solution can be extended to the potential syntax additions of the parser.
Awaiting a salutary answer, thank you.
In order to understand why your parser doesn't work, you need to isolate the issues.
If I understand you correctly, you want your let parser to support either a single instruction on the same line or indented instructions on subsequent lines, e.g:
let x = instruction
let b =
instruction
instruction
If you can't get your existing implementation to work, I'd recommend going back to the implementation on the Wiki and trying to just add support for the let statement.
For example, I made the Wiki parser accept simple let statements with the following modifications:
type Statement = Loop of Identifier * int * int * Statement list
| Print of Identifier
| Let of Identifier * Statement list
let ws = skipManySatisfy isBlank
let str s = pstring s .>> ws
let statement, statementRef = createParserForwardedToRef()
let indentedStatements = indentedMany1 statement "statement"
let plet = keyword "let" >>. pipe2 (ws1 >>. identifier)
(ws >>. str "=" >>. ws
>>. (indentedStatements
<|> (statement |>> fun s -> [s])))
(fun id exp -> Let(id, exp))
statementRef := print <|> loop <|> plet
Note that in the modified version statement is now the parser forwarded to a ref cell, not indentedStatements.
Note also that ws is not implemented with spaces, like in your parser. This is important because spaces also consumes newlines, which would prevent the indentedMany1 from seeing the newline and properly calculating the indentation.
The reason your parser produced an "Expecting: newline" error is that indentedMany1 needs a newline at the beginning of the indented sequence in order to be able to calculate the indentation. You would have to modify the implementation of indentedMany1 if you wanted to support e.g. the following indentation pattern:
let x = instruction
instruction
instruction
I'm trying to use FParsec to parse a TOML multi-line string, and I'm having trouble with the closing delimiter ("""). I have the following parsers:
let controlChars =
['\u0000'; '\u0001'; '\u0002'; '\u0003'; '\u0004'; '\u0005'; '\u0006'; '\u0007';
'\u0008'; '\u0009'; '\u000a'; '\u000b'; '\u000c'; '\u000d'; '\u000e'; '\u000f';
'\u0010'; '\u0011'; '\u0012'; '\u0013'; '\u0014'; '\u0015'; '\u0016'; '\u0017';
'\u0018'; '\u0019'; '\u001a'; '\u001b'; '\u001c'; '\u001d'; '\u001e'; '\u001f';
'\u007f']
let nonSpaceCtrlChars =
Set.difference (Set.ofList controlChars) (Set.ofList ['\n';'\r';'\t'])
let multiLineStringContents : Parser<char,unit> =
satisfy (isNoneOf nonSpaceCtrlChars)
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (pstring "\"\"\"")
|> between (pstring "\"\"\"") (pstring "\"\"\"")
let test parser str =
match run parser str with
| Success (s1, s2, s3) -> printfn "Ok: %A %A %A" s1 s2 s3
| Failure (f1, f2, f3) -> printfn "Fail: %A %A %A" f1 f2 f3
When I test multiLineString against an input like this:
test multiLineString "\"\"\"x\"\"\""
The parser fails with this error:
Fail: "Error in Ln: 1 Col: 8 """x"""
^ Note: The error occurred at the end of the input stream. Expecting: '"""'
I'm confused by this. Wouldn't the manyCharsTill multiLineStringContents (pstring "\"\"\"") parser stop at the """ for the between parser to find it? Why is the parser eating all the input and then failing the between parser?
This seems like a relevant post: How to parse comments with FParsec
But I don't see how the solution to that one differs from what I'm doing here, really.
The manyCharsTill documentation says (emphasis mine):
manyCharsTill cp endp parses chars with the char parser cp until the parser endp succeeds. It stops after endp and returns the parsed chars as a string.
So you don't want to use between in combination with manyCharsTill; you want to do something like pstring "\"\"\"" >>. manyCharsTill (pstring "\"\"\"").
But as it happens, I can save you a lot of work. I've been working on a TOML parser with FParsec myself in my spare time. It's far from complete, but the string part works and handles backslash escapes correctly (as far as I can tell: I've tested thoroughly but not exhaustively). The only thing I'm missing is the "strip first newline if it appears right after the opening delimiter" rule, which you've handled with optional newline. So just add that bit into my code below and you should have a working TOML string parser.
BTW, I am planning to license my code (if I finish it) under the MIT license. So I hereby release the following code block under the MIT license. Feel free to use it in your project if it's useful to you.
let pShortCodepointInHex = // Anything from 0000 to FFFF, *except* the range D800-DFFF
(anyOf "dD" >>. (anyOf "01234567" <?> "a Unicode scalar value (range D800-DFFF not allowed)") .>>. exactly 2 isHex |>> fun (c,s) -> sprintf "d%c%s" c s)
<|> (exactly 4 isHex <?> "a Unicode scalar value")
let pLongCodepointInHex = // Anything from 00000000 to 0010FFFF, *except* the range D800-DFFF
(pstring "0000" >>. pShortCodepointInHex)
<|> (pstring "000" >>. exactly 5 isHex)
<|> (pstring "0010" >>. exactly 4 isHex |>> fun s -> "0010" + s)
<?> "a Unicode scalar value (i.e., in range 00000000 to 0010FFFF)"
let toCharOrSurrogatePair p =
p |> withSkippedString (fun codePoint _ -> System.Int32.Parse(codePoint, System.Globalization.NumberStyles.HexNumber) |> System.Char.ConvertFromUtf32)
let pStandardBackslashEscape =
anyOf "\\\"bfnrt"
|>> function
| 'b' -> "\b" // U+0008 BACKSPACE
| 'f' -> "\u000c" // U+000C FORM FEED
| 'n' -> "\n" // U+000A LINE FEED
| 'r' -> "\r" // U+000D CARRIAGE RETURN
| 't' -> "\t" // U+0009 CHARACTER TABULATION a.k.a. Tab or Horizonal Tab
| c -> string c
let pUnicodeEscape = (pchar 'u' >>. (pShortCodepointInHex |> toCharOrSurrogatePair))
<|> (pchar 'U' >>. ( pLongCodepointInHex |> toCharOrSurrogatePair))
let pEscapedChar = pstring "\\" >>. (pStandardBackslashEscape <|> pUnicodeEscape)
let quote = pchar '"'
let isBasicStrChar c = c <> '\\' && c <> '"' && c > '\u001f' && c <> '\u007f'
let pBasicStrChars = manySatisfy isBasicStrChar
let pBasicStr = stringsSepBy pBasicStrChars pEscapedChar |> between quote quote
let pEscapedNewline = skipChar '\\' .>> skipNewline .>> spaces
let isMultilineStrChar c = c = '\n' || isBasicStrChar c
let pMultilineStrChars = manySatisfy isMultilineStrChar
let pTripleQuote = pstring "\"\"\""
let pMultilineStr = stringsSepBy pMultilineStrChars (pEscapedChar <|> (notFollowedByString "\"\"\"" >>. pstring "\"")) |> between pTripleQuote pTripleQuote
#rmunn provided a correct answer, thanks! I also solved this in a slightly different way after playing with the FParsec API a bit more. As explained in the other answer, The endp argument to manyCharTill was eating the closing """, so I needed to switch to something that wouldn't do that. A simple modification using lookAhead did the trick:
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (lookAhead (pstring "\"\"\""))
|> between (pstring "\"\"\"") (pstring "\"\"\"")
I wish to parse a string in to a recursive data structure using F#. In this question I'm going to present a simplified example that cuts to the core of what I want to do.
I want to parse a string of nested square brackets in to the record type:
type Bracket = | Bracket of Bracket option
So:
"[]" -> Bracket None
"[[]]" -> Bracket ( Some ( Bracket None) )
"[[[]]]" -> Bracket ( Some ( Bracket ( Some ( Bracket None) ) ) )
I would like to do this using the parser combinators in the FParsec library. Here is what I have so far:
let tryP parser =
parser |>> Some
<|>
preturn None
/// Parses up to nesting level of 3
let parseBrakets : Parser<_> =
let mostInnerLevelBracket =
pchar '['
.>> pchar ']'
|>> fun _ -> Bracket None
let secondLevelBracket =
pchar '['
>>. tryP mostInnerLevelBracket
.>> pchar ']'
|>> Bracket
let firstLevelBracket =
pchar '['
>>. tryP secondLevelBracket
.>> pchar ']'
|>> Bracket
firstLevelBracket
I even have some Expecto tests:
open Expecto
[<Tests>]
let parserTests =
[ "[]", Bracket None
"[[]]", Bracket (Some (Bracket None))
"[[[]]]", Bracket ( Some (Bracket (Some (Bracket None)))) ]
|> List.map(fun (str, expected) ->
str
|> sprintf "Trying to parse %s"
|> testCase
<| fun _ ->
match run parseBrakets str with
| Success (x, _,_) -> Expect.equal x expected "These should have been equal"
| Failure (m, _,_) -> failwithf "Expected a match: %s" m
)
|> testList "Bracket tests"
let tests =
[ parserTests ]
|> testList "Tests"
runTests defaultConfig tests
The problem is of course how to handle and arbitrary level of nesting - the code above only works for up to 3 levels. The code I would like to write is:
let rec pNestedBracket =
pchar '['
>>. tryP pNestedBracket
.>> pchar ']'
|>> Bracket
But F# doesn't allow this.
Am I barking up the wrong tree completely with how to solve this (I understand that there are easier ways to solve this particular problem)?
You are looking for FParsecs createParserForwardedToRef method. Because parsers are values and not functions it is impossible to make mutually recursive or self recursive parsers in order to do this you have to in a sense declare a parser before you define it.
Your final code will end up looking something like this
let bracketParser, bracketParserRef = createParserForwardedToRef<Bracket>()
bracketParserRef := ... //here you can finally declare your parser
//you can reference bracketParser which is a parser that uses the bracketParserRef
Also I would recommend this article for basic understanding of parser combinators. https://fsharpforfunandprofit.com/posts/understanding-parser-combinators/. The final section on a JSON parser talks about the createParserForwardedToRef method.
As an example of how to use createParserForwardedToRef, here's a snippet from a small parser I wrote recently. It parses lists of space-separated integers between brackets (and the lists can be nested), and the "integers" can be small arithmetic expressions like 1+2 or 3*5.
type ListItem =
| Int of int
| List of ListItem list
let pexpr = // ... omitted for brevity
let plist,plistImpl = createParserForwardedToRef()
let pListContents = (many1 (plist |>> List .>> spaces)) <|>
(many (pexpr |>> Int .>> spaces))
plistImpl := pchar '[' >>. spaces
>>. pListContents
.>> pchar ']'
P.S. I would have put this as a comment to Thomas Devries's answer, but a comment can't contain nicely-formatted code. Go ahead and accept his answer; mine is just intended to flesh his out.
I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:
EXP : EXP *
| LIT EXP
| LIT
I've tried to implement this in Haskell as:
expr = try star
<|> try litE
<|> lit
litE = do c <- noneOf "*"
rest <- expr
return (c : rest)
lit = do c <- noneOf "*"
return [c]
star = do content <- expr
char '*'
return (content ++ "*")
There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.
Any thoughts?
You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!
You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.
Here's my attempt (I threw in |, +, and ? for good measure):
import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr
data Term = Literal Char
| Sequence [Term]
| Repeat (Int, Maybe Int) Term
| Choice [Term]
deriving ( Show )
term :: Parser Term
term = buildExpressionParser ops atom where
ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
, Postfix (Repeat (1, Nothing) <$ char '+')
, Postfix (Repeat (0, Just 1) <$ char '?')
]
, [ Infix (return sequence) AssocRight
]
, [ Infix (choice <$ char '|') AssocRight
]
]
atom = msum [ Literal <$> lit
, parens term
]
lit = noneOf "*+?|()"
sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
parens = between (char '(') (char ')')
seqTerms (Sequence ts) = ts
seqTerms t = [t]
choiceTerms (Choice ts) = ts
choiceTerms t = [t]
main = parseTest term "he(llo)*|wor+ld?"
Your grammar is left-recursive, which doesn’t play nice with try, as Parsec will repeatedly backtrack. There are a few ways around this. Probably the simplest is just making the * optional in another rule:
lit :: Parser (Char, Maybe Char)
lit = do
c <- noneOf "*"
s <- optionMaybe $ char '*'
return (c, s)
Of course, you’ll probably end up wrapping things in a data type anyway, and there are a lot of ways to go about it. Here’s one, off the top of my head:
import Control.Applicative ((<$>))
data Term = Literal Char
| Sequence [Term]
| Star Term
expr :: Parser Term
expr = Sequence <$> many term
term :: Parser Term
term = do
c <- lit
s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
return $ if isNothing s
then Literal c
else Star $ Literal c
Maybe a more experienced Haskeller will come along with a better solution.