Using FParsec to parse possibly malformed input

Using FParsec to parse possibly malformed input - parsing

I'm writing a parser for a specific file format using FParsec as a firstish foaray into learning fsharp. Part of the file has the following format
{ 123 456 789 333 }
Where the numbers in the brackets are pairs of values and there can be an arbitrary number of spaces to separate them. So these would also be valid things to parse:
{ 22 456 7 333 }
And of course the content of the brackets might be empty, i.e. {}
In addition I want the parser to be able to handle the case where the content is a bit malformed, eg. { some descriptive text } or maybe more likely { 12 3 4} (invalid since the 4 wouldn't be paired with anything). In this case I just want the contents saved to be processed separately.
I have this so far:
type DimNummer = int
type ObjektNummer = int
type DimObjektPair = DimNummer * ObjektNummer
type ObjektListResult = Result<DimObjektPair list, string>
let sieObjektLista =
let pnum = numberLiteral NumberLiteralOptions.None "dimOrObj"
let ws = spaces
let pobj = pnum .>> ws |>> fun x ->
let on: ObjektNummer = int x.String
on
let pdim = pnum |>> fun x ->
let dim: DimNummer = int x.String
dim
let pdimObj = (pdim .>> spaces1) .>>. pobj |>> DimObjektPair
let toObjektLista(objList:list<DimObjektPair>) =
let res: ObjektListResult = Result.Ok objList
res
let pdimObjs = sepBy pdimObj spaces1
let validList = pdimObjs |>> toObjektLista
let toInvalid(str:string) =
let res: ObjektListResult =
match str.Trim(' ') with
| "" -> Result.Ok []
| _ -> Result.Error str
res
let invalidList = manyChars anyChar |>> toInvalid
let pres = between (pchar '{') (pchar '}') (ws >>. (validList <|> invalidList) .>> ws)
pres
let parseSieObjektLista = run sieObjektLista
However running this on a valid sample I get an error:
{ 53735 7785 86231 36732 }
^
Expecting: whitespace or '}'

You're trying to consume too many spaces.
Look: pdimObj is a pdim, followed by some spaces, followed by pobj, which is itself a pnum followed by some spaces. So if you look at the first part of the input:
{ 53735 7785 86231 36732 }
\___/\______/\__/\/
^ ^ ^ ^
| | | |
pnum | | |
^ spaces1 | |
| | ws
pdim pnum ^
^ ^ |
| \ /
| \ /
| \/
\ pobj
\ /
\________/
^
|
pdimObj
One can clearly see from here that pdimObj consumes everything up to 86231, including the space just before it. And therefore, when sepBy inside pdimObjs looks for the next separator (which is spaces1), it can't find any. So it fails.
The smallest way to fix this is to make pdimObjs use many instead of sepBy: since pobj already consumes trailing spaces, there is no need to also consume them in sepBy:
let pdimObjs = many pdimObj
But a cleaner way, in my opinion, would be to remove ws from pobj, because, intuitively, trailing spaces aren't part of the number representing your object (whatever that is), and instead handle possible trailing spaces in pdimObjs via sepEndBy:
let pobj = pnum |>> fun x ->
let on: ObjektNummer = int x.String
on
...
let pdimObjs = sepEndBy pdimObj spaces1

The main problem here is in pdimObjs. The sepBy parser fails because the separator spaces following each number have already been consumed by pobj, so spaces1 cannot succeed. Instead, I suggest you try this:
let pdimObjs = many pdimObj
Which gives the following result on your test input:
Success: Ok [(53735, 7785); (86231, 36732)]

Related

Parsing an ML-like syntax based on indentation, and everything considered to be an instruction/expression

NOTE: Not long ago, I had already asked a similar question. This is not a duplication, but the clarifications to be requested did not fall within the scope of the subject itself. I therefore allow myself to open another position dealing with the analysis of an ML-like syntax based on indentation, and considering everything as an instruction / expression.
For example:
"Hello" is an expression,
let foo = 2 + 1 is an instruction using an expression (2 + 1),
print foo is an instruction, ...
In short, a syntax and semantics that is quite modular and dynamic. Like F#, or OCaml.
To do this, I use F#, with the API (available on nuget) FParsec. The FParsec wiki provides an example of a syntax based on indentation, so I have taken it up again. The module in the code below used is IndentationParserWithoutBacktracking.
The example code to be parsed uses an elementary indentation, not mixing "literal" and "instructions/expressions":
loop i 1 10
loop k 1 10
print k
print i
print j
A simple code, and without context (but this is not important at the moment).
My implementation allows codes such as:
let foo = a + b
let foo =
let a = 9
let b = 1
a + b
let foo = 7
let foo =
loop i 1 10
print i
For example. (The loop and print are there just for the tests...)
The problem I have been having for a long week now, and that I can't solve, is the fact that the indentation module asks me every time an instruction is expected in a parser for a new line... Here is a screenshot:
This applies to all the examples mentioned above. I don't really understand the problem, and therefore don't know how to solve it.
Here is the code tested for this question, it meets the minimum and functional code criteria, however, FParsec must be used:
open FParsec
// This module come from 'https://github.com/stephan-tolksdorf/fparsec/wiki/Parsing-indentation-based-syntax-with-FParsec'
// I used the second module: 'IndentationParserWithoutBacktracking'
module IndentationParserWithoutBacktracking =
let tabStopDistance = 8
type LastParsedIndentation() =
[<DefaultValue>]
val mutable Value: int32
[<DefaultValue>]
val mutable EndIndex: int64
type UserState =
{Indentation: int
// We put LastParsedIndentation into the UserState so that we
// can conveniently use a separate instance for each stream.
// The members of the LastParsedIndentation instance will be mutated
// directly and hence won't be affected by any stream backtracking.
LastParsedIndentation: LastParsedIndentation}
with
static member Create() = {Indentation = -1
LastParsedIndentation = LastParsedIndentation(EndIndex = -1L)}
type CharStream = CharStream<UserState>
type Parser<'t> = Parser<'t, UserState>
// If this function is called at the same index in the stream
// where the function previously stopped, then the previously
// returned indentation will be returned again.
// This way we can avoid backtracking at the end of indented blocks.
let skipIndentation (stream: CharStream) =
let lastParsedIndentation = stream.UserState.LastParsedIndentation
if lastParsedIndentation.EndIndex = stream.Index then
lastParsedIndentation.Value
else
let mutable indentation = stream.SkipNewlineThenWhitespace(tabStopDistance, false)
lastParsedIndentation.EndIndex <- stream.Index
lastParsedIndentation.Value <- indentation
indentation
let indentedMany1 (p: Parser<'t>) label : Parser<'t list> =
fun stream ->
let oldIndentation = stream.UserState.Indentation
let indentation = skipIndentation stream
if indentation <= oldIndentation then
Reply(Error, expected (if indentation < 0 then "newline" else "indented " + label))
else
stream.UserState <- {stream.UserState with Indentation = indentation}
let results = ResizeArray()
let mutable stateTag = stream.StateTag
let mutable reply = p stream // parse the first element
let mutable newIndentation = 0
while reply.Status = Ok
&& (results.Add(reply.Result)
newIndentation <- skipIndentation stream
newIndentation = indentation)
do
stateTag <- stream.StateTag
reply <- p stream
if reply.Status = Ok
|| (stream.IsEndOfStream && results.Count > 0 && stream.StateTag = stateTag)
then
if newIndentation < indentation || stream.IsEndOfStream then
stream.UserState <- {stream.UserState with Indentation = oldIndentation}
Reply(List.ofSeq results)
else
Reply(Error, messageError "wrong indentation")
else // p failed
Reply(reply.Status, reply.Error)
open IndentationParserWithoutBacktracking
let isBlank = fun c -> c = ' ' || c = '\t'
let ws = spaces
let ws1 = skipMany1SatisfyL isBlank "whitespace"
let str s = pstring s .>> ws
let keyword str = pstring str >>? nextCharSatisfiesNot (fun c -> isLetter c || isDigit c) <?> str
// AST
type Identifier = Identifier of string
// A value is just a literal or a data name, called here "Variable"
type Value =
| Int of int | Float of float
| Bool of bool | String of string
| Char of char | Variable of Identifier
// All is an instruction, but there are some differences:
type Instr =
// Arithmetic
| Literal of Value | Infix of Instr * InfixOp * Instr
// Statements (instructions needing another instructions)
| Let of Identifier * Instr list
| Loop of Identifier * Instr * Instr * Instr list
// Other - the "print" function, from the link seen above
| Print of Identifier
and InfixOp =
// Arithmetic
| Sum | Sub | Mul | Div
// Logic
| And | Or | Equal | NotEqual | Greater | Smaller | GreaterEqual | SmallerEqual
// Literals
let numberFormat = NumberLiteralOptions.AllowMinusSign ||| NumberLiteralOptions.AllowFraction |||
NumberLiteralOptions.AllowHexadecimal ||| NumberLiteralOptions.AllowOctal |||
NumberLiteralOptions.AllowBinary
let literal_numeric =
numberLiteral numberFormat "number" |>> fun nl ->
if nl.IsInteger then Literal (Int(int nl.String))
else Literal (Float(float nl.String))
let literal_bool =
(choice [
(stringReturn "true" (Literal (Bool true)))
(stringReturn "false" (Literal (Bool false)))
]
.>> ws) <?> "boolean"
let literal_string =
(between (pstring "\"") (pstring "\"") (manyChars (satisfy (fun c -> c <> '"')))
|>> fun s -> Literal (String s)) <?> "string"
let literal_char =
(between (pstring "'") (pstring "'") (satisfy (fun c -> c <> '''))
|>> fun c -> Literal (Char c)) <?> "character"
let identifier =
(many1Satisfy2L isLetter (fun c -> isLetter c || isDigit c) "identifier"
|>> Identifier) <?> "identifier"
let betweenParentheses p =
(between (str "(") (str ")") p) <?> ""
let variable = identifier |>> fun id -> Literal (Variable id)
let literal = (attempt literal_numeric <|>
attempt literal_bool <|>
attempt literal_char <|>
attempt literal_string <|>
attempt variable)
// Instressions and statements
let pInstrs, pInstrimpl = createParserForwardedToRef()
// `ploop` is located here to force `pInstrs` to be of the type `Instr list`, `ploop` requesting an instression list.
let ploop =
pipe4
(keyword "loop" >>. ws1 >>. identifier)
(ws1 >>. literal)
(ws1 >>. literal)
(pInstrs)
(fun id min max stmts -> Loop(id, min, max, stmts))
// `singlepInstr` allows to use only one Instression, used just after.
let singlepInstr =
pInstrs |>> fun ex -> ex.Head
let term =
(ws >>. singlepInstr .>> ws) <|>
(betweenParentheses (ws >>. singlepInstr)) <|>
(ws >>. literal .>> ws) <|>
(betweenParentheses (ws >>. literal))
let infixOperator (p: OperatorPrecedenceParser<_, _, _>) op prec map =
p.AddOperator(InfixOperator(op, ws, prec, Associativity.Left, map))
let ops =
// Arithmetic
[ "+"; "-"; "*"; "/"; "%" ] #
// Logical
[ "&&"; "||"; "=="; "!="; ">"; "<"; ">="; "<=" ]
let opCorrespondance op =
match op with
// Arithmetic operators
| "+" -> Sum | "-" -> Sub
| "*" -> Mul | "/" -> Div
// Logical operators
| "&&" -> And | "||" -> Or
| "==" -> Equal | "!=" -> NotEqual
| ">" -> Greater | "<" -> Smaller
| ">=" -> GreaterEqual | "<=" -> SmallerEqual
| _ -> failwith ("Unknown operator: " + op)
let opParser = new OperatorPrecedenceParser<Instr, unit, UserState>()
for op in ops do
infixOperator opParser op 1 (fun x y -> Infix(x, opCorrespondance op, y))
opParser.TermParser <- term
// Statements
(*
- let:
let <identifier> = <instruction(s) / value>
- print:
print <identifier>
- loop:
loop <identifier> <literal> <literal> <indented statements>
*)
let plet =
pipe2
(keyword "let" >>. ws1 >>. identifier)
(ws >>. str "=" >>. ws >>. pInstrs)
(fun id exp -> Let(id, exp))
let print =
keyword "print" >>. ws1 >>. identifier
|>> Print
let instruction =
print <|> ploop <|> plet <|>
opParser.ExpressionParser <|>
literal
pInstrimpl := indentedMany1 instruction "instruction"
let document = pInstrs .>> spaces .>> eof
let test str =
match runParserOnString document (UserState.Create()) "" str with
| Success(result, _, _) -> printfn "%A" result
| Failure(errorMsg, _, _) -> printfn "%s" errorMsg
System.Console.Clear()
let code = test #"
let foo = a + b
"
I would like to understand first of all why it doesn't work, but also to be able to find a solution to my problem, and that this solution can be extended to the potential syntax additions of the parser.
Awaiting a salutary answer, thank you.

In order to understand why your parser doesn't work, you need to isolate the issues.
If I understand you correctly, you want your let parser to support either a single instruction on the same line or indented instructions on subsequent lines, e.g:
let x = instruction
let b =
instruction
instruction
If you can't get your existing implementation to work, I'd recommend going back to the implementation on the Wiki and trying to just add support for the let statement.
For example, I made the Wiki parser accept simple let statements with the following modifications:
type Statement = Loop of Identifier * int * int * Statement list
| Print of Identifier
| Let of Identifier * Statement list
let ws = skipManySatisfy isBlank
let str s = pstring s .>> ws
let statement, statementRef = createParserForwardedToRef()
let indentedStatements = indentedMany1 statement "statement"
let plet = keyword "let" >>. pipe2 (ws1 >>. identifier)
(ws >>. str "=" >>. ws
>>. (indentedStatements
<|> (statement |>> fun s -> [s])))
(fun id exp -> Let(id, exp))
statementRef := print <|> loop <|> plet
Note that in the modified version statement is now the parser forwarded to a ref cell, not indentedStatements.
Note also that ws is not implemented with spaces, like in your parser. This is important because spaces also consumes newlines, which would prevent the indentedMany1 from seeing the newline and properly calculating the indentation.
The reason your parser produced an "Expecting: newline" error is that indentedMany1 needs a newline at the beginning of the indented sequence in order to be able to calculate the indentation. You would have to modify the implementation of indentedMany1 if you wanted to support e.g. the following indentation pattern:
let x = instruction
instruction
instruction

Use FParsec to parse a self-describing input

I'm using FParsec to parse an input that describes its own format. For example, consider this input:
int,str,int:4,'hello',3
The first part of the input (before the colon) describes the format of the second part of the input. In this case, the format is int, str, int, which means that the actual data consists of three comma-separated values of the given types, so the result should be 4, "hello", 3.
What is the best way to parse something like this with FParsec?
I've pasted my best effort below, but I'm not happy with it. Is there a better way to do this that is cleaner, less stateful, and less reliant on the parse monad? I think this depends on smarter management of UserState, but I don't know how to do it. Thanks.
open FParsec
type State = { Formats : string[]; Index : int32 }
with static member Default = { Formats = [||]; Index = 0 }
type Value =
| Integer of int
| String of string
let parseFormat : Parser<_, State> =
parse {
let! formats =
sepBy
(pstring "int" <|> pstring "str")
(skipString ",")
|>> Array.ofList
do! updateUserState (fun state -> { state with Formats = formats })
}
let parseValue format =
match format with
| "int" -> pint32 |>> Integer
| "str" ->
between
(skipString "'")
(skipString "'")
(manySatisfy (fun c -> c <> '\''))
|>> String
| _ -> failwith "Unexpected"
let parseValueByState =
parse {
let! state = getUserState
let format = state.Formats.[state.Index]
do! setUserState { state with Index = state.Index + 1}
return! parseValue format
}
let parseData =
sepBy
parseValueByState
(skipString ",")
let parse =
parseFormat
>>. skipString ":"
>>. parseData
[<EntryPoint>]
let main argv =
let result = runParserOnString parse State.Default "" "int,str,int:4,'hello',3"
printfn "%A" result
0

There seem to be several problems with the original code, so I took my liberty to rewrite it from scratch.
First, several library functions that may appear useful in other FParsec-related projects:
/// Simple Map
/// usage: let z = Map ["hello" => 1; "bye" => 2]
let (=>) x y = x,y
let makeMap x = new Map<_,_>(x)
/// A handy construct allowing NOT to write lengthy type definitions
/// and also avoid Value Restriction error
type Parser<'t> = Parser<'t, UserState>
/// A list combinator, inspired by FParsec's (>>=) combinator
let (<<+) (p1: Parser<'T list>) (p2: Parser<'T>) =
p1 >>= fun x -> p2 >>= fun y -> preturn (y::x)
/// Runs all parsers listed in the source list;
/// All but the trailing one are also combined with a separator
let allOfSepBy separator parsers : Parser<'T list> =
let rec fold state =
function
| [] -> pzero
| hd::[] -> state <<+ hd
| hd::tl -> fold (state <<+ (hd .>> separator)) tl
fold (preturn []) parsers
|>> List.rev // reverse the list since we appended to the top
Now, the main code. The basic idea is to run parsing in three steps:
Parse out the keys (which are plain ASCII strings)
Map these keys to actual Value parsers
Run these parsers in order
The rest seems to be commented within the code. :)
/// The resulting type
type Output =
| Integer of int
| String of string
/// tag to parser mappings
let mappings =
[
"int" => (pint32 |>> Integer)
"str" => (
manySatisfy (fun c -> c <> '\'')
|> between (skipChar ''') (skipChar ''')
|>> String
)
]
|> makeMap
let myProcess : Parser<Output list> =
let pKeys = // First, we parse out the keys
many1Satisfy isAsciiLower // Parse one key; keys are always ASCII strings
|> sepBy <| (skipChar ',') // many keys separated by comma
.>> (skipChar ':') // all this with trailing semicolon
let pValues = fun keys ->
keys // take the keys list
|> List.map // find the required Value parser
// (NO ERROR CHECK for bad keys)
(fun p -> Map.find p mappings)
|> allOfSepBy (skipChar ',') // they must run in order, comma-separated
pKeys >>= pValues
Run on string: int,int,str,int,str:4,42,'hello',3,'foobar'
Returned: [Integer 4; Integer 42; String "hello"; Integer 3; String "foobar"]

#bytebuster beat me to it but I still post my solution. The technique is similar to #bytebuster.
Thanks for an interesting question.
In compilers I believe the preferred technique is to parse the text into an AST and on that run a type-checker. For this example a potentially simpler technique would be that parsing the type definitions returns a set of parsers for the values. These parsers are then applied on the rest of the string.
open FParsec
type Value =
| Integer of int
| String of string
type ValueParser = Parser<Value, unit>
let parseIntValue : Parser<Value, unit> =
pint32 |>> Integer
let parseStringValue : Parser<Value, unit> =
between
(skipChar '\'')
(skipChar '\'')
(manySatisfy (fun c -> c <> '\''))
<?> "string"
|>> String
let parseValueParser : Parser<ValueParser, unit> =
choice
[
skipString "int" >>% parseIntValue
skipString "str" >>% parseStringValue
]
let parseValueParsers : Parser<ValueParser list, unit> =
sepBy1
parseValueParser
(skipChar ',')
// Runs a list of parsers 'ps' separated by 'sep' parser
let sepByList (ps : Parser<'T, unit> list) (sep : Parser<unit, unit>) : Parser<'T list, unit> =
let rec loop adjust ps =
match ps with
| [] -> preturn []
| h::t ->
adjust h >>= fun v -> loop (fun pp -> sep >>. pp) t >>= fun vs -> preturn (v::vs)
loop id ps
let parseLine : Parser<Value list, unit> =
parseValueParsers .>> skipChar ':' >>= (fun vps -> sepByList vps (skipChar ',')) .>> eof
[<EntryPoint>]
let main argv =
let s = "int,str,int:4,'hello',3"
let r = run parseLine s
printfn "%A" r
0
Parsing int,str,int:4,'hello',3 yields Success: [Integer 4; String "hello";Integer 3].
Parsing int,str,str:4,'hello',3 (incorrect) yields:
Failure:
Error in Ln: 1 Col: 23
int,str,str:4,'hello',3
^
Expecting: string

I rewrote #FuleSnabel's sepByList as follows to help me understand it better. Does this look right?
let sepByList (parsers : Parser<'T, unit> list) (sep : Parser<unit, unit>) : Parser<'T list, unit> =
let rec loop adjust parsers =
parse {
match parsers with
| [] -> return []
| parser :: tail ->
let! value = adjust parser
let! values = loop (fun parser -> sep >>. parser) tail
return value :: values
}
loop id parsers

How to parse a list of tokens with FParsec

I am trying to parse a list of tokens with FParsec, where each token is either a block of text or a tag - for example:
This is a {type of test} test, and it {succeeds or fails}
Here is the parser:
type Parser<'t> = Parser<'t, unit>
type Token =
| Text of string
| Tag of string
let escape fromString toString : Parser<_> =
pstring fromString |>> (fun c -> toString)
let content : Parser<_> =
let contentNormal = many1Satisfy (fun c -> c <> '{' && c <> '}')
let openBraceEscaped = escape "{{" "{"
let closeBraceEscaped = escape "}}" "}"
let contentEscaped = openBraceEscaped <|> closeBraceEscaped
stringsSepBy contentNormal contentEscaped
let ident : Parser<_> =
let isIdentifierFirstChar c = isLetter c || c = '_'
let isIdentifierChar c = isLetter c || isDigit c || c = '_'
spaces >>. many1Satisfy2L isIdentifierFirstChar isIdentifierChar "identifier" .>> spaces
let text = content |>> Text
let tag =
ident |> between (skipString "{") (skipString "}")
|>> Tag
let token = text <|> tag
let tokens = many token .>>. eof
the following tests work:
> run token "abc def" ;;
val it : ParserResult<Token,unit> = Success: Text "abc def"
> run token "{abc def}" ;;
val it : ParserResult<Token,unit> = Success: Tag "abc def"
but trying to run tokens results in an exception:
> run tokens "{abc} def" ;;
System.InvalidOperationException: (Ln: 1, Col: 10): The combinator 'many' was
applied to a parser that succeeds without consuming input and without
changing the parser state in any other way. (If no exception had been raised,
the combinator likely would have entered an infinite loop.)
I've gone over this stackoverflow question but nothing I've tried works. I even added the following, but I get the same exception:
let tokenFwd, tokenRef = createParserForwardedToRef<Token, unit>()
do tokenRef := choice [tag; text]
let readEndOfInput : Parser<unit, unit> = spaces >>. eof
let readExprs = many tokenFwd
let readExprsTillEnd = readExprs .>> readEndOfInput
run readExprsTillEnd "{abc} def" // System.InvalidOperationException ... The combinator 'many' was applied ...
I believe the problem is stringsSepBy in content, but I can't figure out any other way to get a string with the escaped items
Any help would be much appreciated - I have been going through this for a couple days now and can't figure it out.

stringsSepBy accepts zero strings, causing token to accept an empty string, causing many to complain.
I changed it to the following to verify that that was the line you need to work on.
many1 (contentNormal <|> contentEscaped) |>> fun l -> String.concat "" l
Also I got away from stringsSepBy contentNormal contentEscaped, because that says you need to match contentNormals with contentEscapeds in between them. So a{{b}}c is ok, but {{b}}, {{b}}c and a{{b}} will fail.

notEmpty can be used to consume input. If you're not consuming any input but letting the parser succeed then the "current position" of the parser is not moved forward, so when a statement doing that is inside a many it would go into an infinite loop without that exception. stringsSepBy is succeeding and parsing zero elements, you could use notEmpty to fail it if it gets zero elements:
stringsSepBy contentNormal contentEscaped |> notEmpty
Also, I tried to get your full example to parse, the tags can include spaces so you need to allow ident to include spaces to match that:
let isIdentifierChar c = isLetter c || isDigit c || c = '_' || c = ' '
Another little adjustment would be to only return a Token list rather than Token list * unit tuple (unit is the result of eof):
let tokens = many token .>> eof

FParsec: how to combine parsers so that they will be matched in arbitrary order

The task is find particular key-value pairs and parse them. The pairs can occur in any order. My partially working attempt:
open FParsec
type Parser<'a> = Parser<'a, unit>
type Status = Running | Done
type Job =
{ Id: int
Status: Status
Count: int }
let ws = spaces
let jobId: Parser<int> = ws >>. skipStringCI "Job id" >>. ws >>. skipChar '=' >>. ws >>. pint32
let status: Parser<Status> =
ws >>. skipStringCI "Status" >>. ws >>. skipChar '=' >>. ws >>. (
(skipStringCI "Running" >>% Running) <|> (skipStringCI "Done" >>% Done))
let count: Parser<int> = ws >>. skipStringCI "Count" >>. ws >>. skipChar '=' >>. ws >>. pint32
let parse: Parser<Job> = parse {
do! skipCharsTillStringCI "Job id" false 1000
let! id = jobId
do! skipCharsTillStringCI "Status" false 1000
let! status = status
do! skipCharsTillStringCI "Count" false 1000
let! count = count
return { Id = id; Status = status; Count = count }}
[<EntryPoint>]
let main argv =
let sample = """
Some irrelevant text.
Job id = 33
Some other text.
Status = Done
And another text.
Count = 10
Trailing text.
"""
printfn "%A" (run parse sample)
0
(*
result:
Success: {Id = 33;
Status = Done;
Count = 10;}
*)
So, it works but it has two problems: obvious duplication ("Job id" in jobId function and "Job id" in the top-level parser and so on), and it expects "Job id", "Status" and "Count" to be sequenced in this particular order, which is wrong by the requirement.
I have a strong feeling that there's an elegant solution for this.
Thanks!

The first problem (duplication) can be solved with a minor refactoring. The basic idea is wrapping each parser into a wrapper that would do skipping.
Note that this code is yet far from perfection, I just tried to make refactoring as small as possible.
let jobId: Parser<int> = pint32
let status: Parser<Status> =
(skipStringCI "Running" >>% Running) <|> (skipStringCI "Done" >>% Done)
let count: Parser<int> = pint32
let skipAndParse prefix parser =
skipCharsTillStringCI prefix false 1000
>>. ws >>. skipStringCI prefix >>. ws >>. skipChar '=' >>. ws >>. parser
let parse: Parser<Job> = parse {
let! id = skipAndParse "Job id" jobId
let! status = skipAndParse "Status" status
let! count = skipAndParse "Count" count
return { Id = id; Status = status; Count = count }}
The second problem is more complicated. If you want the data lines to appear in a free order, you must consider the case when
not all data lines present;
a certain data line appears twice or more;
To mitigate this, you need to produce a list of data lines found, analyze if everything required is there, and decide what to do with any possible duplicates.
Note that each data line can not afford to have "skip" part anymore, since it may skip an informative line before the actual parser.
let skipAndParse2 prefix parser =
ws >>. skipStringCI prefix >>. ws >>. skipChar '=' >>. ws >>. parser
// Here, you create a DU that will say which data line was found
type Result =
| Id of int
| Status of Status
| Count of int
| Irrelevant of string
// here's a combinator parser
let parse2 =
// list of possible data line parsers
// Note they are intentionally reordered
[
skipAndParse2 "Count" count |>> Count
skipAndParse2 "Status" status |>> Status
skipAndParse2 "Job id" jobId |>> Id
// the trailing one would skip a line in case if it has not
// been parsed by any of prior parsers
// a guard rule is needed because of specifics of
// restOfLine behavior at the end of input: namely, it would
// succeed without consuming an input, which leads
// to an infinite loop. Actually FParsec handles this and
// raises an exception
restOfLine true .>> notFollowedByEof |>> Irrelevant
]
|> List.map attempt // each parser is optional
|> choice // on each iteration, one of the parsers must succeed
|> many // a loop
Running the code:
let sample = "
Some irrelevant text.\n\
Job id = 33\n\
Some other text.\n\
Status = Done\n\
And another text.\n\
Count = 10\n\
Trailing text.\n\
"
sample |> run parse2 |> printfn "%A "
will produce the following output:
Success: [Irrelevant ""; Irrelevant "Some irrelevant text."; Id 33;
Irrelevant ""; Irrelevant "Some other text."; Status Done; Irrelevant "";
Irrelevant "And another text."; Count 10; Irrelevant ""]
Further processing requires filtering Irrelevant elements, checking for duplicates or missing items, and forming the Job record, or raising errors.
UPDATE: a simple example of further processing to hide out Result and returning Job option instead:
// naive implementation of the record maker
// return Job option
// ignores duplicate fields (uses the first one)
// returns None if any field is missing
let MakeJob arguments =
let a' =
arguments
|> List.filter (function |Irrelevant _ -> false | _ -> true)
try
let theId = a' |> List.pick (function |Id x -> Some x | _ -> None)
let theStatus = a' |> List.pick (function |Status x -> Some x | _ -> None)
let theCount = a' |> List.pick (function |Count x -> Some x | _ -> None)
Some { Id=theId; Status = theStatus; Count = theCount }
with
| :?System.Collections.Generic.KeyNotFoundException -> None
To use it, simply add the following line to the code of parse2:
|>> MakeJob

Ocaml lexer / parser rules

I wrote a program in ocaml that given an infix expression like 1 + 2, outputs the prefix notation : + 1 2
My problem is I don't find a way to make a rules like : all value, operator and bracket should be always separated by at least one space: 1+ 1 would be wrong 1 + 1 ok. I would like to not use the ocamlp4 grammar.
here is the code:
open Genlex
type tree =
| Leaf of string
| Node of tree * string * tree
let my_lexer str =
let kwds = ["("; ")"; "+"; "-"; "*"; "/"] in
make_lexer kwds (Stream.of_string str)
let make_tree_from_stream stream =
let op_parser operator_l higher_perm =
let rec aux left higher_perm = parser
[<'Kwd op when List.mem op operator_l; right = higher_perm; s >]
-> aux (Node (left, op, right)) higher_perm s
| [< >]
-> left
in
parser [< left = higher_perm; s >] -> aux left higher_perm s
in
let rec high_perm l = op_parser ["*"; "/"] brackets l
and low_perm l = op_parser ["+"; "-"] high_perm l
and brackets = parser
| [< 'Kwd "("; e = low_perm; 'Kwd ")" >] -> e
| [< 'Ident n >] -> Leaf n
| [< 'Int n >] -> Leaf (string_of_int n)
in
low_perm stream
let rec draw_tree = function
| Leaf n -> Printf.printf "%s" n
| Node(fg, r, fd) -> Printf.printf "(%s " (r);
draw_tree fg;
Printf.printf " ";
draw_tree fd;
Printf.printf ")"
let () =
let line = read_line() in
draw_tree (make_tree_from_stream (my_lexer line)); Printf.printf "\n"
Plus if you have some tips about the code or if you notice some error of prog style then I will appreciate that you let it me know. Thanks !

The Genlex provides a ready-made lexer that respects OCaml's lexical convention, and in particular ignore the spaces in the positions you mention. I don't think you can implement what you want on top of it (it is not designed as a flexible solution, but a quick way to get a prototype working).
If you want to keep writing stream parsers, you could write your own lexer for it: define a token type, and lex a char Stream.t into a token Stream.t, which you can then parse as you wish. Otherwise, if you don't want to use Camlp4, you may want to try an LR parser generator, such as menhir (a better ocamlyacc).

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Using FParsec to parse possibly malformed input - parsing

Related

Parsing an ML-like syntax based on indentation, and everything considered to be an instruction/expression

Use FParsec to parse a self-describing input

How to parse a list of tokens with FParsec

FParsec: how to combine parsers so that they will be matched in arbitrary order

Ocaml lexer / parser rules

Categories

Resources