I'm trying to implement a whitespace sensitive parser using FParsec, and I'm starting off with the baby step of defining a function which will parse lines of text that start with n chars of whitespace.
Here's what I have so far:
let test: Parser<string list,int>
= let manyNSatisfy i p = manyMinMaxSatisfy i i p
let p = fun (stream:CharStream<int>) ->
let state = stream.UserState
// Should fail softly if `state` chars wasn't parsed
let result = attempt <| manyNSatisfy state (System.Char.IsWhiteSpace) <| stream
if result.Status <> Ok
then result
else restOfLine false <| stream
sepBy p newline
My issue is that when I run
runParserOnString test 1 "test" " hi\n there\nyou" |> printfn "%A"
I get an error on "you". I was under the impression that attempt would backtrack any state changes, and returning Error as my status would give me soft failure.
How do I get ["hi"; "there"] back from my parser?
Oh dear, how embarrassing.
I wanted sepEndBy, which is to say that I should terminate the parse on the separator.
This looks more idiomatic. I have hard-coded 1, but it's easy to extract as parameter.
let skipManyNSatisfy i = skipManyMinMaxSatisfy i i
let pMyText =
( // 1st rule
skipManyNSatisfy 1 System.Char.IsWhiteSpace // skip an arbitrary # of WhiteSpaces
>>. restOfLine false |>> Some // return the rest as Option
)
<|> // If the 1st rule failed...
( // 2nd rule
skipRestOfLine false // skip till the end of the line
>>. preturn None // no result
)
|> sepBy <| newline // Wrap both rules, separated by newLine
|>> Seq.choose id // Out of received string option seq, select only Some()
Related
I am stuck at the following parsing problem:
Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.
This is what I came up with (simplified):
import qualified Text.Megaparsec as MP
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...
-- The escape character.
escChar :: Char
...
pComponent :: Parser (Maybe Text)
pComponent = do
t <- MP.many (escaped <|> regular)
if null t then return Nothing else return $ Just (T.pack t)
where
regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'.
Then, the following parses correctly: ABC\:D:EF to yield ABC:D.
However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.
Two questions:
Why does fail end parsing instead of failing the parser?
Is the above approach sensible to approach the problem, or is there a "proper", canonical way to parse such terminated strings that I am not aware of?
many has to allow its sub-parser to fail once without the whole parse
failing - for example many (char 'A') *> char 'B', while parsing
"AAAB", has to fail to parse the B to know it got to the end of the
As.
You might want manyTill which allows you to recognise the terminator
explicitly. Something like this:
MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)
"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.
Or if you want to parse more than one component you might keep your
existing definition of pComponent and use it with sepBy or similar, like:
MP.sepBy pComponent (MP.satisfy isControlChar)
If you also check for end-of-file after this, like:
MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof
then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.
What a parser object normally does is to extract from the input stream whatever subset it is supposed to accept. That's the usual rule.
Here, it seems you want the parser to accept strings that are followed by something specific. From your examples, it is either end of file (eof) or character ':'. So you might want to consider look ahead.
Environment and auxiliary functions:
import Data.Void (Void)
import qualified Data.Text as T
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MC
type Parser = MP.Parsec Void T.Text
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
isAdmissibleChar ch = elem ch ['A' .. 'Z']
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
isControlChar ch = elem ch ":"
-- The escape character:
escChar :: Char
escChar = '\\'
Termination parser, to be used for look ahead:
termination :: Parser ()
termination = MP.eof MP.<|> do
_ <- MP.satisfy isControlChar
return ()
Modified pComponent parser:
pComponent :: Parser (Maybe T.Text)
pComponent = do
txt <- MP.many (escaped MP.<|> regular)
MP.lookAhead termination -- **CHANGE HERE**
if (null txt) then (return Nothing) else (return $ Just (T.pack txt))
where
regular = (MP.satisfy isAdmissibleChar) MP.<|> (fail "Inadmissible character")
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Testing utility:
tryParse :: String -> IO ()
tryParse str = do
let res = MP.parse pComponent "(noname)" (T.pack str)
putStrLn $ (show res)
Let's try to rerun your examples:
$ ghci
λ>
λ> :load q67809465.hs
λ>
λ> str1 = "ABC\\:D:EF"
λ> putStrLn str1
ABC\:D:EF
λ>
λ> tryParse str1
Right (Just "ABC:D")
λ>
So that is successful, as desired.
λ>
λ> tryParse "ABC&D"
Left (ParseErrorBundle {bundleErrors = TrivialError 3 (Just (Tokens ('&' :| ""))) (fromList [EndOfInput]) :| [], bundlePosState = PosState {pstateInput = "ABC&D", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "(noname)", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})
λ>
So that fails, as desired.
Trying our 2 acceptable termination contexts:
λ> tryParse "ABC:&D"
Right (Just "ABC")
λ>
λ>
λ> tryParse "ABCDEF"
Right (Just "ABCDEF")
λ>
fail does not end parsing in general. It just continues with the next alternative. In this case it selects the empty list alternative introduced by the many combinator, so it stops parsing without an error message.
I think the best way to solve your problem is to specify that the input must end in a termination character, that means that it cannot "succeed" halfway like this. You can do that with the notFollowedBy or lookAhead combinators. Here is the relevant part of the megaparsec tutorial.
I wrote a very simple parser combinator library which seems to work alright (https://github.com/mukeshsoni/tinyparsec).
I then tried writing parser for json with the library. The code for the json parser is here - https://github.com/mukeshsoni/tinyparsec/blob/master/src/example_parsers/JsonParser.purs
The grammar for json is recursive -
data JsonVal
= JsonInt Int
| JsonString String
| JsonBool Boolean
| JsonObj (List (Tuple String JsonVal))
Which means the parser for json object must again call the parser for jsonVal. The code for jsonObj parser looks like this -
jsonValParser
= jsonIntParser <|> jsonBoolParser <|> jsonStringParser <|> jsonObjParser
propValParser :: Parser (Tuple String JsonVal)
propValParser = do
prop <- stringLitParser
_ <- symb ":"
val <- jsonValParser
pure (Tuple prop val)
listOfPropValParser :: Parser (List (Tuple String JsonVal))
listOfPropValParser = sepBy propValParser (symb ",")
jsonObjParser :: Parser JsonVal
jsonObjParser = do
_ <- symb "{"
propValList <- listOfPropValParser
_ <- symb "}"
pure (JsonObj propValList)
But when i try to build it, i get the following error - The value of propValParser is undefined here. So this reference is not allowed here
I found similar issues on stackoverflow but could not understand why the error happens or how should i refactor my code so that it takes care of the recursive references from jsonValParser to propValParser.
Any help would be appreciated.
See https://stackoverflow.com/a/36991223/139614 for a similar case - you'll need to make use of the fix function, or introduce Unit -> ... in front of a parser somewhere to break the cyclic definition.
I managed to get rid of the error by wrapping the blocks which were throwing error inside a do block and starting the do block with a noop -
listOfPropValParser :: Parser (List (Tuple String JsonVal))
listOfPropValParser = do
_ <- pure 1 -- does nothing but defer the execution of the second line
sepBy propValParser (symb ",")
Had to do the same for jsonValParser.
jsonValParser = do
_ <- pure 1
jsonIntParser <|> jsonBoolParser <|> jsonStringParser <|> jsonObjParser
The idea is to defer the execution of the code which might lead to cyclic dependency. The added line, _ <- pure 1, does exactly that. I think it might be doing the same as fix from Data.Fix does or what defer from Data.Lazy does.
I'm trying to parse a file, using FParsec, which consists of either float or int values. I'm facing two problems that I can't find a good solution for.
1
Both pint32 and pfloat will successfully parse the same string, but give different answers, e.g pint32 will return 3 when parsing the string "3.0" and pfloat will return 3.0 when parsing the same string. Is it possible to try parsing a floating point value using pint32 and have it fail if the string is "3.0"?
In other words, is there a way to make the following code work:
let parseFloatOrInt lines =
let rec loop intvalues floatvalues lines =
match lines with
| [] -> floatvalues, intvalues
| line::rest ->
match run floatWs line with
| Success (r, _, _) -> loop intvalues (r::floatvalues) rest
| Failure _ ->
match run intWs line with
| Success (r, _, _) -> loop (r::intvalues) floatvalues rest
| Failure _ -> loop intvalues floatvalues rest
loop [] [] lines
This piece of code will correctly place all floating point values in the floatvalues list, but because pfloat returns "3.0" when parsing the string "3", all integer values will also be placed in the floatvalues list.
2
The above code example seems a bit clumsy to me, so I'm guessing there must be a better way to do it. I considered combining them using choice, however both parsers must return the same type for that to work. I guess I could make a discriminated union with one option for float and one for int and convert the output from pint32 and pfloat using the |>> operator. However, I'm wondering if there is a better solution?
You're on the right path thinking about defining domain data and separating definition of parsers and their usage on source data. This seems to be a good approach, because as your real-life project grows further, you would probably need more data types.
Here's how I would write it:
/// The resulting type, or DSL
type MyData =
| IntValue of int
| FloatValue of float
| Error // special case for all parse failures
// Then, let's define individual parsers:
let pMyInt =
pint32
|>> IntValue
// this is an alternative version of float parser.
// it ensures that the value has non-zero fractional part.
// caveat: the naive approach would treat values like 42.0 as integer
let pMyFloat =
pfloat
>>= (fun x -> if x % 1 = 0 then fail "Not a float" else preturn (FloatValue x))
let pError =
// this parser must consume some input,
// otherwise combined with `many` it would hang in a dead loop
skipAnyChar
>>. preturn Error
// Now, the combined parser:
let pCombined =
[ pMyFloat; pMyInt; pError ] // note, future parsers will be added here;
// mind the order as float supersedes the int,
// and Error must be the last
|> List.map (fun p -> p .>> ws) // I'm too lazy to add whitespase skipping
// into each individual parser
|> List.map attempt // each parser is optional
|> choice // on each iteration, one of the parsers must succeed
|> many // a loop
Note, the code above is capable working with any sources: strings, streams, or whatever. Your real app may need to work with files, but unit testing can be simplified by using just string list.
// Now, applying the parser somewhere in the code:
let maybeParseResult =
match run pCombined myStringData with
| Success(result, _, _) -> Some result
| Failure(_, _, _) -> None // or anything that indicates general parse failure
UPD. I have edited the code according to comments. pMyFloat was updated to ensure that the parsed value has non-zero fractional part.
FParsec has the numberLiteral parser that can be used to solve the problem.
As a start you can use the example available at the link above:
open FParsec
open FParsec.Primitives
open FParsec.CharParsers
type Number = Int of int64
| Float of float
// -?[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?
let numberFormat = NumberLiteralOptions.AllowMinusSign
||| NumberLiteralOptions.AllowFraction
||| NumberLiteralOptions.AllowExponent
let pnumber : Parser<Number, unit> =
numberLiteral numberFormat "number"
|>> fun nl ->
if nl.IsInteger then Int (int64 nl.String)
else Float (float nl.String)```
I am trying to parse an int32 with FParsec but have an additional restriction that the number must be less than some maximum value. Is their a way to perform this without writing my own custom parser (as below) and/or is my custom parser (below) the appropriate way of achieving the requirements.
I ask because most of the built-in library functions seem to revolve around a char satisfying certain predicates and not any other type.
let pRow: Parser<int> =
let error = messageError ("int parsed larger than maxRows")
let mutable res = Reply(Error, error)
fun stream ->
let reply = pint32 stream
if reply.Status = Ok && reply.Result <= 1000000 then
res <- reply
res
UPDATE
Below is an attempt at a more fitting FParsec solution based on the direction given in the comment below:
let pRow2: Parser<int> =
pint32 >>= (fun x -> if x <= 1048576 then (preturn x) else fail "int parsed larger than maxRows")
Is this the correct way to do it?
You've done an excellent research and almost answered your own question.
Generally, there are two approaches:
Unconditionally parse out an int and let the further code to check it for validity;
Use a guard rule bound to the parser. In this case (>>=) is the right tool;
In order to make a good choice, ask yourself whether an integer that failed to pass the guard rule has to "give another chance" by triggering another parser?
Here's what I mean. Usually, in real-life projects, parsers are combined in some chains. If one parser fails, the following one is attempted. For example, in this question, some programming language is parsed, so it needs something like:
let pContent =
pLineComment <|> pOperator <|> pNumeral <|> pKeyword <|> pIdentifier
Theoretically, your DSL may need to differentiate a "small int value" from another type:
/// The resulting type, or DSL
type Output =
| SmallValue of int
| LargeValueAndString of int * string
| Comment of string
let pSmallValue =
pint32 >>= (fun x -> if x <= 1048576 then (preturn x) else fail "int parsed larger than maxRows")
|>> SmallValue
let pLargeValueAndString =
pint32 .>> ws .>>. (manyTill ws)
|>> LargeValueAndString
let pComment =
manyTill ws
|>> Comment
let pCombined =
[ pSmallValue; pLargeValueAndString; pComment]
|> List.map attempt // each parser is optional
|> choice // on each iteration, one of the parsers must succeed
|> many // a loop
Built this way, pCombined will return:
"42 ABC" gets parsed as [ SmallValue 42 ; Comment "ABC" ]
"1234567 ABC" gets parsed as [ LargeValueAndString(1234567, "ABC") ]
As we see, the guard rule impacts how the parsers are applied, so the guard rule has to be within the parsing process.
If, however, you don't need such complication (e.g., an int is parsed unconditionally), your first snippet is just fine.
I am using the Bill Casarin post on how to parse delimited files with fparsec, I am dumbing the logic down to get an understanding of how the code works. I am parsing a multi row delimited document into Cell list list structure (for now) where a Cell is a string or a float. I am a complete newbie on this.
I am having issues parsing the floats - in a typical case (a cell delimitted by tabs, containing a numeric) it works. However when a cell happens to be a string that starts with a number - it falls apart.
How do I modify pFloatCell to either parse (although the way through the tab) as a float or nothing?
Thank you
type Cell =
| String of string
| Float of float
.
.
.
let pStringCell delim =
manyChars (nonQuotedCellChar delim)
|>> String
// this is my issue. pfloat parses the string one
// char at a time, and once it starts off with a number
// it is down that path, and errors out
let pFloatCell delim =
FParsec.CharParsers.pfloat
|>> Float
let pCell delim =
(pFloatCell delim) <|> (pStringCell delim)
.
.
.
let ParseTab s =
let delim = "\t"
let res = run (csv delim) s in
match res with
| Success (rows, _, _) -> { IsSuccess = true; ErrorMsg = "Ok"; Result = stripEmpty rows }
| Failure (s, _, _) -> { IsSuccess = false; ErrorMsg = s; Result = [[]] }
.
.
.
let test() =
let parsed = ParseTab data
oops late for me last night. I meant to post the data. This first one works
let data =
"s10 Mar 2011 18:28:11 GMT\n"
while this returns an error:
let data =
"10 Mar 2011 18:28:11 GMT\n"
returns, both with and witout ChaosP's recommendation:
ErrorMsg = "Error in Ln: 1 Col:
3\r\n10 Mar 2011 18:28:11 GMT\r\n
^\r\nExpecting: end of file, newline
or '\t'\r\n"
It looks as though the attempt is working fine. in the second case it is only grabbing up to the 10 - and the code for pfloat looks only up to the first whitespace. I need to convice pfloat that it needs to look all the way up to the next tab or newline regardless of whether there is a space before it; write my own version of pfloat by performing a Double.Parse - but I would rather rely on the library.
Since it seems the text you'll be parsing is a bit ambiguous you'll need to modify your pCell parser.
let sep delim =
skipString delim <|> skipAnyOf "\r\n" <|> eof
let pCell delim =
attempt (pFloatCell delim .>> sep delim) <|> (pStringCell delim .>> sep delim)
This also means you'll need to modify whichever parser uses pCell.
let pCells delim =
many pCell delim
Note
The .>> operator is actually quite simple. Think of it like the leap-frog operator. The value of the left hand side is returned after applying the right hand side and ignoring the result.
Parser<'a, 'b> -> Parser<'c, 'b> -> Parser<'a, 'b>