When trying to parse {asdc,456,ghji,abc} and I run
run specialListParser "{asdc,456,ghji,abc}"
the parser fails with
The error occurred at the end of the input stream.
Expecting: any char not in ‘,’, ',' or '}'
I defined my parser based on this answer:
let str : Parser<_> = many1Chars (noneOf ",")
let comma = pstring ","
let listParser = sepBy str comma
let specialListParser = between (pstring "{") (pstring "}") listParser
What am I missing?
Looks like your str parser is consuming the final }, so that between never gets to see it. Change your str parser to be many1Chars (noneOf ",}") and it should work.
Alternately, noneOf [','; '}'] would also work, and might be more explicit about your intentions.
Related
Given the input:
alpha beta gamma one two three
How could I parse this into the below?
[["alpha"; "beta"; "gamma"]; ["one"; "two"; "three"]]
I can write this when there is a better separator (e.g.__), as then
sepBy (sepBy word (pchar ' ')) (pstring "__")
works, but in the case of double space, the pchar in the first sepBy consumes the first space and then the parser fails.
The FParsec manual says that in sepBy p sep, if sep succeds and the subsequent p fails (without changing the state), the entire sepBy fails, too. Hence, your goal is:
to make the separator fail if it encounters more than a single space char;
to backtrack so that the "inner" sepBy loop closed happily and passed control to the "outer" sepBy loop.
Here's how to do the both:
// this is your word parser; it can be different of course,
// I just made it as simple as possible;
let pWord = many1Satisfy isAsciiLetter
// this is the Inner separator to separate individual words
let pSepInner =
pchar ' '
.>> notFollowedBy (pchar ' ') // guard rule to prevent 2nd space
|> attempt // a wrapper that fails NON-fatally
// this is the Outer separator
let pSepOuter =
pchar ' '
|> many1 // loop over 1+ spaces
// this is the parser that would return String list list
let pMain =
pWord
|> sepBy <| pSepInner // the Inner loop
|> sepBy <| pSepOuter // the Outer loop
Use:
run pMain "alpha beta gamma one two three"
Success: [["alpha"; "beta"; "gamma"]; ["one"; "two"; "three"]]
I'd recommend replacing sepBy word (pchar ' ') with something like this:
let pOneSpace = pchar ' ' .>> notFollowedBy (pchar ' ')
let pTwoSpaces = pstring " "
// Or if two spaces are allowed as separators but *not* three spaces...
let pTwoSpaces = pstring " " .>> notFollowedBy (pchar ' ')
sepBy (sepBy word pOneSpace) pTwoSpaces
Note: not tested (since I don't have time at the moment), just typed into answer box. So test it in case I made a mistake somewhere.
I'm trying to use FParsec to parse a TOML multi-line string, and I'm having trouble with the closing delimiter ("""). I have the following parsers:
let controlChars =
['\u0000'; '\u0001'; '\u0002'; '\u0003'; '\u0004'; '\u0005'; '\u0006'; '\u0007';
'\u0008'; '\u0009'; '\u000a'; '\u000b'; '\u000c'; '\u000d'; '\u000e'; '\u000f';
'\u0010'; '\u0011'; '\u0012'; '\u0013'; '\u0014'; '\u0015'; '\u0016'; '\u0017';
'\u0018'; '\u0019'; '\u001a'; '\u001b'; '\u001c'; '\u001d'; '\u001e'; '\u001f';
'\u007f']
let nonSpaceCtrlChars =
Set.difference (Set.ofList controlChars) (Set.ofList ['\n';'\r';'\t'])
let multiLineStringContents : Parser<char,unit> =
satisfy (isNoneOf nonSpaceCtrlChars)
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (pstring "\"\"\"")
|> between (pstring "\"\"\"") (pstring "\"\"\"")
let test parser str =
match run parser str with
| Success (s1, s2, s3) -> printfn "Ok: %A %A %A" s1 s2 s3
| Failure (f1, f2, f3) -> printfn "Fail: %A %A %A" f1 f2 f3
When I test multiLineString against an input like this:
test multiLineString "\"\"\"x\"\"\""
The parser fails with this error:
Fail: "Error in Ln: 1 Col: 8 """x"""
^ Note: The error occurred at the end of the input stream. Expecting: '"""'
I'm confused by this. Wouldn't the manyCharsTill multiLineStringContents (pstring "\"\"\"") parser stop at the """ for the between parser to find it? Why is the parser eating all the input and then failing the between parser?
This seems like a relevant post: How to parse comments with FParsec
But I don't see how the solution to that one differs from what I'm doing here, really.
The manyCharsTill documentation says (emphasis mine):
manyCharsTill cp endp parses chars with the char parser cp until the parser endp succeeds. It stops after endp and returns the parsed chars as a string.
So you don't want to use between in combination with manyCharsTill; you want to do something like pstring "\"\"\"" >>. manyCharsTill (pstring "\"\"\"").
But as it happens, I can save you a lot of work. I've been working on a TOML parser with FParsec myself in my spare time. It's far from complete, but the string part works and handles backslash escapes correctly (as far as I can tell: I've tested thoroughly but not exhaustively). The only thing I'm missing is the "strip first newline if it appears right after the opening delimiter" rule, which you've handled with optional newline. So just add that bit into my code below and you should have a working TOML string parser.
BTW, I am planning to license my code (if I finish it) under the MIT license. So I hereby release the following code block under the MIT license. Feel free to use it in your project if it's useful to you.
let pShortCodepointInHex = // Anything from 0000 to FFFF, *except* the range D800-DFFF
(anyOf "dD" >>. (anyOf "01234567" <?> "a Unicode scalar value (range D800-DFFF not allowed)") .>>. exactly 2 isHex |>> fun (c,s) -> sprintf "d%c%s" c s)
<|> (exactly 4 isHex <?> "a Unicode scalar value")
let pLongCodepointInHex = // Anything from 00000000 to 0010FFFF, *except* the range D800-DFFF
(pstring "0000" >>. pShortCodepointInHex)
<|> (pstring "000" >>. exactly 5 isHex)
<|> (pstring "0010" >>. exactly 4 isHex |>> fun s -> "0010" + s)
<?> "a Unicode scalar value (i.e., in range 00000000 to 0010FFFF)"
let toCharOrSurrogatePair p =
p |> withSkippedString (fun codePoint _ -> System.Int32.Parse(codePoint, System.Globalization.NumberStyles.HexNumber) |> System.Char.ConvertFromUtf32)
let pStandardBackslashEscape =
anyOf "\\\"bfnrt"
|>> function
| 'b' -> "\b" // U+0008 BACKSPACE
| 'f' -> "\u000c" // U+000C FORM FEED
| 'n' -> "\n" // U+000A LINE FEED
| 'r' -> "\r" // U+000D CARRIAGE RETURN
| 't' -> "\t" // U+0009 CHARACTER TABULATION a.k.a. Tab or Horizonal Tab
| c -> string c
let pUnicodeEscape = (pchar 'u' >>. (pShortCodepointInHex |> toCharOrSurrogatePair))
<|> (pchar 'U' >>. ( pLongCodepointInHex |> toCharOrSurrogatePair))
let pEscapedChar = pstring "\\" >>. (pStandardBackslashEscape <|> pUnicodeEscape)
let quote = pchar '"'
let isBasicStrChar c = c <> '\\' && c <> '"' && c > '\u001f' && c <> '\u007f'
let pBasicStrChars = manySatisfy isBasicStrChar
let pBasicStr = stringsSepBy pBasicStrChars pEscapedChar |> between quote quote
let pEscapedNewline = skipChar '\\' .>> skipNewline .>> spaces
let isMultilineStrChar c = c = '\n' || isBasicStrChar c
let pMultilineStrChars = manySatisfy isMultilineStrChar
let pTripleQuote = pstring "\"\"\""
let pMultilineStr = stringsSepBy pMultilineStrChars (pEscapedChar <|> (notFollowedByString "\"\"\"" >>. pstring "\"")) |> between pTripleQuote pTripleQuote
#rmunn provided a correct answer, thanks! I also solved this in a slightly different way after playing with the FParsec API a bit more. As explained in the other answer, The endp argument to manyCharTill was eating the closing """, so I needed to switch to something that wouldn't do that. A simple modification using lookAhead did the trick:
let multiLineString : Parser<string,unit> =
optional newline >>. manyCharsTill multiLineStringContents (lookAhead (pstring "\"\"\""))
|> between (pstring "\"\"\"") (pstring "\"\"\"")
I was working on "Write Yourself a Scheme in 48 hours" to learn Haskell and I've run into a problem I don't really understand. It's for question 2 from the exercises at the bottom of this section.
The task is to rewrite
import Text.ParserCombinators.Parsec
parseString :: Parser LispVal
parseString = do
char '"'
x <- many (noneOf "\"")
char '"'
return $ String x
such that quotation marks which are properly escaped (e.g. in "This sentence \" is nonsense") get accepted by the parser.
In an imperative language I might write something like this (roughly pythonic pseudocode):
def parseString(input):
if input[0] != "\"" or input[len(input)-1] != "\"":
return error
input = input[1:len(input) - 1] # slice off quotation marks
output = "" # This is the 'zero' that accumulates over the following loop
# If there is a '"' in our string we want to make sure the previous char
# was '\'
for n in range(len(input)):
if input[n] == "\"":
try:
if input[n - 1] != "\\":
return error
catch IndexOutOfBoundsError:
return error
output += input[n]
return output
I've been looking at the docs for Parsec and I just can't figure out how to work this as a monadic expression.
I got to this:
parseString :: Parser LispVal
parseString = do
char '"'
regular <- try $ many (noneOf "\"\\")
quote <- string "\\\""
char '"'
return $ String $ regular ++ quote
But this only works for one quotation mark and it has to be at the very end of the string--I can't think of a functional expression that does the work that my loops and if-statements do in the imperative pseudocode.
I appreciate you taking your time to read this and give me advice.
Try something like this:
dq :: Char
dq = '"'
parseString :: Parser Val
parseString = do
_ <- char dq
x <- many ((char '\\' >> escapes) <|> noneOf [dq])
_ <- char dq
return $ String x
where
escapes = dq <$ char dq
<|> '\n' <$ char 'n'
<|> '\r' <$ char 'r'
<|> '\t' <$ char 't'
<|> '\\' <$ char '\\'
The solution is to define a string literal as a starting quote + many valid characters + an ending quote where a "valid character" is either a an escape sequence or non-quote.
So there is a one line change to parseString:
parseString = do char '"'
x <- many validChar
char '"'
return $ String x
and we add the definitions:
validChar = try escapeSequence <|> satisfy ( /= '"' )
escapeSequence = do { char '\\'; anyChar }
escapeSequence may be refined to allow a limited set of escape sequences.
I want to parse input strings like this: "this is \"test \" message \"sample\" text"
Now, I wrote a parser for parsing individual text without any quotes:
parseString :: Parser String
parseString = do
char '"'
x <- (many $ noneOf "\"")
char '"'
return x
This parses simple strings like this: "test message"
Then I wrote a parser for quoted strings:
quotedString :: Parser String
quotedString = do
initial <- string "\\\""
x <- many $ noneOf "\\\""
end <- string "\\\""
return $ initial ++ x ++ end
This parsers for strings like this: \"test message\"
Is there a way that I can combine both the parsers so that I obtain my desired objective ? What exactly is the idomatic way to tackle this problem ?
This is what I would do:
escape :: Parser String
escape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
return [d, c]
nonEscape :: Parser Char
nonEscape = noneOf "\\\"\0\n\r\v\t\b\f"
character :: Parser String
character = fmap return nonEscape <|> escape
parseString :: Parser String
parseString = do
char '"'
strings <- many character
char '"'
return $ concat strings
Now all you need to do is call it:
parse parseString "test" "\"this is \\\"test \\\" message \\\"sample\\\" text\""
Parser combinators are a bit difficult to understand at first, but once you get the hang of it they are easier than writing BNF grammars.
quotedString = do
char '"'
x <- many (noneOf "\"" <|> (char '\\' >> char '\"'))
char '"'
return x
I believe, this should work.
In case somebody is looking for a more out of the box solution, this answer in code-review provides just that. Here is a complete example with the right imports:
import Text.Parsec
import Text.Parsec.Language
import Text.Parsec.Token
lexer :: GenTokenParser String u Identity
lexer = makeTokenParser haskellDef
strParser :: Parser String
strParser = stringLiteral lexer
parseString :: String -> Either ParseError String
parseString = parse strParser ""
I prefer the following because it is easier to read:
quotedString :: Parser String
quotedString = do
a <- string "\""
b <- concat <$> many quotedChar
c <- string "\""
-- return (a ++ b ++ c) -- if you want to preserve the quotes
return b
where quotedChar = try (string "\\\\")
<|> try (string "\\\"")
<|> ((noneOf "\"\n") >>= \x -> return [x] )
Aadit's solution may be faster because it does not use try but it's probably harder to read.
Note that it is different from Aadit's solution. My solution ignores escaped things in the string and really only cares about \" and \\.
For example, let's assume you have a tab character in the string.
My solution successfully parses "\"\t\"" to Right "\t". Aadit's solutions says unexpected "\t" expecting "\\" or "\"".
Also note that Aadit's solution only accepts 'valid' escapes. For example, it rejects "\"\\a\"". \a is not a valid escape sequence (well according to man ascii, it represents the system bell and is valid). My solution just returns Right "\\a".
So we have two different use cases.
My solution: Parse quoted strings with possibly escaped quotes and escaped escapes
Aadit's solution: Parse quoted strings with valid escape sequences where valid escapes means "\\\"\0\n\r\v\t\b\f"
I wanted to parse quoted strings and remove any backslashes used for escaping during the parsing step. In my simple language, the only escapable characters were double quotes and backslashes. Here is my solution:
quotedString = do
string <- between (char '"') (char '"') (many quotedStringChar)
return string
where
quotedStringChar = escapedChar <|> normalChar
escapedChar = (char '\\') *> (oneOf ['\\', '"'])
normalChar = noneOf "\""
elaborating on #Priyatham response
pEscString::Char->Parser String
pEscString e= do
char e;
s<-many (
do{char '\\';c<-anyChar;return ['\\',c]}
<|>many1 (noneOf (e:"\\")))
char e
return$concat s
I have the following subexpression to parse 'quotes' which have the following format
"5.75 # 5.95"
I therefore have this parsec expression to parse it
let pquote x = (sepBy (pfloat) ((spaces .>> (pchar '/' <|> pchar '#' )>>. spaces))) x
It works fine.. except when there is a trailing space in my input, as the separator expression starts to consume content.So I wrapped it around an attempt, which works and seems, from what I understand, more or less what this was meant to be.
let pquote x = (sepBy (pfloat) (attempt (spaces .>> (pchar '/' <|> pchar '#' )>>. spaces))) x
As I dont know fparsec so well, I wonder if there are any better way to write this. it seems a bit heavy (while still being very manageable of course)
let s1 = "5.75 # 5.95 "
let s2 = "5.75/5.95 "
let pquote: Parser<_> =
pfloat
.>> spaces .>> skipAnyOf ['#'; '/'] .>> spaces
.>>. pfloat
.>> spaces
Notes:
I've made spaces optional everywhere spaces skips any sequence of zero or more whitespaces, so there's no need to use opt - thanks #Daniel;
type Parser<'t> = Parser<'t, UserState> - I define it this way in order to avoid "value restriction" error; you may remove it;
Also, don't forget the following if your program may run on a system with default language settings having decimal comma: System.Threading.Thread.CurrentThread.CurrentCulture <- Globalization.CultureInfo.GetCultureInfo "en-US" this won't work, thanks #Stephan
I would not use sepBy unless I have a value list of unknown size.
If you don't really need the value returned (e.g. '#' characters), it is recommended to use skip* functions instead p* for performance considerations.
UPD added slash as separator
I would probably do something like this, which returns float * float:
let ws = spaces
let quantity = pfloat .>> ws
let price = pfloat .>> ws
let quoteSep = pstring "#" .>> ws
let quote = quantity .>> quoteSep .>>. price //`.>> eof` (if final parser)
It's typical for each parser to consume trailing whitespace. Just make sure your top-level parser includes eof.
Assuming that you could have more than two float in the input and '/' and '#' are delimiters:
let ws = spaces
let str_ws s = pstring s .>> ws
let float_ws = pfloat .>> ws
let pquote = sepBy float_ws (str_ws "/" <|> str_ws "#")
Talking about handling whitespaces, this section in FParsec tutorial is really helpful.