Haskell Parser Fails on "|" Read - parsing

I am working on a parser in Haskell using Parsec. The issue lies in reading in the string "| ". When I attempt to read in the following,
parseExpr = parseAtom
-- | ...
<|> do string "{|"
args <- try parseList <|> parseDottedList
string "| "
body <- try parseExpr
string " }"
return $ List [Atom "lambda", args, body]
I get a parse error, the following.
Lampas >> {|a b| "a" }
Parse error at "lisp" (line 1, column 12):
unexpected "}"
expecting letter, "\"", digit, "'", "(", "[", "{|" or "."
Another failing case is ^ which bears the following.
Lampas >> {|a b^ "a" }
Parse error at "lisp" (line 1, column 12):
unexpected "}"
expecting letter, "\"", digit, "'", "(", "[", "{|" or "."
However, it works as expected when the string "| " is replaced with "} ".
parseExpr = parseAtom
-- | ...
<|> do string "{|"
args <- try parseList <|> parseDottedList
string "} "
body <- try parseExpr
string " }"
return $ List [Atom "lambda", args, body]
The following is the REPL behavior with the above modification.
Lampas >> {|a b} "a" }
(lambda ("a" "b") ...)
So the question is (a) does pipe have a special behavior in Haskell strings, perhaps only in <|> chains?, and (b) how is this behavior averted?.

The character | may be in a set of reserved characters. Test with other characters, like ^, and I assume it will fail just as well. The only way around this would probably be to change the set of reserved characters, or the structure of your interpreter.

Related

Parsing escape characters when creating parser from scratch in Haskell

I have created the code below that is part of building a parser from scratch. I do however encounter unexpected output when using escape characters similar described here ,although my output is different as follows when using ghci:
ghci> parseString "'\\\\'"
[(Const (StringVal "\\"),"")]
ghci> parseString "'\\'"
[]
ghci> parseString "'\\\'"
[]
ghci> parseString "\\\"
<interactive>:50:18: error:
lexical error in string/character literal at end of input
ghci> parseString "\\"
[]
ghci> parseString "\\\\"
[]
where as seen I get an expected output when parsing '\\\\' but not when parsing just '\\' (as in case of the link referenced above), where I would have expected [(Const (StringVal "\"),"")] as a result.Is this something that is wrong in my code or is it due to ghci, and how can I overcome it if it is the latter?
import Data.Char
import Text.ParserCombinators.ReadP
import Control.Applicative ((<|>))
type ParseError = String
type Parser a = ReadP a
space :: Parser Char
space = satisfy isSpace
spaces :: Parser String
spaces = many space
token :: Parser a -> Parser a
token combinator = spaces >> combinator
parseString input = readP_to_S (do
e <- pExp
token eof
return e) input
pExp :: Parser Exp
pExp = (do
pv <- stringConst
return pv)
pStr :: Parser String
pStr =
(do
string "'"
str <- many rightChar
string "'"
return str)
rightChar :: Parser Char
rightChar = (do
nextChar <- get
case nextChar of
'\\' -> (do ch <- (rightChar'); return ch)
_ -> return 'O' --nextChar
)
rightChar' :: Parser Char
rightChar' = (do
nextChar <- get
case nextChar of
'\\' -> return nextChar
'n' -> return '\n'
_ -> return 'N')
stringConst :: Parser Exp
stringConst =
(do
str <- pStr
return (Const (StringVal str)))
You need to keep in mind that the internal representation of a string differs from the characters that GHCi (or even just GHC) reads from string literals in source code and what GHCi prints as output when you show (or print) the string.
The string literal "\\" in Haskell program text, when parsed and read by GHC, creates a string consisting of a single character, a backslash. When you print this string, it appears on the console as "\\", but it's still a string consisting of a single backslash character. When you say you expect the output at the GHCi prompt to include the string literal "\", that's nonsense. There is no such string. There is no internal representation of a string that, when displayed by GHCi, would result in the three characters ", \ and " being displayed on your screen, in much the same way there is no string that would be printed as "hello with no closing double quote.
In your first test case:
ghci> parseString "'\\\\'"
you are supplying your parser with a four character string -- single quote, backslash, backslash, single quote. If this string had been read from a file, rather than typed in at the GHCi prompt, it would have been the literal four-character program text:
'\\'
Presumably, you want your parser to parse this as a single-character string consisting of a backslash. The output from your parse:
[(Const (StringVal "\\"),"")]
shows that your parser worked. The string as displayed on the screen "\\" represents a single-character string consisting of a backslash, which is what you wanted.
For your next case:
ghci> parseString "'\\'"
you are supplying your parser with the three character string:
'\'
Presumably, this is a parse error, as you appear to have escaped your closing single quote, meaning that this string is not terminated. Your parser correctly fails to parse it.
For your third test case:
ghci> parseString "'\\\'"
you have passed the same three character string to your parser:
'\'
The third backslash in your string literal is processed by GHCi as escaping the closing single quote. It is unnecessary but perfectly legal.
Your final test case:
ghci> parseString "\\\"
is syntactically invalid Haskell. The third backslash escapes the closing double quote, making it part of the string, and now your string is unterminated, as if you'd written:
ghci> parseString "ab

Parsec fails without error if reading from file

I wrote a small parsec parser to read samples from a user supplied input string or an input file. It fails properly on wrong input with a useful error message if the input is provided as a semicolon separated string:
> readUncalC14String "test1,7444,37;6800,36;testA,testB,2000,222;test3,7750,40"
*** Exception: Error in parsing dates from string: (line 1, column 29):
unexpected "t"
expecting digit
But it fails silently for the input file inputFile.txt with identical entries:
test1,7444,37
6800,36
testA,testB,2000,222
test3,7750,40
> readUncalC14FromFile "inputFile.txt"
[UncalC14 "test1" 7444 37,UncalC14 "unknownSampleName" 6800 36]
Why is that and how can I make readUncalC14FromFile fail in a useful manner as well?
Here is a minimal subset of my code:
import qualified Text.Parsec as P
import qualified Text.Parsec.String as P
data UncalC14 = UncalC14 String Int Int deriving Show
readUncalC14FromFile :: FilePath -> IO [UncalC14]
readUncalC14FromFile uncalFile = do
s <- readFile uncalFile
case P.runParser uncalC14SepByNewline () "" s of
Left err -> error $ "Error in parsing dates from file: " ++ show err
Right x -> return x
where
uncalC14SepByNewline :: P.Parser [UncalC14]
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces)
readUncalC14String :: String -> Either String [UncalC14]
readUncalC14String s =
case P.runParser uncalC14SepBySemicolon () "" s of
Left err -> error $ "Error in parsing dates from string: " ++ show err
Right x -> Right x
where
uncalC14SepBySemicolon :: P.Parser [UncalC14]
uncalC14SepBySemicolon = P.sepBy parseOneUncalC14 (P.char ';' <* P.spaces)
parseOneUncalC14 :: P.Parser UncalC14
parseOneUncalC14 = do
P.try long P.<|> short
where
long = do
name <- P.many (P.noneOf ",")
_ <- P.oneOf ","
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 name mean std)
short = do
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 "unknownSampleName" mean std)
What is happening here is that a prefix of your input is a valid string. To force parsec to use the whole input you can use the eof parser:
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces) <* P.eof
The reason that one works and the other doesn't is due to the difference between sepBy and endBy. Here is a simpler example:
sepTest, endTest :: String -> Either P.ParseError String
sepTest s = P.runParser (P.sepBy (P.char 'a') (P.char 'b')) () "" s
endTest s = P.runParser (P.endBy (P.char 'a') (P.char 'b')) () "" s
Here are some interesting examples:
ghci> sepTest "abababb"
Left (line 1, column 7):
unexpected "b"
expecting "a"
ghci> endTest "abababb"
Right "aaa"
ghci> sepTest "ababaa"
Right "aaa"
ghci> endTest "ababaa"
Left (line 1, column 6):
unexpected "a"
expecting "b"
As you can see both sepBy and endBy can fail silently, but sepBy fails silently if the prefix doesn't end in the separator b and endBy fails silently if the prefix doesn't end in the main parser a.
So you should use eof after both parsers if you want to make sure you read the whole file/string.

How to combine many parsers?

Why this parser fails and how to fix it?
λ> str1 = string "elif "
λ> str2 = string "else "
λ> strs = (,) <$> many str1 <*> optionMaybe str2
λ> parse strs "" "elif elif elif else "
Left (line 1, column 16):
unexpected "s"
expecting "elif "
How to combine a many parser and an optionalMaybe parser?
The problem is that string "elif " will eat the el in else, and since it has consumed input it will not backtrack and instead complain about an unexpected s
The easiest fix is to allow backtracking with try:
str1 = try $ string "elif "
The other answer shows the easy fix using try. That does introduce a cost, though, namely the introduction of backtracking. In this answer I introduce another solution with no backtracking, and which therefore should be slightly faster and use slightly less memory. The basic idea is to parse the shared prefix, then dispatch once we hit the part where the two things actually differ. So:
strs = (string "el" *> (elseParser <|> elifParser)) <|> pure ([], Nothing) where
elseParser = ([], Just "else ") <$ string "se "
elifParser = liftA2
(\_ (elifs, elses) -> ("elif ":elifs, elses))
(string "if ")
strs
For simplicity, I use constant "else " and "elif " strings in the results, but these can be built from the partial results of the string parsers with a bit of extra wiring.

How to express parsing logic in Parsec ParserT monad

I was working on "Write Yourself a Scheme in 48 hours" to learn Haskell and I've run into a problem I don't really understand. It's for question 2 from the exercises at the bottom of this section.
The task is to rewrite
import Text.ParserCombinators.Parsec
parseString :: Parser LispVal
parseString = do
char '"'
x <- many (noneOf "\"")
char '"'
return $ String x
such that quotation marks which are properly escaped (e.g. in "This sentence \" is nonsense") get accepted by the parser.
In an imperative language I might write something like this (roughly pythonic pseudocode):
def parseString(input):
if input[0] != "\"" or input[len(input)-1] != "\"":
return error
input = input[1:len(input) - 1] # slice off quotation marks
output = "" # This is the 'zero' that accumulates over the following loop
# If there is a '"' in our string we want to make sure the previous char
# was '\'
for n in range(len(input)):
if input[n] == "\"":
try:
if input[n - 1] != "\\":
return error
catch IndexOutOfBoundsError:
return error
output += input[n]
return output
I've been looking at the docs for Parsec and I just can't figure out how to work this as a monadic expression.
I got to this:
parseString :: Parser LispVal
parseString = do
char '"'
regular <- try $ many (noneOf "\"\\")
quote <- string "\\\""
char '"'
return $ String $ regular ++ quote
But this only works for one quotation mark and it has to be at the very end of the string--I can't think of a functional expression that does the work that my loops and if-statements do in the imperative pseudocode.
I appreciate you taking your time to read this and give me advice.
Try something like this:
dq :: Char
dq = '"'
parseString :: Parser Val
parseString = do
_ <- char dq
x <- many ((char '\\' >> escapes) <|> noneOf [dq])
_ <- char dq
return $ String x
where
escapes = dq <$ char dq
<|> '\n' <$ char 'n'
<|> '\r' <$ char 'r'
<|> '\t' <$ char 't'
<|> '\\' <$ char '\\'
The solution is to define a string literal as a starting quote + many valid characters + an ending quote where a "valid character" is either a an escape sequence or non-quote.
So there is a one line change to parseString:
parseString = do char '"'
x <- many validChar
char '"'
return $ String x
and we add the definitions:
validChar = try escapeSequence <|> satisfy ( /= '"' )
escapeSequence = do { char '\\'; anyChar }
escapeSequence may be refined to allow a limited set of escape sequences.

Parser for Quoted string using Parsec

I want to parse input strings like this: "this is \"test \" message \"sample\" text"
Now, I wrote a parser for parsing individual text without any quotes:
parseString :: Parser String
parseString = do
char '"'
x <- (many $ noneOf "\"")
char '"'
return x
This parses simple strings like this: "test message"
Then I wrote a parser for quoted strings:
quotedString :: Parser String
quotedString = do
initial <- string "\\\""
x <- many $ noneOf "\\\""
end <- string "\\\""
return $ initial ++ x ++ end
This parsers for strings like this: \"test message\"
Is there a way that I can combine both the parsers so that I obtain my desired objective ? What exactly is the idomatic way to tackle this problem ?
This is what I would do:
escape :: Parser String
escape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
return [d, c]
nonEscape :: Parser Char
nonEscape = noneOf "\\\"\0\n\r\v\t\b\f"
character :: Parser String
character = fmap return nonEscape <|> escape
parseString :: Parser String
parseString = do
char '"'
strings <- many character
char '"'
return $ concat strings
Now all you need to do is call it:
parse parseString "test" "\"this is \\\"test \\\" message \\\"sample\\\" text\""
Parser combinators are a bit difficult to understand at first, but once you get the hang of it they are easier than writing BNF grammars.
quotedString = do
char '"'
x <- many (noneOf "\"" <|> (char '\\' >> char '\"'))
char '"'
return x
I believe, this should work.
In case somebody is looking for a more out of the box solution, this answer in code-review provides just that. Here is a complete example with the right imports:
import Text.Parsec
import Text.Parsec.Language
import Text.Parsec.Token
lexer :: GenTokenParser String u Identity
lexer = makeTokenParser haskellDef
strParser :: Parser String
strParser = stringLiteral lexer
parseString :: String -> Either ParseError String
parseString = parse strParser ""
I prefer the following because it is easier to read:
quotedString :: Parser String
quotedString = do
a <- string "\""
b <- concat <$> many quotedChar
c <- string "\""
-- return (a ++ b ++ c) -- if you want to preserve the quotes
return b
where quotedChar = try (string "\\\\")
<|> try (string "\\\"")
<|> ((noneOf "\"\n") >>= \x -> return [x] )
Aadit's solution may be faster because it does not use try but it's probably harder to read.
Note that it is different from Aadit's solution. My solution ignores escaped things in the string and really only cares about \" and \\.
For example, let's assume you have a tab character in the string.
My solution successfully parses "\"\t\"" to Right "\t". Aadit's solutions says unexpected "\t" expecting "\\" or "\"".
Also note that Aadit's solution only accepts 'valid' escapes. For example, it rejects "\"\\a\"". \a is not a valid escape sequence (well according to man ascii, it represents the system bell and is valid). My solution just returns Right "\\a".
So we have two different use cases.
My solution: Parse quoted strings with possibly escaped quotes and escaped escapes
Aadit's solution: Parse quoted strings with valid escape sequences where valid escapes means "\\\"\0\n\r\v\t\b\f"
I wanted to parse quoted strings and remove any backslashes used for escaping during the parsing step. In my simple language, the only escapable characters were double quotes and backslashes. Here is my solution:
quotedString = do
string <- between (char '"') (char '"') (many quotedStringChar)
return string
where
quotedStringChar = escapedChar <|> normalChar
escapedChar = (char '\\') *> (oneOf ['\\', '"'])
normalChar = noneOf "\""
elaborating on #Priyatham response
pEscString::Char->Parser String
pEscString e= do
char e;
s<-many (
do{char '\\';c<-anyChar;return ['\\',c]}
<|>many1 (noneOf (e:"\\")))
char e
return$concat s

Resources