Parsing escape characters when creating parser from scratch in Haskell

Parsing escape characters when creating parser from scratch in Haskell - parsing

I have created the code below that is part of building a parser from scratch. I do however encounter unexpected output when using escape characters similar described here ,although my output is different as follows when using ghci:
ghci> parseString "'\\\\'"
[(Const (StringVal "\\"),"")]
ghci> parseString "'\\'"
[]
ghci> parseString "'\\\'"
[]
ghci> parseString "\\\"
<interactive>:50:18: error:
lexical error in string/character literal at end of input
ghci> parseString "\\"
[]
ghci> parseString "\\\\"
[]
where as seen I get an expected output when parsing '\\\\' but not when parsing just '\\' (as in case of the link referenced above), where I would have expected [(Const (StringVal "\"),"")] as a result.Is this something that is wrong in my code or is it due to ghci, and how can I overcome it if it is the latter?
import Data.Char
import Text.ParserCombinators.ReadP
import Control.Applicative ((<|>))
type ParseError = String
type Parser a = ReadP a
space :: Parser Char
space = satisfy isSpace
spaces :: Parser String
spaces = many space
token :: Parser a -> Parser a
token combinator = spaces >> combinator
parseString input = readP_to_S (do
e <- pExp
token eof
return e) input
pExp :: Parser Exp
pExp = (do
pv <- stringConst
return pv)
pStr :: Parser String
pStr =
(do
string "'"
str <- many rightChar
string "'"
return str)
rightChar :: Parser Char
rightChar = (do
nextChar <- get
case nextChar of
'\\' -> (do ch <- (rightChar'); return ch)
_ -> return 'O' --nextChar
)
rightChar' :: Parser Char
rightChar' = (do
nextChar <- get
case nextChar of
'\\' -> return nextChar
'n' -> return '\n'
_ -> return 'N')
stringConst :: Parser Exp
stringConst =
(do
str <- pStr
return (Const (StringVal str)))

You need to keep in mind that the internal representation of a string differs from the characters that GHCi (or even just GHC) reads from string literals in source code and what GHCi prints as output when you show (or print) the string.
The string literal "\\" in Haskell program text, when parsed and read by GHC, creates a string consisting of a single character, a backslash. When you print this string, it appears on the console as "\\", but it's still a string consisting of a single backslash character. When you say you expect the output at the GHCi prompt to include the string literal "\", that's nonsense. There is no such string. There is no internal representation of a string that, when displayed by GHCi, would result in the three characters ", \ and " being displayed on your screen, in much the same way there is no string that would be printed as "hello with no closing double quote.
In your first test case:
ghci> parseString "'\\\\'"
you are supplying your parser with a four character string -- single quote, backslash, backslash, single quote. If this string had been read from a file, rather than typed in at the GHCi prompt, it would have been the literal four-character program text:
'\\'
Presumably, you want your parser to parse this as a single-character string consisting of a backslash. The output from your parse:
[(Const (StringVal "\\"),"")]
shows that your parser worked. The string as displayed on the screen "\\" represents a single-character string consisting of a backslash, which is what you wanted.
For your next case:
ghci> parseString "'\\'"
you are supplying your parser with the three character string:
'\'
Presumably, this is a parse error, as you appear to have escaped your closing single quote, meaning that this string is not terminated. Your parser correctly fails to parse it.
For your third test case:
ghci> parseString "'\\\'"
you have passed the same three character string to your parser:
'\'
The third backslash in your string literal is processed by GHCi as escaping the closing single quote. It is unnecessary but perfectly legal.
Your final test case:
ghci> parseString "\\\"
is syntactically invalid Haskell. The third backslash escapes the closing double quote, making it part of the string, and now your string is unterminated, as if you'd written:
ghci> parseString "ab

Related

Passing argument to a ReadP Parser in Haskell

I am trying to create a parser from scratch in Haskell. I have problems passing a string as an argument to a function that is already part of a do block in which the parsing occurs. Why does the following Minimal viable example code return [] and not 4 as expected.
import Data.Char
import Text.ParserCombinators.ReadP
import Control.Applicative ((<|>))
type Parser a = ReadP a
token :: Parser a -> Parser a
token combinator = (do spaces
combinator)
space :: Parser Char
space = satisfy isSpace
spaces :: Parser String
spaces = many space
parseString input = readP_to_S (do
e <- pExpr
token eof
return e) input
pExpr = (do
pv <- pOpHelper
spaces
str <- string pv
return str
)
pOpHelper :: Parser String
pOpHelper = (do
e1 <- munch isDigit
return e1
)
I am of course interested in returning a processed version of whatever string pv returns. However I can't understand why the current setup wouldn't return anything besides [] on parseString "4" since calling just pOpHelper wihtout pExpr seems to work.
Edit
I think I have located the 'bug' to be part of the string function. I had a closer look at it here but I can't see from the documentation why it shouldn't work in the above. But the above code is narrowed down to the parts that produce the unintended outputs as specified.
EDIT EDIT
I have now narrowed the problem down even further. It has to do with how 'consumption' works for the parser. The problem is that if I give it parseString "4" the string pv expects the "4" that is returned by pv, but it will still be parsing the next characters on which munch isDigit is no longer satisfies. This means that it will only return [("4","")] rather than [] if the input is parseString "4 4", and only if the spaces has been added to the do-clause in pExpr.
But how can I work around this and avoid 'consuming' the string that I put as input. Is there a way to use look for instance, in the above documentation.
As pointed out in the comments below I am interested in transforming whatever is the input to pOpHelper and then passing its output to functions (in a recursion) that is part of the parent parser-function called. But how can I do it without consuming the input with pOpHelper first such that the following example would return str on input of "4":
pExpr = (do
pv <- pOpHelper
--spaces
str <- string pv
if str == "(4)" then return str -- do stuff!
else pfail
)
pOpHelper :: Parser String
pOpHelper = (do
e1 <- munch isDigit
return ( "(" ++ e1 ++ ")" )
)

Parsec fails without error if reading from file

I wrote a small parsec parser to read samples from a user supplied input string or an input file. It fails properly on wrong input with a useful error message if the input is provided as a semicolon separated string:
> readUncalC14String "test1,7444,37;6800,36;testA,testB,2000,222;test3,7750,40"
*** Exception: Error in parsing dates from string: (line 1, column 29):
unexpected "t"
expecting digit
But it fails silently for the input file inputFile.txt with identical entries:
test1,7444,37
6800,36
testA,testB,2000,222
test3,7750,40
> readUncalC14FromFile "inputFile.txt"
[UncalC14 "test1" 7444 37,UncalC14 "unknownSampleName" 6800 36]
Why is that and how can I make readUncalC14FromFile fail in a useful manner as well?
Here is a minimal subset of my code:
import qualified Text.Parsec as P
import qualified Text.Parsec.String as P
data UncalC14 = UncalC14 String Int Int deriving Show
readUncalC14FromFile :: FilePath -> IO [UncalC14]
readUncalC14FromFile uncalFile = do
s <- readFile uncalFile
case P.runParser uncalC14SepByNewline () "" s of
Left err -> error $ "Error in parsing dates from file: " ++ show err
Right x -> return x
where
uncalC14SepByNewline :: P.Parser [UncalC14]
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces)
readUncalC14String :: String -> Either String [UncalC14]
readUncalC14String s =
case P.runParser uncalC14SepBySemicolon () "" s of
Left err -> error $ "Error in parsing dates from string: " ++ show err
Right x -> Right x
where
uncalC14SepBySemicolon :: P.Parser [UncalC14]
uncalC14SepBySemicolon = P.sepBy parseOneUncalC14 (P.char ';' <* P.spaces)
parseOneUncalC14 :: P.Parser UncalC14
parseOneUncalC14 = do
P.try long P.<|> short
where
long = do
name <- P.many (P.noneOf ",")
_ <- P.oneOf ","
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 name mean std)
short = do
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 "unknownSampleName" mean std)

What is happening here is that a prefix of your input is a valid string. To force parsec to use the whole input you can use the eof parser:
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces) <* P.eof
The reason that one works and the other doesn't is due to the difference between sepBy and endBy. Here is a simpler example:
sepTest, endTest :: String -> Either P.ParseError String
sepTest s = P.runParser (P.sepBy (P.char 'a') (P.char 'b')) () "" s
endTest s = P.runParser (P.endBy (P.char 'a') (P.char 'b')) () "" s
Here are some interesting examples:
ghci> sepTest "abababb"
Left (line 1, column 7):
unexpected "b"
expecting "a"
ghci> endTest "abababb"
Right "aaa"
ghci> sepTest "ababaa"
Right "aaa"
ghci> endTest "ababaa"
Left (line 1, column 6):
unexpected "a"
expecting "b"
As you can see both sepBy and endBy can fail silently, but sepBy fails silently if the prefix doesn't end in the separator b and endBy fails silently if the prefix doesn't end in the main parser a.
So you should use eof after both parsers if you want to make sure you read the whole file/string.

Parser in Haskell for Triples

I am currently doing an assignment about Parsing in Haskell, but I am struggling with some of the basics.
Assignment :
I am supposed to create a function which parses a string into a list of Triples.
So that:
A, B, C
,E ,D
would result in
Triples [("A","B","C"), ("A","E","D")]
The input string is going to include ;\n as an indication for the beginng of a new Triple. The string is going to end with a dot.
The elements of the Triples can be letters or digits or combination,
e.g. abc, a, 1, abc121.
Therefore,
"a,b,c;\n d,e;\n f,g;\n h,i."
would result in:
Triples [("a","b","c"),("a","d","e"),("a","f","g"),("a","h","i")]
My current solution:
parseTriplesD :: Parser Triples
parseTriplesD = parseTriples
>>= \rs -> return (Triples rs)
This function is pretty simple and correct. Takes the string and returns a object of the newtype Triples with the List of Triples created by parseTriples.
parseTriples :: Parser [Triple]
parseTriples = parseTriple
>>= \r -> ((string ";\n" >> parseTriples >>= \rs -> return (r:rs))
P.<|>(return[r]))
This function needs some work. My idea is that I use another function which creates a Triple with tree Elements of the input string, ignores the /n and recursivly calls it self while adding the created triples to a return list. When this does not work because it can only create one Triple, it returns a list with the Triple.
I somehow need to create the first Triple, and then use first element of this triple as the first element of the other ones.
Question 1
How do I create the first Triple and use the first Elements of the Triple for the other Triples?
parseTriple :: Parser Triple
parseTriple = P.many (letter<|>digit) >>= \a -> P.char ','
>> P.many (letter<|>digit)>>= \b -> P.char ','
>> P.many (letter<|>digit)>>= \c -> return ((a,b,c))
This function is pretty simple but I am not sure if its correct.
My idea is that it takes the first couple of characters of the string which are either a letter or a digit, up until the comma "," , and saves these charcters in a.
It is repeated 3 times, and the creates and returns a Triple with the three elements.
Question 2
How do I take only a few characters (which are either a letter or a digit EDIT: Or A SPACE Character) of the string up until the comma?
Is P.many (letter<|>digit) correct?
What we are given:
The Triples data structue:
newtype Triples = Triples [Triple] deriving (Show,Eq)
type Triple = (String, String, String)
Imports:
import Test.HUnit (runTestTT,Test(TestLabel,TestList),(~?=))
import qualified Text.Parsec as P (char,runP,noneOf,many,(<|>),eof)
import Text.ParserCombinators.Parsec
import Text.Parsec.String
import Text.Parsec.Char
import Data.Maybe
Test cases
runParsec :: Parser a -> String -> Maybe a
runParsec parser input = case P.runP parser () "" input of
Left _ -> Nothing
Right a -> Just a
-- | Tests the implementations of 'parseScore'.
main :: IO ()
main = do
testresults <- runTestTT tests
print testresults
-- | List of tests for 'parseScore'.
tests :: Test
tests = TestLabel "parseScoreTest" (TestList [
runParsec parseTriplesD "0,1,2;\n2,3." ~?= Just (Triples [("0","1","2"),("0","2","3")]),
runParsec parseTriplesD "a,bcde ,23." ~?= Just (Triples [("a","bcde ","23")]),
runParsec parseTriplesD "a,b,c;\n d,e;\n f,g;\n h,i." ~?= Just (Triples [("a","b","c"),("a","d","e"),("a","f","g"),("a","h","i")]),
runParsec parseTriplesD "a,bcde23." ~?= Nothing,
runParsec parseTriplesD "a,b,c;d,e;f,g;h,i." ~?= Nothing,
runParsec parseTriplesD "a,b,c;\nd;\nf,g;\nh,i." ~?= Nothing
])

What you could do is:
Parse the first character
Parse a list of pairs
Add the first character to each of the pairs to create triples
Using do notation will make your code more readable.
You can use alphaNum as a shorthand for letter <|> digit.
parseTriplesD :: Parser Triples
parseTriplesD = Triples <$> parseTriples
parseTriples :: Parser [Triple]
parseTriples = do
a <- parseString
char ','
pairs <- parsePair `sepBy1` string ";\n"
char '.'
eof
return (map (\(b, c) -> (a, b, c)) pairs)
parsePair :: Parser (String, String)
parsePair = do
first <- parseString
char ','
second <- parseString
return (first, second)
parseString :: Parser String
parseString = many (char ' ') >> many (alphaNum <|> char ' ')

Fast parsing of string that allows escaped characters?

I'm trying to parse a string that can contain escaped characters, here's an example:
import qualified Data.Text as T
exampleParser :: Parser T.Text
exampleParser = T.pack <$> many (char '\\' *> escaped <|> anyChar)
where escaped = satisfy (\c -> c `elem` ['\\', '"', '[', ']'])
The parser above creates a String and then packs it into Text. Is there any way to parse a string with escapes like the above using the functions for efficient string handling that attoparsec provides? Like string, scan, runScanner, takeWhile, ...
Parsing something like "one \"two\" \[three\]" would produce one "two" [three].
Update:
Thanks to #epsilonhalbe I was able to come out with a generalized solution perfect for my needs; note that the following function doesn't look for matching escaped characters like [..], "..", (..), etc; and also, if it finds an escaped character that is not valid it treats \ as a literal character.
takeEscapedWhile :: (Char -> Bool) -> (Char -> Bool) -> Parser Text
takeEscapedWhile isEscapable while = do
x <- normal
xs <- many escaped
return $ T.concat (x:xs)
where normal = Atto.takeWhile (\c -> c /= '\\' && while c)
escaped = do
x <- (char '\\' *> satisfy isEscapable) <|> char '\\'
xs <- normal
return $ T.cons x xs

It is possible writing some escaping code, attoparsec and text - altogether it is pretty straightforward - seeing you have already worked with parsers
import Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Data.Text (Text)
escaped, quoted, brackted :: Parser Text
normal = AT.takeWhile (/= '\\')
escaped = do r <- normal
rs <- many escaped'
return $ T.concat $ r:rs
where escaped' = do r1 <- normal
r2 <- quoted <|> brackted
return $ r1 <> r2
quoted = do string "\\\""
res <- normal
string "\\\""
return $ "\""<>res <>"\""
brackted = do string "\\["
res <- normal
string "\\]"
return $ "["<>res<>"]"
then you can use it to parse the following test cases
Prelude >: MyModule
Prelude MyModule> import Data.Attoparsec.Text as AT
Prelude MyModule AT> import Data.Text.IO as TIO
Prelude MyModule AT TIO>:set -XOverloadedStrings
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "test"
test
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "\\\"test\\\""
"test"
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "\\[test\\]"
[test]
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped "test \\\"test\\\" \\[test\\]"
test "test" [test]
note you have to escape the escapes - that's why you see \\\" instead of \"
Also if you just parse it will print the Text values escaped, like
Right "test \"text\" [test]"
for the last example.
If you parse a file you write simpley escaped text in the file.
test.txt
I \[like\] \"Haskell\"
then you can
Prelude MyModule AT TIO> file <- TIO.readFile "test.txt"
Prelude MyModule AT TIO> TIO.putStrLn $ parseOnly escaped file
I [like] "Haskell"

Parser for Quoted string using Parsec

I want to parse input strings like this: "this is \"test \" message \"sample\" text"
Now, I wrote a parser for parsing individual text without any quotes:
parseString :: Parser String
parseString = do
char '"'
x <- (many $ noneOf "\"")
char '"'
return x
This parses simple strings like this: "test message"
Then I wrote a parser for quoted strings:
quotedString :: Parser String
quotedString = do
initial <- string "\\\""
x <- many $ noneOf "\\\""
end <- string "\\\""
return $ initial ++ x ++ end
This parsers for strings like this: \"test message\"
Is there a way that I can combine both the parsers so that I obtain my desired objective ? What exactly is the idomatic way to tackle this problem ?

This is what I would do:
escape :: Parser String
escape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf" -- all the characters which can be escaped
return [d, c]
nonEscape :: Parser Char
nonEscape = noneOf "\\\"\0\n\r\v\t\b\f"
character :: Parser String
character = fmap return nonEscape <|> escape
parseString :: Parser String
parseString = do
char '"'
strings <- many character
char '"'
return $ concat strings
Now all you need to do is call it:
parse parseString "test" "\"this is \\\"test \\\" message \\\"sample\\\" text\""
Parser combinators are a bit difficult to understand at first, but once you get the hang of it they are easier than writing BNF grammars.

quotedString = do
char '"'
x <- many (noneOf "\"" <|> (char '\\' >> char '\"'))
char '"'
return x
I believe, this should work.

In case somebody is looking for a more out of the box solution, this answer in code-review provides just that. Here is a complete example with the right imports:
import Text.Parsec
import Text.Parsec.Language
import Text.Parsec.Token
lexer :: GenTokenParser String u Identity
lexer = makeTokenParser haskellDef
strParser :: Parser String
strParser = stringLiteral lexer
parseString :: String -> Either ParseError String
parseString = parse strParser ""

I prefer the following because it is easier to read:
quotedString :: Parser String
quotedString = do
a <- string "\""
b <- concat <$> many quotedChar
c <- string "\""
-- return (a ++ b ++ c) -- if you want to preserve the quotes
return b
where quotedChar = try (string "\\\\")
<|> try (string "\\\"")
<|> ((noneOf "\"\n") >>= \x -> return [x] )
Aadit's solution may be faster because it does not use try but it's probably harder to read.
Note that it is different from Aadit's solution. My solution ignores escaped things in the string and really only cares about \" and \\.
For example, let's assume you have a tab character in the string.
My solution successfully parses "\"\t\"" to Right "\t". Aadit's solutions says unexpected "\t" expecting "\\" or "\"".
Also note that Aadit's solution only accepts 'valid' escapes. For example, it rejects "\"\\a\"". \a is not a valid escape sequence (well according to man ascii, it represents the system bell and is valid). My solution just returns Right "\\a".
So we have two different use cases.
My solution: Parse quoted strings with possibly escaped quotes and escaped escapes
Aadit's solution: Parse quoted strings with valid escape sequences where valid escapes means "\\\"\0\n\r\v\t\b\f"

I wanted to parse quoted strings and remove any backslashes used for escaping during the parsing step. In my simple language, the only escapable characters were double quotes and backslashes. Here is my solution:
quotedString = do
string <- between (char '"') (char '"') (many quotedStringChar)
return string
where
quotedStringChar = escapedChar <|> normalChar
escapedChar = (char '\\') *> (oneOf ['\\', '"'])
normalChar = noneOf "\""

elaborating on #Priyatham response
pEscString::Char->Parser String
pEscString e= do
char e;
s<-many (
do{char '\\';c<-anyChar;return ['\\',c]}
<|>many1 (noneOf (e:"\\")))
char e
return$concat s

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart