Writing parsers with Text.ParserCombinators - parsing

In class, we wrote parsers by defining our own Parser type and this gave us a lot of flexibility. For example, if we wanted to make code and parse out "[at]" as '#', we could write
atParser = Parser $ \s ->
case s of
w:x:y:z:zs ->
| (w:x:y:z:[]) == "[at]" = ['#',zs]
| otherwise = []
zs -> []
However, I cannot figure out how to implement this sort of parser using Text.ParserCombinators. Is it possible?
Thanks

I believe you're looking for the string combinator.
λ> :set -package parsec
package flags have changed, resetting and loading new packages...
λ> import Text.Parsec
λ> :t string
string :: Stream s m Char => String -> ParsecT s u m String
λ> parse (string "[at]") "" "[at]"
Right "[at]"
λ> parse (string "[at]") "" "[at"
Left (line 1, column 1):
unexpected end of input
expecting "[at]"
λ> parse ('#' <$ string "[at]") "" "[at]"
Right '#'

Related

Parsing escape characters when creating parser from scratch in Haskell

I have created the code below that is part of building a parser from scratch. I do however encounter unexpected output when using escape characters similar described here ,although my output is different as follows when using ghci:
ghci> parseString "'\\\\'"
[(Const (StringVal "\\"),"")]
ghci> parseString "'\\'"
[]
ghci> parseString "'\\\'"
[]
ghci> parseString "\\\"
<interactive>:50:18: error:
lexical error in string/character literal at end of input
ghci> parseString "\\"
[]
ghci> parseString "\\\\"
[]
where as seen I get an expected output when parsing '\\\\' but not when parsing just '\\' (as in case of the link referenced above), where I would have expected [(Const (StringVal "\"),"")] as a result.Is this something that is wrong in my code or is it due to ghci, and how can I overcome it if it is the latter?
import Data.Char
import Text.ParserCombinators.ReadP
import Control.Applicative ((<|>))
type ParseError = String
type Parser a = ReadP a
space :: Parser Char
space = satisfy isSpace
spaces :: Parser String
spaces = many space
token :: Parser a -> Parser a
token combinator = spaces >> combinator
parseString input = readP_to_S (do
e <- pExp
token eof
return e) input
pExp :: Parser Exp
pExp = (do
pv <- stringConst
return pv)
pStr :: Parser String
pStr =
(do
string "'"
str <- many rightChar
string "'"
return str)
rightChar :: Parser Char
rightChar = (do
nextChar <- get
case nextChar of
'\\' -> (do ch <- (rightChar'); return ch)
_ -> return 'O' --nextChar
)
rightChar' :: Parser Char
rightChar' = (do
nextChar <- get
case nextChar of
'\\' -> return nextChar
'n' -> return '\n'
_ -> return 'N')
stringConst :: Parser Exp
stringConst =
(do
str <- pStr
return (Const (StringVal str)))
You need to keep in mind that the internal representation of a string differs from the characters that GHCi (or even just GHC) reads from string literals in source code and what GHCi prints as output when you show (or print) the string.
The string literal "\\" in Haskell program text, when parsed and read by GHC, creates a string consisting of a single character, a backslash. When you print this string, it appears on the console as "\\", but it's still a string consisting of a single backslash character. When you say you expect the output at the GHCi prompt to include the string literal "\", that's nonsense. There is no such string. There is no internal representation of a string that, when displayed by GHCi, would result in the three characters ", \ and " being displayed on your screen, in much the same way there is no string that would be printed as "hello with no closing double quote.
In your first test case:
ghci> parseString "'\\\\'"
you are supplying your parser with a four character string -- single quote, backslash, backslash, single quote. If this string had been read from a file, rather than typed in at the GHCi prompt, it would have been the literal four-character program text:
'\\'
Presumably, you want your parser to parse this as a single-character string consisting of a backslash. The output from your parse:
[(Const (StringVal "\\"),"")]
shows that your parser worked. The string as displayed on the screen "\\" represents a single-character string consisting of a backslash, which is what you wanted.
For your next case:
ghci> parseString "'\\'"
you are supplying your parser with the three character string:
'\'
Presumably, this is a parse error, as you appear to have escaped your closing single quote, meaning that this string is not terminated. Your parser correctly fails to parse it.
For your third test case:
ghci> parseString "'\\\'"
you have passed the same three character string to your parser:
'\'
The third backslash in your string literal is processed by GHCi as escaping the closing single quote. It is unnecessary but perfectly legal.
Your final test case:
ghci> parseString "\\\"
is syntactically invalid Haskell. The third backslash escapes the closing double quote, making it part of the string, and now your string is unterminated, as if you'd written:
ghci> parseString "ab

Parser written in Haskell not working as intended

I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work:
module Main where
import Control.Monad
import Numeric
import System.Environment
import Text.ParserCombinators.Parsec hiding (spaces)
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
return (fst (head (readHex x)))
testHex :: String -> String
testHex input = case parse parseHex "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val
main :: IO ()
main = do
args <- getArgs
putStrLn (testHex (head args))
And then I tried testing the testHex function in Haskell's repl:
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
Ok, one module loaded.
*Main> testHex "#xcafebeef"
"Matched3405692655"
*Main> testHex "#xnothx"
"Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit"
*Main> testHex "#xcafexbeef"
"Matched51966"
The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this?
Thank you!
You need to place eof at the end.
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
eof
return (fst (head (readHex x)))
Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places.
testHex :: String -> String
testHex input = case parse (parseHex <* eof) "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val

Parsec fails without error if reading from file

I wrote a small parsec parser to read samples from a user supplied input string or an input file. It fails properly on wrong input with a useful error message if the input is provided as a semicolon separated string:
> readUncalC14String "test1,7444,37;6800,36;testA,testB,2000,222;test3,7750,40"
*** Exception: Error in parsing dates from string: (line 1, column 29):
unexpected "t"
expecting digit
But it fails silently for the input file inputFile.txt with identical entries:
test1,7444,37
6800,36
testA,testB,2000,222
test3,7750,40
> readUncalC14FromFile "inputFile.txt"
[UncalC14 "test1" 7444 37,UncalC14 "unknownSampleName" 6800 36]
Why is that and how can I make readUncalC14FromFile fail in a useful manner as well?
Here is a minimal subset of my code:
import qualified Text.Parsec as P
import qualified Text.Parsec.String as P
data UncalC14 = UncalC14 String Int Int deriving Show
readUncalC14FromFile :: FilePath -> IO [UncalC14]
readUncalC14FromFile uncalFile = do
s <- readFile uncalFile
case P.runParser uncalC14SepByNewline () "" s of
Left err -> error $ "Error in parsing dates from file: " ++ show err
Right x -> return x
where
uncalC14SepByNewline :: P.Parser [UncalC14]
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces)
readUncalC14String :: String -> Either String [UncalC14]
readUncalC14String s =
case P.runParser uncalC14SepBySemicolon () "" s of
Left err -> error $ "Error in parsing dates from string: " ++ show err
Right x -> Right x
where
uncalC14SepBySemicolon :: P.Parser [UncalC14]
uncalC14SepBySemicolon = P.sepBy parseOneUncalC14 (P.char ';' <* P.spaces)
parseOneUncalC14 :: P.Parser UncalC14
parseOneUncalC14 = do
P.try long P.<|> short
where
long = do
name <- P.many (P.noneOf ",")
_ <- P.oneOf ","
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 name mean std)
short = do
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 "unknownSampleName" mean std)
What is happening here is that a prefix of your input is a valid string. To force parsec to use the whole input you can use the eof parser:
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces) <* P.eof
The reason that one works and the other doesn't is due to the difference between sepBy and endBy. Here is a simpler example:
sepTest, endTest :: String -> Either P.ParseError String
sepTest s = P.runParser (P.sepBy (P.char 'a') (P.char 'b')) () "" s
endTest s = P.runParser (P.endBy (P.char 'a') (P.char 'b')) () "" s
Here are some interesting examples:
ghci> sepTest "abababb"
Left (line 1, column 7):
unexpected "b"
expecting "a"
ghci> endTest "abababb"
Right "aaa"
ghci> sepTest "ababaa"
Right "aaa"
ghci> endTest "ababaa"
Left (line 1, column 6):
unexpected "a"
expecting "b"
As you can see both sepBy and endBy can fail silently, but sepBy fails silently if the prefix doesn't end in the separator b and endBy fails silently if the prefix doesn't end in the main parser a.
So you should use eof after both parsers if you want to make sure you read the whole file/string.

Skip everything until a successful parse

I'd like to parse all days from a text like this:
Ignore this
Also this
2019-09-05
More to ignore
2019-09-06
2019-09-07
Using Trifecta, I've defined a function to parse a day:
dayParser :: Parser Day
dayParser = do
dayString <- tillEnd
parseDay dayString
tillEnd :: Parser String
tillEnd = manyTill anyChar (try eof <|> eol)
parseDay :: String -> Parser Day
parseDay s = maybe failure return dayMaybe
where
dayMaybe = parseTime' dayFormat s
failure = unexpected $ "Failed to parse date. Expected format: " ++ dayFormat
-- %-m makes the parser accept months consisting of a single digit
dayFormat = "%Y-%-m-%-d"
eol :: Parser ()
eol = char '\n' <|> char '\r' >> return ()
-- "%Y-%-m-%-d" for example
type TimeFormat = String
-- Given a time format and a string, parses the string to a time.
parseTime' :: (Monad m, ParseTime t) => TimeFormat -> String -> m t
-- True means that the parser tolerates whitespace before and after the date
parseTime' = parseTimeM True defaultTimeLocale
Parsing a day this way works. What I'm having trouble with is ignoring anything in the text that's not a day.
The following can't work since it assumes the number of text blocks that aren't a day:
daysParser :: Parser [Day]
daysParser = do
-- Ignore everything that's not a day
_ <- manyTill anyChar $ try dayParser
days <- many $ token dayParser
_ <- manyTill anyChar $ try dayParser
-- There might be more days after this...
return days
I reckon there's a straightforward way to express this with Trifecta but I can't seem to find it.
Here's the whole module including an example text to parse:
{-# LANGUAGE QuasiQuotes #-}
module DateParser where
import Text.RawString.QQ
import Data.Time
import Text.Trifecta
import Control.Applicative ( (<|>) )
-- "%Y-%-m-%-d" for example
type TimeFormat = String
dayParser :: Parser Day
dayParser = do
dayString <- tillEnd
parseDay dayString
tillEnd :: Parser String
tillEnd = manyTill anyChar (try eof <|> eol)
parseDay :: String -> Parser Day
parseDay s = maybe failure return dayMaybe
where
dayMaybe = parseTime' dayFormat s
failure = unexpected $ "Failed to parse date. Expected format: " ++ dayFormat
-- %-m makes the parser accept months consisting of a single digit
dayFormat = "%Y-%-m-%-d"
eol :: Parser ()
eol = char '\n' <|> char '\r' >> return ()
-- Given a time format and a string, parses the string to a time.
parseTime' :: (Monad m, ParseTime t) => TimeFormat -> String -> m t
-- True means that the parser tolerates whitespace before and after the date
parseTime' = parseTimeM True defaultTimeLocale
daysParser :: Parser [Day]
daysParser = do
-- Ignore everything that's not a day
_ <- manyTill anyChar $ try dayParser
days <- many $ token dayParser
_ <- manyTill anyChar $ try dayParser
-- There might be more days after this...
return days
test = parseString daysParser mempty text1
text1 = [r|
Ignore this
Also this
2019-09-05
More to ignore
2019-09-06
2019-09-07|]
There are three large problems here.
First, the way you're defining dayParser, it's always trying to parse the rest of the text as a date. For example, if your input text is "2019-01-01 foo bar", then dayParser would first consume the whole string, so that dayString == "2019-01-01 foo bar", and then will try to parse that string as a date. Which, of course, would fail.
In order to have a saner behavior, you could only bite off the beginning of the string that kinda looks like a date and try to parse that, like:
dayParser =
parseDay =<< many (digit <|> char '-')
This implementation bites off the beginning of the input consisting of digits and dashes, and tries to parse that as a date.
Note that this is a quick-n-dirty implementation. It is imprecise. For example, this implementation would accept input like "2019-01-0123456" and try to parse that as a date, and of course will fail. From your question, it is not clear whether you'd want to still parse 2019-01-01 and leave the rest, or whether you want to not consider that a proper date. If you wanted to be super-precise about this, you could specify the exact format as precisely as you want, e.g.:
dayParser = do
y <- count 4 digit
void $ char '-'
m <- try (count 2 digit) <|> count 1 digit
void $ char '-'
d <- try (count 2 digit) <|> count 1 digit
parseDay $ y ++ "-" ++ m ++ "-" ++ d
This implementation expects exactly the format of the date.
Second, there is a logical problem: your daysParser tries to first parse some garbage, then parse many days, and then parse some garbage again. This logic does not admit a case where the many dates have some garbage between them.
Third problem is much more tricky. You see, the way the try combinator works - if the parser fails, then try will roll back the input position, but if the parser succeeds, then the input remains consumed! This means that you cannot use try as a zero-consumption lookahead, the way you're trying to do in manyTill anyChar $ try dayParser. Such a parser will parse until it finds a date, and then it will consume the date, leaving nothing for the next parser and causing it to fail.
I will illustrate with a simpler example. Consider this:
> parseString (many (char 'a')) mempty "aaa"
Success "aaa"
Cool, it parses three 'a's. Now let's add a try at the beginning:
> parseString (try (char 'b') *> many (char 'a')) mempty "aaa"
Success "aaa"
Awesome, this still works: the try fails, and then we parse three 'a's as before.
Now let's change the try from 'b' to 'a':
> parseString (try (char 'a') *> many (char 'a')) mempty "aaa"
Success "aa"
Look what happened: the try has consumed the first 'a', leaving only two to be parsed by many.
We can even extend it to more fully resemble your approach:
> p = manyTill anyChar (try (char 'a')) *> many (char 'a')
> parseString p mempty "aaa"
Success "aa"
> parseString p mempty "cccaaa"
Success "aa"
See what happens? manyTill correctly skips all the 'c's up to the first 'a', but then it also consumes that first 'a'!
There appears to be no sane way (that I see) to have a zero-consumption lookahead like this. You always have to consume the first successful hit.
If I had this problem, I would probably resort to recursion: parsing chars one by one, at every step looking if I can get a day, and concatenating in a list. Something like this:
data WhatsThis = AChar Char | ADay Day | EOF
daysParser = do
r <- (ADay <$> dayParser) <|> (AChar <$> anyChar) <|> (EOF <$ eof)
case r of
ADay d -> do
rest <- daysParser
pure $ d : rest
AChar _ ->
daysParser
EOF ->
pure []
It tries to parse a day, and if that fails, just skips a char, unless there are no more chars. If day parsing succeeded, it calls itself recursively, then prepends the day to the result of the recursive call.
Note that this approach is not very composable: it always consumes everything till the end of the input. If you want to compose it with something else, you may want consider replacing eof with a parameter:
daysParser stop = do
r <- (ADay <$> dayParser) <|> (AChar <$> anyChar) <|> (EOF <$ stop)
...

How to parse integer with base prefix using parsec in haskell?

I'm trying to parse an input integer string in haskell using parsec. The string might either be in decimal, octal or hexadecimal. The base is specified by a prefix of #d, #o or #x for decimal, octal and hexadecimal respectively, which is then followed by the integer. If no prefix is specified, the base is assumed to be 10. Here's what I've done so far:
parseNumber = do x <- noPrefix <|> withPrefix
return x
where noPrefix = many1 digit
withPrefix = do char '#'
prefix <- oneOf "dox"
return $ case prefix of
'd' -> many1 digit
'o' -> fmap (show . fst . head . readOct) (many1 octDigit)
'x' -> fmap (show . fst . head . readHex) (many1 hexDigit)
However, this isn't compiling and is failing with type errors. I don't quite really understand the type error and would just like help in general with this problem. Any alternative ways to solve it will also be appreciated.
Thank you for your time and help. :)
EDIT: Here's the error I've been getting.
In Megaparsec—a modern
fork of Parsec, this problem is non-existent (from
documentation of hexadecimal):
Parse an integer in hexadecimal representation. Representation of
hexadecimal number is expected to be according to Haskell report except
for the fact that this parser doesn't parse “0x” or “0X” prefix. It is
responsibility of the programmer to parse correct prefix before parsing
the number itself.
For example you can make it conform to Haskell report like this:
hexadecimal = char '0' >> char' 'x' >> L.hexadecimal
So in your case you can just define (note how it's more readable):
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
type Parser = Parsec Void String
parseNumber :: Parser Integer
parseNumber = choice
[ L.decimal
, (string "o#" *> L.octal) <?> "octal integer"
, (string "d#" *> L.decimal) <?> "decimal integer"
, (string "h#" *> L.hexadecimal) <?> "hexadecimal integer" ]
Let's try the parser (note quality of error messages):
λ> parseTest' (parseNumber <* eof) ""
1:1:
|
1 | <empty line>
| ^
unexpected end of input
expecting decimal integer, hexadecimal integer, integer, or octal integer
λ> parseTest' (parseNumber <* eof) "d#3"
3
λ> parseTest' (parseNumber <* eof) "h#ff"
255
λ> parseTest' (parseNumber <* eof) "o#8"
1:3:
|
1 | o#8
| ^
unexpected '8'
expecting octal integer
λ> parseTest' (parseNumber <* eof) "o#77"
63
λ> parseTest' (parseNumber <* eof) "190"
190
Full-disclosure: I'm the author/maintainer of Megaparsec.
You have two slight errors:
One indention error (return x must be indented compared to do) and the parsers in withPrefix must not be returned, since they will return their results anyway.
parseNumber = do x <- noPrefix <|> withPrefix
return x
where noPrefix = many1 digit
withPrefix = do char '#'
prefix <- oneOf "dox"
case prefix of
'd' -> many1 digit
'o' -> fmap (show . fst . head . readOct) (many1 octDigit)
'x' -> fmap (show . fst . head . readHex) (many1 hexDigit)
This should work

Resources