Parsec fails without error if reading from file - parsing

I wrote a small parsec parser to read samples from a user supplied input string or an input file. It fails properly on wrong input with a useful error message if the input is provided as a semicolon separated string:
> readUncalC14String "test1,7444,37;6800,36;testA,testB,2000,222;test3,7750,40"
*** Exception: Error in parsing dates from string: (line 1, column 29):
unexpected "t"
expecting digit
But it fails silently for the input file inputFile.txt with identical entries:
test1,7444,37
6800,36
testA,testB,2000,222
test3,7750,40
> readUncalC14FromFile "inputFile.txt"
[UncalC14 "test1" 7444 37,UncalC14 "unknownSampleName" 6800 36]
Why is that and how can I make readUncalC14FromFile fail in a useful manner as well?
Here is a minimal subset of my code:
import qualified Text.Parsec as P
import qualified Text.Parsec.String as P
data UncalC14 = UncalC14 String Int Int deriving Show
readUncalC14FromFile :: FilePath -> IO [UncalC14]
readUncalC14FromFile uncalFile = do
s <- readFile uncalFile
case P.runParser uncalC14SepByNewline () "" s of
Left err -> error $ "Error in parsing dates from file: " ++ show err
Right x -> return x
where
uncalC14SepByNewline :: P.Parser [UncalC14]
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces)
readUncalC14String :: String -> Either String [UncalC14]
readUncalC14String s =
case P.runParser uncalC14SepBySemicolon () "" s of
Left err -> error $ "Error in parsing dates from string: " ++ show err
Right x -> Right x
where
uncalC14SepBySemicolon :: P.Parser [UncalC14]
uncalC14SepBySemicolon = P.sepBy parseOneUncalC14 (P.char ';' <* P.spaces)
parseOneUncalC14 :: P.Parser UncalC14
parseOneUncalC14 = do
P.try long P.<|> short
where
long = do
name <- P.many (P.noneOf ",")
_ <- P.oneOf ","
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 name mean std)
short = do
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 "unknownSampleName" mean std)

What is happening here is that a prefix of your input is a valid string. To force parsec to use the whole input you can use the eof parser:
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces) <* P.eof
The reason that one works and the other doesn't is due to the difference between sepBy and endBy. Here is a simpler example:
sepTest, endTest :: String -> Either P.ParseError String
sepTest s = P.runParser (P.sepBy (P.char 'a') (P.char 'b')) () "" s
endTest s = P.runParser (P.endBy (P.char 'a') (P.char 'b')) () "" s
Here are some interesting examples:
ghci> sepTest "abababb"
Left (line 1, column 7):
unexpected "b"
expecting "a"
ghci> endTest "abababb"
Right "aaa"
ghci> sepTest "ababaa"
Right "aaa"
ghci> endTest "ababaa"
Left (line 1, column 6):
unexpected "a"
expecting "b"
As you can see both sepBy and endBy can fail silently, but sepBy fails silently if the prefix doesn't end in the separator b and endBy fails silently if the prefix doesn't end in the main parser a.
So you should use eof after both parsers if you want to make sure you read the whole file/string.

Related

Parsec try : should try to go to next option

Currently, I have the following code:
import Control.Applicative ((<|>))
import Text.Parsec (ParseError, endBy, sepBy, try)
import Text.Parsec.String (Parser)
import qualified Data.Char as Char
import qualified Text.Parsec as Parsec
data Operation = Lt | Gt deriving (Show)
data Value =
Raw String
| Op Operation
deriving (Show)
sampleStr :: String
sampleStr = unlines
[ "#BEGIN#"
, "x <- 3.14 + 2.72;"
, "x < 10;"
]
gtParser :: Parser Value
gtParser = do
Parsec.string "<"
return $ Op Gt
ltParser :: Parser Value
ltParser = do
Parsec.string ">"
return $ Op Lt
opParser :: Parser Value
opParser = gtParser <|> ltParser
rawParser :: Parser Value
rawParser = do
str <- Parsec.many1 $ Parsec.satisfy $ not . Char.isSpace
return $ Raw str
valueParser :: Parser Value
valueParser = try opParser <|> rawParser
eolParser :: Parser Char
eolParser = try (Parsec.char ';' >> Parsec.endOfLine)
<|> Parsec.endOfLine
lineParser :: Parser [Value]
lineParser = sepBy valueParser $ Parsec.many1 $ Parsec.char ' '
fileParser :: Parser [[Value]]
fileParser = endBy lineParser eolParser
parse :: String -> Either ParseError [[Value]]
parse = Parsec.parse fileParser "fail..."
main :: IO ()
main = print $ parse sampleStr
This will fail with the message
Left "fail..." (line 2, column 4):
unexpected "-"
expecting " ", ";" or new-line
To my understanding, since I have try opParser, after Parsec sees that the token <- cannot be parsed by opParser, it should go to rawParser. (It is essentially a lookahead).
What is my misunderstanding, and how do I fix this error?
You can replicate the problem with the smaller test case:
> Parsec.parse fileParser "foo" "x <- 3.14"
The problem is that fileParser first calls lineParser, which successfully parses "x <" into [Raw "x", Op Gt] and leaves "- 3.14" yet to be parsed. Unfortunately, fileParser now expects to parse something with eolParser, but eolParser can't parse "- 3.14" because it starts with neither a semicolon nor an endOfLine.
Your try opParser has no effect here because opParser successfully parses <, so there's nothing to backtrack from.
There are many ways you might fix the problem. If <- is the only case where a < might be misparsed, you could exclude this case with notFollowedBy:
gtParser :: Parser Value
gtParser = do
Parsec.string "<"
notFollowedBy $ Parsec.string "-"
return $ Op Gt

Parser written in Haskell not working as intended

I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work:
module Main where
import Control.Monad
import Numeric
import System.Environment
import Text.ParserCombinators.Parsec hiding (spaces)
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
return (fst (head (readHex x)))
testHex :: String -> String
testHex input = case parse parseHex "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val
main :: IO ()
main = do
args <- getArgs
putStrLn (testHex (head args))
And then I tried testing the testHex function in Haskell's repl:
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
Ok, one module loaded.
*Main> testHex "#xcafebeef"
"Matched3405692655"
*Main> testHex "#xnothx"
"Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit"
*Main> testHex "#xcafexbeef"
"Matched51966"
The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this?
Thank you!
You need to place eof at the end.
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
eof
return (fst (head (readHex x)))
Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places.
testHex :: String -> String
testHex input = case parse (parseHex <* eof) "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val

Haskell, Parsec - Parsing application in lambda calculus

I am writing a simple parser/interpreter for lambda calculus in Haskell.
I am using the syntax "/(x.x)" for a lambda function, "x" for a name and "f y" for application, such that "/(x.x) y" is the application of identity to y and should return y.
My strategy is to have a parser for names, lambda functions and applications. The name and function parsers work, but not the application parser.
When I try to parse "/(x.x) y" it fails. It should try to parse the source as a name, as a lambda function, and after not being able to use both (because it would not parse all the source) try to parse it as an application and then succeed.
My code is:
module Main where
import Text.ParserCombinators.Parsec
data Expression =
Var String
| Lambda Expression Expression
| Application Expression Expression
deriving Eq
instance Show Expression where
show (Var v) = v
show (Lambda v e) = "/(" ++ show v ++ "." ++ show e ++ ")"
show (Application e1 e2) = "" ++ show e1 ++ " " ++ show e2 ++ ""
-- Parser
parseSource :: String -> Expression
parseSource source =
case parse (parseExpression <* eof) "" source of
Right e -> e
_ -> error "Parsing error"
parseExpression :: Parser Expression
parseExpression =
parseVar <|> parseLambda <|> parseApplication
parseVar :: Parser Expression
parseVar = Var <$> many1 letter
parseLambda :: Parser Expression
parseLambda = do
_ <- char '/'
_ <- char '('
arg <- parseVar
_ <- char '.'
body <- parseExpression
_ <- char ')'
return $ Lambda arg body
parseApplication :: Parser Expression
parseApplication = do
e1 <- parseExpression
_ <- spaces
e2 <- parseExpression
return $ Application e1 e2

Haskell read variable name

I need to write a code that parses some language. I got stuck on parsing variable name - it can be anything that is at least 1 char long, starts with lowercase letter and can contain underscore '_' character. I think I made a good start with following code:
identToken :: Parser String
identToken = do
c <- letter
cs <- letdigs
return (c:cs)
where letter = satisfy isLetter
letdigs = munch isLetter +++ munch isDigit +++ munch underscore
num = satisfy isDigit
underscore = \x -> x == '_'
lowerCase = \x -> x `elem` ['a'..'z'] -- how to add this function to current code?
ident :: Parser Ident
ident = do
_ <- skipSpaces
s <- identToken
skipSpaces; return $ s
idents :: Parser Command
idents = do
skipSpaces; ids <- many1 ident
...
This function however gives me a weird results. If I call my test function
test_parseIdents :: String -> Either Error [Ident]
test_parseIdents p =
case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left InvalidParse
multipleRes -> Left (AmbiguousIdents multipleRes)
where
prog :: Parser [Ident]
prog = do
result <- many ident
eof
return result
like this:
test_parseIdents "test"
I get this:
Left (AmbiguousIdents [(["test"],""),(["t","est"],""),(["t","e","st"],""),
(["t","e","st"],""),(["t","est"],""),(["t","e","st"],""),(["t","e","st"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],"")])
Note that Parser is just synonym for ReadP a.
I also want to encode in the parser that variable names should start with a lowercase character.
Thank you for your help.
Part of the problem is with your use of the +++ operator. The following code works for me:
import Data.Char
import Text.ParserCombinators.ReadP
type Parser a = ReadP a
type Ident = String
identToken :: Parser String
identToken = do c <- satisfy lowerCase
cs <- letdigs
return (c:cs)
where lowerCase = \x -> x `elem` ['a'..'z']
underscore = \x -> x == '_'
letdigs = munch (\c -> isLetter c || isDigit c || underscore c)
ident :: Parser Ident
ident = do _ <- skipSpaces
s <- identToken
skipSpaces
return s
test_parseIdents :: String -> Either String [Ident]
test_parseIdents p = case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left "Invalid parse"
multipleRes -> Left ("Ambiguous idents: " ++ show multipleRes)
where prog :: Parser [Ident]
prog = do result <- many ident
eof
return result
main = print $ test_parseIdents "test_1349_zefz"
So what went wrong:
+++ imposes an order on its arguments, and allows for multiple alternatives to succeed (symmetric choice). <++ is left-biased so only the left-most option succeeds -> this would remove the ambiguity in the parse, but still leaves the next problem.
Your parser was looking for letters first, then digits, and finally underscores. Digits after underscores failed, for example. The parser had to be modified to munch characters that were either letters, digits or underscores.
I also removed some functions that were unused and made an educated guess for the definition of your datatypes.

Haskell Parsec - error messages are less helpful while using custom tokens

I'm working on seperating lexing and parsing stages of a parser. After some tests, I realized error messages are less helpful when I'm using some tokens other than Parsec's Char tokens.
Here are some examples of Parsec's error messages while using Char tokens:
ghci> P.parseTest (string "asdf" >> spaces >> string "ok") "asdf wrong"
parse error at (line 1, column 7):
unexpected "w"
expecting space or "ok"
ghci> P.parseTest (choice [string "ok", string "nop"]) "wrong"
parse error at (line 1, column 1):
unexpected "w"
expecting "ok" or "nop"
So, string parser shows what string is expected when found an unexpected string, and choice parser shows what are alternatives.
But when I use same combinators with my tokens:
ghci> Parser.parseTest ((tok $ Ide "asdf") >> (tok $ Ide "ok")) "asdf "
parse error at "test" (line 1, column 1):
unexpected end of input
In this case, it doesn't print what was expected.
ghci> Parser.parseTest (choice [tok $ Ide "ok", tok $ Ide "nop"]) "asdf "
parse error at (line 1, column 1):
unexpected (Ide "asdf","test" (line 1, column 1))
And when I use choice, it doesn't print alternatives.
I expect this behavior to be related with combinator functions, and not with tokens, but seems like I'm wrong. How can I fix this?
Here's the full lexer + parser code:
Lexer:
module Lexer
( Token(..)
, TokenPos(..)
, tokenize
) where
import Text.ParserCombinators.Parsec hiding (token, tokens)
import Control.Applicative ((<*), (*>), (<$>), (<*>))
data Token = Ide String
| Number String
| Bool String
| LBrack
| RBrack
| LBrace
| RBrace
| Keyword String
deriving (Show, Eq)
type TokenPos = (Token, SourcePos)
ide :: Parser TokenPos
ide = do
pos <- getPosition
fc <- oneOf firstChar
r <- optionMaybe (many $ oneOf rest)
spaces
return $ flip (,) pos $ case r of
Nothing -> Ide [fc]
Just s -> Ide $ [fc] ++ s
where firstChar = ['A'..'Z'] ++ ['a'..'z'] ++ "_"
rest = firstChar ++ ['0'..'9']
parsePos p = (,) <$> p <*> getPosition
lbrack = parsePos $ char '[' >> return LBrack
rbrack = parsePos $ char ']' >> return RBrack
lbrace = parsePos $ char '{' >> return LBrace
rbrace = parsePos $ char '}' >> return RBrace
token = choice
[ ide
, lbrack
, rbrack
, lbrace
, rbrace
]
tokens = spaces *> many (token <* spaces)
tokenize :: SourceName -> String -> Either ParseError [TokenPos]
tokenize = runParser tokens ()
Parser:
module Parser where
import Text.Parsec as P
import Control.Monad.Identity
import Lexer
parseTest :: Show a => Parsec [TokenPos] () a -> String -> IO ()
parseTest p s =
case tokenize "test" s of
Left e -> putStrLn $ show e
Right ts' -> P.parseTest p ts'
tok :: Token -> ParsecT [TokenPos] () Identity Token
tok t = token show snd test
where test (t', _) = case t == t' of
False -> Nothing
True -> Just t
SOLUTION:
Ok, after fp4me's answer and reading Parsec's Char source more carefully, I ended up with this:
{-# LANGUAGE FlexibleContexts #-}
module Parser where
import Text.Parsec as P
import Control.Monad.Identity
import Lexer
parseTest :: Show a => Parsec [TokenPos] () a -> String -> IO ()
parseTest p s =
case tokenize "test" s of
Left e -> putStrLn $ show e
Right ts' -> P.parseTest p ts'
type Parser a = Parsec [TokenPos] () a
advance :: SourcePos -> t -> [TokenPos] -> SourcePos
advance _ _ ((_, pos) : _) = pos
advance pos _ [] = pos
satisfy :: (TokenPos -> Bool) -> Parser Token
satisfy f = tokenPrim show
advance
(\c -> if f c then Just (fst c) else Nothing)
tok :: Token -> ParsecT [TokenPos] () Identity Token
tok t = (Parser.satisfy $ (== t) . fst) <?> show t
Now I'm getting same error messages:
ghci> Parser.parseTest (choice [tok $ Ide "ok", tok $ Ide "nop"]) " asdf"
parse error at (line 1, column 1):
unexpected (Ide "asdf","test" (line 1, column 3))
expecting Ide "ok" or Ide "nop"
A beginning of solution can be to define your choice function in the Parser,
use a specific unexpected function to override unexpected error and finally
use the <?> operator to override the expecting message:
mychoice [] = mzero
mychoice (x:[]) = (tok x <|> myUnexpected) <?> show x
mychoice (x:xs) = ((tok x <|> mychoice xs) <|> myUnexpected) <?> show (x:xs)
myUnexpected = do
input <- getInput
unexpected $ (id $ first input )
where
first [] = "eof"
first (x:xs) = show $ fst x
and call your parser like that :
ghci> Parser.parseTest (mychoice [Ide "ok", Ide "nop"]) "asdf "
parse error at (line 1, column 1):
unexpected Ide "asdf"
expecting [Ide "ok",Ide "nop"]

Resources