My Megaparsec parser gets stuck and ghci debugging isn't helping either - parsing

I have worked through this Megaparsec tutorial and am trying now to write my own parser based on that. I want so write a simple parser for a made-up assembly language:
Label: lda $0ffe
sta %10100110
push $01, $02, $03
This are the simple data types I'm using:
-- Syntax.hs
module Syntax where
import Data.Int
-- |A program is made up of one or more source lines
type Program = [SourceLine]
data SourceLine = SourceLine
{ label :: Maybe String -- ^ Each line may contain a label
, instr :: Maybe String -- ^ This can either be an opcode or an assembler directive
, operand :: Maybe String -- ^ The opcode/instruction may need operand(s)
}
deriving (Show, Eq)
Here's the code of the parser:
--Parser.hs
module Parser where
import Syntax
import Control.Applicative (empty)
import Control.Monad (void)
import Control.Monad.Combinators.Expr
-- import Data.Scientific (toRealFloat)
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
type Parser = Parsec Void String
-- | Parse a single source code line
sourceline :: Parser SourceLine
sourceline = do
l <- optional labelfield
i <- optional instrfield
o <- optional oprfield
return $ SourceLine l i o
-- TODO: forbid double underscores
-- | Parse the label field of a source line
labelfield :: Parser String
labelfield = (lexeme . try) $ do
l <- identifier
symbol ":"
return l
-- TODO: parse assembler directives starting with an elipse (.)
-- | Parse the instruction field of a source line
instrfield :: Parser String
instrfield = (lexeme . try) $ do
i <- some letterChar
return i
-- | Parse the operand field of a source line
oprfield :: Parser String
oprfield = (lexeme . try) $ do
o <- try identifier
<|> datalist
<|> number
return o
-- | Parses a legal identifier; identifiers must start with a letter
-- and my contain underscores or numbers
identifier :: Parser String
identifier = ((:) <$> letterChar <*> many (alphaNumChar <|> char '_'))
-- | Parse a list of values separated by commas (,)
datalist :: Parser String
datalist = do
x <- some datalist'
y <- number
return $ filter (/='\n') $ unlines x ++ y
datalist' :: Parser String
datalist' = try ((++) <$> number <*> (symbol ","))
-- | Parse numbers
number :: Parser String
number = try binnumber
<|> decnumber
<|> hexnumber
binnumber :: Parser String
binnumber = lexeme ((:) <$> char '%' <*> (some $ binDigitChar))
decnumber :: Parser String
decnumber = lexeme $ some digitChar
hexnumber :: Parser String
hexnumber = lexeme ((:) <$> char '$' <*> (some $ hexDigitChar))
----- Helper Function ----------------------------------------------------------
lineComment :: Parser ()
lineComment = L.skipLineComment "#"
-- eats all whitespace and newline
scn :: Parser ()
scn = L.space space1 lineComment empty
-- eats all whitespace but newline
sc :: Parser ()
sc = L.space (void $ takeWhile1P Nothing f) lineComment empty
where
f x = x == ' ' || x == '\t'
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: String -> Parser String
symbol = L.symbol sc
-- this is giving me trouble
prog :: Parser Program
prog = between scn eof (sepEndBy sourceline scn)
I've put the function that is giving me trouble. I've written some tests for these functions, here's the tests:
-- file Spec.hs
import Syntax
import Parser
import Text.Megaparsec
import Test.Hspec
import Test.Hspec.Megaparsec
import Test.QuickCheck
import Control.Exception (evaluate)
main :: IO ()
main = hspec $ do
describe "Label parsing" $ do
it "Parse empty label field" $
parse sourceline "" " " `shouldParse` SourceLine Nothing Nothing Nothing
it "Parse single character lower-case label" $
parse sourceline "" "x:" `shouldParse` SourceLine (Just "x") Nothing Nothing
it "Parse multi-character label" $
parse sourceline "" "label:" `shouldParse` SourceLine (Just "label") Nothing Nothing
it "Parse multi-character label with trailing whitespace" $
parse sourceline "" "label: " `shouldParse` SourceLine (Just "label") Nothing Nothing
it "Parse label with underscore" $
parse sourceline "" "la_bel: " `shouldParse` SourceLine (Just "la_bel") Nothing Nothing
it "Parse label with underscores and numbers" $
parse sourceline "" "l4_b3l: " `shouldParse` SourceLine (Just "l4_b3l") Nothing Nothing
describe "Label and opcode parsing" $ do
it "Parse line with label and opcode" $
parse sourceline "" "label: lda" `shouldParse` SourceLine (Just "label") (Just "lda") Nothing
it "Parse line opcode only" $
parse sourceline "" "lda" `shouldParse` SourceLine Nothing (Just "lda") Nothing
describe "Opcodes and operands parsing" $ do
it "Parse an opcode with symbol operand" $
parse sourceline "" "lda label_2" `shouldParse` SourceLine Nothing (Just "lda") (Just "label_2")
it "Parse an opcode with binary operand" $
parse sourceline "" "lda %01101" `shouldParse` SourceLine Nothing (Just "lda") (Just "%01101")
it "Parse an opcode with decimal operand" $
parse sourceline "" "lda 1234" `shouldParse` SourceLine Nothing (Just "lda") (Just "1234")
it "Parse an opcode with hexdecimal operand" $
parse sourceline "" "lda $affe34" `shouldParse` SourceLine Nothing (Just "lda") (Just "$affe34")
it "Parse a labeled opcode with symbol operand" $
parse sourceline "" "label: lda label_2" `shouldParse` SourceLine (Just "label") (Just "lda") (Just "label_2")
it "Parse a labeled opcode with binary operand" $
parse sourceline "" "labe_l: lda %01101" `shouldParse` SourceLine (Just "labe_l") (Just "lda") (Just "%01101")
it "Parse a labeled opcode with decimal operand" $
parse sourceline "" "label_2: lda 1234" `shouldParse` SourceLine (Just "label_2") (Just "lda") (Just "1234")
it "Parse a labeled opcode with hexdecimal operand" $
parse sourceline "" "l4b3l: lda $affe34" `shouldParse` SourceLine (Just "l4b3l") (Just "lda") (Just "$affe34")
describe "Operand parsing" $ do
it "Parse a value/data list with decimal values" $
parse sourceline "" "lda 12,23,23,43 " `shouldParse` SourceLine Nothing (Just "lda") (Just "12,23,23,43")
it "Parse a value/data list with binary values" $
parse sourceline "" "lda %101,%111,%000,%001 " `shouldParse` SourceLine Nothing (Just "lda") (Just "%101,%111,%000,%001")
it "Parse a value/data list with hexdecimal values" $
parse sourceline "" "lda $101,$affe,$AfF3,$c3D4 " `shouldParse` SourceLine Nothing (Just "lda") (Just "$101,$affe,$AfF3,$c3D4")
it "Parse a value/data list with spaces" $
parse sourceline "" "lda $101, $affe , $AfF3,$c3D4" `shouldParse` SourceLine Nothing (Just "lda") (Just "$101,$affe,$AfF3,$c3D4")
it "Parse a value/data list with spaces and mixed values" $
parse sourceline "" "lda %101, 1234 , $AfF3,$c3D4" `shouldParse` SourceLine Nothing (Just "lda") (Just "%101,1234,$AfF3,$c3D4")
-- describe "Parse multiple lines" $ do
-- it "Parse a 3-line program" $
-- parse prog "" "label1: \n lda $10\nsta %10011001" `shouldParse` [SourceLine (Just "label1") Nothing Nothing,
-- SourceLine Nothing (Just "lda") (Just "$10"),
-- SourceLine Nothing (Just "sta") (Just "%10011001")]
As usual with assembly files, I want to parse the source code line by line. All the tests above pass, except for the out-commented one. Running prog in ghci with parseTest yields the same result, it returns no result and crashes eventually:
*Main Parser Syntax Text.Megaparsec> parseTest sourceline "lda $10 # comment ignored"
SourceLine {label = Nothing, instr = Just "lda", operand = Just "$10"}
*Main Parser Syntax Text.Megaparsec> parseTest prog "lda $10\nsta %1010"
-- crashes
I'm assuming that I am somehow ab-/overusing lexeme in my code to remove trailing whitespace from the parsed strings. What am I missing?

The sepEndBy sourceline scn will keep iterating as long as it can get the sourceLine and scn parsers to match. However, both of those parsers can complete successfully without consuming any input, so they will always match. Since all of the branches of sourceLine have a try, any parse error will cause the parser to back off and just match an infinite number of empty source lines. Even without a parse error, reaching eof will produce an infinite number of source lines.

Related

Parsec try : should try to go to next option

Currently, I have the following code:
import Control.Applicative ((<|>))
import Text.Parsec (ParseError, endBy, sepBy, try)
import Text.Parsec.String (Parser)
import qualified Data.Char as Char
import qualified Text.Parsec as Parsec
data Operation = Lt | Gt deriving (Show)
data Value =
Raw String
| Op Operation
deriving (Show)
sampleStr :: String
sampleStr = unlines
[ "#BEGIN#"
, "x <- 3.14 + 2.72;"
, "x < 10;"
]
gtParser :: Parser Value
gtParser = do
Parsec.string "<"
return $ Op Gt
ltParser :: Parser Value
ltParser = do
Parsec.string ">"
return $ Op Lt
opParser :: Parser Value
opParser = gtParser <|> ltParser
rawParser :: Parser Value
rawParser = do
str <- Parsec.many1 $ Parsec.satisfy $ not . Char.isSpace
return $ Raw str
valueParser :: Parser Value
valueParser = try opParser <|> rawParser
eolParser :: Parser Char
eolParser = try (Parsec.char ';' >> Parsec.endOfLine)
<|> Parsec.endOfLine
lineParser :: Parser [Value]
lineParser = sepBy valueParser $ Parsec.many1 $ Parsec.char ' '
fileParser :: Parser [[Value]]
fileParser = endBy lineParser eolParser
parse :: String -> Either ParseError [[Value]]
parse = Parsec.parse fileParser "fail..."
main :: IO ()
main = print $ parse sampleStr
This will fail with the message
Left "fail..." (line 2, column 4):
unexpected "-"
expecting " ", ";" or new-line
To my understanding, since I have try opParser, after Parsec sees that the token <- cannot be parsed by opParser, it should go to rawParser. (It is essentially a lookahead).
What is my misunderstanding, and how do I fix this error?
You can replicate the problem with the smaller test case:
> Parsec.parse fileParser "foo" "x <- 3.14"
The problem is that fileParser first calls lineParser, which successfully parses "x <" into [Raw "x", Op Gt] and leaves "- 3.14" yet to be parsed. Unfortunately, fileParser now expects to parse something with eolParser, but eolParser can't parse "- 3.14" because it starts with neither a semicolon nor an endOfLine.
Your try opParser has no effect here because opParser successfully parses <, so there's nothing to backtrack from.
There are many ways you might fix the problem. If <- is the only case where a < might be misparsed, you could exclude this case with notFollowedBy:
gtParser :: Parser Value
gtParser = do
Parsec.string "<"
notFollowedBy $ Parsec.string "-"
return $ Op Gt

Parsec fails without error if reading from file

I wrote a small parsec parser to read samples from a user supplied input string or an input file. It fails properly on wrong input with a useful error message if the input is provided as a semicolon separated string:
> readUncalC14String "test1,7444,37;6800,36;testA,testB,2000,222;test3,7750,40"
*** Exception: Error in parsing dates from string: (line 1, column 29):
unexpected "t"
expecting digit
But it fails silently for the input file inputFile.txt with identical entries:
test1,7444,37
6800,36
testA,testB,2000,222
test3,7750,40
> readUncalC14FromFile "inputFile.txt"
[UncalC14 "test1" 7444 37,UncalC14 "unknownSampleName" 6800 36]
Why is that and how can I make readUncalC14FromFile fail in a useful manner as well?
Here is a minimal subset of my code:
import qualified Text.Parsec as P
import qualified Text.Parsec.String as P
data UncalC14 = UncalC14 String Int Int deriving Show
readUncalC14FromFile :: FilePath -> IO [UncalC14]
readUncalC14FromFile uncalFile = do
s <- readFile uncalFile
case P.runParser uncalC14SepByNewline () "" s of
Left err -> error $ "Error in parsing dates from file: " ++ show err
Right x -> return x
where
uncalC14SepByNewline :: P.Parser [UncalC14]
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces)
readUncalC14String :: String -> Either String [UncalC14]
readUncalC14String s =
case P.runParser uncalC14SepBySemicolon () "" s of
Left err -> error $ "Error in parsing dates from string: " ++ show err
Right x -> Right x
where
uncalC14SepBySemicolon :: P.Parser [UncalC14]
uncalC14SepBySemicolon = P.sepBy parseOneUncalC14 (P.char ';' <* P.spaces)
parseOneUncalC14 :: P.Parser UncalC14
parseOneUncalC14 = do
P.try long P.<|> short
where
long = do
name <- P.many (P.noneOf ",")
_ <- P.oneOf ","
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 name mean std)
short = do
mean <- read <$> P.many1 P.digit
_ <- P.oneOf ","
std <- read <$> P.many1 P.digit
return (UncalC14 "unknownSampleName" mean std)
What is happening here is that a prefix of your input is a valid string. To force parsec to use the whole input you can use the eof parser:
uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces) <* P.eof
The reason that one works and the other doesn't is due to the difference between sepBy and endBy. Here is a simpler example:
sepTest, endTest :: String -> Either P.ParseError String
sepTest s = P.runParser (P.sepBy (P.char 'a') (P.char 'b')) () "" s
endTest s = P.runParser (P.endBy (P.char 'a') (P.char 'b')) () "" s
Here are some interesting examples:
ghci> sepTest "abababb"
Left (line 1, column 7):
unexpected "b"
expecting "a"
ghci> endTest "abababb"
Right "aaa"
ghci> sepTest "ababaa"
Right "aaa"
ghci> endTest "ababaa"
Left (line 1, column 6):
unexpected "a"
expecting "b"
As you can see both sepBy and endBy can fail silently, but sepBy fails silently if the prefix doesn't end in the separator b and endBy fails silently if the prefix doesn't end in the main parser a.
So you should use eof after both parsers if you want to make sure you read the whole file/string.

Writing parsers with Text.ParserCombinators

In class, we wrote parsers by defining our own Parser type and this gave us a lot of flexibility. For example, if we wanted to make code and parse out "[at]" as '#', we could write
atParser = Parser $ \s ->
case s of
w:x:y:z:zs ->
| (w:x:y:z:[]) == "[at]" = ['#',zs]
| otherwise = []
zs -> []
However, I cannot figure out how to implement this sort of parser using Text.ParserCombinators. Is it possible?
Thanks
I believe you're looking for the string combinator.
λ> :set -package parsec
package flags have changed, resetting and loading new packages...
λ> import Text.Parsec
λ> :t string
string :: Stream s m Char => String -> ParsecT s u m String
λ> parse (string "[at]") "" "[at]"
Right "[at]"
λ> parse (string "[at]") "" "[at"
Left (line 1, column 1):
unexpected end of input
expecting "[at]"
λ> parse ('#' <$ string "[at]") "" "[at]"
Right '#'

Unexpected end of input with Parsec

I try to parse the following text file with series of data between keywords :
many text many text many text
BEGIN
T LISTE2
1 154
2 321
3 519
4 520
5 529
6 426
END
many text many text many text
By using the following haskell program
import Text.Parsec
import Text.Parsec.String
import Text.Parsec.Char
import Text.Parsec.Combinator
endOfLine :: Parser String
endOfLine = try (string "\n")
<|> try (string "\r\n")
line = many $ noneOf "\n"
parseListing = do
spaces
many $ noneOf "\n"
spaces
cont <- between (string "BEGIN\n") (string "END\n") $ endBy line endOfLine
spaces
many $ noneOf "\n"
spaces
eof
return cont
main :: IO ()
main = do
file <- readFile ("test_list.txt")
case parse parseListing "(stdin)" file of
Left err -> do putStrLn "!!! Error !!!"
print err
Right resu -> do putStrLn $ concat resu
And when I parse my text file, I get the following error :
"(stdin)" (line 16, column 1):
unexpected end of input
expecting "\n", "\r\n" or "END\n"
I'm a newbie with parsing and I don't understand why it fail?
My sequence is yet between BEGIN and END
Do you know what is wrong with my parser and how to correct it ?
Your between will never stop, because endBy line endOfLine consumes any line and END\n too, so it will eat more and more lines until it fails.
Then your parser tries to consume string "END\n" and fails too, that's why error message mentions "END\n"
You must rewrite line parser to fail on END\n. For example:
parseListing :: Parsec String () [String]
parseListing = do
spaces
many $ noneOf "\n"
spaces
cont <- between begin end $ endBy (notFollowedBy end >> line) endOfLine
spaces
many $ noneOf "\n"
spaces
eof
return cont
where
begin = string "BEGIN\n"
end = string "END\n"

Haskell Parser Fails on "|" Read

I am working on a parser in Haskell using Parsec. The issue lies in reading in the string "| ". When I attempt to read in the following,
parseExpr = parseAtom
-- | ...
<|> do string "{|"
args <- try parseList <|> parseDottedList
string "| "
body <- try parseExpr
string " }"
return $ List [Atom "lambda", args, body]
I get a parse error, the following.
Lampas >> {|a b| "a" }
Parse error at "lisp" (line 1, column 12):
unexpected "}"
expecting letter, "\"", digit, "'", "(", "[", "{|" or "."
Another failing case is ^ which bears the following.
Lampas >> {|a b^ "a" }
Parse error at "lisp" (line 1, column 12):
unexpected "}"
expecting letter, "\"", digit, "'", "(", "[", "{|" or "."
However, it works as expected when the string "| " is replaced with "} ".
parseExpr = parseAtom
-- | ...
<|> do string "{|"
args <- try parseList <|> parseDottedList
string "} "
body <- try parseExpr
string " }"
return $ List [Atom "lambda", args, body]
The following is the REPL behavior with the above modification.
Lampas >> {|a b} "a" }
(lambda ("a" "b") ...)
So the question is (a) does pipe have a special behavior in Haskell strings, perhaps only in <|> chains?, and (b) how is this behavior averted?.
The character | may be in a set of reserved characters. Test with other characters, like ^, and I assume it will fail just as well. The only way around this would probably be to change the set of reserved characters, or the structure of your interpreter.

Resources