Haskell read variable name - parsing
I need to write a code that parses some language. I got stuck on parsing variable name - it can be anything that is at least 1 char long, starts with lowercase letter and can contain underscore '_' character. I think I made a good start with following code:
identToken :: Parser String
identToken = do
c <- letter
cs <- letdigs
return (c:cs)
where letter = satisfy isLetter
letdigs = munch isLetter +++ munch isDigit +++ munch underscore
num = satisfy isDigit
underscore = \x -> x == '_'
lowerCase = \x -> x `elem` ['a'..'z'] -- how to add this function to current code?
ident :: Parser Ident
ident = do
_ <- skipSpaces
s <- identToken
skipSpaces; return $ s
idents :: Parser Command
idents = do
skipSpaces; ids <- many1 ident
...
This function however gives me a weird results. If I call my test function
test_parseIdents :: String -> Either Error [Ident]
test_parseIdents p =
case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left InvalidParse
multipleRes -> Left (AmbiguousIdents multipleRes)
where
prog :: Parser [Ident]
prog = do
result <- many ident
eof
return result
like this:
test_parseIdents "test"
I get this:
Left (AmbiguousIdents [(["test"],""),(["t","est"],""),(["t","e","st"],""),
(["t","e","st"],""),(["t","est"],""),(["t","e","st"],""),(["t","e","st"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],"")])
Note that Parser is just synonym for ReadP a.
I also want to encode in the parser that variable names should start with a lowercase character.
Thank you for your help.
Part of the problem is with your use of the +++ operator. The following code works for me:
import Data.Char
import Text.ParserCombinators.ReadP
type Parser a = ReadP a
type Ident = String
identToken :: Parser String
identToken = do c <- satisfy lowerCase
cs <- letdigs
return (c:cs)
where lowerCase = \x -> x `elem` ['a'..'z']
underscore = \x -> x == '_'
letdigs = munch (\c -> isLetter c || isDigit c || underscore c)
ident :: Parser Ident
ident = do _ <- skipSpaces
s <- identToken
skipSpaces
return s
test_parseIdents :: String -> Either String [Ident]
test_parseIdents p = case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left "Invalid parse"
multipleRes -> Left ("Ambiguous idents: " ++ show multipleRes)
where prog :: Parser [Ident]
prog = do result <- many ident
eof
return result
main = print $ test_parseIdents "test_1349_zefz"
So what went wrong:
+++ imposes an order on its arguments, and allows for multiple alternatives to succeed (symmetric choice). <++ is left-biased so only the left-most option succeeds -> this would remove the ambiguity in the parse, but still leaves the next problem.
Your parser was looking for letters first, then digits, and finally underscores. Digits after underscores failed, for example. The parser had to be modified to munch characters that were either letters, digits or underscores.
I also removed some functions that were unused and made an educated guess for the definition of your datatypes.
Related
Parser written in Haskell not working as intended
I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work: module Main where import Control.Monad import Numeric import System.Environment import Text.ParserCombinators.Parsec hiding (spaces) parseHex :: Parser Integer parseHex = do string "#x" x <- many1 hexDigit return (fst (head (readHex x))) testHex :: String -> String testHex input = case parse parseHex "lisp" input of Left err -> "Does not match " ++ show err Right val -> "Matched" ++ show val main :: IO () main = do args <- getArgs putStrLn (testHex (head args)) And then I tried testing the testHex function in Haskell's repl: GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help [1 of 1] Compiling Main ( src/Main.hs, interpreted ) Ok, one module loaded. *Main> testHex "#xcafebeef" "Matched3405692655" *Main> testHex "#xnothx" "Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit" *Main> testHex "#xcafexbeef" "Matched51966" The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this? Thank you!
You need to place eof at the end. parseHex :: Parser Integer parseHex = do string "#x" x <- many1 hexDigit eof return (fst (head (readHex x))) Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places. testHex :: String -> String testHex input = case parse (parseHex <* eof) "lisp" input of Left err -> "Does not match " ++ show err Right val -> "Matched" ++ show val
Haskell : Operator Parser keeps going to undefined rather than inputs
I'm practicing writing parsers. I'm using Tsodings JSON Parser video as reference. I'm trying to add to it by being able to parse arithmetic of arbitrary length and I have come up with the following AST. data HVal = HInteger Integer -- No Support For Floats | HBool Bool | HNull | HString String | HChar Char | HList [HVal] | HObj [(String, HVal)] deriving (Show, Eq, Read) data Op -- There's only one operator for the sake of brevity at the moment. = Add deriving (Show, Read) newtype Parser a = Parser { runParser :: String -> Maybe (String, a) } The following functions is my attempt of implementing the operator parser. ops :: [Char] ops = ['+'] isOp :: Char -> Bool isOp c = elem c ops spanP :: (Char -> Bool) -> Parser String spanP f = Parser $ \input -> let (token, rest) = span f input in Just (rest, token) opLiteral :: Parser String opLiteral = spanP isOp sOp :: String -> Op sOp "+" = Add sOp _ = undefined parseOp :: Parser Op parseOp = sOp <$> (charP '"' *> opLiteral <* charP '"') The logic above is similar to how strings are parsed therefore my assumption was that the only difference was looking specifically for an operator rather than anything that's not a number between quotation marks. It does seemingly begin to parse correctly but it then gives me the following error: λ > runParser parseOp "\"+\"" Just ("+\"",*** Exception: Prelude.undefined CallStack (from HasCallStack): error, called at libraries/base/GHC/Err.hs:80:14 in base:GHC.Err undefined, called at /DIRECTORY/parser.hs:110:11 in main:Main I'm confused as to where the error is occurring. I'm assuming it's to do with sOp mainly due to how the other functions work as intended as the rest of parseOp being a translation of the parseString function: stringLiteral :: Parser String stringLiteral = spanP (/= '"') parseString :: Parser HVal parseString = HString <$> (charP '"' *> stringLiteral <* charP '"') The only reason why I have sOp however is that if it was replaced with say Op, I would get the error that the following doesn't exist Op :: String -> Op. When I say this my inclination was that the string coming from the parsed expression would be passed into this function wherein I could return the appropriate operator. This however is incorrect and I'm not sure how to proceed. charP and Applicative Instance charP :: Char -> Parser Char charP x = Parser $ f where f (y:ys) | y == x = Just (ys, x) | otherwise = Nothing f [] = Nothing instance Applicative Parser where pure x = Parser $ \input -> Just (input, x) (Parser p) <*> (Parser q) = Parser $ \input -> do (input', f) <- p input (input', a) <- q input Just (input', f a)
The implementation of (<*>) is the culprit. You did not use input' in the next call to q, but used input instead. As a result you pass the string to the next parser without "eating" characters. You can fix this with: instance Applicative Parser where pure x = Parser $ \input -> Just (input, x) (Parser p) <*> (Parser q) = Parser $ \input -> do (input', f) <- p input (input'', a) <- q input' Just (input'', f a) With the updated instance for Applicative, we get: *Main> runParser parseOp "\"+\"" Just ("",Add)
Parsec Parsing list of different kind of statements
I'm trying to parse (for now) a subset of the Dot language. The grammar is here and my code is the following import System.Environment import System.IO import qualified Text.Parsec.Token as P import Text.ParserCombinators.Parsec.Char -- for letter import Text.Parsec import qualified Control.Applicative as App import Lib type Id = String data Dot = Undirected Id Stmts | Directed Id Stmts deriving (Show) data Stmt = NodeStmt Node | EdgeStmt Edges deriving (Show) type Stmts = [Stmt] data Node = Node Id Attributes deriving (Show) data Edge = Edge Id Id deriving (Show) type Edges = [Edge] data Attribute = Attribute Id Id deriving (Show) type Attributes = [Attribute] dotDef :: P.LanguageDef st dotDef = P.LanguageDef { P.commentStart = "/*" , P.commentEnd = "*/" , P.commentLine = "//" , P.nestedComments = True , P.identStart = letter , P.identLetter = alphaNum , P.reservedNames = ["node", "edge", "graph", "digraph", "subgraph", "strict" ] , P.caseSensitive = True , P.opStart = oneOf "-=" , P.opLetter = oneOf "->" , P.reservedOpNames = [] } lexer = P.makeTokenParser dotDef brackets = P.brackets lexer braces = P.braces lexer identifier = P.identifier lexer reserved = P.reserved lexer semi = P.semi lexer comma = P.comma lexer reservedOp = P.reservedOp lexer eq_op = reservedOp "=" undir_edge_op = reservedOp "--" dir_edge_op = reservedOp "->" edge_op = undir_edge_op <|> dir_edge_op -- -> Attribute attribute = do id1 <- identifier eq_op id2 <- identifier optional (semi <|> comma) return $ Attribute id1 id2 a_list = many attribute bracked_alist = brackets $ option [] a_list attributes = do nestedAttributes <- many1 bracked_alist return $ concat nestedAttributes nodeStmt = do nodeName <- identifier attr <- option [] attributes return $ NodeStmt $ Node nodeName attr dropLast = reverse . tail . reverse edgeStmt = do nodes <- identifier `sepBy1` edge_op return $ EdgeStmt $ fmap (\x -> Edge (fst x) (snd x)) (zip (dropLast nodes) (tail nodes)) stmt = do x <- nodeStmt <|> edgeStmt optional semi return x stmt_list = many stmt graphDecl = do reserved "graph" varName <- option "" identifier stms <- braces stmt_list return $ Undirected varName stms digraphDecl = do reserved "digraph" varName <- option "" identifier stms <- braces stmt_list return $ Directed varName stms topLevel3 = do spaces graphDecl <|> digraphDecl main :: IO () main = do (file:_) <- getArgs content <- readFile file case parse topLevel3 "" content of Right g -> print g Left err -> print err Given this input digraph PZIFOZBO{ a[toto = bar] b ; c ; w // 1 a->b // 2 } It works fine if line 1 or line 2 is commented, but if both are enabled, it fails with (line 3, column 10): unexpected "-" expecting identifier or "}" My understanding it that the parser picks first matching rule (with backtracking). Here both edge and node statement starts with and identifier, so it always pick this one. I tried reversing the order in stmt, without any luck. I also tried to sprinkle some try in stmt, nodeStmt and edgeStmt, without luck either. Any help appreciated.
Note that I get the same error whether or not line 1 is commented out, so: digraph PZIFOZBO{ a->b } also says unexpected "-". As I think you have correctly diagnosed, the problem here is that the stmt parser tries nodeStmt first. That succeeds and parses "a", leaving "->b" yet to be consumed, but ->b isn't a valid statement. Note that Parsec does not backtrack automatically in the absence of a try, so it's not going to go back and revisit this decisions when it "discovers" that ->b can't be parsed. You can "fix" this problem by swapping the order in stmt: x <- edgeStmt <|> nodeStmt but now the parse will break on an expression like a[toto = bar]. That's because edgeStmt is buggy. It parses "a" as a valid statement EdgeStmt [] because sepBy1 allows a single edge "a", which isn't what you want. If you rewrite edgeStmt to require at least one edge: import Control.Monad (guard) edgeStmt = do nodes <- identifier `sepBy1` edge_op guard $ length nodes > 1 return $ EdgeStmt $ fmap (\x -> Edge (fst x) (snd x)) (zip (dropLast nodes) (tail nodes)) and adjust stmt to "try" an edge statement first and backtrack to a node statement: stmt = do x <- try edgeStmt <|> nodeStmt optional semi return x then your example compiles fine.
Parsing juxtaposition-based, indentation-aware syntax using Text.Parsec.Layout
I'm trying to parse a small language with Haskell-like syntax, using parsec-layout. The two key features that don't seem to interact too well with each other are: Function application syntax is juxtaposition, i.e. if F and E are terms, F E is the syntax for F applied to E. Indentation can be used to denote nesting, i.e. the following two are equivalent: X = case Y of A -> V B -> W X = case Y of A -> V; B -> W I haven't managed to figure out a combination of skipping and keeping whitespace that would allow me to parse a list of such definitions. Here's my simplified code: import Text.Parsec hiding (space, runP) import Text.Parsec.Layout import Control.Monad (void) type Parser = Parsec String LayoutEnv data Term = Var String | App Term Term | Case Term [(String, Term)] deriving Show name :: Parser String name = spaced $ (:) <$> upper <*> many alphaNum kw :: String -> Parser () kw = void . spaced . string reserved :: String -> Parser () reserved s = try $ spaced $ string s >> notFollowedBy alphaNum term :: Parser Term term = foldl1 App <$> part `sepBy1` space where part = choice [ caseBlock , Var <$> name ] caseBlock = Case <$> (reserved "case" *> term <* reserved "of") <*> laidout alt alt = (,) <$> (name <* kw "->") <*> term binding :: Parser (String, Term) binding = (,) <$> (name <* kw "=") <*> term -- https://github.com/luqui/parsec-layout/issues/1 trim :: String -> String trim = reverse . dropWhile (== '\n') . reverse runP :: Parser a -> String -> Either ParseError a runP p = runParser (p <* eof) defaultLayoutEnv "" . trim If I try to run it on input like s = unlines [ "A = case B of" , " X -> Y Z" , "C = D" ] via runP (laidout binding) s, it fails on the application Y Z: (line 2, column 10): expecting space or semi-colon However, if I change the definition of term to term = foldl1 App <$> many1 part then it doesn't stop parsing the term at the start of the (unindented!) third line, leading to (line 3, column 4): expecting semi-colon
I think the problem has to do with that name already eliminates the following space, so the sepBy1 in the definition of term doesn't see it. Consider these simplified versions of term: term0 = foldl1 App <$> (Var <$> name) `sepBy1` space term1 = foldl1 App <$> (Var <$> name') `sepBy1` space name' = (:) <$> upper <*> many alphaNum term3 = foldl1 App <$> many (Var <$> name) Then: runP term0 "A B C" -- fails runP term1 "A B C" -- succeeds runP term3 "A B C" -- succeeds I think part of the solution is to define part = [ caseBlock, Var <$> name' ] where name' is as above. However, there are still some issues.
How can I parse a float with a comma in place of the decimal point?
I want to parse Float values from a file where they are stored using comma as the decimal separator. Thus i need a function myParse :: String -> Float such that, for instance, myParse "23,46" == 23.46. I have some ideas about how to do this, but they all seem overcomplicated, for example: replace , with a . in the string and use read; or follow this FP Complete blogpost (entitled Parsing Floats With Parsec), and challenge the curse of the monomorphism restriction. Is there a simpler way, or do I really need to use a parsing library? In the second case, could you please paste some suggestions in order to get me started? The monomorphism restriction scares me, and I believe that there has to be a way to do this without using language extensions.
Replacing , by . and then call read is straightforward enough; you just need to remember to use your own specialized function instead of plain old read: readFloatWithComma :: String -> Float readFloatWithComma = read . sanitize where sanitize = map (\c -> if c == ',' then '.' else c) In GHCi: λ> readFloatWithComma "23,46" 23.46 Regarding the parsec approach, despite what the article you link to suggest, the monomorphism restriction needs not be a worry, as long as you have type signatures for all your top-level bindings. In particular, the following code doesn't need any language extensions to compile properly (at least, in GHC 7.10.1): import Text.Parsec import Text.Parsec.String ( Parser ) import Control.Applicative hiding ( (<|>) ) infixr 5 <++> (<++>) :: Applicative f => f [a] -> f [a] -> f [a] a <++> b = (++) <$> a <*> b infixr 5 <:> (<:>) :: Applicative f => f a -> f [a] -> f [a] a <:> b = (:) <$> a <*> b number :: Parser String number = many1 digit plus :: Parser String plus = char '+' *> number minus :: Parser String minus = char '-' <:> number integer :: Parser String integer = plus <|> minus <|> number float :: Parser Float float = fmap rd $ integer <++> decimal <++> exponent where rd = read :: String -> Float decimal = option "" $ ('.' <$ char ',') <:> number exponent = option "" $ oneOf "eE" <:> integer In GHCi: λ> parseTest float "23,46" 23.46