Haskell read variable name - parsing

I need to write a code that parses some language. I got stuck on parsing variable name - it can be anything that is at least 1 char long, starts with lowercase letter and can contain underscore '_' character. I think I made a good start with following code:
identToken :: Parser String
identToken = do
c <- letter
cs <- letdigs
return (c:cs)
where letter = satisfy isLetter
letdigs = munch isLetter +++ munch isDigit +++ munch underscore
num = satisfy isDigit
underscore = \x -> x == '_'
lowerCase = \x -> x `elem` ['a'..'z'] -- how to add this function to current code?
ident :: Parser Ident
ident = do
_ <- skipSpaces
s <- identToken
skipSpaces; return $ s
idents :: Parser Command
idents = do
skipSpaces; ids <- many1 ident
...
This function however gives me a weird results. If I call my test function
test_parseIdents :: String -> Either Error [Ident]
test_parseIdents p =
case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left InvalidParse
multipleRes -> Left (AmbiguousIdents multipleRes)
where
prog :: Parser [Ident]
prog = do
result <- many ident
eof
return result
like this:
test_parseIdents "test"
I get this:
Left (AmbiguousIdents [(["test"],""),(["t","est"],""),(["t","e","st"],""),
(["t","e","st"],""),(["t","est"],""),(["t","e","st"],""),(["t","e","st"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],"")])
Note that Parser is just synonym for ReadP a.
I also want to encode in the parser that variable names should start with a lowercase character.
Thank you for your help.

Part of the problem is with your use of the +++ operator. The following code works for me:
import Data.Char
import Text.ParserCombinators.ReadP
type Parser a = ReadP a
type Ident = String
identToken :: Parser String
identToken = do c <- satisfy lowerCase
cs <- letdigs
return (c:cs)
where lowerCase = \x -> x `elem` ['a'..'z']
underscore = \x -> x == '_'
letdigs = munch (\c -> isLetter c || isDigit c || underscore c)
ident :: Parser Ident
ident = do _ <- skipSpaces
s <- identToken
skipSpaces
return s
test_parseIdents :: String -> Either String [Ident]
test_parseIdents p = case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left "Invalid parse"
multipleRes -> Left ("Ambiguous idents: " ++ show multipleRes)
where prog :: Parser [Ident]
prog = do result <- many ident
eof
return result
main = print $ test_parseIdents "test_1349_zefz"
So what went wrong:
+++ imposes an order on its arguments, and allows for multiple alternatives to succeed (symmetric choice). <++ is left-biased so only the left-most option succeeds -> this would remove the ambiguity in the parse, but still leaves the next problem.
Your parser was looking for letters first, then digits, and finally underscores. Digits after underscores failed, for example. The parser had to be modified to munch characters that were either letters, digits or underscores.
I also removed some functions that were unused and made an educated guess for the definition of your datatypes.

Related

Parser written in Haskell not working as intended

I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work:
module Main where
import Control.Monad
import Numeric
import System.Environment
import Text.ParserCombinators.Parsec hiding (spaces)
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
return (fst (head (readHex x)))
testHex :: String -> String
testHex input = case parse parseHex "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val
main :: IO ()
main = do
args <- getArgs
putStrLn (testHex (head args))
And then I tried testing the testHex function in Haskell's repl:
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
Ok, one module loaded.
*Main> testHex "#xcafebeef"
"Matched3405692655"
*Main> testHex "#xnothx"
"Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit"
*Main> testHex "#xcafexbeef"
"Matched51966"
The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this?
Thank you!
You need to place eof at the end.
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
eof
return (fst (head (readHex x)))
Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places.
testHex :: String -> String
testHex input = case parse (parseHex <* eof) "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val

Haskell : Operator Parser keeps going to undefined rather than inputs

I'm practicing writing parsers. I'm using Tsodings JSON Parser video as reference. I'm trying to add to it by being able to parse arithmetic of arbitrary length and I have come up with the following AST.
data HVal
= HInteger Integer -- No Support For Floats
| HBool Bool
| HNull
| HString String
| HChar Char
| HList [HVal]
| HObj [(String, HVal)]
deriving (Show, Eq, Read)
data Op -- There's only one operator for the sake of brevity at the moment.
= Add
deriving (Show, Read)
newtype Parser a = Parser {
runParser :: String -> Maybe (String, a)
}
The following functions is my attempt of implementing the operator parser.
ops :: [Char]
ops = ['+']
isOp :: Char -> Bool
isOp c = elem c ops
spanP :: (Char -> Bool) -> Parser String
spanP f = Parser $ \input -> let (token, rest) = span f input
in Just (rest, token)
opLiteral :: Parser String
opLiteral = spanP isOp
sOp :: String -> Op
sOp "+" = Add
sOp _ = undefined
parseOp :: Parser Op
parseOp = sOp <$> (charP '"' *> opLiteral <* charP '"')
The logic above is similar to how strings are parsed therefore my assumption was that the only difference was looking specifically for an operator rather than anything that's not a number between quotation marks. It does seemingly begin to parse correctly but it then gives me the following error:
λ > runParser parseOp "\"+\""
Just ("+\"",*** Exception: Prelude.undefined
CallStack (from HasCallStack):
error, called at libraries/base/GHC/Err.hs:80:14 in base:GHC.Err
undefined, called at /DIRECTORY/parser.hs:110:11 in main:Main
I'm confused as to where the error is occurring. I'm assuming it's to do with sOp mainly due to how the other functions work as intended as the rest of parseOp being a translation of the parseString function:
stringLiteral :: Parser String
stringLiteral = spanP (/= '"')
parseString :: Parser HVal
parseString = HString <$> (charP '"' *> stringLiteral <* charP '"')
The only reason why I have sOp however is that if it was replaced with say Op, I would get the error that the following doesn't exist Op :: String -> Op. When I say this my inclination was that the string coming from the parsed expression would be passed into this function wherein I could return the appropriate operator. This however is incorrect and I'm not sure how to proceed.
charP and Applicative Instance
charP :: Char -> Parser Char
charP x = Parser $ f
where f (y:ys)
| y == x = Just (ys, x)
| otherwise = Nothing
f [] = Nothing
instance Applicative Parser where
pure x = Parser $ \input -> Just (input, x)
(Parser p) <*> (Parser q) = Parser $ \input -> do
(input', f) <- p input
(input', a) <- q input
Just (input', f a)
The implementation of (<*>) is the culprit. You did not use input' in the next call to q, but used input instead. As a result you pass the string to the next parser without "eating" characters. You can fix this with:
instance Applicative Parser where
pure x = Parser $ \input -> Just (input, x)
(Parser p) <*> (Parser q) = Parser $ \input -> do
(input', f) <- p input
(input'', a) <- q input'
Just (input'', f a)
With the updated instance for Applicative, we get:
*Main> runParser parseOp "\"+\""
Just ("",Add)

Parsec Parsing list of different kind of statements

I'm trying to parse (for now) a subset of the Dot language.
The grammar is here and my code is the following
import System.Environment
import System.IO
import qualified Text.Parsec.Token as P
import Text.ParserCombinators.Parsec.Char -- for letter
import Text.Parsec
import qualified Control.Applicative as App
import Lib
type Id = String
data Dot = Undirected Id Stmts
| Directed Id Stmts
deriving (Show)
data Stmt = NodeStmt Node | EdgeStmt Edges
deriving (Show)
type Stmts = [Stmt]
data Node = Node Id Attributes deriving (Show)
data Edge = Edge Id Id deriving (Show)
type Edges = [Edge]
data Attribute = Attribute Id Id deriving (Show)
type Attributes = [Attribute]
dotDef :: P.LanguageDef st
dotDef = P.LanguageDef
{ P.commentStart = "/*"
, P.commentEnd = "*/"
, P.commentLine = "//"
, P.nestedComments = True
, P.identStart = letter
, P.identLetter = alphaNum
, P.reservedNames = ["node", "edge", "graph", "digraph", "subgraph", "strict" ]
, P.caseSensitive = True
, P.opStart = oneOf "-="
, P.opLetter = oneOf "->"
, P.reservedOpNames = []
}
lexer = P.makeTokenParser dotDef
brackets = P.brackets lexer
braces = P.braces lexer
identifier = P.identifier lexer
reserved = P.reserved lexer
semi = P.semi lexer
comma = P.comma lexer
reservedOp = P.reservedOp lexer
eq_op = reservedOp "="
undir_edge_op = reservedOp "--"
dir_edge_op = reservedOp "->"
edge_op = undir_edge_op <|> dir_edge_op
-- -> Attribute
attribute = do
id1 <- identifier
eq_op
id2 <- identifier
optional (semi <|> comma)
return $ Attribute id1 id2
a_list = many attribute
bracked_alist =
brackets $ option [] a_list
attributes =
do
nestedAttributes <- many1 bracked_alist
return $ concat nestedAttributes
nodeStmt = do
nodeName <- identifier
attr <- option [] attributes
return $ NodeStmt $ Node nodeName attr
dropLast = reverse . tail . reverse
edgeStmt = do
nodes <- identifier `sepBy1` edge_op
return $ EdgeStmt $ fmap (\x -> Edge (fst x) (snd x)) (zip (dropLast nodes) (tail nodes))
stmt = do
x <- nodeStmt <|> edgeStmt
optional semi
return x
stmt_list = many stmt
graphDecl = do
reserved "graph"
varName <- option "" identifier
stms <- braces stmt_list
return $ Undirected varName stms
digraphDecl = do
reserved "digraph"
varName <- option "" identifier
stms <- braces stmt_list
return $ Directed varName stms
topLevel3 = do
spaces
graphDecl <|> digraphDecl
main :: IO ()
main = do
(file:_) <- getArgs
content <- readFile file
case parse topLevel3 "" content of
Right g -> print g
Left err -> print err
Given this input
digraph PZIFOZBO{
a[toto = bar] b ; c ; w // 1
a->b // 2
}
It works fine if line 1 or line 2 is commented, but if both are enabled, it fails with
(line 3, column 10): unexpected "-" expecting identifier or "}"
My understanding it that the parser picks first matching rule (with backtracking). Here both edge and node statement starts with and identifier, so it always pick this one.
I tried reversing the order in stmt, without any luck.
I also tried to sprinkle some try in stmt, nodeStmt and edgeStmt, without luck either.
Any help appreciated.
Note that I get the same error whether or not line 1 is commented out, so:
digraph PZIFOZBO{
a->b
}
also says unexpected "-".
As I think you have correctly diagnosed, the problem here is that the stmt parser tries nodeStmt first. That succeeds and parses "a", leaving "->b" yet to be consumed, but ->b isn't a valid statement. Note that Parsec does not backtrack automatically in the absence of a try, so it's not going to go back and revisit this decisions when it "discovers" that ->b can't be parsed.
You can "fix" this problem by swapping the order in stmt:
x <- edgeStmt <|> nodeStmt
but now the parse will break on an expression like a[toto = bar]. That's because edgeStmt is buggy. It parses "a" as a valid statement EdgeStmt [] because sepBy1 allows a single edge "a", which isn't what you want.
If you rewrite edgeStmt to require at least one edge:
import Control.Monad (guard)
edgeStmt = do
nodes <- identifier `sepBy1` edge_op
guard $ length nodes > 1
return $ EdgeStmt $ fmap (\x -> Edge (fst x) (snd x)) (zip (dropLast nodes) (tail nodes))
and adjust stmt to "try" an edge statement first and backtrack to a node statement:
stmt = do
x <- try edgeStmt <|> nodeStmt
optional semi
return x
then your example compiles fine.

Parsing juxtaposition-based, indentation-aware syntax using Text.Parsec.Layout

I'm trying to parse a small language with Haskell-like syntax, using parsec-layout. The two key features that don't seem to interact too well with each other are:
Function application syntax is juxtaposition, i.e. if F and E are terms, F E is the syntax for F applied to E.
Indentation can be used to denote nesting, i.e. the following two are equivalent:
X = case Y of
A -> V
B -> W
X = case Y of A -> V; B -> W
I haven't managed to figure out a combination of skipping and keeping whitespace that would allow me to parse a list of such definitions. Here's my simplified code:
import Text.Parsec hiding (space, runP)
import Text.Parsec.Layout
import Control.Monad (void)
type Parser = Parsec String LayoutEnv
data Term = Var String
| App Term Term
| Case Term [(String, Term)]
deriving Show
name :: Parser String
name = spaced $ (:) <$> upper <*> many alphaNum
kw :: String -> Parser ()
kw = void . spaced . string
reserved :: String -> Parser ()
reserved s = try $ spaced $ string s >> notFollowedBy alphaNum
term :: Parser Term
term = foldl1 App <$> part `sepBy1` space
where
part = choice [ caseBlock
, Var <$> name
]
caseBlock = Case <$> (reserved "case" *> term <* reserved "of") <*> laidout alt
alt = (,) <$> (name <* kw "->") <*> term
binding :: Parser (String, Term)
binding = (,) <$> (name <* kw "=") <*> term
-- https://github.com/luqui/parsec-layout/issues/1
trim :: String -> String
trim = reverse . dropWhile (== '\n') . reverse
runP :: Parser a -> String -> Either ParseError a
runP p = runParser (p <* eof) defaultLayoutEnv "" . trim
If I try to run it on input like
s = unlines [ "A = case B of"
, " X -> Y Z"
, "C = D"
]
via runP (laidout binding) s, it fails on the application Y Z:
(line 2, column 10):
expecting space or semi-colon
However, if I change the definition of term to
term = foldl1 App <$> many1 part
then it doesn't stop parsing the term at the start of the (unindented!) third line, leading to
(line 3, column 4):
expecting semi-colon
I think the problem has to do with that name already eliminates the following space, so the sepBy1 in the definition of term doesn't see it.
Consider these simplified versions of term:
term0 = foldl1 App <$> (Var <$> name) `sepBy1` space
term1 = foldl1 App <$> (Var <$> name') `sepBy1` space
name' = (:) <$> upper <*> many alphaNum
term3 = foldl1 App <$> many (Var <$> name)
Then:
runP term0 "A B C" -- fails
runP term1 "A B C" -- succeeds
runP term3 "A B C" -- succeeds
I think part of the solution is to define
part = [ caseBlock, Var <$> name' ]
where name' is as above. However, there are still some issues.

How can I parse a float with a comma in place of the decimal point?

I want to parse Float values from a file where they are stored using comma as the decimal separator. Thus i need a function myParse :: String -> Float such that, for instance, myParse "23,46" == 23.46.
I have some ideas about how to do this, but they all seem overcomplicated, for example:
replace , with a . in the string and use read; or
follow this FP Complete blogpost (entitled Parsing Floats With Parsec), and challenge the curse of the monomorphism restriction.
Is there a simpler way, or do I really need to use a parsing library? In the second case, could you please paste some suggestions in order to get me started? The monomorphism restriction scares me, and I believe that there has to be a way to do this without using language extensions.
Replacing , by . and then call read is straightforward enough; you just need to remember to use your own specialized function instead of plain old read:
readFloatWithComma :: String -> Float
readFloatWithComma = read . sanitize
where
sanitize = map (\c -> if c == ',' then '.' else c)
In GHCi:
λ> readFloatWithComma "23,46"
23.46
Regarding the parsec approach, despite what the article you link to suggest, the monomorphism restriction needs not be a worry, as long as you have type signatures for all your top-level bindings. In particular, the following code doesn't need any language extensions to compile properly (at least, in GHC 7.10.1):
import Text.Parsec
import Text.Parsec.String ( Parser )
import Control.Applicative hiding ( (<|>) )
infixr 5 <++>
(<++>) :: Applicative f => f [a] -> f [a] -> f [a]
a <++> b = (++) <$> a <*> b
infixr 5 <:>
(<:>) :: Applicative f => f a -> f [a] -> f [a]
a <:> b = (:) <$> a <*> b
number :: Parser String
number = many1 digit
plus :: Parser String
plus = char '+' *> number
minus :: Parser String
minus = char '-' <:> number
integer :: Parser String
integer = plus <|> minus <|> number
float :: Parser Float
float = fmap rd $ integer <++> decimal <++> exponent
where rd = read :: String -> Float
decimal = option "" $ ('.' <$ char ',') <:> number
exponent = option "" $ oneOf "eE" <:> integer
In GHCi:
λ> parseTest float "23,46"
23.46

Resources