Parsing simple molecule names with Attoparsec - parsing

I find it extremely difficult to learn how to use Attoparsec, because the documentation is really just an API documentation and there are basically no tutorials around (except the one from FPComplete). If you know other places where I can learn Attoparsec, that'd be great.
I have to parse simple molecule names, in the following format: NaCl, CO2, H2O, HCN, H2O2.
An element name is an uppercase letter optionally followed by a lowercase one (I'm not considering those elements with a symbol longer than 2 characters).
An element can be followed by a number (that would be the subscript in a formula).
New version (thanks to Mark's and Tarmil's suggestions), which compiles but does not parse:
module Chem
where
import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text
data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]
parseString :: String -> Result Molecule
parseString = parse (many' parseElement) . pack
parseElement :: Parser Element
parseElement = do
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
n <- option 1 decimal
return $ Element el n
pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
Any suggestion is appreciated.
EDIT: I managed to get it running. Basically, a Partial continuation was returned, and to finish the parsing it's necessary to feed the parser with an empty bytestring. So the correct parseString would be:
parseString = flip feed empty . parse (many' parseElement) . pack
where empty is Data.Text.empty. However, since I don't need incremental parsing there is the useful function parseOnly, which does not wait for more input and returns an Either.
With that in mind, I rewrote the code like this (it works now):
module Chem
where
import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text
data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]
parseString :: String -> Either String Molecule
parseString = parseOnly (many' parseElement) . pack
parseElement :: Parser Element
parseElement = do
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
n <- option 1 decimal
return $ Element el n
pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)

You have two problems in the letters parsing part:
inClass is not a parser, it is a function that is meant to be passed to satisfy.
<*> has type Parser (a -> b) -> Parser a -> Parser b, so the parser on the left should return a function. Typically, it is used like this:
pf <$> p1 <*> p2 <*> ... <*> pn
where pf is a function with n arguments.
So here you probably want something like this:
-- parse a character in the given class, and transform it to a single-char string
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
-- ...
el <- ((++) <$> pClass "A-Z" <*> pClass "a-z") <|> pClass "A-Z"
-- ...
I think this would be enhanced by using option, instead of duplicating the A-Z parser:
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")

Related

Parser written in Haskell not working as intended

I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work:
module Main where
import Control.Monad
import Numeric
import System.Environment
import Text.ParserCombinators.Parsec hiding (spaces)
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
return (fst (head (readHex x)))
testHex :: String -> String
testHex input = case parse parseHex "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val
main :: IO ()
main = do
args <- getArgs
putStrLn (testHex (head args))
And then I tried testing the testHex function in Haskell's repl:
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
Ok, one module loaded.
*Main> testHex "#xcafebeef"
"Matched3405692655"
*Main> testHex "#xnothx"
"Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit"
*Main> testHex "#xcafexbeef"
"Matched51966"
The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this?
Thank you!
You need to place eof at the end.
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
eof
return (fst (head (readHex x)))
Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places.
testHex :: String -> String
testHex input = case parse (parseHex <* eof) "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val

How do I parse S-expressions into a data structure in Haskell?

I'm new to Haskell and could use some guidance.
The challenge: take an S-expression and parse it into a record.
Where I have succeeded: I can take a file and read it into a parsed String.
Yet, using parsing Text to DFA s.t
let
toDFA :: [L.Text] -> EntryDFA
toDFA t =
let [q,a,d,s,f] = t
in EntryDFA {
state = read q
,alpha = read a
,delta = read d
,start = read s
,final = read f }
returns this error:
• Couldn't match type ‘L.Text’ with ‘[Char]’
Expected type: String
Actual type: L.Text
There must be a more idiomatic approach.
read is a partial function with type Read a => String -> a, which throws an exception on parsing failure. Normally you want to avoid it (use readMaybe instead if you have a string). String and L.Text are different types, which is why you're getting an error.
Your sample code is missing an extra ) after the trans-func.
I'm using the Megaparsec package which provides an easy way to work with parser combinators. The author of the library has written a longer tutorial here.
The basic idea is that Parser a is the type of a value that can parse something of type a. In Text.Megaparsec there are several functions which you can use (parse, parseMaybe etc.), to "run" the parser on a "stringy" data type (e.g. String or strict/lazy Text).
When you use do notation for IO, it means "do one action after another". Similarly, you can use do notation with Parser, it means "parse this one thing, then parse the next thing".
p1 *> p2 means run the parser p1, run p2 and return the result of running p2. p1 <* p2 means run the parser p1, run p2 and return the result of running p1. You can also look up documentation on Hoogle in case you're having trouble understanding something.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE NamedFieldPuns #-}
-- In practice, many of these imports would be unqualified, but I've
-- opted for explicitness for clarity.
import Control.Applicative (empty, many, some, (<*), (*>))
import Control.Exception (try, IOException)
import Data.Maybe (fromMaybe)
import Data.Set (Set)
import Data.Text (Text)
import qualified Data.Set as Set
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.Char as MPC
import qualified Text.Megaparsec.Char.Lexer as MPCL
type Q = Text
type E = Char
data EntryDFA = EntryDFA
{ state :: Set Q
, alpha :: Set E
, delta :: Set (Q,E,Q)
, start :: Q
, final :: Set Q
} deriving Show
inputFile = "foo.sexp"
main :: IO ()
main = do
-- read file and check for exception instead of checking if
-- it exists and then trying to read it
result <- try (TIO.readFile inputFile)
case result of
Left e -> print (e :: IOException)
Right txt -> do
case MP.parse dfaParser inputFile txt of
Left e -> print e
Right dfa -> print dfa
type Parser = MP.Parsec () Text
-- There are no comments in the S-exprs, so leave those empty
spaceConsumer :: Parser ()
spaceConsumer = MPCL.space MPC.space1 empty empty
symbol :: Text -> Parser Text
symbol txt = MPCL.symbol spaceConsumer txt
parens :: Parser a -> Parser a
parens p = MP.between (symbol "(") (symbol ")") p
setP :: Ord a => Parser a -> Parser (Set a)
setP p = do
items <- parens (p `MP.sepBy1` (symbol ","))
return (Set.fromList items)
pair :: Parser a -> Parser b -> Parser (a, b)
pair p1 p2 = parens $ do
x1 <- p1
x2 <- symbol "," *> p2
return (x1, x2)
stateP :: Parser Text
stateP = do
c <- MPC.letterChar
cs <- many MPC.alphaNumChar
return (T.pack (c:cs))
dfaParser :: Parser EntryDFA
dfaParser = do
() <- spaceConsumer
(_, state) <- pair (symbol "states") (setP stateP)
(_, alpha) <- pair (symbol "alpha") (setP alphaP)
(_, delta) <- pair (symbol "trans-func") (setP transFuncP)
(_, start) <- pair (symbol "start") valP
(_, final) <- pair (symbol "final") (setP valP)
return (EntryDFA {state, alpha, delta, start, final})
where
alphaP :: Parser Char
alphaP = MPC.letterChar <* spaceConsumer
transFuncP :: Parser (Text, Char, Text)
transFuncP = parens $ do
s1 <- stateP
a <- symbol "," *> alphaP
s2 <- symbol "," *> stateP
return (s1, a, s2)
valP :: Parser Text
valP = fmap T.pack (some MPC.digitChar)

Parsing a Complex Number from Scratch

Given a data type data CI = CI Int Int, representing a complex number, I want to build a parser for CI that can convert "a" to CI a 0 and "(a,b)" to CI a b. For example, I want a function parseCI such runParser parseCI "(1,2)" returns the value [(CI 1 2, "")] (ideally, but something similar is fine). I also want to make CI an instance of read.
I would like to do this using functions and definitions from the code below (basically, without anything advanced, like Parsec), but I'm not sure where to start. Some starting code to set me on the right track and/or a hint would be helpful. I'm not looking for a full answer, as I'd like to figure that out myself.
module Parser where
import Control.Applicative
import Control.Monad
newtype Parser a = Parser { runParser :: String -> [(a,String)] }
satisfy :: (Char -> Bool) -> Parser Char
satisfy f = Parser $ \s -> case s of
[] -> []
a:as -> [(a,as) | f a]
char :: Char -> Parser Char
char = satisfy . (==)
string :: String -> Parser String
string str = Parser $ \s -> [(t,u) | let (t,u) = splitAt (length str) s, str == t]
instance Functor Parser where
fmap f p = Parser $ \s ->
[ (f a,t)
| (a,t) <- runParser p s
]
instance Applicative Parser where
pure a = Parser $ \s -> [(a,s)]
af <*> aa = Parser $ \s ->
[ (f a,u)
| (f,t) <- runParser af s
, (a,u) <- runParser aa t
]
instance Alternative Parser where
empty = Parser $ \s -> []
p1 <|> p2 = Parser $ (++) <$> runParser p1 <*> runParser p2`
instance Monad Parser where
return = pure
ma >>= f = Parser $ \s ->
[ (b,u)
| (a,t) <- runParser ma s
, (b,u) <- runParser (f a) t
]
instance MonadPlus Parser where
mzero = empty
mplus = (<|>)
You've probably already seen it, but in case you haven't: Monadic Parsing in Haskell sets up parsing like this.
Since you have two different ways of parsing CI, you might want to approach this as two problems: make one parser parseCI1 that parses "a" to CI a 0 and make another parser parseCI2 that parses "(a,b)" to CI a b. Then, you can combine these into one with
parseCI = parseCI1 <|> parseCI2
For both of these subparsers, you will need some way of parsing integers: parseInt :: Parser Int. When making parseInt, you will likely want to use some combination of satisfy, isDigit, read, and possibly some (depending on how you go about solving this).
Making CI an instance of read is a bit more straightforward once you have parseCI done:
instance Read CI where
readsPrec _ = runParser parseCI

Indentation-aware parsing of expression trees

As a follow-up to this question, I am now trying to parse an expression language that has variables and case ... of ... expressions. The syntax should be indentation-based:
Expressions can span multiple lines, as long as every line is indented relative to the first one; i.e. this should be parsed as a single application:
f x y
z
q
Each alternative of a case expression needs to be on its own line, indented relative to the case keyword. Right-hand sides can span multiple lines.
case E of
C -> x
D -> f x
y
should be parsed into a single case with two alternatives, with x and f x y as the right-hand sides
I've simplified my code into the following:
import qualified Text.Megaparsec.Lexer as L
import Text.Megaparsec hiding (space)
import Text.Megaparsec.Char hiding (space)
import Text.Megaparsec.String
import Control.Monad (void)
import Control.Applicative
data Term = Var String
| App [Term]
| Case Term [(String, Term)]
deriving Show
space :: Parser ()
space = L.space (void spaceChar) empty empty
name :: Parser String
name = try $ do
s <- some letterChar
if s `elem` ["case", "of"]
then fail $ unwords ["Unexpected: reserved word", show s]
else return s
term :: Parser () -> Parser Term
term sp = App <$> atom `sepBy1` try sp
where
atom = choice [ caseBlock
, Var <$> L.lexeme sp name
]
caseBlock = L.lineFold sp $ \sp' ->
Case <$>
(L.symbol sp "case" *> L.lexeme sp (term sp) <* L.symbol sp' "of") <*>
alt sp' `sepBy` try sp' <* sp
alt sp' = L.lineFold sp' $ \sp'' ->
(,) <$> L.lexeme sp' name <* L.symbol sp' "->" <*> term sp''
As you can see, I am trying to use the technique from this answer to separate alternatives with sp'aces that are more indented than the case keyword.
Problems
This seems to work for single expressions made up of application only:
λ» parseTest (L.lineFold space term) "x y\n z"
App [Var "x",Var "y",Var "z"]
It doesn't work for list of such expressions using the technique from the linked answer:
λ» parseTest (L.lineFold space $ \sp -> (term sp `sepBy` try sp)) "x\n y\nz"
3:1:
incorrect indentation (got 1, should be greater than 1)
case expressions fail out of the gate when trying to use line-folding:
λ» parseTest (L.lineFold space term) "case x of\n C -> y\n D -> z"
1:5:
Unexpected: reserved word "case"
case works without line folding for the outermost expression, for one alternative only:
λ» parseTest (term space) "case x of\n C -> y\n z"
App [Case (App [Var "x"]) [("C",App [Var "y",Var "z"])]]
But case fails as soon as I have multiple alternatives:
λ» parseTest (term space) "case x of\n C -> y\n D -> z"
3:2:
incorrect indentation (got 2, should be greater than 2)
What am I doing wrong?
I'm answering since I promised to take a look at this. This problem represents a rather difficult problem for Parsec-like parsers in their current state. I probably could make it work after spending much more time that I have available, but in the slot of time I can spend answering this, I only got this far:
module Main (main) where
import Control.Applicative
import Control.Monad (void)
import Text.Megaparsec
import Text.Megaparsec.String
import qualified Data.List.NonEmpty as NE
import qualified Text.Megaparsec.Lexer as L
data Term = Var String
| App [Term]
| Case Term [(String, Term)]
deriving Show
scn :: Parser ()
scn = L.space (void spaceChar) empty empty
sc :: Parser ()
sc = L.space (void $ oneOf " \t") empty empty
name :: Parser String
name = try $ do
s <- some letterChar
if s `elem` ["case", "of"]
then (unexpected . Label . NE.fromList) ("reserved word \"" ++ s ++ "\"")
else return s
manyTerms :: Parser [Term]
manyTerms = many pTerm
pTerm :: Parser Term
pTerm = caseBlock <|> app -- parse a term first
caseBlock :: Parser Term
caseBlock = L.indentBlock scn $ do
void (L.symbol sc "case")
t <- Var <$> L.lexeme sc name -- not sure what sort of syntax case of
-- case expressions should have, so simplified to vars for now
void (L.symbol sc "of")
return (L.IndentSome Nothing (return . Case t) alt)
alt :: Parser (String, Term)
alt = L.lineFold scn $ \sc' ->
(,) <$> L.lexeme sc' name <* L.symbol sc' "->" <*> pTerm -- (1)
app :: Parser Term
app = L.lineFold scn $ \sc' ->
App <$> ((Var <$> name) `sepBy1` try sc' <* scn)
-- simplified here, with some effort should be possible to go from Var to
-- more general Term in applications
Your original grammar is left-recursive because every term can be either a case expression or an application and if it's an application, then the first part of it again can be either case expression or application, etc. You'll need to deal with that somehow.
Here is a session:
λ> parseTest pTerm "x y\n z"
App [Var "x",Var "y",Var "z"]
λ> parseTest pTerm "x\n y\nz"
App [Var "x",Var "y"]
λ> parseTest manyTerms "x\n y\nz"
[App [Var "x",Var "y"],App [Var "z"]]
λ> parseTest pTerm "case x of\n C -> y\n D -> z"
Case (Var "x") [("C",App [Var "y"]),("D",App [Var "z"])]
λ> parseTest pTerm "case x of\n C -> y\n z"
3:3:
incorrect indentation (got 3, should be equal to 2)
This last result is because of (1) in the code. Introducing a parameter to app makes it impossible to use it without thinking of context (it would be no longer stand-alone expression, but factored-out part of something). We can see that if you indent z with respect to start of y application, not the entire alternative, it works:
λ> parseTest pTerm "case x of\n C -> y\n z"
Case (Var "x") [("C",App [Var "y",Var "z"])]
Finally, case expression works:
λ> parseTest pTerm "case x of\n C -> y\n D -> z"
Case (Var "x") [("C",App [Var "y"]),("D",App [Var "z"])]
My advice here would be to take a look at some pre-processor and use Megaparsec on top of that. The tools in Text.Megaparsec.Lexer are not that easy to apply in this case, but they are the best we could come up with and they work fine for simple indentation-sensitive grammars.

Parsing text with optional data at the end

Please note, subsequently to posting this question I managed to derive a solution myself. See the end of this question for my final answer.
I'm working on a little parser at the moment for org-mode documents, and in these documents headings can have a title, and may optionally consist of a list of tags at the of the heading:
* Heading :foo:bar:baz:
I'm having difficulty writing a parser for this, however. The following is what I'm working with for now:
import Control.Applicative
import Text.ParserCombinators.Parsec
data Node = Node String [String]
deriving (Show)
myTest = parse node "" "Some text here :tags:here:"
node = Node <$> (many1 anyChar) <*> tags
tags = (char ':') >> (sepEndBy1 (many1 alphaNum) (char ':'))
<?> "Tag list"
While my simple tags parser works, it doesn't work in the context of node because all of the characters are used up parsing the title of the heading (many1 anyChar). Furthermore, I can't change this parser to use noneOf ":" because : is valid in the title. In fact, it's only special if it's in a taglist, at the very end of the line.
Any ideas how I can parse this optional data?
As an aside, this is my first real Haskell project, so if Parsec is not even the right tool for the job - feel free to point that out and suggest other options!
Ok, I got a complete solution now, but it needs refactoring. The following works:
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Data.Char (isSpace)
import Text.ParserCombinators.Parsec
data Node = Node { level :: Int, keyword :: Maybe String, heading :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> level <*> (optionMaybe keyword) <*> name <*> (optionMaybe tags)
where level = length <$> many1 (char '*') <* space
keyword = (try (many1 upper <* space))
name = noneOf "\n" `manyTill` (eof <|> (lookAhead (try (tags *> eof))))
tags = char ':' *> many1 alphaNum `sepEndBy1` char ':'
myTest = parse parseNode "org-mode" "** Some : text here :tags: JUST KIDDING :tags:here:"
myTest2 = parse parseNode "org-mode" "* TODO Just a node"
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Text.ParserCombinators.Parsec
instance Applicative (GenParser s a) where
pure = return
(<*>) = ap
data Node = Node { name :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> name <*> tags
where tags = optionMaybe $ optional (string " :") *> many (noneOf ":\n") `sepEndBy` (char ':')
name = noneOf "\n" `manyTill` try (string " :" <|> string "\n")
myTest = parse parseNode "" "Some:text here :tags:here:"
myTest2 = parse parseNode "" "Sometext here :tags:here:"
Results:
*Main> myTest
Right (Node {name = "Some:text here", tags = Just ["tags","here",""]})
*Main> myTest2
Right (Node {name = "Sometext here", tags = Just ["tags","here",""]})

Resources