Generating a parser given a list of tokens - parsing

Background
I'm trying to implement a date printing and parsing system using Parsec.
I have successfully implemented a printing function of type
showDate :: String -> Date -> Parser String
It takes parses a formatting string and creates a new string based on the tokens that the formatted string presented.
For example
showDate "%d-%m-%Y" $ Date 2015 3 17
has the output Right "17-3-2015"
I already wrote a tokenizer to use in the showDate function, so I thought that I could just use the output of that to somehow generate a parser using the function readDate :: [Token] -> Parser Date. My idea quickly came to a halt as I realised I had no idea how to implement this.
What I want to accomplish
Assume we have the following functions and types (the implementation doesn't matter):
data Token = DayNumber | Year | MonthNumber | DayOrdinal | Literal String
-- Parses four digits and returns an integer
pYear :: Parser Integer
-- Parses two digits and returns an integer
pMonthNum :: Parser Int
-- Parses two digits and returns an integer
pDayNum :: Parser Int
-- Parses two digits and an ordinal suffix and returns an integer
pDayOrd :: Parser Int
-- Parses a string literal
pLiteral :: String -> Parser String
The parser readDate [DayNumber,Literal "-",MonthNumber,Literal "-",Year] should be equivalent to
do
d <- pDayNum
pLiteral "-"
m <- pMonthNum
pLiteral "-"
y <- pYear
return $ Date y m d
Similarly, the parser readDate [Literal "~~", MonthNumber,Literal "hello",DayNumber,Literal " ",Year] should be equivalent to
do
pLiteral "~~"
m <- pMonthNum
pLiteral "hello"
d <- pDayNum
pLiteral " "
y <- pYear
return $ Date y m d
My intuition suggests there's some kind of concat/map/fold using monad bindings that I can use for this, but I have no idea.
Questions
Is parsec the right tool for this?
Is my approach convoluted or ineffective?
If not, how do I achieve this functionality?
If so, what should I try to do instead?

Your Tokens are instructions in a small little language for date formats [Token].
import Data.Functor
import Text.Parsec
import Text.Parsec.String
data Date = Date Int Int Int deriving (Show)
data Token = DayNumber | Year | MonthNumber | Literal String
In order to interpret this language, we need a type that represents the state of the interpreter. We start off not knowing any of the components of the Date and then discover them as we encounter DayNumber, Year, or MonthNumber. The following DateState represents the state of knowing or not knowing each of the components of the Date.
data DateState = DateState {dayState :: (Maybe Int), monthState :: (Maybe Int), yearState :: (Maybe Int)}
We will start interpreting a [Token] with DateState Nothing Nothing Nothing.
Each Token will be converted into a function that reads the DateState and produces a parser that computes the new DateState.
readDateToken :: Token -> DateState -> Parser DateState
readDateToken (DayNumber) ds =
do
day <- pNatural
return ds {dayState = Just day}
readDateToken (MonthNumber) ds =
do
month <- pNatural
return ds {monthState = Just month}
readDateToken (Year) ds =
do
year <- pNatural
return ds {yearState = Just year}
readDateToken (Literal l) ds = string l >> return ds
pNatural :: Num a => Parser a
pNatural = fromInteger . read <$> many1 digit
To read a date interpreting a [Token] we will first convert it into a list of functions that decide how to parse a new state based on the current state with map readDateToken :: [Token] -> [DateState -> Parser DateState]. Then, starting with a parser that succeeds with the initial state return (DateState Nothing Nothing Nothing) we will bind all of these functions together with >>=. If the resulting DateState doesn't completely define the Date we will complain that the [Token]s was invalid. We also could have checked this ahead of time. If you want to include invalid date errors as parsing errors this would also be the place to check that the Date is valid and doesn't represent a non-existent date like April 31st.
readDate :: [Token] -> Parser Date
readDate tokens =
do
dateState <- foldl (>>=) (return (DateState Nothing Nothing Nothing)) . map readDateToken $ tokens
case dateState of
DateState (Just day) (Just month) (Just year) -> return (Date day month year)
_ -> fail "Date format is incomplete"
We will run a few examples.
runp p s = runParser p () "runp" s
main = do
print . runp (readDate [DayNumber,Literal "-",MonthNumber,Literal "-",Year]) $ "12-3-456"
print . runp (readDate [Literal "~~", MonthNumber,Literal "hello",DayNumber,Literal " ",Year]) $ "~~3hello12 456"
print . runp (readDate [DayNumber,Literal "-",MonthNumber,Literal "-",Year,Literal "-",Year]) $ "12-3-456-789"
print . runp (readDate [DayNumber,Literal "-",MonthNumber]) $ "12-3"
This results in the following outputs. Notice that when we asked to read the Year twice, the second of the two years was used in the Date. You can choose a different behavior by modifying the definitions for readDateToken and possibly modifying the DateState type. When the [Token] didn't specify how to read one of the date fields we get the error Date format is incomplete with a slightly incorrect description; this could be improved upon.
Right (Date 12 3 456)
Right (Date 12 3 456)
Right (Date 12 3 789)
Left "runp" (line 1, column 5):
unexpected end of input
expecting digit
Date format is incomplete

Related

Taking a parser in argument and try to apply it zero or more times

I am currently writing a basic parser. A parser for type a takes a string in argument and returns either nothing, or an object of type a and the rest of the string.
Here is a simple type satisfying all these features:
type Parser a = String -> Maybe (a, String)
For example, I wrote a function that takes a Char as argument and returns a Parser Char :
parseChar :: Char -> Parser Char
parseChar _ [] = Nothing
parseChar c (x:xs)
| c == x = Just (x, xs)
| otherwise = Nothing
I would like to write a function which takes a parser in argument and tries to apply it zero or more times, returning a list of the parsed elements :
parse :: Parser a -> Parser [a]
Usage example:
> parse (parseChar ' ') " foobar"
Just (" ", "foobar")
I tried to write a recursive function but I can't save the parsed elements in a list.
How can I apply the parsing several times and save the result in a list ?
I tried to write a recursive function but I can't save the parsed elements in a list.
You don't need to "save" anything. You can use pattern matching. Here's a hint. Try to reason about what should happen in each case below. The middle case is a bit subtle, don't worry if you get that wrong at first. Note how s and s' are used below.
parse :: Parser a -> Parser [a]
parse p s = case p s of
Nothing -> ... -- first p failed
Just (x,s') -> case parse p s' of
Nothing -> ... -- subtle case, might not be relevant after all
Just (xs,s'') -> ... -- merge the results
Another hint: note that according to your description parse p should never fail, since it can always return the empty list.

Parse letter or number with Parsec

I am trying to write a parser for strings such as x, A (i.e. single letters), 657 and 0 (i.e. integer positive numbers).
Here is the code I wrote.
import Text.Parsec
data Expression = String String | Number Int
value = letter <|> many1 digit
However I get the following error.
Couldn't match type ‘[Char]’ with ‘Char’
How to convert Char -> String inside the parser?
What should the type annotation be for value ?
letter parses just a single letter and returns a Char. You want to parse a String, namely [Char] (it's the same thing), so I guess you want to parse many letter?
But if you want to parse just a single letter as a String you can take advantage of the fact that Parsec _ _ has a Functor instance in order to map over its result and pack it in a list:
value :: Parsec s u String
value = fmap (:[]) letter <|> many1 digit
After the edit I guess you want to parse the Expression you have presented to us, so you will need some more fancy fmapping to wrap the results in proper constructors:
value :: Parsec s u Expression
value = fmap (String . (:[])) letter
<|> fmap (Number . read) (many1 digit)

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?
Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.
the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

Parsing Printable Text File in Haskell

I'm trying to figure out the "right" way to parse a particular text file in Haskell.
In F#, I loop over each line, testing it against a regular expression to determine if it's a line I want to parse, and then if it is, I parse it using the regular expression. Otherwise, I ignore the line.
The file is a printable report, with headers on each page. Each record is one line, and each field is separated by two or more spaces. Here's an example:
MY COMPANY'S NAME
PROGRAM LISTING
STATE: OK PRODUCT: ProductName
(DESCRIPTION OF REPORT)
DATE: 11/03/2013
This is the first line of a a two-line description of the contents of this report. The description, as noted,
spans two lines. This is more text. I'm running out of things to write. Blah.
DIVISION CODE: 3 XYZ CODE: FAA3 AGENT CODE: 0007 PAGE NO: 1
AGENT TARGET NAME ST UD TARGET# XYZ# X-DATE YEAR CO ENCODING
----- ------------------------------ -- -- ------- ---- ---------- ---- ---------- ----------
0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL
0007 SMITH, JANE 43 3 2345678 001 12/07/2013 2005 ACME YELLOW
0007 DOE, JOHN 43 3 3456789 004 12/09/2013 2008 MICROSOFT GREEN
0007 DOE, JANE 43 3 4567890 002 12/09/2013 2007 MICROSOFT BLUE
0007 BORGES, JORGE LUIS 43 3 5678901 001 12/09/2013 2008 DUFEMSCHM Y1500
0007 DEWEY, JOHN & 43 3 6789012 003 12/11/2013 2013 ERTZEVILI X1500
0007 NIETZSCHE, FRIEDRICH 43 3 7890123 004 12/11/2013 2006 NCORPORAT X7
I first built the parser to test each line to see if it were a record. Were it a record, I just cut up the line based on character position with my home-grown substring function. This works just fine.
Then I discovered that I did, indeed, have a regular expression library in my Haskell installation, so I decided to try using regular expressions like I do in F#. That failed miserably, as the library rejects perfectly valid regular expressions.
Then I thought, What about Parsec? But the learning curve for using that is getting steeper the higher I climb, and I find myself wondering if it is the right tool for such a simple task as parsing this report.
So I thought I'd ask some Haskell experts: how would you go about parsing this kind of report? I'm not asking for code, though if you've got some, I'd love to see it. I'm really asking for technique or technology.
Thanks!
P.s. The output is just a colon-separated file with a line of field names at the top of the file, followed by just the records, that can be imported into Excel for the end-user.
Edit:
Thank you all so much for the great comments and answers!
Because I didn't make it clear originally: The first fourteen lines of the example repeat for every page of (print) output, with the number of records varying per page from zero to a full page (looks like 45 records). I apologize for not making that clear earlier, as it will probably affect some of the answers already offered.
My Haskell system currently is limited to Parsec (it doesn't have attoparsec) and Text.Regex.Base and Text.Regex.Posix. I'll have to see about installing attoparsec and/or additional Regex libraries. But for the time being, you've convinced me to keep at learning Parsec. Thank you for the very helpful code examples!
This is definitely a job worth of a parsing library. My primary goal is normally (i.e., for anything I intend to use more than once or twice) to get the data into a non-textual form ASAP, something like
module ReportParser where
import Prelude hiding (takeWhile)
import Data.Text hiding (takeWhile)
import Control.Applicative
import Data.Attoparsec.Text
data ReportHeaderData = Company Text
| Program Text
| State Text
-- ...
| FieldNames [Text]
data ReportData = ReportData Int Text Int Int Int Int Date Int Text Text
data Date = Date Int Int Int
and we can say, for the sake of argument, that a report is
data Report = Report [ReportHeaderData] [ReportData]
Now, I generally create a parser which is a function of the same name as the data type
-- Ending condition for a field
doubleSpace :: Parser Char
doubleSpace = space >> space
-- Clears leading spaces
clearSpaces :: Parser Text
clearSpaces = takeWhile (== ' ') -- Naively assumes no tabs
-- Throws away everything up to and including a newline character (naively assumes unix line endings)
clearNewline :: Parser ()
clearNewline = (anyChar `manyTill` char '\n') *> pure ()
-- Parse a date
date :: Parser Date
date = Date <$> decimal <*> (char '/' *> decimal) <*> (char '/' *> decimal)
-- Parse a report
reportData :: Parser ReportData
reportData = let f1 = decimal <* clearSpaces
f2 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
f3 = decimal <* clearSpaces
f4 = decimal <* clearSpaces
f5 = decimal <* clearSpaces
f6 = decimal <* clearSpaces
f7 = date <* clearSpaces
f8 = decimal <* clearSpaces
f9 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
f10 = (pack <$> manyTill anyChar doubleSpace) <* clearNewline
in ReportData <$> f1 <*> f2 <*> f3 <*> f4 <*> f5 <*> f6 <*> f7 <*> f8 <*> f9 <*> f10
By proper running of one of the parse functions and the use of one of the combinators (such as many (and possibly feed, if you end up with a Partial result), you should end up with a list of ReportDatas. You can then convert them to CSV with some function you've created.
Note that I didn't deal with the header. It should be relatively trivial to write code to parse it, and build a Report with e.g.
-- Not tested
parseReport = Report <$> (many reportHeader) <*> (many reportData)
Note that I prefer the Applicative form, but it's also possible to use the monadic form if you prefer (I did in doubleSpace). Data.Alternative is also useful, for reasons implied by the name.
For playing with this, I highly recommend GHCI and the parseTest function. GHCI is just overall handy and a good way to test individual parsers, while parseTest takes a parser and input string and outputs the status of the run, the parsed string, and any remaining string not parsed. Very useful when you're not quite sure what's going on.
There are very few languages that I would recommend using a parser for something so simple (I've parsed many a file like this using regular expressions in the past), but parsec makes it so easy-
parseLine = do
first <- count 4 anyChar
second <- count 4 anyChar
return (first, second)
parseFile = endBy parseLine (char '\n')
main = interact $ show . parse parseFile "-"
The function "parseLine" creates a parser for an individual line by chaining together two fields made up of fixed length (4 chars, any char will do).
The function "parseFile" then chains these together as a list of lines.
Of course you will have to add more fields, and cut off the header in your data still, but all of this is easy in parsec.
This is arguably much easier to read than regexps....
Assuming a few things—that the header is fixed and the field of each line is "double space" delimited—it's really quite easy to implement a parser in Haskell for this file. The end result is probably going to be longer than your regexp (and there are regexp libraries in Haskell if that fits your desire) but it's far more testable and readable. I'll demonstrate some of that while I outline how to build one for this file format.
I'll use Attoparsec. We'll also need to use the ByteString data type (and the OverloadedStrings PRAGMA which lets Haskell interpret string literals as both String and ByteString) and some combinators from Control.Applicative and Control.Monad.
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8
import Control.Applicative
import Control.Monad
import qualified Data.ByteString.Char8 as S
First, we'll build a data type representing each record.
data YearMonthDay =
YearMonthDay { ymdYear :: Int
, ymdMonth :: Int
, ymdDay :: Int
}
deriving ( Show )
data Line =
Line { agent :: Int
, name :: S.ByteString
, st :: Int
, ud :: Int
, targetNum :: Int
, xyz :: Int
, xDate :: YearMonthDay
, year :: Int
, co :: S.ByteString
, encoding :: S.ByteString
}
deriving ( Show )
You could fill in more descriptive types for each field if desired, but this isn't a bad start. Since each line can be parsed independently, I'll do just that. The first step is to build a Parser Line type---read that as a parser type which returns a Line if it succeeds.
To do this, we'll build our Line type "inside of" the Parser using its Applicative interface. That sounds really complex, but it's simple and looks quite pretty. We'll start with the YearMonthDay type as a warm-up
parseYMDWrong :: Parser YearMonthDay
parseYMDWrong =
YearMonthDay <$> decimal
<*> decimal
<*> decimal
Here, decimal is a built-in Attoparsec parser which parses an integral type like Int. You can read this parser as nothing more than "parse three decimal numbers and use them to build my YearMonthDay type" and you'd be basically correct. The (<*>) operator (read as "apply") sequences the parses and collects their results into our YearMonthDay constructor function.
Unfortunately, as I indicated in the type, it's a little bit wrong. To point, we're currently ignoring the '/' characters which delimit the numbers inside of our YearMonthDay. We fix this by using the "sequence and throw away" operator (<*). It's a visual pun on (<*>) and we use it when we want to perform a parsing action... but we don't want to keep the result.
We use (<*) to augment the first two decimal parsers with their following '/' characters using the built-in char8 parser.
parseYMD :: Parser YearMonthDay
parseYMD =
YearMonthDay <$> (decimal <* char8 '/')
<*> (decimal <* char8 '/')
<*> decimal
And we can test that this is a valid parser using Attoparsec's parseOnly function
>>> parseOnly parseYMD "2013/12/12"
Right (YearMonthDay {ymdYear = 2013, ymdMonth = 12, ymdDay = 12})
We'd like to now generalize this technique to the entire Line parser. There's one hitch, however. We'd like to parse ByteString fields like "SMITH, JOHN" which might contain spaces... while also delimiting each field of our Line by double spaces. This means that we need a special ByteString parser which consumes any character including single spaces... but quits the moment it sees two spaces in a row.
We can build this using the scan combinator. scan allows us to accumulate a state while consuming characters in our parse and determine when to stop that parse on the fly. We'll keep a boolean state—"was the last character a space?"—and stop whenever we see a new space while knowing the previous character was a space too.
parseStringField :: Parser S.ByteString
parseStringField = scan False step where
step :: Bool -> Char -> Maybe Bool
step b ' ' | b = Nothing
| otherwise = Just True
step _ _ = Just False
We can again test this little piece using parseOnly. Let's try parsing three string fields.
>>> let p = (,,) <$> parseStringField <*> parseStringField <*> parseStringField
>>> parseOnly p "foo bar baz"
Right ("foo "," bar "," baz")
>>> parseOnly p "foo bar baz quux end"
Right ("foo bar "," baz quux "," end")
>>> parseOnly p "a sentence with no double space delimiters"
Right ("a sentence with no double space delimiters","","")
Depending on your actual file format, this might be perfect. It's worth noting that it leaves trailing spaces (these could be trimmed if desired) and it allows some space delimited fields to be empty. It's easy to continue to fiddle with this piece in order to fix these errors, but I'll leave it for now.
We can now build our Line parser. Like with parseYMD, we'll follow each field's parser with a delimiting parser, someSpaces which consumes two or more spaces. We'll use the MonadPlus interface to Parser to build this atop the built-in parser space by (1) parsing some spaces and (2) checking to be sure that we got at least two of them.
someSpaces :: Parser Int
someSpaces = do
sps <- some space
let count = length sps
if count >= 2 then return count else mzero
>>> parseOnly someSpaces " "
Right 2
>>> parseOnly someSpaces " "
Right 4
>>> parseOnly someSpaces " "
Left "Failed reading: mzero"
And now we can build the line parser
lineParser :: Parser Line
lineParser =
Line <$> (decimal <* someSpaces)
<*> (parseStringField <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (decimal <* someSpaces)
<*> (parseYMD <* someSpaces)
<*> (decimal <* someSpaces)
<*> (parseStringField <* someSpaces)
<*> (parseStringField <* some space)
>>> parseOnly lineParser "0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL "
Right (Line { agent = 7
, name = "SMITH, JOHN "
, st = 43
, ud = 3
, targetNum = 1234567
, xyz = 1
, xDate = YearMonthDay {ymdYear = 12, ymdMonth = 6, ymdDay = 2013}
, year = 2004
, co = "ABC "
, encoding = "SIZE XL "
})
And then we can just cut off the header and parse each line.
parseFile :: S.ByteString -> [Either String Line]
parseFile = map (parseOnly parseLine) . drop 14 . lines

What is an appropriate data structure or algorithm for producing an immutable concrete syntax tree in a functionally pure manner?

Given a LL(1) grammar what is an appropriate data structure or algorithm for producing an immutable concrete syntax tree in a functionally pure manner? Please feel free to write example code in whatever language you prefer.
My Idea
symbol : either a token or a node
result : success or failure
token : a lexical token from source text
value -> string : the value of the token
type -> integer : the named type code of the token
next -> token : reads the next token and keeps position of the previous token
back -> token : moves back to the previous position and re-reads the token
node : a node in the syntax tree
type -> integer : the named type code of the node
symbols -> linkedList : the symbols found at this node
append -> symbol -> node : appends the new symbol to a new copy of the node
Here is an idea I have been thinking about. The main issue here is handling syntax errors.
I mean I could stop at the first error but that doesn't seem right.
let program token =
sourceElements (node nodeType.program) token
let sourceElements node token =
let (n, r) = sourceElement (node.append (node nodeType.sourceElements)) token
match r with
| success -> (n, r)
| failure -> // ???
let sourceElement node token =
match token.value with
| "function" ->
functionDeclaration (node.append (node nodeType.sourceElement)) token
| _ ->
statement (node.append (node nodeType.sourceElement)) token
Please Note
I will be offering up a nice bounty to the best answer so don't feel rushed. Answers that simply post a link will have less weight over answers that show code or contain detailed explanations.
Final Note
I am really new to this kind of stuff so don't be afraid to call me a dimwit.
You want to parse something into an abstract syntax tree.
In the purely functional programming language Haskell, you can use parser combinators to express your grammar. Here an example that parses a tiny expression language:
EDIT Use monadic style to match Graham Hutton's book
-- import a library of *parser combinators*
import Parsimony
import Parsimony.Char
import Parsimony.Error
(+++) = (<|>)
-- abstract syntax tree
data Expr = I Int
| Add Expr Expr
| Mul Expr Expr
deriving (Eq,Show)
-- parse an expression
parseExpr :: String -> Either ParseError Expr
parseExpr = Parsimony.parse pExpr
where
-- grammar
pExpr :: Parser String Expr
pExpr = do
a <- pNumber +++ parentheses pExpr -- first argument
do
f <- pOp -- operation symbol
b <- pExpr -- second argument
return (f a b)
+++ return a -- or just the first argument
parentheses parser = do -- parse inside parentheses
string "("
x <- parser
string ")"
return x
pNumber = do -- parse an integer
digits <- many1 digit
return . I . read $ digits
pOp = -- parse an operation symbol
do string "+"
return Add
+++
do string "*"
return Mul
Here an example run:
*Main> parseExpr "(5*3)+1"
Right (Add (Mul (I 5) (I 3)) (I 1))
To learn more about parser combinators, see for example chapter 8 of Graham Hutton's book "Programming in Haskell" or chapter 16 of "Real World Haskell".
Many parser combinator library can be used with different token types, as you intend to do. Token streams are usually represented as lists of tokens [Token].
Definitely check out the monadic parser combinator approach; I've blogged about it in C# and in F#.
Eric Lippert's blog series on immutable binary trees may be helpful. Obviously, you need a tree which is not binary, but it will give you the general idea.

Resources