This is the most puzzling combinator in all of FParsec...
http://www.quanttec.com/fparsec/reference/primitives.html#members.chainl1
...but there is no example on how to use it in the documentation or, AFAIK, on any web pages on the internet. I have a left-recursive parse that seems to require it, but for the life of me I can't figure out how to call it or what to pass to it.
Please help :)
I have some pretty diagrams involving chainl1 (from my own C# code) here:
http://lorgonblog.wordpress.com/2007/12/04/monadic-parser-combinators-part-three/
I put together a simple expression parser in FParsec at the end of this unrelated post. Here's an excerpt using chainl1 to make a parser for a chained operator expression from parsers for the operand and operator.
(* fop : (double -> double -> double) -> (env -> double) -> (env -> double) -> env -> double *)
let fop op fa fb env = fa env |> op <| fb env
(* Parse single operators - return function taking two operands and giving the result *)
let (addop : Parser<_,unit>) =
sym "+" >>% fop (+)
<|> ( sym "-" >>% fop (-) )
(* term, expr - chain of operators of a given precedence *)
let term = chainl1 atom mulop
let expr = chainl1 term addop
Related
I am trying to get more familiar with megaparsec, and I am running into some issues with presedences. By 'nested data' in the title I refer to the fact that I am trying to parse types, which in turn could contain other types. If someone could explain why this does not behave as I would expect, please don't hesitate to tell me.
I am trying to parse types similar to those found in Haskell. Types are either base types Int, Bool, Float or type variables a (any lowercase word).
We can also construct algebraic data types from type constructors (Uppercase words) such as Maybe and type parameters (any other type). Examples are Maybe a and Either (Maybe Int) Bool. Functions associate to the right and are constructed with ->, such as Maybe a -> Either Int (b -> c). N-ary tuples are a sequence of types separated by , and enclosed in ( and ), such as (Int, Bool, a). A type can be wrapped in parenthesis to raise its precedence level (Maybe a). A unit type () is also defined.
I am using this ADT to describe this.
newtype Ident = Ident String
newtype UIdent = UIdent String
data Type a
= TLam a (Type a) (Type a)
| TVar a Ident
| TNil a
| TAdt a UIdent [Type a]
| TTup a [Type a]
| TBool a
| TInt a
| TFloat a
I have tried to write a megaparsec parser to parse such types, but I get unexpected results. I attach the relevant code below after which I will try to describe what I experience.
{-# LANGUAGE OverloadedStrings #-}
module Parser where
import AbsTinyCamiot
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as Lexer
import Text.Megaparsec.Debug
import Control.Applicative hiding (many, some, Const)
import Control.Monad.Combinators.Expr
import Control.Monad.Identity
import Data.Void
import Data.Text (Text, unpack)
type Parser a = ParsecT Void Text Identity a
-- parse types
pBaseType :: Parser (Type ())
pBaseType = choice [
TInt () <$ label "parse int" (pSymbol "Int"),
TBool () <$ label "parse bool" (pSymbol "Bool"),
TFloat () <$ label "parse float" (pSymbol "Float"),
TNil () <$ label "parse void" (pSymbol "()"),
TVar () <$> label "parse type variable" pIdent]
pAdt :: Parser (Type ())
pAdt = label "parse ADT" $ do
con <- pUIdent
variables <- many $ try $ many spaceChar >> pType
return $ TAdt () con variables
pType :: Parser (Type ())
pType = label "parse a type" $
makeExprParser
(choice [ try pFunctionType
, try $ parens pType
, try pTupleType
, try pBaseType
, try pAdt
])
[]--[[InfixR (TLam () <$ pSymbol "->")]]
pTupleType :: Parser (Type ())
pTupleType = label "parse a tuple type" $ do
pSymbol "("
fst <- pType
rest <- some (pSymbol "," >> pType)
pSymbol ")"
return $ TTup () (fst : rest)
pFunctionType :: Parser (Type ())
pFunctionType = label "parse a function type" $ do
domain <- pType
some spaceChar
pSymbol "->"
some spaceChar
codomain <- pType
return $ TLam () domain codomain
parens :: Parser a -> Parser a
parens p = label "parse a type wrapped in parentheses" $ do
pSymbol "("
a <- p
pSymbol ")"
return a
pUIdent :: Parser UIdent
pUIdent = label "parse a UIdent" $ do
a <- upperChar
rest <- many $ choice [letterChar, digitChar, char '_']
return $ UIdent (a:rest)
pIdent :: Parser Ident
pIdent = label "parse an Ident" $ do
a <- lowerChar
rest <- many $ choice [letterChar, digitChar, char '_']
return $ Ident (a:rest)
pSymbol :: Text -> Parser Text
pSymbol = Lexer.symbol pSpace
pSpace :: Parser ()
pSpace = Lexer.space
(void spaceChar)
(Lexer.skipLineComment "--")
(Lexer.skipBlockComment "{-" "-}")
This might be overwhelming, so let me explain some key points. I understand that I have a lot of different constructions that could match on an opening parenthesis, so I've wrapped those parsers in try, such that if they fail I can try the next parser that might consume an opening parenthesis. Perhaps I am using try too much? Does it affect performance to potentially backtrack so much?
I've also tried to make an expression parser by defining some terms and an operator table. You can see now that I've commented out the operator (function arrow), however. As the code looks right now I loop infinitely when I try to parse a function type. I think this might be due to the fact that when I try to parse a function type (invoked from pType) I immediately try to parse a type representing the domain of the function, which again call pType. How would I do this correctly?
If I decide to use the operator table instead, and not use my custom parser for function types, I parse things using wrong precedences. E.g Maybe a -> b gets parsed as Maybe (a -> b), while I would want it to be parsed as (Maybe a) -> b. Is there a way where I could use the operator table and still have type constructors bind more tightly than the function arrow?
Lastly, as I am learning megaparsec as I go, if anyone sees any misunderstandings or things that are wierd/unexpected, please tell me. I've read most of this tutorial to get my this far.
Please let me know of any edits I can make to increase the quality of my question!
Your code does not handle precedences at all, and also as a result of this it uses looping left-recursion.
To give an example of left-recursion in your code, pFunctionType calls pType as the first action, which calls pFunctionType as the first action. This is clearly a loop.
For precedences, I recommend to look at tutorials on "recursive descent operator parsing", a quick Google search reveals that there are several of them. Nevertheless I can summarize here the key points. I write some code.
{-# language OverloadedStrings #-}
import Control.Monad.Identity
import Data.Text (Text)
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as Lexer
type Parser a = ParsecT Void Text Identity a
newtype Ident = Ident String deriving Show
newtype UIdent = UIdent String deriving Show
data Type
= TVar Ident
| TFun Type Type -- instead of "TLam"
| TAdt UIdent [Type]
| TTup [Type]
| TUnit -- instead of "TNil"
| TBool
| TInt
| TFloat
deriving Show
pSymbol :: Text -> Parser Text
pSymbol = Lexer.symbol pSpace
pChar :: Char -> Parser ()
pChar c = void (char c <* pSpace)
pSpace :: Parser ()
pSpace = Lexer.space
(void spaceChar)
(Lexer.skipLineComment "--")
(Lexer.skipBlockComment "{-" "-}")
keywords :: [String]
keywords = ["Bool", "Int", "Float"]
pUIdent :: Parser UIdent
pUIdent = try $ do
a <- upperChar
rest <- many $ choice [letterChar, digitChar, char '_']
pSpace
let x = a:rest
if elem x keywords
then fail "expected an ADT name"
else pure $ UIdent x
pIdent :: Parser Ident
pIdent = try $ do
a <- lowerChar
rest <- many $ choice [letterChar, digitChar, char '_']
pSpace
return $ Ident (a:rest)
Let's stop here.
I changed the names of constructors in Type to conform to how they are called in Haskell. I also removed the parameter on Type, to have less noise in my example, but you can add it back of course.
Note the changed pUIdent and the addition of keywords. In general, if you want to parse identifiers, you have to disambiguate them from keywords. In this case, Int could parse both as Int and as an upper case identifier, so we have to specify that Int is not an identifier.
Continuing:
pClosed :: Parser Type
pClosed =
(TInt <$ pSymbol "Int")
<|> (TBool <$ pSymbol "Bool")
<|> (TFloat <$ pSymbol "Float")
<|> (TVar <$> pIdent)
<|> (do pChar '('
ts <- sepBy1 pFun (pChar ',') <* pChar ')'
case ts of
[] -> pure TUnit
[t] -> pure t
_ -> pure (TTup ts))
pApp :: Parser Type
pApp = (TAdt <$> pUIdent <*> many pClosed)
<|> pClosed
pFun :: Parser Type
pFun = foldr1 TFun <$> sepBy1 pApp (pSymbol "->")
pExpr :: Parser Type
pExpr = pSpace *> pFun <* eof
We have to group operators according to binding strength. For each strength, we need to have a separate parsing function which parses all operators of that strength. In this case we have pFun, pApp and pClosed in increasing order of binding strength. pExpr is just a wrapper which handles top-level expressions, and takes care of leading whitespace and matches the end of the input.
When writing an operator parser, the first thing we should pin down is the group of closed expressions. Closed expressions are delimited by a keyword or symbol both on the left and the right. This is conceptually "infinite" binding strength, since text before and after such expressions don't change their parsing at all.
Keywords and variables are clearly closed, since they consist of a single token. We also have three more closed cases: the unit type, tuples and parenthesized expressions. Since all of these start with a (, I factor this out. After that, we have one or more types separated by , and we have to branch on the number of parsed types.
The rule in precedence parsing is that when parsing an operator expression of given strength, we always call the next stronger expression parser when reading the expressions between operator symbols.
, is the weakest operator, so we call the function for the second weakest operator, pFun.
pFun in turn calls pApp, which reads ADT applications, or falls back to pClosed. In pFun you can also see the handling of right associativity, as we use foldr1 TFun to combine expressions. In a left-associative infix operator, we would instead use foldl1.
Note that parser functions always parse all stronger expressions as well. So pFun falls back on pApp when there is no -> (because sepBy1 accepts the case with no separators), and pApp falls back on pClosed when there's no ADT application.
I am trying to tackle the scariest part of programming for me and that is parsing and ASTs. I am working on a trivial example using F# and FParsec. I am wanting to parse a simple series of multiplications. I am only getting the first term back though. Here is what I have so far:
open FParsec
let test p str =
match run p str with
| Success(result, _, _) -> printfn "Success: %A" result
| Failure(errorMsg, _, _) -> printfn "Failure: %s" errorMsg
type Expr =
| Float of float
| Multiply of Expr * Expr
let parseExpr, impl = createParserForwardedToRef ()
let pNumber = pfloat .>> spaces |>> (Float)
let pMultiply = parseExpr .>> pstring "*" >>. parseExpr
impl := pNumber <|> pMultiply
test parseExpr "2.0 * 3.0 * 4.0 * 5.0"
When I run this I get the following:
> test parseExpr "2.0 * 3.0 * 4.0 * 5.0";;
Success: Float 2.0
val it : unit = ()
My hope was that I get a nested set of multiplications. I feel like I am missing something tremendously obvious.
Parser combinators like FParsec are not equivalent to BNF grammars. The big difference is that when you have an alternative (<|> in FParsec), the cases are tried in order. If the left parser is successful, then it is returned and the right parser isn't tried. If the left parser fails after consuming some input, then the failure is returned and the right parser isn't tried either. It's only if the left parser fails without consuming any input that the right parser is tried. [1]
In your pNumber <|> pMultiply, pNumber is successful and returned immediately without trying to do pMultiply. You might think to fix that by writing pMultiply <|> pNumber instead, but that's not good either: when parsing the last number, pMultiply will fail to find a * after having consumed some input for its parseExpr, so the whole parsing will be marked as failed.
You generally want to use FParsec's combinator functions as much as possible, and in this case the best solution is probably to use chainl1.
let pNumber = pfloat .>> spaces |>> Float
let pTimes = pstring "*" .>> spaces >>% (fun x y -> Multiply (x, y))
let pMultiply = chainl1 pNumber pTimes
If your goal was to learn how to use BNF grammars, you probably want to look at FsLex and FsYacc rather than FParsec.
[1] There's a function attempt that turns a consuming failure into a non-consuming failure, but it should be used as sparingly as possible.
I've been coding up an attoparsec parser and have been hitting a pattern where I want to turn parsers into recursive parsers (recursively combining them with the monad bind >>= operator).
So I created a function to turn a parser into a recursive parser as follows:
recursiveParser :: (a -> A.Parser a) -> a -> A.Parser a
recursiveParser parser a = (parser a >>= recursiveParser parser) <|> return a
Which is useful if you have a recursive data type like
data Expression = ConsExpr Expression Expression | EmptyExpr
parseRHS :: Expression -> Parser Expression
parseRHS e = ConsExpr e <$> parseFoo
parseExpression :: Parser Expression
parseExpression = parseLHS >>= recursiveParser parseRHS
where parseLHS = parseRHS EmptyExpr
Is there a more idiomatic solution? It almost seems like recursiveParser should be some kind of fold... I also saw sepBy in the docs, but this method seems to suit me better for my application.
EDIT: Oh, actually now that I think about it should actually be something similar to fix... Don't know how I forgot about that.
EDIT2: Rotsor makes a good point with his alternative for my example, but I'm afraid my AST is actually a bit more complicated than that. It actually looks something more like this (although this is still simplified)
data Segment = Choice1 Expression
| Choice2 Expression
data Expression = ConsExpr Segment Expression
| Token String
| EmptyExpr
where the string a -> b brackets to the right and c:d brackets to the left, with : binding more tightly than ->.
I.e. a -> b evaluates to
(ConsExpr (Choice1 (Token "a")) (Token "b"))
and c:d evaluates to
(ConsExpr (Choice2 (Token "d")) (Token "c"))
I suppose I could use foldl for the one and foldr for the other but there's still more complexity in there. Note that it's recursive in a slightly strange way, so "a:b:c -> e:f -> :g:h ->" is actually a valid string, but "-> a" and "b:" are not. In the end fix seemed simpler to me. I've renamed the recursive method like so:
fixParser :: (a -> A.Parser a) -> a -> A.Parser a
fixParser parser a = (parser a >>= fixParser parser) <|> pure a
Thanks.
Why not just parse a list and fold it into whatever you want later?
Maybe I am missing something, but this looks more natural to me:
consChain :: [Expression] -> Expression
consChain = foldl ConsExpr EmptyExpr
parseExpression :: Parser Expression
parseExpression = consChain <$> many1 parseFoo
And it's shorter too.
As you can see, consChain is now independent from parsing and can be useful somewhere else. Also, if you separate out the result folding, the somewhat unintuitive recursive parsing simplifies down to many or many1 in this case.
You may want to take a look at how many is implemented too:
many :: (Alternative f) => f a -> f [a]
many v = many_v
where many_v = some_v <|> pure []
some_v = (:) <$> v <*> many_v
It has a lot in common with your recursiveParser:
some_v is similar to parser a >>= recursiveParser parser
many_v is similar to recursiveParser parser
You may ask why I called your recursive parser function unintuitive. This is because this pattern allows parser argument to affect the parsing behaviour (a -> A.Parser a, remember?), which may be useful, but not obviously (I don't see a use case for this yet). The fact that your example does not use this feature makes it look redundant.
As part of the 4th exercise here
I would like to use a reads type function such as readHex with a parsec Parser.
To do this I have written a function:
liftReadsToParse :: Parser String -> (String -> [(a, String)]) -> Parser a
liftReadsToParse p f = p >>= \s -> if null (f s) then fail "No parse" else (return . fst . head ) (f s)
Which can be used, for example in GHCI, like this:
*Main Numeric> parse (liftReadsToParse (many1 hexDigit) readHex) "" "a1"
Right 161
Can anyone suggest any improvement to this approach with regard to:
Will the term (f s) be memoised, or evaluated twice in the case of a null (f s) returning False?
Handling multiple successful parses, i.e. when length (f s) is greater than one, I do not know how parsec deals with this.
Handling the remainder of the parse, i.e. (snd . head) (f s).
This is a nice idea. A more natural approach that would make
your ReadS parser fit in better with Parsec would be to
leave off the Parser String at the beginning of the type:
liftReadS :: ReadS a -> String -> Parser a
liftReadS reader = maybe (unexpected "no parse") (return . fst) .
listToMaybe . filter (null . snd) . reader
This "combinator" style is very idiomatic Haskell - once you
get used to it, it makes function definitions much easier
to read and understand.
You would then use liftReadS like this in the simple case:
> parse (many1 hexDigit >>= liftReadS readHex) "" "a1"
(Note that listToMaybe is in the Data.Maybe module.)
In more complex cases, liftReadS is easy to use inside any
Parsec do block.
Regarding some of your other questions:
The function reader is applied only once now, so there is nothing to "memoize".
It is common and accepted practice to ignore all except the first parse in a ReadS parser in most cases, so you're fine.
To answer the first part of your question, no (f s) will not be memoised, you would have to do that manually:
liftReadsToParse p f = p >>= \s -> let fs = f s in if null fs then fail "No parse"
else (return . fst . head ) fs
But I'd use pattern matching instead:
liftReadsToParse p f = p >>= \s -> case f s of
[] -> fail "No parse"
(answer, _) : _ -> return answer
Given a LL(1) grammar what is an appropriate data structure or algorithm for producing an immutable concrete syntax tree in a functionally pure manner? Please feel free to write example code in whatever language you prefer.
My Idea
symbol : either a token or a node
result : success or failure
token : a lexical token from source text
value -> string : the value of the token
type -> integer : the named type code of the token
next -> token : reads the next token and keeps position of the previous token
back -> token : moves back to the previous position and re-reads the token
node : a node in the syntax tree
type -> integer : the named type code of the node
symbols -> linkedList : the symbols found at this node
append -> symbol -> node : appends the new symbol to a new copy of the node
Here is an idea I have been thinking about. The main issue here is handling syntax errors.
I mean I could stop at the first error but that doesn't seem right.
let program token =
sourceElements (node nodeType.program) token
let sourceElements node token =
let (n, r) = sourceElement (node.append (node nodeType.sourceElements)) token
match r with
| success -> (n, r)
| failure -> // ???
let sourceElement node token =
match token.value with
| "function" ->
functionDeclaration (node.append (node nodeType.sourceElement)) token
| _ ->
statement (node.append (node nodeType.sourceElement)) token
Please Note
I will be offering up a nice bounty to the best answer so don't feel rushed. Answers that simply post a link will have less weight over answers that show code or contain detailed explanations.
Final Note
I am really new to this kind of stuff so don't be afraid to call me a dimwit.
You want to parse something into an abstract syntax tree.
In the purely functional programming language Haskell, you can use parser combinators to express your grammar. Here an example that parses a tiny expression language:
EDIT Use monadic style to match Graham Hutton's book
-- import a library of *parser combinators*
import Parsimony
import Parsimony.Char
import Parsimony.Error
(+++) = (<|>)
-- abstract syntax tree
data Expr = I Int
| Add Expr Expr
| Mul Expr Expr
deriving (Eq,Show)
-- parse an expression
parseExpr :: String -> Either ParseError Expr
parseExpr = Parsimony.parse pExpr
where
-- grammar
pExpr :: Parser String Expr
pExpr = do
a <- pNumber +++ parentheses pExpr -- first argument
do
f <- pOp -- operation symbol
b <- pExpr -- second argument
return (f a b)
+++ return a -- or just the first argument
parentheses parser = do -- parse inside parentheses
string "("
x <- parser
string ")"
return x
pNumber = do -- parse an integer
digits <- many1 digit
return . I . read $ digits
pOp = -- parse an operation symbol
do string "+"
return Add
+++
do string "*"
return Mul
Here an example run:
*Main> parseExpr "(5*3)+1"
Right (Add (Mul (I 5) (I 3)) (I 1))
To learn more about parser combinators, see for example chapter 8 of Graham Hutton's book "Programming in Haskell" or chapter 16 of "Real World Haskell".
Many parser combinator library can be used with different token types, as you intend to do. Token streams are usually represented as lists of tokens [Token].
Definitely check out the monadic parser combinator approach; I've blogged about it in C# and in F#.
Eric Lippert's blog series on immutable binary trees may be helpful. Obviously, you need a tree which is not binary, but it will give you the general idea.