Indentation-aware parsing of expression trees

Indentation-aware parsing of expression trees - parsing

As a follow-up to this question, I am now trying to parse an expression language that has variables and case ... of ... expressions. The syntax should be indentation-based:
Expressions can span multiple lines, as long as every line is indented relative to the first one; i.e. this should be parsed as a single application:
f x y
z
q
Each alternative of a case expression needs to be on its own line, indented relative to the case keyword. Right-hand sides can span multiple lines.
case E of
C -> x
D -> f x
y
should be parsed into a single case with two alternatives, with x and f x y as the right-hand sides
I've simplified my code into the following:
import qualified Text.Megaparsec.Lexer as L
import Text.Megaparsec hiding (space)
import Text.Megaparsec.Char hiding (space)
import Text.Megaparsec.String
import Control.Monad (void)
import Control.Applicative
data Term = Var String
| App [Term]
| Case Term [(String, Term)]
deriving Show
space :: Parser ()
space = L.space (void spaceChar) empty empty
name :: Parser String
name = try $ do
s <- some letterChar
if s `elem` ["case", "of"]
then fail $ unwords ["Unexpected: reserved word", show s]
else return s
term :: Parser () -> Parser Term
term sp = App <$> atom `sepBy1` try sp
where
atom = choice [ caseBlock
, Var <$> L.lexeme sp name
]
caseBlock = L.lineFold sp $ \sp' ->
Case <$>
(L.symbol sp "case" *> L.lexeme sp (term sp) <* L.symbol sp' "of") <*>
alt sp' `sepBy` try sp' <* sp
alt sp' = L.lineFold sp' $ \sp'' ->
(,) <$> L.lexeme sp' name <* L.symbol sp' "->" <*> term sp''
As you can see, I am trying to use the technique from this answer to separate alternatives with sp'aces that are more indented than the case keyword.
Problems
This seems to work for single expressions made up of application only:
λ» parseTest (L.lineFold space term) "x y\n z"
App [Var "x",Var "y",Var "z"]
It doesn't work for list of such expressions using the technique from the linked answer:
λ» parseTest (L.lineFold space $ \sp -> (term sp `sepBy` try sp)) "x\n y\nz"
3:1:
incorrect indentation (got 1, should be greater than 1)
case expressions fail out of the gate when trying to use line-folding:
λ» parseTest (L.lineFold space term) "case x of\n C -> y\n D -> z"
1:5:
Unexpected: reserved word "case"
case works without line folding for the outermost expression, for one alternative only:
λ» parseTest (term space) "case x of\n C -> y\n z"
App [Case (App [Var "x"]) [("C",App [Var "y",Var "z"])]]
But case fails as soon as I have multiple alternatives:
λ» parseTest (term space) "case x of\n C -> y\n D -> z"
3:2:
incorrect indentation (got 2, should be greater than 2)
What am I doing wrong?

I'm answering since I promised to take a look at this. This problem represents a rather difficult problem for Parsec-like parsers in their current state. I probably could make it work after spending much more time that I have available, but in the slot of time I can spend answering this, I only got this far:
module Main (main) where
import Control.Applicative
import Control.Monad (void)
import Text.Megaparsec
import Text.Megaparsec.String
import qualified Data.List.NonEmpty as NE
import qualified Text.Megaparsec.Lexer as L
data Term = Var String
| App [Term]
| Case Term [(String, Term)]
deriving Show
scn :: Parser ()
scn = L.space (void spaceChar) empty empty
sc :: Parser ()
sc = L.space (void $ oneOf " \t") empty empty
name :: Parser String
name = try $ do
s <- some letterChar
if s `elem` ["case", "of"]
then (unexpected . Label . NE.fromList) ("reserved word \"" ++ s ++ "\"")
else return s
manyTerms :: Parser [Term]
manyTerms = many pTerm
pTerm :: Parser Term
pTerm = caseBlock <|> app -- parse a term first
caseBlock :: Parser Term
caseBlock = L.indentBlock scn $ do
void (L.symbol sc "case")
t <- Var <$> L.lexeme sc name -- not sure what sort of syntax case of
-- case expressions should have, so simplified to vars for now
void (L.symbol sc "of")
return (L.IndentSome Nothing (return . Case t) alt)
alt :: Parser (String, Term)
alt = L.lineFold scn $ \sc' ->
(,) <$> L.lexeme sc' name <* L.symbol sc' "->" <*> pTerm -- (1)
app :: Parser Term
app = L.lineFold scn $ \sc' ->
App <$> ((Var <$> name) `sepBy1` try sc' <* scn)
-- simplified here, with some effort should be possible to go from Var to
-- more general Term in applications
Your original grammar is left-recursive because every term can be either a case expression or an application and if it's an application, then the first part of it again can be either case expression or application, etc. You'll need to deal with that somehow.
Here is a session:
λ> parseTest pTerm "x y\n z"
App [Var "x",Var "y",Var "z"]
λ> parseTest pTerm "x\n y\nz"
App [Var "x",Var "y"]
λ> parseTest manyTerms "x\n y\nz"
[App [Var "x",Var "y"],App [Var "z"]]
λ> parseTest pTerm "case x of\n C -> y\n D -> z"
Case (Var "x") [("C",App [Var "y"]),("D",App [Var "z"])]
λ> parseTest pTerm "case x of\n C -> y\n z"
3:3:
incorrect indentation (got 3, should be equal to 2)
This last result is because of (1) in the code. Introducing a parameter to app makes it impossible to use it without thinking of context (it would be no longer stand-alone expression, but factored-out part of something). We can see that if you indent z with respect to start of y application, not the entire alternative, it works:
λ> parseTest pTerm "case x of\n C -> y\n z"
Case (Var "x") [("C",App [Var "y",Var "z"])]
Finally, case expression works:
λ> parseTest pTerm "case x of\n C -> y\n D -> z"
Case (Var "x") [("C",App [Var "y"]),("D",App [Var "z"])]
My advice here would be to take a look at some pre-processor and use Megaparsec on top of that. The tools in Text.Megaparsec.Lexer are not that easy to apply in this case, but they are the best we could come up with and they work fine for simple indentation-sensitive grammars.

Related

Parser written in Haskell not working as intended

I was playing around with Haskell's parsec library. I was trying to parse a hexadecimal string of the form "#x[0-9A-Fa-f]*" into an integer. This the code I thought would work:
module Main where
import Control.Monad
import Numeric
import System.Environment
import Text.ParserCombinators.Parsec hiding (spaces)
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
return (fst (head (readHex x)))
testHex :: String -> String
testHex input = case parse parseHex "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val
main :: IO ()
main = do
args <- getArgs
putStrLn (testHex (head args))
And then I tried testing the testHex function in Haskell's repl:
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling Main ( src/Main.hs, interpreted )
Ok, one module loaded.
*Main> testHex "#xcafebeef"
"Matched3405692655"
*Main> testHex "#xnothx"
"Does not match \"lisp\" (line 1, column 3):\nunexpected \"n\"\nexpecting hexadecimal digit"
*Main> testHex "#xcafexbeef"
"Matched51966"
The first and second try work as intended. But in the third one, the string is matching upto the invalid character. I do not want the parser to do this, but rather not match if any digit in the string is not a valid string. Why is this happening, and how do if fix this?
Thank you!

You need to place eof at the end.
parseHex :: Parser Integer
parseHex = do
string "#x"
x <- many1 hexDigit
eof
return (fst (head (readHex x)))
Alternatively, you can compose it with eof where you use it if you want to reuse parseHex in other places.
testHex :: String -> String
testHex input = case parse (parseHex <* eof) "lisp" input of
Left err -> "Does not match " ++ show err
Right val -> "Matched" ++ show val

Parsing juxtaposition-based, indentation-aware syntax using Text.Parsec.Layout

I'm trying to parse a small language with Haskell-like syntax, using parsec-layout. The two key features that don't seem to interact too well with each other are:
Function application syntax is juxtaposition, i.e. if F and E are terms, F E is the syntax for F applied to E.
Indentation can be used to denote nesting, i.e. the following two are equivalent:
X = case Y of
A -> V
B -> W
X = case Y of A -> V; B -> W
I haven't managed to figure out a combination of skipping and keeping whitespace that would allow me to parse a list of such definitions. Here's my simplified code:
import Text.Parsec hiding (space, runP)
import Text.Parsec.Layout
import Control.Monad (void)
type Parser = Parsec String LayoutEnv
data Term = Var String
| App Term Term
| Case Term [(String, Term)]
deriving Show
name :: Parser String
name = spaced $ (:) <$> upper <*> many alphaNum
kw :: String -> Parser ()
kw = void . spaced . string
reserved :: String -> Parser ()
reserved s = try $ spaced $ string s >> notFollowedBy alphaNum
term :: Parser Term
term = foldl1 App <$> part `sepBy1` space
where
part = choice [ caseBlock
, Var <$> name
]
caseBlock = Case <$> (reserved "case" *> term <* reserved "of") <*> laidout alt
alt = (,) <$> (name <* kw "->") <*> term
binding :: Parser (String, Term)
binding = (,) <$> (name <* kw "=") <*> term
-- https://github.com/luqui/parsec-layout/issues/1
trim :: String -> String
trim = reverse . dropWhile (== '\n') . reverse
runP :: Parser a -> String -> Either ParseError a
runP p = runParser (p <* eof) defaultLayoutEnv "" . trim
If I try to run it on input like
s = unlines [ "A = case B of"
, " X -> Y Z"
, "C = D"
]
via runP (laidout binding) s, it fails on the application Y Z:
(line 2, column 10):
expecting space or semi-colon
However, if I change the definition of term to
term = foldl1 App <$> many1 part
then it doesn't stop parsing the term at the start of the (unindented!) third line, leading to
(line 3, column 4):
expecting semi-colon

I think the problem has to do with that name already eliminates the following space, so the sepBy1 in the definition of term doesn't see it.
Consider these simplified versions of term:
term0 = foldl1 App <$> (Var <$> name) `sepBy1` space
term1 = foldl1 App <$> (Var <$> name') `sepBy1` space
name' = (:) <$> upper <*> many alphaNum
term3 = foldl1 App <$> many (Var <$> name)
Then:
runP term0 "A B C" -- fails
runP term1 "A B C" -- succeeds
runP term3 "A B C" -- succeeds
I think part of the solution is to define
part = [ caseBlock, Var <$> name' ]
where name' is as above. However, there are still some issues.

Writing Parser for S Expressions

I'm trying to write a Parser for S Expressions from Prof. Yorgey's 2013 homework.
newtype Parser a = Parser { runParser :: String -> Maybe (a, String) }
Given the following definitions, presented in the homework:
type Ident = String
-- An "atom" is either an integer value or an identifier.
data Atom = N Integer | I Ident
deriving Show
-- An S-expression is either an atom, or a list of S-expressions.
data SExpr = A Atom
| Comb [SExpr]
deriving Show
I wrote a parser for Parser Atom and Parser SExpr for A Atom.
parseAtom :: Parser Atom
parseAtom = alt n i
where n = (\_ z -> N z) <$> spaces <*> posInt
i = (\ _ z -> I z) <$> spaces <*> ident
parseAAtom :: Parser SExpr
parseAAtom = fmap (\x -> A x) parseAtom
Then, I attempted to write a parser to handle a Parser SExpr for the Comb ... case:
parseComb :: Parser SExpr
parseComb = (\_ _ x _ _ _ -> x) <$> (zeroOrMore spaces) <*> (char '(') <*>
(alt parseAAtom parseComb) <*> (zeroOrMore spaces)
<*> (char ')') <*> (zeroOrMore spaces)
Assuming that parseComb was right, I could simply make usage of oneOrMore for Parser [SExpr].
parseCombElements :: Parser [SExpr]
parseCombElements = oneOrMore parseComb
So, my two last functions compile, but running runParser parseComb "( foo )" never terminates.
What's wrong with my parseComb definition? Please don't give me the whole answer, but rather a hint - for my own learning.

I am very suspicious of zeroOrMore spaces, because spaces is usually a parser which itself parses zero or more spaces. Which means that it can parse the empty string if there aren't any spaces at that point. In particular, the spaces parser always succeeds.
But when you apply zeroOrMore to a parser that always succeeds, the combined parser will never stop - because zeroOrMore only stops trying again once its parser argument fails.
As an aside, Applicative expressions like (\_ _ x _ _ _ -> x) <$> ... <*> ... <*> ...... which only use a single of the subparsers can usually be written more succinctly with the *> and <* combinators:
... *> ... *> x_parser_here <* ... <* ...

Parsing an expression grammar having function application with parser combinators (left-recursion)

As a simplified subproblem of a parser for a real language, I am trying to implement a parser for expressions of a fictional language which looks similar to standard imperative languages (like Python, JavaScript, and so). Its syntax features the following construct:
integer numbers
identifiers ([a-zA-Z]+)
arithmetic expressions with + and * and parenthesis
structure access with . (eg foo.bar.buz)
tuples (eg (1, foo, bar.buz)) (to remove ambiguity one-tuples are written as (x,))
function application (eg foo(1, bar, buz()))
functions are first class so they can also be returned from other functions and directly be applied (eg foo()() is legal because foo() might return a function)
So a fairly complex program in this language is
(1+2*3, f(4,5,6)(bar) + qux.quux()().quuux)
the associativity is supposed to be
( (1+(2*3)), ( ((f(4,5,6))(bar)) + ((((qux.quux)())()).quuux) ) )
I'm currently using the very nice uu-parsinglib an applicative parser combinator library.
The first problem was obviously that the intuitive expression grammar (expr -> identifier | number | expr * expr | expr + expr | (expr) is left-recursive. But I could solve that problem using the the pChainl combinator (see parseExpr in the example below).
The remaining problem (hence this question) is function application with functions returned from other functions (f()()). Again, the grammar is left recursive expr -> fun-call | ...; fun-call -> expr ( parameter-list ). Any ideas how I can solve this problem elegantly using uu-parsinglib? (the problem should directly apply to parsec, attoparsec and other parser combinators as well I guess).
See below my current version of the program. It works well but function application is only working on identifiers to remove the left-recursion:
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE RankNTypes #-}
module TestExprGrammar
(
) where
import Data.Foldable (asum)
import Data.List (intercalate)
import Text.ParserCombinators.UU
import Text.ParserCombinators.UU.Utils
import Text.ParserCombinators.UU.BasicInstances
data Node =
NumberLiteral Integer
| Identifier String
| Tuple [Node]
| MemberAccess Node Node
| FunctionCall Node [Node]
| BinaryOperation String Node Node
parseFunctionCall :: Parser Node
parseFunctionCall =
FunctionCall <$>
parseIdentifier {- `parseExpr' would be correct but left-recursive -}
<*> parseParenthesisedNodeList 0
operators :: [[(Char, Node -> Node -> Node)]]
operators = [ [('+', BinaryOperation "+")]
, [('*' , BinaryOperation "*")]
, [('.', MemberAccess)]
]
samePrio :: [(Char, Node -> Node -> Node)] -> Parser (Node -> Node -> Node)
samePrio ops = asum [op <$ pSym c <* pSpaces | (c, op) <- ops]
parseExpr :: Parser Node
parseExpr =
foldr pChainl
(parseIdentifier
<|> parseNumber
<|> parseTuple
<|> parseFunctionCall
<|> pParens parseExpr
)
(map samePrio operators)
parseNodeList :: Int -> Parser [Node]
parseNodeList n =
case n of
_ | n < 0 -> parseNodeList 0
0 -> pListSep (pSymbol ",") parseExpr
n -> (:) <$>
parseExpr
<* pSymbol ","
<*> parseNodeList (n-1)
parseParenthesisedNodeList :: Int -> Parser [Node]
parseParenthesisedNodeList n = pParens (parseNodeList n)
parseIdentifier :: Parser Node
parseIdentifier = Identifier <$> pSome pLetter <* pSpaces
parseNumber :: Parser Node
parseNumber = NumberLiteral <$> pNatural
parseTuple :: Parser Node
parseTuple =
Tuple <$> parseParenthesisedNodeList 1
<|> Tuple [] <$ pSymbol "()"
instance Show Node where
show n =
let showNodeList ns = intercalate ", " (map show ns)
showParenthesisedNodeList ns = "(" ++ showNodeList ns ++ ")"
in case n of
Identifier i -> i
Tuple ns -> showParenthesisedNodeList ns
NumberLiteral n -> show n
FunctionCall f args -> show f ++ showParenthesisedNodeList args
MemberAccess f g -> show f ++ "." ++ show g
BinaryOperation op l r -> "(" ++ show l ++ op ++ show r ++ ")"

Looking briefly at the list-like combinators for uu-parsinglib (I'm more familiar with parsec), I think you can solve this by folding over the result of the pSome combinator:
parseFunctionCall :: Parser Node
parseFunctionCall =
foldl' FunctionCall <$>
parseIdentifier {- `parseExpr' would be correct but left-recursive -}
<*> pSome (parseParenthesisedNodeList 0)
This is also equivalent to the Alternative some combinator, which should indeed apply to the other parsing libs you mentioned.

I don't know this library but can show you how to remove left recursion. The standard right recursive expression grammar is
E -> T E'
E' -> + TE' | eps
T -> F T'
T' -> * FT' | eps
F -> NUMBER | ID | ( E )
To add function application you must decide its level of precedence. In most languages I've seen it is highest. So you'd add another layer of productions for function application.
E -> T E'
E' -> + TE' | eps
T -> AT'
T' -> * A T' | eps
A -> F A'
A' -> ( E ) A' | eps
F -> NUMBER | ID | ( E )
Yes this is a hairy-looking grammar and bigger than the left recursive one. That's the price of top-down predictive parsing. If you want simpler grammars use a bottom up parser generator a la yacc.

Custom ADT vs. Tree for parser return value

I'm using Parsec to build a simple Lisp parser.
What are the (dis)advantages of using a custom ADT for the parser types versus using a standard Tree (i.e. Data.Tree)?
After trying both ways, I've come up with a couple points for custom ADTs (i.e. Parser ASTNode):
seems to be much clearer and simpler
others have done it this way(including Tiger, which is/was bundled with Parsec)
and one against (i.e. Parser (Tree ASTNode):
Data.Tree already has Functor, Monad, etc. instances, which will be very helpful for semantic analysis, evaluation, calculating code statistics
For example:
custom ADT
import Text.ParserCombinators.Parsec
data ASTNode
= Application ASTNode [ASTNode]
| Symbol String
| Number Float
deriving (Show)
int :: Parser ASTNode
int = many1 digit >>= (return . Number . read)
symbol :: Parser ASTNode
symbol = many1 (oneOf ['a'..'z']) >>= (return . Symbol)
whitespace :: Parser String
whitespace = many1 (oneOf " \t\n\r\f")
app :: Parser ASTNode
app =
char '(' >>
sepBy1 expr whitespace >>= (\(e:es) ->
char ')' >>
(return $ Application e es))
expr :: Parser ASTNode
expr = symbol <|> int <|> app
example use:
ghci> parse expr "" "(a 12 (b 13))"
Right
(Application
(Symbol "a")
[Number 12.0, Application
(Symbol "b")
[Number 13.0]])
Data.Tree
import Text.ParserCombinators.Parsec
import Data.Tree
data ASTNode
= Application (Tree ASTNode)
| Symbol String
| Number Float
deriving (Show)
int :: Parser (Tree ASTNode)
int = many1 digit >>= (\x -> return $ Node (Number $ read x) [])
symbol :: Parser (Tree ASTNode)
symbol = many1 (oneOf ['a' .. 'z']) >>= (\x -> return $ Node (Symbol x) [])
whitespace :: Parser String
whitespace = many1 (oneOf " \t\n\r\f")
app :: Parser (Tree ASTNode)
app =
char '(' >>
sepBy1 expr whitespace >>= (\(e:es) ->
char ')' >>
(return $ Node (Application e) es))
expr :: Parser (Tree ASTNode)
expr = symbol <|> int <|> app
and example use:
ghci> parse expr "" "(a 12 (b 13))"
Right
(Node
(Application
(Node (Symbol "a") []))
[Node (Number 12.0) [],
Node
(Application
(Node (Symbol "b") []))
[Node (Number 13.0) []]])
(sorry for the formatting -- hopefully it's clear)

I'd absolutely go for the AST, because interpretation/compilation/language analysis in general is very much driven by the structure of your language. The AST will simply and naturally represent and respect that structure, while Tree will do neither.
For example, a common form of language implementation technique is to implement some complex features by translation: translate programs that involve those features or constructs into programs in a subset of the a language that does not use them (Lisp macros, for example, are all about this). If you use an AST, the type system will, for example, often forbid you from producing illegal translations as output. Whereas a Tree type that doesn't understand your program will not help there.
Your AST doesn't look very complicated, so writing utility functions for it should not be hard. Take this one for example:
foldASTNode :: (r -> [r] -> r) -> (String -> r) -> (Float -> r) -> r
foldASTNode app sym num node =
case node of
Application f args -> app (subfold f) (map subfold args)
Symbol str -> sym str
Number n -> num n
where subfold = foldASTNode app sym num
And in any case, what sort of Functor do you wish to have on your AST? There's no type parameter on it...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Indentation-aware parsing of expression trees - parsing

Related

Parser written in Haskell not working as intended

Parsing juxtaposition-based, indentation-aware syntax using Text.Parsec.Layout

Writing Parser for S Expressions

Parsing an expression grammar having function application with parser combinators (left-recursion)

Custom ADT vs. Tree for parser return value

Categories

Resources