I am trying to parse an XML dump of Wikipedia to find certain links on each page using the Haskell Parsec library. Links are denoted by double brackets: texttext[[link]]texttext. To simplify the scenario as much as possible, let's say I am looking for the first link not enclosed in double curly braces (which can be nested): {{ {{ [[Wrong Link]] }} [[Wrong Link]] }} [[Right Link]]. I have written a parser to discard links which are enclosed in non-nested double braces:
import Text.Parsec
getLink :: String -> Either ParseError String
getLink = parse linkParser "Links"
linkParser = do
beforeLink
link <- many $ noneOf "]"
string "]]"
return link
beforeLink = manyTill (many notLink) (try $ string "[[")
notLink = try doubleCurlyBrac <|> (many1 normalText)
normalText = noneOf "[{"
<|> notFollowedByItself '['
<|> notFollowedByItself '{'
notFollowedByItself c = try ( do x <- char c
notFollowedBy $ char c
return x)
doubleCurlyBrac = between (string "{{") (string "}}") (many $ noneOf "}")
getLinkTest = fmap getLink testList
where testList = [" [[rightLink]] " --Correct link is found
, " {{ [[Wrong_Link]] }} [[rightLink]]" --Correct link is found
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]" ] --Wrong link is found
I have tried making the doubleCurlyBrac parser recursive to also discard links in nested curly braces, without success:
doubleCurlyBrac = between (string "{{") (string "}}") betweenBraces
where betweenBraces = doubleCurlyBrac <|> (many $ try $ noneOf "}")
This parser stops consuming input after the first }}, rather than the final one, in a nested example. Is there an elegant way to write a recursive parser to (in this case) correctly ignore links in nested double curly braces? Also, can it be done without using try? I have found that since try does not consume input, it often causes the parser to hang on unexpected, ill-formed input.
Here's a more direct version that doesn't use a custom lexer. It does use try though, and I don't see how to avoid it here. The problem is that it seems we need a non-committing look ahead to distinguish double brackets from single brackets; try is for non-committing look ahead.
The high level approach is that same as in
my first answer. I've been careful
to make the three node parsers commute -- making the code more robust
to change -- by using both try and notFollowedBy:
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (string)
import qualified Text.Parsec
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)
import Debug.Trace
----------------------------------------------------------------------
-- Token parsers.
llink , rlink , lbrace , rbrace :: Parsec String u String
[llink , rlink , lbrace , rbrace] = reserved
reserved = map (try . Text.Parsec.string) ["[[" , "]]" , "{{" , "}}"]
----------------------------------------------------------------------
-- Node parsers.
-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show
nodes :: Parsec String u [Node]
nodes = many node
node :: Parsec String u Node
node = link <|> braces <|> string
link , braces , string :: Parsec String u Node
link = L <$> between llink rlink nodes
braces = B <$> between lbrace rbrace nodes
string = S <$> many1 (notFollowedBy (choice reserved) >> anyChar)
----------------------------------------------------------------------
parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>"
----------------------------------------------------------------------
-- Tests.
getLink :: [Node] -> Maybe Node
getLink = find isLink where
isLink (L _) = True
isLink _ = False
parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes
testList = [ " [[rightLink]] "
, " {{ [[Wrong_Link]] }} [[rightLink]]"
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]"
, " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
-- Pathalogical example from comments.
, "{{ab}cd}}"
-- A more pathalogical example.
, "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf"
-- No top level link.
, "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
-- Too many '{{'.
, "{{ {{ {{ [[ asdf ]] }} }}"
-- Too many '}}'.
, "{{ {{ [[ asdf ]] }} }} }}"
-- Too many '[['.
, "[[ {{ [[{{[[asdf]]}}]]}}"
]
main =
forM_ testList $ \ t -> do
putStrLn $ "Test: ^" ++ t ++ "$"
let parses = ( , ) <$> parseNodes t <*> parseLink t
printParses (n , l) = do
putStrLn $ "Nodes: " ++ show n
putStrLn $ "Link: " ++ show l
printError = putStrLn . show
either printError printParses parses
putStrLn ""
The output is the same in the non-error cases:
Test: ^ [[rightLink]] $
Nodes: [S " ",L [S "rightLink"],S " "]
Link: Just (L [S "rightLink"])
Test: ^ {{ [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ {{ {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])
Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing
Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])
Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing
but the parse error messages are not as informative in the cases of
unmatched openings:
Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"
Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 26):
unexpected "}}"
Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"
and I couldn't figure out how to fix them.
My solution does not use try, but is relatively complicated: I used
your question as an excuse to learn how to create a lexer in
Parsec without using
makeTokenParser :D I avoid try because the only look ahead happens in the lexer (tokenize), where the various bracket pairs are identified.
The high level idea is that we treat {{, }}, [[, and ]] as
special tokens and parse the input into an AST. You didn't specify
the grammar precisely, so I chose a simple one that generates your
examples:
node ::= '{{' node* '}}'
| '[[' node* ']]'
| string
string ::= <non-empty string without '{{', '}}', '[[', or ']]'>
I parse an input string into a list of nodes. The first top-level
link ([[) node, if any, is the link you're looking for.
The approach I've taken should be relatively robust to grammar
changes. E.g., if you want to allow only strings inside links, then
change '[[' node* ']]' to '[[' string ']]'. (In the code
link = L <$> between llink rlink nodes
becomes
link = L <$> between llink rlink string
).
The code is fairly long, but mostly straightforward. Most of it
concerns creating the token stream (lexing) and parsing the individual
tokens. After this, the actual Node parsing is very simple.
Here it is:
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (char , string)
import Text.Parsec.Pos (updatePosString , updatePosChar)
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)
----------------------------------------------------------------------
-- Lexing.
-- Character or punctuation.
data Token = C Char | P String deriving Eq
instance Show Token where
show (C c) = [c]
show (P s) = s
tokenize :: String -> [Token]
tokenize [] = []
tokenize [c] = [C c]
tokenize (c1:c2:cs) = case [c1,c2] of
"[[" -> ts
"]]" -> ts
"{{" -> ts
"}}" -> ts
_ -> C c1 : tokenize (c2:cs)
where
ts = P [c1,c2] : tokenize cs
----------------------------------------------------------------------
-- Token parsers.
-- We update the 'sourcePos' while parsing the tokens. Alternatively,
-- we could have annotated the tokens with positions above in
-- 'tokenize', and then here we would use 'token' instead of
-- 'tokenPrim'.
llink , rlink , lbrace , rbrace :: Parsec [Token] u Token
[llink , rlink , lbrace , rbrace] =
map (t . P) ["[[" , "]]" , "{{" , "}}"]
where
t x = tokenPrim show update match where
match y = if x == y then Just x else Nothing
update pos (P s) _ = updatePosString pos s
char :: Parsec [Token] u Char
char = tokenPrim show update match where
match (C c) = Just c
match (P _) = Nothing
update pos (C c) _ = updatePosChar pos c
----------------------------------------------------------------------
-- Node parsers.
-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show
nodes :: Parsec [Token] u [Node]
nodes = many node
node :: Parsec [Token] u Node
node = link <|> braces <|> string
link , braces , string :: Parsec [Token] u Node
link = L <$> between llink (rlink <?> "]]") nodes
braces = B <$> between lbrace (rbrace <?> "}}") nodes
string = S <$> many1 char
----------------------------------------------------------------------
parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>" . tokenize
----------------------------------------------------------------------
-- Tests.
getLink :: [Node] -> Maybe Node
getLink = find isLink where
isLink (L _) = True
isLink _ = False
parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes
testList = [ " [[rightLink]] "
, " {{ [[Wrong_Link]] }} [[rightLink]]"
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]"
, " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
-- Pathalogical example from comments.
, "{{ab}cd}}"
-- A more pathalogical example.
, "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf"
-- No top level link.
, "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
-- Too many '{{'.
, "{{ {{ {{ [[ asdf ]] }} }}"
-- Too many '}}'.
, "{{ {{ [[ asdf ]] }} }} }}"
-- Too many '[['.
, "[[ {{ [[{{[[asdf]]}}]]}}"
]
main =
forM_ testList $ \ t -> do
putStrLn $ "Test: ^" ++ t ++ "$"
let parses = ( , ) <$> parseNodes t <*> parseLink t
printParses (n , l) = do
putStrLn $ "Nodes: " ++ show n
putStrLn $ "Link: " ++ show l
printError = putStrLn . show
either printError printParses parses
putStrLn ""
The output from main is:
Test: ^ [[rightLink]] $
Nodes: [S " ",L [S "rightLink"],S " "]
Link: Just (L [S "rightLink"])
Test: ^ {{ [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ {{ {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])
Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing
Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])
Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing
Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting }}
Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 24):
unexpected }}
expecting end of input
Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting ]]
Related
I would like to parse a basic indented language using Megaparsec. Originally I was using Parsec which I managed to get working correctly with indentation but now I'm having quite some trouble.
I've been following a tutorial here and here's the code I have to parse a language ignoring indentation.
module Parser where
import Data.Functor ((<$>), (<$))
import Control.Applicative (Applicative(..))
import qualified Control.Monad as M
import Control.Monad (void)
import Data.Functor.Identity
import Data.Text (Text)
import qualified Data.Text as Text
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Perm
import Text.Megaparsec.Expr
import qualified Text.Megaparsec.Char.Lexer as L
import Text.Pretty.Simple
import Data.Either.Unwrap
--import Lexer
import Syntax
type Parser = Parsec Void String
lineComment :: Parser ()
lineComment = L.skipLineComment "#"
scn :: Parser ()
scn = L.space space1 lineComment empty
sc :: Parser () -- ‘sc’ stands for “space consumer”
sc = L.space (void $ takeWhile1P Nothing f) lineComment empty
where
f x = x == ' ' || x == '\t'
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: String -> Parser String
symbol = L.symbol sc
integer :: Parser Integer
integer = lexeme L.decimal
semi :: Parser String
semi = symbol ";"
rword :: String -> Parser ()
rword w = lexeme (string w *> notFollowedBy alphaNumChar)
rws :: [String] -- list of reserved words
rws = ["if","then","else","while","do","skip","true","false","not","and","or"]
identifier :: Parser String
identifier = (lexeme . try) (p >>= check)
where
p = (:) <$> letterChar <*> many alphaNumChar
check x = if x `elem` rws
then fail $ "keyword " ++ show x ++ " cannot be an identifier"
else return x
parens :: Parser a -> Parser a
parens = between (symbol "(") (symbol ")")
whileParser :: Parser Stmt
whileParser = between sc eof stmt
stmt :: Parser Stmt
stmt = f <$> sepBy1 stmt' semi
where
-- if there's only one stmt return it without using ‘Seq’
f l = if length l == 1 then head l else Seq l
stmt' :: Parser Stmt
stmt' = ifStmt
<|> whileStmt
<|> skipStmt
<|> assignStmt
<|> parens stmt
ifStmt :: Parser Stmt
ifStmt = do
rword "if"
cond <- bExpr
rword "then"
stmt1 <- stmt
rword "else"
stmt2 <- stmt
return (If cond stmt1 stmt2)
whileStmt :: Parser Stmt
whileStmt = do
rword "while"
cond <- bExpr
rword "do"
stmt1 <- stmt
return (While cond stmt1)
assignStmt :: Parser Stmt
assignStmt = do
var <- identifier
void (symbol ":=")
expr <- aExpr
return (Assign var expr)
skipStmt :: Parser Stmt
skipStmt = Skip <$ rword "skip"
aExpr :: Parser AExpr
aExpr = makeExprParser aTerm aOperators
bExpr :: Parser BExpr
bExpr = makeExprParser bTerm bOperators
aOperators :: [[Operator Parser AExpr]]
aOperators =
[ [Prefix (Neg <$ symbol "-") ]
, [ InfixL (ABinary Multiply <$ symbol "*")
, InfixL (ABinary Divide <$ symbol "/") ]
, [ InfixL (ABinary Add <$ symbol "+")
, InfixL (ABinary Subtract <$ symbol "-") ]
]
bOperators :: [[Operator Parser BExpr]]
bOperators =
[ [Prefix (Not <$ rword "not") ]
, [InfixL (BBinary And <$ rword "and")
, InfixL (BBinary Or <$ rword "or") ]
]
aTerm :: Parser AExpr
aTerm = parens aExpr
<|> Var <$> identifier
<|> IntConst <$> integer
bTerm :: Parser BExpr
bTerm = parens bExpr
<|> (BoolConst True <$ rword "true")
<|> (BoolConst False <$ rword "false")
<|> rExpr
rExpr :: Parser BExpr
rExpr = do
a1 <- aExpr
op <- relation
a2 <- aExpr
return (RBinary op a1 a2)
relation :: Parser RBinOp
relation = (symbol ">" *> pure Greater)
<|> (symbol "<" *> pure Less)
parsePrint :: String -> IO()
parsePrint s = do
parseTest stmt' s
Running this parses correctly.
parsePrint $ unlines
[ "while (true) do if(false) then x := 5 else y := 20"
]
This is the code for parsing indentation from the second tutorial here.
{-# LANGUAGE TupleSections #-}
module Main where
import Control.Applicative (empty)
import Control.Monad (void)
import Data.Void
import Data.Char (isAlphaNum)
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
type Parser = Parsec Void String
lineComment :: Parser ()
lineComment = L.skipLineComment "#"
scn :: Parser ()
scn = L.space space1 lineComment empty
sc :: Parser ()
sc = L.space (void $ takeWhile1P Nothing f) lineComment empty
where
f x = x == ' ' || x == '\t'
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
pItem :: Parser String
pItem = lexeme (takeWhile1P Nothing f) <?> "list item"
where
f x = isAlphaNum x || x == '-'
pComplexItem :: Parser (String, [String])
pComplexItem = L.indentBlock scn p
where
p = do
header <- pItem
return (L.IndentMany Nothing (return . (header, )) pLineFold)
pLineFold :: Parser String
pLineFold = L.lineFold scn $ \sc' ->
let ps = takeWhile1P Nothing f `sepBy1` try sc'
f x = isAlphaNum x || x == '-'
in unwords <$> ps <* sc
pItemList :: Parser (String, [(String, [String])])
pItemList = L.nonIndented scn (L.indentBlock scn p)
where
p = do
header <- pItem
return (L.IndentSome Nothing (return . (header, )) pComplexItem)
parser :: Parser (String, [(String, [String])])
parser = pItemList <* eof
main :: IO ()
main = return ()
I would like as an example for this to parse correctly.
parsePrint $ unlines
[ "while (true) do"
, " if(false) then x := 5 else y := 20"
]
How could I parse indentation correctly? Also are there any other places with tutorials/documentation on using Megaparsec?
After spending a lot of time on this over the last couple of weeks I managed to work it out. It was a matter of moving from using strings to using my own "Expr" data type.
For anybody else who would like to start writing an indented language this code could be a good start!
Parser
{-# LANGUAGE TupleSections #-}
module IndentTest where
import Control.Applicative (empty)
import Control.Monad (void)
import Data.Void
import Data.Char (isAlphaNum)
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
import Text.Megaparsec.Expr
import Block
type Parser = Parsec Void String
-- Tokens
lineComment :: Parser ()
lineComment = L.skipLineComment "#"
scn :: Parser ()
scn = L.space space1 lineComment empty
sc :: Parser ()
sc = L.space (void $ takeWhile1P Nothing f) lineComment empty
where
f x = x == ' ' || x == '\t'
symbol :: String -> Parser String
symbol = L.symbol sc
rword :: String -> Parser ()
rword w = lexeme (string w *> notFollowedBy alphaNumChar)
rws :: [String] -- list of reserved words
rws = ["module", "println", "import", "let", "if","then","else","while","do","skip","true","false","not","and","or"]
word :: Parser String
word = (lexeme . try) (p >>= check)
where
p = (:) <$> alphaNumChar <*> many alphaNumChar
check x = if x `elem` rws
then fail $ "keyword " ++ show x ++ " cannot be an word"
else return x
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
integer :: Parser Integer
integer = lexeme L.decimal
parens :: Parser a -> Parser a
parens = between (symbol "(") (symbol ")")
aTerm :: Parser AExpr
aTerm = parens aExpr
<|> Var <$> identifier
<|> IntConst <$> integer
aOperators :: [[Operator Parser AExpr]]
aOperators =
[ [Prefix (Neg <$ symbol "-") ]
, [ InfixL (ABinary Multiply <$ symbol "*")
, InfixL (ABinary Divide <$ symbol "/") ]
, [ InfixL (ABinary Add <$ symbol "+")
, InfixL (ABinary Subtract <$ symbol "-") ]
]
aExpr :: Parser AExpr
aExpr = makeExprParser aTerm aOperators
assignArith :: Parser Expr
assignArith = do
var <- identifier
symbol ":"
vType <- valType
symbol "="
e <- aExpr
return $ AssignArith vType var e
bTerm :: Parser BExpr
bTerm = parens bExpr
<|> (BoolConst True <$ rword "true")
<|> (BoolConst False <$ rword "false")
<|> rExpr
bOperators :: [[Operator Parser BExpr]]
bOperators =
[ [Prefix (Not <$ rword "not") ]
, [InfixL (BBinary And <$ rword "and")
, InfixL (BBinary Or <$ rword "or") ]
]
bExpr :: Parser BExpr
bExpr = makeExprParser bTerm bOperators
rExpr :: Parser BExpr
rExpr = do
a1 <- aExpr
op <- relation
a2 <- aExpr
return (RBinary op a1 a2)
relation :: Parser RBinOp
relation = (symbol ">" *> pure Greater)
<|> (symbol "<" *> pure Less)
identifier :: Parser String
identifier = (lexeme . try) (p >>= check)
where
p = (:) <$> letterChar <*> many alphaNumChar
check x = if x `elem` rws
then fail $ "keyword " ++ show x ++ " cannot be an identifier"
else return x
stringLiteral :: Parser Expr
stringLiteral = do
value <- char '"' >> manyTill L.charLiteral (char '"')
symbol ";"
return $ StringLiteral value
assignString :: Parser Expr
assignString = do
var <- identifier
symbol ":"
vType <- valType
symbol "="
e <- stringLiteral
return (AssignString vType var e)
arrayDef :: Parser Expr
arrayDef = do
name <- identifier
symbol ":"
symbol "["
arrType <- word
symbol "]"
symbol "="
return $ ArrayDef arrType name
arrayValues :: Parser Expr
arrayValues = do
symbol "["
values <- many identifier
symbol "]"
return $ ArrayValues values
arrayAssign :: Parser Expr
arrayAssign = do
def <- arrayDef
values <- arrayValues
return $ ArrayAssignment def values
arrayElementSelect :: Parser Expr
arrayElementSelect = do
symbol "!!"
elementNum <- word
return $ ArrayElementSelect elementNum
moduleParser :: Parser Expr
moduleParser = L.nonIndented scn (L.indentBlock scn p)
where
p = do
rword "module"
name <- identifier
return (L.IndentSome Nothing (return . (Module name)) expr')
valType :: Parser Expr
valType = do
value <- identifier
return $ Type value
argumentType :: Parser Expr
argumentType = do
value <- identifier
return $ ArgumentType value
returnType :: Parser Expr
returnType = do
value <- identifier
return $ ReturnType value
argument :: Parser Expr
argument = do
value <- identifier
return $ Argument value
-- Function parser
functionParser :: Parser Expr
functionParser = L.indentBlock scn p
where
p = do
name <- identifier
symbol ":"
argTypes <- some argumentType
symbol "->"
rType <- IndentTest.returnType
nameDup <- L.lineFold scn $ \sp' ->
(identifier) `sepBy1` try sp' <* scn
args <- many argument
symbol "="
if(name == "main") then
return (L.IndentMany Nothing (return . (MainFunction name argTypes args rType)) expr')
else
return (L.IndentMany Nothing (return . (Function name argTypes args rType)) expr')
functionCallParser :: Parser Expr
functionCallParser = do
name <- identifier
args <- parens $ many argument
return $ FunctionCall name args
printParser :: Parser Expr
printParser = do
rword "println"
bodyArr <- identifier
symbol ";"
return $ Print bodyArr
valueToken :: Parser String
valueToken = lexeme (takeWhile1P Nothing f) <?> "list item"
where
f x = isAlphaNum x || x == '-'
ifStmt :: Parser Expr
ifStmt = L.indentBlock scn p
where
p = do
rword "if"
cond <- bExpr
return (L.IndentMany Nothing (return . (If cond)) expr')
elseStmt :: Parser Expr
elseStmt = L.indentBlock scn p
where
p = do
rword "else"
return (L.IndentMany Nothing (return . (Else)) expr')
whereStmt :: Parser Expr
whereStmt = do
rword "where"
symbol "{"
exprs <- many expr
symbol "}"
return $ (Where exprs)
expr :: Parser Expr
expr = f <$> sepBy1 expr' (symbol ";")
where
-- if there's only one expr return it without using ‘Seq’
f l = if length l == 1 then head l else Seq l
expr' :: Parser Expr
expr' = try moduleParser
<|> try functionParser
<|> try ifStmt
<|> try elseStmt
<|> try arrayAssign
<|> arrayElementSelect
<|> try assignArith
<|> try functionCallParser
<|> try assignString
<|> try printParser
<|> try whereStmt
<|> try stringLiteral
parser :: Parser Expr
parser = expr'
parseFromFile file = runParser expr file <$> readFile file
parseString input =
case parse expr' "" input of
Left e -> show e
Right x -> show x
parsePrint :: String -> IO()
parsePrint s = parseTest' parser s
Block/Expr - The AST consists of this
module Block where
import Data.List
import Text.Show.Functions
import Data.Char
import Data.Maybe
-- Boolean expressions
data BExpr
= BoolConst Bool
| Not BExpr
| BBinary BBinOp BExpr BExpr
| RBinary RBinOp AExpr AExpr
instance Show BExpr where
show (BoolConst b) = lowerString $ show b
show (Not n) = show n
show (BBinary bbinop bExpr1 bExpr2) = show bExpr1 ++ " " ++ show bbinop ++ " " ++ show bExpr2
show (RBinary rbinop aExpr1 aExpr2) = show aExpr1 ++ " " ++ show rbinop ++ " " ++ show aExpr2
-- Boolean ops
data BBinOp
= And
| Or
instance Show BBinOp where
show (And) = "&&"
show (Or) = "||"
-- R binary ops
data RBinOp
= Greater
| Less
instance Show RBinOp where
show (Greater) = ">"
show (Less) = "<"
-- Arithmetic expressions
data AExpr
= Var String
| IntConst Integer
| Neg AExpr
| ABinary ABinOp AExpr AExpr
| Parenthesis AExpr
instance Show AExpr where
show (Var v) = v
show (IntConst i) = show i
show (Neg aExpr) = "-" ++ show aExpr
show (ABinary aBinOp aExpr1 aExpr2) = show aExpr1 ++ " " ++ show aBinOp ++ " " ++ show aExpr2
show (Parenthesis aExpr) = "(" ++ show aExpr ++ ")"
-- Arithmetic ops
data ABinOp
= OpeningParenthesis
| ClosingParenthesis
| Add
| Subtract
| Multiply
| Divide
instance Show ABinOp where
show (Add) = "+"
show (Subtract) = "-"
show (Multiply) = "*"
show (Divide) = "/"
show (OpeningParenthesis) = "("
show (ClosingParenthesis) = ")"
-- Statements
data Expr
= Seq [Expr]
| Module String [Expr]
| Import String String
| MainFunction {name ::String, argTypes:: [Expr], args::[Expr], returnType::Expr, body::[Expr]}
| Function String [Expr] [Expr] Expr [Expr]
| FunctionCall String [Expr]
| Type String
| ValueType String
| Argument String
| ArgumentType String
| ReturnType String
| AssignArith Expr String AExpr
| AssignString Expr String Expr
| If BExpr [Expr]
| Else [Expr]
| While BExpr [Expr]
| Print String
| Return Expr
| ArrayValues [String]
| ArrayDef String String
| ArrayAssignment Expr Expr
| ArrayElementSelect String
| Lambda String String
| Where [Expr]
| StringLiteral String
| Skip
instance Show Expr where
show (Module name bodyArray) =
-- Get the main function tree
"public class " ++ name ++ "{\n" ++
"public static void main(String[] args){\n" ++
name ++ " " ++ lowerString name ++ "= new " ++ name ++ "();\n" ++
intercalate "\n" (map (\mStatement -> if(isFunctionCall mStatement) then (lowerString name ++ "." ++ show mStatement) else show mStatement) (body ((filter (isMainFunction) bodyArray)!!0))) ++
"}\n" ++
getFunctionString bodyArray ++
"}\n"
show (Import directory moduleName) = "import " ++ directory ++ moduleName
show (Function name argTypes args returnType body) = "public " ++ show returnType ++ " " ++ name ++ "("++ intercalate ", " (zipWith (\x y -> x ++ " " ++ y) (map show argTypes) (map show args)) ++"){\n" ++ intercalate "\n" (map show body) ++ "}"
show (MainFunction name argTypes args returnType body) =
intercalate "\n " $ map show body
show (FunctionCall name exprs) = name ++ "(" ++ (intercalate ", " (map show exprs)) ++ ");"
show (Type b) = b
show (Argument b) = b
show (ArgumentType b) = b
show (ReturnType b) = b
show (AssignArith vType name value) = "" ++ show vType ++ " " ++ name ++ "=" ++ show value ++ ";"
show (AssignString vType name value) = "" ++ show vType ++ " " ++ name ++ "=" ++ show value ++ ";"
show (If condition statement) = "if(" ++ show condition ++ "){\n" ++ intercalate "\n" (map show statement) ++ "}"
show (Else statement) = " else {\n" ++ intercalate "\n" (map show statement) ++ "}"
show (While condition statement) = "while(" ++ show condition ++ "){\n" ++ intercalate "\n" (map show statement) ++ "}"
show (Skip) = "[skip]"
show (Seq s) = "[seq]"
show (Return expr) = "return " ++ show expr ++ ";"
show (Print exprs) = "System.out.println(" ++ exprs ++ ");" --"System.out.println(" ++ intercalate " " (map show exprs) ++ ");"
show (ArrayDef arrType name) = arrType ++ "[] " ++ name ++ "="
show (ArrayValues exprs) = "{" ++ intercalate ", " exprs ++ "};"
show (ArrayAssignment arr values) = show arr ++ show values
show (ArrayElementSelect i) = "[" ++ i ++ "];"
show (Lambda valName collectionName) = ""
show (Where exprs) = intercalate "\n" (map show exprs)
show (StringLiteral value) = "\"" ++ value ++ "\""
show (_) = "<unknown>"
lowerString str = [ toLower loweredString | loweredString <- str]
extractMain :: Expr -> Maybe String
extractMain (MainFunction m _ _ _ _) = Just m
extractMain _ = Nothing
extractFunctionCall :: Expr -> Maybe String
extractFunctionCall (FunctionCall m _) = Just m
extractFunctionCall _ = Nothing
isMainFunction :: Expr -> Bool
isMainFunction e = isJust $ extractMain e
isFunctionCall :: Expr -> Bool
isFunctionCall e = isJust $ extractFunctionCall e
{--
getInnerMainFunctionString :: [Expr] -> String -> String
getInnerMainFunctionString e instanceName = do
if(isMainFunction (e!!0)) then
show (e!!0)
else
getInnerMainFunctionString (drop 1 e) instanceName
--}
getFunctionString :: [Expr] -> String
getFunctionString e = do
if(isMainFunction (e!!0)) then
""
else
"" ++ show (e!!0) ++ getFunctionString (drop 1 e)
Code Example
module IndentationTest
testFunction : int -> void
testFunction x =
if(x < 50)
println x;
nextX :int = x + 1 * 2 - 3 / 2 + 5
testFunction (nextX)
else
last :int = 1000
println last;
main : String -> IO
main args =
x :int = 3
y :int = 10
z :int = 15
arrTest:[int] = [x y z]
println arrTest;
testFunction (x)
stringTest :String = "Helloworld";
This will successfully parse the example code. Just pass it into the parsePrint function.
I'm trying to parse a small language with Haskell-like syntax, using parsec-layout. The two key features that don't seem to interact too well with each other are:
Function application syntax is juxtaposition, i.e. if F and E are terms, F E is the syntax for F applied to E.
Indentation can be used to denote nesting, i.e. the following two are equivalent:
X = case Y of
A -> V
B -> W
X = case Y of A -> V; B -> W
I haven't managed to figure out a combination of skipping and keeping whitespace that would allow me to parse a list of such definitions. Here's my simplified code:
import Text.Parsec hiding (space, runP)
import Text.Parsec.Layout
import Control.Monad (void)
type Parser = Parsec String LayoutEnv
data Term = Var String
| App Term Term
| Case Term [(String, Term)]
deriving Show
name :: Parser String
name = spaced $ (:) <$> upper <*> many alphaNum
kw :: String -> Parser ()
kw = void . spaced . string
reserved :: String -> Parser ()
reserved s = try $ spaced $ string s >> notFollowedBy alphaNum
term :: Parser Term
term = foldl1 App <$> part `sepBy1` space
where
part = choice [ caseBlock
, Var <$> name
]
caseBlock = Case <$> (reserved "case" *> term <* reserved "of") <*> laidout alt
alt = (,) <$> (name <* kw "->") <*> term
binding :: Parser (String, Term)
binding = (,) <$> (name <* kw "=") <*> term
-- https://github.com/luqui/parsec-layout/issues/1
trim :: String -> String
trim = reverse . dropWhile (== '\n') . reverse
runP :: Parser a -> String -> Either ParseError a
runP p = runParser (p <* eof) defaultLayoutEnv "" . trim
If I try to run it on input like
s = unlines [ "A = case B of"
, " X -> Y Z"
, "C = D"
]
via runP (laidout binding) s, it fails on the application Y Z:
(line 2, column 10):
expecting space or semi-colon
However, if I change the definition of term to
term = foldl1 App <$> many1 part
then it doesn't stop parsing the term at the start of the (unindented!) third line, leading to
(line 3, column 4):
expecting semi-colon
I think the problem has to do with that name already eliminates the following space, so the sepBy1 in the definition of term doesn't see it.
Consider these simplified versions of term:
term0 = foldl1 App <$> (Var <$> name) `sepBy1` space
term1 = foldl1 App <$> (Var <$> name') `sepBy1` space
name' = (:) <$> upper <*> many alphaNum
term3 = foldl1 App <$> many (Var <$> name)
Then:
runP term0 "A B C" -- fails
runP term1 "A B C" -- succeeds
runP term3 "A B C" -- succeeds
I think part of the solution is to define
part = [ caseBlock, Var <$> name' ]
where name' is as above. However, there are still some issues.
The problem
I came across a problem today and I do not know how to solve it. It is very strange to me, because the code I've written should (according to my current knowledge) is correct.
So below you can find a sample parser combinators. The most important one is pOperator, which in very simple way (only for demonstration purposes) builds an operator AST.
It consumes "x" and can consume multiple "x" separated by whitespaces.
I've got also pParens combinator which is defined like:
pPacked pParenL (pWSpaces *> pParenR)
so it consumes Whitespaces before closing bracket.
Sample input / output
The correct input/output SHOULD be:
in: "(x)"
out: Single "x"
in: "(x )"
out: Single "x"
but I'm getting:
in: "(x)"
out: Single "x"
in: "(x )"
out: Multi (Single "x") (Single "x")
-- Correcting steps:
-- Inserted 'x' at position LineColPos 0 3 3 expecting one of ['\t', ' ', 'x']
but in the second example I'm getting error - and the parser behaves like it greedy eats some tokens (and there is no greedy operation).
I would be thankful for any help with it.
Sample code
import Prelude hiding(lex)
import Data.Char hiding (Space)
import qualified Text.ParserCombinators.UU as UU
import Text.ParserCombinators.UU hiding(parse)
import qualified Text.ParserCombinators.UU.Utils as Utils
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)
data El = Multi El El
| Single String
deriving (Show)
---------- Example core grammar ----------
pElement = Single <$> pSyms "x"
pOperator = applyAll <$> pElement <*> pMany (flip <$> (Multi <$ pWSpaces1) <*> pElement)
---------- Basic combinators ----------
applyAll x (f:fs) = applyAll (f x) fs
applyAll x [] = x
pSpace = pSym ' '
pTab = pSym '\t'
pWSpace = pSpace <|> pTab
pWSpaces = pMany pWSpace
pWSpaces1 = pMany1 pWSpace
pMany1 p = (:) <$> p <*> pMany p
pSyms [] = pReturn []
pSyms (x : xs) = (:) <$> pSym x <*> pSyms xs
pParenL = Utils.lexeme $ pSym '('
pParenR = Utils.lexeme $ pSym ')'
pParens = pPacked pParenL (pWSpaces *> pParenR)
---------- Program ----------
pProgram = pParens pOperator
-- if you replace it with following line, it works:
-- pProgram = pParens pElement
-- so it seems like something in pOperator is greedy
tests = [ ("test", "(x)")
, ("test", "(x )")
]
---------- Helpers ----------
type Parser a = P (Str Char String LineColPos) a
parse p s = UU.parse ( (,) <$> p <*> pEnd) (createStr (LineColPos 0 0 0) s)
main :: IO ()
main = do
mapM_ (\(desc, p) -> putStrLn ("\n=== " ++ desc ++ " ===") >> run pProgram p) tests
return ()
run :: Show t => Parser t -> String -> IO ()
run p inp = do let (a, errors) = parse p inp
putStrLn ("-- Result: \n" ++ show a)
if null errors then return ()
else do putStr ("-- Correcting steps: \n")
show_errors errors
putStrLn "-- "
where show_errors :: (Show a) => [a] -> IO ()
show_errors = sequence_ . (map (putStrLn . show))
IMPORTANT
pOperator = applyAll <$> pElement <*> pMany (flip <$> (Multi <$ pWSpaces1) <*> pElement)
is equivalent to:
foldr pChainl pElement (Multi <$ pWSpaces1)
according to: Combinator Parsing: A Short Tutorial
And it is used to define operator precedense.
The definition of pMany reads:
pMany :: IsParser p => p a -> p [a]
pMany p = pList p
and this suggest the solution. When seeing the space we should not commit immediately to the choice to continue with more x-es so we define:
pMany :: IsParser p => p a -> p [a]
pMany_ng p = pList_ng p
Of course you may also call pList_ng immediately. Even better would be to write:
pParens (pChainr_ng (pMulti <$ pWSpaces1) px) --
I did not test it since I am not sure whether between x-es there should be at least one space etc.
Doaitse
I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:
EXP : EXP *
| LIT EXP
| LIT
I've tried to implement this in Haskell as:
expr = try star
<|> try litE
<|> lit
litE = do c <- noneOf "*"
rest <- expr
return (c : rest)
lit = do c <- noneOf "*"
return [c]
star = do content <- expr
char '*'
return (content ++ "*")
There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.
Any thoughts?
You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!
You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.
Here's my attempt (I threw in |, +, and ? for good measure):
import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr
data Term = Literal Char
| Sequence [Term]
| Repeat (Int, Maybe Int) Term
| Choice [Term]
deriving ( Show )
term :: Parser Term
term = buildExpressionParser ops atom where
ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
, Postfix (Repeat (1, Nothing) <$ char '+')
, Postfix (Repeat (0, Just 1) <$ char '?')
]
, [ Infix (return sequence) AssocRight
]
, [ Infix (choice <$ char '|') AssocRight
]
]
atom = msum [ Literal <$> lit
, parens term
]
lit = noneOf "*+?|()"
sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
parens = between (char '(') (char ')')
seqTerms (Sequence ts) = ts
seqTerms t = [t]
choiceTerms (Choice ts) = ts
choiceTerms t = [t]
main = parseTest term "he(llo)*|wor+ld?"
Your grammar is left-recursive, which doesn’t play nice with try, as Parsec will repeatedly backtrack. There are a few ways around this. Probably the simplest is just making the * optional in another rule:
lit :: Parser (Char, Maybe Char)
lit = do
c <- noneOf "*"
s <- optionMaybe $ char '*'
return (c, s)
Of course, you’ll probably end up wrapping things in a data type anyway, and there are a lot of ways to go about it. Here’s one, off the top of my head:
import Control.Applicative ((<$>))
data Term = Literal Char
| Sequence [Term]
| Star Term
expr :: Parser Term
expr = Sequence <$> many term
term :: Parser Term
term = do
c <- lit
s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
return $ if isNothing s
then Literal c
else Star $ Literal c
Maybe a more experienced Haskeller will come along with a better solution.
Please note, subsequently to posting this question I managed to derive a solution myself. See the end of this question for my final answer.
I'm working on a little parser at the moment for org-mode documents, and in these documents headings can have a title, and may optionally consist of a list of tags at the of the heading:
* Heading :foo:bar:baz:
I'm having difficulty writing a parser for this, however. The following is what I'm working with for now:
import Control.Applicative
import Text.ParserCombinators.Parsec
data Node = Node String [String]
deriving (Show)
myTest = parse node "" "Some text here :tags:here:"
node = Node <$> (many1 anyChar) <*> tags
tags = (char ':') >> (sepEndBy1 (many1 alphaNum) (char ':'))
<?> "Tag list"
While my simple tags parser works, it doesn't work in the context of node because all of the characters are used up parsing the title of the heading (many1 anyChar). Furthermore, I can't change this parser to use noneOf ":" because : is valid in the title. In fact, it's only special if it's in a taglist, at the very end of the line.
Any ideas how I can parse this optional data?
As an aside, this is my first real Haskell project, so if Parsec is not even the right tool for the job - feel free to point that out and suggest other options!
Ok, I got a complete solution now, but it needs refactoring. The following works:
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Data.Char (isSpace)
import Text.ParserCombinators.Parsec
data Node = Node { level :: Int, keyword :: Maybe String, heading :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> level <*> (optionMaybe keyword) <*> name <*> (optionMaybe tags)
where level = length <$> many1 (char '*') <* space
keyword = (try (many1 upper <* space))
name = noneOf "\n" `manyTill` (eof <|> (lookAhead (try (tags *> eof))))
tags = char ':' *> many1 alphaNum `sepEndBy1` char ':'
myTest = parse parseNode "org-mode" "** Some : text here :tags: JUST KIDDING :tags:here:"
myTest2 = parse parseNode "org-mode" "* TODO Just a node"
import Control.Applicative hiding (many, optional, (<|>))
import Control.Monad
import Text.ParserCombinators.Parsec
instance Applicative (GenParser s a) where
pure = return
(<*>) = ap
data Node = Node { name :: String, tags :: Maybe [String] }
deriving (Show)
parseNode = Node <$> name <*> tags
where tags = optionMaybe $ optional (string " :") *> many (noneOf ":\n") `sepEndBy` (char ':')
name = noneOf "\n" `manyTill` try (string " :" <|> string "\n")
myTest = parse parseNode "" "Some:text here :tags:here:"
myTest2 = parse parseNode "" "Sometext here :tags:here:"
Results:
*Main> myTest
Right (Node {name = "Some:text here", tags = Just ["tags","here",""]})
*Main> myTest2
Right (Node {name = "Sometext here", tags = Just ["tags","here",""]})