Parsec Parsing list of different kind of statements - parsing

I'm trying to parse (for now) a subset of the Dot language.
The grammar is here and my code is the following
import System.Environment
import System.IO
import qualified Text.Parsec.Token as P
import Text.ParserCombinators.Parsec.Char -- for letter
import Text.Parsec
import qualified Control.Applicative as App
import Lib
type Id = String
data Dot = Undirected Id Stmts
| Directed Id Stmts
deriving (Show)
data Stmt = NodeStmt Node | EdgeStmt Edges
deriving (Show)
type Stmts = [Stmt]
data Node = Node Id Attributes deriving (Show)
data Edge = Edge Id Id deriving (Show)
type Edges = [Edge]
data Attribute = Attribute Id Id deriving (Show)
type Attributes = [Attribute]
dotDef :: P.LanguageDef st
dotDef = P.LanguageDef
{ P.commentStart = "/*"
, P.commentEnd = "*/"
, P.commentLine = "//"
, P.nestedComments = True
, P.identStart = letter
, P.identLetter = alphaNum
, P.reservedNames = ["node", "edge", "graph", "digraph", "subgraph", "strict" ]
, P.caseSensitive = True
, P.opStart = oneOf "-="
, P.opLetter = oneOf "->"
, P.reservedOpNames = []
}
lexer = P.makeTokenParser dotDef
brackets = P.brackets lexer
braces = P.braces lexer
identifier = P.identifier lexer
reserved = P.reserved lexer
semi = P.semi lexer
comma = P.comma lexer
reservedOp = P.reservedOp lexer
eq_op = reservedOp "="
undir_edge_op = reservedOp "--"
dir_edge_op = reservedOp "->"
edge_op = undir_edge_op <|> dir_edge_op
-- -> Attribute
attribute = do
id1 <- identifier
eq_op
id2 <- identifier
optional (semi <|> comma)
return $ Attribute id1 id2
a_list = many attribute
bracked_alist =
brackets $ option [] a_list
attributes =
do
nestedAttributes <- many1 bracked_alist
return $ concat nestedAttributes
nodeStmt = do
nodeName <- identifier
attr <- option [] attributes
return $ NodeStmt $ Node nodeName attr
dropLast = reverse . tail . reverse
edgeStmt = do
nodes <- identifier `sepBy1` edge_op
return $ EdgeStmt $ fmap (\x -> Edge (fst x) (snd x)) (zip (dropLast nodes) (tail nodes))
stmt = do
x <- nodeStmt <|> edgeStmt
optional semi
return x
stmt_list = many stmt
graphDecl = do
reserved "graph"
varName <- option "" identifier
stms <- braces stmt_list
return $ Undirected varName stms
digraphDecl = do
reserved "digraph"
varName <- option "" identifier
stms <- braces stmt_list
return $ Directed varName stms
topLevel3 = do
spaces
graphDecl <|> digraphDecl
main :: IO ()
main = do
(file:_) <- getArgs
content <- readFile file
case parse topLevel3 "" content of
Right g -> print g
Left err -> print err
Given this input
digraph PZIFOZBO{
a[toto = bar] b ; c ; w // 1
a->b // 2
}
It works fine if line 1 or line 2 is commented, but if both are enabled, it fails with
(line 3, column 10): unexpected "-" expecting identifier or "}"
My understanding it that the parser picks first matching rule (with backtracking). Here both edge and node statement starts with and identifier, so it always pick this one.
I tried reversing the order in stmt, without any luck.
I also tried to sprinkle some try in stmt, nodeStmt and edgeStmt, without luck either.
Any help appreciated.

Note that I get the same error whether or not line 1 is commented out, so:
digraph PZIFOZBO{
a->b
}
also says unexpected "-".
As I think you have correctly diagnosed, the problem here is that the stmt parser tries nodeStmt first. That succeeds and parses "a", leaving "->b" yet to be consumed, but ->b isn't a valid statement. Note that Parsec does not backtrack automatically in the absence of a try, so it's not going to go back and revisit this decisions when it "discovers" that ->b can't be parsed.
You can "fix" this problem by swapping the order in stmt:
x <- edgeStmt <|> nodeStmt
but now the parse will break on an expression like a[toto = bar]. That's because edgeStmt is buggy. It parses "a" as a valid statement EdgeStmt [] because sepBy1 allows a single edge "a", which isn't what you want.
If you rewrite edgeStmt to require at least one edge:
import Control.Monad (guard)
edgeStmt = do
nodes <- identifier `sepBy1` edge_op
guard $ length nodes > 1
return $ EdgeStmt $ fmap (\x -> Edge (fst x) (snd x)) (zip (dropLast nodes) (tail nodes))
and adjust stmt to "try" an edge statement first and backtrack to a node statement:
stmt = do
x <- try edgeStmt <|> nodeStmt
optional semi
return x
then your example compiles fine.

Related

How to parse a recursive left syntax rule with FParsec?

I usually use FParsec for LL grammars, but sometimes it happens that in a whole grammar only one element requires left recursive parsing (so the grammar is no longer LL). Currently I have such a situation, I have a large LL grammar implemented with FParsec, but a small grammar element is bothering me because it obviously cannot be parsed correctly.
The syntax element in question is an access to an array index à la F#, e.g. myArray.[index] where myArray can be any expression and index can be any expression too. It turns out that my function calls use square brackets, not parentheses, and my identifiers can be qualified with dots.
An example of correct syntax for an expression is: std.fold[fn, f[myArray.[0]], std.tail[myArray]].
The .[] syntax element is obviously left recursive, but perhaps there is a trick that allows me to parse it anyway? My minimal code is as follows:
open FParsec
type Name = string list
type Expr =
(* foo, Example.Bar.fizz *)
| Variable of Name
(* 9, 17, -1 *)
| Integer of int
(* foo[3, 2], Std.sqrt[2] *)
| FunCall of Name * Expr list
(* (a + b), (a + (1 - c)) *)
| Parens of Expr
(* myArray.[0], table.[index - 1] *)
| ArrayAccess of Expr * Expr
(* a + b *)
| Addition of Expr * Expr
let opp =
new OperatorPrecedenceParser<Expr, _, _>()
let pExpr = opp.ExpressionParser
let pName =
let id =
identifier (IdentifierOptions(isAsciiIdStart = isAsciiLetter, isAsciiIdContinue = isAsciiLetter))
sepBy1 id (skipChar '.')
let pVariable = pName |>> Variable
let pInt = pint32 |>> Integer
let pFunCall =
pipe4
pName
(spaces >>. skipChar '[')
(sepBy (spaces >>. pExpr) (skipChar ','))
(spaces >>. skipChar ']')
(fun name _ args _ -> FunCall(name, args))
let pArrayAccess =
pipe5
pExpr
(spaces >>. skipChar '.')
(spaces >>. skipChar '[')
(spaces >>. pExpr)
(spaces >>. skipChar ']')
(fun expr _ _ index _ -> ArrayAccess(expr, index))
let pParens =
between (skipChar '(') (skipChar ')') (spaces >>. pExpr)
opp.TermParser <-
choice [ attempt pFunCall
pVariable
pArrayAccess
pInt
pParens ]
.>> spaces
let addInfixOperator str prec assoc mapping =
opp.AddOperator
<| InfixOperator(str, spaces, prec, assoc, (), (fun _ leftTerm rightTerm -> mapping leftTerm rightTerm))
addInfixOperator "+" 6 Associativity.Left (fun a b -> Addition(a, b))
let startParser = runParserOnString (pExpr .>> eof) () ""
printfn "%A" <| startParser "std.fold[fn, f[myArray.[0]], std.tail[myArray]]"
One way to do this is as follows: instead of making a list of parsing choices that also lists pArrayAccess like above, which will at some point cause an infinite loop, one can modify pExpr to parse the grammar element in question as an optional element following an expression:
let pExpr =
parse {
let! exp = opp.ExpressionParser
let pArrayAccess =
between (skipString ".[") (skipString "]") opp.ExpressionParser
match! opt pArrayAccess with
| None -> return exp
| Some index -> return ArrayAccess(exp, index)
}
After testing, it turns out that this works very well if the following two conditions are not met:
The contents of the square brackets must not contain access to another array ;
An array cannot be accessed a second time in succession (my2DArray.[x].[y]).
This restricts usage somewhat. How can I get away with this? Is there a way to do this or do I have to change the grammar?
Finally, a solution to this problem is quite simple: just expect a list of array access. If the list is empty, then return the initial expression, otherwise fold over all the array accesses and return the result. Here is the implementation:
let rec pExpr =
parse {
let! exp = opp.ExpressionParser
let pArrayAccess =
between (skipString ".[") (skipString "]") pExpr
match! many pArrayAccess with
| [] -> return exp
| xs -> return List.fold
(fun acc curr -> ArrayAccess(acc, curr)) exp xs
}
This way of doing things meets my needs, so I'd be happy with it, if anyone passes by and wants something more general and not applicable with the proposed solution, then I refer to #Martin Freedman comment, using createParserForwardedToRef().

Record parsing in Haskell

I am building a parser using Megaparsec and I don't know which is the best approach to parse a structure like
names a b c
surnames d e f g
where names and surnames are keywords followed by a list of strings, and each of the two line is optional. This means that also
names a b c
and
surnames d e f g
are valid.
I can parse every line with something like
maybeNames <- optional $ do
constant "names"
many identifier
where identifier parses a valid non-reserved string.
Now, I'm not sure how to express that each line is optional, but still retrieve its value if it is present
Start with writing the context free grammar for your format:
program ::= lines
lines ::= line | line lines
line ::= names | surnames
names ::= NAMES ids
surnames ::= SURNAMES ids
ids ::= id | id ids
id ::= STRING
Where upper case names are for terminals,
and lower case names are for non terminals.
You could then easily use Alex + Happy to parse your text file.
You can do something similar to what appears in this guide and use <|>
to select optional arguments. Here are the essence of things:
whileParser :: Parser Stmt
whileParser = between sc eof stmt
stmt :: Parser Stmt
stmt = f <$> sepBy1 stmt' semi
where
-- if there's only one stmt return it without using ‘Seq’
f l = if length l == 1 then head l else Seq l
stmt' = ifStmt
<|> whileStmt
<|> skipStmt
<|> assignStmt
<|> parens stmt
ifStmt :: Parser Stmt
ifStmt = do
rword "if"
cond <- bExpr
rword "then"
stmt1 <- stmt
rword "else"
stmt2 <- stmt
return (If cond stmt1 stmt2)
whileStmt :: Parser Stmt
whileStmt = do
rword "while"
cond <- bExpr
rword "do"
stmt1 <- stmt
return (While cond stmt1)

Is it possible to force backtrack all options?

I need to parse this syntax for function declaration
foo x = 1
Func "foo" (Ident "x") = 1
foo (x = 1) = 1
Func "foo" (Label "x" 1) = 1
foo x = y = 1
Func "foo" (Ident "x") = (Label "y" 1)
I have written this parser
module SimpleParser where
import Text.Parsec.String (Parser)
import Text.Parsec.Language (emptyDef)
import Text.Parsec
import qualified Text.Parsec.Token as Tok
import Text.Parsec.Char
import Prelude
lexer :: Tok.TokenParser ()
lexer = Tok.makeTokenParser style
where
style = emptyDef {
Tok.identLetter = alphaNum
}
parens :: Parser a -> Parser a
parens = Tok.parens lexer
commaSep :: Parser a -> Parser [a]
commaSep = Tok.commaSep1 lexer
commaSep1 :: Parser a -> Parser [a]
commaSep1 = Tok.commaSep1 lexer
identifier :: Parser String
identifier = Tok.identifier lexer
reservedOp :: String -> Parser ()
reservedOp = Tok.reservedOp lexer
data Expr = IntLit Int | Ident String | Label String Expr | Func String Expr Expr | ExprList [Expr] deriving (Eq, Ord, Show)
integer :: Parser Integer
integer = Tok.integer lexer
litInt :: Parser Expr
litInt = do
n <- integer
return $ IntLit (fromInteger n)
ident :: Parser Expr
ident = Ident <$> identifier
paramLabelItem = litInt <|> paramLabel
paramLabel :: Parser Expr
paramLabel = do
lbl <- try (identifier <* reservedOp "=")
body <- paramLabelItem
return $ Label lbl body
paramItem :: Parser Expr
paramItem = parens paramRecord <|> litInt <|> try paramLabel <|> ident
paramRecord :: Parser Expr
paramRecord = ExprList <$> commaSep1 paramItem
func :: Parser Expr
func = do
name <- identifier
params <- paramRecord
reservedOp "="
body <- paramRecord
return $ (Func name params body)
parseExpr :: String -> Either ParseError Expr
parseExpr s = parse func "" s
I can parse foo (x) = 1 but I cannot parse foo x = 1
parseExpr "foo x = 1"
Left (line 1, column 10):
unexpected end of input
expecting digit, "," or "="
I understand it tries to parse this code like Func "foo" (Label "x" 1) and fails. But after fail why it cannot try to parse it like Func "foo" (Ident "x") = 1
Is there any way to do it?
Also I have tried to swap ident and paramLabel
paramItem = parens paramRecord <|> litInt <|> try paramLabel <|> ident
paramItem = parens paramRecord <|> litInt <|> try ident <|> paramLabel
In this case I can parse foo x = 1 but I cannot parse foo (x = 1) = 2
parseExpr "foo (x = 1) = 2"
Left (line 1, column 8):
unexpected "="
expecting "," or ")"
Here is how I understand how Parsec backtracking works (and doesn't work):
In:
(try a <|> try b <|> c)
if a fails, b will be tried, and if b subsequently fails c will be tried.
However, in:
(try a <|> try b <|> c) >> d
if a succeeds but d fails, Parsec does not go back and try b. Once a succeeded Parsec considers the whole choice as parsed and it moves on to d. It will never go back to try b or c.
This doesn't work either for the same reason:
(try (try a <|> try b <|> c)) >> d
Once a or b succeeds, the whole choice succeeds, and therefor the outer try succeeds. Parsing then moves on to d.
A solution is to distribute d to within the choice:
try (a >> d) <|> try (b >> d) <|> (c >> d)
Now if a succeeds but d fails, b >> d will be tried.

Parsing juxtaposition-based, indentation-aware syntax using Text.Parsec.Layout

I'm trying to parse a small language with Haskell-like syntax, using parsec-layout. The two key features that don't seem to interact too well with each other are:
Function application syntax is juxtaposition, i.e. if F and E are terms, F E is the syntax for F applied to E.
Indentation can be used to denote nesting, i.e. the following two are equivalent:
X = case Y of
A -> V
B -> W
X = case Y of A -> V; B -> W
I haven't managed to figure out a combination of skipping and keeping whitespace that would allow me to parse a list of such definitions. Here's my simplified code:
import Text.Parsec hiding (space, runP)
import Text.Parsec.Layout
import Control.Monad (void)
type Parser = Parsec String LayoutEnv
data Term = Var String
| App Term Term
| Case Term [(String, Term)]
deriving Show
name :: Parser String
name = spaced $ (:) <$> upper <*> many alphaNum
kw :: String -> Parser ()
kw = void . spaced . string
reserved :: String -> Parser ()
reserved s = try $ spaced $ string s >> notFollowedBy alphaNum
term :: Parser Term
term = foldl1 App <$> part `sepBy1` space
where
part = choice [ caseBlock
, Var <$> name
]
caseBlock = Case <$> (reserved "case" *> term <* reserved "of") <*> laidout alt
alt = (,) <$> (name <* kw "->") <*> term
binding :: Parser (String, Term)
binding = (,) <$> (name <* kw "=") <*> term
-- https://github.com/luqui/parsec-layout/issues/1
trim :: String -> String
trim = reverse . dropWhile (== '\n') . reverse
runP :: Parser a -> String -> Either ParseError a
runP p = runParser (p <* eof) defaultLayoutEnv "" . trim
If I try to run it on input like
s = unlines [ "A = case B of"
, " X -> Y Z"
, "C = D"
]
via runP (laidout binding) s, it fails on the application Y Z:
(line 2, column 10):
expecting space or semi-colon
However, if I change the definition of term to
term = foldl1 App <$> many1 part
then it doesn't stop parsing the term at the start of the (unindented!) third line, leading to
(line 3, column 4):
expecting semi-colon
I think the problem has to do with that name already eliminates the following space, so the sepBy1 in the definition of term doesn't see it.
Consider these simplified versions of term:
term0 = foldl1 App <$> (Var <$> name) `sepBy1` space
term1 = foldl1 App <$> (Var <$> name') `sepBy1` space
name' = (:) <$> upper <*> many alphaNum
term3 = foldl1 App <$> many (Var <$> name)
Then:
runP term0 "A B C" -- fails
runP term1 "A B C" -- succeeds
runP term3 "A B C" -- succeeds
I think part of the solution is to define
part = [ caseBlock, Var <$> name' ]
where name' is as above. However, there are still some issues.

Haskell read variable name

I need to write a code that parses some language. I got stuck on parsing variable name - it can be anything that is at least 1 char long, starts with lowercase letter and can contain underscore '_' character. I think I made a good start with following code:
identToken :: Parser String
identToken = do
c <- letter
cs <- letdigs
return (c:cs)
where letter = satisfy isLetter
letdigs = munch isLetter +++ munch isDigit +++ munch underscore
num = satisfy isDigit
underscore = \x -> x == '_'
lowerCase = \x -> x `elem` ['a'..'z'] -- how to add this function to current code?
ident :: Parser Ident
ident = do
_ <- skipSpaces
s <- identToken
skipSpaces; return $ s
idents :: Parser Command
idents = do
skipSpaces; ids <- many1 ident
...
This function however gives me a weird results. If I call my test function
test_parseIdents :: String -> Either Error [Ident]
test_parseIdents p =
case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left InvalidParse
multipleRes -> Left (AmbiguousIdents multipleRes)
where
prog :: Parser [Ident]
prog = do
result <- many ident
eof
return result
like this:
test_parseIdents "test"
I get this:
Left (AmbiguousIdents [(["test"],""),(["t","est"],""),(["t","e","st"],""),
(["t","e","st"],""),(["t","est"],""),(["t","e","st"],""),(["t","e","st"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],""),
(["t","e","s","t"],""),(["t","e","s","t"],""),(["t","e","s","t"],"")])
Note that Parser is just synonym for ReadP a.
I also want to encode in the parser that variable names should start with a lowercase character.
Thank you for your help.
Part of the problem is with your use of the +++ operator. The following code works for me:
import Data.Char
import Text.ParserCombinators.ReadP
type Parser a = ReadP a
type Ident = String
identToken :: Parser String
identToken = do c <- satisfy lowerCase
cs <- letdigs
return (c:cs)
where lowerCase = \x -> x `elem` ['a'..'z']
underscore = \x -> x == '_'
letdigs = munch (\c -> isLetter c || isDigit c || underscore c)
ident :: Parser Ident
ident = do _ <- skipSpaces
s <- identToken
skipSpaces
return s
test_parseIdents :: String -> Either String [Ident]
test_parseIdents p = case readP_to_S prog p of
[(j, "")] -> Right j
[] -> Left "Invalid parse"
multipleRes -> Left ("Ambiguous idents: " ++ show multipleRes)
where prog :: Parser [Ident]
prog = do result <- many ident
eof
return result
main = print $ test_parseIdents "test_1349_zefz"
So what went wrong:
+++ imposes an order on its arguments, and allows for multiple alternatives to succeed (symmetric choice). <++ is left-biased so only the left-most option succeeds -> this would remove the ambiguity in the parse, but still leaves the next problem.
Your parser was looking for letters first, then digits, and finally underscores. Digits after underscores failed, for example. The parser had to be modified to munch characters that were either letters, digits or underscores.
I also removed some functions that were unused and made an educated guess for the definition of your datatypes.

Resources