Using the Earley library to parse with features and unification - parsing

The Earley parsing library is great for writing linguistic parsers in Haskell. CFGs can be specified in an intuitive way, and there is excellent support for backtracking and ambiguity. A simple example:
{-# LANGUAGE OverloadedStrings #-}
import Text.Earley
np = rule ("John" <|> "Mary")
vp = rule ("runs" <|> "walks")
sentence = do
subj <- np
pred <- vp
return $ (++) <$> subj <*> pred
sentence can be used to parse ["John", "runs"] or ["Mary", "walks"], among other inputs.
It would be nice to be able to use Earley to write parsers for FCFGs, where nonterminals are complexes of a label and a feature bundle, and feature matching can happen via unification (for example, the Earley parser in NLTK parses FCFGs). However, it is not clear how to do this using Earley, or whether it can even be done. An example of something we might want in something like BNF:
np[sg] ::= "John" | "Mary"
np[?x] ::= det n[?x]
n[pl] ::= "boys" | "girls"
det ::= "the"
vp[sg] ::= "runs" | "walks"
vp[pl] ::= "run" | "walk"
s ::= np[?x] vp[?x]
Under this FCFG, ["John", "runs"] is an s (since their number features match, as required by the s rule), and ["the", "boys", "walks"] isn't an s (since ["the", "boys"] parses to np[pl] and ["walks"] parses to vp[sg]).
One can in general rewrite an FCFG into an equivalent CFG, but this can be highly inconvenient, and result in a blowup of the grammar, especially when we have many possible features ranging over many possible values.

You're not actually doing any particularly interesting unification here, so perhaps it's enough to toss a very simple nondeterminism applicative of your own into the mix. The standard one is [], but for this case, even Maybe looks like enough. Like this:
{-# Language OverloadedStrings #-}
{-# Language TypeApplications #-}
import Control.Applicative
import Control.Monad
import Data.Foldable
import Text.Earley
data Feature = SG | PL deriving (Eq, Ord, Read, Show)
(=:=) :: (Feature, a) -> (Feature, b) -> Maybe (a, b)
(fa, a) =:= (fb, b) = (a, b) <$ guard (fa == fb)
data NP = Name String | Determined String String deriving (Eq, Ord, Read, Show)
np :: Grammar r (Prod r e String (Feature, NP))
np = rule . asum $
[ fmap (\name -> (SG, Name name)) ("John" <|> "Mary")
, liftA2 (\det n -> (PL, Determined det n)) "the" ("boys" <|> "girls")
]
vp :: Grammar r (Prod r e String (Feature, String))
vp = rule . asum $
[ (,) SG <$> ("runs" <|> "walks")
, (,) PL <$> ("run" <|> "walk")
]
s :: Grammar r (Prod r e String (Maybe (NP, String)))
s = liftA2 (liftA2 (=:=)) np vp
test :: [String] -> IO ()
test = print . allParses #() (parser s)
Try it out in ghci:
> sequence_ [test (words n ++ [v]) | n <- ["John", "the boys"], v <- ["walks", "walk"]]
([(Just (Name "John","walks"),2)],Report {position = 2, expected = [], unconsumed = []})
([(Nothing,2)],Report {position = 2, expected = [], unconsumed = []})
([(Nothing,3)],Report {position = 3, expected = [], unconsumed = []})
([(Just (Determined "the" "boys","walk"),3)],Report {position = 3, expected = [], unconsumed = []})
So, the result needs a bit of interpretation -- a successful parse of Nothing really counts as a failed parse -- but perhaps that's not so bad? Not sure. Certainly it's unfortunate that you don't get to reuse Earley's error-reporting and nondeterminism machinery. Probably to get either thing, you'd have to fork Earley.
If you need to do real unification you could look into returning a IntBindingT t Identity instead of a Maybe, but at least until your features are themselves recursive this is probably enough and much, much simpler.

Related

Indentation-aware parsing of expression trees

As a follow-up to this question, I am now trying to parse an expression language that has variables and case ... of ... expressions. The syntax should be indentation-based:
Expressions can span multiple lines, as long as every line is indented relative to the first one; i.e. this should be parsed as a single application:
f x y
z
q
Each alternative of a case expression needs to be on its own line, indented relative to the case keyword. Right-hand sides can span multiple lines.
case E of
C -> x
D -> f x
y
should be parsed into a single case with two alternatives, with x and f x y as the right-hand sides
I've simplified my code into the following:
import qualified Text.Megaparsec.Lexer as L
import Text.Megaparsec hiding (space)
import Text.Megaparsec.Char hiding (space)
import Text.Megaparsec.String
import Control.Monad (void)
import Control.Applicative
data Term = Var String
| App [Term]
| Case Term [(String, Term)]
deriving Show
space :: Parser ()
space = L.space (void spaceChar) empty empty
name :: Parser String
name = try $ do
s <- some letterChar
if s `elem` ["case", "of"]
then fail $ unwords ["Unexpected: reserved word", show s]
else return s
term :: Parser () -> Parser Term
term sp = App <$> atom `sepBy1` try sp
where
atom = choice [ caseBlock
, Var <$> L.lexeme sp name
]
caseBlock = L.lineFold sp $ \sp' ->
Case <$>
(L.symbol sp "case" *> L.lexeme sp (term sp) <* L.symbol sp' "of") <*>
alt sp' `sepBy` try sp' <* sp
alt sp' = L.lineFold sp' $ \sp'' ->
(,) <$> L.lexeme sp' name <* L.symbol sp' "->" <*> term sp''
As you can see, I am trying to use the technique from this answer to separate alternatives with sp'aces that are more indented than the case keyword.
Problems
This seems to work for single expressions made up of application only:
λ» parseTest (L.lineFold space term) "x y\n z"
App [Var "x",Var "y",Var "z"]
It doesn't work for list of such expressions using the technique from the linked answer:
λ» parseTest (L.lineFold space $ \sp -> (term sp `sepBy` try sp)) "x\n y\nz"
3:1:
incorrect indentation (got 1, should be greater than 1)
case expressions fail out of the gate when trying to use line-folding:
λ» parseTest (L.lineFold space term) "case x of\n C -> y\n D -> z"
1:5:
Unexpected: reserved word "case"
case works without line folding for the outermost expression, for one alternative only:
λ» parseTest (term space) "case x of\n C -> y\n z"
App [Case (App [Var "x"]) [("C",App [Var "y",Var "z"])]]
But case fails as soon as I have multiple alternatives:
λ» parseTest (term space) "case x of\n C -> y\n D -> z"
3:2:
incorrect indentation (got 2, should be greater than 2)
What am I doing wrong?
I'm answering since I promised to take a look at this. This problem represents a rather difficult problem for Parsec-like parsers in their current state. I probably could make it work after spending much more time that I have available, but in the slot of time I can spend answering this, I only got this far:
module Main (main) where
import Control.Applicative
import Control.Monad (void)
import Text.Megaparsec
import Text.Megaparsec.String
import qualified Data.List.NonEmpty as NE
import qualified Text.Megaparsec.Lexer as L
data Term = Var String
| App [Term]
| Case Term [(String, Term)]
deriving Show
scn :: Parser ()
scn = L.space (void spaceChar) empty empty
sc :: Parser ()
sc = L.space (void $ oneOf " \t") empty empty
name :: Parser String
name = try $ do
s <- some letterChar
if s `elem` ["case", "of"]
then (unexpected . Label . NE.fromList) ("reserved word \"" ++ s ++ "\"")
else return s
manyTerms :: Parser [Term]
manyTerms = many pTerm
pTerm :: Parser Term
pTerm = caseBlock <|> app -- parse a term first
caseBlock :: Parser Term
caseBlock = L.indentBlock scn $ do
void (L.symbol sc "case")
t <- Var <$> L.lexeme sc name -- not sure what sort of syntax case of
-- case expressions should have, so simplified to vars for now
void (L.symbol sc "of")
return (L.IndentSome Nothing (return . Case t) alt)
alt :: Parser (String, Term)
alt = L.lineFold scn $ \sc' ->
(,) <$> L.lexeme sc' name <* L.symbol sc' "->" <*> pTerm -- (1)
app :: Parser Term
app = L.lineFold scn $ \sc' ->
App <$> ((Var <$> name) `sepBy1` try sc' <* scn)
-- simplified here, with some effort should be possible to go from Var to
-- more general Term in applications
Your original grammar is left-recursive because every term can be either a case expression or an application and if it's an application, then the first part of it again can be either case expression or application, etc. You'll need to deal with that somehow.
Here is a session:
λ> parseTest pTerm "x y\n z"
App [Var "x",Var "y",Var "z"]
λ> parseTest pTerm "x\n y\nz"
App [Var "x",Var "y"]
λ> parseTest manyTerms "x\n y\nz"
[App [Var "x",Var "y"],App [Var "z"]]
λ> parseTest pTerm "case x of\n C -> y\n D -> z"
Case (Var "x") [("C",App [Var "y"]),("D",App [Var "z"])]
λ> parseTest pTerm "case x of\n C -> y\n z"
3:3:
incorrect indentation (got 3, should be equal to 2)
This last result is because of (1) in the code. Introducing a parameter to app makes it impossible to use it without thinking of context (it would be no longer stand-alone expression, but factored-out part of something). We can see that if you indent z with respect to start of y application, not the entire alternative, it works:
λ> parseTest pTerm "case x of\n C -> y\n z"
Case (Var "x") [("C",App [Var "y",Var "z"])]
Finally, case expression works:
λ> parseTest pTerm "case x of\n C -> y\n D -> z"
Case (Var "x") [("C",App [Var "y"]),("D",App [Var "z"])]
My advice here would be to take a look at some pre-processor and use Megaparsec on top of that. The tools in Text.Megaparsec.Lexer are not that easy to apply in this case, but they are the best we could come up with and they work fine for simple indentation-sensitive grammars.

Is there any trick about translating BNF to Parsec program?

The BNF that match function call chain (like x(y)(z)...):
expr = term T
T = (expr) T
| EMPTY
term = (expr)
| VAR
Translate it to Parsec program that looks so tricky.
term :: Parser Term
term = parens expr <|> var
expr :: Parser Term
expr = do whiteSpace
e <- term
maybeAddSuffix e
where addSuffix e0 = do e1 <- parens expr
maybeAddSuffix $ TermApp e0 e1
maybeAddSuffix e = addSuffix e
<|> return e
Could you list all the design patterns about translating BNF to Parsec program?
The simplest think you could do if your grammar is sizeable is to just use the Alex/Happy combo. It is fairly straightforward to use, accepts the BNF format directly - no human translation needed - and perhaps most importantly, produces blazingly fast parsers/lexers.
If you are dead set on doing it with parsec (or you are doing this as a learning exercise), I find it easier in general to do it in two stages; first lexing, then parsing. Parsec will do both!
First write the appropriate types:
{-# LANGUAGE LambdaCase #-}
import Text.Parsec
import Text.Parsec.Combinator
import Text.Parsec.Prim
import Text.Parsec.Pos
import Text.ParserCombinators.Parsec.Char
import Control.Applicative hiding ((<|>))
import Control.Monad
data Term = App Term Term | Var String deriving (Show, Eq)
data Token = LParen | RParen | Str String deriving (Show, Eq)
type Lexer = Parsec [Char] () -- A lexer accepts a stream of Char
type Parser = Parsec [Token] () -- A parser accepts a stream of Token
Parsing a single token is simple. For simplicity, a variable is 1 or more letters. You can of course change this however you like.
oneToken :: Lexer Token
oneToken = (char '(' >> return LParen) <|>
(char ')' >> return RParen) <|>
(Str <$> many1 letter)
Parsing the entire token stream is just parsing a single token many times, possible separated by whitespace:
lexer :: Lexer [Token]
lexer = spaces >> many1 (oneToken <* spaces)
Note the placement of spaces: this way, white space is accepted at the beginning and end of the string.
Since Parser uses a custom token type, you have to use a custom satisfy function. Fortunately, this is almost identical to the existing satisfy.
satisfy' :: (Token -> Bool) -> Parser Token
satisfy' f = tokenPrim show
(\src _ _ -> incSourceColumn src 1)
(\x -> if f x then Just x else Nothing)
Then we can write parsers for each of the primitive tokens.
lparen = satisfy' $ \case { LParen -> True ; _ -> False }
rparen = satisfy' $ \case { RParen -> True ; _ -> False }
strTok = (\(Str s) -> s) <$> (satisfy' $ \case { Str {} -> True ; _ -> False })
As you may imagine, parens would be useful for our purposes. It is very straightforward to write.
parens :: Parser a -> Parser a
parens = between lparen rparen
Now the interesting parts.
term, expr, var :: Parser Term
term = parens expr <|> var
var = Var <$> strTok
These two should be fairly obvious to you.
Parec contains combinators option and optionMaybe which are useful when you you need to "maybe do something".
expr = do
e0 <- term
option e0 (parens expr >>= \e1 -> return (App e0 e1))
The last line means - try to apply the parser given to option - if it fails, instead return e0.
For testing you can do:
tokAndParse = runParser (lexer <* eof) () "" >=> runParser (expr <* eof) () ""
The eof attached to each parser is to make sure that the entire input is consumed; the string cannot be a member of the grammar if there are extra trailing characters. Note - your example x(y)(z) is not actually in your grammar!
>tokAndParse "x(y)(z)"
Left (line 1, column 5):
unexpected LParen
expecting end of input
But the following is
>tokAndParse "(x(y))(z)"
Right (App (App (Var "x") (Var "y")) (Var "z"))

Parsing simple molecule names with Attoparsec

I find it extremely difficult to learn how to use Attoparsec, because the documentation is really just an API documentation and there are basically no tutorials around (except the one from FPComplete). If you know other places where I can learn Attoparsec, that'd be great.
I have to parse simple molecule names, in the following format: NaCl, CO2, H2O, HCN, H2O2.
An element name is an uppercase letter optionally followed by a lowercase one (I'm not considering those elements with a symbol longer than 2 characters).
An element can be followed by a number (that would be the subscript in a formula).
New version (thanks to Mark's and Tarmil's suggestions), which compiles but does not parse:
module Chem
where
import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text
data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]
parseString :: String -> Result Molecule
parseString = parse (many' parseElement) . pack
parseElement :: Parser Element
parseElement = do
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
n <- option 1 decimal
return $ Element el n
pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
Any suggestion is appreciated.
EDIT: I managed to get it running. Basically, a Partial continuation was returned, and to finish the parsing it's necessary to feed the parser with an empty bytestring. So the correct parseString would be:
parseString = flip feed empty . parse (many' parseElement) . pack
where empty is Data.Text.empty. However, since I don't need incremental parsing there is the useful function parseOnly, which does not wait for more input and returns an Either.
With that in mind, I rewrote the code like this (it works now):
module Chem
where
import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text
data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]
parseString :: String -> Either String Molecule
parseString = parseOnly (many' parseElement) . pack
parseElement :: Parser Element
parseElement = do
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
n <- option 1 decimal
return $ Element el n
pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
You have two problems in the letters parsing part:
inClass is not a parser, it is a function that is meant to be passed to satisfy.
<*> has type Parser (a -> b) -> Parser a -> Parser b, so the parser on the left should return a function. Typically, it is used like this:
pf <$> p1 <*> p2 <*> ... <*> pn
where pf is a function with n arguments.
So here you probably want something like this:
-- parse a character in the given class, and transform it to a single-char string
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
-- ...
el <- ((++) <$> pClass "A-Z" <*> pClass "a-z") <|> pClass "A-Z"
-- ...
I think this would be enhanced by using option, instead of duplicating the A-Z parser:
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")

parsec using between to parse parens

If I wanted to parse a string with multiple parenthesized groups into a list of strings holding each group, for example
"((a b c) a b c)"
into
["((a b c) a b c)","( a b c)"]
How would I do that using parsec? The use of between looks nice but it does not seem possible to separate with a beginning and end value.
I'd use a recursive parser:
data Expr = List [Expr] | Term String
expr :: Parsec String () Expr
expr = recurse <|> terminal
where terminal is your primitives, in this case these seem to be strings of characters so
where terminal = Term <$> many1 letter
and recurse is
recurse = List <$>
(between `on` char) '(' ')' (expr `sepBy1` char ' ')
Now we have a nice tree of Exprs which we can gather with
collect r#(List ts) = r : concatMap collect ts
collect _ = []
While jozefg's solution is almost identical to what I came up with (and I completely agree to all his suggestions), there are some small differences that made me think that I should post a second answer:
Due to the expected result of the initial example, it is not necessary to treat space-separated parts as individual subtrees.
Further it might be interesting to see the part that actually computes the expected results (i.e., a list of strings).
So here is my version. As already suggested by jozefg, split the task into several sub-tasks. Those are:
Parse a string into an algebraic data type representing some kind of tree.
Collect the (desired) subtrees of this tree.
Turn trees into strings.
Concerning 1, we first need a tree data type
import Text.Parsec
import Text.Parsec.String
import Control.Applicative ((<$>))
data Tree = Leaf String | Node [Tree]
and then a function that can parse strings into values of this type.
parseTree :: Parser Tree
parseTree = node <|> leaf
where
node = Node <$> between (char '(') (char ')') (many parseTree)
leaf = Leaf <$> many1 (noneOf "()")
In my version I do consider the hole string between parenthesis as a Leaf node (i.e., I do not split at white-spaces).
Now we need to collect the subtrees of a tree we are interested in:
nodes :: Tree -> [Tree]
nodes (Leaf _) = []
nodes t#(Node ts) = t : concatMap nodes ts
Finally, a Show-instance for Trees allows us to turn them into strings.
instance Show Tree where
showsPrec d (Leaf x) = showString x
showsPrec d (Node xs) = showString "(" . showList xs . showString ")"
where
showList [] = id
showList (x:xs) = shows x . showList xs
Then the original task can be solved, e.g., by:
parseGroups :: Parser [String]
parseGroups = map show . nodes <$> parseTree
> parseTest parseGroups "((a b c) a b c)"
["((a b c) a b c)","(a b c)"]

Using Parsec to parse regular expressions

I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:
EXP : EXP *
| LIT EXP
| LIT
I've tried to implement this in Haskell as:
expr = try star
<|> try litE
<|> lit
litE = do c <- noneOf "*"
rest <- expr
return (c : rest)
lit = do c <- noneOf "*"
return [c]
star = do content <- expr
char '*'
return (content ++ "*")
There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.
Any thoughts?
You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!
You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.
Here's my attempt (I threw in |, +, and ? for good measure):
import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr
data Term = Literal Char
| Sequence [Term]
| Repeat (Int, Maybe Int) Term
| Choice [Term]
deriving ( Show )
term :: Parser Term
term = buildExpressionParser ops atom where
ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
, Postfix (Repeat (1, Nothing) <$ char '+')
, Postfix (Repeat (0, Just 1) <$ char '?')
]
, [ Infix (return sequence) AssocRight
]
, [ Infix (choice <$ char '|') AssocRight
]
]
atom = msum [ Literal <$> lit
, parens term
]
lit = noneOf "*+?|()"
sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
parens = between (char '(') (char ')')
seqTerms (Sequence ts) = ts
seqTerms t = [t]
choiceTerms (Choice ts) = ts
choiceTerms t = [t]
main = parseTest term "he(llo)*|wor+ld?"
Your grammar is left-recursive, which doesn’t play nice with try, as Parsec will repeatedly backtrack. There are a few ways around this. Probably the simplest is just making the * optional in another rule:
lit :: Parser (Char, Maybe Char)
lit = do
c <- noneOf "*"
s <- optionMaybe $ char '*'
return (c, s)
Of course, you’ll probably end up wrapping things in a data type anyway, and there are a lot of ways to go about it. Here’s one, off the top of my head:
import Control.Applicative ((<$>))
data Term = Literal Char
| Sequence [Term]
| Star Term
expr :: Parser Term
expr = Sequence <$> many term
term :: Parser Term
term = do
c <- lit
s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
return $ if isNothing s
then Literal c
else Star $ Literal c
Maybe a more experienced Haskeller will come along with a better solution.

Resources