I'm trying to combine parsers in Haskell in such a way that I could parse certain patterns up to n times. To illustrate, imagine I want to parse up to eight digits from the input. I know I can use count from Text.Parser.Combinators to parse exactly n occurrences, e.g.:
import Text.Parser.Char (digit)
import Text.Parser.Combinators (count)
eightDigits :: Parser [Char]
eightDigits = count 8 digit
This, however, fails if it doesn't find exactly 8 digits. I could also use some to parse one or more digits:
import Text.Parser.Char (digit)
import Text.Parser.Combinators (some)
someDigits :: Parser [Char]
someDigits = some digit
The problem with the above is that it may consume more digits than I want. Finally, I could use try, which combine parsers that may consume input and, on failure, go back to where it started:
import Text.Parser.Char (digit)
import Text.Parser.Combinators (count, try)
import Control.Applicative ((<|>))
twoOrThreeDigits :: Parser [Char]
twoOrThreeDigits = try (count 3 digit) <|> count 2 digit
While this could be extended to up to 8 repetitions, it's not scalable nor elegant, so the question is how can I combine parsers to parse a pattern anywhere between 1 and up to n times?
You could construct a many-like combinator with an upper limit:
upto :: Int -> Parser a -> Parser [a]
upto n p | n > 0 = (:) <$> try p <*> upto (n-1) p <|> return []
upto _ _ = return []
And for 1 up to n, a many1-like combinator:
upto1 :: Int -> Parser a -> Parser [a]
upto1 n p | n > 0 = (:) <$> p <*> upto (n-1) p
upto1 _ _ = return []
A short demo:
> map (parse (upto 8 digitChar) "") ["", "123", "1234567890"]
[Right "",Right "123",Right "12345678"]
Related
The Earley parsing library is great for writing linguistic parsers in Haskell. CFGs can be specified in an intuitive way, and there is excellent support for backtracking and ambiguity. A simple example:
{-# LANGUAGE OverloadedStrings #-}
import Text.Earley
np = rule ("John" <|> "Mary")
vp = rule ("runs" <|> "walks")
sentence = do
subj <- np
pred <- vp
return $ (++) <$> subj <*> pred
sentence can be used to parse ["John", "runs"] or ["Mary", "walks"], among other inputs.
It would be nice to be able to use Earley to write parsers for FCFGs, where nonterminals are complexes of a label and a feature bundle, and feature matching can happen via unification (for example, the Earley parser in NLTK parses FCFGs). However, it is not clear how to do this using Earley, or whether it can even be done. An example of something we might want in something like BNF:
np[sg] ::= "John" | "Mary"
np[?x] ::= det n[?x]
n[pl] ::= "boys" | "girls"
det ::= "the"
vp[sg] ::= "runs" | "walks"
vp[pl] ::= "run" | "walk"
s ::= np[?x] vp[?x]
Under this FCFG, ["John", "runs"] is an s (since their number features match, as required by the s rule), and ["the", "boys", "walks"] isn't an s (since ["the", "boys"] parses to np[pl] and ["walks"] parses to vp[sg]).
One can in general rewrite an FCFG into an equivalent CFG, but this can be highly inconvenient, and result in a blowup of the grammar, especially when we have many possible features ranging over many possible values.
You're not actually doing any particularly interesting unification here, so perhaps it's enough to toss a very simple nondeterminism applicative of your own into the mix. The standard one is [], but for this case, even Maybe looks like enough. Like this:
{-# Language OverloadedStrings #-}
{-# Language TypeApplications #-}
import Control.Applicative
import Control.Monad
import Data.Foldable
import Text.Earley
data Feature = SG | PL deriving (Eq, Ord, Read, Show)
(=:=) :: (Feature, a) -> (Feature, b) -> Maybe (a, b)
(fa, a) =:= (fb, b) = (a, b) <$ guard (fa == fb)
data NP = Name String | Determined String String deriving (Eq, Ord, Read, Show)
np :: Grammar r (Prod r e String (Feature, NP))
np = rule . asum $
[ fmap (\name -> (SG, Name name)) ("John" <|> "Mary")
, liftA2 (\det n -> (PL, Determined det n)) "the" ("boys" <|> "girls")
]
vp :: Grammar r (Prod r e String (Feature, String))
vp = rule . asum $
[ (,) SG <$> ("runs" <|> "walks")
, (,) PL <$> ("run" <|> "walk")
]
s :: Grammar r (Prod r e String (Maybe (NP, String)))
s = liftA2 (liftA2 (=:=)) np vp
test :: [String] -> IO ()
test = print . allParses #() (parser s)
Try it out in ghci:
> sequence_ [test (words n ++ [v]) | n <- ["John", "the boys"], v <- ["walks", "walk"]]
([(Just (Name "John","walks"),2)],Report {position = 2, expected = [], unconsumed = []})
([(Nothing,2)],Report {position = 2, expected = [], unconsumed = []})
([(Nothing,3)],Report {position = 3, expected = [], unconsumed = []})
([(Just (Determined "the" "boys","walk"),3)],Report {position = 3, expected = [], unconsumed = []})
So, the result needs a bit of interpretation -- a successful parse of Nothing really counts as a failed parse -- but perhaps that's not so bad? Not sure. Certainly it's unfortunate that you don't get to reuse Earley's error-reporting and nondeterminism machinery. Probably to get either thing, you'd have to fork Earley.
If you need to do real unification you could look into returning a IntBindingT t Identity instead of a Maybe, but at least until your features are themselves recursive this is probably enough and much, much simpler.
I am working through the WikiBook "Write Yourself A Scheme in 48 Hours."
The Haskell library Parsec is being used to parse basic expressions, such as numbers (as shown in the code below).
import Lib
import Text.ParserCombinators.Parsec hiding (spaces)
import System.Environment
import Control.Monad
import Data.Typeable ( typeOf )
import Debug.Trace
data LispVal = Atom String
| List [LispVal]
| DottedList [LispVal] LispVal
| Number Integer
| String String
| Bool Bool
deriving Show
-- ...
parseNumber :: Parser LispVal
parseNumber = do
x <- many1 digit
return $ Number (read x)
One of the exercises in the book asks the reader to rewrite parseNumber using >>= notation instead. However, I keep running into scary-looking type mismatch errors. Can someone please show me how to rewrite the function using >>= notation? Or at least give me a hint?
The Haskell report has a section on do notation and how to "desugar" these do blocks.
If you write a do block as:
parseNumber = do
x <- many1 digit
return (Number (read x))
Then this is syntactically equivalent to:
parseNumber :: Parser LispVal
parseNumber = many1 digit >>= \x -> return (Number (read x))
or more elegant:
parseNumber :: Parser LispVal
parseNumber = many1 digit >>= return . Number . read
We however do not need to work with >>=. Indeed, if we want to apply a function on the item that will be constructed with the Parser, then we can use fmap :: Functor f => (a -> b) -> f a -> f b or (<$>) :: Functor f => (a -> b) -> f a -> f b for this:
parseNumber :: Parser LispVal
parseNumber = Number . read <$> many1 digit
I'm learning some techniques to make a very simple Haskell parser that serves to calculation consistence (addition, subtraction and other trivial operations). Library I use is Parsec. Although I've got some comprehension on binary calculation, it seems to be tough to me if I try to make a unary operator function, for example that of negation (~). There is a code snippet I use to implement parsing for multiplication:
import Text.Parsec hiding(digit)
import Data.Functor
type Parser a = Parsec String () a
digit :: Parser Char
digit = oneOf ['0'..'9']
number :: Parser Integer
number = read <$> many1 digit
applyMany :: a -> [a -> a] -> a
applyMany x [] = x
applyMany x (h:t) = applyMany (h x) t
multiplication :: Parser Integer
multiplication = do
lhv <- number
spaces
char '*'
spaces
rhv <- number
return $ lhv * rhv
Switching to an unary operation, my code for factorial as follows:
fact :: Parser Integer
fact = do
spaces
char '!'
rhv <- number
spaces
return $ factorial rhv
factorial :: Parser Integer -> Parser Integer
factorial n
| n == 0 || n == 1 = 1
| otherwise = n * factorial (n-1)
And once module is getting loaded, an error message appears just like that:
Couldn't match type `Integer'
with `ParsecT String () Data.Functor.Identity.Identity Integer'
Expected type: Parser Integer
Actual type: Integer
Confusingly, it's a hard case for me to realize what's wrong with my comprehension about unary ops comparing them to binary ones. Hoping any help to fix that.
factorial doesn't define a parser; it computes a factorial. The type should just be Integer -> Integer, not Parser Integer -> Parser Integer.
I wrote the following Parsec code to decode text that represent Word8 (unsigned 8-bit integers):
decOctetP = try e <|> try d <|> try c <|> try b <|> a
where
a = fmap (:[]) digit
b = do
m <- oneOf "123456789"
n <- digit
return [m, n]
c = do
char '1'
m <- count 2 digit
return ('1':m)
d = do
char '2'
m <- oneOf "01234"
n <- digit
return ['2', m, n]
e = do
string "25"
m <- oneOf "012345"
return ['2', '5', m]
I can't help but feel there is an easier way to do this. Can someone enlighten me?
Honestly the easiest way is just to parse it as natural number and then fail to parse if it's beyond the bounds 0-255 by returning mzero.
import Control.Monad
import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Token as Tok
natural :: Parser Integer
natural = Tok.natural lexer
number :: Parser Integer
number = do
n <- natural
if n < 256 then return n
else mzero
You could replace c, d and e with something like this:
decOctetP = try c <|> try b <|> a
where
a = fmap (:[]) digit
b = do
m <- oneOf "123456789"
n <- digit
return [m, n]
c = do
m <- (:) <$> oneOf "123456789" <*> count 2 digit
guard $ combine m <= 255
return m
combine = foldl' (\r d -> 10 * r + (ord d - ord '0')) 0
Still not very pretty, but it's a little shorter.
Ended up with this version which I think is clean and intuitive:
decOctetP = choice [e, d, c, b, a]
where
a = fmap (:[]) digit
b = sequence [oneOf "123456789", digit]
c = sequence [char '1', digit, digit]
d = sequence [char '2', oneOf "01234", digit]
e = sequence [char '2', char '5', oneOf "012345"]
If I wanted to parse a string with multiple parenthesized groups into a list of strings holding each group, for example
"((a b c) a b c)"
into
["((a b c) a b c)","( a b c)"]
How would I do that using parsec? The use of between looks nice but it does not seem possible to separate with a beginning and end value.
I'd use a recursive parser:
data Expr = List [Expr] | Term String
expr :: Parsec String () Expr
expr = recurse <|> terminal
where terminal is your primitives, in this case these seem to be strings of characters so
where terminal = Term <$> many1 letter
and recurse is
recurse = List <$>
(between `on` char) '(' ')' (expr `sepBy1` char ' ')
Now we have a nice tree of Exprs which we can gather with
collect r#(List ts) = r : concatMap collect ts
collect _ = []
While jozefg's solution is almost identical to what I came up with (and I completely agree to all his suggestions), there are some small differences that made me think that I should post a second answer:
Due to the expected result of the initial example, it is not necessary to treat space-separated parts as individual subtrees.
Further it might be interesting to see the part that actually computes the expected results (i.e., a list of strings).
So here is my version. As already suggested by jozefg, split the task into several sub-tasks. Those are:
Parse a string into an algebraic data type representing some kind of tree.
Collect the (desired) subtrees of this tree.
Turn trees into strings.
Concerning 1, we first need a tree data type
import Text.Parsec
import Text.Parsec.String
import Control.Applicative ((<$>))
data Tree = Leaf String | Node [Tree]
and then a function that can parse strings into values of this type.
parseTree :: Parser Tree
parseTree = node <|> leaf
where
node = Node <$> between (char '(') (char ')') (many parseTree)
leaf = Leaf <$> many1 (noneOf "()")
In my version I do consider the hole string between parenthesis as a Leaf node (i.e., I do not split at white-spaces).
Now we need to collect the subtrees of a tree we are interested in:
nodes :: Tree -> [Tree]
nodes (Leaf _) = []
nodes t#(Node ts) = t : concatMap nodes ts
Finally, a Show-instance for Trees allows us to turn them into strings.
instance Show Tree where
showsPrec d (Leaf x) = showString x
showsPrec d (Node xs) = showString "(" . showList xs . showString ")"
where
showList [] = id
showList (x:xs) = shows x . showList xs
Then the original task can be solved, e.g., by:
parseGroups :: Parser [String]
parseGroups = map show . nodes <$> parseTree
> parseTest parseGroups "((a b c) a b c)"
["((a b c) a b c)","(a b c)"]