How do I pattern match with Data.Text in Haskell? - parsing

I am currently in the process of writing a parser in Haskell. I have the following code.
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Text
newtype Parser a = Parser { runParser :: Text -> Either Text (Text, a) }
char1 :: Char -> Parser Char
char1 c = Parser $ \case
(x:xs) | x == c -> Right (xs, x)
_ -> Left "Unexpected character"
It fails to compile with these two errors.
test.hs:12:6: error:
• Couldn't match expected type ‘Text’ with actual type ‘[Char]’
• In the pattern: x : xs
In a case alternative: (x : xs) | x == c -> Right (xs, x)
In the second argument of ‘($)’, namely
‘\case
(x : xs) | x == c -> Right (xs, x)
_ -> Left "Unexpected character"’
|
12 | (x:xs) | x == c -> Right (xs, x)
| ^^^^
test.hs:12:24: error:
• Couldn't match type ‘[Char]’ with ‘Text’
Expected type: Either Text (Text, Char)
Actual type: Either Text ([Char], Char)
• In the expression: Right (xs, x)
In a case alternative: (x : xs) | x == c -> Right (xs, x)
In the second argument of ‘($)’, namely
‘\case
(x : xs) | x == c -> Right (xs, x)
_ -> Left "Unexpected character"’
|
12 | (x:xs) | x == c -> Right (xs, x)
| ^^^^^^^^^^^^^
I can fix the error by replacing the Text data type with String but I would prefer to use the Text data type.
Is there a way to pattern match with the Data.Text type without first explicitly converting it to a string first? Perhaps there is a GHC extension that would allow me to do this?
Thanks in advance.

A a refinement to #DanielWagner's answer, you can combine view patterns and pattern synonyms to do this. You'll need a new constructor in place of :, but it might look like:
{-# LANGUAGE PatternSynonyms #-}
{-# LANGUAGE ViewPatterns #-}
import Data.Text
pattern x :> xs <- (uncons -> Just (x, xs))
pattern Empty <- (uncons -> Nothing)
findInText :: (Char -> Bool) -> Text -> Maybe Char
findInText _ Empty = Nothing
findInText p (x :> xs) | p x = Just x
| otherwise = findInText p xs
The idea here is that a pattern x :> xs is a synonym for the pattern uncons -> Just (x, xs) which is a view pattern that operates by applying uncons to the scrutinee and pattern-matching the result with Just (x, xs) to population x and xs for the parent pattern.
As per the comment, there might be some concern about whether this usage ends up calling uncons more than once. With optimization entirely shut off (-O0), the generated core does have multiple uncons calls:
-- unoptimized -O0
findInText
= \ ds ds1 ->
case uncons ds1 of {
Nothing -> Nothing;
Just ipv ->
case uncons ds1 of {
Nothing -> ...
With optimization on (-O or -O2), everything gets inlined and the generated core is incredibly complicated because of the Unicode processing going on. However, if you also define:
findInText' :: (Char -> Bool) -> Text -> Maybe Char
findInText' p txt = case uncons txt of
Nothing -> Nothing
Just (x, xs) | p x -> Just x
| otherwise -> findInText' p xs
it turns out that GHC compiles findInText' to:
findInText' = findInText
so it looks like in this case at least, GHC doesn't do any extra work as a result of the view patterns.

You can match on a call to uncons.
case uncons text of
Just (x, xs) -> ...
Nothing -> ...
View patterns let you do this within the pattern instead of within the scrutinee, but require you to say uncons once for each pattern.
case text of
(uncons -> Just (x, xs)) -> ...
(uncons -> Nothing) -> ...

Related

Haskell: Graham Hutton Book Parsing (Ch-8): What does `parse (f v) out` do, and how does it do it?

My question is about Graham Hutton's book Programming in Haskell 1st Ed.
There is a parser created in section 8.4, and I am assuming anyone answering has the book or can see the link to slide 8 in the link above.
A basic parser called item is described as:
type Parser a = String -> [(a, String)]
item :: Parser Char
item = \inp -> case inp of
[] -> []
(x:xs) -> [(x,xs)]
which is used with do to define another parser p (the do parser)
p :: Parser (Char, Char)
p = do x <- item
item
y <- item
return (x,y)
the relevant bind definition is:
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = \inp -> case parse p inp of
[] -> []
[(v,out)] -> parse (f v) out
return is defined as:
return :: a -> Parser a
return v = \inp -> [(v,inp)]
parse is defined as:
parse :: Parser a -> String -> [(a,String)]
parse p inp = p inp
The program (the do parser) takes a string and selects the 1st and 3rd characters and returns them in a tuple with the remainder of the string in a list, e.g., "abcdef" produces [('a','c'), "def"].
I want to know how the
(f v) out
in
[(v,out)] -> parse (f v) out
returns a parser which is then applied to out.
f in the do parser is item and item taking a character 'c' returns [('c',[])]?
How can that be a parser and how can it take out as an argument?
Perhaps I am just not understanding what (f v) does.
Also how does the do parser 'drop' the returned values each time to operate on the rest of the input string when item is called again?
What is the object that works its way through the do parser, and how is it altered at each step, and by what means is it altered?
f v produces a Parser b because f is a function of type a -> Parser b and v is a value of type a. So then you're calling parse with this Parser b and the string out as arguments.
F in the 'do' parser is item
No, it's not. Let's consider a simplified (albeit now somewhat pointless) version of your parser:
p = do x <- item
return x
This will desugar to:
p = item >>= \x -> return x
So the right operand of >>=, i.e. f, is \x -> return x, not item.
Also how does the 'do' parser 'drop' the returned values each time to operate on the rest of the input string when item is called again? What is the object that works its way through the 'do' parser and how is it altered and each step and by what means is it altered?
When you apply a parser it returns a tuple containing the parsed value and a string representing the rest of the input. If you look at item for example, the second element of the tuple will be xs which is the tail of the input string (i.e. a string containing all characters of the input string except the first). This second part of the tuple will be what's fed as the new input to subsequent parsers (as per [(v,out)] -> parse (f v) out), so that way each successive parser will take as input the string that the previous parser produced as the second part of its output tuple (which will be a suffix of its input).
In response to your comments:
When you write "p = item >>= \x -> return x", is that the equivalent of just the first line "p = do x <- item"?
No, it's equivalent to the entire do-block (i.e. do {x <- item; return x}). You can't translate do-blocks line-by-line like that. do { x <- foo; rest } is equivalent to foo >>= \x -> do {rest}, so you'll always have the rest of the do-block as part of the right operand of >>=.
but not how that reduces to simply making 'out' available as the input for the next line. What is parse doing if the next line of the 'do' parser is a the item parser?
Let's walk through an example where we invoke item twice (this is like your p, but without the middle item). In the below I'll use === to denote that the expressions above and below the === are equivalent.
do x <- item
y <- item
return (x, y)
=== -- Desugaring do
item >>= \x -> item >>= \y -> return (x, y)
=== -- Inserting the definition of >>= for outer >>=
\inp -> case parse item inp of
[] -> []
[(v,out)] -> parse (item >>= \y -> return (v, y)) out
Now let's apply this to the input "ab":
case parse item "ab" of
[] -> []
[(v,out)] -> parse (item >>= \y -> return (v, y)) out
=== Insert defintiion of `parse`
case item "ab" of
[] -> []
[(v,out)] -> parse (item >>= \y -> return (v, y)) out
=== Insert definition of item
case ('a', "b") of
[] -> []
[(v,out)] -> parse (item >>= \y -> return (v, y)) out
===
parse (item >>= \y -> return ('a', y)) out
Now we can expand the second >>= the same we did the fist and eventually end up with ('a', 'b').
The relevant advice is, Don't panic (meaning, don't rush it; or, take it slow), and, Follow the types.
First of all, Parsers
type Parser a = String -> [(a,String)]
are functions from String to lists of pairings of result values of type a and the leftover Strings (because type defines type synonyms, not new types like data or newtype do).
That leftovers string will be used as input for the next parsing step. That's the main thing about it here.
You are asking, in
p >>= f = \inp -> case (parse p inp) of
[] -> []
[(v,out)] -> parse (f v) out
how the (f v) in [(v,out)] -> parse (f v) out returns a parser which is then applied to out?
The answer is, f's type says that it does so:
(>>=) :: Parser a -> (a -> Parser b) -> Parser b -- or, the equivalent
(>>=) :: Parser a -> (a -> Parser b) -> (String -> [(b,String)])
-- p f inp
We have f :: a -> Parser b, so that's just what it does: applied to a value of type a it returns a value of type Parser b. Or equivalently,
f :: a -> (String -> [(b,String)]) -- so that
f (v :: a) :: String -> [(b,String)] -- and,
f (v :: a) (out :: String) :: [(b,String)]
So whatever is the value that parse p inp produces, it must be what f is waiting for to proceed. The types must "fit":
p :: Parser a -- m a
f :: a -> Parser b -- a -> m b
f <$> p :: Parser ( Parser b ) -- m ( m b )
f =<< p :: Parser b -- m b
or, equivalently,
p :: String -> [(a, String)]
-- inp v out
f :: a -> String -> [(b, String)]
-- v out
p >>= f :: String -> [(b, String)] -- a combined Parser
-- inp v2 out2
So this also answers your second question,
How can that be a parser and how can it take out as an argument?
The real question is, what kind of f is it, that does such a thing? Where does it come from? And that's your fourth question.
And the answer is, your example in do-notation,
p :: Parser (Char, Char)
p = do x <- item
_ <- item
y <- item
return (x,y)
by Monad laws is equivalent to the nested chain
p = do { x <- item
; do { _ <- item
; do { y <- item
; return (x,y) }}}
which is a syntactic sugar for the nested chain of Parser bind applications,
p :: Parser (Char, Char) -- ~ String -> [((Char,Char), String)]
p = item >>= (\ x -> -- item :: Parser Char ~ String -> [(Char,String)]
item >>= (\ _ -> -- x :: Char
item >>= (\ y -> -- y :: Char
return (x,y) )))
and it is because the functions are nested that the final return has access to both y and x there; and it is precisely the Parser bind that arranges for the output leftovers string to be used as input to the next parsing step:
p = item >>= f -- :: String -> [((Char,Char), String)]
where
{ f x = item >>= f2
where { f2 _ = item >>= f3
where { f3 y = return (x,y) }}}
i.e. (under the assumption that inp is a string of length two or longer),
parse p inp -- assume that `inp`'s
= (item >>= f) inp -- length is at least 2 NB.
=
let [(v, left)] = item inp -- by the def of >>=
in
(f v) left
=
let [(v, left)] = item inp
in
let x = v -- inline the definition of `f`
in (item >>= f2) left
=
let [(v, left)] = item inp
in let x = v
in let [(v2, left2)] = item left -- by the def of >>=, again
in (f2 v2) left2
=
..........
=
let [(x,left1)] = item inp -- x <- item
[(_,left2)] = item left1 -- _ <- item
[(y,left3)] = item left2 -- y <- item
in
[((x,y), left3)]
=
let (x:left1) = inp -- inline the definition
(_:left2) = left1 -- of `item`
(y:left3) = left2
in
[((x,y), left3)]
=
let (x:_:y:left3) = inp
in
[((x,y), left3)]
after few simplifications.
And this answers your third question.
I am having similar problems reading the syntax, because it's not what we are used to.
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = \inp -> case parse p inp of
[] -> []
[(v,out)] -> parse (f v) out
so for the question:
I want to know how the (f v) out in [(v,out)] -> parse (f v) out returns a parser which is then applied to out.
It does because that's the signature of the 2nd arg (the f): (>>=) :: Parser a -> (a -> Parser b) -> Parser b .... f takes an a and produces a Parser b . a Parser b takes a String which is the out ... (f v) out.
But the output of this should not be mixed up with the output of the function we are writing: >>=
We are outputting a parser ... (>>=) :: Parser a -> (a -> Parser b) ->
Parser b .
The Parser we are outputting has the job of wrapping and chaining the first 2 args
A parser is a function that takes 1 arg. This is constructed right after the first = ... i.e. by returning an (anonymous) function: p >>= f = \inp -> ... so inp refers to the input string of the Parser we are building
so what is left is to define what that constructed function should do ... NOTE: we are not implementing any of the input parsers just chaining them together ... so the output Parser function should:
apply the input parser (p) to the its input (inp): p >>= f = \inp -> case parse p inp of
take the output of that parse [(v, out)] -- v is the result, out is what remains of the input
apply the input function (f is (a -> Parser b)) to the parsed result (v)
(f v) produces a Parser b (a function that takes 1 arg)
so apply that output parser to the remainder of the input after the first parser (out)
For me the understanding lies in the use of destructuring and the realization that we are constructing a function that glues together the execution of other functions together simply considering their interface.
Hope that helps ... it helped me to write it :-)

applicative functor: <*> and partial application, how it works

I am reading the book Programming in Haskell by Graham Hutton and I have some problem to understand how <*> and partial application can be used to parse a string.
I know that pure (+1) <*> Just 2
produces Just 3
because pure (+1) produces Just (+1) and then Just (+1) <*> Just 2
produces Just (2+1) and then Just 3
But in more complex case like this:
-- Define a new type containing a parser function
newtype Parser a = P (String -> [(a,String)])
-- This function apply the parser p on inp
parse :: Parser a -> String -> [(a,String)]
parse (P p) inp = p inp
-- A parser which return a tuple with the first char and the remaining string
item :: Parser Char
item = P (\inp -> case inp of
[] -> []
(x:xs) -> [(x,xs)])
-- A parser is a functor
instance Functor Parser where
fmap g p = P (\inp -> case parse p inp of
[] -> []
[(v, out)] -> [(g v, out)])
-- A parser is also an applicative functor
instance Applicative Parser where
pure v = P (\inp -> [(v, inp)])
pg <*> px = P (\inp -> case parse pg inp of
[] -> []
[(g, out)] -> parse (fmap g px) out)
So, when I do:
parse (pure (\x y -> (x,y)) <*> item <*> item) "abc"
The answer is:
[(('a','b'),"c")]
But I don't understand what exactly happens.
First:
pure (\x y -> (x,y)) => P (\inp1 -> [(\x y -> (x,y), inp1)])
I have now a parser with one parameter.
Then:
P (\inp1 -> [(\x y -> (x,y), inp1)]) <*> item
=> P (\inp2 -> case parse (\inp1 -> [(\x y -> (x,y), inp1)]) inp2 of ???
I really don't understand what happens here.
Can someone explain, step by step, what's happens now until the end please.
Let's evaluate pure (\x y -> (x,y)) <*> item. The second application of <*> will be easy once we've seen the first:
P (\inp1 -> [(\x y -> (x,y), inp1)]) <*> item
We replace the <*> expression with its definition, substituting the expression's operands for the definition's parameters.
P (\inp2 -> case parse P (\inp1 -> [(\x y -> (x,y), inp1)]) inp2 of
[] -> []
[(g, out)] -> parse (fmap g item) out)
Then we do the same for the fmap expression.
P (\inp2 -> case parse P (\inp1 -> [(\x y -> (x,y), inp1)]) inp2 of
[] -> []
[(g, out)] -> parse P (\inp -> case parse item inp of
[] -> []
[(v, out)] -> [(g v, out)]) out)
Now we can reduce the first two parse expressions (we'll leave parse item out for later since it's basically primitive).
P (\inp2 -> case [(\x y -> (x,y), inp2)] of
[] -> []
[(g, out)] -> case parse item out of
[] -> []
[(v, out)] -> [(g v, out)])
So much for pure (\x y -> (x,y)) <*> item. Since you created the first parser by lifting a binary function of type a -> b -> (a, b), the single application to a parser of type Parser Char represents a parser of type Parser (b -> (Char, b)).
We can run this parser through the parse function with input "abc". Since the parser has type Parser (b -> (Char, b)), this should reduce to a value of type [(b -> (Char, b), String)]. Let's evaluate that expression now.
parse P (\inp2 -> case [(\x y -> (x,y), inp2)] of
[] -> []
[(g, out)] -> case parse item out of
[] -> []
[(v, out)] -> [(g v, out)]) "abc"
By the definition of parse this reduces to
case [(\x y -> (x,y), "abc")] of
[] -> []
[(g, out)] -> case parse item out of
[] -> []
[(v, out)] -> [(g v, out)]
Clearly, the patterns don't match in the first case, but they do in the second case. We substitute the matches for the patterns in the second expression.
case parse item "abc" of
[] -> []
[(v, out)] -> [((\x y -> (x,y)) v, out)]
Now we finally evaluate that last parse expression. parse item "abc" clearly reduces to [('a', "bc")] from the definition of item.
case [('a', "bc")] of
[] -> []
[(v, out)] -> [((\x y -> (x,y)) v, out)]
Again, the second pattern matches and we do substitution
[((\x y -> (x,y)) 'a', "bc")]
which reduces to
[(\y -> ('a', y), "bc")] :: [(b -> (Char, b), String)] -- the expected type
If you apply this same process to evaluate a second <*> application, and put the result in the parse (result) "abc" expression, you'll see that the expression indeed reduces to[(('a','b'),"c")].
What helped me a lot while learning these things was to focus on the types of the values and functions involved. It's all about applying a function to a value (or in your case applying a function to two values).
($) :: (a -> b) -> a -> b
fmap :: Functor f => (a -> b) -> f a -> f b
(<*>) :: Applicative f => f (a -> b) -> f a -> f b
So with a Functor we apply a function on a value inside a "container/context" (i.e. Maybe, List, . .), and with an Applicative the function we want to apply is itself inside a "container/context".
The function you want to apply is (,), and the values you want to apply the function to are inside a container/context (in your case Parser a).
Using pure we lift the function (,) into the same "context/container" our values are in (note, that we can use pure to lift the function into any Applicative (Maybe, List, Parser, . . ):
(,) :: a -> b -> (a, b)
pure (,) :: Parser (a -> b -> (a, b))
Using <*> we can apply the function (,) that is now inside the Parser context to a value that is also inside the Parser context. One difference to the example you provided with +1 is that (,) has two arguments. Therefore we have to use <*> twice:
(<*>) :: Applicative f => f (a -> b) -> f a -> f b
x :: Parser Int
y :: Parser Char
let p1 = pure (,) <*> x :: Parser (b -> (Int, b))
let v1 = (,) 1 :: b -> (Int, b)
let p2 = p1 <*> y :: Parser (Int, Char)
let v2 = v1 'a' :: (Int, Char)
We have now created a new parser (p2) that we can use just like any other parser!
. . and then there is more!
Have a look at this convenience function:
(<$>) :: Functor f => (a -> b) -> f a -> f b
<$> is just fmap but you can use it to write the combinators more beautifully:
data User = User {name :: String, year :: Int}
nameParser :: Parser String
yearParser :: Parser Int
let userParser = User <$> nameParser <*> yearParser -- :: Parser User
Ok, this answer got longer than I expected! Well, I hope it helps. Maybe also have a look at Typeclassopedia which I found invaluable while learning Haskell which is an endless beautiful process . . :)
TL;DR: When you said you "[now] have a parser with one parameter" inp1, you got confused: inp1 is an input string to a parser, but the function (\x y -> (x,y)) - which is just (,) - is being applied to the output value(s), produced by parsing the input string. The sequence of values produced by your interim parsers is:
-- by (pure (,)):
(,) -- a function expecting two arguments
-- by the first <*> combination with (item):
(,) x -- a partially applied (,) function expecting one more argument
-- by the final <*> combination with another (item):
((,) x) y == (x,y) -- the final result, a pair of `Char`s taken off the
-- input string, first (`x`) by an `item`,
-- and the second (`y`) by another `item` parser
Working by equational reasoning can oftentimes be easier:
-- pseudocode definition of `fmap`:
parse (fmap g p) inp = case (parse p inp) of -- g :: a -> b , p :: Parser a
[] -> [] -- fmap g p :: Parser b
[(v, out)] -> [(g v, out)] -- v :: a , g v :: b
(apparently this assumes any parser can only produce 0 or 1 results, as the case of a longer list isn't handled at all -- which is usually frowned upon, and with good reason);
-- pseudocode definition of `pure`:
parse (pure v) inp = [(v, inp)] -- v :: a , pure v :: Parser a
(parsing with pure v produces the v without consuming the input);
-- pseudocode definition of `item`:
parse (item) inp = case inp of -- inp :: ['Char']
[] -> []
(x:xs) -> [(x,xs)] -- item :: Parser 'Char'
(parsing with item means taking one Char off the head of the input String, if possible); and,
-- pseudocode definition of `(<*>)`:
parse (pg <*> px) inp = case (parse pg inp) of -- px :: Parser a
[] -> []
[(g, out)] -> parse (fmap g px) out -- g :: a -> b
(<*> combines two parsers with types of results that fit, producing a new, combined parser which uses the first parse to parse the input, then uses the second parser to parse the leftover string, combining the two results to produce the result of the new, combined parser);
Now, <*> associates to the left, so what you ask about is
parse ( pure (\x y -> (x,y)) <*> item <*> item ) "abc"
= parse ( (pure (,) <*> item1) <*> item2 ) "abc" -- item_i = item
the rightmost <*> is the topmost, so we expand it first, leaving the nested expression as is for now,
= case (parse (pure (,) <*> item1) "abc") of -- by definition of <*>
[] -> []
[(g2, out2)] -> parse (fmap g2 item2) out2
= case (parse item out2) of -- by definition of fmap
[] -> []
[(v, out)] -> [(g2 v, out)]
= case out2 of -- by definition of item
[] -> []
(y:ys) -> [(g2 y, ys)]
Similarly, the nested expression is simplified as
parse (pure (,) <*> item1) "abc"
= case (parse (pure (\x y -> (x,y))) "abc") of -- by definition of <*>
[] -> []
[(g1, out1)] -> parse (fmap g1 item1) out1
= case (parse item out1) of ....
= case out1 of
[] -> []
(x:xs) -> [(g1 x, xs)]
= case [((,), "abc")] of -- by definition of pure
[(g1, out1)] -> case out1 of
[] -> []
(x:xs) -> [(g1 x, xs)]
= let { out1 = "abc"
; g1 = (,)
; (x:xs) = out1
}
in [(g1 x, xs)]
= [( (,) 'a', "bc")]
and thus we get
= case [( (,) 'a', "bc")] of
[(g2, out2)] -> case out2 of
[] -> []
(y:ys) -> [(g2 y, ys)]
I think you can see now why the result will be [( ((,) 'a') 'b', "c")].
First, I want to emphasize one thing. I found that the crux of understanding lies in noticing the separation between the Parser itself and running the parser with parse.
In running the parser you give the Parser and input string to parse and it will give you the list of possible parses. I think that's probably easy to understand.
You will pass parse a Parser, which may be built using glue, <*>. Try to understand that when you pass parse the Parser, a, or the Parser, f <*> a <*> b, you will be passing it the same type of thing, i.e. something equivalent to (String -> [(a,String)]). I think this is probably easy to understand as well, but still it takes a while to "click".
That said, I'll talk a little about the nature of this applicative glue, <*>. An applicative, F a is a computation that yields data of type a. You can think of a term such as
... f <*> g <*> h
as a series of computations which return some data, say a then b then c. In the context of Parser, the computation involve f looking for a in the current string, then passing the remainder of the string to g, etc. If any of the computations/parses fails, then so does the whole term.
Its interesting to note that any applicative can be written with a pure function at the beginning to collect all those emitted values, so we can generally write,
pure3ArgFunction <$> f <*> g <*> h
I personally find the mental model of emitting and collecting helpful.
So, with that long preamble over, onto the actual explanation. What does
parse (pure (\x y -> (x,y)) <*> item <*> item) "abc"
do? Well, parse (p::Parser (Char,Char) "abc" applies the parser, (which I renamed p) to "abc", yielding [(('a','b'),"c")]. This is a successful parse with the return value of ('a','b') and the leftover string, "c".
Ok, that's not the question though. Why does the parser work this way? Starting with:
.. <*> item <*> item
item takes the next character from the string, yields it as a result and passes the unconsumed input. The next item does the same. The beginning can be rewritten as:
fmap (\x y -> (x,y)) $ item <*> item
or
(\x y -> (x,y)) <$> item <*> item
which is my way of showing that the pure function does not do anything to the input string, it just collects the results. When looked at in this light I think the parser should be easy to understand. Very easy. Too easy. I mean that in all seriousness. Its not that the concept is so hard, but our normal frame of looking at programming is just too foreign for it to make much sense at first.
Some people below did great jobs on "step-by-step" guides for you to easily understand the progress of computation to create the final result. So I don't replicate it here.
What I think is that, you really need to deeply understand about Functor and Applicative Functor. Once you understand these topics, the others will be easy as one two three (I means most of them ^^).
So: what is Functor, Applicative Functor and their applications in your problem?
Best tutorials on these:
Chapter 11 of "Learn You a Haskell for a great good": http://learnyouahaskell.com/functors-applicative-functors-and-monoids.
More visual "Functors, Applicatives, And Monads in Pictures": http://adit.io/posts/2013-04-17-functors,_applicatives,_and_monads_in_pictures.html.
First, when you think about Functor, Applicative Functor, think about "values in contexts": the values are important, and the computational contexts are important too. You have to deal with both of them.
The definitions of the types:
-- Define a new type containing a parser function
newtype Parser a = P (String -> [(a,String)])
-- This function apply the parser p on inp
parse :: Parser a -> String -> [(a,String)]
parse (P p) inp = p inp
The value here is the value of type a, the first element of the tuple in the list.
The context here is the function, or the eventual value. You have to supply an input to get the final value.
Parser is a function wrapped in a P data constructor. So if you got a value b :: Parser Char, and you want to apply it to some input, you have to unwrap the inner function in b. That's why we have the function parse, it unwraps the inner function and applies it to the input value.
And, if you want to create Parser value, you have to use P data constructor wraps around a function.
Second, Functor: something that can be "mapped" over, specified by the function fmap:
fmap :: (a -> b) -> f a -> f b
I often call the function g :: (a -> b) is a normal function because as you see no context wraps around it. So, to be able to apply g to f a, we have to extract the a from f a somehow, so that g can be apply to a alone. That "somehow" depends on the specific Functor and is the context you are working in:
instance Functor Parser where
fmap g p = P (\inp -> case parse p inp of
[] -> []
[(v, out)] -> [(g v, out)])
g is the function of type (a -> b), p is of type f a.
To unwrap p, to get the value of of context, we have to pass some input value in: parse p inp, then the value is the 1st element of the tuple. Apply g to that value, get a value of type b.
The result of fmap is of type f b, so we have to wrap all the result in the same context, that why we have: fmap g p = P (\inp -> ...).
At this time, you might be wonder you could have an implementation of fmap in which the result, instead of [(g v, out)], is [(g v, inp)]. And the answer is Yes. You can implement fmap in any way you like, but the important thing is to be an appropriate Functor, the implementation must obey Functor laws. The laws are they way we deriving the implementation of those functions (http://mvanier.livejournal.com/4586.html). The implementation must satisfy at least 2 Functor laws:
fmap id = id.
fmap (f . g) = fmap f . fmap g.
fmap is often written as infix operator: <$>. When you see this, look at the 2nd operand to determine which Functor you are working with.
Third, Applicative Functor: you apply a wrapped function to a wrapped value to get another wrapped value:
<*> :: f (a -> b) -> f a -> f b
Unwrap the inner function.
Unwrap 1st value.
Apply the function and wrap the result.
In your case:
instance Applicative Parser where
pure v = P (\inp -> [(v, inp)])
pg <*> px = P (\inp -> case parse pg inp of
[] -> []
[(g, out)] -> parse (fmap g px) out)
pg is of type f (a -> b), px is of type f a.
Unwrap g from pg by parse pg inp, g is the 1st of the tuple.
Unwrap px and apply g to the value by using fmap g px. Attention, the result function only applies to out, in some case that is "bc" not "abc".
Wrap the whole result: P (\inp -> ...).
Like Functor, an implementation of Applicative Functor must obey Applicative Functor laws (in the tutorials above).
Fourth, apply to your problem:
parse (pure (\x y -> (x,y)) <*> item <*> item) "abc"
| f1 | |f2| |f3|
Unwrap f1 <*> f2 by passing "abc" to it:
Unwrap f1 by passing "abc" to it, we get [(g, "abc")].
Then fmap g on f2 and passing out="abc" to it:
Unwrap f2 get [('a', "bc")].
Apply g on 'a' get a result: [(\y -> ('a', y), "bc")].
Then fmap 1st element of the result on f3 and passing out="bc" to it:
Unwrap f3 get [('b', "c")].
Apply the function on 'b' get final result: [(('a', 'b'), "c")].
In conclusion:
Take some time for the ideas to "dive" into you. Especially, the laws derives the implementations.
Next time, design your data structure to easier understand.
Haskell is one of my favorite languages and I thing it will be yours soon, so be patient, it needs a learning curve and then you go!
Happy Haskell hacking!
Hmm I am not experienced with Haskell but my attempt on generating Functor and Applicative instances of the Parser type would be as follows;
-- Define a new type containing a parser function
newtype Parser a = P (String -> [(a,String)])
-- This function apply the parser p on inp
parse :: Parser a -> String -> [(a,String)]
parse (P p) inp = p inp
-- A parser which return a tuple with the first char and the remaining string
item :: Parser Char
item = P (\inp -> case inp of
[] -> []
(x:xs) -> [(x,xs)])
-- A parser is a functor
instance Functor Parser where
fmap g (P f) = P (\str -> map (\(x,y) -> (g x, y)) $ f str)
-- A parser is also an applicative functor
instance Applicative Parser where
pure v = P (\str -> [(v, str)])
(P g) <*> (P f) = P (\str -> [(g' v, s) | (g',s) <- g str, (v,_) <- f str])
(P g) <*> (P f) = P (\str -> f str >>= \(v,s1) -> g s1 >>= \(g',s2) -> [(g' v,s2)])
(10x very much for the helping of #Will Ness on <*>)
Accordingly...
*Main> parse (P (\s -> [((+3), s)]) <*> pure 2) "test"
[(5,"test")]
*Main> parse (P (\s -> [((,), s ++ " altered")]) <*> pure 2 <*> pure 4) "test"
[((2,4),"test altered")]

How to restrict backtracking in a monad transformer parser combinator

tl;dr, How do I implement parsers whose backtracking can be restricted, where the parsers are monad transformer stacks?
I haven't found any papers, blogs, or example implementations of this approach; it seems the typical approach to restricting backtracking is a datatype with additional constructors, or the Parsec approach where backtracking is off by default.
My current implementation -- using a commit combinator, see below -- is wrong; I'm not sure about the types, whether it belongs in a type class, and my instances are less generic than it feels like they should be.
Can anyone describe how to do this cleanly, or point me to resources?
I've added my current code below; sorry for the post being so long!
The stack:
StateT
MaybeT/ListT
Either e
The intent is that backtracking operates in the middle layer -- a Nothing or an empty list wouldn't necessarily yield an error, it'd just mean that a different branch should be tried -- whereas the bottom layer is for errors (with some contextual information) that immediately abort the parsing.
{-# LANGUAGE NoMonomorphismRestriction, FunctionalDependencies,
FlexibleInstances, UndecidableInstances #-}
import Control.Monad.Trans.State (StateT(..))
import Control.Monad.State.Class (MonadState(..))
import Control.Monad.Trans.Maybe (MaybeT(..))
import Control.Monad.Trans.List (ListT(..))
import Control.Monad (MonadPlus(..), guard)
type Parser e t mm a = StateT [t] (mm (Either e)) a
newtype DParser e t a =
DParser {getDParser :: Parser e t MaybeT a}
instance Monad (DParser e t) where
return = DParser . return
(DParser d) >>= f = DParser (d >>= (getDParser . f))
instance MonadPlus (DParser e t) where
mzero = DParser (StateT (const (MaybeT (Right Nothing))))
mplus = undefined -- will worry about later
instance MonadState [t] (DParser e t) where
get = DParser get
put = DParser . put
A couple of parsing classes:
class (Monad m) => MonadParser t m n | m -> t, m -> n where
item :: m t
parse :: m a -> [t] -> n (a, [t])
class (Monad m, MonadParser t m n) => CommitParser t m n where
commit :: m a -> m a
Their instances:
instance MonadParser t (DParser e t) (MaybeT (Either e)) where
item =
get >>= \xs -> case xs of
(y:ys) -> put ys >> return y;
[] -> mzero;
parse = runStateT . getDParser
instance CommitParser t (DParser [t] t) (MaybeT (Either [t])) where
commit p =
DParser (
StateT (\ts -> MaybeT $ case runMaybeT (parse p ts) of
Left e -> Left e;
Right Nothing -> Left ts;
Right (Just x) -> Right (Just x);))
And a couple more combinators:
satisfy f =
item >>= \x ->
guard (f x) >>
return x
literal x = satisfy (== x)
Then these parsers:
ab = literal 'a' >> literal 'b'
ab' = literal 'a' >> commit (literal 'b')
give these results:
> myParse ab "abcd"
Right (Just ('b',"cd")) -- succeeds
> myParse ab' "abcd"
Right (Just ('b',"cd")) -- 'commit' doesn't affect success
> myParse ab "acd"
Right Nothing -- <== failure but not an error
> myParse ab' "acd"
Left "cd" -- <== error b/c of 'commit'
The answer appears to be in the MonadOr type class (which unfortunately for me is not part of the standard libraries):
class MonadZero m => MonadOr m where
morelse :: m a -> m a -> m a
satisfying Monoid and Left Catch:
morelse mzero b = b
morelse a mzero = a
morelse (morelse a b) c = morelse a (morelse b c)
morelse (return a) b = return a

Packrat parsing (memoization via laziness) in OCaml

I'm implementing a packrat parser in OCaml, as per the Master Thesis by B. Ford. My parser should receive a data structure that represents the grammar of a language and parse given sequences of symbols.
I'm stuck with the memoization part. The original thesis uses Haskell's lazy evaluation to accomplish linear time complexity. I want to do this (memoization via laziness) in OCaml, but don't know how to do it.
So, how do you memoize functions by lazy evaluations in OCaml?
EDIT: I know what lazy evaluation is and how to exploit it in OCaml. The question is how to use it to memoize functions.
EDIT: The data structure I wrote that represents grammars is:
type ('a, 'b, 'c) expr =
| Empty of 'c
| Term of 'a * ('a -> 'c)
| NTerm of 'b
| Juxta of ('a, 'b, 'c) expr * ('a, 'b, 'c) expr * ('c -> 'c -> 'c)
| Alter of ('a, 'b, 'c) expr * ('a, 'b, 'c) expr
| Pred of ('a, 'b, 'c) expr * 'c
| NPred of ('a, 'b, 'c) expr * 'c
type ('a, 'b, 'c) grammar = ('a * ('a, 'b, 'c) expr) list
The (not-memoized) function that parse a list of symbols is:
let rec parse g v xs = parse' g (List.assoc v g) xs
and parse' g e xs =
match e with
| Empty y -> Parsed (y, xs)
| Term (x, f) ->
begin
match xs with
| x' :: xs when x = x' -> Parsed (f x, xs)
| _ -> NoParse
end
| NTerm v' -> parse g v' xs
| Juxta (e1, e2, f) ->
begin
match parse' g e1 xs with
| Parsed (y, xs) ->
begin
match parse' g e2 xs with
| Parsed (y', xs) -> Parsed (f y y', xs)
| p -> p
end
| p -> p
end
( and so on )
where the type of the return value of parse is defined by
type ('a, 'c) result = Parsed of 'c * ('a list) | NoParse
For example, the grammar of basic arithmetic expressions can be specified as g, in:
type nt = Add | Mult | Prim | Dec | Expr
let zero _ = 0
let g =
[(Expr, Juxta (NTerm Add, Term ('$', zero), fun x _ -> x));
(Add, Alter (Juxta (NTerm Mult, Juxta (Term ('+', zero), NTerm Add, fun _ x -> x), (+)), NTerm Mult));
(Mult, Alter (Juxta (NTerm Prim, Juxta (Term ('*', zero), NTerm Mult, fun _ x -> x), ( * )), NTerm Prim));
(Prim, Alter (Juxta (Term ('<', zero), Juxta (NTerm Dec, Term ('>', zero), fun x _ -> x), fun _ x -> x), NTerm Dec));
(Dec, List.fold_left (fun acc d -> Alter (Term (d, (fun c -> int_of_char c - 48)), acc)) (Term ('0', zero)) ['1';'2';'3';])]
The idea of using lazyness for memoization is use not functions, but data structures, for memoization. Lazyness means that when you write let x = foo in some_expr, foo will not be evaluated immediately, but only as far as some_expr needs it, but that different occurences of xin some_expr will share the same trunk: as soon as one of them force computation, the result is available to all of them.
This does not work for functions: if you write let f x = foo in some_expr, and call f several times in some_expr, well, each call will be evaluated independently, there is not a shared thunk to store the results.
So you can get memoization by using a data structure instead of a function. Typically, this is done using an associative data structure: instead of computing a a -> b function, you compute a Table a b, where Table is some map from the arguments to the results. One example is this Haskell presentation of fibonacci:
fib n = fibTable !! n
fibTable = [0,1] ++ map (\n -> fib (n - 1) + fib (n - 2)) [2..]
(You can also write that with tail and zip, but this doesn't make the point clearer.)
See that you do not memoize a function, but a list: it is the list fibTable that does the memoization. You can write this in OCaml as well, for example using the LazyList module of the Batteries library:
open Batteries
module LL = LazyList
let from_2 = LL.seq 2 ((+) 1) (fun _ -> true)
let rec fib n = LL.at fib_table (n - 1) + LL.at fib_table (n - 2)
and fib_table = lazy (LL.Cons (0, LL.cons 1 <| LL.map fib from_2))
However, there is little interest in doing so: as you have seen in the example above, OCaml does not particularly favor call-by-need evaluation -- it's reasonable to use, but not terribly convenient as it was forced to be in Haskell. It is actually equally simple to directly write the cache structure by direct mutation:
open Batteries
let fib =
let fib_table = DynArray.of_list [0; 1] in
let get_fib n = DynArray.get fib_table n in
fun n ->
for i = DynArray.length fib_table to n do
DynArray.add fib_table (get_fib (i - 1) + get_fib (i - 2))
done;
get_fib n
This example may be ill-chosen, because you need a dynamic structure to store the cache. In the packrat parser case, you're tabulating parsing on a known input text, so you can use plain arrays (indexed by the grammar rules): you would have an array of ('a, 'c) result option for each rule, of the size of the input length and initialized to None. Eg. juxta.(n) represents the result of trying the rule Juxta from input position n, or None if this has not yet been tried.
Lazyness is a nice way to present this kind of memoization, but is not always expressive enough: if you need, say, to partially free some part of your result cache to lower memory usage, you will have difficulties if you started from a lazy presentation. See this blog post for a remark on this.
Why do you want to memoize functions? What you want to memoize is, I believe, the parsing result for a given (parsing) expression and a given position in the input stream. You could for instance use Ocaml's Hashtables for that.
The lazy keyword.
Here you can find some great examples.
If it fits your use case, you can also use OCaml streams instead of manually generating thunks.

Is this a reasonable foundation for a parser combinator library?

I've been working with FParsec lately and I found that the lack of generic parsers is a major stopping point for me. My goal for this little library is simplicity as well as support for generic input. Can you think of any additions that would improve this or is anything particularly bad?
open LazyList
type State<'a, 'b> (input:LazyList<'a>, data:'b) =
member this.Input = input
member this.Data = data
type Result<'a, 'b, 'c> =
| Success of 'c * State<'a, 'b>
| Failure of string * State<'a, 'b>
type Parser<'a,'b, 'c> = State<'a, 'b> -> Result<'a, 'b, 'c>
let (>>=) left right state =
match left state with
| Success (result, state) -> (right result) state
| Failure (message, _) -> Result<'a, 'b, 'd>.Failure (message, state)
let (<|>) left right state =
match left state with
| Success (_, _) as result -> result
| Failure (_, _) -> right state
let (|>>) parser transform state =
match parser state with
| Success (result, state) -> Success (transform result, state)
| Failure (message, _) -> Failure (message, state)
let (<?>) parser errorMessage state =
match parser state with
| Success (_, _) as result -> result
| Failure (_, _) -> Failure (errorMessage, state)
type ParseMonad() =
member this.Bind (f, g) = f >>= g
member this.Return x s = Success(x, s)
member this.Zero () s = Failure("", s)
member this.Delay (f:unit -> Parser<_,_,_>) = f()
let parse = ParseMonad()
Backtracking
Surprisingly it didn't take too much code to implement what you describe. It is a bit sloppy but seems to work quite well.
let (>>=) left right state =
seq {
for res in left state do
match res with
| Success(v, s) ->
let v =
right v s
|> List.tryFind (
fun res ->
match res with
| Success (_, _) -> true
| _ -> false
)
match v with
| Some v -> yield v
| None -> ()
} |> Seq.toList
let (<|>) left right state =
left state # right state
Backtracking Part 2
Switched around the code to use lazy lists and tail-call optimized recursion.
let (>>=) left right state =
let rec readRight lst =
match lst with
| Cons (x, xs) ->
match x with
| Success (r, s) as q -> LazyList.ofList [q]
| Failure (m, s) -> readRight xs
| Nil -> LazyList.empty<Result<'a, 'b, 'd>>
let rec readLeft lst =
match lst with
| Cons (x, xs) ->
match x with
| Success (r, s) ->
match readRight (right r s) with
| Cons (x, xs) ->
match x with
| Success (r, s) as q -> LazyList.ofList [q]
| Failure (m, s) -> readRight xs
| Nil -> readLeft xs
| Failure (m, s) -> readLeft xs
| Nil -> LazyList.empty<Result<'a, 'b, 'd>>
readLeft (left state)
let (<|>) (left:Parser<'a, 'b, 'c>) (right:Parser<'a, 'b, 'c>) state =
LazyList.delayed (fun () -> left state)
|> LazyList.append
<| LazyList.delayed (fun () -> right state)
I think that one important design decision that you'll need to make is whether you want to support backtracking in your parsers or not (I don't remember much about parsing theory, but this probably specifies the types of languages that your parser can handle).
Backtracking. In your implementation, a parser can either fail (the Failure case) or produce exactly one result (the Success case). An alternative option is to generate zero or more results (for example, represent results as seq<'c>). Sorry if this is something you already considered :-), but anyway...
The difference is that your parser always follows the first possible option. For example, if you write something like the following:
let! s1 = (str "ab" <|> str "a")
let! s2 = str "bcd"
Using your implementation, this will fail for input "abcd". It will choose the first branch of the <|> operator, which will then process first two characters and the next parser in the sequence will fail. An implementation based on sequences would be able to backtrack and follow the second path in <|> and parse the input.
Combine. Another idea that occurs to me is that you could also add Combine member to your parser computation builder. This is a bit subtle (because you need to understand how computation expressions are translated), but it can be sometimes useful. If you add:
member x.Combine(a, b) = a <|> b
member x.ReturnFrom(p) = p
You can then write recursive parsers nicely:
let rec many p acc =
parser { let! r = p // Parse 'p' at least once
return! many p (r::acc) // Try parsing 'p' multiple times
return r::acc |> List.rev } // If fails, return the result

Resources