Implementation differences between parser combinators and packrat algorithm - parsing

In order to have a better understanding of packrat I've tried to have a look at the provided implementation coming with the paper (I'm focusing on the bind):
instance Derivs d => Monad (Parser d) where
-- Sequencing combinator
(Parser p1) >>= f = Parser parse
where parse dvs = first (p1 dvs)
first (Parsed val rem err) =
let Parser p2 = f val
in second err (p2 rem)
first (NoParse err) = NoParse err
second err1 (Parsed val rem err) =
Parsed val rem (joinErrors err1 err)
second err1 (NoParse err) =
NoParse (joinErrors err1 err)
-- Result-producing combinator
return x = Parser (\dvs -> Parsed x dvs (nullError dvs))
-- Failure combinator
fail [] = Parser (\dvs -> NoParse (nullError dvs))
fail msg = Parser (\dvs -> NoParse (msgError (dvPos dvs) msg))
For me it looks like (errors handling aside) to parser combinators (such as this simplified version of Parsec):
bind :: Parser a -> (a -> Parser b) -> Parser b
bind p f = Parser $ \s -> concatMap (\(a, s') -> parse (f a) s') $ parse p s
I'm quite confused because before that I thought that the big difference was that packrat was a parser generator with a memoization part.
Sadly it seems that this concept is not used in this implementation.
What is the big difference between parser combinators and packrat at implementation level?
PS: I also have had a look at Papillon but it seems to be very different from the implementation coming with the paper.

The point here is really that this Packrat parser combinator library is not a full implementation of the Packrat algorithm, but more like a set of definitions that can be reused between different packrat parsers.
The real trick of the packrat algorithm (namely the memoization of parse results) happens elsewhere.
Look at the following code (taken from Ford's thesis):
data Derivs = Derivs {
dvAdditive :: Result Int,
dvMultitive :: Result Int,
dvPrimary :: Result Int,
dvDecimal :: Result Int,
dvChar :: Result Char}
pExpression :: Derivs -> Result ArithDerivs Int
Parser pExpression = (do char ’(’
l <- Parser dvExpression
char ’+’
r <- Parser dvExpression
char ’)’
return (l + r))
</> (do Parser dvDecimal)
Here, it's important to notice that the recursive call of the expression parser to itself is broken (in a kind of open-recursion fashion) by simply projecting the appropriate component of the Derivs structure.
This recursive knot is then tied in the "recursive tie-up function" (again taken from Ford's thesis):
parse :: String -> Derivs
parse s = d where
d = Derivs add mult prim dec chr
add = pAdditive d
mult = pMultitive d
prim = pPrimary d
dec = pDecimal d
chr = case s of
(c:s’) -> Parsed c (parse s’)
[] -> NoParse
These snippets are really where the packrat trick happens.
It's important to understand that this trick cannot be implemented in a standard way in a traditional parser combinator library (at least in a pure programming language like Haskell), because it needs to know the recursive structure of the grammar.
There are experimental approaches to parser combinator libraries that use a particular representation of the recursive structure of the grammar, and there it is possible to provide a standard implementation of Packrat.
For example, my own grammar-combinators library (not maintained atm, sorry) offers an implementation of Packrat.

As stated elsewhere, packrat is not an alternative to combinators, but is an implementation option in those parsers. Pyparsing is a combinator that offers an optional packrat optimization by calling ParserElement.enablePackrat(). Its implementation is almost a drop-in replacement for pyparsing's _parse method (renamed to _parseNoCache), with a _parseCache method.
Pyparsing uses a fixed-length FIFO queue for its cache, since packrat cache entries get stale once the combinator fully matches the cached expression and moves on through the input stream. A custom cache size can be passed as an integer argument to enablePackrat(), or if None is passed, the cache is unbounded. A packrat cache with the default value of 128 was only 2% less efficient than an unbounded cache against the supplied Verilog parser, with significant savings in memory.

Related

Implement `Applicative Parser`'s Apply Function

From Brent Yorgey's 2013 Penn class, after getting help on defining a Functor Parser, I'm attempting to make an Applicative Parser:
--p1 <*> p2 represents the parser which first runs p1 (which will
--consume some input and produce a function), then passes the
--remaining input to p2 (which consumes more input and produces
--some value), then returns the result of applying the function to the
--value
Here's my attempt:
instance Applicative (Parser) where
pure x = Parser $ \_ -> Just (x, [])
(Parser f) <*> (Parser g) = case (\ys -> f ys) of Nothing -> Parser Nothing
Just (_, xs) -> Parser $ g xs
However, I'm getting compile-time errors on the apply (<*>) definition.
Intuitively, I believe that using <*> achieves AND functionality.
If I have a parser for foo and a parser for bar, then I should be able to use apply <*> to say: foo followed by bar. In other words, input of foobar should successfully match, whereas foobip would not. It would fail on the second parser.
However, I believe that the types are:
Parser (a -> b) -> Parser a -> Parser b
So, that makes me think that my intuition is not entirely correct.
Please give me a tip to guide me towards understanding how to implement apply.
Your code is predicated on a misunderstanding of what a Parser is. Don't worry, virtually everybody makes this mistake.
newtype Parser a = Parser { runParser :: String -> Maybe (a, String) }
Lets break down what this means.
String -> Maybe (a, String)
[1] [2] [3] [4]
[1]: I take a string and return Maybe (a, String)
[2]: I might not succeed in parsing the input into the desired datatype
[3]: The desired type I am parsing the String into
[4]: Remaining input after having consumed the amount of data required to parse a
Parser is a function of text input to Maybe a tuple of a value and the rest of the text. Parser is emphatically not a tuple, otherwise you wouldn't have a parser. Just data in a tuple.
I'm not going to tell you how to implement <*> and nobody else should either as it would deprive you of the experience.
However, I'll give you pure so you understand the basic pattern:
pure a = Parser (\s -> Just (a, s))
See? It's a function of s -> Maybe (a, s). I intentionally mimicked the type variables in my terms to make it more obvious.

Conditional looping in an Applicative Functor

Suppose that Parser x is a parser that parses an x. This parser probably possesses a many combinator, that parses zero or more occurrences of something (stopping when the item parser fails).
I can see how one might implement that if Parser forms a monad. I can't figure out how to do it if Parser is only an Applicative Functor. There doesn't seem to be any way to check the previous result and decide what to do next (precisely the notion that monads add). What am I missing?
The Alternative type class provides the many combinator:
class Applicative f => Alternative f where
empty :: f a
(<|>) :: f a -> f a -> f a
many :: f a -> f [a]
some :: f a -> f [a]
some = some'
many = many'
many' a = some' a <|> pure []
some' a = (:) <$> a <*> many' a
The many a combinator means “zero or more” a.
The some a combinator means “one or more” a.
Hence:
The some a combinator returns a list of one a followed by many a (i.e. 1 + (0 or more)).
The many a combinator returns either some a or an empty list (i.e. (1 or more) | 0).
The many combinator depends upon the (<|>) operator which can be viewed as the default operator in languages like JavaScript. For example, consider the Alternative instance of Maybe:
instance Alternative Maybe where
empty = Nothing
Nothing <|> r = r
l <|> _ = l
Essentially the (<|>) should return the left hand side value if it's truthy. Otherwise it should return the right hand side value.
A Parser is a data structure which is defined similarly to Maybe (the idea of applicative lexer combinators and parser combinators is essentially the same):
data Lexer a = Fail | Ok (Maybe a) (Vec (Lexer a))
If parsing fails, the Fail value is returned. Otherwise an Ok value is returned. Since Fail <|> pure [] is pure [], this is how the many combinator knows when to stop and return an empty list.
It can't be done just by using what is provided by Applicative. But Alternative has a function that gives you power beyond Applicative:
(<|>) :: f a -> f a -> f a
This function lets you "combine" two Alternatives without any restriction whatsoever on the a. But how? Something intrinsic to the particular functor f must give you a means to do that.
Typically, Alternatives require some notion of failure or emptiness. Like for parsers, where (<|>) means "try to parse this, if it fails, try this other thing". But this "dependence on a previous value" is hidden in the machinery implementing (<|>). It is not available to the external interface, so to speak.
From (<|>), one can implement a zero-or-one combinator:
optional :: Alternative f => f a -> f (Maybe a)
optional v = Just <$> v <|> pure Nothing
The definitions of some an many are similar but they require mutually recursive functions.
Notice that there are Applicatives that aren't Alternatives. You can't make the Identity functor an Alternative, for example. How would you implement empty?
many is a class method of the Alternative class (link) which suggests that an general applicative functor does not always have a many implementation.

How to parse arbitrary lists with Haskell parsers?

Is it possible to use one of the parsing libraries (e.g. Parsec) for parsing something different than a String? And how would I do this?
For the sake of simplicity, let's assume the input is a list of ints [Int]. The task could be
drop leading zeros
parse the rest into the pattern (S+L+)*, where S is a number less than 10, and L is a number larger or equal to ten.
return a list of tuples (Int,Int), where fst is the product of the S and snd is the product of the L integers
It would be great if someone could show how to write such a parser (or something similar).
Yes, as user5402 points out, Parsec can parse any instance of Stream, including arbitrary lists. As there are no predefined token parsers (as there are for text) you have to roll your own, (myToken below) using e.g. tokenPrim
The only thing I find a bit awkward is the handling of "source positions". SourcePos is an abstract type (rather than a type class) and forces me to use its "filename/line/column" format, which feels a bit unnatural here.
Anyway, here is the code (without the skipping of leading zeroes, for brevity)
import Text.Parsec
myToken :: (Show a) => (a -> Bool) -> Parsec [a] () a
myToken test = tokenPrim show incPos $ justIf test where
incPos pos _ _ = incSourceColumn pos 1
justIf test x = if (test x) then Just x else Nothing
small = myToken (< 10)
large = myToken (>= 10)
smallLargePattern = do
smallints <- many1 small
largeints <- many1 large
let prod = foldl1 (*)
return (prod smallints, prod largeints)
myIntListParser :: Parsec [Int] () [(Int,Int)]
myIntListParser = many smallLargePattern
testMe :: [Int] -> [(Int, Int)]
testMe xs = case parse myIntListParser "your list" xs of
Left err -> error $ show err
Right result -> result
Trying it all out:
*Main> testMe [1,2,55,33,3,5,99]
[(2,1815),(15,99)]
*Main> testMe [1,2,55,33,3,5,99,1]
*** Exception: "your list" (line 1, column 9):
unexpected end of input
Note the awkward line/column format in the error message
Of course one could write a function sanitiseSourcePos :: SourcePos -> MyListPosition
There is very likely a way to get Parsec to use [a] as the stream type, but the idea behind parser combinators is actually very simple, and it's not very difficult to roll your own library.
A very accessible resource I would recommend is Monadic Parsing in Haskell by Graham Hutton and Erik Meijer.
Indeed, right now Erik Meijer is teaching an intro Haskell/functional programming course on edx.org (link) and Lecture 7 is all about functional parsers. As he states in the intro to the lecture:
"... No one can follow the path towards mastering functional programming without writing their own parser combinator library. We start by explaining what parsers are and how they can naturally be viewed as side-effecting functions. Next we define a number of basic parsers and higher-order functions for combining parsers. ..."

Insert a character into parser combinator character stream in Haskell

This question is related to both Parsec and uu-parsinglib. When we write parser combinators, they process characters streams from compiler. Is it somehow possible to parse a character and put it back (or return another character back) to the input stream?
I want for example to parse input "test + 5", parse the t, e, s, t and after recognition of test pattern, put for example v character back into the character stream, so while continuating the parsing process we are matching against v + 5
I do not want to use this in any particular case for now - I want to deeply learn the possibilities.
I'm not sure if it's possible with these parsers directly, but in general you can accomplish it by combining parsers with some streaming that allows injecting leftovers.
For example, using attoparsec-conduit you can turn a parser into a conduit using
sinkParser :: (AttoparsecInput a, MonadThrow m)
=> Parser a b -> Consumer a m b
where Consumer is a special kind of conduit that doesn't produce any output, only receives input and returns a final value.
Since conduits support leftovers, you can create a helper method that converts a parser that optionally returns a value to be pushed into the stream into a conduit:
import Data.Attoparsec.Types
import Data.Conduit
import Data.Conduit.Attoparsec
import Data.Functor
reinject :: (AttoparsecInput a, MonadThrow m)
=> Parser a (Maybe a, b) -> Consumer a m b
reinject p = do
(lo, r) <- sinkParser p
maybe (return ()) leftover lo
return r
Then you convert standard parsers to conduits using sinkParser and these special parsers using reinject, and then combine conduits instead of parsers.
I think the simplest way to archive this is to build a multi-layered parser. Think of a lexer + parser combination. This is a clean approach to this problem.
You have to separate the two kind of parsing. The search-and-replace parsing goes to the first parser and the build-the-AST parsing to the second. Or you can create an intermediate token representation.
import Text.Parsec
import Text.Parsec.String
parserLvl1 :: Parser String
parserLvl1 = many (try (string "test" >> return 'v') <|> anyChar)
parserLvl2 :: Parser Plus
parserLvl2 = do text1 <- many (noneOf "+")
char '+'
text2 <- many (noneOf "+")
return $ Plus text1 text2
data Plus = Plus String String
deriving Show
wholeParse :: String -> Either ParseError Plus
wholeParse source = do res1 <- parse parserLvl1 "lvl1" source
res2 <- parse parserLvl2 "lvl2" res1
return res2
Now you can parse your example. wholeParse "test+5" results in Right (Plus "v" "5").
Possible variations:
Create a class and an instance for combining wrapped parser stages. (Possibly carrying parser state.)
Create an intermediate representation, a stream of tokens
This is easily done in uu-parsinglib using the pSwitch function. But the question is why you want to do so? Because the v is missing from the input? In that case uu-parsinglib will perform error correction automatically so you do not need something like this. Otherwise you can write
pSwitch :: (st1 -> (st2, st2 -> st1)) -> P st2 a -> P st1 a
pInsert_v = pSwitch (\st1 -> (prepend v st2, id) (pSucceed ())
It depends on your actual state type how the v is actually added, so you will have to define the function prepend yourself. I do not know e.g. how such an insertion would influence the current position in the file etc.
Doaitse Swierstra

Making attoparsec parsers recursive

I've been coding up an attoparsec parser and have been hitting a pattern where I want to turn parsers into recursive parsers (recursively combining them with the monad bind >>= operator).
So I created a function to turn a parser into a recursive parser as follows:
recursiveParser :: (a -> A.Parser a) -> a -> A.Parser a
recursiveParser parser a = (parser a >>= recursiveParser parser) <|> return a
Which is useful if you have a recursive data type like
data Expression = ConsExpr Expression Expression | EmptyExpr
parseRHS :: Expression -> Parser Expression
parseRHS e = ConsExpr e <$> parseFoo
parseExpression :: Parser Expression
parseExpression = parseLHS >>= recursiveParser parseRHS
where parseLHS = parseRHS EmptyExpr
Is there a more idiomatic solution? It almost seems like recursiveParser should be some kind of fold... I also saw sepBy in the docs, but this method seems to suit me better for my application.
EDIT: Oh, actually now that I think about it should actually be something similar to fix... Don't know how I forgot about that.
EDIT2: Rotsor makes a good point with his alternative for my example, but I'm afraid my AST is actually a bit more complicated than that. It actually looks something more like this (although this is still simplified)
data Segment = Choice1 Expression
| Choice2 Expression
data Expression = ConsExpr Segment Expression
| Token String
| EmptyExpr
where the string a -> b brackets to the right and c:d brackets to the left, with : binding more tightly than ->.
I.e. a -> b evaluates to
(ConsExpr (Choice1 (Token "a")) (Token "b"))
and c:d evaluates to
(ConsExpr (Choice2 (Token "d")) (Token "c"))
I suppose I could use foldl for the one and foldr for the other but there's still more complexity in there. Note that it's recursive in a slightly strange way, so "a:b:c -> e:f -> :g:h ->" is actually a valid string, but "-> a" and "b:" are not. In the end fix seemed simpler to me. I've renamed the recursive method like so:
fixParser :: (a -> A.Parser a) -> a -> A.Parser a
fixParser parser a = (parser a >>= fixParser parser) <|> pure a
Thanks.
Why not just parse a list and fold it into whatever you want later?
Maybe I am missing something, but this looks more natural to me:
consChain :: [Expression] -> Expression
consChain = foldl ConsExpr EmptyExpr
parseExpression :: Parser Expression
parseExpression = consChain <$> many1 parseFoo
And it's shorter too.
As you can see, consChain is now independent from parsing and can be useful somewhere else. Also, if you separate out the result folding, the somewhat unintuitive recursive parsing simplifies down to many or many1 in this case.
You may want to take a look at how many is implemented too:
many :: (Alternative f) => f a -> f [a]
many v = many_v
where many_v = some_v <|> pure []
some_v = (:) <$> v <*> many_v
It has a lot in common with your recursiveParser:
some_v is similar to parser a >>= recursiveParser parser
many_v is similar to recursiveParser parser
You may ask why I called your recursive parser function unintuitive. This is because this pattern allows parser argument to affect the parsing behaviour (a -> A.Parser a, remember?), which may be useful, but not obviously (I don't see a use case for this yet). The fact that your example does not use this feature makes it look redundant.

Resources