John Hughes' Deterministic LL(1) parsing with Arrow and errors - parsing

I wanted to write a parser based on John Hughes' paper Generalizing Monads to Arrows. When reading through and trying to reimplement his code I realized there were some things that didn't quite make sense. In one section he lays out a parser implementation based on Swierstra and Duponchel's paper Deterministic, error-correcting combinator parsers using Arrows. The parser type he describes looks like this:
data StaticParser ch = SP Bool [ch]
data DynamicParser ch a b = DP (a, [ch]) -> (b, [ch])
data Parser ch a b = P (StaticParser ch) (DynamicParser ch a b)
with the composition operator looking something like this:
(.) :: Parser ch b c -> Parser ch a b -> Parser ch a c
P (SP e2 st2) (DP f2) . P (SP e1 st1) (DP f1) =
P (SP (e1 && e2) (st1 `union` if e1 then st2 else []))
(DP $ f2 . f1)
The issue is that the composition of parsers q . p 'forgets' q's starting symbols. One possible interpretation I thought of is that Hughes' expects all our DynamicParsers to be total such that a symbol parser's type signature would be symbol :: ch -> Parser ch a (Maybe ch) instead of symbol :: ch -> Parser ch a ch. This still seems awkward though since we have to duplicate information putting starting symbol information in both the StaticParser and DynamicParser. Another issue is that almost all parsers will have the potential to throw which means we will have to spend a lot of time inside Maybe or Either creating what is essentially the "monads do not compose problem." This could be remedied by rewriting DynamicParser itself to handle failure or as an Arrow transformer, but this is straying quite a bit from the paper. None of these issues are addressed in the paper, and the Parser is presented as if it obviously works, so I feel like I must me missing something basic. If someone can catch what I missed that would be super helpful.

I think the deterministic parsers described by Swierstra and Duponcheel are a bit different from traditional parsers: they do not handle failure at all, only choice.
See also the invokeDet function in the S&D paper:
invokeDet :: Symbol s => DetPar s a -> Input s -> a
invokeDet (_, p) inp = case p inp [] of (a, _) -> a
This function clearly assumes it will always be able to find a valid parse.
With the arrow version of the parsers described by Hughes you can write a examples like this:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ab"
print $ invokeDet p "ac"
Which will print the expected:
'b'
'c'
However, if you write a "failing" parse:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ad"
It will still print:
'c'
To make this behavior a bit more sensible, Swierstra and Duponcheel also introduce error-correction. The output 'c' is expected if we assume the erroneous character d has been corrected to be a c in the input. This requires an extra mechanism which presumably was too complicated to include in Hughes' paper.
I have uploaded the implementation I used to get these results here: https://gist.github.com/noughtmare/eced4441332784cc8212e9c0adb68b35
For more information about a more practical parser in the same style (but no longer deterministic and no longer limited to LL(1)) I really like the "Combinator Parsing: A Short Tutorial" by Swierstra. An interesting excerpt from section 9.3:
A subtle point here is the question how to deal with monadic parsers. As we described in [13] the static analysis does not go well with monadic computations, since in that case we dynamically build new parses based on the input produced thus far: the whole idea of a static analysis is that it is static. This observation has lead John Hughes to propose arrows for dealing with such situations [7]. It is only recently that we realised that, although our arguments still hold in general, they do not apply to the case of the LL(1) analysis. If we want to compute the symbols which can be recognised as the first symbol by a parser of the form p >>= q then we are only interested in the starting symbols of the right hand side if the left hand side can recognise the empty string; the good news is that in that case we statically know what value will be returned as a witness, and can pass this value on to q, and analyse the result of this call statically too. Unfortunately we will have to take special precautions in case the left hand side operator contains a call to pErrors in one of the empty derivations, since then it is no longer true that the witness of this alternative can be determined statically.
The full parser implementation by Swierstra can be found in the uu-parsinglib package, although I do not know how many of the extensions are implemented there.

Related

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?
Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.
the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

How to parse arbitrary lists with Haskell parsers?

Is it possible to use one of the parsing libraries (e.g. Parsec) for parsing something different than a String? And how would I do this?
For the sake of simplicity, let's assume the input is a list of ints [Int]. The task could be
drop leading zeros
parse the rest into the pattern (S+L+)*, where S is a number less than 10, and L is a number larger or equal to ten.
return a list of tuples (Int,Int), where fst is the product of the S and snd is the product of the L integers
It would be great if someone could show how to write such a parser (or something similar).
Yes, as user5402 points out, Parsec can parse any instance of Stream, including arbitrary lists. As there are no predefined token parsers (as there are for text) you have to roll your own, (myToken below) using e.g. tokenPrim
The only thing I find a bit awkward is the handling of "source positions". SourcePos is an abstract type (rather than a type class) and forces me to use its "filename/line/column" format, which feels a bit unnatural here.
Anyway, here is the code (without the skipping of leading zeroes, for brevity)
import Text.Parsec
myToken :: (Show a) => (a -> Bool) -> Parsec [a] () a
myToken test = tokenPrim show incPos $ justIf test where
incPos pos _ _ = incSourceColumn pos 1
justIf test x = if (test x) then Just x else Nothing
small = myToken (< 10)
large = myToken (>= 10)
smallLargePattern = do
smallints <- many1 small
largeints <- many1 large
let prod = foldl1 (*)
return (prod smallints, prod largeints)
myIntListParser :: Parsec [Int] () [(Int,Int)]
myIntListParser = many smallLargePattern
testMe :: [Int] -> [(Int, Int)]
testMe xs = case parse myIntListParser "your list" xs of
Left err -> error $ show err
Right result -> result
Trying it all out:
*Main> testMe [1,2,55,33,3,5,99]
[(2,1815),(15,99)]
*Main> testMe [1,2,55,33,3,5,99,1]
*** Exception: "your list" (line 1, column 9):
unexpected end of input
Note the awkward line/column format in the error message
Of course one could write a function sanitiseSourcePos :: SourcePos -> MyListPosition
There is very likely a way to get Parsec to use [a] as the stream type, but the idea behind parser combinators is actually very simple, and it's not very difficult to roll your own library.
A very accessible resource I would recommend is Monadic Parsing in Haskell by Graham Hutton and Erik Meijer.
Indeed, right now Erik Meijer is teaching an intro Haskell/functional programming course on edx.org (link) and Lecture 7 is all about functional parsers. As he states in the intro to the lecture:
"... No one can follow the path towards mastering functional programming without writing their own parser combinator library. We start by explaining what parsers are and how they can naturally be viewed as side-effecting functions. Next we define a number of basic parsers and higher-order functions for combining parsers. ..."

Maximal munch in Text.ParserCombinators.ReadP

The Read instance for Double behaves in a very straightforward way:
reads "34.567e8 foo" :: [(Double, String)] = [(3.4567e9," foo")]
However the Read instance for Scientific does something different:
reads "34.567e8 foo" :: [(Scientific, String)] =
[(34.0,".567e8 foo"),(34.567,"e8 foo"),(3.4567e9," foo")]
Strictly this is correct, in that it is presenting a list of possible parses of the input. In fact it could equally well have included (3.0, "4.567e8 foo") in the list, as well as some others. However the usual behaviour in cases like this (which the Double instance follows) is "maximal munch", meaning that the longest valid prefix is parsed.
I'm updating my Decimal library, which has a similar behaviour, and I'm wondering what the Right Thing is here. Both Scientific and Decimal are using Text.ParserCombinators.ReadP, which was designed to make it easy to write Read instances, and this seems to be a characteristic of ReadP parsers.
So my questions:
1: What is the Right Thing for "reads" to return in these cases? Should I file a bug for Data.Scientific?
2: If it should only return the maximal munch (like the Double instance does) then how do you get ReadP to do that?
I've decided that maximal munch is the Right Thing. Given "1.23" a parser that returns 1 is just wrong. I've been tripped up by this myself because I once tried to write a "maybeRead" looking like this:
maybeRead :: (Read a) => String -> Maybe a
maybeRead str = case reads str of
[v, ""] -> Just v
_ => Nothing
This worked fine for Double but failed for Decimal and Scientific. (Obviously it can be fixed to handle multiple return results, but I didn't expect to need to do this).
The problem turned out to be the implementation of "optional" in Text.ParserCombinators.ReadP. This uses the symmetric choice operator "+++", which returns the parse with and without the optional component. Hence when I wrote something like
expPart <- optional "" $ do {...}
the results included a parse without the expPart.
I wrote a different version of "optional" using the left-biased choice operator:
myOpt d p = p <++ return d
If the parser "p" consumes any text then the default is not used. This does the Right Thing if you want maximal munch.
For #2, you could change the scientific package to use this parser defined in terms of the old one: scientificPmaxmuch = scientificP <* eof :: ReadP Scientific.
I don't think there is much of a convention for #1: it doesn't make a difference for people using read or Text.Read.readMaybe. readS_to_P reads :: ReadP Double is probably faster than readS_to_P reads :: ReadP Scientific, but if efficiency mattered at all you would keep everything as ReadP until the end.

Cannot compute minimal length of a parser - uu-parsinglib in Haskell

Lets see the code snippet:
pSegmentBegin p i = pIndentExact i *> ((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
if I change this code in my parser to:
pSegmentBegin p i = do
pIndentExact i
((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
I've got an error:
canot compute minmal length of a parser due to occurrence of a moadic bind, use addLength to override
I thought the above parser should behave the same way. Why this error can occur?
EDIT
The above example is very simple (to simplify the question) and as noted below it is not necessary to use do notation here, but the real case I wanted it to use is as follows:
pSegmentBegin p i = do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
I have noticed that adding "addLength 1" before the do statement solves the problem, but I'm unsure if its a correct solution:
pSegmentBegin p i = addLength 2 $ do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
As I have mentioned many times the monadic interface should be avoided whenever possible. let me try to explain why the applicative interface is to be preferred.
One of the distinctive features of my library is that it performs error correction by inserting or deleting problems. Of course we can take an umlimited look-ahead here but that would make the process VERY expensive. So we take only a limited lookahead of three steps.
Now suppose we have to insert an expression and one of the expression alternatives is:
expr := "if" expr "then" expr "else" expr
then we want to exclude this alternative since choosing this alternative would necessitate the insertion of another expression etc. So we perform an abstract interpretation of the alternatives and make sure that in case of a draw (i.e. equal costs for the limited lookahead) we take one of the non-recursive alternatives.
Unfortunately this scheme breaks down when one writes monadic parsers, since the length of the right hand side of the bind may depend on the result of the left-hand side. So we issue the error message, and ask some help from the programmer to indicate the number of tokens this alternative might consume. The actual value does not matter so much, as long as you do not provide a finite length for something which is recursive and may lead to infinite insertions. It is only used to select the shortest alternative in case of an insertion.
This abstract interpretation has some costs and if you write all your parsers in monadic style it is unavoidable that this analysis is repeated over an over again. so: DO NOT WRITE MONADIC STYLE PARSERS WHEN USING THIS LIBRARY IF THERE IS AN APPLICATIVE ALTERNATIVE.
It's trying to statically analyze how much input needs to be read in order to optimize performance, but that kind of optimization requires a statically known parser structure—the kind that can be built by Applicatives since the parser effect cannot depend upon the parser value such what (>>=) does.
So that's what goes wrong—when you use do notation it translates to a Monadic bind which breaks the Applicative predictor. It'd be nice if the library only exposed one of the two interfaces so that this kind of error cannot happen, but instead there's some inconsistency if you use both interfaces together in the same parser.
Since this use of do is strictly unnecessary—you're not using the extra power the monadic interface gives you—it's probably better to just avoid it.
I have a workaround I use with monadic parsers in uuparsinglib. Its a self-answer here:
Monadic parse with uu-parsinglib
You may find it useful

Using Parsec to write a Read instance

Using Parsec, I'm able to write a function of type String -> Maybe MyType with relative ease. I would now like to create a Read instance for my type based on that; however, I don't understand how readsPrec works or what it is supposed to do.
My best guess right now is that readsPrec is used to build a recursive parser from scratch to traverse a string, building up the desired datatype in Haskell. However, I already have a very robust parser who does that very thing for me. So how do I tell readsPrec to use my parser? What is the "operator precedence" parameter it takes, and what is it good for in my context?
If it helps, I've created a minimal example on Github. It contains a type, a parser, and a blank Read instance, and reflects quite well where I'm stuck.
(Background: The real parser is for Scheme.)
However, I already have a very robust parser who does that very thing for me.
It's actually not that robust, your parser has problems with superfluous parentheses, it won't parse
((1) (2))
for example, and it will throw an exception on some malformed inputs, because
singleP = Single . read <$> many digit
may use read "" :: Int.
That out of the way, the precedence argument is used to determine whether parentheses are necessary in some place, e.g. if you have
infixr 6 :+:
data a :+: b = a :+: b
data C = C Int
data D = D C
you don't need parentheses around a C 12 as an argument of (:+:), since the precedence of application is higher than that of (:+:), but you'd need parentheses around C 12 as an argument of D.
So you'd usually have something like
readsPrec p = needsParens (p >= precedenceLevel) someParser
where someParser parses a value from the input without enclosing parentheses, and needsParens True thing parses a thing between parentheses, while needsParens False thing parses a thing optionally enclosed in parentheses [you should always accept more parentheses than necessary, ((((((1)))))) should parse fine as an Int].
Since the readsPrec p parsers are used to parse parts of the input as parts of the value when reading lists, tuples etc., they must return not only the parsed value, but also the remaining part of the input.
With that, a simple way to transform a parsec parser to a readsPrec parser would be
withRemaining :: Parser a -> Parser (a, String)
withRemaining p = (,) <$> p <*> getInput
parsecToReadsPrec :: Parser a -> Int -> ReadS a
parsecToReadsPrec parsecParser prec input
= case parse (withremaining $ needsParens (prec >= threshold) parsecParser) "" input of
Left _ -> []
Right result -> [result]
If you're using GHC, it may however be preferable to use a ReadPrec / ReadP parser (built using Text.ParserCombinators.ReadP[rec]) instead of a parsec parser and define readPrec instead of readsPrec.

Resources