Maximal munch in Text.ParserCombinators.ReadP - parsing

The Read instance for Double behaves in a very straightforward way:
reads "34.567e8 foo" :: [(Double, String)] = [(3.4567e9," foo")]
However the Read instance for Scientific does something different:
reads "34.567e8 foo" :: [(Scientific, String)] =
[(34.0,".567e8 foo"),(34.567,"e8 foo"),(3.4567e9," foo")]
Strictly this is correct, in that it is presenting a list of possible parses of the input. In fact it could equally well have included (3.0, "4.567e8 foo") in the list, as well as some others. However the usual behaviour in cases like this (which the Double instance follows) is "maximal munch", meaning that the longest valid prefix is parsed.
I'm updating my Decimal library, which has a similar behaviour, and I'm wondering what the Right Thing is here. Both Scientific and Decimal are using Text.ParserCombinators.ReadP, which was designed to make it easy to write Read instances, and this seems to be a characteristic of ReadP parsers.
So my questions:
1: What is the Right Thing for "reads" to return in these cases? Should I file a bug for Data.Scientific?
2: If it should only return the maximal munch (like the Double instance does) then how do you get ReadP to do that?

I've decided that maximal munch is the Right Thing. Given "1.23" a parser that returns 1 is just wrong. I've been tripped up by this myself because I once tried to write a "maybeRead" looking like this:
maybeRead :: (Read a) => String -> Maybe a
maybeRead str = case reads str of
[v, ""] -> Just v
_ => Nothing
This worked fine for Double but failed for Decimal and Scientific. (Obviously it can be fixed to handle multiple return results, but I didn't expect to need to do this).
The problem turned out to be the implementation of "optional" in Text.ParserCombinators.ReadP. This uses the symmetric choice operator "+++", which returns the parse with and without the optional component. Hence when I wrote something like
expPart <- optional "" $ do {...}
the results included a parse without the expPart.
I wrote a different version of "optional" using the left-biased choice operator:
myOpt d p = p <++ return d
If the parser "p" consumes any text then the default is not used. This does the Right Thing if you want maximal munch.

For #2, you could change the scientific package to use this parser defined in terms of the old one: scientificPmaxmuch = scientificP <* eof :: ReadP Scientific.
I don't think there is much of a convention for #1: it doesn't make a difference for people using read or Text.Read.readMaybe. readS_to_P reads :: ReadP Double is probably faster than readS_to_P reads :: ReadP Scientific, but if efficiency mattered at all you would keep everything as ReadP until the end.

Related

John Hughes' Deterministic LL(1) parsing with Arrow and errors

I wanted to write a parser based on John Hughes' paper Generalizing Monads to Arrows. When reading through and trying to reimplement his code I realized there were some things that didn't quite make sense. In one section he lays out a parser implementation based on Swierstra and Duponchel's paper Deterministic, error-correcting combinator parsers using Arrows. The parser type he describes looks like this:
data StaticParser ch = SP Bool [ch]
data DynamicParser ch a b = DP (a, [ch]) -> (b, [ch])
data Parser ch a b = P (StaticParser ch) (DynamicParser ch a b)
with the composition operator looking something like this:
(.) :: Parser ch b c -> Parser ch a b -> Parser ch a c
P (SP e2 st2) (DP f2) . P (SP e1 st1) (DP f1) =
P (SP (e1 && e2) (st1 `union` if e1 then st2 else []))
(DP $ f2 . f1)
The issue is that the composition of parsers q . p 'forgets' q's starting symbols. One possible interpretation I thought of is that Hughes' expects all our DynamicParsers to be total such that a symbol parser's type signature would be symbol :: ch -> Parser ch a (Maybe ch) instead of symbol :: ch -> Parser ch a ch. This still seems awkward though since we have to duplicate information putting starting symbol information in both the StaticParser and DynamicParser. Another issue is that almost all parsers will have the potential to throw which means we will have to spend a lot of time inside Maybe or Either creating what is essentially the "monads do not compose problem." This could be remedied by rewriting DynamicParser itself to handle failure or as an Arrow transformer, but this is straying quite a bit from the paper. None of these issues are addressed in the paper, and the Parser is presented as if it obviously works, so I feel like I must me missing something basic. If someone can catch what I missed that would be super helpful.
I think the deterministic parsers described by Swierstra and Duponcheel are a bit different from traditional parsers: they do not handle failure at all, only choice.
See also the invokeDet function in the S&D paper:
invokeDet :: Symbol s => DetPar s a -> Input s -> a
invokeDet (_, p) inp = case p inp [] of (a, _) -> a
This function clearly assumes it will always be able to find a valid parse.
With the arrow version of the parsers described by Hughes you can write a examples like this:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ab"
print $ invokeDet p "ac"
Which will print the expected:
'b'
'c'
However, if you write a "failing" parse:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ad"
It will still print:
'c'
To make this behavior a bit more sensible, Swierstra and Duponcheel also introduce error-correction. The output 'c' is expected if we assume the erroneous character d has been corrected to be a c in the input. This requires an extra mechanism which presumably was too complicated to include in Hughes' paper.
I have uploaded the implementation I used to get these results here: https://gist.github.com/noughtmare/eced4441332784cc8212e9c0adb68b35
For more information about a more practical parser in the same style (but no longer deterministic and no longer limited to LL(1)) I really like the "Combinator Parsing: A Short Tutorial" by Swierstra. An interesting excerpt from section 9.3:
A subtle point here is the question how to deal with monadic parsers. As we described in [13] the static analysis does not go well with monadic computations, since in that case we dynamically build new parses based on the input produced thus far: the whole idea of a static analysis is that it is static. This observation has lead John Hughes to propose arrows for dealing with such situations [7]. It is only recently that we realised that, although our arguments still hold in general, they do not apply to the case of the LL(1) analysis. If we want to compute the symbols which can be recognised as the first symbol by a parser of the form p >>= q then we are only interested in the starting symbols of the right hand side if the left hand side can recognise the empty string; the good news is that in that case we statically know what value will be returned as a witness, and can pass this value on to q, and analyse the result of this call statically too. Unfortunately we will have to take special precautions in case the left hand side operator contains a call to pErrors in one of the empty derivations, since then it is no longer true that the witness of this alternative can be determined statically.
The full parser implementation by Swierstra can be found in the uu-parsinglib package, although I do not know how many of the extensions are implemented there.

How to parse a keyword that is also an operator

I am trying to parse the following code using parsec
for x = Int in [1, 2, 3]
print x + 1
The only part of the example that might be hard to understand is x = Int which means the variable x is defined as an Int. Syntactically Int here is an expression. It might just as well be replaced with a function call that returns a type.
So far I have been able to parse all the simple literals and operators. My problem now is that in this language in is a keyword as well as an operator and types (Int) are objects like any other (that can be in lists). E.g. the following code is perfectly valid and prints false
print (Int in [1, 2, 3])
So right now my parser parses for x = correctly and then it parses Int in [1, 2, 3] as ONE expression. How can I make the for parser grab the in instead of leaving it to the expression parser? I have a feeling that parsec has something like that built in, but I have no idea how to find it.
Edit: I changed the example to make more sense...
Edit: I have this difficulty in various places, the language is very complex. Another example is the else operator which returns it's second argument if it's first argument is null:
print (if true then (null else "hello") else "world")
# >> hello
print (if true then null else "hello" else "world")
# >> world
Thank you very much #talex and #n.m. for pointing me where I had to look. This is how I solved this specific problem:
I parameterized the expression parser (had to enable {-# LANGUAGE FlexibleContexts #-}) with a list of "eject" words and equally every relevant parser below it, specifically the binOperator parser
expression :: [String] -> MyParser AST
binOperator :: [String] -> MyParser AST
If one of the "eject"-words is encountered in the position of a binary operator, the binOperator parser fails (and with the chainl1 based parser that reads binary operations), thus leaving the "eject" word (in this case in) to the for parser to consume. This should work just as well with the if parser.
And I simply don't pass the eject words to the paren parser so there are no eject words recognized between ( and ) (and similar parsers like list).

Understanding Read instance

I made Read and Show instances of my data, but did not understand the Read instance
data Tests = Zero
| One Int
| Two Int Double
instance Show Tests where
show Zero = "ZERO"
show (One i) = printf "ONE %i" i
show (Two i j) = printf "TWO %i %f" i j
instance Read Tests where
readsPrec _ str = [(mkTests str, "")]
mkTests :: String -> Tests
mkTests = check . words
check :: [String] -> Tests
check ["ZERO"] = Zero
check ["ONE", i] = One (read i)
check ["TWO", i, j] = Two (read i) (read j)
check _ = error "no parse"
main :: IO ()
main = do
print Zero
print $ One 10
print $ Two 1 3.14
let x = read "ZERO" :: Tests
print x
let y = read "ONE 2" :: Tests
print y
let z = read "TWO 2 5.5" :: Tests
print z
This is output
ZERO
ONE 10
TWO 1 3.14
ZERO
ONE 2
TWO 2 5.5
Here are questions:
What is recommend way to implement Read instance?
The minimal complete definition of Read class is readsPrec | readPrec
and readPrec :: ReadPrec a description wrote
Proposed replacement for readsPrec using new-style parsers (GHC only).
Should I use readPrec instead, How? I can't find any example on the net that I can understand.
What is the new-style parsers, is it parsec?
What is the first Int argument of readsPrec :: Int -> ReadS a , is using for?
Is there anyway to somehow deriving Read from Show?
In the past I could use deriving (Show,Read) to most of the job. But this time I want to move to next level.
In my opinion the correct way to implement Read is to derive it and otherwise, it is likely better to move on to more sophisticated parsers. Here is answers to all of your questions anyways.
readPrec is a simple parser combinator based approach for that GHC provides. If you are willing to sacrifice portability for your Read instance you can use it and it makes parsing easier.
I include a small example of how you could use readPrec below
parsec is different from readPrec, however both are parser combinator like. Parsec is a much more complete parser library. Another parser combinator library is attoparsec which works very similarly to parsec.
parsec and attoparsec can't be used with the ordinary Read typeclass (at least directly) but the greater flexibility they offer makes them a good idea for any time you want more complex parsing.
The Int argument to readsPrec is for dealing with precedence when parsing. This might matter when you want to parse arithmetic expressions. You can choose to fail parsing if the precedence is higher than the precedence of the current operator.
Deriving Read from Show isn't possible unfortunately.
Here are a couple of snippets that show how I would implement Read using ReadPrec.
ReadPrec example:
instance Read Tests where
readPrec = choice [pZero, pOne, pTwo] where
pChar c = do
c' <- get
if c == c'
then return c
else pfail
pZero = traverse pChar "ZERO" *> pure Zero
pOne = One <$> (traverse pChar "ONE " *> readPrec)
pTwo = Two <$> (traverse pChar "TWO " *> readPrec) <*> readPrec
In general implementing Read is less intuitive than more heavyweight parsers. Depending on what you want to parse I highly suggest learning parsec or attoparsec since they are extremely useful when you want to parse even more complicated things.

Using Parsec to write a Read instance

Using Parsec, I'm able to write a function of type String -> Maybe MyType with relative ease. I would now like to create a Read instance for my type based on that; however, I don't understand how readsPrec works or what it is supposed to do.
My best guess right now is that readsPrec is used to build a recursive parser from scratch to traverse a string, building up the desired datatype in Haskell. However, I already have a very robust parser who does that very thing for me. So how do I tell readsPrec to use my parser? What is the "operator precedence" parameter it takes, and what is it good for in my context?
If it helps, I've created a minimal example on Github. It contains a type, a parser, and a blank Read instance, and reflects quite well where I'm stuck.
(Background: The real parser is for Scheme.)
However, I already have a very robust parser who does that very thing for me.
It's actually not that robust, your parser has problems with superfluous parentheses, it won't parse
((1) (2))
for example, and it will throw an exception on some malformed inputs, because
singleP = Single . read <$> many digit
may use read "" :: Int.
That out of the way, the precedence argument is used to determine whether parentheses are necessary in some place, e.g. if you have
infixr 6 :+:
data a :+: b = a :+: b
data C = C Int
data D = D C
you don't need parentheses around a C 12 as an argument of (:+:), since the precedence of application is higher than that of (:+:), but you'd need parentheses around C 12 as an argument of D.
So you'd usually have something like
readsPrec p = needsParens (p >= precedenceLevel) someParser
where someParser parses a value from the input without enclosing parentheses, and needsParens True thing parses a thing between parentheses, while needsParens False thing parses a thing optionally enclosed in parentheses [you should always accept more parentheses than necessary, ((((((1)))))) should parse fine as an Int].
Since the readsPrec p parsers are used to parse parts of the input as parts of the value when reading lists, tuples etc., they must return not only the parsed value, but also the remaining part of the input.
With that, a simple way to transform a parsec parser to a readsPrec parser would be
withRemaining :: Parser a -> Parser (a, String)
withRemaining p = (,) <$> p <*> getInput
parsecToReadsPrec :: Parser a -> Int -> ReadS a
parsecToReadsPrec parsecParser prec input
= case parse (withremaining $ needsParens (prec >= threshold) parsecParser) "" input of
Left _ -> []
Right result -> [result]
If you're using GHC, it may however be preferable to use a ReadPrec / ReadP parser (built using Text.ParserCombinators.ReadP[rec]) instead of a parsec parser and define readPrec instead of readsPrec.

Parsing user input with reads in Haskell

I am trying to parse user entered string like "A12", into a Haskell tuple, like ('A', 12).
Here's what I have tried:
import Data.Maybe
type Pos = (Char, Int)
parse :: String -> Maybe Pos
parse u = do
(c, rest) <- (listToMaybe.reads) u
(r, _) <- (listToMaybe.reads) rest
return $ (c, r)
But this always returns Nothing. Why does this happen, and what is the correct way to parse this string? Since this is fairly simple, I'd like to avoid using Parsec or a similar advanced parsing library.
EDIT (to clarify):
Sample Input and Output:
"A12" gives Just ('A', 12)
"J5" gives Just ('J', 5)
"A" gives Nothing
"2324" gives Nothing
read is usually the opposite of show and they both generally use Haskell syntax to represent the given values. This means that since the Haskell syntax for characters uses single quotes, show on a character will add single quotes around it, and read will expect the single quotes to be there.
In other words, your function expects syntax like 'A' 42, and indeed it works if you try that:
> parse "'A' 42"
Just ('A',42)
For your format, I would instead use pattern matching for the first character and then reads for the rest, e.g. something like this:
parse :: String -> Maybe Pos
parse [] = Nothing
parse (c:rest) = do
(r, _) <- listToMaybe $ reads rest
return (c, r)
Do you have to use do notation? If not, the following function suits your needs. It's not pretty, but it gets the job done.
parse :: String -> Maybe Pos
parse (x:xs) = Just (x,read xs::Int)
I'm not sure what you consider "failing" and thus worth of a Nothing

Resources