Extending input in Haskell Parsec - parsing

I want to implement a Parsec parser for a simple language that allows file inclusions. I.e., the language looks like this:
include otherfile;
expression in the language;
If an inclusion is parsed, I want to read a file with this name and embed its parsed contents in the parent structure.
Since I have to read a file, the parser needs to be packed in IO. My guess was that the underyling monad u in ParsecT s u m a can be used for this. However, this leads to quite some changes in the language definition since the LanguageDef's rely on Identity as an underyling monad.
Is my approach reasonable? Are there other ways to include files within a parser, e.g., extending the input stream?

Okay, here is what I come up with:
parsecTrans :: Monad m => ParsecT s u Identity a -> ParsecT s u m a
parsecTrans p = mkPT $ \s -> return $ fmap (return . runIdentity) $ runIdentity $ (runParsecT p) s
This function unpacks ParsecT monad and generalizes it to arbitrary monad. You can use it to lift all Identity-based parsers to IO based ones.

Related

When would a parser append its input string?

Given a parser
newtype Parser a = Parser { parse :: String -> [(a,String)] }
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = Parser $ \s -> concat [ parse (f a) s' | (a, s') <- parse p s ]
return :: a -> Parser a
return a = Parser (\s -> [(a,s)])
item :: Parser Char
item = Parser $ \s -> case cs of
"" -> []
(c:cs) -> [(c,cs)]
We can see that item consumes part of the input string given to it ("abc" -> [('a', "bc")]). Is there ever a case where a parser would produce additional string output or replace/modify it (e.g. Parser $ \s -> [((), 'a':s)])? I suspect that this might be the case with context-sensitive grammars but have trouble coming up with a sensible example.
Is there a reason why it would make sense to do this for a real-world problem?
References
Monadic Parsing in Haskell
Here are a couple of cases where it is convenient to inject tokens into the input stream. (How this is actually integrated into the parsing pipeline is another question.)
Macro expansion, in the style of the C/C++ preprocessing phase. This is arguably not the best model for macro expansion; hygienic macros would more likely be expanded using a tree transformation, as with C++ template resolution. But the token-oriented preprocessor is not going away soon. Since it is not tightly coupled with the language syntax, the easiest implementation is to substitute the macro (and arguments if applicable) with the tokens from its expansion.
Ecmascript-style automatic semi-colon insertion (ASI). The language syntax requires a semi-colon to be inserted into the token stream under certain precisely-defined circumstances, which are difficult (at least) to incorporate in a CFG. Since ASI is only possible if the next token in the input stream cannot be shifted (and done other conditions), it can certainly be integrated into the parser loop.
Similarly, indentation-aware block syntax (as in Haskell and Python, for example) can certainly be implemented by replacing leading whitespace with an injected INDENT token or some number of injected DEDENTs. Since this substitution is dependent on parse context (it isn't done inside parentheses, for example), injection inside the parser could be a reasonable approach.
That's not an exhaustive list, but it might be at least indicative. Not all of those cases necessarily involve context-sensitivity (I believe ASI could, in theory, be handled with a context-free grammar although I have no intention of trying) and not all instances of context-sensitivity necessarily require token injection (the ambiguity in C between type and variable names only requires selecting the correct token).

parsec: feeding output of one parser to another [duplicate]

This question already has answers here:
Generate parser that runs a received parser on the output of another parser and monadically joins the results
(2 answers)
Closed 5 years ago.
I use (abuse) parsers to do some string transformation e.g. normalizeWS :: Parser String removes duplicate whitespace and normalizeCase maps specific strings to lower case. I use parsers because the input data has some structure for example literate strings have to be left untransformed. Is there an elegant way to feed the output of one parser as input to the next and thus form a transformation pipeline? Something in the vein of normalizeWS . normalizeCase (which of course doesnt work)?
Many thanks in advance!
I solved the problem using this approach ... maybe there is a more elegant way
preprocessor :: Parser String
preprocessor = normalizeCase `feeds` expandKettensatz `feeds` normalizeWs
feeds :: Parser String -> Parser String -> Parser String
feeds p1 p2 = do
s <- p1
setInput s
p2
If you have functions like
normalizeWhitespace :: Stream s m Char => ParsecT s u m String
normalizeCase :: Stream s m Char => Set String -> Parsec s u m String
You could chain them together using runParser and >>=:
runBoth :: Stream s Identity Char => Set String -> SourceName -> s -> Either ParseError String
runBoth wordSet src input = do
input <- runParser normalizeWhitespace () src input
runParser (normalizeCase wordSet) () src input
But this doesn't give you a parser that you can chain together with other parsers.
This isn't terribly surprising, as parser composition in Parsec is all about
composing parsers that operate on the same stream, whereas these operate on
different streams.
Having multiple different streams is pretty common too - using the output of a tokenization or lexing
pass as input to parsing can make the process easier
to understand, but Parsec is a little easier to use out of the box as a direct parser (without
lexing/tokenization).

Insert a character into parser combinator character stream in Haskell

This question is related to both Parsec and uu-parsinglib. When we write parser combinators, they process characters streams from compiler. Is it somehow possible to parse a character and put it back (or return another character back) to the input stream?
I want for example to parse input "test + 5", parse the t, e, s, t and after recognition of test pattern, put for example v character back into the character stream, so while continuating the parsing process we are matching against v + 5
I do not want to use this in any particular case for now - I want to deeply learn the possibilities.
I'm not sure if it's possible with these parsers directly, but in general you can accomplish it by combining parsers with some streaming that allows injecting leftovers.
For example, using attoparsec-conduit you can turn a parser into a conduit using
sinkParser :: (AttoparsecInput a, MonadThrow m)
=> Parser a b -> Consumer a m b
where Consumer is a special kind of conduit that doesn't produce any output, only receives input and returns a final value.
Since conduits support leftovers, you can create a helper method that converts a parser that optionally returns a value to be pushed into the stream into a conduit:
import Data.Attoparsec.Types
import Data.Conduit
import Data.Conduit.Attoparsec
import Data.Functor
reinject :: (AttoparsecInput a, MonadThrow m)
=> Parser a (Maybe a, b) -> Consumer a m b
reinject p = do
(lo, r) <- sinkParser p
maybe (return ()) leftover lo
return r
Then you convert standard parsers to conduits using sinkParser and these special parsers using reinject, and then combine conduits instead of parsers.
I think the simplest way to archive this is to build a multi-layered parser. Think of a lexer + parser combination. This is a clean approach to this problem.
You have to separate the two kind of parsing. The search-and-replace parsing goes to the first parser and the build-the-AST parsing to the second. Or you can create an intermediate token representation.
import Text.Parsec
import Text.Parsec.String
parserLvl1 :: Parser String
parserLvl1 = many (try (string "test" >> return 'v') <|> anyChar)
parserLvl2 :: Parser Plus
parserLvl2 = do text1 <- many (noneOf "+")
char '+'
text2 <- many (noneOf "+")
return $ Plus text1 text2
data Plus = Plus String String
deriving Show
wholeParse :: String -> Either ParseError Plus
wholeParse source = do res1 <- parse parserLvl1 "lvl1" source
res2 <- parse parserLvl2 "lvl2" res1
return res2
Now you can parse your example. wholeParse "test+5" results in Right (Plus "v" "5").
Possible variations:
Create a class and an instance for combining wrapped parser stages. (Possibly carrying parser state.)
Create an intermediate representation, a stream of tokens
This is easily done in uu-parsinglib using the pSwitch function. But the question is why you want to do so? Because the v is missing from the input? In that case uu-parsinglib will perform error correction automatically so you do not need something like this. Otherwise you can write
pSwitch :: (st1 -> (st2, st2 -> st1)) -> P st2 a -> P st1 a
pInsert_v = pSwitch (\st1 -> (prepend v st2, id) (pSucceed ())
It depends on your actual state type how the v is actually added, so you will have to define the function prepend yourself. I do not know e.g. how such an insertion would influence the current position in the file etc.
Doaitse Swierstra

Using Parsec to write a Read instance

Using Parsec, I'm able to write a function of type String -> Maybe MyType with relative ease. I would now like to create a Read instance for my type based on that; however, I don't understand how readsPrec works or what it is supposed to do.
My best guess right now is that readsPrec is used to build a recursive parser from scratch to traverse a string, building up the desired datatype in Haskell. However, I already have a very robust parser who does that very thing for me. So how do I tell readsPrec to use my parser? What is the "operator precedence" parameter it takes, and what is it good for in my context?
If it helps, I've created a minimal example on Github. It contains a type, a parser, and a blank Read instance, and reflects quite well where I'm stuck.
(Background: The real parser is for Scheme.)
However, I already have a very robust parser who does that very thing for me.
It's actually not that robust, your parser has problems with superfluous parentheses, it won't parse
((1) (2))
for example, and it will throw an exception on some malformed inputs, because
singleP = Single . read <$> many digit
may use read "" :: Int.
That out of the way, the precedence argument is used to determine whether parentheses are necessary in some place, e.g. if you have
infixr 6 :+:
data a :+: b = a :+: b
data C = C Int
data D = D C
you don't need parentheses around a C 12 as an argument of (:+:), since the precedence of application is higher than that of (:+:), but you'd need parentheses around C 12 as an argument of D.
So you'd usually have something like
readsPrec p = needsParens (p >= precedenceLevel) someParser
where someParser parses a value from the input without enclosing parentheses, and needsParens True thing parses a thing between parentheses, while needsParens False thing parses a thing optionally enclosed in parentheses [you should always accept more parentheses than necessary, ((((((1)))))) should parse fine as an Int].
Since the readsPrec p parsers are used to parse parts of the input as parts of the value when reading lists, tuples etc., they must return not only the parsed value, but also the remaining part of the input.
With that, a simple way to transform a parsec parser to a readsPrec parser would be
withRemaining :: Parser a -> Parser (a, String)
withRemaining p = (,) <$> p <*> getInput
parsecToReadsPrec :: Parser a -> Int -> ReadS a
parsecToReadsPrec parsecParser prec input
= case parse (withremaining $ needsParens (prec >= threshold) parsecParser) "" input of
Left _ -> []
Right result -> [result]
If you're using GHC, it may however be preferable to use a ReadPrec / ReadP parser (built using Text.ParserCombinators.ReadP[rec]) instead of a parsec parser and define readPrec instead of readsPrec.

Underlying Parsec Monad

Many of the Parsec combinators I use are of a type such as:
foo :: CharParser st Foo
CharParser is defined here as:
type CharParser st = GenParser Char st
CharParser is thus a type synonym involving GenParser, itself defined here as:
type GenParser tok st = Parsec [tok] st
GenParser is then another type synonym, assigned using Parsec, defined here as:
type Parsec s u = ParsecT s u Identity
So Parsec is a partial application of ParsecT, itself listed here with type:
data ParsecT s u m a
along with the words:
"ParsecT s u m a is a parser with stream type s, user state type u,
underlying monad m and return type a."
What is the underlying monad? In particular, what is it when I use the CharParser parsers? I can't see where it's inserted in the stack. Is there a relationship to the use of the list monad in Monadic Parsing in Haskell to return multiple successful parses from an ambiguous parser?
In your case the underlying monad is Identity. However ParsecT is different from most monad transformers in that it is an instance of the Monad class even if the type parameter m is not. If you look at the source code you will note the lack of "(Monad m) =>" in the instance declaration.
So then you ask yourself, "If I were to have a non-trivial monad stack, where would it be used?"
There are a three of answers to that question:
It is used to uncons the next token out of the stream:
class (Monad m) => Stream s m t | s -> t where
uncons :: s -> m (Maybe (t,s))
Notice that uncons takes an s (the stream of tokens t) and returns its result wrapped in your monad. This allows one to do interesting thing while or even during the process of getting the next token.
It is used in the resulting output of each parser. This means you can create parsers that don't touch the input but take action in the underlying monad and use the combinators to bind them to regular parsers. In other words, lift (x :: m a) :: ParsecT s u m a.
Finally, the end result of RunParsecT and friends (until you build up to the point where m is replaced by Identity) return their results wrapped in this monad.
There is not a relationship between this monad and the one from Monadic Parsing in Haskell. In this case Hutton and Meijer are referring to the monad instance for ParsecT itself. The fact that in Parsec-3.0.0 and beyond ParsecT has become a monad transformer with an underlying monad is not relevant to the paper.
What I think you are looking for however is where the list of possible results went. In Hutton and Meijer the parser returns a list of all possible results while Parsec stubbornly returns only one. I think you are looking at the m in the result and thinking to yourself that the list of results must be hiding in there somewhere. It is not.
Parsec, for reasons of efficiency, made a choice to prefer the first matching result in Hutton and Meijer's list of results. This let's it toss away both the unused results in the tail of Hutton and Meijer's list and also the front of the stream of tokens because we never backtrack. In parsec, given the combined parser a <|> b, if a consumes any input b will never be evaluated. The way around this is try which will reset the state back to where it was if a fails then evaluate b.
You asked in the comments if this was done using Maybe or Either. The answer is "almost but not quite." If you look at the low lever run* functions you see that they return an Algebraic type which tell weather input was consumed then a second which give either the result or an error message. These types work kind of like Either, but even they are not used directly. Rather then stretch this out further, I'll refer you to the post by Antoine Latter that explains how this works and why it is done this way.
GenParser is defined in terms of Parsec, not ParsecT. Parsec in turn is defined as
type Parsec s u = ParsecT s u Identity
So the answer is that when using CharParser the underlying monad is the Identity monad.

Resources