When would a parser append its input string? - parsing

Given a parser
newtype Parser a = Parser { parse :: String -> [(a,String)] }
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = Parser $ \s -> concat [ parse (f a) s' | (a, s') <- parse p s ]
return :: a -> Parser a
return a = Parser (\s -> [(a,s)])
item :: Parser Char
item = Parser $ \s -> case cs of
"" -> []
(c:cs) -> [(c,cs)]
We can see that item consumes part of the input string given to it ("abc" -> [('a', "bc")]). Is there ever a case where a parser would produce additional string output or replace/modify it (e.g. Parser $ \s -> [((), 'a':s)])? I suspect that this might be the case with context-sensitive grammars but have trouble coming up with a sensible example.
Is there a reason why it would make sense to do this for a real-world problem?
References
Monadic Parsing in Haskell

Here are a couple of cases where it is convenient to inject tokens into the input stream. (How this is actually integrated into the parsing pipeline is another question.)
Macro expansion, in the style of the C/C++ preprocessing phase. This is arguably not the best model for macro expansion; hygienic macros would more likely be expanded using a tree transformation, as with C++ template resolution. But the token-oriented preprocessor is not going away soon. Since it is not tightly coupled with the language syntax, the easiest implementation is to substitute the macro (and arguments if applicable) with the tokens from its expansion.
Ecmascript-style automatic semi-colon insertion (ASI). The language syntax requires a semi-colon to be inserted into the token stream under certain precisely-defined circumstances, which are difficult (at least) to incorporate in a CFG. Since ASI is only possible if the next token in the input stream cannot be shifted (and done other conditions), it can certainly be integrated into the parser loop.
Similarly, indentation-aware block syntax (as in Haskell and Python, for example) can certainly be implemented by replacing leading whitespace with an injected INDENT token or some number of injected DEDENTs. Since this substitution is dependent on parse context (it isn't done inside parentheses, for example), injection inside the parser could be a reasonable approach.
That's not an exhaustive list, but it might be at least indicative. Not all of those cases necessarily involve context-sensitivity (I believe ASI could, in theory, be handled with a context-free grammar although I have no intention of trying) and not all instances of context-sensitivity necessarily require token injection (the ambiguity in C between type and variable names only requires selecting the correct token).

Related

Generate parser that runs a received parser on the output of another parser and monadically joins the results

given the following type and function, meant to parse a field of a CSV field into a string:
type Parser resultType = ParsecT String () Identity resultType
cell :: Parser String
I have implemented the following function:
customCell :: String -> Parser res -> Parser res
customCell typeName subparser =
cell
>>= either (const $ unexpected typeName)
return . parse (subparser <* eof) ""
Though I cannot stop thinking that I am not using the Monad concept as much as desired and that eventually there is a better way to merge the result of the inner with the outer parser, specially on what regards its failure.
Does anybody know how could I do so, or is this code what is meant to be done?
PS - I now realised that my type simplification is probably not appropriate and that maybe what I want is to replace the underlying Identity Monad by the Either Monad.... Unfortunately, I do not feel enough acquainted with monad transformers yet.
PS2 - What the hell is the underlying monad good for anyway?
Elaborating on #Daniel Wagner's answer... The way parsers are normally built with Parsec, you start with low-level parsers that parse specific characters (e.g., a plus sign or a digit), and you build parsers on top of them using combinators (like a many1 combinator that turns a parser that reads a single digit into a parser that reads one or more digits, or a monadic parse that parsers "one or more digits" followed by a "plus sign" followed by "one or more digits").
However, each parser, whether it's a low-level digit parser or a higher-level "addition expression" parser, is intended to be applied directly to the same input stream.
What you don't typically do is write a parser that gobbles a chunk of the input stream to produce, say, a String and another parser that parses that String (instead of the original input stream) and try to combine them. This is the kind of "vertical composition" that isn't directly supported by Parsec and looks unnatural and non-monadic.
As pointed out in the comments, there are some situations where vertical composition is the cleanest overall approach (like when you have one language embedded within the components or expressions of another language), but it's not the usual approach taken by a Parsec parser.
The bottom line in your application is that a cell parser that produces only a String is too specialized to be useful. A more useful Parsec framework for CSV files would be:
import Text.Parsec
import Text.Parsec.String
-- | `csv cell` parses a CSV file each of whose elements is parsed by `cell`
csv :: Parser a -> Parser [[a]]
csv cell = many (row cell)
-- | `row cell` parses a newline-terminated row of comma separated
-- `cell`-expressions
row :: Parser a -> Parser [a]
row cell = sepBy cell (char ',') <* char '\n'
Now, you can write a custom cell parser that parses positive integers:
customCell :: Parser Int
customCell = read <$> many1 digit
and parse CSV files:
> parse (csv customCell) "" "1,2,3\n4,5,6\n"
Right [[1,2,3],[4,5,6]]
>
Here, instead of having a cell subparser that explicitly parses a comma-delimited cell into a string to be fed to a different parser, the "cell" is an implicit context in which a supplied cell parser is called to parse the underlying input stream at the appropriate point where one would expect a comma-delimited cell in the middle of a row in the middle of the input stream.
Sadly I know of no parser library or parser generator for Haskell that supports vertical parser composition like this. Something like what you wrote is about as good as it gets. Dang!

How to parse arbitrary lists with Haskell parsers?

Is it possible to use one of the parsing libraries (e.g. Parsec) for parsing something different than a String? And how would I do this?
For the sake of simplicity, let's assume the input is a list of ints [Int]. The task could be
drop leading zeros
parse the rest into the pattern (S+L+)*, where S is a number less than 10, and L is a number larger or equal to ten.
return a list of tuples (Int,Int), where fst is the product of the S and snd is the product of the L integers
It would be great if someone could show how to write such a parser (or something similar).
Yes, as user5402 points out, Parsec can parse any instance of Stream, including arbitrary lists. As there are no predefined token parsers (as there are for text) you have to roll your own, (myToken below) using e.g. tokenPrim
The only thing I find a bit awkward is the handling of "source positions". SourcePos is an abstract type (rather than a type class) and forces me to use its "filename/line/column" format, which feels a bit unnatural here.
Anyway, here is the code (without the skipping of leading zeroes, for brevity)
import Text.Parsec
myToken :: (Show a) => (a -> Bool) -> Parsec [a] () a
myToken test = tokenPrim show incPos $ justIf test where
incPos pos _ _ = incSourceColumn pos 1
justIf test x = if (test x) then Just x else Nothing
small = myToken (< 10)
large = myToken (>= 10)
smallLargePattern = do
smallints <- many1 small
largeints <- many1 large
let prod = foldl1 (*)
return (prod smallints, prod largeints)
myIntListParser :: Parsec [Int] () [(Int,Int)]
myIntListParser = many smallLargePattern
testMe :: [Int] -> [(Int, Int)]
testMe xs = case parse myIntListParser "your list" xs of
Left err -> error $ show err
Right result -> result
Trying it all out:
*Main> testMe [1,2,55,33,3,5,99]
[(2,1815),(15,99)]
*Main> testMe [1,2,55,33,3,5,99,1]
*** Exception: "your list" (line 1, column 9):
unexpected end of input
Note the awkward line/column format in the error message
Of course one could write a function sanitiseSourcePos :: SourcePos -> MyListPosition
There is very likely a way to get Parsec to use [a] as the stream type, but the idea behind parser combinators is actually very simple, and it's not very difficult to roll your own library.
A very accessible resource I would recommend is Monadic Parsing in Haskell by Graham Hutton and Erik Meijer.
Indeed, right now Erik Meijer is teaching an intro Haskell/functional programming course on edx.org (link) and Lecture 7 is all about functional parsers. As he states in the intro to the lecture:
"... No one can follow the path towards mastering functional programming without writing their own parser combinator library. We start by explaining what parsers are and how they can naturally be viewed as side-effecting functions. Next we define a number of basic parsers and higher-order functions for combining parsers. ..."

Cannot compute minimal length of a parser - uu-parsinglib in Haskell

Lets see the code snippet:
pSegmentBegin p i = pIndentExact i *> ((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
if I change this code in my parser to:
pSegmentBegin p i = do
pIndentExact i
((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
I've got an error:
canot compute minmal length of a parser due to occurrence of a moadic bind, use addLength to override
I thought the above parser should behave the same way. Why this error can occur?
EDIT
The above example is very simple (to simplify the question) and as noted below it is not necessary to use do notation here, but the real case I wanted it to use is as follows:
pSegmentBegin p i = do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
I have noticed that adding "addLength 1" before the do statement solves the problem, but I'm unsure if its a correct solution:
pSegmentBegin p i = addLength 2 $ do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
As I have mentioned many times the monadic interface should be avoided whenever possible. let me try to explain why the applicative interface is to be preferred.
One of the distinctive features of my library is that it performs error correction by inserting or deleting problems. Of course we can take an umlimited look-ahead here but that would make the process VERY expensive. So we take only a limited lookahead of three steps.
Now suppose we have to insert an expression and one of the expression alternatives is:
expr := "if" expr "then" expr "else" expr
then we want to exclude this alternative since choosing this alternative would necessitate the insertion of another expression etc. So we perform an abstract interpretation of the alternatives and make sure that in case of a draw (i.e. equal costs for the limited lookahead) we take one of the non-recursive alternatives.
Unfortunately this scheme breaks down when one writes monadic parsers, since the length of the right hand side of the bind may depend on the result of the left-hand side. So we issue the error message, and ask some help from the programmer to indicate the number of tokens this alternative might consume. The actual value does not matter so much, as long as you do not provide a finite length for something which is recursive and may lead to infinite insertions. It is only used to select the shortest alternative in case of an insertion.
This abstract interpretation has some costs and if you write all your parsers in monadic style it is unavoidable that this analysis is repeated over an over again. so: DO NOT WRITE MONADIC STYLE PARSERS WHEN USING THIS LIBRARY IF THERE IS AN APPLICATIVE ALTERNATIVE.
It's trying to statically analyze how much input needs to be read in order to optimize performance, but that kind of optimization requires a statically known parser structure—the kind that can be built by Applicatives since the parser effect cannot depend upon the parser value such what (>>=) does.
So that's what goes wrong—when you use do notation it translates to a Monadic bind which breaks the Applicative predictor. It'd be nice if the library only exposed one of the two interfaces so that this kind of error cannot happen, but instead there's some inconsistency if you use both interfaces together in the same parser.
Since this use of do is strictly unnecessary—you're not using the extra power the monadic interface gives you—it's probably better to just avoid it.
I have a workaround I use with monadic parsers in uuparsinglib. Its a self-answer here:
Monadic parse with uu-parsinglib
You may find it useful

Using Parsec to write a Read instance

Using Parsec, I'm able to write a function of type String -> Maybe MyType with relative ease. I would now like to create a Read instance for my type based on that; however, I don't understand how readsPrec works or what it is supposed to do.
My best guess right now is that readsPrec is used to build a recursive parser from scratch to traverse a string, building up the desired datatype in Haskell. However, I already have a very robust parser who does that very thing for me. So how do I tell readsPrec to use my parser? What is the "operator precedence" parameter it takes, and what is it good for in my context?
If it helps, I've created a minimal example on Github. It contains a type, a parser, and a blank Read instance, and reflects quite well where I'm stuck.
(Background: The real parser is for Scheme.)
However, I already have a very robust parser who does that very thing for me.
It's actually not that robust, your parser has problems with superfluous parentheses, it won't parse
((1) (2))
for example, and it will throw an exception on some malformed inputs, because
singleP = Single . read <$> many digit
may use read "" :: Int.
That out of the way, the precedence argument is used to determine whether parentheses are necessary in some place, e.g. if you have
infixr 6 :+:
data a :+: b = a :+: b
data C = C Int
data D = D C
you don't need parentheses around a C 12 as an argument of (:+:), since the precedence of application is higher than that of (:+:), but you'd need parentheses around C 12 as an argument of D.
So you'd usually have something like
readsPrec p = needsParens (p >= precedenceLevel) someParser
where someParser parses a value from the input without enclosing parentheses, and needsParens True thing parses a thing between parentheses, while needsParens False thing parses a thing optionally enclosed in parentheses [you should always accept more parentheses than necessary, ((((((1)))))) should parse fine as an Int].
Since the readsPrec p parsers are used to parse parts of the input as parts of the value when reading lists, tuples etc., they must return not only the parsed value, but also the remaining part of the input.
With that, a simple way to transform a parsec parser to a readsPrec parser would be
withRemaining :: Parser a -> Parser (a, String)
withRemaining p = (,) <$> p <*> getInput
parsecToReadsPrec :: Parser a -> Int -> ReadS a
parsecToReadsPrec parsecParser prec input
= case parse (withremaining $ needsParens (prec >= threshold) parsecParser) "" input of
Left _ -> []
Right result -> [result]
If you're using GHC, it may however be preferable to use a ReadPrec / ReadP parser (built using Text.ParserCombinators.ReadP[rec]) instead of a parsec parser and define readPrec instead of readsPrec.

How do I do python-style indent/dedent tokens with alex/haskell?

I'm writing a lexer for a small language in Alex with Haskell.
The language is specified to have pythonesque significant indentation, with an INDENT token or a DEDENT token emitted whenever the indentation level changes.
In a traditional imperative language like C, you'd keep a global in the lexer and update it with the indentation level at each line.
This doesn't work in Alex/Haskell because I can't store any global data anywhere with Haskell, and I can't put all my lexing rules inside any monad or anything.
So, how can I do this? Is it even possible? Or will i have to write my own lexer and avoid using alex?
Note that in other whitespace-sensitive languages -- like Haskell -- the layout handling is indeed done in the lexer. GHC in fact implements layout handling in Alex. Here's the source:
https://github.com/ghc/ghc/blob/master/compiler/GHC/Parser/Lexer.x
There are some serious errors in your question that lead you astray, as jrockway points out. "I can't store any global data anywhere with Haskell" is on the wrong track. Firstly, you can have global state, secondly, you should not be using global state here, when Alex fully supports state transitions in rules in a safe manner.
Look at the AlexState structure that Alex provides, letting you thread state through your lexer. Then, look at how the state is used in GHC's layout implementation to implement indent/unindent of the layout rules. (Search for "-- Layout processing" in GHC's lexer to see how the state is pushed and popped).
I can't store any global data anywhere with Haskell
This is not true; in most cases something like the State monad is sufficient, but there is also the ST monad.
You don't need global state for this task, however. Writing a parser consists of two parts; lexical analysis and syntax analysis. The lexical analysis just turns a stream of characters into a stream of meaningful tokens. The syntax analysis turns tokens into an AST; this is where you should deal with indentation.
As you are interpreting the indentation, you will call a handler function as the indentation level changes -- when it increases (nesting), you call your handler function (perhaps with one arg incremented, if you want to track the indentation level); when the level decreases, you simply return the relevant AST portion from the function.
(As an aside, using a global variable for this is something that would not occur to me in an imperative language either -- if anything, it's an instance variable. The State monad is very similar conceptually to this.)
Finally, I think the phrase "I can't put all my lexing rules inside any monad" indicates some sort of misunderstanding of monads. If I needed to parse and keep global state, my code would look like:
data AST = ...
type Step = State Int AST
parseFunction :: Stream -> Step
parseFunction s = do
level <- get
...
if anotherFunction then put (level + 1) >> parseFunction ...
else parseWhatever
...
return node
parse :: Stream -> Step
parse s = do
if looksLikeFunction then parseFunction ...
main = runState parse 0 -- initial nesting of 0
Instead of combining function applications with (.) or ($), you combine them with (>>=) or (>>). Other than that, the algorithm is the same. (There is no "monad" to be "inside".)
Finally, you might like applicative functors:
eval :: Environment -> Node -> Evaluated
eval e (Constant x) = Evaluated x
eval e (Variable x) = Evaluated (lookup e x)
eval e (Function f x y) = (f <$> (`eval` x) <*> (`eval` y)) e
(or
eval e (Function f x y) = ((`eval` f) <*> (`eval` x) <*> (`eval` y)) e
if you have something like "funcall"... but I digress.)
There is plenty of literature on parsing with applicative functors, monads, and arrows; all of which have the potential to solve your problem. Read up on those and see what you get.

Resources