How do I do python-style indent/dedent tokens with alex/haskell? - parsing

I'm writing a lexer for a small language in Alex with Haskell.
The language is specified to have pythonesque significant indentation, with an INDENT token or a DEDENT token emitted whenever the indentation level changes.
In a traditional imperative language like C, you'd keep a global in the lexer and update it with the indentation level at each line.
This doesn't work in Alex/Haskell because I can't store any global data anywhere with Haskell, and I can't put all my lexing rules inside any monad or anything.
So, how can I do this? Is it even possible? Or will i have to write my own lexer and avoid using alex?

Note that in other whitespace-sensitive languages -- like Haskell -- the layout handling is indeed done in the lexer. GHC in fact implements layout handling in Alex. Here's the source:
https://github.com/ghc/ghc/blob/master/compiler/GHC/Parser/Lexer.x
There are some serious errors in your question that lead you astray, as jrockway points out. "I can't store any global data anywhere with Haskell" is on the wrong track. Firstly, you can have global state, secondly, you should not be using global state here, when Alex fully supports state transitions in rules in a safe manner.
Look at the AlexState structure that Alex provides, letting you thread state through your lexer. Then, look at how the state is used in GHC's layout implementation to implement indent/unindent of the layout rules. (Search for "-- Layout processing" in GHC's lexer to see how the state is pushed and popped).

I can't store any global data anywhere with Haskell
This is not true; in most cases something like the State monad is sufficient, but there is also the ST monad.
You don't need global state for this task, however. Writing a parser consists of two parts; lexical analysis and syntax analysis. The lexical analysis just turns a stream of characters into a stream of meaningful tokens. The syntax analysis turns tokens into an AST; this is where you should deal with indentation.
As you are interpreting the indentation, you will call a handler function as the indentation level changes -- when it increases (nesting), you call your handler function (perhaps with one arg incremented, if you want to track the indentation level); when the level decreases, you simply return the relevant AST portion from the function.
(As an aside, using a global variable for this is something that would not occur to me in an imperative language either -- if anything, it's an instance variable. The State monad is very similar conceptually to this.)
Finally, I think the phrase "I can't put all my lexing rules inside any monad" indicates some sort of misunderstanding of monads. If I needed to parse and keep global state, my code would look like:
data AST = ...
type Step = State Int AST
parseFunction :: Stream -> Step
parseFunction s = do
level <- get
...
if anotherFunction then put (level + 1) >> parseFunction ...
else parseWhatever
...
return node
parse :: Stream -> Step
parse s = do
if looksLikeFunction then parseFunction ...
main = runState parse 0 -- initial nesting of 0
Instead of combining function applications with (.) or ($), you combine them with (>>=) or (>>). Other than that, the algorithm is the same. (There is no "monad" to be "inside".)
Finally, you might like applicative functors:
eval :: Environment -> Node -> Evaluated
eval e (Constant x) = Evaluated x
eval e (Variable x) = Evaluated (lookup e x)
eval e (Function f x y) = (f <$> (`eval` x) <*> (`eval` y)) e
(or
eval e (Function f x y) = ((`eval` f) <*> (`eval` x) <*> (`eval` y)) e
if you have something like "funcall"... but I digress.)
There is plenty of literature on parsing with applicative functors, monads, and arrows; all of which have the potential to solve your problem. Read up on those and see what you get.

Related

John Hughes' Deterministic LL(1) parsing with Arrow and errors

I wanted to write a parser based on John Hughes' paper Generalizing Monads to Arrows. When reading through and trying to reimplement his code I realized there were some things that didn't quite make sense. In one section he lays out a parser implementation based on Swierstra and Duponchel's paper Deterministic, error-correcting combinator parsers using Arrows. The parser type he describes looks like this:
data StaticParser ch = SP Bool [ch]
data DynamicParser ch a b = DP (a, [ch]) -> (b, [ch])
data Parser ch a b = P (StaticParser ch) (DynamicParser ch a b)
with the composition operator looking something like this:
(.) :: Parser ch b c -> Parser ch a b -> Parser ch a c
P (SP e2 st2) (DP f2) . P (SP e1 st1) (DP f1) =
P (SP (e1 && e2) (st1 `union` if e1 then st2 else []))
(DP $ f2 . f1)
The issue is that the composition of parsers q . p 'forgets' q's starting symbols. One possible interpretation I thought of is that Hughes' expects all our DynamicParsers to be total such that a symbol parser's type signature would be symbol :: ch -> Parser ch a (Maybe ch) instead of symbol :: ch -> Parser ch a ch. This still seems awkward though since we have to duplicate information putting starting symbol information in both the StaticParser and DynamicParser. Another issue is that almost all parsers will have the potential to throw which means we will have to spend a lot of time inside Maybe or Either creating what is essentially the "monads do not compose problem." This could be remedied by rewriting DynamicParser itself to handle failure or as an Arrow transformer, but this is straying quite a bit from the paper. None of these issues are addressed in the paper, and the Parser is presented as if it obviously works, so I feel like I must me missing something basic. If someone can catch what I missed that would be super helpful.
I think the deterministic parsers described by Swierstra and Duponcheel are a bit different from traditional parsers: they do not handle failure at all, only choice.
See also the invokeDet function in the S&D paper:
invokeDet :: Symbol s => DetPar s a -> Input s -> a
invokeDet (_, p) inp = case p inp [] of (a, _) -> a
This function clearly assumes it will always be able to find a valid parse.
With the arrow version of the parsers described by Hughes you can write a examples like this:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ab"
print $ invokeDet p "ac"
Which will print the expected:
'b'
'c'
However, if you write a "failing" parse:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ad"
It will still print:
'c'
To make this behavior a bit more sensible, Swierstra and Duponcheel also introduce error-correction. The output 'c' is expected if we assume the erroneous character d has been corrected to be a c in the input. This requires an extra mechanism which presumably was too complicated to include in Hughes' paper.
I have uploaded the implementation I used to get these results here: https://gist.github.com/noughtmare/eced4441332784cc8212e9c0adb68b35
For more information about a more practical parser in the same style (but no longer deterministic and no longer limited to LL(1)) I really like the "Combinator Parsing: A Short Tutorial" by Swierstra. An interesting excerpt from section 9.3:
A subtle point here is the question how to deal with monadic parsers. As we described in [13] the static analysis does not go well with monadic computations, since in that case we dynamically build new parses based on the input produced thus far: the whole idea of a static analysis is that it is static. This observation has lead John Hughes to propose arrows for dealing with such situations [7]. It is only recently that we realised that, although our arguments still hold in general, they do not apply to the case of the LL(1) analysis. If we want to compute the symbols which can be recognised as the first symbol by a parser of the form p >>= q then we are only interested in the starting symbols of the right hand side if the left hand side can recognise the empty string; the good news is that in that case we statically know what value will be returned as a witness, and can pass this value on to q, and analyse the result of this call statically too. Unfortunately we will have to take special precautions in case the left hand side operator contains a call to pErrors in one of the empty derivations, since then it is no longer true that the witness of this alternative can be determined statically.
The full parser implementation by Swierstra can be found in the uu-parsinglib package, although I do not know how many of the extensions are implemented there.

When would a parser append its input string?

Given a parser
newtype Parser a = Parser { parse :: String -> [(a,String)] }
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = Parser $ \s -> concat [ parse (f a) s' | (a, s') <- parse p s ]
return :: a -> Parser a
return a = Parser (\s -> [(a,s)])
item :: Parser Char
item = Parser $ \s -> case cs of
"" -> []
(c:cs) -> [(c,cs)]
We can see that item consumes part of the input string given to it ("abc" -> [('a', "bc")]). Is there ever a case where a parser would produce additional string output or replace/modify it (e.g. Parser $ \s -> [((), 'a':s)])? I suspect that this might be the case with context-sensitive grammars but have trouble coming up with a sensible example.
Is there a reason why it would make sense to do this for a real-world problem?
References
Monadic Parsing in Haskell
Here are a couple of cases where it is convenient to inject tokens into the input stream. (How this is actually integrated into the parsing pipeline is another question.)
Macro expansion, in the style of the C/C++ preprocessing phase. This is arguably not the best model for macro expansion; hygienic macros would more likely be expanded using a tree transformation, as with C++ template resolution. But the token-oriented preprocessor is not going away soon. Since it is not tightly coupled with the language syntax, the easiest implementation is to substitute the macro (and arguments if applicable) with the tokens from its expansion.
Ecmascript-style automatic semi-colon insertion (ASI). The language syntax requires a semi-colon to be inserted into the token stream under certain precisely-defined circumstances, which are difficult (at least) to incorporate in a CFG. Since ASI is only possible if the next token in the input stream cannot be shifted (and done other conditions), it can certainly be integrated into the parser loop.
Similarly, indentation-aware block syntax (as in Haskell and Python, for example) can certainly be implemented by replacing leading whitespace with an injected INDENT token or some number of injected DEDENTs. Since this substitution is dependent on parse context (it isn't done inside parentheses, for example), injection inside the parser could be a reasonable approach.
That's not an exhaustive list, but it might be at least indicative. Not all of those cases necessarily involve context-sensitivity (I believe ASI could, in theory, be handled with a context-free grammar although I have no intention of trying) and not all instances of context-sensitivity necessarily require token injection (the ambiguity in C between type and variable names only requires selecting the correct token).

How to parse arbitrary lists with Haskell parsers?

Is it possible to use one of the parsing libraries (e.g. Parsec) for parsing something different than a String? And how would I do this?
For the sake of simplicity, let's assume the input is a list of ints [Int]. The task could be
drop leading zeros
parse the rest into the pattern (S+L+)*, where S is a number less than 10, and L is a number larger or equal to ten.
return a list of tuples (Int,Int), where fst is the product of the S and snd is the product of the L integers
It would be great if someone could show how to write such a parser (or something similar).
Yes, as user5402 points out, Parsec can parse any instance of Stream, including arbitrary lists. As there are no predefined token parsers (as there are for text) you have to roll your own, (myToken below) using e.g. tokenPrim
The only thing I find a bit awkward is the handling of "source positions". SourcePos is an abstract type (rather than a type class) and forces me to use its "filename/line/column" format, which feels a bit unnatural here.
Anyway, here is the code (without the skipping of leading zeroes, for brevity)
import Text.Parsec
myToken :: (Show a) => (a -> Bool) -> Parsec [a] () a
myToken test = tokenPrim show incPos $ justIf test where
incPos pos _ _ = incSourceColumn pos 1
justIf test x = if (test x) then Just x else Nothing
small = myToken (< 10)
large = myToken (>= 10)
smallLargePattern = do
smallints <- many1 small
largeints <- many1 large
let prod = foldl1 (*)
return (prod smallints, prod largeints)
myIntListParser :: Parsec [Int] () [(Int,Int)]
myIntListParser = many smallLargePattern
testMe :: [Int] -> [(Int, Int)]
testMe xs = case parse myIntListParser "your list" xs of
Left err -> error $ show err
Right result -> result
Trying it all out:
*Main> testMe [1,2,55,33,3,5,99]
[(2,1815),(15,99)]
*Main> testMe [1,2,55,33,3,5,99,1]
*** Exception: "your list" (line 1, column 9):
unexpected end of input
Note the awkward line/column format in the error message
Of course one could write a function sanitiseSourcePos :: SourcePos -> MyListPosition
There is very likely a way to get Parsec to use [a] as the stream type, but the idea behind parser combinators is actually very simple, and it's not very difficult to roll your own library.
A very accessible resource I would recommend is Monadic Parsing in Haskell by Graham Hutton and Erik Meijer.
Indeed, right now Erik Meijer is teaching an intro Haskell/functional programming course on edx.org (link) and Lecture 7 is all about functional parsers. As he states in the intro to the lecture:
"... No one can follow the path towards mastering functional programming without writing their own parser combinator library. We start by explaining what parsers are and how they can naturally be viewed as side-effecting functions. Next we define a number of basic parsers and higher-order functions for combining parsers. ..."

Cannot compute minimal length of a parser - uu-parsinglib in Haskell

Lets see the code snippet:
pSegmentBegin p i = pIndentExact i *> ((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
if I change this code in my parser to:
pSegmentBegin p i = do
pIndentExact i
((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
I've got an error:
canot compute minmal length of a parser due to occurrence of a moadic bind, use addLength to override
I thought the above parser should behave the same way. Why this error can occur?
EDIT
The above example is very simple (to simplify the question) and as noted below it is not necessary to use do notation here, but the real case I wanted it to use is as follows:
pSegmentBegin p i = do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
I have noticed that adding "addLength 1" before the do statement solves the problem, but I'm unsure if its a correct solution:
pSegmentBegin p i = addLength 2 $ do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
As I have mentioned many times the monadic interface should be avoided whenever possible. let me try to explain why the applicative interface is to be preferred.
One of the distinctive features of my library is that it performs error correction by inserting or deleting problems. Of course we can take an umlimited look-ahead here but that would make the process VERY expensive. So we take only a limited lookahead of three steps.
Now suppose we have to insert an expression and one of the expression alternatives is:
expr := "if" expr "then" expr "else" expr
then we want to exclude this alternative since choosing this alternative would necessitate the insertion of another expression etc. So we perform an abstract interpretation of the alternatives and make sure that in case of a draw (i.e. equal costs for the limited lookahead) we take one of the non-recursive alternatives.
Unfortunately this scheme breaks down when one writes monadic parsers, since the length of the right hand side of the bind may depend on the result of the left-hand side. So we issue the error message, and ask some help from the programmer to indicate the number of tokens this alternative might consume. The actual value does not matter so much, as long as you do not provide a finite length for something which is recursive and may lead to infinite insertions. It is only used to select the shortest alternative in case of an insertion.
This abstract interpretation has some costs and if you write all your parsers in monadic style it is unavoidable that this analysis is repeated over an over again. so: DO NOT WRITE MONADIC STYLE PARSERS WHEN USING THIS LIBRARY IF THERE IS AN APPLICATIVE ALTERNATIVE.
It's trying to statically analyze how much input needs to be read in order to optimize performance, but that kind of optimization requires a statically known parser structure—the kind that can be built by Applicatives since the parser effect cannot depend upon the parser value such what (>>=) does.
So that's what goes wrong—when you use do notation it translates to a Monadic bind which breaks the Applicative predictor. It'd be nice if the library only exposed one of the two interfaces so that this kind of error cannot happen, but instead there's some inconsistency if you use both interfaces together in the same parser.
Since this use of do is strictly unnecessary—you're not using the extra power the monadic interface gives you—it's probably better to just avoid it.
I have a workaround I use with monadic parsers in uuparsinglib. Its a self-answer here:
Monadic parse with uu-parsinglib
You may find it useful

Underlying Parsec Monad

Many of the Parsec combinators I use are of a type such as:
foo :: CharParser st Foo
CharParser is defined here as:
type CharParser st = GenParser Char st
CharParser is thus a type synonym involving GenParser, itself defined here as:
type GenParser tok st = Parsec [tok] st
GenParser is then another type synonym, assigned using Parsec, defined here as:
type Parsec s u = ParsecT s u Identity
So Parsec is a partial application of ParsecT, itself listed here with type:
data ParsecT s u m a
along with the words:
"ParsecT s u m a is a parser with stream type s, user state type u,
underlying monad m and return type a."
What is the underlying monad? In particular, what is it when I use the CharParser parsers? I can't see where it's inserted in the stack. Is there a relationship to the use of the list monad in Monadic Parsing in Haskell to return multiple successful parses from an ambiguous parser?
In your case the underlying monad is Identity. However ParsecT is different from most monad transformers in that it is an instance of the Monad class even if the type parameter m is not. If you look at the source code you will note the lack of "(Monad m) =>" in the instance declaration.
So then you ask yourself, "If I were to have a non-trivial monad stack, where would it be used?"
There are a three of answers to that question:
It is used to uncons the next token out of the stream:
class (Monad m) => Stream s m t | s -> t where
uncons :: s -> m (Maybe (t,s))
Notice that uncons takes an s (the stream of tokens t) and returns its result wrapped in your monad. This allows one to do interesting thing while or even during the process of getting the next token.
It is used in the resulting output of each parser. This means you can create parsers that don't touch the input but take action in the underlying monad and use the combinators to bind them to regular parsers. In other words, lift (x :: m a) :: ParsecT s u m a.
Finally, the end result of RunParsecT and friends (until you build up to the point where m is replaced by Identity) return their results wrapped in this monad.
There is not a relationship between this monad and the one from Monadic Parsing in Haskell. In this case Hutton and Meijer are referring to the monad instance for ParsecT itself. The fact that in Parsec-3.0.0 and beyond ParsecT has become a monad transformer with an underlying monad is not relevant to the paper.
What I think you are looking for however is where the list of possible results went. In Hutton and Meijer the parser returns a list of all possible results while Parsec stubbornly returns only one. I think you are looking at the m in the result and thinking to yourself that the list of results must be hiding in there somewhere. It is not.
Parsec, for reasons of efficiency, made a choice to prefer the first matching result in Hutton and Meijer's list of results. This let's it toss away both the unused results in the tail of Hutton and Meijer's list and also the front of the stream of tokens because we never backtrack. In parsec, given the combined parser a <|> b, if a consumes any input b will never be evaluated. The way around this is try which will reset the state back to where it was if a fails then evaluate b.
You asked in the comments if this was done using Maybe or Either. The answer is "almost but not quite." If you look at the low lever run* functions you see that they return an Algebraic type which tell weather input was consumed then a second which give either the result or an error message. These types work kind of like Either, but even they are not used directly. Rather then stretch this out further, I'll refer you to the post by Antoine Latter that explains how this works and why it is done this way.
GenParser is defined in terms of Parsec, not ParsecT. Parsec in turn is defined as
type Parsec s u = ParsecT s u Identity
So the answer is that when using CharParser the underlying monad is the Identity monad.

Resources