How to handle selection sets of grammars that have multiple derivation trees - parsing

I need to write a parse table for a grammar that has lambda expressions and multiple derivation trees. I am having trouble finding examples of parse tables for grammars with lambda expressions. How should I start?
My attempt at the selection sets are as follows:
S-> ABCe = {e,b,c,d}.
(A->bB = {b}, A->lambda, B->cC ={c}, A->lambda, B->lambda, C->d = {d}, A->lambda, B-> lambda, C goes to lambda = {e}).
A-> bB = {b}
A-> lambda = {c,d,e}
B-> cC = {c}
B-> lambda = {d,e}
C-> d = {d}
C-> lambda = {d}
There are two problems I am having:
1) I don't know what to write to define lambda in the parse table or the actual parse code.
2) If lambda expressions are dependent on what follows in the string, then the current token would determine what is popped off, right? For example, if S goes to ABCe, and the current token is b, then would I push(e) and push(C)? I also realized something else for the lambdas just now. The selection sets for B are mutually inclusive. So, for instance, only if the current token at A is b would I push 'B,' if the current token is c,d, or e, I would just pop(). I'm really not sure how to go about writing it, this is just my thought process. But is it allowed to use the selection set in place of the actual grammar rule?

Related

Why do we need FOLLOW set in LL(1) grammar parser?

In generated parsing function we use an algorithm which looks on a peek of a tokens list and chooses rule (alternative) based on the current non-terminal FIRST set. If it contains an epsilon (rule is nullable), FOLLOW set is checked as well.
Consider following grammar [not LL(1)]:
B : A term
A : N1 | N2
N1 :
N2 :
During calculation of the FOLLOW set terminal term will be propagated from A to both N1 and N2, so FOLLOW set won't help us decide.
On the other hand, if there is exactly one nullable alternative, we know for sure how to continue execution, even in case current token doesn't match against anything from the FIRST set (by choosing epsilon production).
If above statements are true, FOLLOW set is redundant. Is it needed only for error-handling?
Yes, it is not necessary.
I was asked precisely this question on the colloquium, and my answer that FOLLOW set is used
to check that grammar is LL(1)
to fail immediately when an error occurs, instead of dragging the ill-formatted token to some later production, where generated fail message may be unclear
and for nothing else
was accepted
While you can certainly find grammars for which FOLLOW is unnecessary (i.e., it doesn't play a role in the calculation of the parsing table), in general it is necessary.
For example, consider the grammar
S : A | C
A : B a
B : b | epsilon
C : D c
D : d | epsilon
You need to know that
Follow(B) = {a}
Follow(D) = {c}
to calculate
First(A) = {b, a}
First(C) = {d, c}
in order to make the correct choice at S.

John Hughes' Deterministic LL(1) parsing with Arrow and errors

I wanted to write a parser based on John Hughes' paper Generalizing Monads to Arrows. When reading through and trying to reimplement his code I realized there were some things that didn't quite make sense. In one section he lays out a parser implementation based on Swierstra and Duponchel's paper Deterministic, error-correcting combinator parsers using Arrows. The parser type he describes looks like this:
data StaticParser ch = SP Bool [ch]
data DynamicParser ch a b = DP (a, [ch]) -> (b, [ch])
data Parser ch a b = P (StaticParser ch) (DynamicParser ch a b)
with the composition operator looking something like this:
(.) :: Parser ch b c -> Parser ch a b -> Parser ch a c
P (SP e2 st2) (DP f2) . P (SP e1 st1) (DP f1) =
P (SP (e1 && e2) (st1 `union` if e1 then st2 else []))
(DP $ f2 . f1)
The issue is that the composition of parsers q . p 'forgets' q's starting symbols. One possible interpretation I thought of is that Hughes' expects all our DynamicParsers to be total such that a symbol parser's type signature would be symbol :: ch -> Parser ch a (Maybe ch) instead of symbol :: ch -> Parser ch a ch. This still seems awkward though since we have to duplicate information putting starting symbol information in both the StaticParser and DynamicParser. Another issue is that almost all parsers will have the potential to throw which means we will have to spend a lot of time inside Maybe or Either creating what is essentially the "monads do not compose problem." This could be remedied by rewriting DynamicParser itself to handle failure or as an Arrow transformer, but this is straying quite a bit from the paper. None of these issues are addressed in the paper, and the Parser is presented as if it obviously works, so I feel like I must me missing something basic. If someone can catch what I missed that would be super helpful.
I think the deterministic parsers described by Swierstra and Duponcheel are a bit different from traditional parsers: they do not handle failure at all, only choice.
See also the invokeDet function in the S&D paper:
invokeDet :: Symbol s => DetPar s a -> Input s -> a
invokeDet (_, p) inp = case p inp [] of (a, _) -> a
This function clearly assumes it will always be able to find a valid parse.
With the arrow version of the parsers described by Hughes you can write a examples like this:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ab"
print $ invokeDet p "ac"
Which will print the expected:
'b'
'c'
However, if you write a "failing" parse:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ad"
It will still print:
'c'
To make this behavior a bit more sensible, Swierstra and Duponcheel also introduce error-correction. The output 'c' is expected if we assume the erroneous character d has been corrected to be a c in the input. This requires an extra mechanism which presumably was too complicated to include in Hughes' paper.
I have uploaded the implementation I used to get these results here: https://gist.github.com/noughtmare/eced4441332784cc8212e9c0adb68b35
For more information about a more practical parser in the same style (but no longer deterministic and no longer limited to LL(1)) I really like the "Combinator Parsing: A Short Tutorial" by Swierstra. An interesting excerpt from section 9.3:
A subtle point here is the question how to deal with monadic parsers. As we described in [13] the static analysis does not go well with monadic computations, since in that case we dynamically build new parses based on the input produced thus far: the whole idea of a static analysis is that it is static. This observation has lead John Hughes to propose arrows for dealing with such situations [7]. It is only recently that we realised that, although our arguments still hold in general, they do not apply to the case of the LL(1) analysis. If we want to compute the symbols which can be recognised as the first symbol by a parser of the form p >>= q then we are only interested in the starting symbols of the right hand side if the left hand side can recognise the empty string; the good news is that in that case we statically know what value will be returned as a witness, and can pass this value on to q, and analyse the result of this call statically too. Unfortunately we will have to take special precautions in case the left hand side operator contains a call to pErrors in one of the empty derivations, since then it is no longer true that the witness of this alternative can be determined statically.
The full parser implementation by Swierstra can be found in the uu-parsinglib package, although I do not know how many of the extensions are implemented there.

Erlang implementing an amb operator.

On wikipedia it says that using call/cc you can implement the amb operator for nondeterministic choice, and my question is how would you implement the amb operator in a language in which the only support for continuations is to write in continuation passing style, like in erlang?
If you can encode the constraints for what constitutes a successful solution or choice as guards, list comprehensions can be used to generate solutions. For example, the list comprehension documentation shows an example of solving Pythagorean triples, which is a problem frequently solved using amb (see for example exercise 4.35 of SICP, 2nd edition). Here's the more efficient solution, pyth1/1, shown on the list comprehensions page:
pyth1(N) ->
[ {A,B,C} ||
A <- lists:seq(1,N-2),
B <- lists:seq(A+1,N-1),
C <- lists:seq(B+1,N),
A+B+C =< N,
A*A+B*B == C*C
].
One important aspect of amb is efficiently searching the solution space, which is done here by generating possible values for A, B, and C with lists:seq/2 and then constraining and testing those values with guards. Note that the page also shows a less efficient solution named pyth/1 where A, B, and C are all generated identically using lists:seq(1,N); that approach generates all permutations but is slower than pyth1/1 (for example, on my machine, pyth(50) is 5-6x slower than pyth1(50)).
If your constraints can't be expressed as guards, you can use pattern matching and try/catch to deal with failing solutions. For example, here's the same algorithm in pyth/1 rewritten as regular functions triples/1 and the recursive triples/5:
-module(pyth).
-export([triples/1]).
triples(N) ->
triples(1,1,1,N,[]).
triples(N,N,N,N,Acc) ->
lists:reverse(Acc);
triples(N,N,C,N,Acc) ->
triples(1,1,C+1,N,Acc);
triples(N,B,C,N,Acc) ->
triples(1,B+1,C,N,Acc);
triples(A,B,C,N,Acc) ->
NewAcc = try
true = A+B+C =< N,
true = A*A+B*B == C*C,
[{A,B,C}|Acc]
catch
error:{badmatch,false} ->
Acc
end,
triples(A+1,B,C,N,NewAcc).
We're using pattern matching for two purposes:
In the function heads, to control values of A, B and C with respect to N and to know when we're finished
In the body of the final clause of triples/5, to assert that conditions A+B+C =< N and A*A+B*B == C*C match true
If both conditions match true in the final clause of triples/5, we insert the solution into our accumulator list, but if either fails to match, we catch the badmatch error and keep the original accumulator value.
Calling triples/1 yields the same result as the list comprehension approaches used in pyth/1 and pyth1/1, but it's also half the speed of pyth/1. Even so, with this approach any constraint could be encoded as a normal function and tested for success within the try/catch expression.

Underlying Parsec Monad

Many of the Parsec combinators I use are of a type such as:
foo :: CharParser st Foo
CharParser is defined here as:
type CharParser st = GenParser Char st
CharParser is thus a type synonym involving GenParser, itself defined here as:
type GenParser tok st = Parsec [tok] st
GenParser is then another type synonym, assigned using Parsec, defined here as:
type Parsec s u = ParsecT s u Identity
So Parsec is a partial application of ParsecT, itself listed here with type:
data ParsecT s u m a
along with the words:
"ParsecT s u m a is a parser with stream type s, user state type u,
underlying monad m and return type a."
What is the underlying monad? In particular, what is it when I use the CharParser parsers? I can't see where it's inserted in the stack. Is there a relationship to the use of the list monad in Monadic Parsing in Haskell to return multiple successful parses from an ambiguous parser?
In your case the underlying monad is Identity. However ParsecT is different from most monad transformers in that it is an instance of the Monad class even if the type parameter m is not. If you look at the source code you will note the lack of "(Monad m) =>" in the instance declaration.
So then you ask yourself, "If I were to have a non-trivial monad stack, where would it be used?"
There are a three of answers to that question:
It is used to uncons the next token out of the stream:
class (Monad m) => Stream s m t | s -> t where
uncons :: s -> m (Maybe (t,s))
Notice that uncons takes an s (the stream of tokens t) and returns its result wrapped in your monad. This allows one to do interesting thing while or even during the process of getting the next token.
It is used in the resulting output of each parser. This means you can create parsers that don't touch the input but take action in the underlying monad and use the combinators to bind them to regular parsers. In other words, lift (x :: m a) :: ParsecT s u m a.
Finally, the end result of RunParsecT and friends (until you build up to the point where m is replaced by Identity) return their results wrapped in this monad.
There is not a relationship between this monad and the one from Monadic Parsing in Haskell. In this case Hutton and Meijer are referring to the monad instance for ParsecT itself. The fact that in Parsec-3.0.0 and beyond ParsecT has become a monad transformer with an underlying monad is not relevant to the paper.
What I think you are looking for however is where the list of possible results went. In Hutton and Meijer the parser returns a list of all possible results while Parsec stubbornly returns only one. I think you are looking at the m in the result and thinking to yourself that the list of results must be hiding in there somewhere. It is not.
Parsec, for reasons of efficiency, made a choice to prefer the first matching result in Hutton and Meijer's list of results. This let's it toss away both the unused results in the tail of Hutton and Meijer's list and also the front of the stream of tokens because we never backtrack. In parsec, given the combined parser a <|> b, if a consumes any input b will never be evaluated. The way around this is try which will reset the state back to where it was if a fails then evaluate b.
You asked in the comments if this was done using Maybe or Either. The answer is "almost but not quite." If you look at the low lever run* functions you see that they return an Algebraic type which tell weather input was consumed then a second which give either the result or an error message. These types work kind of like Either, but even they are not used directly. Rather then stretch this out further, I'll refer you to the post by Antoine Latter that explains how this works and why it is done this way.
GenParser is defined in terms of Parsec, not ParsecT. Parsec in turn is defined as
type Parsec s u = ParsecT s u Identity
So the answer is that when using CharParser the underlying monad is the Identity monad.

Parsing unordered sequence with parsing expression grammar

Is there a (simple) way, within a parsing expression grammar (PEG), to express an "unordered sequence"? A rule such as
Rule <- A B C
requires A, B and C to match in order. A rule such as
Rule <- (A B C) / (B C A) / (C A B) / (A C B) / (C B A) / (B A C)
allows them to match in any order (which is what we want) but it is cumbersome and inapplicable in practice with more terms in the sequence.
Is the only solution to use a syntactically looser rule such as
Rule <- (A / B / C){3}
and semantically check that each rule matches only once?
The fact that, e.g., Relax NG Compact Syntax has an "unordered list" operator to parse XML make me hint that there is no obvious solution.
Last question: do you think the addition of such an operator would bring ambiguity to PEG?
Grammar rules express precisely the sequence of forms that you want, regardless of parsing engine (e.g., PEG, LALR, LL(k), ...) that you choose.
The only way to express that you want all possible sequences of just of something using BNF rules is the big ugly rule you proposed.
The standard solution is to simply define:
rule <- (A | B | C)*
(or whatever syntax your parser generator accepts for lists) and semantically count that only 3 forms are provided and they are unique.
Often people building parser generators add special "extended BNF" notations to let them describe special circumstances; you gave an example use {3} as special syntax implying that you only wanted "3 of" under the assumption the parser generator accepts this notation and does the appropriate enforcement. One can imagine an extension notation {unique} to let you describe your situation. I've never seen a parser generator that implemented that idea.

Resources