Dealing with infinite loops when constructing states for LR(1) parsing

Dealing with infinite loops when constructing states for LR(1) parsing - parsing

I'm currently constructing LR(1) states from the following grammar.
S->AS
S->c
A->aA
A->b
where A,S are nonterminals and a,b,c are terminals.
This is the construction of I0
I0: S' -> .S, epsilon
---------------
S -> .AS, epsilon
S -> .c, epsilon
---------------
S -> .AS, a
S -> .c, c
A -> .aA, a
A -> .b, b
And I1.
From S, I1: S' -> S., epsilon //DONE
And so on. But when I get to constructing I4...
From a, I4: A -> a.A, a
-----------
A -> .aA, a
A -> .b, b
The problem is
A -> .aA
When I attempt to construct the next state from a, I'm going to once again get the exact same content of I4, and this continues infinitely. A similar loop occurs with
S -> .AS
So, what am I doing wrong? There has to be some detail that I'm missing, but I've browsed my notes and my book and either can't find or just don't understand what's wrong here. Any help?

I'm pretty sure I figured out the answer. Obviously, states can point to each other, so that eliminates the need to create new ones if it's content already exists. I'd still like it if someone can confirm this, though.

Related

What is the canonical way to handle conditionals in Erlang?

I am working on simple list functions in Erlang to learn the syntax.
Everything was looking very similar to code I wrote for the Prolog version of these functions until I got to an implementation of 'intersection'.
The cleanest solution I could come up with:
myIntersection([],_) -> [];
myIntersection([X|Xs],Ys) ->
UseFirst = myMember(X,Ys),
myIntersection(UseFirst,X,Xs,Ys).
myIntersection(true,X,Xs,Ys) ->
[X|myIntersection(Xs,Ys)];
myIntersection(_,_,Xs,Ys) ->
myIntersection(Xs,Ys).
To me, this feels slightly like a hack. Is there a more canonical way to handle this? By 'canonical', I mean an implementation true to the spirit of what Erlang's design.
Note: the essence of this question is conditional handling of user-defined predicate functions. I am not asking for someone to point me to a library function. Thanks!

I like this one:
inter(L1,L2) -> inter(lists:sort(L1),lists:sort(L2),[]).
inter([H1|T1],[H1|T2],Acc) -> inter(T1,T2,[H1|Acc]);
inter([H1|T1],[H2|T2],Acc) when H1 < H2 -> inter(T1,[H2|T2],Acc);
inter([H1|T1],[_|T2],Acc) -> inter([H1|T1],T2,Acc);
inter([],_,Acc) -> Acc;
inter(_,_,Acc) -> Acc.
it gives the exact intersection:
inter("abcd","efgh") -> []
inter("abcd","efagh") -> "a"
inter("abcd","efagah") -> "a"
inter("agbacd","eafagha") -> "aag"
if you want that a value appears only once, simply replace one of the lists:sort/1 function by lists:usort/1
Edit
As #9000 says, one clause is useless:
inter(L1,L2) -> inter(lists:sort(L1),lists:sort(L2),[]).
inter([H1|T1],[H1|T2],Acc) -> inter(T1,T2,[H1|Acc]);
inter([H1|T1],[H2|T2],Acc) when H1 < H2 -> inter(T1,[H2|T2],Acc);
inter([H1|T1],[_|T2],Acc) -> inter([H1|T1],T2,Acc);
inter(_,_,Acc) -> Acc.
gives the same result, and
inter(L1,L2) -> inter(lists:usort(L1),lists:sort(L2),[]).
inter([H1|T1],[H1|T2],Acc) -> inter(T1,T2,[H1|Acc]);
inter([H1|T1],[H2|T2],Acc) when H1 < H2 -> inter(T1,[H2|T2],Acc);
inter([H1|T1],[_|T2],Acc) -> inter([H1|T1],T2,Acc);
inter(_,_,Acc) -> Acc.
removes any duplicate in the output.
If you know that there are no duplicate values in the input list, I think that
inter(L1,L2) -> [X || X <- L1, Y <- L2, X == Y].
is the shorter code solution but much slower (1 second to evaluate the intersection of 2 lists of 10 000 elements compare to 16ms for the previous solution, and an O(2) complexity comparable to #David Varela proposal; the ratio is 70s compare to 280ms with 2 lists of 100 000 elements!, an I guess there is a very high risk to run out of memory with bigger lists)

The canonical way ("canonical" as in "SICP") is to use an accumulator.
myIntersection(A, B) -> myIntersectionInner(A, B, []).
myIntersectionInner([], _, Acc) -> Acc;
myIntersectionInner(_, [], Acc) -> Acc;
myIntersectionInner([A|As], B, Acc) ->
case myMember(A, Bs) of
true ->
myIntersectionInner(As, Bs, [A|Acc]);
false ->
myIntersectionInner(As, Bs, [Acc]);
end.
This implementation of course produces duplicates if duplicates are present in both inputs. This can be fixed at the expense of calling myMember(A, Acc) and only appending A is the result is negative.
My apologies for the approximate syntax.

Although I appreciate the efficient implementations suggested, my intention was to better understand Erlang's implementation. As a beginner, I think #7stud's comment, particularly http://erlang.org/pipermail/erlang-questions/2009-December/048101.html, was the most illuminating. In essence, 'case' and pattern matching in functions use the same mechanism under the hood, although functions should be preferred for clarity.
In a real system, I would go with one of #Pascal's implementations; depending on whether 'intersect' did any heavy lifting.

Understand whether a grammar is LR(1) with no parsing table

I've found out an exercise that require a trick to understand whether a grammar is LR(1) with no parsing table operations.
The grammar is the followed:
S -> Aa | Bb
A -> aAb | ab
B -> aBbb | abb
Do you know what is the trick behind?
Thanks, :)

Imagine that you're an LR(1) parser and that you've just read aab with a lookahead of b. (I know, you're probably thinking "man, that happens to me all the time!") What exactly should you do here?
Looking at the grammar, you can't tell whether the initial production was Aa or Bb, so you're going to have to simultaneously consider production rules for A and for B. If you look at the A options, you'll see that one option here would be to reduce A → ab, which is plausible here because the lookahead is a b and that's precisely what you'd expect to find after seeing an ab when expanding out an A (notice that there's the rule A → aRb, so any recursively-expanded As would be followed by a b). So that tells you to reduce. On the other hand, look at the B options. If you see aab followed by a b, you'd be thinking "oh, that second b is going to make aabb, and then I'd go and reduce B → abb, because that's totally a thing I like to do because I'm an LR(1) parser." So that tells you to shift. At that point, bam! You've got a shift/reduce conflict, so you're almost certainly not going to have an LR(1) grammar.
So does that actually happen? Well, let's go build the LR(1) configurating sets that we'd see if we did indeed read aab and see b as a lookahead:
Initial State
S' -> .S [$]
S -> .Aa [$]
S -> .Bb [$]
A -> .aAb [a]
A -> .ab [a]
B -> .aBbb [b]
B -> .abb [b]
State after reading a
A -> a.Ab [a]
A -> a.b [a]
A -> .aAb [b]
A -> .ab [b]
B -> a.Bbb [b]
B -> a.bb [b]
B -> .aBbb [b]
B -> .abb [b]
State after reading aa
A -> a.Ab [b]
A -> a.b [b]
A -> .aAb [b]
A -> .ab [b]
B -> a.Bbb [b]
B -> a.bb [b]
B -> .aBbb [b]
B -> .abb [b]
State after reading aab
A -> ab. [b]
B -> ab.b [b]
And hey! There's that shift/reduce conflict we were talking about. That first item reduces on b, but the second shifts on b. So there you go! Our intuition led us to think that this isn't going to be an LR(1) grammar, and if we look at the tables the evidence is supported by the data.
So how would you know to try that? Well, in general, it's pretty hard to do this. The main cue, for me, at least, is that the parser has to guess whether it wants A or B at some point, but the way it tiebreaks is the number of bs. The parser was going to have to at some point determine whether it likes ab and to go with A or whether it likes abb and to go with B, but it can't see both of the bs before making the decision. That led me to think that we'd like to find some sort of conflict where we've seen enough to know that some recursion was happening (so that the trailing b would cause problem) and to find a place where the recursion would differ between the two production rules.

fixing a grammar to LR(0)

Question:
Given the following grammar, fix it to an LR(O) grammar:
S -> S' $
S'-> aS'b | T
T -> cT | c
Thoughts
I've been trying this for quite sometime, using automatic tools for checking my fixed grammars, with no success. Our professor likes asking this kind of questions on test without giving us a methodology for approaching this (except for repeated trying). Is there any method that can be applied to answer these kind of questions? Can anyone show this method can be applied on this example?

I don't know of an automatic procedure, but the basic idea is to defer decisions. That is, if at a particular state in the parse, both shift and reduce actions are possible, find a way to defer the reduction.
In the LR(0) parser, you can make a decision based on the token you just shifted, but not on the token you (might be) about to shift. So you need to move decisions to the end of productions, in a manner of speaking.
For example, your language consists of all sentences { ancmbn$ | n ≥ 0, m > 0}. If we restrict that to n > 0, then an LR(0) grammar can be constructed by deferring the reduction decision to the point following a b:
S -> S' $.
S' -> U | a S' b.
U -> a c T.
T -> b | c T.
That grammar is LR(0). In the original grammar, at the itemset including T -> c . and T -> c . T, both shift and reduce are possible: shift c and reduce before b. By moving the b into the production for T, we defer the decision until after the shift: after shifting b, a reduction is required; after c, the reduction is impossible.
But that forces every sentence to have at least one b. It omits sentences for which n = 0 (that is, the regular language c*$). That subset has an LR(0) grammar:
S -> S' $.
S' -> c | S' c.
We can construct the union of these two languages in a straight-forward manner, renaming one of the S's:
S -> S1' $ | S2' $.
S1' -> U | a S1' b.
U -> a c T.
T -> b | c T.
S2' -> c | S2' c.
This grammar is LR(0), but the form in which the end-of-input sentinel $ has been included seems to be cheating. At least, it violates the rule for augmented grammars, because an augmented grammar's base rule is always S -> S' $ where S' and $ are symbols not used in the original grammar.
It might seem that we could avoid that technicality by right-factoring:
S -> S' $
S' -> S1' | S2'
Unfortunately, while that grammar is still deterministic, and does recognise exactly the original language, it is not LR(0).
(Many thanks to #templatetypedef for checking the original answer, and identifying a flaw, and also to #Dennis, who observed that c* was omitted.)

Generating the LL(1) parsing table for the given CFG

The CFG is as following :
S -> SD|SB
B -> b|c
D -> a|dB
The method which I tried is as following:
I removed non-determinism from the first production (S->SD|SB) by left-factoring method.
So, the CFG after applying left-factoring is as following:
S -> SS'
S'-> D|B
B -> b|c
D -> a|dB
I need to find the first of S for the production i.e. S -> SS'
in order to proceed further. Could some one please help or advise?

You cannot convert this grammar that way into an LL(1) parser: the grammar is left recursive, you thus will have to perform left recursion removal. The point is that you can perform the following trick: since the only rule for S is S -> SS' and S -> (epsilon), it means that you simply reverse the order, and thus introduce the rule S -> S'S. So now the grammar is:
S -> S'S
S'-> D|B
B -> b|c
D -> a|dB
Now we can construct first: first(B)={b,c}, first(D)={a,d}, first(S')={a,b,c,d} and first(S)={a,b,c,d}.

Will this 'algorithm' for nullable and first work (in a parser)?

Working through this for fun: http://www.diku.dk/hjemmesider/ansatte/torbenm/Basics/
Example calculation of nullable and first uses a fixed-point calculation. (see section 3.8)
I'm doing things in Scheme and relying a lot on recursion.
If you try to implement nullable or first via recursion, it should be clear you'll recur infinitely on a production like
N -> N a b
where N is a non-terminal and a,b are terminals.
Could this be solved, recursively, by maintaining a set of non-terminals seen on the left hand side of production rules, and ignoring them after we have accounted for them once?
This seems to work for nullable. What about for first?
EDIT: This is what I have learned from playing around. Source code link at bottom.
Non terminals cannot be ignored in the calculation of first unless they are nullable.
Consider:
N -> N a
N -> X
N ->
Here we can ignore N in N a because N is nullable. We can replace N -> N a with N -> a and deduce that a is a member of first(N).
Here we cannot ignore N:
N -> N a
N -> M
M -> b
If we ignored the N in N -> N a we would deduce that a is in first(N) which is false. Instead, we see that N is not nullable, and hence when calculating first, we can omit any production where N is found as the first symbol in the RHS.
This yields:
N -> M
M -> b
which tells us b is in first(N).
Source Code: http://gist.github.com/287069
So ... does this sound OK?

I suggest to keep on reading :)
3.13 Rewriting a grammar for LL(1) parsing and especially 3.13.1 Eliminating left-recursion.
Just to note you can run into indirect left recursion as well:
A -> Bac
B -> A
B -> _also something else_
But the solution here is quite similar to eliminating the direct left recursion as in your first example.
You might want to check this paper which explains it in a little bit more straight-forward way. Less theory :)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Dealing with infinite loops when constructing states for LR(1) parsing - parsing

I'm pretty sure I figured out the answer. Obviously, states can point to each other, so that eliminates the need to create new ones if it's content already exists. I'd still like it if someone can confirm this, though.

Related

What is the canonical way to handle conditionals in Erlang?

Understand whether a grammar is LR(1) with no parsing table

fixing a grammar to LR(0)

Generating the LL(1) parsing table for the given CFG

Will this 'algorithm' for nullable and first work (in a parser)?

Categories

Resources