LALR(1) Parser Not Parsing The Text At All - parsing

I have to admit I'm an absolute newbie in this and might not even understand what I am doing.
I am trying to make a grammar that at least contains grammar from BABA IS YOU, and if possible expands on it. I am using this tool to debug my grammar: http://jsmachines.sourceforge.net/machines/lalr1.html
Admittedly, my grammar is currently not LALR(1) (as seen by many shift/reduce conflicts which I am unsure on how to properly resolve).
So, when I enter "RED AND BLUE BABA IS YOU" into the parser, this is what I expect to see:
And yet what I see is:
I have no idea at where to dig to start understanding my problem and I need help with at least that
The Grammar I use is this: https://pastebin.com/5MHZrFLe
sentence' -> sentence
sentence -> give
give -> giver property
giver -> noun IS
selector -> adjective noun
multinoun -> noun AND
multinoun -> multinoun AND
multinoun -> multinoun noun
multiadjective -> adjective AND
multiadjective -> multiadjective AND
multiadjective -> multiadjective adjective
noun -> multinoun
noun -> selector
noun -> BABA
noun -> KEKE
noun -> ROBOT
adjective -> RED
adjective -> BLUE
adjective -> GREEN
property -> YOU

In order for the token AND in that sentence to be recognised, there would have to be a derivation sequence leading from sentence' to multiadjective. There is no such sequence, as can easily be verified by doing a simple reachability graph (which is just a DFS).
That makes multiadjective useless in that grammar. It's slightly surprising that the tool you use doesn't warn you about that.
That's not the case for multinoun, which is reachable through the noun -> multinoun production. However, that creates a number of ambiguities, leading to shift/reduce conflicts. One example:
noun -> multinoun -> multinoun AND
vs
noun -> multinoun -> noun AND -> multinoun AND
The general pattern for a bottom-up grammar representing a list of token-separated items is:
list -> item
list -> list separator item
In such a grammar, the list is included in an outer production using the non-terminal list, not item. Adding item -> list in order to be able to refer to it as item leads to the same ambiguities as your noun non-terminal, which more or less reproduces this error.

Related

What is the canonical way to handle conditionals in Erlang?

I am working on simple list functions in Erlang to learn the syntax.
Everything was looking very similar to code I wrote for the Prolog version of these functions until I got to an implementation of 'intersection'.
The cleanest solution I could come up with:
myIntersection([],_) -> [];
myIntersection([X|Xs],Ys) ->
UseFirst = myMember(X,Ys),
myIntersection(UseFirst,X,Xs,Ys).
myIntersection(true,X,Xs,Ys) ->
[X|myIntersection(Xs,Ys)];
myIntersection(_,_,Xs,Ys) ->
myIntersection(Xs,Ys).
To me, this feels slightly like a hack. Is there a more canonical way to handle this? By 'canonical', I mean an implementation true to the spirit of what Erlang's design.
Note: the essence of this question is conditional handling of user-defined predicate functions. I am not asking for someone to point me to a library function. Thanks!
I like this one:
inter(L1,L2) -> inter(lists:sort(L1),lists:sort(L2),[]).
inter([H1|T1],[H1|T2],Acc) -> inter(T1,T2,[H1|Acc]);
inter([H1|T1],[H2|T2],Acc) when H1 < H2 -> inter(T1,[H2|T2],Acc);
inter([H1|T1],[_|T2],Acc) -> inter([H1|T1],T2,Acc);
inter([],_,Acc) -> Acc;
inter(_,_,Acc) -> Acc.
it gives the exact intersection:
inter("abcd","efgh") -> []
inter("abcd","efagh") -> "a"
inter("abcd","efagah") -> "a"
inter("agbacd","eafagha") -> "aag"
if you want that a value appears only once, simply replace one of the lists:sort/1 function by lists:usort/1
Edit
As #9000 says, one clause is useless:
inter(L1,L2) -> inter(lists:sort(L1),lists:sort(L2),[]).
inter([H1|T1],[H1|T2],Acc) -> inter(T1,T2,[H1|Acc]);
inter([H1|T1],[H2|T2],Acc) when H1 < H2 -> inter(T1,[H2|T2],Acc);
inter([H1|T1],[_|T2],Acc) -> inter([H1|T1],T2,Acc);
inter(_,_,Acc) -> Acc.
gives the same result, and
inter(L1,L2) -> inter(lists:usort(L1),lists:sort(L2),[]).
inter([H1|T1],[H1|T2],Acc) -> inter(T1,T2,[H1|Acc]);
inter([H1|T1],[H2|T2],Acc) when H1 < H2 -> inter(T1,[H2|T2],Acc);
inter([H1|T1],[_|T2],Acc) -> inter([H1|T1],T2,Acc);
inter(_,_,Acc) -> Acc.
removes any duplicate in the output.
If you know that there are no duplicate values in the input list, I think that
inter(L1,L2) -> [X || X <- L1, Y <- L2, X == Y].
is the shorter code solution but much slower (1 second to evaluate the intersection of 2 lists of 10 000 elements compare to 16ms for the previous solution, and an O(2) complexity comparable to #David Varela proposal; the ratio is 70s compare to 280ms with 2 lists of 100 000 elements!, an I guess there is a very high risk to run out of memory with bigger lists)
The canonical way ("canonical" as in "SICP") is to use an accumulator.
myIntersection(A, B) -> myIntersectionInner(A, B, []).
myIntersectionInner([], _, Acc) -> Acc;
myIntersectionInner(_, [], Acc) -> Acc;
myIntersectionInner([A|As], B, Acc) ->
case myMember(A, Bs) of
true ->
myIntersectionInner(As, Bs, [A|Acc]);
false ->
myIntersectionInner(As, Bs, [Acc]);
end.
This implementation of course produces duplicates if duplicates are present in both inputs. This can be fixed at the expense of calling myMember(A, Acc) and only appending A is the result is negative.
My apologies for the approximate syntax.
Although I appreciate the efficient implementations suggested, my intention was to better understand Erlang's implementation. As a beginner, I think #7stud's comment, particularly http://erlang.org/pipermail/erlang-questions/2009-December/048101.html, was the most illuminating. In essence, 'case' and pattern matching in functions use the same mechanism under the hood, although functions should be preferred for clarity.
In a real system, I would go with one of #Pascal's implementations; depending on whether 'intersect' did any heavy lifting.

Generating the LL(1) parsing table for the given CFG

The CFG is as following :
S -> SD|SB
B -> b|c
D -> a|dB
The method which I tried is as following:
I removed non-determinism from the first production (S->SD|SB) by left-factoring method.
So, the CFG after applying left-factoring is as following:
S -> SS'
S'-> D|B
B -> b|c
D -> a|dB
I need to find the first of S for the production i.e. S -> SS'
in order to proceed further. Could some one please help or advise?
You cannot convert this grammar that way into an LL(1) parser: the grammar is left recursive, you thus will have to perform left recursion removal. The point is that you can perform the following trick: since the only rule for S is S -> SS' and S -> (epsilon), it means that you simply reverse the order, and thus introduce the rule S -> S'S. So now the grammar is:
S -> S'S
S'-> D|B
B -> b|c
D -> a|dB
Now we can construct first: first(B)={b,c}, first(D)={a,d}, first(S')={a,b,c,d} and first(S)={a,b,c,d}.

F#: Advanced use of active patterns

Here is my problem: I'm trying to write a parser leveraging the power of active patterns in F#. The basic signature of a parsing function is the following
LazyList<Token> -> 'a * LazyList<Token>
Meaning it takes a lazy list of tokens, and returns the result of the parse and the new list of tokens after parsing, so as to follow functional design.
Now, as a next step, I can define active patterns that will help me match some constructs directly in match expressions, thusly
let inline (|QualName|_|) token_stream =
match parse_qualified_name token_stream with
| Some id_list, new_stream -> Some (id_list, new_stream)
| None, new_stream -> None
let inline (|Tok|_|) token_stream =
match token_stream with
| Cons (token, tail) -> Some(token.variant, tail)
| _ -> None
and then match parse results in a high level fashion this way
let parse_subprogram_profile = function
| Tok (Kw (KwProcedure | KwFunction),
QualName(qual_name,
Tok (Punc (OpeningPar), stream_tail))) as token_stream ->
// some code
| token_stream -> None, token_stream
The problem I have with this code is that every new matched construct is nested, which is not readable, especially if you have a long chain of results to match. I'd like to have the ability to define a matching operator such as the :: operator for list, which would enable me to do the following :
let parse_subprogram_profile = function
| Tok (Kw (KwProcedure | KwFunction)) ::
QualName(qual_name) ::
Tok (Punc (OpeningPar)) :: stream_tail as token_stream ->
// some code
| token_stream -> None, token_stream
But I don't think such a thing is possible in F#. I would even accept a design in which I have to call a specific "ChainN" active pattern where N is the number of element I want to parse, but I don't know how to design such a function if it is possible.
Any advice or directions regarding this ? Is there an obvious design I didn't see ?
I had something like this in mind, too, but actually gave up going for this exact design. Something you can do is to use actual lists.
In such case, you would have a CombinedList which is made of (firstly) a normal list acting as a buffer and (secondly) a lazy list.
When you want to match against a pattern, you can do:
match tokens.EnsureBuffer(4) with
| el1 :: el2 :: remaining -> (el1.v+el2.v, tokens.SetBuffer(remaining))
| el3 :: el4 :: el5 :: el6 :: remaining -> (el1.v-el2.v+el3.v-el4.v, tokens.SetBuffer(remaining))
where EnsureBuffer and SetBuffer may either mutate "tokens" and return it or return it if no change are required or return a new instances otherwise.
Would that solve your problem?
François

Unparse AST < O(exp(n))?

Abstract problem description:
The way I see it, unparsing means to create a token stream from an AST, which when parsed again produces an equal AST.
So parse(unparse(AST)) = AST holds.
This is the equal to finding a valid parse tree which would produce the same AST.
The language is described by a context free S-attributed grammar using a eBNF variant.
So the unparser has to find a valid 'path' through the traversed nodes in which all grammar constraints hold. This bascially means to find a valid allocation of AST nodes to grammar production rules. This is a constraint satisfaction problem (CSP) in general and could be solved, like parsing, by backtracking in O(exp(n)).
Fortunately for parsing, this can be done in O(n³) using GLR (or better restricting the grammar). Because the AST structure is so close to the grammar production rule structure, I was really surprised seeing an implementation where the runtime is worse than parsing: XText uses ANTLR for parsing and backtracking for unparsing.
Questions
Is a context free S-attribute grammar everything a parser and unparser need to share or are there further constraints, e.g. on the parsing technique / parser implementation?
I've got the feeling this problem isn't O(exp(n)) in general - could some genius help me with this?
Is this basically a context-sensitive grammar?
Example1:
Area returns AnyObject -> Pedestrian | Highway
Highway returns AnyObject -> "Foo" Car
Pedestrian returns AnyObject -> "Bar" Bike
Car returns Vehicle -> anyObjectInstance.name="Car"
Bike returns Vehicle -> anyObjectInstance.name="Bike"
So if I have an AST containing
AnyObject -> AnyObject -> Vehicle [name="Car"] and I know the root can be Area, I could resolve it to
Area -> Highway -> Car
So the (Highway | Pedestrian) decision depends on the subtree decisions. The problem get's worse when a leaf might be, at first sight, one of several types, but has to be a specific one to form a valid path later on.
Example2:
So if I have S-attribute rules returning untyped objects, just assigning some attributes, e.g.
A -> B C {Obj, Obj}
X -> Y Z {Obj, Obj}
B -> "somekeyword" {0}
Y -> "otherkeyword" {0}
C -> "C" {C}
Z -> "Z" {Z}
So if an AST contains
Obj
/ \
"0" "C"
I can unparse it to
A
/ \
B C
just after I could resolve "C" to C.
While traversing the AST, all constraints I can generate from the grammar are satisfied for both rules, A and X, until I hit "C". This means that for
A -> B | C
B -> "map" {MagicNumber_42}
C -> "foreach" {MagicNumber_42}
both solutions for the tree
Obj
|
MagicNumber_42
are valid and it is considered that they have equal semantics ,e.g. syntactic sugar.
Further Information:
unparsing in XText
grammar constraints for unparsing, see Serializer: Concrete Syntax Validation
Question 1: no, the grammar itself may not be enough. Take the example of an ambiguous grammar. If you ended up with a unique leftmost (rightmost) derivation (the AST) for a given string, you would somehow have to know how the parser eliminated the ambiguity. Just think of the string 'a+b*c' with the naive grammar for expressions 'E:=E+E|E*E|...'.
Question 3: none of the grammar examples you give is context sensitive. The lefthand-side of the productions are a single non-terminal, there is no context.

Dealing with infinite loops when constructing states for LR(1) parsing

I'm currently constructing LR(1) states from the following grammar.
S->AS
S->c
A->aA
A->b
where A,S are nonterminals and a,b,c are terminals.
This is the construction of I0
I0: S' -> .S, epsilon
---------------
S -> .AS, epsilon
S -> .c, epsilon
---------------
S -> .AS, a
S -> .c, c
A -> .aA, a
A -> .b, b
And I1.
From S, I1: S' -> S., epsilon //DONE
And so on. But when I get to constructing I4...
From a, I4: A -> a.A, a
-----------
A -> .aA, a
A -> .b, b
The problem is
A -> .aA
When I attempt to construct the next state from a, I'm going to once again get the exact same content of I4, and this continues infinitely. A similar loop occurs with
S -> .AS
So, what am I doing wrong? There has to be some detail that I'm missing, but I've browsed my notes and my book and either can't find or just don't understand what's wrong here. Any help?
I'm pretty sure I figured out the answer. Obviously, states can point to each other, so that eliminates the need to create new ones if it's content already exists. I'd still like it if someone can confirm this, though.

Resources