Python list to Parse tree

Python list to Parse tree - parsing

In Python, I have an input of list like following-
[('S', ['NP', 'VP']),
('A', ['V', 'NP']),
('VP', ['V', 'NP']),
('NP', ['DET', 'NP']),
('N', "'mouse'"),
('NP', "'mouse'"),
('DET', "'the'"),
('V', "'saw'"),
('N', "'Ron'"),
('NP', "'Ron'")]
This is the result of the following CYK algorithm-
S -> NP VP
VP -> A NP | V NP
NP -> N N | DET NP | 'chocolate' | 'cat' | 'John' | 'Ron' | 'mouse'
DET -> 'the'
N -> 'chocolate' | 'cat' | 'John' | 'Ron' | 'mouse'
V -> 'saw' | 'bought' | 'ate'
A -> V NP
The string that I want to match with is "Ron saw the mouse"
I want to relate output like this-
(S (NP Ron) (VP (V saw) (NP (DET the) (NP mouse))))
I am not sure how the algorithm should be constructed especially with an ambiguous algorithm which may contain multiple outputs.
How should I construct code? Any suggestion what should be a better approach with/without recursion?
UPDATE---
I managed to get a single exact parse tree after adding extra parents and child nodes position values with the input list. But my problem doesn't solve with the ambiguous sentence.

Related

How can I simplify a recursive-descent parser?

I have the following simple LL(1) grammar, which describes a language with only three valid sentences: "", "x y" and "z x y":
S -> A x y | ε .
A -> z | ε .
I have constructed the following parsing table, and from it a "naive" recursive-descent parser:
| x | y | z | $
S | S -> A x y | | S -> A x y | S -> ε
A | A -> ε | | A -> z |
func S():
if next() in ['x', 'z']:
A()
expect('x')
expect('y')
expect('$')
elif next() == '$':
pass
else:
error()
func A():
if next() == 'x':
pass
elif next() == 'z':
expect('z')
else:
error()
However, the function A seems to be more complicated than necessary. All of my tests still pass if it's simplified to:
func A():
if next() == 'z':
expect('z')
Is this a valid simplification of A? If so, are there any general rules regarding when it's valid to make simplifications like this one?

That simplification is certainly valid (and quite common).
The main difference is that there is no code associated with the production A→ε. If there are some semantics to implement, you will need to test for the condition. If you only need to ignore the nullable production, you can certainly just return.
Coalescing errors and epsilon productions has one other difference: the error (for example, in the input y) is detected later, after A() returns. Sometimes that makes it harder to produce good error messages (and sometimes it doesn't).

This context-free grammar is ambiguous and I'm not sure why. The SLR(1) compiler I'm building doesn't work the way I expect it to

I'm building a syntax parser. It's going good to be SLR(1) but I believe there are some reduce/shift conflicts or some kind of conflict that is making the parser reject strings too early . Here is the grammar:
Note: I did left factor the grammar to see if that was the problem, but that doesn't get rid of ambiguity. However this is the original grammar without left factoring
P'' -> P'$
P' -> P
P -> C | C;D
D -> R | RD
R -> pu{P}
C -> I | I;C
I -> h | O | A | R | Z
O -> i(V) | z(V)
Y -> u
V -> S | N
S -> u
N -> u
A -> S=s | S=S | N=X
X -> N | b | L
L -> d(X,X) | s(X,X) | m(X,X)
R -> f(B)t{C} | f(B)t{C}1{C}
B -> e(V,V) | (N<N) | (N>N) | nB | a(B,B) | o(B,B)
Z -> w(B){C} | r(N=0;N<N;N=a(N,1)){C}
I understand this grammar is quite big, but if you could help me here you would be a life saver. Thank you in advance!

Having recognized an I, and with ; as the next symbol, there's a shift-reduce conflict:
The production C -> I;C says to shift the ;.
The production P -> C;D says to reduce via C -> I.
So the grammar is not SLR(1).

Transform a grammar G into LL(1)

I have the following grammar and I need to convert it to LL(1) grammar
G = (N; T; P; S) N = {S,A,B,C} T = {a, b, c, d}
P = {
S -> CbSb | adB | bc
A -> BdA | b
B -> aCd | ë
C -> Cca | bA | a
}
The point is that I know how to convert when its just a production, but I can't find any clear method of solving this on the internet.
Thanks in advance!

Remove left recursion, direct and indirect.
Build an LA(k) table. If there's no ambiguity, the grammar (and the language) is LL(k).
The obvious left recursion in the grammar is:
S ==> C... ==> C...

Can a table-based LL parser handle repetition without right-recursion?

I understand how an LL recursive descent parser can handle rules of this form:
A = B*;
with a simple loop that checks whether to continue looping or not based on whether the lookahead token matches a terminal in the FIRST set of B. However, I'm curious about table based LL parsers: how can rules of this form work there? As far as I know, the only way to handle repetition like this in one is through right-recursion, but that messes up associativity in cases where a right-associative parse tree is not desired.
I'd like to know because I'm currently attempting to write an LL(1) table-based parser generator and I'm not sure how to handle a case like this without changing the intended parse tree shape.

The Grammar
Let's expand your EBNF grammar to simple BNF and assume, that b is a terminal and <e> is an empty string:
A -> X
X -> BX
X -> <e>
B -> b
This grammar produces strings of terminal b's of any length.
The LL(1) Table
To construct the table, we will need to generate the first and follow sets (constructing an LL(1) parsing table).
First sets
First(α) is the set of terminals that begin strings derived from any string of grammar symbols α.
First(A) : b, <e>
First(X) : b, <e>
First(B) : b
Follow sets
Follow(A) is the set of terminals a that can
appear immediately to the right of a nonterminal A.
Follow(A) : $
Follow(X) : $
Follow(B) : b$
Table
We can now construct the table based on the sets, $ is the end of input marker.
+---+---------+----------+
| | b | $ |
+---+---------+----------+
| A | A -> X | A -> X |
| X | X -> BX | X -> <e> |
| B | B -> b | |
+---+---------+----------+
The parser action always depends on the top of the parse stack and the next input symbol.
Terminal on top of the parse stack:
Matches the input symbol: pop stack, advance to the next input symbol
No match: parse error
Nonterminal on top of the parse stack:
Parse table contains production: apply production to stack
Cell is empty: parse error
$ on top of the parse stack:
$ is the input symbol: accept input
$ is not the input symbol: parse error
Sample Parse
Let us analyze the input bb. The initial parse stack contains the start symbol and the end marker A $.
+-------+-------+-----------+
| Stack | Input | Action |
+-------+-------+-----------+
| A $ | bb$ | A -> X |
| X $ | bb$ | X -> BX |
| B X $ | bb$ | B -> b |
| b X $ | bb$ | consume b |
| X $ | b$ | X -> BX |
| B X $ | b$ | B -> b |
| b X $ | b$ | consume b |
| X $ | $ | X -> <e> |
| $ | $ | accept |
+-------+-------+-----------+
Conclusion
As you can see, rules of the form A = B* can be parsed without problems. The resulting concrete parse tree for input bb would be:

Yes, this is definitely possible. The standard method of rewriting to BNF and constructing a parse table is useful for figuring out how the parser should work – but as far as I can tell, what you're asking is how you can avoid the recursive part, which would mean that you'd get the slanted binary tree/linked list form of AST.
If you're hand-coding the parser, you can simply use a loop, using the lookaheads from the parse table that indicate a recursive call to decide to go around the loop once more. (I.e., you could just use while with those lookaheads as the condition.) Then for each iteration, you simply append the constructed subtree as a child of the current parent. In your case, then, A would get several direct B-children.
Now, as I understand it, you're building a parser generator, and it might be easiest to follow the standard procedure, going via plan BNF. However, that's not really an issue; there is no substantive difference between iteration and recursion, after all. You simply have to have a class of “helper rules” that don't introduce new AST nodes, but that rather append their result to the node of the nonterminal that triggered them. So when turning the repetition into X -> BX, rather than constructing X nodes, you have your X rule extend the child-list of the A or X (whichever triggered it) by its own children. You'll still end up with A having several B children, and no X nodes in sight.

Why does this very simple grammar cause GLR parsers to choke?

I've tried several different parser generators (Bison, DParser, etc.) that claim to be able to generate GLR parsers i.e., ones that can handle ambiguous grammars. Here is a very simple ambiguous grammar of the type I'm talking about:
START: A | B;
A: C | D;
B: C | D;
C: T1 | T2;
D: T3 | T4;
T1: 't1';
T2: 't2';
T3: 't3';
T4: 't4';
I can generate the parsers just fine, but I get "unresolved ambiguity" errors or just outright crashes when I give the parser input that should be valid. There are no problems of any kind when I change the grammar to an unambiguous version.
What am I not understanding about GLR parsers? I thought the whole point was that in cases of ambiguity, ALL possible parses are tracked up until they merge or reach a dead end. All I need is a parser that can tell me whether there is ANY valid parse of the input.
Thanks for any help.
edit:
This is frustrating. Using %dprec and %merge I've been able to get Bison to handle ambiguous rules and terminals, but it still chokes on very simple but highly pathological pseudo-English grammars of the kind that I need to handle:
S: NP VP | NPA VP;
NPA: D N | NP PP;
NP: D N | NP PP | NPA;
VP: V NP | VP PP;
PP: P NP;
D: "the" | "a";
P: "in" | "with";
V: "saw";
N: "saw" | "girl" | "boy" | "park" | "telescope";
With input "a boy saw a girl", Bison is unable to parse and returns with code 1. Tom on the other hand, deals with this grammar and this input sentence flawlessly, and even naturally handles unknown terminals by just assigning them to all possible terminal types. But unlike Bison, Tom chokes on large grammars. (By "chokes" I mean fails in various different ways. If the failure modes would be helpful, I can report those.)
Anyone have any other ideas?

Unfortunately, bison really insists on producing a (single) parse, so you have to specify some way to merge ambiguous parses. If you don't, and there is more than one possible parse, bison's GLR parser will indeed complain that the parse is ambiguous.
If you don't really care which of the multiple parses is accepted, then it's not too difficult to bend bison to your will. The simplest way is to just assign a different %dprec to every possibly ambiguous production. Bison will then select whichever applicable production happens to have the best precedence.
You can even get bison to tell you about multiple parses with a simple %merge function; there is an example in the bison manual. (The documentation of this feature isn't great but it might be adequate to your needs. If not, feel free to ask a more specific question.)
I don't have much experience with DParser, but the manual indicates that its behaviour when faced with multiple possible parses is similar: the default is to complain, but you can provide a trivial merge function of your own: (The quote comes from Section 12, Ambiguities)
Ambiguities are resolved automatically based on priorities and associativities. In addition, when the other resolution techniques fail, user defined ambiguity resolution is possible. The default ambiguity handler produces a fatal error on an unresolved ambiguity. This behavior can be replaced with a user defined resolvers the signature of which is provided in dparse.h.
Here's an example bison GLR grammar for the second example. I left out the lexer, which is really not relevant (and slightly embarrassing because I was rushing).
%{
int yylex();
void yyerror(const char* msg);
%}
%error-verbose
%glr-parser
%token WORD_A "a"
%token WORD_BOY "boy"
%token WORD_GIRL "girl"
%token WORD_IN "in"
%token WORD_LIKED "liked"
%token WORD_PARK "park"
%token WORD_SAW "saw"
%token WORD_TELESCOPE "telescope"
%token WORD_THE "the"
%token WORD_WITH "with"
%%
S : NP VP {puts("S: NP VP");} %dprec 1
| NPA VP {puts("S: NPA VP");} %dprec 2
;
NPA: D N {puts("NPA: D N");} %dprec 3
| NP PP {puts("NPA: NP PP");} %dprec 4
;
NP : D N {puts("NP: D N");} %dprec 6
| NP PP {puts("NP: NP PP");} %dprec 7
| NPA {puts("NP: NPA");} %dprec 10
;
VP : V NP {puts("VP: V NP ");} %dprec 11
| VP PP {puts("VP: VP PP");} %dprec 12
;
PP : P NP {puts("PP: P NP");} %dprec 14
;
D : "the" {puts("D: the");} %dprec 15
| "a" {puts("D: a");} %dprec 16
;
P : "in" {puts("P: in");} %dprec 17
| "with" {puts("P: with");} %dprec 18
;
V : "liked" {puts("V: liked");} %dprec 19
| "saw" {puts("V: saw");} %dprec 20
;
N : "girl" {puts("N: girl");} %dprec 21
| "boy" {puts("N: boy");} %dprec 22
| "park" {puts("N: park");} %dprec 23
| "saw" {puts("N: saw");} %dprec 24
| "telescope"{puts("N: telescope");} %dprec 25
;
%%
int main(int argc, char** argv) {
printf("yyparse returned %d\n", yyparse());
return 0;
}
Compilation:
$ make ambig2
bison30 -v -d -o ambig2.c ambig2.y
ambig2.y: warning: 6 shift/reduce conflicts [-Wconflicts-sr]
ambig2.y: warning: 10 reduce/reduce conflicts [-Wconflicts-rr]
gcc-4.8 -ggdb -Wall -D_POSIX_C_SOURCE=200809L -std=c99 -c -o ambig2.o ambig2.c
gcc-4.8 ambig2.o -o ambig2
rm ambig2.o ambig2.c
Sample parses:
$ ./ambig2 <<<"a boy saw a girl"
D: a
N: boy
NPA: D N
V: saw
D: a
N: girl
NPA: D N
NP: NPA
VP: V NP
S: NPA VP
yyparse returned 0
$ ./ambig2 <<<"a saw saw the saw in a saw"
D: a
N: saw
NPA: D N
V: saw
D: the
N: saw
NPA: D N
NP: NPA
VP: V NP
P: in
D: a
N: saw
NPA: D N
NP: NPA
PP: P NP
VP: VP PP
S: NPA VP
yyparse returned 0

Your grammar doesn't cause GLR parsers to choke.
You need a GLR parsing engine that delivers what GLR parsers are supposed to deliver: parsing in the face of ambiguities, and handing you the result. Presumably you use additional context to resolve the ambiguities. (You can tangle context-checking
into the parsing process if you really insist on avoiding producing context-prevented ambiguities. If you do that, you get the kind of complications that the GCC guys
had when they tried to parse C and C++ with LALR).
Here's the output for OP's problem, given to our DMS Software Reengineering Toolkit's GLR parser generator. I had to define a lexer and a grammar compatible for DMS:
Lexer (defining individual tokens as words; a more scalable version
might have defined word class tokens such as D P V N):
%%
%%main
#skip "\s+"
#skip "[\u000d\u000a]+"
#token 'the' "the"
#token 'a' "a"
#token 'in' "in"
#token 'with' "with"
#token 'saw' "saw"
#token 'girl' "girl"
#token 'boy' "boy"
#token 'park' "park"
#token 'telescope' "telescope"
%%
Grammar (DMS doesn't bother with EBNF):
S = NP VP ;
S = NPA VP ;
NPA = D N ;
NPA = NP PP ;
NP = D N ;
NP = NP PP ;
NP = NPA ;
VP = V NP ;
VP = VP PP ;
PP = P NP ;
D = 'the' ;
D = 'a';
P = 'in' ;
P = 'with' ;
V = 'saw' ;
N = 'saw' ;
N = 'girl' ;
N = 'boy' ;
N = 'park' ;
N = 'telescope' ;
Sample file "aboysawagirl.txt"
a boy saw a girl\n
From start to finish, building lexer and parser (about 10 minutes of fumbling...)
Parsing the sample file and dumping the automatically built tree:
C:\DMS\Domains\simpenglish\Tools\Parser\Source>run ..\domainparser ++AST ..\..\Lexer\aboysawagirl.txt
simpenglish Domain Parser Version 2.5.15
Copyright (C) 1996-2013 Semantic Designs, Inc; All Rights Reserved; SD Confidential
Powered by DMS (R) Software Reengineering Toolkit
24 tree nodes in tree.
3 ambiguity nodes in tree.
(AMBIGUITY<S=11>#simpenglish=31##1f35140^0{2} Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(S#simpenglish=1##1f350e0^1#1f35140:1 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(AMBIGUITY<NP=12>#simpenglish=31##1f34ba0^1#1f350e0:1{2} Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(NP#simpenglish=5##1f34b80^1#1f34ba0:1 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|(D#simpenglish=12##1f34aa0^2#1f34b80:1#1f34b40:1 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| ('a'#simpenglish=22##1f349c0^1#1f34aa0:1[Keyword:0] Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'a'
|)D#1f34aa0
|(N#simpenglish=18##1f34b20^2#1f34b80:2#1f34b40:2 Line 1 Column 3 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| ('boy'#simpenglish=27##1f34a80^1#1f34b20:1[Keyword:0] Line 1 Column 3 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'boy'
|)N#1f34b20
)NP#1f34b80
(NP#simpenglish=7##1f34c60^1#1f34ba0:2 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|(NPA#simpenglish=3##1f34b40^2#1f35040:1#1f34c60:1 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| (D#simpenglish=12##1f34aa0^2... [ALREADY PRINTED] ...)
| (N#simpenglish=18##1f34b20^2... [ALREADY PRINTED] ...)
|)NPA#1f34b40
)NP#1f34c60
)AMBIGUITY#1f34ba0
(VP#simpenglish=8##1f34fc0^1#1f350e0:2 Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(V#simpenglish=15##1f34d60^1#1f34fc0:1 Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|('saw'#simpenglish=25##1f34b00^2#1f34d60:1#1f34d40:1[Keyword:0] Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'saw'
)V#1f34d60
(AMBIGUITY<NP=12>#simpenglish=31##1f34f00^2#1f34f80:2#1f34fc0:2{2} Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|(NP#simpenglish=5##1f34e60^1#1f34f00:1 Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| (D#simpenglish=12##1f34da0^2#1f34e60:1#1f34de0:1 Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| ('a'#simpenglish=22##1f34ce0^1#1f34da0:1[Keyword:0] Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'a'
| )D#1f34da0
| (N#simpenglish=17##1f34dc0^2#1f34e60:2#1f34de0:2 Line 1 Column 13 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| ('girl'#simpenglish=26##1f34d80^1#1f34dc0:1[Keyword:0] Line 1 Column 13 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt)'girl'
| )N#1f34dc0
|)NP#1f34e60
|(NP#simpenglish=7##1f34f20^1#1f34f00:2 Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| (NPA#simpenglish=3##1f34de0^1#1f34f20:1 Line 1 Column 11 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
| (D#simpenglish=12##1f34da0^2... [ALREADY PRINTED] ...)
| (N#simpenglish=17##1f34dc0^2... [ALREADY PRINTED] ...)
| )NPA#1f34de0
|)NP#1f34f20
)AMBIGUITY#1f34f00
)VP#1f34fc0
)S#1f350e0
(S#simpenglish=2##1f35040^1#1f35140:2 Line 1 Column 1 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(NPA#simpenglish=3##1f34b40^2... [ALREADY PRINTED] ...)
(VP#simpenglish=8##1f34f80^1#1f35040:2 Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
(V#simpenglish=15##1f34d40^1#1f34f80:1 Line 1 Column 7 File C:/DMS/Domains/simpenglish/Tools/Lexer/aboysawagirl.txt
|('saw'#simpenglish=25##1f34b00^2... [ALREADY PRINTED] ...)
)V#1f34d40
(AMBIGUITY<NP=12>#simpenglish=31##1f34f00^2... [ALREADY PRINTED] ...)
)VP#1f34f80
)S#1f35040
)AMBIGUITY#1f35140
Your simple english grammar parser parses your example sentence in different ways.
This is a lot more spectacular with a full C++11 grammar.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Python list to Parse tree - parsing

Related

How can I simplify a recursive-descent parser?

This context-free grammar is ambiguous and I'm not sure why. The SLR(1) compiler I'm building doesn't work the way I expect it to

Transform a grammar G into LL(1)

Can a table-based LL parser handle repetition without right-recursion?

Why does this very simple grammar cause GLR parsers to choke?

Categories

Resources