Parse Tree Given Grammar - parsing

I am given the following sentence:
The bird tried to escape from the strong cage.
And the following grammar rules:
s->np, vp
np->det, n
np->det, adjp
adjp->adj, n
pp->p, np
comp->p, vp
vp->v, pp
vp->v, comp
I tried left most derivation to derive the tree and also from just doing it through bottom up analysis. Here is a simple chart I tried:
The question I have is whether it is possible to have two S which will lead up to the route of a single S
More concretely is this acceptable:
s
/ \
s s
/ \ / \
NP VP VP NP

According to your grammar, a prepositional phrase (pp) consists of a preposition (p) followed by a noun phrase (np). But your parse tree shows pps consisting only of a preposition ("to" and "from"). If you do the bottom-up parse with this in mind, you should arrive at the correct answer.
To answer your direct question, your grammar does not allow s to consist of two ss; only of a noun phrase (np) followed by a verb phrase (vp).

Related

Understanding grammar for Parse trees

I am just trying to understand a natural language interface for relational databases proposed by Fei Li (2014) (available as github project). Specifically I don't understand the grammar they define for a ParseTree of a natural language query to a database. This question is somewhat like this one but with a more complex grammar.
Background: A natural language sentence can be parsed as ParseTree (common library for this is the Stanford Parser) which describes the grammatical relations of words in a sentence.
The grammar for a valid ParseTree is:
Q -> (SClause)(ComplexCondition)*
SClause -> SELECT + GNP
ComplexCondition -> ON + (leftSubtree*rightSubtree)
leftSubtree -> GNP
rightSubtree -> GNP | VN | MIN | MAX
GNP -> (FN + GNP) | NP
NP -> NN + (NN)*(condition)*
condition -> VN | (ON + VN)
where
Q represents an entire query tree
+ a parent-child relationship
* a sibling relationship
SN is a SELECT node
ON is an OPERATOR node (e.g. =, <=)
FN is a FUNCTION node (e.g. AVG)
NN is a NAME node (e.g. a column in a db table)
VN is a VALUE node (i.e. a value in a column in a db table)
ComplexCondition must have one ON with a leftSubtree and rightSubtree
NP is one NN whose children are multiple NNs and conditions.
My questions:
why is condition defined as VN or (ON+VN)? This would mean that something like the digit 5 by itself can be a condition. Would make more sense of only the latter, i.e. (ON+VN) is a condition (e.g. >5)
How can a rightSubtree just be a function (e.g. MIN). Btw I am understanding the pipe | as logical or.
I understand that GNP is recursively defined but at a terminal node the GNP node must be just a NP node right? But a NP is defined as something that has children... HOW?!?!?!
The authors of the github project quoted above state: "Take a Value Node (VN) for example, according to the grammar, it is invalid if and only if it has children. I don't know how to infer this from the grammar
Thanks for the help

LR(0) parsing conflicts (shift/shift somehow)

I come with a question while doing one of the exercises for my parsing course in the university. The question specifically goes to constructing a parsing table using canonical LR items.
The given grammar production rules go as follow:
S -> NP VP
NP -> NP PP
NP -> a n
VP -> v
VP -> VP NP
VP -> VP PP
PP -> p NP
Now the problem is, when I try to construct the table by figuring out the states, I end up with this problem:
at I_0
S'-> .S
S->.NP VP
NP->.NP PP/.a n
In this above case, I have to perform go-to action on both S->.NP VP and NP->.NP PP. This would indicate that when I create the parsing table, I would have two states in which NP could bring the parsing to. Am I missing something here?
Please do note that while I can see this question not being a problem with a lookahead, this question was specifically for LR(0) exercise.

How is this context free grammar using difference lists in Prolog functioning?

I'm reading this tutorial on context free grammars in Prolog, and they mention at the bottom of the page implementing a context free grammar in Prolog using difference lists, with the following code block included:
s(X,Z):- np(X,Y), vp(Y,Z).
np(X,Z):- det(X,Y), n(Y,Z).
vp(X,Z):- v(X,Y), np(Y,Z).
vp(X,Z):- v(X,Z).
det([the|W],W).
det([a|W],W).
n([woman|W],W).
n([man|W],W).
v([shoots|W],W).
It mentions:
Consider these rules carefully. For example, the s rule says: I know that the pair of lists X and Z represents a sentence if (1) I can consume X and leave behind a Y , and the pair X and Y represents a noun phrase, and (2) I can then go on to consume Y leaving Z behind , and the pair Y Z represents a verb phrase . The np rule and the second of the vp rules work similarly.
And
Think of the first list as what needs to be consumed (or if you prefer: the input list ), and the second list as what we should leave behind (or: the output list ). Viewed from this (rather procedural) perspective the difference list
[a,woman,shoots,a,man] [].
represents the sentence a woman shoots a man because it says: If I consume all the symbols on the left, and leave behind the symbols on the right, then I have the sentence I am interested in. That is, the sentence we are interested in is the difference between the contents of these two lists.
That’s all we need to know about difference lists to rewrite our recogniser. If we simply bear in mind the idea of consuming something, and leaving something behind in mind, we obtain the following recogniser:
As an explanation, but that just isn't doing anything to clarify this code to me. I understand it's more efficient than using append/3, but as for the notion of consuming things and leaving others behind, it seems so much more confusing than the previous append/3 implementation, and I just can't make heads or tails of it. Furthermore when I copy and paste that code into a Prolog visualizer to try to understand it, the visualizer says there are errors. Could anyone shed some light on this?
Listing as facts
Lets try to explain this with a counter example. Let's specify the nouns, verbs, etc. with simple facts:
det(the).
det(a).
n(woman).
n(man).
v(shoots).
Now we can implement a noun phrase np as:
np([X,Y]) :-
det(X),
n(Y).
In other words we say "a noun phrase is a sentence with two words the first being a det, the second a n". And this will work: if we query np([a,woman]), it will succeed etc.
But now we need to do something more advance, define the verb phrase. There are two possible verb phrases: the one with a verb and a noun phrase that was originally defined as:
vp(X,Z):- v(X,Y),np(Y,Z).
We could define it as:
vp([X|Y]) :-
v(X),
np(Y).
And the one with only a single verb:
vp(X,Z):- v(X,Z).
That can be converted into:
vp([X]) :-
v(X).
The guessing problem
The problem however is that both variants have a different number of words: there are verb phrases with one word and with three words. That's not really a problem, but now say - I know this is not correct English - there exists a sentence defined as vp followed by np, so this would be:
s(X,Z):- vp(X,Y), np(Y,Z).
in the original grammar.
The problem is that if we want to transform this into our new way of representing it, we need to know how much vp will consume (how much words will be eaten by vp). We cannot know this in advance: since at this point we don't know much about the sentence, we cannot make a guess whether vp will eat one or three words.
We might of course guess the number of words with:
s([X|Y]) :-
vp([X]),
np(Y).
s([X,Y,Z|T]) :-
vp([X,Y,Z]),
np(Z).
But I hope you can imagine that if you would define verb phrases with 1, 3, 5 and 7 words, things will get problematic. Another way to solve this, is leave this to Prolog:
s(S) :-
append(VP,NP,S),
vp(VP),
np(NP).
Now Prolog will guess how to subdivide the sentence into two parts first and then try to match every part. But the problem is that for a sentence with n words, there are n breakpoints.
So Prolog will for instance first split it as:
VP=[],NP=[shoots,the,man,the,woman]
(remember we swapped the order of the verb phrase and noun phrase). Evidently vp will not be very happy if it gets an empty string. So that will be rejected easily. But next it says:
VP=[shoots],NP=[the,man,the,woman]
Now vp is happy with only shoots, but it will require some computational effort to realize that. np is however not amused with this long part. So Prolog backtracks again:
VP=[shoots,the],NP=[man,the,woman]
now vp will complain again that it has been given too much words. Finally Prolog will split it correctly with:
VP=[shoots,the,woman],NP=[the,woman]
The point is that it requires a large number of guesses. And for each of these guesses vp and np will require work as well. For a real complicated grammar, vp and np might split the sentence further resulting in a huge amount of try-and-error.
The true reason is the append/3 has no "semantical" clue how to split the sentence, so it tries all possibilities. One is however more interested in an approach where vp could provide information about what share of the sentence it really wants.
Furthermore if you have to split the sentence into 3 parts, the number of ways to do this even increases to O(n^2) and so on. So guessing will not do the trick.
You might also try to generate a random verb phrase, and then hope the verb phrase matches:
s(S) :-
vp(VP),
append(VP,NP,S),
np(NP).
But in that case the number of guessed verb phrases will blow up exponentially. Of course you can give "hints" etc. to speed up the process, but it will still take some time.
The solution
What you want to do is to provide the part of the sentence to every predicate such that a predicate looks like:
predicate(Subsentence,Remaining)
Subsentence is a list of words that start for that predicate. For instance for a noun phrase, it might look like [the,woman,shoots,the,man]. Every predicate consumes the words it is interested in: the words up to a certain point. In this case, the noun-phrase is only interested in ['the','woman'], because that is a noun-phrase. To do the remaining parsing, it returns the remaining part [shoots,the,woman], in the hope that some other predicate can consume the remainder of the sentence.
For our table of facts that is easy:
det([the|W],W).
det([a|W],W).
n([woman|W],W).
n([man|W],W).
v([shoots|W],W).
It thus means that if you query a setence: [the,woman,shoots,...], and you ask the det/2 whether that's a determiner, it will say: "yes, the is a determiner, but the remaining part [woman,shoots,...]" is not part of the determiner, please match it with something else.
This matching is done because a list is represented as a linked list. [the,woman,shoots,...], is actually represented as [the|[woman|[shoots|...]]] (so it points to the next "sublist"). If you match:
[the|[woman|[shoots|...]]]
det([the|W] ,W)
It will unify [woman|[shoots|...]] with W and thus result in:
det([the|[woman|[shoots|...]],[woman|[shoots|...]]).
Thus returning the remaining list, it has thus consumed the the part.
Higher level predicates
Now in case we define the noun phrase:
np(X,Z):- det(X,Y), n(Y,Z).
And we again call with [the,woman,shoots,...], it will query unifying X with that list. It will first call det that will consume the, without any need for backtracking. Next Y is equal to [woman,shoots,...], now the n/2 will consume woman and return [shoots,...]. This is also the result the np will return, and another predicate will have to consume this.
Usefulness
Say you introduce your name as an additional noun:
n([doug,smith|W],W).
(sorry for using small cases, but otherwise Prolog sees these as variables).
It will simply try to match the first two words with doug and smith, and if that succeeds, try to match the remaining of the sentence. So one can make a setence like: [the,doug,smith,shoots,the,woman] (sorry about that, furthermore in English, some noun phrases map to a noun directly np(X,Y) :- n(X,Y) so the the can be removed for a more complex English grammar).
Guessing completely eliminated?
Is the guessing completely eliminated? No. It is still possible that there is overlap in consuming. For instance you might add:
n([doug,smith|W],W).
n([doug|W],W).
In that case if you query for [the,doug,smith,shoots,the,woman]. It will first consume/eat the in det, next it will look for a noun to consume from [doug,smith,...]. There are two candidates. Prolog will first try to eat only doug, and match [smith,shoots,...] as an entire verb phrase, but since smith is not a verb, it will backtrack, reconsider eating only one word and thus decinde to eat both doug and smith as a noun instead.
The point is that when using append, Prolog would have to backtrack as well.
Conclusion
By using difference lists, you can eat an arbitrary amount of words. The remainder is returned such that other sentence parts like the verb phrase aim to consume the remainder. The list is furthermore always fully grounded, so one definitely does not use brute force to generate all kinds of verb phrases first.
This answer is a follow-up to the answer by #mat. Let's proceed step-by-step:
We start with the code shown in #mat's answer:
s --> np, vp.
np --> det, n.
vp --> v, np.
vp --> v.
det --> [the].
det --> [a].
n --> [woman].
n --> [man].
v --> [shoots].
So far, we have used (',')/2: (A,B) means sequence A followed by sequence B.
Next, we will also use ('|')/2: (A|B) means sequence A or sequence B.
For information on control-constructs in grammar bodies read this manual section.
s --> np, vp.
np --> det, n.
vp --> v, np | v.
det --> [the] | [a].
n --> [woman] | [man].
v --> [shoots].
We can make the code more concise by "inlining" a little:
s --> np, vp.
np --> ([the] | [a]), ([woman] | [man]).
vp --> v,np | v.
v --> [shoots].
How about some more inlining---as suggested by #mat in his comment...
s --> np, [shoots], (np | []).
np --> ([the] | [a]), ([woman] | [man]).
Take it to the max! (The following could be written as a one-liner.)
?- phrase((( [the]
| [a]), ( [woman]
| [man]), [shoots], ( ( [the]
| [a]), ( [woman]
| [man])
| [])),
Ws).
Different formulations all have their up-/down-sides, e.g., very compact code is harder to extend, but may be required when space is scarce—like when putting code on presentation slides.
Sample query:
?- phrase(s,Ws).
Ws = [the, woman, shoots, the, woman]
; Ws = [the, woman, shoots, the, man]
; Ws = [the, woman, shoots, a, woman]
; Ws = [the, woman, shoots, a, man]
; Ws = [the, woman, shoots]
; Ws = [the, man, shoots, the, woman]
; Ws = [the, man, shoots, the, man]
; Ws = [the, man, shoots, a, woman]
; Ws = [the, man, shoots, a, man]
; Ws = [the, man, shoots]
; Ws = [a, woman, shoots, the, woman]
; Ws = [a, woman, shoots, the, man]
; Ws = [a, woman, shoots, a, woman]
; Ws = [a, woman, shoots, a, man]
; Ws = [a, woman, shoots]
; Ws = [a, man, shoots, the, woman]
; Ws = [a, man, shoots, the, man]
; Ws = [a, man, shoots, a, woman]
; Ws = [a, man, shoots, a, man]
; Ws = [a, man, shoots]. % 20 solutions
I know exactly what you mean. It is quite hard at first to think in terms of list differences. The good news is that you usually do not have to do this.
There is a built-in formalism called Definite Clause Grammars (DCGs) which make it completely unnecessary to manually encode list differences in cases like yours.
In your example, I recommend you simply write this as:
s --> np, vp.
np --> det, n.
vp --> v, np.
vp --> v.
det --> [the].
det --> [a].
n --> [woman].
n --> [man].
v --> [shoots].
and accept the fact that within DCGs, [T] represents the terminal T and, is read as "and then". That's how to describe lists with DCGs.
You can already use this to ask for all solutions, using the phrase/2 interface to DCGs:
?- phrase(s, Ls).
Ls = [the, woman, shoots, the, woman] ;
Ls = [the, woman, shoots, the, man] ;
Ls = [the, woman, shoots, a, woman] ;
etc.
It is hard to understand this in procedural terms at first, so I suggest that you do not try this. Instead, focus an a declarative reading of the grammar and see that it describes lists.
Later, you can go into more implementation details. Start with the way how simple terminals are translated, and then move on to more advanced grammar constructs: Rules containing a single terminal, then rules containing a terminal and a nonterminal etc.
Difference lists work like this ( a layman's explanation).
Consider append is used to join two trains X and Y
X = {1}[2][3][4] Y = {a}[b][c]
{ } - represents the compartment having the engine or head.
[ ] - represents compartments or elements in the tail. Assume that we can remove the engine from one compartment and put it into another.
Append proceeds like this:
The new train Z is now Y i.e., {a}[b][c]
next the engine is removed from Z's head and put into tail end compartment removed from X and the new train Z is:
Z = {4}[a][b][c]
The same process is repeated
Z = {3}[4][a][b][c]
Z = {2}[3][4][a][b][c]
Z = {1}[2][[3][4][a][b][c]
our new long train.
Introducing Difference lists is like having a toa hook at the end of X that can readily fasten to Y.
The final hook is discarded.
n([man|W],W).
W is the hook here, unification of W with the head of the successor list is the fastening process.

Recognizing permutations of a finite set of strings in a formal grammar

Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.
For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.
Syntactically valid strings would include:
(the empty string)
A,
B A, and
C A B
Syntactically invalid strings would include:
A A, and
B A C B
To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.
From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.
Update
What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.
Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.
A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.
Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).
For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.
S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY ::= ab | ba
XacY ::= ac | ca
XbcY ::= bc | cb
Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).

Difference between constituency parser and dependency parser

What is the difference between a constituency parser and a dependency parser? What are the different usages of the two?
A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence "John sees Bill", a constituency parse would be:
Sentence
|
+-------------+------------+
| |
Noun Phrase Verb Phrase
| |
John +-------+--------+
| |
Verb Noun Phrase
| |
sees Bill
A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship. A dependency parse of "John sees Bill", would be:
sees
|
+--------------+
subject | | object
| |
John Bill
You should use the parser type that gets you closest to your goal. If you are interested in sub-phrases within the sentence, you probably want the constituency parse. If you are interested in the dependency relationships between words, then you probably want the dependency parse.
The Stanford parser can give you either (online demo). In fact, the way it really works is to always parse the sentence with the constituency parser, and then, if needed, it performs a deterministic (rule-based) transformation on the constituency parse tree to convert it into a dependency tree.
More can be found here:
http://en.wikipedia.org/wiki/Phrase_structure_grammar
http://en.wikipedia.org/wiki/Dependency_grammar

Resources