What is the difference between a constituency parser and a dependency parser? What are the different usages of the two?
A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence "John sees Bill", a constituency parse would be:
Sentence
|
+-------------+------------+
| |
Noun Phrase Verb Phrase
| |
John +-------+--------+
| |
Verb Noun Phrase
| |
sees Bill
A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship. A dependency parse of "John sees Bill", would be:
sees
|
+--------------+
subject | | object
| |
John Bill
You should use the parser type that gets you closest to your goal. If you are interested in sub-phrases within the sentence, you probably want the constituency parse. If you are interested in the dependency relationships between words, then you probably want the dependency parse.
The Stanford parser can give you either (online demo). In fact, the way it really works is to always parse the sentence with the constituency parser, and then, if needed, it performs a deterministic (rule-based) transformation on the constituency parse tree to convert it into a dependency tree.
More can be found here:
http://en.wikipedia.org/wiki/Phrase_structure_grammar
http://en.wikipedia.org/wiki/Dependency_grammar
Related
I am coding a parser of expression and visualization of it, which means every step of the recursive descent parsing or construction of AST will be visualized like a tiny version of VisuAlgo
// Expression grammer
Goal -> Expr
Expr -> Term + Term
| Expr - Term
| Term
Term -> Term * Factor
| Term / Factor
| Factor
Factor -> (Expr)
| num
| name
So I am wondering what data structure can be easily used for storing every step of constructing AST and how to implement the visualization of every step of constructing AST. Well, I have searched some similar questions and implement a recursive descent parser before, but just can't get a way to figure this out. I will be appreciated if anyone can help me.
This SO answer shows how to parse and build a tree as you parse
You could easily just the print the tree or tree fragments at each point where they are created.
To print the tree, just walk it recursively and print the nodes with indentation equal to depth of recursion.
I am just trying to understand a natural language interface for relational databases proposed by Fei Li (2014) (available as github project). Specifically I don't understand the grammar they define for a ParseTree of a natural language query to a database. This question is somewhat like this one but with a more complex grammar.
Background: A natural language sentence can be parsed as ParseTree (common library for this is the Stanford Parser) which describes the grammatical relations of words in a sentence.
The grammar for a valid ParseTree is:
Q -> (SClause)(ComplexCondition)*
SClause -> SELECT + GNP
ComplexCondition -> ON + (leftSubtree*rightSubtree)
leftSubtree -> GNP
rightSubtree -> GNP | VN | MIN | MAX
GNP -> (FN + GNP) | NP
NP -> NN + (NN)*(condition)*
condition -> VN | (ON + VN)
where
Q represents an entire query tree
+ a parent-child relationship
* a sibling relationship
SN is a SELECT node
ON is an OPERATOR node (e.g. =, <=)
FN is a FUNCTION node (e.g. AVG)
NN is a NAME node (e.g. a column in a db table)
VN is a VALUE node (i.e. a value in a column in a db table)
ComplexCondition must have one ON with a leftSubtree and rightSubtree
NP is one NN whose children are multiple NNs and conditions.
My questions:
why is condition defined as VN or (ON+VN)? This would mean that something like the digit 5 by itself can be a condition. Would make more sense of only the latter, i.e. (ON+VN) is a condition (e.g. >5)
How can a rightSubtree just be a function (e.g. MIN). Btw I am understanding the pipe | as logical or.
I understand that GNP is recursively defined but at a terminal node the GNP node must be just a NP node right? But a NP is defined as something that has children... HOW?!?!?!
The authors of the github project quoted above state: "Take a Value Node (VN) for example, according to the grammar, it is invalid if and only if it has children. I don't know how to infer this from the grammar
Thanks for the help
I was able to add support to my parser's grammar for alternating characters (e.g. ababa or baba) by following along with this question.
I'm now looking to extend that by allowing repeats of characters.
For example, I'd like to be able to support abaaabab and aababaaa as well. In my particular case, only the a is allowed to repeat but a solution that allows for repeating b's would also be useful.
Given the rules from the other question:
expr ::= A | B
A ::= "a" B | "a"
B ::= "b" A | "b"
... I tried extending it to support repeats, like so:
expr ::= A | B
# support 1 or more "a"
A_one_or_more = A_one_or_more "a" | "a"
A ::= A_one_or_more B | A_one_or_more
B ::= "b" A | "b"
... but that grammar is ambiguous. Is it possible for this to be made unambiguous, and if so could anyone help me disambiguate it?
I'm using the lemon parser which is an LALR(1) parser.
The point of parsing, in general, is to parse; that is, determine the syntactic structure of an input. That's significantly different from simply verifying that an input belongs to a language.
For example, the language which consists of arbitrary repetitions of a and b can be described with the regular expression (a|b)*, which can be written in BNF as
S ::= /* empty */ | S a | S b
But that probably does not capture the syntactic structure you are trying to defind. On the other hand, since you don't specify that structure, it is hard to know.
Here are a couple more possibilities, which build different parse trees:
S ::= E | S E
E ::= A b | E b
A ::= a | A a
S ::= E | S E
E ::= A B
A ::= a | A a
B ::= b | B b
When writing a grammar to parse a language, it is useful to start by drawing your proposed parse trees. Usually, you can write the grammar directly from the form of the trees, which shows that a formal grammar is primarily a documentation tool, since it clearly describes the language in a way that informal descriptions cannot. Using a parser generator to turn that grammar into a parser ensures that the parser implements the described language. Or, at least, that is the goal.
Here is a nice tool for checking your grammar online http://smlweb.cpsc.ucalgary.ca/start.html. It actually accepts the grammar you provided as a valid LALR(1) grammar.
A different LALR(1) grammar, that allows reapeating a's, would be:
S ::= "a" S | "a" | "b" A | "b"
A ::= "a" S .
I am given the following sentence:
The bird tried to escape from the strong cage.
And the following grammar rules:
s->np, vp
np->det, n
np->det, adjp
adjp->adj, n
pp->p, np
comp->p, vp
vp->v, pp
vp->v, comp
I tried left most derivation to derive the tree and also from just doing it through bottom up analysis. Here is a simple chart I tried:
The question I have is whether it is possible to have two S which will lead up to the route of a single S
More concretely is this acceptable:
s
/ \
s s
/ \ / \
NP VP VP NP
According to your grammar, a prepositional phrase (pp) consists of a preposition (p) followed by a noun phrase (np). But your parse tree shows pps consisting only of a preposition ("to" and "from"). If you do the bottom-up parse with this in mind, you should arrive at the correct answer.
To answer your direct question, your grammar does not allow s to consist of two ss; only of a noun phrase (np) followed by a verb phrase (vp).
Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.
For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.
Syntactically valid strings would include:
(the empty string)
A,
B A, and
C A B
Syntactically invalid strings would include:
A A, and
B A C B
To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.
From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.
Update
What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.
Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.
A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.
Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).
For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.
S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY ::= ab | ba
XacY ::= ac | ca
XbcY ::= bc | cb
Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).