Understanding grammar for Parse trees - parsing

I am just trying to understand a natural language interface for relational databases proposed by Fei Li (2014) (available as github project). Specifically I don't understand the grammar they define for a ParseTree of a natural language query to a database. This question is somewhat like this one but with a more complex grammar.
Background: A natural language sentence can be parsed as ParseTree (common library for this is the Stanford Parser) which describes the grammatical relations of words in a sentence.
The grammar for a valid ParseTree is:
Q -> (SClause)(ComplexCondition)*
SClause -> SELECT + GNP
ComplexCondition -> ON + (leftSubtree*rightSubtree)
leftSubtree -> GNP
rightSubtree -> GNP | VN | MIN | MAX
GNP -> (FN + GNP) | NP
NP -> NN + (NN)*(condition)*
condition -> VN | (ON + VN)
where
Q represents an entire query tree
+ a parent-child relationship
* a sibling relationship
SN is a SELECT node
ON is an OPERATOR node (e.g. =, <=)
FN is a FUNCTION node (e.g. AVG)
NN is a NAME node (e.g. a column in a db table)
VN is a VALUE node (i.e. a value in a column in a db table)
ComplexCondition must have one ON with a leftSubtree and rightSubtree
NP is one NN whose children are multiple NNs and conditions.
My questions:
why is condition defined as VN or (ON+VN)? This would mean that something like the digit 5 by itself can be a condition. Would make more sense of only the latter, i.e. (ON+VN) is a condition (e.g. >5)
How can a rightSubtree just be a function (e.g. MIN). Btw I am understanding the pipe | as logical or.
I understand that GNP is recursively defined but at a terminal node the GNP node must be just a NP node right? But a NP is defined as something that has children... HOW?!?!?!
The authors of the github project quoted above state: "Take a Value Node (VN) for example, according to the grammar, it is invalid if and only if it has children. I don't know how to infer this from the grammar
Thanks for the help

Related

Parse Tree Given Grammar

I am given the following sentence:
The bird tried to escape from the strong cage.
And the following grammar rules:
s->np, vp
np->det, n
np->det, adjp
adjp->adj, n
pp->p, np
comp->p, vp
vp->v, pp
vp->v, comp
I tried left most derivation to derive the tree and also from just doing it through bottom up analysis. Here is a simple chart I tried:
The question I have is whether it is possible to have two S which will lead up to the route of a single S
More concretely is this acceptable:
s
/ \
s s
/ \ / \
NP VP VP NP
According to your grammar, a prepositional phrase (pp) consists of a preposition (p) followed by a noun phrase (np). But your parse tree shows pps consisting only of a preposition ("to" and "from"). If you do the bottom-up parse with this in mind, you should arrive at the correct answer.
To answer your direct question, your grammar does not allow s to consist of two ss; only of a noun phrase (np) followed by a verb phrase (vp).

Recognizing permutations of a finite set of strings in a formal grammar

Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.
For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.
Syntactically valid strings would include:
(the empty string)
A,
B A, and
C A B
Syntactically invalid strings would include:
A A, and
B A C B
To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.
From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.
Update
What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.
Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.
A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.
Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).
For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.
S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY ::= ab | ba
XacY ::= ac | ca
XbcY ::= bc | cb
Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).

Unparse AST < O(exp(n))?

Abstract problem description:
The way I see it, unparsing means to create a token stream from an AST, which when parsed again produces an equal AST.
So parse(unparse(AST)) = AST holds.
This is the equal to finding a valid parse tree which would produce the same AST.
The language is described by a context free S-attributed grammar using a eBNF variant.
So the unparser has to find a valid 'path' through the traversed nodes in which all grammar constraints hold. This bascially means to find a valid allocation of AST nodes to grammar production rules. This is a constraint satisfaction problem (CSP) in general and could be solved, like parsing, by backtracking in O(exp(n)).
Fortunately for parsing, this can be done in O(n³) using GLR (or better restricting the grammar). Because the AST structure is so close to the grammar production rule structure, I was really surprised seeing an implementation where the runtime is worse than parsing: XText uses ANTLR for parsing and backtracking for unparsing.
Questions
Is a context free S-attribute grammar everything a parser and unparser need to share or are there further constraints, e.g. on the parsing technique / parser implementation?
I've got the feeling this problem isn't O(exp(n)) in general - could some genius help me with this?
Is this basically a context-sensitive grammar?
Example1:
Area returns AnyObject -> Pedestrian | Highway
Highway returns AnyObject -> "Foo" Car
Pedestrian returns AnyObject -> "Bar" Bike
Car returns Vehicle -> anyObjectInstance.name="Car"
Bike returns Vehicle -> anyObjectInstance.name="Bike"
So if I have an AST containing
AnyObject -> AnyObject -> Vehicle [name="Car"] and I know the root can be Area, I could resolve it to
Area -> Highway -> Car
So the (Highway | Pedestrian) decision depends on the subtree decisions. The problem get's worse when a leaf might be, at first sight, one of several types, but has to be a specific one to form a valid path later on.
Example2:
So if I have S-attribute rules returning untyped objects, just assigning some attributes, e.g.
A -> B C {Obj, Obj}
X -> Y Z {Obj, Obj}
B -> "somekeyword" {0}
Y -> "otherkeyword" {0}
C -> "C" {C}
Z -> "Z" {Z}
So if an AST contains
Obj
/ \
"0" "C"
I can unparse it to
A
/ \
B C
just after I could resolve "C" to C.
While traversing the AST, all constraints I can generate from the grammar are satisfied for both rules, A and X, until I hit "C". This means that for
A -> B | C
B -> "map" {MagicNumber_42}
C -> "foreach" {MagicNumber_42}
both solutions for the tree
Obj
|
MagicNumber_42
are valid and it is considered that they have equal semantics ,e.g. syntactic sugar.
Further Information:
unparsing in XText
grammar constraints for unparsing, see Serializer: Concrete Syntax Validation
Question 1: no, the grammar itself may not be enough. Take the example of an ambiguous grammar. If you ended up with a unique leftmost (rightmost) derivation (the AST) for a given string, you would somehow have to know how the parser eliminated the ambiguity. Just think of the string 'a+b*c' with the naive grammar for expressions 'E:=E+E|E*E|...'.
Question 3: none of the grammar examples you give is context sensitive. The lefthand-side of the productions are a single non-terminal, there is no context.

Conversion to Chomsky Normal Form

I do need your help.
I have these productions:
1) A--> aAb
2) A--> bAa
3) A--> ε
I should apply the Chomsky Normal Form (CNF).
In order to apply the above rule I should:
eliminate ε producions
eliminate unitary productions
remove useless symbols
Immediately I get stuck. The reason is that A is a nullable symbol (ε is part of its body)
Of course I can't remove the A symbol.
Can anyone help me to get the final solution?
As the Wikipedia notes, there are two definitions of Chomsky Normal Form, which differ in the treatment of ε productions. You will have to pick the one where these are allowed, as otherwise you will never get an equivalent grammar: your grammar produces the empty string, while a CNF grammar following the other definition isn't capable of that.
To begin conversion to Chomsky normal form (using Definition (1) provided by the Wikipedia page), you need to find an equivalent essentially noncontracting grammar. A grammar G with start symbol S is essentially noncontracting iff
1. S is not a recursive variable
2. G has no ε-rules other than S -> ε if ε ∈ L(G)
Calling your grammar G, an equivalent grammar G' with a non-recursive start symbol is:
G' : S -> A
A -> aAb | bAa | ε
Clearly, the set of nullable variables of G' is {S,A}, since A -> ε is a production in G' and S -> A is a chain rule. I assume that you have been given an algorithm for removing ε-rules from a grammar. That algorithm should produce a grammar similar to:
G'' : S -> A | ε
A -> aAb | bAa | ab | ba
The grammar G'' is essentially noncontracting; you can now apply the remaining algorithms to the grammar to find an equivalent grammar in Chomsky normal form.

Binary trees, construct a tree based on preorder

constructing a tree given it's inorder is easy enough.
But, let's say you are supposed to construct a tree based on it's preorder (+ + y z + * x y z for example).
It's easy to see that + is the root, and how to continue in the left subtree from there.
But.. how do you know when you are supposed to "switch" to the right subtree?
Usually, inorder is considered the difficult case.
For preorder, you'll just have a grammar like this.
expr ::= operator expr expr | var
An operator is followed by exactly two well-formed expressions. This can be parsed easily using recursion
If you parse a tree and get a variable, return the variable.
If you parse a tree and get an operator, parse the two following trees as right/left subtrees.

Resources