How to interpret tregex for clauses

How to interpret tregex for clauses - parsing

I am looking at the source code for L2 Syntactic Complexity Analyzer
and it had a tregex expression for clause as:
S|SINV|SQ [> ROOT <, (VP <# VB) | <# MD|VBZ|VBP|VBD | < (VP [<#
MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)])]
I am reading tregex syntax from this link but am not confident that I understood the Boolean relational operators correctly, specifically does the second part of this tregex:
VP [<# MD|VBP|VBZ|VBD | < CC < (VP <# MD|VBP|VBZ|VBD)]
mean
(VP <# MD|VBP|VBZ|VBD) OR ((VP < CC) AND (VP < (VP <# MD|VBP|VBZ|VBD)))
verb phrase that contains both a cc and vp (with md vbp vbz vbd)
Or
(VP <# MD|VBP|VBZ|VBD) OR (VP < (CC < (VP <# MD|VBP|VBZ|VBD)))
verb phrase that contains a cc that contains a vp

In tregex (following earlier tgrep languages) a clause A op B op C op D always means A op B AND A op C AND A op D. If you want the opposite, you need to use parentheses as in Example 1: A op (B op (C op D)). So in the second disjunct of the original message, the VP has to contain a CC and another VP headed by a word of the indicated part-of-speech (<# is the "headed by" relation). So the answer is basically the former interpretation, with the one added constraint that the first VP in each conjunct of the second disjunct has to be the same VP node.

Chris's answer is correct...
Full slides here:
https://nlp.stanford.edu/software/tregex/The_Wonderful_World_of_Tregex.ppt

Related

Removing ambiguity in a grammar

I have a grammar with one production rule:
S → aSbS | bSaS | ∈
This is ambiguous. Is it allowed to remove the ambiguity this way?
S → A|B|∈
A → aS
B → bS
This makes it unambiguous.
Another grammar:
S → A | B
A → aAb | ab
B → abB | ∈
Correction to make it unambiguous
S → A | B
A → aA'b
A' → ab
B → abB | ∈
I am not using any rules to make the grammars unambiguous. If it is wrong to remove ambiguity in a grammar this way, can anyone point a proper set of rules for removing ambiguity in ambiguous grammars?

As #kaby76 points out in a comment, in your first example, you haven't just removed the unambiguity. You have also changed the language recognised by the grammar.
S → a S b S
| b S a A
| ε
recognises only strings with the same number of a's and b's, while
S → A
| B
| ε
A → a S
B → b S
recognises any string made up of a's and b's.
So that's certainly not a legitimate disambiguation.
By the way, your second grammar could have been simplified; A and B serve no useful purpose.
S → a S
| b S
| ε
There are unambiguous grammars for this language. One example:
S → a A S
| b B S
A → a
| b A A
B → b
| a B B
See this post on the Computer Science StackExchange for an explanation.
In your second example, the grammar
S → A
| B
A → a A b
| a b
B → a b B
| ε
is ambiguous, but only because A and B both match ab. Every other recognised string has exactly one possible parse.
In this grammar, A matches strings which consist of some number of as followed by the same number of bs, while B matches strings which consist of any number of repetitions of ab.
ab fits both criteria: it is a repetition of ab (consisting of just one copy) and it is a sequence of as followed by the same number of bs (in this case, one of each). The empty string also matches both criteria (with repetition count 0), but it has been excluded from A by making the starting rule A → a b. An easy way to make the grammar unambiguous would be to change that base rule to A → a a b b.
Again, your disambiguated grammar does not recognise the same language as the original grammar. Your change makes A non-recursive, so that it only recognises aabb and now strings like aaabbb, aaaabbbb, and so on are not recognised. So once again, it is not simply an unambiguous version of the original.
Note that this grammar only matches strings with an equal number of as and bs, but it does not match all such strings. There are many other strings with an equal number of as and bs which are not matched, such as ba or aabbab. So its language is a subset of the language of the first grammar, but it is not the same language.
Finally, you ask if there is a mechanical procedure which can create an unambiguous grammar given an ambiguous grammar. The answer is no. It can be proven that there is no algorithm which can even decide whether a particular context-free grammar is ambiguous. Nor is there an algorithm which can decide whether two context-free grammars have the same language. So it shouldn't be surprising that there is no algorithm which can construct an equivalent unambiguous grammar from an ambiguous grammar.
That doesn't mean that the task cannot be done for certain grammars. But it's not a simple mechanical procedure, and it might not work for a particular grammar.

Handling (, ,) and (. .) and other punctuation when processing natural language parse trees with Lisp

My question has to do with post-processing of part-of-speech tagged and parsed natural language sentences. Specifically, I am writing a component of a Lisp post-processor that takes as input a sentence parse tree (such as, for example, one produced by the Stanford Parser), extracts from that parse tree the phrase structure rules invoked to generate the parse, and then produces a table of rules and rule counts. An example of input and output would be the following:
(1) Sentence:
John said that he knows who Mary likes
(2) Parser output:
(ROOT
(S
(NP (NNP John))
(VP (VBD said)
(SBAR (IN that)
(S
(NP (PRP he))
(VP (VBZ knows)
(SBAR
(WHNP (WP who))
(S
(NP (NNP Mary))
(VP (VBZ likes))))))))))
(3) My Lisp program post-processor output for this parse tree:
(S --> NP VP) 3
(NP --> NNP) 2
(VP --> VBZ) 1
(WHNP --> WP) 1
(SBAR --> WHNP S) 1
(VP --> VBZ SBAR) 1
(NP --> PRP) 1
(SBAR --> IN S) 1
(VP --> VBD SBAR) 1
(ROOT --> S) 1
Note the lack of punctuation in sentence (1). That's intentional. I am having trouble parsing the punctuation in Lisp -- precisely because some punctuation (commas, for example) are reserved for special purposes. But parsing sentences without punctuation changes the distribution of the parse rules as well as the symbols contained in those rules, as illustrated by the following:
(4) Input sentence:
I said no and then I did it anyway
(5) Parser output:
(ROOT
(S
(NP (PRP I))
(VP (VBD said)
(ADVP (RB no)
(CC and)
(RB then))
(SBAR
(S
(NP (PRP I))
(VP (VBD did)
(NP (PRP it))
(ADVP (RB anyway))))))))
(6) Input sentence (with punctuation):
I said no, and then I did it anyway.
(7) Parser output:
(ROOT
(S
(S
(NP (PRP I))
(VP (VBD said)
(INTJ (UH no))))
(, ,)
(CC and)
(S
(ADVP (RB then))
(NP (PRP I))
(VP (VBD did)
(NP (PRP it))
(ADVP (RB anyway))))
(. .)))
Note how including punctuation completely rearranges the parse tree and also involves different POS tags (and thus, implies that different grammar rules were invoked to produce it) So including punctuation is important, at least for my application.
What I need is to discover a way to include punctuation in rules, so that I can produce rules like the following, which would appear, for example, in the table like (3), as follows:
(8) Desired rule:
S --> S , CC S .
Rules like (8) are in fact desired for the specific application I am writing.
But I am finding that doing this in Lisp is difficult: In (7), for example, we observe the appearance of (, ,) and (. .) , both of which are problematic to handle in Lisp.
I have included my relevant Lisp code below. Please note that I'm a neophyte Lisp hacker and so my code isn't particularly pretty or efficient. If someone could suggest how I might modify my below code such that I can parse (7) to produce a table like (3) that includes a rule like (8), I would be most appreciative.
Here is my Lisp code relevant to this task:
(defun WRITE-RULES-AND-COUNTS-SORTED (sent)
(multiple-value-bind (rules-list counts-list)
(COUNT-RULES-OCCURRENCES sent)
(setf comblist (sort (pairlis rules-list counts-list) #'> :key #'cdr))
(format t "~%")
(do ((i 0 (incf i)))
((= i (length comblist)) NIL)
(format t "~A~26T~A~%" (car (nth i comblist)) (cdr (nth i comblist))))
(format t "~%")))
(defun COUNT-RULES-OCCURRENCES (sent)
(let* ((original-rules-list (EXTRACT-GRAMMAR sent))
(de-duplicated-list (remove-duplicates original-rules-list :test #'equalp))
(count-list nil))
(dolist (i de-duplicated-list)
(push (reduce #'+ (mapcar #'(lambda (x) (if (equalp x i) 1 0)) original-rules-list) ) count-list))
(setf count-list (nreverse count-list))
(values de-duplicated-list count-list)))
(defun EXTRACT-GRAMMAR (sent &optional (rules-stack nil))
(cond ((null sent)
NIL)
((and (= (length sent) 1)
(listp (first sent))
(= (length (first sent)) 2)
(symbolp (first (first sent)))
(symbolp (second (first sent))))
NIL)
((and (symbolp (first sent))
(symbolp (second sent))
(= 2 (length sent)))
NIL)
((symbolp (first sent))
(push (EXTRACT-GRAMMAR-RULE sent) rules-stack)
(append rules-stack (EXTRACT-GRAMMAR (rest sent) )))
((listp (first sent))
(cond ((not (and (listp (first sent))
(= (length (first sent)) 2)
(symbolp (first (first sent)))
(symbolp (second (first sent)))))
(push (EXTRACT-GRAMMAR-RULE (first sent)) rules-stack)
(append rules-stack (EXTRACT-GRAMMAR (rest (first sent))) (EXTRACT-GRAMMAR (rest sent) )))
(t (append rules-stack (EXTRACT-GRAMMAR (rest sent) )))))))
(defun EXTRACT-GRAMMAR-RULE (sentence-or-phrase)
(append (list (first sentence-or-phrase))
'(-->)
(mapcar #'first (rest sentence-or-phrase))))
The code is invoked as follows (using (1) as input, producing (3) as output):
(WRITE-RULES-AND-COUNTS-SORTED '(ROOT
(S
(NP (NNP John))
(VP (VBD said)
(SBAR (IN that)
(S
(NP (PRP he))
(VP (VBZ knows)
(SBAR
(WHNP (WP who))
(S
(NP (NNP Mary))
(VP (VBZ likes)))))))))))

S-expressions in Common Lisp
In Common Lisp s-expressions characters like ,, . and others are a part of the default syntax.
If you want symbols with arbitrary names in Lisp s-expressions, you have to escape them. Either use a backslash to escape single characters or use a pair of vertical bars to escape multiple characters:
CL-USER 2 > (loop for symbol in '(\, \. | a , b , c .|)
do (describe symbol))
\, is a SYMBOL
NAME ","
VALUE #<unbound value>
FUNCTION #<unbound function>
PLIST NIL
PACKAGE #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>
\. is a SYMBOL
NAME "."
VALUE #<unbound value>
FUNCTION #<unbound function>
PLIST NIL
PACKAGE #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>
| a , b , c .| is a SYMBOL
NAME " a , b , c ."
VALUE #<unbound value>
FUNCTION #<unbound function>
PLIST NIL
PACKAGE #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>
NIL
Tokenizing / Parsing
If you want to deal with other input formats and not s-expressions, you might want to tokenize / parse the input yourself.
Primitive example:
CL-USER 11 > (mapcar (lambda (string)
(intern string "CL-USER"))
(split-sequence " " "S --> S , CC S ."))
(S --> S \, CC S \.)

UPDATE:
Thank you Dr. Joswig, for your comments and for your code demo: Both were quite helpful.
In the above question I'm interested in overcoming the fact that , and . are part of Lisp's default syntax (or at least accommodating that fact). And so what I ended up doing is writing the function PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ. What it does is read in one parse tree from a file, as a series of strings; trims white-space from the strings; concatenates the strings together to form a string representation of the parse tree; and then scans this string, character by character, searching for instances of punctuation to modify. The modification implements Dr. Joswig's suggestion. Finally, the modified string is converted to a tree (list representation) and then sent off to the extractor to produce the rules table and counts. To implement I cobbled together bits of code found elsewhere on StackOverflow along with my own original code. The result (not all punctuation can be handled of course since this is just a demo):
(defun PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ (file-name)
(let ((result (make-array 1 :element-type 'character :fill-pointer 0 :adjustable T))
(list-of-strings-to-process (mapcar #'(lambda (x) (string-trim " " x))
(GET-PARSE-TREE-FROM-FILE file-name)))
(concatenated-string nil)
(punct-list '(#\, #\. #\; #\: #\! #\?))
(testchar nil)
(string-length 0))
(setf concatenated-string (format nil "~{ ~A~}" list-of-strings-to-process))
(setf string-length (length concatenated-string))
(do ((i 0 (incf i)))
((= i string-length) NIL)
(setf testchar (char concatenated-string i))
(cond ((member testchar punct-list)
(vector-push-extend #\| result)
(vector-push-extend testchar result)
(vector-push-extend #\| result))
(t (vector-push-extend testchar result))))
(reverse result)
(with-input-from-string (s result)
(loop for x = (read s nil :end) until (eq x :end) collect x))))
(defun GET-PARSE-TREE-FROM-FILE (file-name)
(with-open-file (stream file-name)
(loop for line = (read-line stream nil)
while line
collect line)))
Note that GET-PARSE-TREE-FROM-FILE reads only one tree from a file that consists of only one tree. These two functions are not, of course, ready for prime-time!
And finally, a parse tree containing (Lisp-reserved) punctuation can be processed--and thus the original goal met--as follows (user supplies the filename containing one parse tree):
(WRITE-RULES-AND-COUNTS-SORTED
(PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ filename))
The following output is produced:
(NP --> PRP) 3
(PP --> IN NP) 2
(VP --> VB PP) 1
(S --> VP) 1
(VP --> VBD) 1
(NP --> NN CC NN) 1
(ADVP --> RB) 1
(PRN --> , ADVP PP ,) 1
(S --> PRN NP VP) 1
(WHADVP --> WRB) 1
(SBAR --> WHADVP S) 1
(NP --> NN) 1
(NP --> DT NN) 1
(ADVP --> NP IN) 1
(VP --> VBD ADVP NP , SBAR) 1
(S --> NP VP) 1
(S --> S : S .) 1
(ROOT --> S) 1
That output was the result of using the following input (saved as filename):
(ROOT
(S
(S
(NP (PRP It))
(VP (VBD was)
(ADVP
(NP (DT the) (NN day))
(IN before))
(NP (NN yesterday))
(, ,)
(SBAR
(WHADVP (WRB when))
(S
(PRN (, ,)
(ADVP (RB out))
(PP (IN of)
(NP (NN happiness)
(CC and)
(NN mirth)))
(, ,))
(NP (PRP I))
(VP (VBD decided))))))
(: :)
(S
(VP (VB go)
(PP (IN for)
(NP (PRP it)))))
(. !)))

How do I rewrite a context free grammar so that it is LR(1)?

For the given context free grammar:
S -> G $
G -> PG | P
P -> id : R
R -> id R | epsilon
How do I rewrite the grammar so that it is LR(1)?
The current grammar has shift/reduce conflicts when parsing the input "id : .id", where "." is the input pointer for the parser.
This grammar produces the language satisfying the regular expression (id:(id)*)+

It's easy enough to produce an LR(1) grammar for the same language. The trick is finding one which has a similar parse tree, or at least from which the original parse tree can be recovered easily.
Here's a manually generated grammar, which is slightly simplified from the general algorithm. In effect, we rewrite the regular expression:
(id:id*)+
to:
id(:id+)*:id*
which induces the grammar:
S → id G $
G → P G | P'
P' → : R'
P → : R
R' → ε | id R'
R → ε | id R
which is LALR(1).
In effect, we've just shifted all the productions one token to the right, and there is a general algorithm which can be used to create an LR(1) grammar from an LR(k+1) grammar for any k≥1. (The version of this algorithm I'm using comes from Parsing Theory by S. Sippu & E. Soisalon-Soininen, Vol II, section 6.7.)
The non-terminals of the new grammar will have the form (x, V, y) where V is a symbol from the original grammar (either a terminal or a non-terminal) and x and y are terminal sequences of maximum length k such that:
y ∈ FOLLOWk(V)
x ∈ FIRSTk(Vy)
(The lengths of y and consequently x might be less than k if the end of input is included in the follow set. Some people avoid this issue by adding k end symbols, but I think this version is just as simple.)
A non-terminal (x, V, y) will generate the x-derivative of the strings derived from Vy from the original grammar. Informally, the entire grammar is shifted k tokens to the right; each non-terminal matches a string which is missing the first k tokens but is augmented with the following k tokens.
The productions are generated mechanically from the original productions. First, we add a new start symbol, S' with productions:
S' → x (x, S, ε)
for every x ∈ FIRSTk(S). Then, for every production
T → V0 V1 … Vm
we generate the set of productions:
(x0,T,xm+1) → (x0,V0,x1) (x1,V1,x2) … (xm,Vm,xm+1)
and for every terminal A we generate the set of productions
(Ax,A,xB) → B if |x| = k
(Ax,A,x) → ε if |x| ≤ k
Since there is an obvious homomorphism from the productions in the new grammar to the productions in the old grammar, we can directly create the original parse tree, although we need to play some tricks with the semantic values in order to correctly attach them to the parse tree.

Curious case LL(1) or not?

I have stumbled upon a very curious case:
Consider
1) S -> Ax
2) & 3) A->alpha|beta
4) alpha-> b
5) & 6) beta -> epsilon | x
Now I checked and this grammar doesn't defy any rules of LL(1) grammars. But when I construct the parsing table, there are some collisions.
First Sets
S => {b,x}
A=>{b,x,epsilon}
alpha=>{b}
beta=> {x,epsilon}
Follow sets
S=> {$}
A => {x}
alpha => {x}
beta => {x}
Here is the parsing table **without considering** the RHS's which can produce
epsilons
x b $
S 1 1
A 3 2
alpha b
beta 6
So far so good, but when we do consider RHS's that can derive epsilon, we get collisions in the table!
So is this LL(1) or not?

So is this LL(1) or not?
First(A) contains x, and Follow(A) contains x. Since A can derive empty, and there is an intersection between First(A) and Follow(A), it is not LL(1).

I am really sorry, its a blunder on my part.
Actually it doesn't satisfy all the rules of LL(1) grammars
beta-> epsilon | x
hence first(x)^follow(beta) should be disjoint but thats not the case!!
Sorryy!!

Conversion to Chomsky Normal Form

I do need your help.
I have these productions:
1) A--> aAb
2) A--> bAa
3) A--> ε
I should apply the Chomsky Normal Form (CNF).
In order to apply the above rule I should:
eliminate ε producions
eliminate unitary productions
remove useless symbols
Immediately I get stuck. The reason is that A is a nullable symbol (ε is part of its body)
Of course I can't remove the A symbol.
Can anyone help me to get the final solution?

As the Wikipedia notes, there are two definitions of Chomsky Normal Form, which differ in the treatment of ε productions. You will have to pick the one where these are allowed, as otherwise you will never get an equivalent grammar: your grammar produces the empty string, while a CNF grammar following the other definition isn't capable of that.

To begin conversion to Chomsky normal form (using Definition (1) provided by the Wikipedia page), you need to find an equivalent essentially noncontracting grammar. A grammar G with start symbol S is essentially noncontracting iff
1. S is not a recursive variable
2. G has no ε-rules other than S -> ε if ε ∈ L(G)
Calling your grammar G, an equivalent grammar G' with a non-recursive start symbol is:
G' : S -> A
A -> aAb | bAa | ε
Clearly, the set of nullable variables of G' is {S,A}, since A -> ε is a production in G' and S -> A is a chain rule. I assume that you have been given an algorithm for removing ε-rules from a grammar. That algorithm should produce a grammar similar to:
G'' : S -> A | ε
A -> aAb | bAa | ab | ba
The grammar G'' is essentially noncontracting; you can now apply the remaining algorithms to the grammar to find an equivalent grammar in Chomsky normal form.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart