I parse a sentence "The man swimming in the lake is my father." with deduction "...who is...". "swimming" in this sentence is a verb.
The tree structure shows that "swimming" is a verb.
However, in dependency parsing (DP), "swimming" is a noun.
(ROOT
(S
(NP
(NP (DT The) (NN man) (NN swimming))
(PP (IN in)
(NP (DT the) (NN lake))))
(VP (VBZ is)
(NP (PRP$ my) (NN father)))
(. .)))
det(swimming-3, The-1)
compound(swimming-3, man-2)
nsubj(father-9, swimming-3)
case(lake-6, in-4)
det(lake-6, the-5)
nmod:in(swimming-3, lake-6)
cop(father-9, is-7)
nmod:poss(father-9, my-8)
root(ROOT-0, father-9)
I am confusing about DP output, is it correct?
Related
Thank you for sharing your fantastic tool with us. Very excellent job.
Just a question, why I got different constituency parsing result between online task demo and local python library? I think both of them are based on this model?
For example, input the same sentence,
They quickly ran to the place which is sound came from.
(from a student's composition).
The online demo gave the result:
(S (NP (PRP They)) (ADVP (RB quickly)) (VBD ran) (PP (IN to) (NP (NP (DT the) (NN place)) (SBAR (WHNP (WDT which)) (S (VP (VBZ is) (NP ***(NN sound)***)))))) (VP (VBD came) (PP (IN from))) (. .))
but the result of python library version:
(S (NP (PRP They)) (ADVP (RB quickly)) (VBD ran) (PP (IN to) (NP (NP (DT the) (NN place)) (SBAR (WHNP (WDT which)) (S (VP (VBZ is) (NP ***(JJ sound)***)))))) (VP (VBD came) (PP (IN from))) (. .))
It seems the online demo gave a better result.
The demo and the library sometimes go out of sync, because we update the library more often than the demo. Right now I'm in the middle of an effort to update all the demo usage information to use the new AllenNLP 2.0 version.
In your example the demo is indeed better, but your example is ungrammatical, so I would not put too much stock into the results anyways. Essentially, this is an out-of-domain sentence. If I fix the sentence ("They quickly ran to the place which the sound came from."), the parse is correct.
My question has to do with post-processing of part-of-speech tagged and parsed natural language sentences. Specifically, I am writing a component of a Lisp post-processor that takes as input a sentence parse tree (such as, for example, one produced by the Stanford Parser), extracts from that parse tree the phrase structure rules invoked to generate the parse, and then produces a table of rules and rule counts. An example of input and output would be the following:
(1) Sentence:
John said that he knows who Mary likes
(2) Parser output:
(ROOT
(S
(NP (NNP John))
(VP (VBD said)
(SBAR (IN that)
(S
(NP (PRP he))
(VP (VBZ knows)
(SBAR
(WHNP (WP who))
(S
(NP (NNP Mary))
(VP (VBZ likes))))))))))
(3) My Lisp program post-processor output for this parse tree:
(S --> NP VP) 3
(NP --> NNP) 2
(VP --> VBZ) 1
(WHNP --> WP) 1
(SBAR --> WHNP S) 1
(VP --> VBZ SBAR) 1
(NP --> PRP) 1
(SBAR --> IN S) 1
(VP --> VBD SBAR) 1
(ROOT --> S) 1
Note the lack of punctuation in sentence (1). That's intentional. I am having trouble parsing the punctuation in Lisp -- precisely because some punctuation (commas, for example) are reserved for special purposes. But parsing sentences without punctuation changes the distribution of the parse rules as well as the symbols contained in those rules, as illustrated by the following:
(4) Input sentence:
I said no and then I did it anyway
(5) Parser output:
(ROOT
(S
(NP (PRP I))
(VP (VBD said)
(ADVP (RB no)
(CC and)
(RB then))
(SBAR
(S
(NP (PRP I))
(VP (VBD did)
(NP (PRP it))
(ADVP (RB anyway))))))))
(6) Input sentence (with punctuation):
I said no, and then I did it anyway.
(7) Parser output:
(ROOT
(S
(S
(NP (PRP I))
(VP (VBD said)
(INTJ (UH no))))
(, ,)
(CC and)
(S
(ADVP (RB then))
(NP (PRP I))
(VP (VBD did)
(NP (PRP it))
(ADVP (RB anyway))))
(. .)))
Note how including punctuation completely rearranges the parse tree and also involves different POS tags (and thus, implies that different grammar rules were invoked to produce it) So including punctuation is important, at least for my application.
What I need is to discover a way to include punctuation in rules, so that I can produce rules like the following, which would appear, for example, in the table like (3), as follows:
(8) Desired rule:
S --> S , CC S .
Rules like (8) are in fact desired for the specific application I am writing.
But I am finding that doing this in Lisp is difficult: In (7), for example, we observe the appearance of (, ,) and (. .) , both of which are problematic to handle in Lisp.
I have included my relevant Lisp code below. Please note that I'm a neophyte Lisp hacker and so my code isn't particularly pretty or efficient. If someone could suggest how I might modify my below code such that I can parse (7) to produce a table like (3) that includes a rule like (8), I would be most appreciative.
Here is my Lisp code relevant to this task:
(defun WRITE-RULES-AND-COUNTS-SORTED (sent)
(multiple-value-bind (rules-list counts-list)
(COUNT-RULES-OCCURRENCES sent)
(setf comblist (sort (pairlis rules-list counts-list) #'> :key #'cdr))
(format t "~%")
(do ((i 0 (incf i)))
((= i (length comblist)) NIL)
(format t "~A~26T~A~%" (car (nth i comblist)) (cdr (nth i comblist))))
(format t "~%")))
(defun COUNT-RULES-OCCURRENCES (sent)
(let* ((original-rules-list (EXTRACT-GRAMMAR sent))
(de-duplicated-list (remove-duplicates original-rules-list :test #'equalp))
(count-list nil))
(dolist (i de-duplicated-list)
(push (reduce #'+ (mapcar #'(lambda (x) (if (equalp x i) 1 0)) original-rules-list) ) count-list))
(setf count-list (nreverse count-list))
(values de-duplicated-list count-list)))
(defun EXTRACT-GRAMMAR (sent &optional (rules-stack nil))
(cond ((null sent)
NIL)
((and (= (length sent) 1)
(listp (first sent))
(= (length (first sent)) 2)
(symbolp (first (first sent)))
(symbolp (second (first sent))))
NIL)
((and (symbolp (first sent))
(symbolp (second sent))
(= 2 (length sent)))
NIL)
((symbolp (first sent))
(push (EXTRACT-GRAMMAR-RULE sent) rules-stack)
(append rules-stack (EXTRACT-GRAMMAR (rest sent) )))
((listp (first sent))
(cond ((not (and (listp (first sent))
(= (length (first sent)) 2)
(symbolp (first (first sent)))
(symbolp (second (first sent)))))
(push (EXTRACT-GRAMMAR-RULE (first sent)) rules-stack)
(append rules-stack (EXTRACT-GRAMMAR (rest (first sent))) (EXTRACT-GRAMMAR (rest sent) )))
(t (append rules-stack (EXTRACT-GRAMMAR (rest sent) )))))))
(defun EXTRACT-GRAMMAR-RULE (sentence-or-phrase)
(append (list (first sentence-or-phrase))
'(-->)
(mapcar #'first (rest sentence-or-phrase))))
The code is invoked as follows (using (1) as input, producing (3) as output):
(WRITE-RULES-AND-COUNTS-SORTED '(ROOT
(S
(NP (NNP John))
(VP (VBD said)
(SBAR (IN that)
(S
(NP (PRP he))
(VP (VBZ knows)
(SBAR
(WHNP (WP who))
(S
(NP (NNP Mary))
(VP (VBZ likes)))))))))))
S-expressions in Common Lisp
In Common Lisp s-expressions characters like ,, . and others are a part of the default syntax.
If you want symbols with arbitrary names in Lisp s-expressions, you have to escape them. Either use a backslash to escape single characters or use a pair of vertical bars to escape multiple characters:
CL-USER 2 > (loop for symbol in '(\, \. | a , b , c .|)
do (describe symbol))
\, is a SYMBOL
NAME ","
VALUE #<unbound value>
FUNCTION #<unbound function>
PLIST NIL
PACKAGE #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>
\. is a SYMBOL
NAME "."
VALUE #<unbound value>
FUNCTION #<unbound function>
PLIST NIL
PACKAGE #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>
| a , b , c .| is a SYMBOL
NAME " a , b , c ."
VALUE #<unbound value>
FUNCTION #<unbound function>
PLIST NIL
PACKAGE #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>
NIL
Tokenizing / Parsing
If you want to deal with other input formats and not s-expressions, you might want to tokenize / parse the input yourself.
Primitive example:
CL-USER 11 > (mapcar (lambda (string)
(intern string "CL-USER"))
(split-sequence " " "S --> S , CC S ."))
(S --> S \, CC S \.)
UPDATE:
Thank you Dr. Joswig, for your comments and for your code demo: Both were quite helpful.
In the above question I'm interested in overcoming the fact that , and . are part of Lisp's default syntax (or at least accommodating that fact). And so what I ended up doing is writing the function PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ. What it does is read in one parse tree from a file, as a series of strings; trims white-space from the strings; concatenates the strings together to form a string representation of the parse tree; and then scans this string, character by character, searching for instances of punctuation to modify. The modification implements Dr. Joswig's suggestion. Finally, the modified string is converted to a tree (list representation) and then sent off to the extractor to produce the rules table and counts. To implement I cobbled together bits of code found elsewhere on StackOverflow along with my own original code. The result (not all punctuation can be handled of course since this is just a demo):
(defun PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ (file-name)
(let ((result (make-array 1 :element-type 'character :fill-pointer 0 :adjustable T))
(list-of-strings-to-process (mapcar #'(lambda (x) (string-trim " " x))
(GET-PARSE-TREE-FROM-FILE file-name)))
(concatenated-string nil)
(punct-list '(#\, #\. #\; #\: #\! #\?))
(testchar nil)
(string-length 0))
(setf concatenated-string (format nil "~{ ~A~}" list-of-strings-to-process))
(setf string-length (length concatenated-string))
(do ((i 0 (incf i)))
((= i string-length) NIL)
(setf testchar (char concatenated-string i))
(cond ((member testchar punct-list)
(vector-push-extend #\| result)
(vector-push-extend testchar result)
(vector-push-extend #\| result))
(t (vector-push-extend testchar result))))
(reverse result)
(with-input-from-string (s result)
(loop for x = (read s nil :end) until (eq x :end) collect x))))
(defun GET-PARSE-TREE-FROM-FILE (file-name)
(with-open-file (stream file-name)
(loop for line = (read-line stream nil)
while line
collect line)))
Note that GET-PARSE-TREE-FROM-FILE reads only one tree from a file that consists of only one tree. These two functions are not, of course, ready for prime-time!
And finally, a parse tree containing (Lisp-reserved) punctuation can be processed--and thus the original goal met--as follows (user supplies the filename containing one parse tree):
(WRITE-RULES-AND-COUNTS-SORTED
(PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ filename))
The following output is produced:
(NP --> PRP) 3
(PP --> IN NP) 2
(VP --> VB PP) 1
(S --> VP) 1
(VP --> VBD) 1
(NP --> NN CC NN) 1
(ADVP --> RB) 1
(PRN --> , ADVP PP ,) 1
(S --> PRN NP VP) 1
(WHADVP --> WRB) 1
(SBAR --> WHADVP S) 1
(NP --> NN) 1
(NP --> DT NN) 1
(ADVP --> NP IN) 1
(VP --> VBD ADVP NP , SBAR) 1
(S --> NP VP) 1
(S --> S : S .) 1
(ROOT --> S) 1
That output was the result of using the following input (saved as filename):
(ROOT
(S
(S
(NP (PRP It))
(VP (VBD was)
(ADVP
(NP (DT the) (NN day))
(IN before))
(NP (NN yesterday))
(, ,)
(SBAR
(WHADVP (WRB when))
(S
(PRN (, ,)
(ADVP (RB out))
(PP (IN of)
(NP (NN happiness)
(CC and)
(NN mirth)))
(, ,))
(NP (PRP I))
(VP (VBD decided))))))
(: :)
(S
(VP (VB go)
(PP (IN for)
(NP (PRP it)))))
(. !)))
Corenlp parsing is too slow for bad input. It gives following kind of warnings, and takes lots of time for parsing.
For input:
"The Lincolns' fourth son, Thomas "Tad" Lincoln, was born on April 4, 1853, and died of heart failure at the age of 18 on July 16, 1871."
It is producing this error:
Jul 24, 2015 4:03:42 PM edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder funkyFindLeafWithApproximateSpan
WARNING: RuleBasedCorefMentionFinder: Failed to find head token:
Tree is: (ROOT (S (NP (NP (NP (DT The) (NNS Lincolns) (POS ')) (JJ fourth) (NN son)) (, ,) (NP (NNP Thomas) () (NNP Tad) ('' '') (NNP Lincoln)) (, ,)) (VP (VP (VBD was) (VP (VBN born) (PP (IN on) (NP (NP (NNP April) (CD 4)) (, ,) (NP (CD 1853)) (, ,))))) (CC and) (VP (VBD died) (PP (IN of) (NP (NN heart) (NN failure))) (PP (IN at) (NP (NP (DT the) (NN age)) (PP (IN of) (NP (CD 18))))) (PP (IN on) (NP (NNP July) (CD 16))) (, ,) (NP (CD 1871)))) (. .)))
token = |NP|0|, approx=0
Jul 24, 2015 4:03:42 PM edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder funkyFindLeafWithApproximateSpan
WARNING: RuleBasedCorefMentionFinder: Last resort: returning as head: 1871
Jul 24, 2015 4:03:42 PM edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder findHead
WARNING: Invalid index for head 34=34-0: originalSpan=[The Lincolns '], head=1871-35
Jul 24, 2015 4:03:42 PM edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder findHead
WARNING: Setting head string to entire mention
It took me 600.339 seconds to parse the cleaned text of this document https://en.wikipedia.org/wiki/Abraham_Lincoln.
Is there any way to speed this thing up? Is there any option in corenlp to skip bad sentences automatically? or is there any way to set a time limit for parsing a sentence, after which parser will automatically skip the sentence?
As I am a beginner in machine learning, I got confused to parse the given sentence.
"I am in the left side of river."
tried a lot but really was not able to get exact solution.
There are difference language parser are available, but it depends on what is your requirement. Check out someof this to get started
http://www.nltk.org/howto/parse.html
http://nlp.stanford.edu/software/lex-parser.shtml
Google sentence parser, you will get big list
Here is the result with stanford parser:
NLP> I am in the left side of river.
Sentence #1 (9 tokens):
I am in the left side of river.
[Text=I CharacterOffsetBegin=0 CharacterOffsetEnd=1 PartOfSpeech=PRP Lemma=I NamedEntityTag=O] [Text=am CharacterOffsetBegin=2 CharacterOffsetEnd=4 PartOfSpeech=VBP Lemma=be NamedEntityTag=O] [Text=in CharacterOffsetBegin=5 CharacterOffsetEnd=7 PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=the CharacterOffsetBegin=8 CharacterOffsetEnd=11 PartOfSpeech=DT Lemma=the NamedEntityTag=O] [Text=left CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=JJ Lemma=left NamedEntityTag=O] [Text=side CharacterOffsetBegin=17 CharacterOffsetEnd=21 PartOfSpeech=NN Lemma=side NamedEntityTag=O] [Text=of CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=IN Lemma=of NamedEntityTag=O] [Text=river CharacterOffsetBegin=25 CharacterOffsetEnd=30 PartOfSpeech=NN Lemma=river NamedEntityTag=O] [Text=. CharacterOffsetBegin=30 CharacterOffsetEnd=31 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
(S
(NP (PRP I))
(VP (VBP am)
(PP (IN in)
(NP
(NP (DT the) (JJ left) (NN side))
(PP (IN of)
(NP (NN river))))))
(. .)))
root(ROOT-0, am-2)
nsubj(am-2, I-1)
det(side-6, the-4)
amod(side-6, left-5)
prep_in(am-2, side-6)
prep_of(side-6, river-8)
nltk parser:
>>> nltk.parse.chart.demo(3, print_times=False, trace=0,
... sent='I saw John with a dog', numparses=2)
* Sentence:
I saw John with a dog
['I', 'saw', 'John', 'with', 'a', 'dog']
* Strategy: Bottom-up left-corner
Nr edges in chart: 36
(S
(NP I)
(VP (VP (Verb saw) (NP John)) (PP with (NP (Det a) (Noun dog)))))
(S
(NP I)
(VP (Verb saw) (NP (NP John) (PP with (NP (Det a) (Noun dog))))))
I am trying to figure out how to train the stanford LexicalizedParser
( edu.stanford.nlp.parser.lexparser.LexicalizedParser ) to incorporate new nouns into its lexicon.
At first my goal was to take take an existing model and tweak it slightly, rather than creating a brand new model
from a vast set of training examples.
the answer to this question suggests that is not possible >
How can I add more tagged words to the Stanford POS-Tagger's trained models?
Hopefully someone out there can put me on the right track as to how to do this.
As a concrete example of what i want to do, say i have the word 'researchgate' which i want to be treated as a noun when i parse
sentences. Currently, 'researchgate' is getting treated as different parts of speech, depending on its
position.. but i want it identified as an 'NN' (noun).
Examples...
instead of this:
(NP
(NP (JJ recent) (NN activity))
(PP (IN in)
(NP (PRP$ your) (JJ researchgate) (NNS topics)))))
i want this:
(NP
(NP (JJ recent) (NN activity))
(PP (IN in)
(NP (PRP$ your) (NN researchgate) (NNS topics)))))
and instead of this:
(ROOT
(FRAG
(NP (NN subscription))
(S
(VP (TO to)
(VP (VB researchgate))))))
i want this:
(ROOT
(NP
(NP (NN subscription))
(PP (TO to)
(NP (NN researchgate)))))
I am currently using this model: models/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz
I tried doing this >
java -cp stanford-parser.jar
edu.stanford.nlp.parser.lexparser.LexicalizedParser -train /tmp/train.txt
with the contensts of /tmp/train.txt as follows >
(NP
(NP (JJ recent) (NN activity))
(PP (IN in)
(NP (PRP$ your) (JJ researchgate) (NNS topics)))))
I got a bunch of promising output, but then got this error >
Error. Can't parse test sentence: [This, is, just, a, test, .]
So clearly i need to supply more examples than just the one i have in /tmp/train.txt.
Looking at the documentation there seems to be one promising method on
LexicalizedParser that I am considering trying... >
public static LexicalizedParser getParserFromTreebank(Treebank trainTreebank,
Treebank secondaryTrainTreebank,
double weight,
GrammarCompactor compactor,
Options op,
Treebank tuneTreebank,
List<List<TaggedWord>> extraTaggedWords)
i am hesitant to jump in and try this because it seems tricky to get the Options right.
The doco says:
options to the parser which MUST be the SAME at both training and testing (parsing) time in
order for the parser to work properly
so i might need guidance on how to extract the options used for
edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz perhaps it is
edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams ?
Also, maybe i want to add researchgate in as one of my extraTaggedWords ?
I have the feeling i am on the right track but was hoping to get some advice before descending
into a rat hole.
Thanks in advance !
chris
I posted to stanford parser mailing list and I received an answer from John Bauer (thanks, John !)
John Bauer
2:09 PM (39 minutes ago)
to me, parser-user
Unfortunately, you would need to start training from the beginning. There is no way to extend a current parser model.
That feature is on "the list", but it's somewhere near the back, so don't hold your breath...
John