How to extratc SOV from a sentance using popular NLP libraries. I have read that one method is to generate a dependency structure and convert it to SOV structure.
In StanfordCoreNLP, you can take a look at our Dependency Parser, which produces dependency trees (SemanticGraphs) that can be queried using Semgrex. For example, with the pattern
{pos:/V.*/}=verb >/.subj.*/ {}=subject >/.obj.*/ {}=object
Alternately, the Stanford OpenIE system may be of interest. To a first approximation, I think what you're looking for is the extraction of OpenIE (subject; relation; object) triples. In this same vein, the University of Washington has a number of OpenIE systems that you can take a look at: Ollie and the more recent OpenIE4.
Related
I am using Gensim to train Word2Vec. I know word similarities are deteremined by if the words can replace each other and make sense in a sentence. But can word similarities be used to extract relationships between entities?
Example:
I have a bunch of interview documents and in each interview, the interviewee always says the name of their manager. If I wanted to extract the name of the manager from these interview transcripts could I just get a list of all human name's in the document (using nlp), and the name that is the most similar to the word "manager" using Word2Vec, is most likely the manager.
Does this thought process make any sense with Word2Vec? If it doesn't, would the ML solution to this problem then be to input my word embeddings into a sequence to sequence model?
Yes, word-vector similarities & relative-arrangements can indicate relationships.
In the original Word2Vec paper, this was demonstrated by using word-vectors to solve word-analogies. The most famous example involves the analogy "'man' is to 'king' as 'woman' is to ?".
By starting with the word-vector for 'king', then subtracting the vector for 'man', and adding the vector for 'woman', you arrive at a new point in the coordinate system. And then, if you look for other words close to that new point, often the closest word will be queen. Essentially, the directions & distances have helped find a word that's related in a particular way – a gender-reversed equivalent.
And, in large news-based corpuses, famous names like 'Obama' or 'Bush' do wind up with vectors closer to their well-known job titles like 'president'. (There will be many contexts in such corpuses where the words appear immediately together – "President Obama today signed…" – or simply in similar roles – "The President appointed…" or "Obama appointed…", etc.)
However, I suspect that's less-likely to work with your 'manager' interview-transcripts example. Achieving meaningful word-to-word arrangements depends on lots of varied examples of the words in shared usage contexts. Strong vectors require large corpuses of millions to billions of words. So the transcripts with a single manager wouldn't likely be enough to get a good model – you'd need transcripts across many managers.
And in such a corpus each manager's name might not be strongly associated with just manager-like contexts. The same name(s) will be repeated when also mentioning other roles, and transcripts may not especially refer to managerial-action in helpful third-person ways that make specific name-vectors well-positioned. (That is, there won't be clean expository statements like, "John_Smith called a staff meeting", or "John_Smith cancelled the project, alongside others like "…manager John_Smith…" or "The manager cancelled the project".)
How to create dependency graph (parse tree) for random sentences. Is there any predined grammer to parse english sentences using nltk.
Example:
I want to make a parse tree for the sentence
“A large company needs a sustainable business model.”
which should look like this.
Please suggest me how this can be done.
This question is a near-duplicate of 3125926. But I'll elaborate just a little on the answer given there.
I don't have personal experience with dependency parsing under NLTK, but according to the accepted answer, the integration with MaltParser is documented at http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html
If for some reason MaltParser doesn't suit your needs, you might also take a look at MSTParser and the Stanford Parser. I think those three options are the best-known, and I expect one (or all) of them will work for you.
Note that the Stanford Parser includes routines to convert from constituency trees and between several of the standard dependency representations, so if you need a specific format, you might look at the format-conversion arguments to the edu.stanford.nlp.trees.EnglishGrammaticalStructure class.
e.g., to convert from constituency trees to basic dependencies:
java -cp stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile <input trees> -basic
I heavily use strings in a project so what i am looking for is a fast library for handling them.I think the Boyer-Moore Algorithm is the best.
Is there a free solution for that ?
You can consider the following resources implementing Boyer–Moore algorithm:
Boyer-Moore Horspool in Delphi 2010
Boyer-Moore-Horspool text searching
Search Components - Version 2.1
Boyer-moore, de la recherche efficace
Last Edit:
The StringSimilarity package of theunknownones project is a good source for fuzzy and phonetic string comparison algorithms:
DamerauLevenshtein
Koelner Phonetik
SoundEx
Metaphone
DoubleMetaphone
NGram
Dice
JaroWinkler
NeedlemanWunch
SmithWatermanGotoh
MongeElkan
CAUTION: Answering to the comment rather than to the question itself
There is (or, rather, was, because it has been abandoned currently) a Delphi unit (namely!) FastStrings which implements Boyer–Moore string search algorithm by heavy use of inline assembler. Is is one you are looking for?
As side note: project homepage is defunct now as long as author's e-mail, so i'm finding reuse (and modification and, naturally, any further development) of this code rather problematic given how restrictive are licensing terms.
I've studied some simple semantic network implementations and basic techniques for parsing natural language. However, I haven't seen many projects that try and bridge the gap between the two.
For example, consider the dialog:
"the man has a hat"
"he has a coat"
"what does he have?" => "a hat and coat"
A simple semantic network, based on the grammar tree parsing of the above sentences, might look like:
the_man = Entity('the man')
has = Entity('has')
a_hat = Entity('a hat')
a_coat = Entity('a coat')
Relation(the_man, has, a_hat)
Relation(the_man, has, a_coat)
print the_man.relations(has) => ['a hat', 'a coat']
However, this implementation assumes the prior knowledge that the text segments "the man" and "he" refer to the same network entity.
How would you design a system that "learns" these relationships between segments of a semantic network? I'm used to thinking about ML/NL problems based on creating a simple training set of attribute/value pairs, and feeding it to a classification or regression algorithm, but I'm having trouble formulating this problem that way.
Ultimately, it seems I would need to overlay probabilities on top of the semantic network, but that would drastically complicate an implementation. Is there any prior art along these lines? I've looked at a few libaries, like NLTK and OpenNLP, and while they have decent tools to handle symbolic logic and parse natural language, neither seems to have any kind of proabablilstic framework for converting one to the other.
There is quite a lot of history behind this kind of task. Your best start is probably by looking at Question Answering.
The general advice I always give is that if you have some highly restricted domain where you know about all the things that might be mentioned and all the ways they interact then you can probably be quite successful. If this is more of an 'open-world' problem then it will be extremely difficult to come up with something that works acceptably.
The task of extracting relationship from natural language is called 'relationship extraction' (funnily enough) and sometimes fact extraction. This is a pretty large field of research, this guy did a PhD thesis on it, as have many others. There are a large number of challenges here, as you've noticed, like entity detection, anaphora resolution, etc. This means that there will probably be a lot of 'noise' in the entities and relationships you extract.
As for representing facts that have been extracted in a knowledge base, most people tend not to use a probabilistic framework. At the simplest level, entities and relationships are stored as triples in a flat table. Another approach is to use an ontology to add structure and allow reasoning over the facts. This makes the knowledge base vastly more useful, but adds a lot of scalability issues. As for adding probabilities, I know of the Prowl project that is aimed at creating a probabilistic ontology, but it doesn't look very mature to me.
There is some research into probabilistic relational modelling, mostly into Markov Logic Networks at the University of Washington and Probabilstic Relational Models at Stanford and other places. I'm a little out of touch with the field, but this is is a difficult problem and it's all early-stage research as far as I know. There are a lot of issues, mostly around efficient and scalable inference.
All in all, it's a good idea and a very sensible thing to want to do. However, it's also very difficult to achieve. If you want to look at a slick example of the state of the art, (i.e. what is possible with a bunch of people and money) maybe check out PowerSet.
Interesting question, I've been doing some work on a strongly-typed NLP engine in C#: http://blog.abodit.com/2010/02/a-strongly-typed-natural-language-engine-c-nlp/ and have recently begun to connect it to an ontology store.
To me it looks like the issue here is really: How do you parse the natural language input to figure out that 'He' is the same thing as "the man"? By the time it's in the Semantic Network it's too late: you've lost the fact that statement 2 followed statement 1 and the ambiguity in statement 2 can be resolved using statement 1. Adding a third relation after the fact to say that "He" and "the man" are the same is another option but you still need to understand the sequence of those assertions.
Most NLP parsers seem to focus on parsing single sentences or large blocks of text but less frequently on handling conversations. In my own NLP engine there's a conversation history which allows one sentence to be understood in the context of all the sentences that came before it (and also the parsed, strongly-typed objects that they referred to). So the way I would handle this is to realize that "He" is ambiguous in the current sentence and then look back to try to figure out who the last male person was that was mentioned.
In the case of my home for example, it might tell you that you missed a call from a number that's not in its database. You can type "It was John Smith" and it can figure out that "It" means the call that was just mentioned to you. But if you typed "Tag it as Party Music" right after the call it would still resolve to the song that's currently playing because the house is looking back for something that is ITaggable.
I'm not exactly sure if this is what you want, but take a look at natural language generation wikipedia, the "reverse" of parsing, constructing derivations that conform to the given semantical constraints.
Is there existing software for discriminative reranking, such as that used by the Charniak NLP parser, Shen, Sarkar, and Och's parser or Shen and Joshi's techniques? I'd like something that I can easily adapt for my own uses, which are similar to parse reranking.
Charniak-Johnson Reranking
The source code for the Charniak-Johnson (CJ) reranking parser is freely available, you can download a copy here.
The reranker is a separate code module that takes as input n-best lists of parses, so it's trivial to decouple it from the parsing front end.
SVM-rank
Alternatively, the package SVM-rank, from Thorsten Joachim's lab at Cornell, is a general purpose ranker. It might be easier to go with this package, if what you want to do deviates significantly from what's being done by the Charniak-Johnson parser.