Is there any way to find those sentences that are describing objects?
For example sentences like "This is a good product" or "You are very beautiful"
I guess I can create an algorithm by using TokenSequencePattern and filter with POS some patterns like PRONUN + VERB + ADJECTIVE but don't think would be something reliable.
I am asking you if there is something out of the box, what I am trying to do is to identify review comments on a webpage.
Instead of POS tagging, you would achieve better results by dependency parsing. By using that instead of POS tagging & patterns as you mentioned, you will have richer and accurate information about the sentence structure.
Example:
https://demos.explosion.ai/displacy/?text=The%20product%20was%20really%20very%20good.&model=en_core_web_sm&cpu=0&cph=0
Stanford NLP does support depedency parsing.
Apart from that you can also use the brilliant SpaCy.
Related
I'm making an application using a dependency tree parser. Actually, the parser is this one:
Parser Stanford, but it rarely change one or two letters of some words in a sentence that I want to parse. This is a big trouble for me, because I can't see any pattern in these changes and I need the dependency tree with the same words of my sentence.
All I can see is that just some words have these problems. I'm working with a tweets database. So, I have a lot of grammar mistakes in this data. For example the hashtag '#AllAmericanhumour ' becomes AllAmericanhumor. It misses one letter(u).
Is there anything I can do to solve this problem? In my first view I thought using an edit distance algorithm, but I think that might be an easier way to do it.
Thanks everybody in advance
You can give options to the tokenizer with the -tokenize.options flag/property. For this particular normalization, you can turn it off with
-tokenize.options americanize=false
There are also various other normalizations that you can turn off (see PTBTokenizer or http://nlp.stanford.edu/software/tokenizer.shtml. You can turn off a lot with
-tokenize.options ptb3Escaping=false
However, the parser is trained on data that looks like the output of ptb3Escaping=true and so will tend to degrade in performance if used with unnormalized tokens. So, you may want to consider alternative strategies.
If you're working at the Java level, you can look at the word tokens, which are actually Maps, and they have various keys. OriginalTextAnnotation will give you the unnormalized token, even when it has been normalized. CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation will map to character offsets into the text.
p.s. And you should accept some answers :-).
Is there a way to to find all the sub-sentences of a sentence that still are meaningful and contain at least one subject, verb, and a predicate/object?
For example, if we have a sentence like "I am going to do a seminar on NLP at SXSW in Austin next month". We can extract the following meaningful sub-sentences from this sentence: "I am going to do a seminar", "I am going to do a seminar on NLP", "I am going to do a seminar on NLP at SXSW", "I am going to do a seminar at SXSW", "I am going to do a seminar in Austin", "I am going to do a seminar on NLP next month", etc.
Please note that there is no deduced sentences here (e.g. "There will be a NLP seminar at SXSW next month". Although this is true, we don't need this as part of this problem.) . All generated sentences are strictly part of the given sentence.
How can we approach solving this problem? I was thinking of creating annotated training data that has a set of legal sub-sentences for each sentence in the training data set. And then write some supervised learning algorithm(s) to generate a model.
I am quite new to NLP and Machine Learning, so it would be great if you guys could suggest some ways to solve this problem.
You can use dependency parser provided by Stanford CoreNLP.
Collapsed output of your sentence will look like below.
nsubj(going-3, I-1)
xsubj(do-5, I-1)
aux(going-3, am-2)
root(ROOT-0, going-3)
aux(do-5, to-4)
xcomp(going-3, do-5)
det(seminar-7, a-6)
dobj(do-5, seminar-7)
prep_on(seminar-7, NLP-9)
prep_at(do-5, -11)
prep_in(do-5, Austin-13)
amod(month-15, next-14)
tmod(do-5, month-15)
The last 5 of your sentence output are optional. You can remove one or more parts that are not essential to your sentence.
Most of this optional parts are belong to prepositional and modifier e.g : prep_in, prep_do, advmod, tmod, etc. See Stanford Dependency Manual.
For example, if you remove all modifier from the output, you will get
I am going to do a seminar on NLP at SXSW in Austin.
There's a paper titled "Using Discourse Commitments to Recognize Textual Entailment" by Hickl et al that discusses the extraction of discourse commitments (sub-sentences). The paper includes a description of their algorithm which in some level operates on rules. They used it for RTE, and there may be some minimal levels of deduction in the output. Text simplification maybe a related area to look at.
The following paper http://www.mpi-inf.mpg.de/~rgemulla/publications/delcorro13clausie.pdf processes the dependencies from the Stanford parser and contructs simple clauses (text-simplification).
See the online demo - https://d5gate.ag5.mpi-sb.mpg.de/ClausIEGate/ClausIEGate
One approach would be with a parser such as a PCFG. Trying to just train a model to detect 'subsentences' is likely to suffer from data sparsity. Also, I am doubtful that you could write down a really clean and unambiguous definition of a subsentence, and if you can't define it, you can't get annotators to annotate for it.
I am experimenting with machine learning in general, and Bayesian analysis in particular, by writing a tool to help me identify my collection of e-books. The input data consist of a set of e-book files, whose names and in some cases contents contain hints as to the book they correspond to.
Some are obvious to the human reader, like:
Artificial Intelligence - A Modern Approach 3rd.pdf
Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
The Complete Guide to PC Repair 5th Ed [2011].pdf
Hamlet.txt
Others are not so obvious:
Vsphere5.prc (Actually 'Mastering VSphere 5' by Scott Lowe)
as.ar.pdf (Actually 'Atlas Shrugged' by Ayn Rand)
Rather than try to code various parsers for different formats of file names, I thought I would build a few dozen simple rules, each with a score.
For example, one rule would look in the first few pages of the file for something resembling an ISBN number, and if found would propose a hypothesis that the file corresponds to the book identified by that ISBN number.
Another rule would look to see if the file name is in 'Author - Title' format and, if so, would propose a hypothesis that the author is 'Author' and the title is 'Title'. Similar rules for other formats.
I thought I could also get a list of book titles and authors from Amazon or an ISBN database, and search the file name and first few pages of the file for any of these; any matches found would result in a hypothesis being suggested by that rule.
In the end I would have a set of tuples like this:
[rulename,hypothesis]
I expect that some rules, such as the ISBN match, will have a high probability of being correct, when they are available. Other rules, like matches based on known book titles and authors, would be more common but not as accurate.
My questions are:
Is this a good approach for solving this problem?
If so, is Bayesian analysis a good candidate for combining all of these rules' hypotheses into compound score to help determine which hypothesis is the strongest, or most likely?
Is there a better way to solve this problem, or some research paper or book which you can suggest I turn to for more information?
It depends on the size of your collection and the time you want to spend training the classifier. It will be difficult to get good generalization that will save you time. For any type of classifier you will have to create a large training set, and also find a lot of rules before you get good accuracy. It will probably be more efficient (less false positives) to create the rules and use them only to suggest title alternatives for you to choose from, and not to implement the classifier. But, if the purpose is learning, then go ahead.
I am searching for information on algorithms to process text sentences or to follow a structure when creating sentences that are valid in a normal human language such as English. I would like to know if there are projects working in this field that I can go learn from or start using.
For example, if I gave a program a noun, provided it with a thesaurus (for related words) and part-of-speech (so it understood where each word belonged in a sentence) - could it create a random, valid sentence?
I'm sure there are many sub-sections of this kind of research so any leads into this would be great.
The field you're looking for is called natural language generation, a subfield of natural language processing
http://en.wikipedia.org/wiki/Natural_language_processing
Sentence generation is either really easy or really hard depending on how good you want the sentences to be. Currently, there aren't programs that will be able to generate 100% sensible sentences about given nouns (even with a thesaurus) -- if that is what you mean.
If, on the other hand, you would be satisfied with nonsense that was sometimes ungrammatical, then you could try an n-gram based sentence generator. These just chain together of words that tend to appear in sequence, and 3-4-gram generators look quite okay sometimes (although you'll recognize them as what generates a lot of spam email).
Here's an intro to the basics of n-gram based generation, using NLTK:
http://www.nltk.org/book/ch02.html#generating-random-text-with-bigrams
This is called NLG (Natural Language Generation), although that is mainly the task of generating text that describes a set of data. There is also a lot of research on completely random sentence generation as well.
One starting point is to use Markov chains to generate sentences. How this is done is that you have a transition matrix that says how likely it is to transition between every every part-of-speech. You also have the most likely starting and ending part-of-speech of a sentence. Put this all together and you can generate likely sequences of parts-of-speech.
Now, you are far from done, this will first of all not offer a very good result as you are only considering the probability between adjacent words (also called bi-grams), so what you want to do is to extend this to look for instance at the transition matrix between three parts-of-speech (this makes a 3D matrix and gives you trigrams). You can extend it to 4-grams, 5-grams, etc. depending on the processing power and if your corpus can fill such matrix.
Lastly, you need to patch up things such as object agreement (subject-verb-agreement, adjective-verb-agreement (not in English though), etc.) and tense, so that everything is congruent.
Yes. There is some work dealing with solving problems in NLG with AI techniques. As far as I know, currently, there is no method that you can use for any practical use.
If you have the background, I suggest getting familiar with some work by Alexander Koller from Saarland University. He describes how to code NLG to PDDL. The main article you'll want to read is "Sentence generating as a planning problem".
If you do not have any background in NLP, just search for the online courses or course materials by Michael Collings or Dan Jurafsky.
Writing random sentences is not that hard. Any parser textbook's simple-english-grammar example can be run in reverse to generate grammatically correct nonsense sentences.
Another way is the word-tuple-random-walk, made popular by the old BYTE magazine TRAVESTY, or stuff like
http://www.perlmonks.org/index.pl?node_id=94856
I've studied some simple semantic network implementations and basic techniques for parsing natural language. However, I haven't seen many projects that try and bridge the gap between the two.
For example, consider the dialog:
"the man has a hat"
"he has a coat"
"what does he have?" => "a hat and coat"
A simple semantic network, based on the grammar tree parsing of the above sentences, might look like:
the_man = Entity('the man')
has = Entity('has')
a_hat = Entity('a hat')
a_coat = Entity('a coat')
Relation(the_man, has, a_hat)
Relation(the_man, has, a_coat)
print the_man.relations(has) => ['a hat', 'a coat']
However, this implementation assumes the prior knowledge that the text segments "the man" and "he" refer to the same network entity.
How would you design a system that "learns" these relationships between segments of a semantic network? I'm used to thinking about ML/NL problems based on creating a simple training set of attribute/value pairs, and feeding it to a classification or regression algorithm, but I'm having trouble formulating this problem that way.
Ultimately, it seems I would need to overlay probabilities on top of the semantic network, but that would drastically complicate an implementation. Is there any prior art along these lines? I've looked at a few libaries, like NLTK and OpenNLP, and while they have decent tools to handle symbolic logic and parse natural language, neither seems to have any kind of proabablilstic framework for converting one to the other.
There is quite a lot of history behind this kind of task. Your best start is probably by looking at Question Answering.
The general advice I always give is that if you have some highly restricted domain where you know about all the things that might be mentioned and all the ways they interact then you can probably be quite successful. If this is more of an 'open-world' problem then it will be extremely difficult to come up with something that works acceptably.
The task of extracting relationship from natural language is called 'relationship extraction' (funnily enough) and sometimes fact extraction. This is a pretty large field of research, this guy did a PhD thesis on it, as have many others. There are a large number of challenges here, as you've noticed, like entity detection, anaphora resolution, etc. This means that there will probably be a lot of 'noise' in the entities and relationships you extract.
As for representing facts that have been extracted in a knowledge base, most people tend not to use a probabilistic framework. At the simplest level, entities and relationships are stored as triples in a flat table. Another approach is to use an ontology to add structure and allow reasoning over the facts. This makes the knowledge base vastly more useful, but adds a lot of scalability issues. As for adding probabilities, I know of the Prowl project that is aimed at creating a probabilistic ontology, but it doesn't look very mature to me.
There is some research into probabilistic relational modelling, mostly into Markov Logic Networks at the University of Washington and Probabilstic Relational Models at Stanford and other places. I'm a little out of touch with the field, but this is is a difficult problem and it's all early-stage research as far as I know. There are a lot of issues, mostly around efficient and scalable inference.
All in all, it's a good idea and a very sensible thing to want to do. However, it's also very difficult to achieve. If you want to look at a slick example of the state of the art, (i.e. what is possible with a bunch of people and money) maybe check out PowerSet.
Interesting question, I've been doing some work on a strongly-typed NLP engine in C#: http://blog.abodit.com/2010/02/a-strongly-typed-natural-language-engine-c-nlp/ and have recently begun to connect it to an ontology store.
To me it looks like the issue here is really: How do you parse the natural language input to figure out that 'He' is the same thing as "the man"? By the time it's in the Semantic Network it's too late: you've lost the fact that statement 2 followed statement 1 and the ambiguity in statement 2 can be resolved using statement 1. Adding a third relation after the fact to say that "He" and "the man" are the same is another option but you still need to understand the sequence of those assertions.
Most NLP parsers seem to focus on parsing single sentences or large blocks of text but less frequently on handling conversations. In my own NLP engine there's a conversation history which allows one sentence to be understood in the context of all the sentences that came before it (and also the parsed, strongly-typed objects that they referred to). So the way I would handle this is to realize that "He" is ambiguous in the current sentence and then look back to try to figure out who the last male person was that was mentioned.
In the case of my home for example, it might tell you that you missed a call from a number that's not in its database. You can type "It was John Smith" and it can figure out that "It" means the call that was just mentioned to you. But if you typed "Tag it as Party Music" right after the call it would still resolve to the song that's currently playing because the house is looking back for something that is ITaggable.
I'm not exactly sure if this is what you want, but take a look at natural language generation wikipedia, the "reverse" of parsing, constructing derivations that conform to the given semantical constraints.