Does an algorithm exist to identify different queries/questions in sentence?

Does an algorithm exist to identify different queries/questions in sentence? - machine-learning

I want to identifies different queries in sentences.
Like - Who is Bill Gates and where he was born? or Who is Bill Gates, where he was born? contains two queries
Who is Bill Gates?
Where Bill Gates was born
I worked on Coreference resolution, so I can identify that he points to Bill Gates so resolved sentence is "Who is Bill Gates, where Bill Gates was born"
Like wise
MGandhi is good guys, Where he was born?
single query
who is MGandhi and where was he born?
2 queries
who is MGandhi, where he was born and died?
3 queries
India won world cup against Australia, when?
1 query (when India won WC against Auz)
I can perform Coreference resolution but not getting how can I distinguish queries in it.
How to do this?
I checked various sentence parser, but as this is pure nlp stuff, sentence parser does not identify it.
I tried to find "Sentence disambiguation" like "word sense disambiguation", but nothing exist like that.
Any help or suggestion would be much appreciable.

Natural language is full of exceptions. Especially in English, it is often said that there are more exceptions than rules. So, it is almost impossible to get a completely accurate solution that works every single time, but using a parser, you can achieve reasonably good performance.
I like to use the Berkeley parser for such tasks. Their online demo includes a graphical representation of the parse tree, which is extremely helpful when trying to formulate heuristics.
For example, consider the question "Who is Bill Gates and where was he born?". The parse tree looks like this:
Clearly, you can split the tree at the central conjunction (CC) node to extract the individual queries. In general, this will be easy if the parsed sentence is simple (where there will be only one query) or compound (where the individual queries can be split by looking at conjunction nodes, as above).
Another more complex example in your question has three queries, such as "Who is Gandhi and where did he work and live?". The parse tree:
Again, you can see the conjunction node which splits "Who is Gandhi" and "Where did he work and live*". The parse does not, however, split up the second query into two, as you would ideally want. And that brings us to the hardest part of what you are trying to do: dealing with (computationally, of course) what is known as right node raising. This is a linguistic construct where common parts get shared.
For example, consider the question "When and how did he suffer a setback?". What it really asks is (a) when did he suffer a setback?, and (b) how did he suffer a setback? Right-node raising issues cannot be solved by just parse trees. It is, in fact, one of the harder problems in computational linguistics, and belongs to the domain of hardcore academic research.

Related

Explain to a noob how to approached nested named entity recognition / tokens within spans?

I have a noob question, go easy on me — I'll probably get the terminology wrong. I'm hoping someone can give me the "here's what to google next" explanation for how to approach creating a CoreML model that can identify tokens within spans. Since my question falls between the hello world examples and the intellectual papers that cover the topics in detail, it has been hard to google for.
I'm taking my first stab at doing some natural language processing, specifically parsing data out of recipe ingredients. CreateML supports word tagging, which I interpret to mean Named Entity Recognition — split a string into tokens (probably words), annotate them, feed them to the model.
"1 tablespoon (0.5 oz / 14 g) baking soda"
This scenario immediately breaks my understanding of word tagging. Tokenize this by words, this includes three measurements. However, this is really one measurement with a clarification that contains two alternate measurements. What I really want to do is to label "(0.5 oz / 14 g)" as a clarification which contains measurements.
Or how about "Olive oil". If I were tokenizing by words, I'd probably get two tokens labeled as "ingredient" which I'd interpret to mean I have two ingredients, but they go together as one.
I've been looking at https://prodi.gy/ which does span categorization, and seemingly handles this scenario — tokenize, then name the entities, then categorize them into spans. However, as far as I understand it, spans are an entirely different paradigm which wouldn't convert over to CoreML.
My naive guess for how you'd do this in CoreML is that I use multiple models, or something that works recursively — one pass would tokenize "(0.5 oz / 14 g)" as a single token labeled as "clarification" and then the next pass would tokenize it into words. However, this smells like a bad idea.
So, how does one solve this problem with CoreML? Code is fine, if relevant, but I'm really just asking about how to think about the problem so I can continue my research.
Thanks for your help!

Text recommendation based on keywords

I need some advice on the following problem.
I'm given a set of weighted keywords (by percentage) and need to find a text in a database that best matches those keywords. I will give an example.
I'm presented with these keywords
Sun(90%)
National Park(85% some keywords contain 2 words)
Landmark(60%)
Now lets say my database contains 3 entries of texts e.g
Going-to-the-Sun Road is a scenic mountain road in the Rocky Mountains of the western United States, in Glacier National Park in Montana.
Everybody has a little bit of the sun and moon in them. Everybody has a little bit of man, woman, and animal in them.
A hybrid car is one that uses more than one means of propulsion - that means combining a petrol or diesel engine with an electric motor.
Obviously the first text is the one that best describes the given set of keywords so this is what I want to recommend to the user. Following the second text that somewhat relates with the "sun" keyword and that could be an acceptable choice too.
The 3rd text is totally irrelevant and shall only be recommended as a last resort when everything else fails.
I'm totally new to that kind of stuff so I need some advice as to which technologies/algorithms I should use. Seems like there is some machine learning (nlp) involved or some kind of fuzzy logic. I'm not really sure.

You need to use a combination of query terms boosting and synonyms
Look into Is there a way to do fuzzy string matching for words on string?

How to find the characteristics of a bunch of word Clusters?

My Motivations I'm trying to learn German and realized there's a confounding fact with the structure of German: every noun has a gender which seems unrelated to the noun itself in many cases.
Unlike languages such as English, each noun has a different definite article, depending on gender: der (masculine), die (feminine), and das (neuter). For example:
das Mädchen ("the girl"), der Rock ("the skirt), die Hose ("the trousers/pants"). So, there seems to be no correlation between gender assignment of nouns and their meanings.
The Data
I gathered up to 5000 German words with 3 columns (das, der, die) for each word with 1's and 0's. So, my data is already clustered with one hot encoding and I'm not trying to predict anything.
Why I'm here I am clueless on where to start, how to approach this problem as the concept of distance in clustering doesn't make sense to me in this setting. I can't think of a way to generate an understandable description of these clusters. The mixed data makes it impossible for me to think of some hard-coded metrics for evaluation.
So, my question is:
I want to find some patterns, some characteristics of these words that made them fall in a specific cluster. I don't know if I'm making any sense but some people managed to find some patterns already (for example word endings, elongated long objects tend to be masculine etc., etc) and I believe ML/AI could do a way better job at this. Would it be possible for me to do something like this?
Some personal thoughts
While I was doing some research (perhaps, naive), I realized the potential options are decision trees and cobweb algorithms. Also, I was thinking if I could just scrape a few images (say 5) for every word and try to run some image classification and see the intermediate NN's to see if any specific shapes support a specific object gender. In addition to that, I was wondering whether scraping the data of google n-gram viewers of these words could help in anyway. I couldn't think of a way to use NLP or its sub domains.
Alternatives If everything I just wrote sounds nonsensical, please suggest me a way to make visual representations of my dataframe (more like nodes and paths with images at nodes, one for each cluster) in Python so that I could make pictorial mind maps and try to by heart them.
The ultimate purpose is to make learning German simpler for myself and possibly for others

NLP parsing multiple questions contained in one single query

If a single query from the user contains multiple questions belonging to different categories, how can they be identified, split and parsed?
Eg -
User - what is the weather now and tell me my next meeting
Parser - {:weather => "what is the weather", :schedule => "tell me my next meeting"}
Parser identifies the parts of sentences where the question belongs to two different categories
User - show me hotels in san francisco for tomorrow that are less than $300 but not less than $200 are pet friendly have a gym and a pool with 3 or 4 stars staying for 2 nights and dont include anything that doesnt have wifi
Parser - {:hotels => ["show me hotels in san francisco",
"for tomorrow", "less than $300 but not less than $200",
"pet friendly have a gym and a pool",
"with 3 or 4 stars", "staying for 2 nights", "with wifi"]}
Parser identifies the question belonging to only one category but has additional steps for fine tuning the answer and created an array ordered according to the steps to take
From what I can understand this requires a sentence segmenter, multi-label classifier and co-reference resolution
But the sentence segementer I have come across depend heavily on grammar, punctuations.
Multi-label classifiers, like a good trained naive bayes classifier works in most cases but since they are multi-label, most times output multiple categories for sentences which clearly belong to one class. Depending solely on the array outputs to check the labels present would fail.
If used a multi-class classifier, that is also good to check the array output of probable categories but obviously they dont tell the different parts of the sentence much accurately, much less in what fashion to proceed with the next step.
As a first step, how can I tune sentence segmenter to correctly split the sentence without any strict grammar rules. Good accuracy of this would help a lot in classification.

As a first step, how can I tune sentence segmenter to correctly split the sentence without any strict grammar rules.
Instead of doing this I'd suggest you use the parse-tree directly (either dependency parser, or constituency parse).
Here I'm showing the output of the dependency parse and you can see that the two segments are separated via a "CONJ" arrow:
(from here: http://deagol.cs.illinois.edu:8080/)
Another solution I'd give try is ClausIE:
https://gate.d5.mpi-inf.mpg.de/ClausIEGate/ClausIEGate?inputtext=what+is+the+weather+now+and+tell+me+my+next+meeting++&processCcAllVerbs=true&processCcNonVerbs=true&type=true&go=Extract

If you want something for segmentation that doesn't depend on grammar heavily, then chunking comes to mind. In the NLTK book there is a fragment on that. The approach authors take here depends only on part of speech tags.
BTW Jurafsky and Martin's 3rd ed of Speech and Language processing contains information on chunking in the parsing chapter, and it also contains a chapters on information retrieval nad chatbots.

Probabilistic Generation of Semantic Networks

I've studied some simple semantic network implementations and basic techniques for parsing natural language. However, I haven't seen many projects that try and bridge the gap between the two.
For example, consider the dialog:
"the man has a hat"
"he has a coat"
"what does he have?" => "a hat and coat"
A simple semantic network, based on the grammar tree parsing of the above sentences, might look like:
the_man = Entity('the man')
has = Entity('has')
a_hat = Entity('a hat')
a_coat = Entity('a coat')
Relation(the_man, has, a_hat)
Relation(the_man, has, a_coat)
print the_man.relations(has) => ['a hat', 'a coat']
However, this implementation assumes the prior knowledge that the text segments "the man" and "he" refer to the same network entity.
How would you design a system that "learns" these relationships between segments of a semantic network? I'm used to thinking about ML/NL problems based on creating a simple training set of attribute/value pairs, and feeding it to a classification or regression algorithm, but I'm having trouble formulating this problem that way.
Ultimately, it seems I would need to overlay probabilities on top of the semantic network, but that would drastically complicate an implementation. Is there any prior art along these lines? I've looked at a few libaries, like NLTK and OpenNLP, and while they have decent tools to handle symbolic logic and parse natural language, neither seems to have any kind of proabablilstic framework for converting one to the other.

There is quite a lot of history behind this kind of task. Your best start is probably by looking at Question Answering.
The general advice I always give is that if you have some highly restricted domain where you know about all the things that might be mentioned and all the ways they interact then you can probably be quite successful. If this is more of an 'open-world' problem then it will be extremely difficult to come up with something that works acceptably.
The task of extracting relationship from natural language is called 'relationship extraction' (funnily enough) and sometimes fact extraction. This is a pretty large field of research, this guy did a PhD thesis on it, as have many others. There are a large number of challenges here, as you've noticed, like entity detection, anaphora resolution, etc. This means that there will probably be a lot of 'noise' in the entities and relationships you extract.
As for representing facts that have been extracted in a knowledge base, most people tend not to use a probabilistic framework. At the simplest level, entities and relationships are stored as triples in a flat table. Another approach is to use an ontology to add structure and allow reasoning over the facts. This makes the knowledge base vastly more useful, but adds a lot of scalability issues. As for adding probabilities, I know of the Prowl project that is aimed at creating a probabilistic ontology, but it doesn't look very mature to me.
There is some research into probabilistic relational modelling, mostly into Markov Logic Networks at the University of Washington and Probabilstic Relational Models at Stanford and other places. I'm a little out of touch with the field, but this is is a difficult problem and it's all early-stage research as far as I know. There are a lot of issues, mostly around efficient and scalable inference.
All in all, it's a good idea and a very sensible thing to want to do. However, it's also very difficult to achieve. If you want to look at a slick example of the state of the art, (i.e. what is possible with a bunch of people and money) maybe check out PowerSet.

Interesting question, I've been doing some work on a strongly-typed NLP engine in C#: http://blog.abodit.com/2010/02/a-strongly-typed-natural-language-engine-c-nlp/ and have recently begun to connect it to an ontology store.
To me it looks like the issue here is really: How do you parse the natural language input to figure out that 'He' is the same thing as "the man"? By the time it's in the Semantic Network it's too late: you've lost the fact that statement 2 followed statement 1 and the ambiguity in statement 2 can be resolved using statement 1. Adding a third relation after the fact to say that "He" and "the man" are the same is another option but you still need to understand the sequence of those assertions.
Most NLP parsers seem to focus on parsing single sentences or large blocks of text but less frequently on handling conversations. In my own NLP engine there's a conversation history which allows one sentence to be understood in the context of all the sentences that came before it (and also the parsed, strongly-typed objects that they referred to). So the way I would handle this is to realize that "He" is ambiguous in the current sentence and then look back to try to figure out who the last male person was that was mentioned.
In the case of my home for example, it might tell you that you missed a call from a number that's not in its database. You can type "It was John Smith" and it can figure out that "It" means the call that was just mentioned to you. But if you typed "Tag it as Party Music" right after the call it would still resolve to the song that's currently playing because the house is looking back for something that is ITaggable.

I'm not exactly sure if this is what you want, but take a look at natural language generation wikipedia, the "reverse" of parsing, constructing derivations that conform to the given semantical constraints.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart