NLP parsing multiple questions contained in one single query - machine-learning

If a single query from the user contains multiple questions belonging to different categories, how can they be identified, split and parsed?
Eg -
User - what is the weather now and tell me my next meeting
Parser - {:weather => "what is the weather", :schedule => "tell me my next meeting"}
Parser identifies the parts of sentences where the question belongs to two different categories
User - show me hotels in san francisco for tomorrow that are less than $300 but not less than $200 are pet friendly have a gym and a pool with 3 or 4 stars staying for 2 nights and dont include anything that doesnt have wifi
Parser - {:hotels => ["show me hotels in san francisco",
"for tomorrow", "less than $300 but not less than $200",
"pet friendly have a gym and a pool",
"with 3 or 4 stars", "staying for 2 nights", "with wifi"]}
Parser identifies the question belonging to only one category but has additional steps for fine tuning the answer and created an array ordered according to the steps to take
From what I can understand this requires a sentence segmenter, multi-label classifier and co-reference resolution
But the sentence segementer I have come across depend heavily on grammar, punctuations.
Multi-label classifiers, like a good trained naive bayes classifier works in most cases but since they are multi-label, most times output multiple categories for sentences which clearly belong to one class. Depending solely on the array outputs to check the labels present would fail.
If used a multi-class classifier, that is also good to check the array output of probable categories but obviously they dont tell the different parts of the sentence much accurately, much less in what fashion to proceed with the next step.
As a first step, how can I tune sentence segmenter to correctly split the sentence without any strict grammar rules. Good accuracy of this would help a lot in classification.

As a first step, how can I tune sentence segmenter to correctly split the sentence without any strict grammar rules.
Instead of doing this I'd suggest you use the parse-tree directly (either dependency parser, or constituency parse).
Here I'm showing the output of the dependency parse and you can see that the two segments are separated via a "CONJ" arrow:
(from here: http://deagol.cs.illinois.edu:8080/)
Another solution I'd give try is ClausIE:
https://gate.d5.mpi-inf.mpg.de/ClausIEGate/ClausIEGate?inputtext=what+is+the+weather+now+and+tell+me+my+next+meeting++&processCcAllVerbs=true&processCcNonVerbs=true&type=true&go=Extract

If you want something for segmentation that doesn't depend on grammar heavily, then chunking comes to mind. In the NLTK book there is a fragment on that. The approach authors take here depends only on part of speech tags.
BTW Jurafsky and Martin's 3rd ed of Speech and Language processing contains information on chunking in the parsing chapter, and it also contains a chapters on information retrieval nad chatbots.

Related

Find the most similar terms from a list of given terms in a huge text corpora [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have a 2-million long list of names of Podcasts. Also, I have a huge text corpus scraped from a sub-Reddit (Posts, comments, threads etc.) where the podcasts from our list are being mentioned a lot by the users. The task I'm trying to solve is, I've to count the number of mentions by each name in our corpora. In other words, generate a dictionary of (name: count) pairs.
The challenge here is that most of these Podcast names are several words long, For eg: "Utah's Noon News"; "Congress Hears Tech Policy Debates" etc. However, the mentions which Reddit users make are often a crude substring of the original name, for eg: "Utah Noon/ Utah New" or "Congress Tech Debates/ Congress Hears Tech". This makes identifying names from the list quite difficult.
What I've Tried:
First, I processed and concatenated all the words in the original podcast names into a single word. For instance,
"Congress Hears Tech Policy Debates" -> "Congresshearstechpolicydebates"
As I traversed the subreddit corpus, whenever I found a named-entity or a potential podcast name, I processed its words like this,
"Congress Hears Tech" (assuming this is what I found in the corpora) -> "congresshearstech"
I compared this "congresshearstech" string to all the processed names in the podcast list. I make this comparison using scored calculated on word-spelling similarity. I did this using difflib Python library. Also, there are similarity scores like Leveshtein and Hamming Distance. Eventually, I rewarded the podcast name with similarity score maximum to our corpus-found string.
My problem:
The thing is, the above strategy is infact working accurately. However, it's way too slow to do for the entire corpus. Also, my list of names is way too long. Can anyone please suggest a faster algorithm/data structure to compare so many names on such a huge corpus? Is there any deep learning based approach possible here? Something like where I can train a LSTM on the 2 million Podcast names. So, that whenever a possible name is encountered, this trained model can output the closest spelling of any Podcast from our list?
You may be able to use something like tf-idf and cosine similarity to solve this problem. I'm not familiar with any approach to use machine learning that would be helpful here.
This article gives a more detailed description of the process and links to some useful libraries. You should also read this article which describes a somewhat similar project to yours and includes information on improving performance. I'll describe the method as I understand it here.
tf-idf is an acronym meaning "term frequency inverse document frequency". Essentially, you look at a subset of text and find the frequency of the terms in your subset relative to the frequency of those terms in the entire corpus of text. Terms that are common in your subset and in the corpus as a whole will have a low value, whereas terms that are common in your subset but rare in the corpus would have a high value.
If you can compute the tf-idf for a "document" (or subset of text) you can turn a subset of text into a vector of tf-idf values. Once you have this vector you can use it to compute the cosine-similarity of your text subset with other subsets. Say, find the similarity of an excerpt from reddit with all of your titles. (There is a way to manage this so you aren't continuously checking each reddit excerpt against literally every title - see this post).
Once you can do this then I think the solution is to pick some value n, and scan through the reddit posts n words at a time doing the tf-idf / cosine similarity scan on your titles and marking matches when the cosine-similarity is higher than a certain value (you'll need to experiment with this to find what gives you a good result). Then, you decrement n and repeat until n is 0.
If exact text matching (with or without your whitespace removal preprocessing) is sufficient, consider the Aho-Corasick string matching algorithm for detecting substring matches (i.e. the podcast names) in a body of text (i.e. the subreddit content). There are many implementations of this algorithm for python, but ahocorapy has a good readme that summarizes how to use it on a dataset.
If fuzzy matching is a requirement (also matching when the mention text of the podcast name is not an exact match), then consider a fuzzy string matching library like thefuzz (aka fuzzywuzzy) if per query-document operations offer sufficient performance. Another approach is to precompute n-grams from the substrings and accumulate the support counts across all n-grams for each document as the fuzzyset package does.
If additional information about the podcasts is available in a knowledge base (i.e. more than just the name is known), then the problem is more like the general NLP task of entity linking but to a custom knowledge base (i.e. the podcast list). This is an area of active research and state of the art methods are discussed on NLP Progress here.

Can Word2Vec be used for information extraction?

I am using Gensim to train Word2Vec. I know word similarities are deteremined by if the words can replace each other and make sense in a sentence. But can word similarities be used to extract relationships between entities?
Example:
I have a bunch of interview documents and in each interview, the interviewee always says the name of their manager. If I wanted to extract the name of the manager from these interview transcripts could I just get a list of all human name's in the document (using nlp), and the name that is the most similar to the word "manager" using Word2Vec, is most likely the manager.
Does this thought process make any sense with Word2Vec? If it doesn't, would the ML solution to this problem then be to input my word embeddings into a sequence to sequence model?
Yes, word-vector similarities & relative-arrangements can indicate relationships.
In the original Word2Vec paper, this was demonstrated by using word-vectors to solve word-analogies. The most famous example involves the analogy "'man' is to 'king' as 'woman' is to ?".
By starting with the word-vector for 'king', then subtracting the vector for 'man', and adding the vector for 'woman', you arrive at a new point in the coordinate system. And then, if you look for other words close to that new point, often the closest word will be queen. Essentially, the directions & distances have helped find a word that's related in a particular way – a gender-reversed equivalent.
And, in large news-based corpuses, famous names like 'Obama' or 'Bush' do wind up with vectors closer to their well-known job titles like 'president'. (There will be many contexts in such corpuses where the words appear immediately together – "President Obama today signed…" – or simply in similar roles – "The President appointed…" or "Obama appointed…", etc.)
However, I suspect that's less-likely to work with your 'manager' interview-transcripts example. Achieving meaningful word-to-word arrangements depends on lots of varied examples of the words in shared usage contexts. Strong vectors require large corpuses of millions to billions of words. So the transcripts with a single manager wouldn't likely be enough to get a good model – you'd need transcripts across many managers.
And in such a corpus each manager's name might not be strongly associated with just manager-like contexts. The same name(s) will be repeated when also mentioning other roles, and transcripts may not especially refer to managerial-action in helpful third-person ways that make specific name-vectors well-positioned. (That is, there won't be clean expository statements like, "John_Smith called a staff meeting", or "John_Smith cancelled the project, alongside others like "…manager John_Smith…" or "The manager cancelled the project".)

Named entity recognition (NER) features

I'm new to Named Entity Recognition and I'm having some trouble understanding what/how features are used for this task.
Some papers I've read so far mention features used, but don't really explain them, for example in
Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition, the following features are mentioned:
Main features used by the the sixteen systems that participated in the
CoNLL-2003 shared task sorted by performance on the English test data.
Aff: affix information (n-grams); bag: bag of words; cas: global case
information; chu: chunk tags; doc: global document information; gaz:
gazetteers; lex: lexical features; ort: orthographic information; pat:
orthographic patterns (like Aa0); pos: part-of-speech tags; pre:
previously predicted NE tags; quo: flag signing that the word is
between quotes; tri: trigger words.
I'm a bit confused by some of these, however. For example:
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
how can POS tags exactly be used as features ? Don't we have a POS tag for each word? Isn't each object/instance a "text"?
what is global document information?
what is the feature trigger words?
I think all I need here is to just to look at an example table with each of these features as columns and see their values to understand how they really work, but so far I've failed to find an easy to read dataset.
Could someone please clarify or point me to some explanation or example of these features being used?
Here's a shot at some answers (and by the way the terminology on all this stuff is super overloaded).
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
In my experience BOW Feature Extraction is used to produce word features out of sentences. So IMO BOW is not one feature, it is a method of generating features out of a sentence (or a block of text you are using). Uning NGrams can help with accounting for sequence, but BOW features amount to unordered bags of strings.
how can POS tags exactly be used as features ? Don't we have a POS tag for each word?
POS Tags are used as features because they can help with "word sense disambiguation" (at least on a theoretical level). For instance, the word "May" can be a name of a person or a month of a year or a poorly capitalized conjugated verb, but the POS tag can be the feature that differentiates that fact. And yes, you can get a POS tag for each word, but unless you explicitly use those tags in your "feature space" then the words themselves have no idea what they are in terms of their POS.
Isn't each object/instance a "text"?
If you mean what I think you mean, then this is true only if you have extracted object-instance "pairs" and stored them as features (an array of them derived from a string of tokens).
what is global document information?
I perceive this one to mean as such: Most NLP tasks function on a sentence. Global document information is data from all the surrounding text in the entire document. For instance, if you are trying to extract geographic placenames but disambiguate them, and you find the word Paris, which one is it? Well if France is mentioned 5 sentences above, that could increase the likelihood of it being Paris France rather than Paris Texas or worst case, the person Paris Hilton. It's also really important in what is called "coreference resolution", which is when you correlate a name to a pronoun reference (mapping a name mention to "he" or "she" etc).
what is the feature trigger words?
Trigger words are specific tokens or sequences that have high reliability as a stand alone thing to have a specific meaning. For instance, in sentiment analysis, curse words with exclamation marks often indicate negativity. There can be many permutations of this.
Anyway, my answers here are not perfect, and are prone to all manner of problems in human epistemology and inter-subjectivity, but those are the way I've been thinking about this things over the years I've been trying to solve problems with NLP.
Hopefully someone else will chime in, especially if I'm way off.
You should probably keep in mind that NER classify each word/token separately from features that are internal or external clues. Internal clues takes into account the word itself (morphology as uppercase letters, is the token present in a dedicated lexicon, POS) and external ones relies on contextual information (previous and next word, document features).
isn't bag of words supposed to be a method to generate features (one
for each word)? How can BOW itself be a feature? Or does this simply
mean we have a feature for each word as in BOW, besides all the other
features mentioned?
Yes, BOW generates one feature per word, with sometimes feature selection methods that reduces the number features taken into account (e.g. minimum frequency of words)
how can a gazetteer be a feature?
Gazetteer may also generate one feature per word, but in most cases it does enrich data, by labelling words or multi-word expressions (as full proper names). It is an ambiguous step: "Georges Washington" will lead to two features: entire "Georges Washington" as a celebrity and "Washington" as a city.
how can POS tags exactly be used as features ? Don't we have a POS tag
for each word? Isn't each object/instance a "text"?
For classifiers, each instance is a word. This is why sequence labelling (e.g. CRF) methods are used: they allow to leverage previous words and next words as additional contextual features to classify the current word. Labelling a text is done as a process relying on the most likely NE types for each word in the sequence.
what is global document information?
This could be metadata (e.g. date, author), topics (full text categorization), coreference, etc.
what is the feature trigger words?
Triggers are external clues, contextual patterns that help disambiguation. For instance "Mr" will be used as a feature that strongly suggest that the following tokens would be a person.
I recently implemented a NER system in python and I found the following features helpful:
character-level ngrams (using CountVectorizer)
previous word features and labels (i.e. context)
viterbi or beam-search on label sequence probability
part of speech (pos), word-length, word-count, is_capitalized, is_stopword

Does an algorithm exist to identify different queries/questions in sentence?

I want to identifies different queries in sentences.
Like - Who is Bill Gates and where he was born? or Who is Bill Gates, where he was born? contains two queries
Who is Bill Gates?
Where Bill Gates was born
I worked on Coreference resolution, so I can identify that he points to Bill Gates so resolved sentence is "Who is Bill Gates, where Bill Gates was born"
Like wise
MGandhi is good guys, Where he was born?
single query
who is MGandhi and where was he born?
2 queries
who is MGandhi, where he was born and died?
3 queries
India won world cup against Australia, when?
1 query (when India won WC against Auz)
I can perform Coreference resolution but not getting how can I distinguish queries in it.
How to do this?
I checked various sentence parser, but as this is pure nlp stuff, sentence parser does not identify it.
I tried to find "Sentence disambiguation" like "word sense disambiguation", but nothing exist like that.
Any help or suggestion would be much appreciable.
Natural language is full of exceptions. Especially in English, it is often said that there are more exceptions than rules. So, it is almost impossible to get a completely accurate solution that works every single time, but using a parser, you can achieve reasonably good performance.
I like to use the Berkeley parser for such tasks. Their online demo includes a graphical representation of the parse tree, which is extremely helpful when trying to formulate heuristics.
For example, consider the question "Who is Bill Gates and where was he born?". The parse tree looks like this:
Clearly, you can split the tree at the central conjunction (CC) node to extract the individual queries. In general, this will be easy if the parsed sentence is simple (where there will be only one query) or compound (where the individual queries can be split by looking at conjunction nodes, as above).
Another more complex example in your question has three queries, such as "Who is Gandhi and where did he work and live?". The parse tree:
Again, you can see the conjunction node which splits "Who is Gandhi" and "Where did he work and live*". The parse does not, however, split up the second query into two, as you would ideally want. And that brings us to the hardest part of what you are trying to do: dealing with (computationally, of course) what is known as right node raising. This is a linguistic construct where common parts get shared.
For example, consider the question "When and how did he suffer a setback?". What it really asks is (a) when did he suffer a setback?, and (b) how did he suffer a setback? Right-node raising issues cannot be solved by just parse trees. It is, in fact, one of the harder problems in computational linguistics, and belongs to the domain of hardcore academic research.

How to use Bayesian analysis to compute and combine weights for multiple rules to identify books

I am experimenting with machine learning in general, and Bayesian analysis in particular, by writing a tool to help me identify my collection of e-books. The input data consist of a set of e-book files, whose names and in some cases contents contain hints as to the book they correspond to.
Some are obvious to the human reader, like:
Artificial Intelligence - A Modern Approach 3rd.pdf
Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
The Complete Guide to PC Repair 5th Ed [2011].pdf
Hamlet.txt
Others are not so obvious:
Vsphere5.prc (Actually 'Mastering VSphere 5' by Scott Lowe)
as.ar.pdf (Actually 'Atlas Shrugged' by Ayn Rand)
Rather than try to code various parsers for different formats of file names, I thought I would build a few dozen simple rules, each with a score.
For example, one rule would look in the first few pages of the file for something resembling an ISBN number, and if found would propose a hypothesis that the file corresponds to the book identified by that ISBN number.
Another rule would look to see if the file name is in 'Author - Title' format and, if so, would propose a hypothesis that the author is 'Author' and the title is 'Title'. Similar rules for other formats.
I thought I could also get a list of book titles and authors from Amazon or an ISBN database, and search the file name and first few pages of the file for any of these; any matches found would result in a hypothesis being suggested by that rule.
In the end I would have a set of tuples like this:
[rulename,hypothesis]
I expect that some rules, such as the ISBN match, will have a high probability of being correct, when they are available. Other rules, like matches based on known book titles and authors, would be more common but not as accurate.
My questions are:
Is this a good approach for solving this problem?
If so, is Bayesian analysis a good candidate for combining all of these rules' hypotheses into compound score to help determine which hypothesis is the strongest, or most likely?
Is there a better way to solve this problem, or some research paper or book which you can suggest I turn to for more information?
It depends on the size of your collection and the time you want to spend training the classifier. It will be difficult to get good generalization that will save you time. For any type of classifier you will have to create a large training set, and also find a lot of rules before you get good accuracy. It will probably be more efficient (less false positives) to create the rules and use them only to suggest title alternatives for you to choose from, and not to implement the classifier. But, if the purpose is learning, then go ahead.

Resources