I want to build a ml program who talks to user and get some inputs from the user.
The ml program analyze the input data(keywords) then predict the best solution.
So, you are looking at an AI application which needs some sort of machine intelligence for processing natural language.
Let us say the language of choice here is English. There are many things to be considered before building such a system.
Dependency parsing
Word Sense Disambiguation
Verb Sense Disambiguation
Coreference Resolution
Semantic Role Labelling
Universe of knowledge.
In brief you need to build all the above essential modules before you can generate your response.
You need to decide what kind of problem you are working on? Is it an open domain or closed domain problem, meaning what is the scope of knowledge of this application.
For example: Google now is an open domain problem which can practically take any possible input.
But some applications pertain to a particular task like automating food orders in an app etc where the scope of questions which can be asked is limited.
Once that is decided, you need to parse your input sentence and dependency parsing is the way to go. You can use Stanford core NLP suite to achieve most of the NLP tasks which were mentioned above.
Once the input sentence is parsed and you have the subjects, objects, etc it is time to disambiguate the words in the sentence as a particular word can have different meanings.
Then disambiguate the verb meaning identifying the type of verb (like return could mean going back to a place or giving back something )
Then you need to resolve coreference resolution meaning mapping the nouns and pronouns and other entities in a given context. For example:
My name is John. I work at ABC company.
Here I in the second sentence refers to John.
This helps us in answering questions like where does John work. Since John was only used in the first sentence and his work was mentioned in the second sentence coreference resolution helps us map them together.
The next task at hand is semantic role labelling, which basically means labelling all the arguments in a sentence with respect to each of its verb.
For example: John killed Mary.
Here the verb is kill, John and Mary are the arguments of the verb kill. John takes the role A0 and Mary the role A1. Where the definitions of these roles for each verb are mentioned in a huge frame and argument annotation framework created by the NLP community. Here A0 means the person who killed, A1 means the person who was killed.
Now once you have identified A0 and A1 just look into the definition of the kill frame and return A0 for killer and A1 for the victim.
Another important task at hand is to identify when your system must respond with an answer. For which you need to know if the given sentence is a declarative or assertive sentence or an interrogative sentence. You can just check that by seeing if the input sentence ends with a question mark.
Now to answer your question:
Let us say your input to the application is:
Input 1: John killed Mary.
Clearly this is an assertive sentence so just store it and process it as mentioned above.
Now the next input is:
Input 2: Who killed Mary?
This is an interrogative sentence so you need to come up with a reply or a response.
Now find the semantic role labels of input 1 and input 2 and return the word of input 1 which matches the argument of Who in sentence 2.
Here in this case who would be labeled as A0 and John would be labeled as A0, simply return John.
Most of the NLP modules mentioned can directly be implemented using Stanford core NLP however if you want to implement some algorithms on your own you can go through the recent publications in EMNLP, NIPS, ICML, CONLL etc to understand them better and implement the one which best suits you.
Good luck !
Related
I have some raw text that has questions and answers in it. I would like to identify which parts of the text are questions and which parts are the answers. This seems like it would be easy, but the questions aren't necessarily terminated with question marks. The only thing I know for sure is that after a question is over the answer begins, and after the answer is over another question begins, but there is no consistent format on how many \n are included in the answers. A question is definitely its own paragraph though.
I'm hoping for some sort of pre-trained model for this?
One possibility would be to take some existing data, manually tag each paragraph as q vs a and then use google's universal sentence encoder for each paragraph to get the 512 dimension output and then use that as the input to train a neural net or some other classification model on the labeled data. I'm hoping to avoid this path because I don't want to manually tag a few thousand paragraphs, and after all that work, who knows if the model will have a decent classification error.
Another possibility is to use something like gpt3: feed it the entire text and just ask it what are the questions/requests. The problem with this is that the gpt3 api is still a bit sandboxed. I tried a sample on the gpt3 playground and it only identified 80% of the questions.
Any other suggestions?
To give you an idea, the text may look like this:
What is the name of the company?
We are Acme Inc.
How many employees are there.
There are 50 employees.
Describe a day in the life of an employee.
An employee arrives at 9am.
Then they go to the factory and make widgets for 4 hours. After making widgets they eat lunch and then go to the QA engineer to make sure their widgets are good enough.
After QA, they write a report about how many widgets they made.
Most employees leave around 5pm.
List the pay range of your employees.
The starting salary is $22/hours.
After 1 year pay increases to $25 an hour and then increases 3% per year.
Contact information:
Acme Inc
123 Main Street
Anyplace, USA
According to the description and the text sample that you provided, I would split this problem into 2 parts:
How to split the whole text
How to "classify" which sentence (or paragraph) is a question or an answer
I tried solving this problem using a heuristics based approach with spacy (you can use other libraries).
You can just use this technique directly or use it to build a weakly supervised dataset that you can train a classification model with (try skweak).
Sentence Detection
This is the easy part, all you have to do is follow the details in this link https://spacy.io/usage/linguistic-features#sbd
nlp = spacy.load('en_core_web_sm')
doc = nlp("Hi, I'm sentence number 1. Hi, I'm sentence number 2.")
for sent in doc.sents:
print(sent.text)
# Hi, I'm sentence number 1.
# Hi, I'm sentence number 2.
Question or Answer
From the sample that you shared, I can see that you want to detect questions and also imperative phrases:
Questions: How many employees are there. To detect this type of question, you can use spacy's tag property and look for WH tokens (who, where etc...). You could use the same logic to find Subject verb inversions for example etc...
Imperative phrases: List the pay range of your employees. To detect these cases, you can search for verbs that are the first tokens of a sentence.
Here's a small example that you can follow:
def is_question(sent):
d = nlp(sent)
token = d[0] # gets the first token in a sentence
if token.pos_ == "VERB" and token.dep_ == "ROOT": # checks if the first token is a verb and root or not
return True
for token in d: # loops through the sentence and checks for WH tokens
if token.tag_ == "WDT" or token.tag_ == "WP" or token.tag_ == "WP$" or token.tag_ == "WRB":
return True
return False
doc = nlp(text)
for sent in doc.sents:
print(sent.text.strip())
if is_question(sent.text.strip()):
print("is question")
else:
print("not a question")
print("***")
# what is the name of the company?
# is question
# ***
# We are Acme Inc.
# not a question
# ***
# How many employees are there.
# is question
# ***
# There are 50 employees.
# not a question
You can apply this function on a large corpus and get a weakly annotated dataset that you can use to train a classifier or you can just use the function. But...
Beware !!
This is a heuristic based approach, not all the results are correct for example: What a beautiful day !
The sentence start with a WH token but it's not a question, you fix this by checking if it ends with a question mark or not but in you corpus questions don't always end with a question mark.
A possible solution would be, to apply this on a corpus and manually filter out these outliers.
If a single query from the user contains multiple questions belonging to different categories, how can they be identified, split and parsed?
Eg -
User - what is the weather now and tell me my next meeting
Parser - {:weather => "what is the weather", :schedule => "tell me my next meeting"}
Parser identifies the parts of sentences where the question belongs to two different categories
User - show me hotels in san francisco for tomorrow that are less than $300 but not less than $200 are pet friendly have a gym and a pool with 3 or 4 stars staying for 2 nights and dont include anything that doesnt have wifi
Parser - {:hotels => ["show me hotels in san francisco",
"for tomorrow", "less than $300 but not less than $200",
"pet friendly have a gym and a pool",
"with 3 or 4 stars", "staying for 2 nights", "with wifi"]}
Parser identifies the question belonging to only one category but has additional steps for fine tuning the answer and created an array ordered according to the steps to take
From what I can understand this requires a sentence segmenter, multi-label classifier and co-reference resolution
But the sentence segementer I have come across depend heavily on grammar, punctuations.
Multi-label classifiers, like a good trained naive bayes classifier works in most cases but since they are multi-label, most times output multiple categories for sentences which clearly belong to one class. Depending solely on the array outputs to check the labels present would fail.
If used a multi-class classifier, that is also good to check the array output of probable categories but obviously they dont tell the different parts of the sentence much accurately, much less in what fashion to proceed with the next step.
As a first step, how can I tune sentence segmenter to correctly split the sentence without any strict grammar rules. Good accuracy of this would help a lot in classification.
As a first step, how can I tune sentence segmenter to correctly split the sentence without any strict grammar rules.
Instead of doing this I'd suggest you use the parse-tree directly (either dependency parser, or constituency parse).
Here I'm showing the output of the dependency parse and you can see that the two segments are separated via a "CONJ" arrow:
(from here: http://deagol.cs.illinois.edu:8080/)
Another solution I'd give try is ClausIE:
https://gate.d5.mpi-inf.mpg.de/ClausIEGate/ClausIEGate?inputtext=what+is+the+weather+now+and+tell+me+my+next+meeting++&processCcAllVerbs=true&processCcNonVerbs=true&type=true&go=Extract
If you want something for segmentation that doesn't depend on grammar heavily, then chunking comes to mind. In the NLTK book there is a fragment on that. The approach authors take here depends only on part of speech tags.
BTW Jurafsky and Martin's 3rd ed of Speech and Language processing contains information on chunking in the parsing chapter, and it also contains a chapters on information retrieval nad chatbots.
I'm new to Named Entity Recognition and I'm having some trouble understanding what/how features are used for this task.
Some papers I've read so far mention features used, but don't really explain them, for example in
Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition, the following features are mentioned:
Main features used by the the sixteen systems that participated in the
CoNLL-2003 shared task sorted by performance on the English test data.
Aff: affix information (n-grams); bag: bag of words; cas: global case
information; chu: chunk tags; doc: global document information; gaz:
gazetteers; lex: lexical features; ort: orthographic information; pat:
orthographic patterns (like Aa0); pos: part-of-speech tags; pre:
previously predicted NE tags; quo: flag signing that the word is
between quotes; tri: trigger words.
I'm a bit confused by some of these, however. For example:
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
how can POS tags exactly be used as features ? Don't we have a POS tag for each word? Isn't each object/instance a "text"?
what is global document information?
what is the feature trigger words?
I think all I need here is to just to look at an example table with each of these features as columns and see their values to understand how they really work, but so far I've failed to find an easy to read dataset.
Could someone please clarify or point me to some explanation or example of these features being used?
Here's a shot at some answers (and by the way the terminology on all this stuff is super overloaded).
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
In my experience BOW Feature Extraction is used to produce word features out of sentences. So IMO BOW is not one feature, it is a method of generating features out of a sentence (or a block of text you are using). Uning NGrams can help with accounting for sequence, but BOW features amount to unordered bags of strings.
how can POS tags exactly be used as features ? Don't we have a POS tag for each word?
POS Tags are used as features because they can help with "word sense disambiguation" (at least on a theoretical level). For instance, the word "May" can be a name of a person or a month of a year or a poorly capitalized conjugated verb, but the POS tag can be the feature that differentiates that fact. And yes, you can get a POS tag for each word, but unless you explicitly use those tags in your "feature space" then the words themselves have no idea what they are in terms of their POS.
Isn't each object/instance a "text"?
If you mean what I think you mean, then this is true only if you have extracted object-instance "pairs" and stored them as features (an array of them derived from a string of tokens).
what is global document information?
I perceive this one to mean as such: Most NLP tasks function on a sentence. Global document information is data from all the surrounding text in the entire document. For instance, if you are trying to extract geographic placenames but disambiguate them, and you find the word Paris, which one is it? Well if France is mentioned 5 sentences above, that could increase the likelihood of it being Paris France rather than Paris Texas or worst case, the person Paris Hilton. It's also really important in what is called "coreference resolution", which is when you correlate a name to a pronoun reference (mapping a name mention to "he" or "she" etc).
what is the feature trigger words?
Trigger words are specific tokens or sequences that have high reliability as a stand alone thing to have a specific meaning. For instance, in sentiment analysis, curse words with exclamation marks often indicate negativity. There can be many permutations of this.
Anyway, my answers here are not perfect, and are prone to all manner of problems in human epistemology and inter-subjectivity, but those are the way I've been thinking about this things over the years I've been trying to solve problems with NLP.
Hopefully someone else will chime in, especially if I'm way off.
You should probably keep in mind that NER classify each word/token separately from features that are internal or external clues. Internal clues takes into account the word itself (morphology as uppercase letters, is the token present in a dedicated lexicon, POS) and external ones relies on contextual information (previous and next word, document features).
isn't bag of words supposed to be a method to generate features (one
for each word)? How can BOW itself be a feature? Or does this simply
mean we have a feature for each word as in BOW, besides all the other
features mentioned?
Yes, BOW generates one feature per word, with sometimes feature selection methods that reduces the number features taken into account (e.g. minimum frequency of words)
how can a gazetteer be a feature?
Gazetteer may also generate one feature per word, but in most cases it does enrich data, by labelling words or multi-word expressions (as full proper names). It is an ambiguous step: "Georges Washington" will lead to two features: entire "Georges Washington" as a celebrity and "Washington" as a city.
how can POS tags exactly be used as features ? Don't we have a POS tag
for each word? Isn't each object/instance a "text"?
For classifiers, each instance is a word. This is why sequence labelling (e.g. CRF) methods are used: they allow to leverage previous words and next words as additional contextual features to classify the current word. Labelling a text is done as a process relying on the most likely NE types for each word in the sequence.
what is global document information?
This could be metadata (e.g. date, author), topics (full text categorization), coreference, etc.
what is the feature trigger words?
Triggers are external clues, contextual patterns that help disambiguation. For instance "Mr" will be used as a feature that strongly suggest that the following tokens would be a person.
I recently implemented a NER system in python and I found the following features helpful:
character-level ngrams (using CountVectorizer)
previous word features and labels (i.e. context)
viterbi or beam-search on label sequence probability
part of speech (pos), word-length, word-count, is_capitalized, is_stopword
Is there a way to generate a one-sentence summarization of Q&A pairs?
For example, provided:
Q: What is the color of the car?
A: Red
I want to generate a summary as
The color of the car is red
Or, given
Q: Are you a man?
A: Yes
to
Yes, I am a man.
which accounts for both question and answer.
What would be some of the most reasonable ways to do this?
I had to once work on solving the opposite problem, i.e. generating questions out of sentences from Wikipedia articles.
I used the Stanford Parser to generate parse trees out of all possible sentences in my training dataset.
e.g.
Go to http://nlp.stanford.edu:8080/parser/index.jsp
Enter "The color of the car is red." and click "Parse".
Then look at the Parse section of the response. The first layer of that sentence is NP VP (noun phrase followed by a verb phrase).
The second layer is NP PP VBZ ADJP.
I basically collected these patterns across 1000s of sentences, sorted them how common each patter was, and then used figured out how to best modify this parse tree to convert into each sentence in a different Wh-question (What, Who, When, Where, Why, etc)
You could you easily do something very similar. Study the parse trees of all of your training data, and figure out what patterns you could extract to get your work done. In many cases, just replacing the Wh word from the question with the answer would give you a valid albeit somewhat awkwardly phrases sentence.
e.g. "Red is the color of the car."
In the case of questions like "Are you a man?" (i.e. primary verb is something like 'are', 'can', 'should', etc), swapping the first 2 words usually does the trick - "You are a man?"
I don't know any NLP task that explicitly handles your requirement.
Broadly, there are two kinds of questions. Questions that expect a passage as the answer such as definition or explain sort: What is Ebola Fever. The second type are fill in the blank which are referred to as Factoid Questions in the literature such as What is the height of Mt. Everest?. It is not clear what kind of question you would like to summarize. I am assuming you are interested in factoid questions as your examples refer to only them.
A very similar problem arises in the task of Question Answering. One of the first stages of this task is to generate query. In the paper: An Exploration of the Principles Underlying
Redundancy-Based Factoid Question
Answering; Jimmy Lin 2007, the author claims that better performance can be achieved by reformulating the query (see section 4.1) to the form more likely to appear in free text. Let me copy some of the examples discussed in the paper.
1. What year did Alaska became a state?
2. Alaska became a state ?x
1. Who was the first person to run the miles in less than four minutes?
2. The first person to run the miles in less than four minutes was ?x
In the above examples, the query in 1 is reformulated to 2. As you might have already observed, ?x is the blank that should be filled by the answer. This reformulation is carried out through a dozen hand-written rules and are built into the software tool discussed in the paper: ARANEA. All you have to do is to find the tool and use it, the paper is a good ten years old, I cannot promise you anything though :)
Hope this helps.
I want to identifies different queries in sentences.
Like - Who is Bill Gates and where he was born? or Who is Bill Gates, where he was born? contains two queries
Who is Bill Gates?
Where Bill Gates was born
I worked on Coreference resolution, so I can identify that he points to Bill Gates so resolved sentence is "Who is Bill Gates, where Bill Gates was born"
Like wise
MGandhi is good guys, Where he was born?
single query
who is MGandhi and where was he born?
2 queries
who is MGandhi, where he was born and died?
3 queries
India won world cup against Australia, when?
1 query (when India won WC against Auz)
I can perform Coreference resolution but not getting how can I distinguish queries in it.
How to do this?
I checked various sentence parser, but as this is pure nlp stuff, sentence parser does not identify it.
I tried to find "Sentence disambiguation" like "word sense disambiguation", but nothing exist like that.
Any help or suggestion would be much appreciable.
Natural language is full of exceptions. Especially in English, it is often said that there are more exceptions than rules. So, it is almost impossible to get a completely accurate solution that works every single time, but using a parser, you can achieve reasonably good performance.
I like to use the Berkeley parser for such tasks. Their online demo includes a graphical representation of the parse tree, which is extremely helpful when trying to formulate heuristics.
For example, consider the question "Who is Bill Gates and where was he born?". The parse tree looks like this:
Clearly, you can split the tree at the central conjunction (CC) node to extract the individual queries. In general, this will be easy if the parsed sentence is simple (where there will be only one query) or compound (where the individual queries can be split by looking at conjunction nodes, as above).
Another more complex example in your question has three queries, such as "Who is Gandhi and where did he work and live?". The parse tree:
Again, you can see the conjunction node which splits "Who is Gandhi" and "Where did he work and live*". The parse does not, however, split up the second query into two, as you would ideally want. And that brings us to the hardest part of what you are trying to do: dealing with (computationally, of course) what is known as right node raising. This is a linguistic construct where common parts get shared.
For example, consider the question "When and how did he suffer a setback?". What it really asks is (a) when did he suffer a setback?, and (b) how did he suffer a setback? Right-node raising issues cannot be solved by just parse trees. It is, in fact, one of the harder problems in computational linguistics, and belongs to the domain of hardcore academic research.