I'm using the gensim library for word2vec. I want to train the model on text examples that are unrelated, for example: "The cat is brown. What time is it?"
I have created the following input to the model:
[["The", "cat", "is", "brown"], ["What", "time", "is", "it"]], however I'm wondering whether the model assumes that "brown" and "What" are in the same context.
Tried to find the answer in the api, but could not find it.
The gensim API won't consider "brown" and "What" in the same context. Uneven windows are used near sentence boundaries. So, for your example, if the window size let's say is 1, the (context, target) pairs would look like as below:
([cat],The), ([The,is],cat), ([cat,brown],is), ([is],brown) ([time],What), ([What,is],time), ([time,it],is), ([is],it)
I hope this clears your doubt.
Related
I'm new to Named Entity Recognition and I'm having some trouble understanding what/how features are used for this task.
Some papers I've read so far mention features used, but don't really explain them, for example in
Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition, the following features are mentioned:
Main features used by the the sixteen systems that participated in the
CoNLL-2003 shared task sorted by performance on the English test data.
Aff: affix information (n-grams); bag: bag of words; cas: global case
information; chu: chunk tags; doc: global document information; gaz:
gazetteers; lex: lexical features; ort: orthographic information; pat:
orthographic patterns (like Aa0); pos: part-of-speech tags; pre:
previously predicted NE tags; quo: flag signing that the word is
between quotes; tri: trigger words.
I'm a bit confused by some of these, however. For example:
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
how can POS tags exactly be used as features ? Don't we have a POS tag for each word? Isn't each object/instance a "text"?
what is global document information?
what is the feature trigger words?
I think all I need here is to just to look at an example table with each of these features as columns and see their values to understand how they really work, but so far I've failed to find an easy to read dataset.
Could someone please clarify or point me to some explanation or example of these features being used?
Here's a shot at some answers (and by the way the terminology on all this stuff is super overloaded).
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
In my experience BOW Feature Extraction is used to produce word features out of sentences. So IMO BOW is not one feature, it is a method of generating features out of a sentence (or a block of text you are using). Uning NGrams can help with accounting for sequence, but BOW features amount to unordered bags of strings.
how can POS tags exactly be used as features ? Don't we have a POS tag for each word?
POS Tags are used as features because they can help with "word sense disambiguation" (at least on a theoretical level). For instance, the word "May" can be a name of a person or a month of a year or a poorly capitalized conjugated verb, but the POS tag can be the feature that differentiates that fact. And yes, you can get a POS tag for each word, but unless you explicitly use those tags in your "feature space" then the words themselves have no idea what they are in terms of their POS.
Isn't each object/instance a "text"?
If you mean what I think you mean, then this is true only if you have extracted object-instance "pairs" and stored them as features (an array of them derived from a string of tokens).
what is global document information?
I perceive this one to mean as such: Most NLP tasks function on a sentence. Global document information is data from all the surrounding text in the entire document. For instance, if you are trying to extract geographic placenames but disambiguate them, and you find the word Paris, which one is it? Well if France is mentioned 5 sentences above, that could increase the likelihood of it being Paris France rather than Paris Texas or worst case, the person Paris Hilton. It's also really important in what is called "coreference resolution", which is when you correlate a name to a pronoun reference (mapping a name mention to "he" or "she" etc).
what is the feature trigger words?
Trigger words are specific tokens or sequences that have high reliability as a stand alone thing to have a specific meaning. For instance, in sentiment analysis, curse words with exclamation marks often indicate negativity. There can be many permutations of this.
Anyway, my answers here are not perfect, and are prone to all manner of problems in human epistemology and inter-subjectivity, but those are the way I've been thinking about this things over the years I've been trying to solve problems with NLP.
Hopefully someone else will chime in, especially if I'm way off.
You should probably keep in mind that NER classify each word/token separately from features that are internal or external clues. Internal clues takes into account the word itself (morphology as uppercase letters, is the token present in a dedicated lexicon, POS) and external ones relies on contextual information (previous and next word, document features).
isn't bag of words supposed to be a method to generate features (one
for each word)? How can BOW itself be a feature? Or does this simply
mean we have a feature for each word as in BOW, besides all the other
features mentioned?
Yes, BOW generates one feature per word, with sometimes feature selection methods that reduces the number features taken into account (e.g. minimum frequency of words)
how can a gazetteer be a feature?
Gazetteer may also generate one feature per word, but in most cases it does enrich data, by labelling words or multi-word expressions (as full proper names). It is an ambiguous step: "Georges Washington" will lead to two features: entire "Georges Washington" as a celebrity and "Washington" as a city.
how can POS tags exactly be used as features ? Don't we have a POS tag
for each word? Isn't each object/instance a "text"?
For classifiers, each instance is a word. This is why sequence labelling (e.g. CRF) methods are used: they allow to leverage previous words and next words as additional contextual features to classify the current word. Labelling a text is done as a process relying on the most likely NE types for each word in the sequence.
what is global document information?
This could be metadata (e.g. date, author), topics (full text categorization), coreference, etc.
what is the feature trigger words?
Triggers are external clues, contextual patterns that help disambiguation. For instance "Mr" will be used as a feature that strongly suggest that the following tokens would be a person.
I recently implemented a NER system in python and I found the following features helpful:
character-level ngrams (using CountVectorizer)
previous word features and labels (i.e. context)
viterbi or beam-search on label sequence probability
part of speech (pos), word-length, word-count, is_capitalized, is_stopword
I would like to create a custom NER model. That's what i did:
TRAINING DATA (stanford-ner.tsv):
Hello O
! O
My O
name O
is O
Damiano PERSON
. O
PROPERTIES (stanford-ner.prop):
trainFile = stanford-ner.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
maxLeft=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
gazette=gazzetta.txt
cleanGazette=true
GAZZETTE gazzetta.txt):
PERSON John
PERSON Andrea
I build the model via command line with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop stanford-ner.prop
And test with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt
I did two tests with the following texts:
>>> TEST 1 <<<
TEXT:
Hello! My name is Damiano and this is a fake text to test.
OUTPUT
Hello/O !/O
My/O name/O is/O Damiano/PERSON and/O this/O is/O a/O fake/O text/O to/O test/O ./O
>>> TEST 2 <<<
TEXT:
Hello! My name is John and this is a fake text to test.
OUTPUT
Hello/O !/O
My/O name/O is/O John/O and/O this/O is/O a/O fake/O text/O to/O test/O ./O
As you can see only "Damiano" entity is found. This entity is in my training data but "John" (second test) is inside the gazzette. So the question is.
Why does John entity is not recognized ?
Thank you so much in advance.
As Stanford FAQ says,
If a gazette is used, this does not guarantee that words in the
gazette are always used as a member of the intended class, and it does
not guarantee that words outside the gazette will not be chosen. It
simply provides another feature for the CRF to train against. If the
CRF has higher weights for other features, the gazette features may be
overwhelmed.
If you want something that will recognize text as a member of a class
if and only if it is in a list of words, you might prefer either the
regexner or the tokensregex tools included in Stanford CoreNLP. The
CRF NER is not guaranteed to accept all words in the gazette as part
of the expected class, and it may also accept words outside the
gazette as part of the class.
Btw, it is not a good practice to test machine learning pipelines in a 'unit-test'-way, i.e. with only one or two examples, because it is supposed to work on much greater volume of data and, more importantly, it is probabilistic by nature.
If you want to check if your gazette file is actually used, it may be better to take existent examples (see the bottom of the page linked above for austen.gaz.prop and austen.gaz.txt examples) and replace multiple names by your own ones, then check. If it fails, firstly try to change your test, e.g. add more names, reformulate text and so on.
gazzette will only help for extracting extra features from the training data, if you don't have any occurrence of these words inside your training data or any connection to labeled tokens, your model will not benefits from that. One of the experiments that I would suggest is to add Damiano to your gazzette.
Why does John entity is not recognized ?
It looks to me that your minimal example should most probably add "Damiano" to the gazetteer as a PERSON category. Currently, the training data allows the model to learn that "Damiano" is a PERSON label, but I think this is not related to the gazetteer categories (i.e. having PERSON on both sides is not sufficient).
Is there a way to generate a one-sentence summarization of Q&A pairs?
For example, provided:
Q: What is the color of the car?
A: Red
I want to generate a summary as
The color of the car is red
Or, given
Q: Are you a man?
A: Yes
to
Yes, I am a man.
which accounts for both question and answer.
What would be some of the most reasonable ways to do this?
I had to once work on solving the opposite problem, i.e. generating questions out of sentences from Wikipedia articles.
I used the Stanford Parser to generate parse trees out of all possible sentences in my training dataset.
e.g.
Go to http://nlp.stanford.edu:8080/parser/index.jsp
Enter "The color of the car is red." and click "Parse".
Then look at the Parse section of the response. The first layer of that sentence is NP VP (noun phrase followed by a verb phrase).
The second layer is NP PP VBZ ADJP.
I basically collected these patterns across 1000s of sentences, sorted them how common each patter was, and then used figured out how to best modify this parse tree to convert into each sentence in a different Wh-question (What, Who, When, Where, Why, etc)
You could you easily do something very similar. Study the parse trees of all of your training data, and figure out what patterns you could extract to get your work done. In many cases, just replacing the Wh word from the question with the answer would give you a valid albeit somewhat awkwardly phrases sentence.
e.g. "Red is the color of the car."
In the case of questions like "Are you a man?" (i.e. primary verb is something like 'are', 'can', 'should', etc), swapping the first 2 words usually does the trick - "You are a man?"
I don't know any NLP task that explicitly handles your requirement.
Broadly, there are two kinds of questions. Questions that expect a passage as the answer such as definition or explain sort: What is Ebola Fever. The second type are fill in the blank which are referred to as Factoid Questions in the literature such as What is the height of Mt. Everest?. It is not clear what kind of question you would like to summarize. I am assuming you are interested in factoid questions as your examples refer to only them.
A very similar problem arises in the task of Question Answering. One of the first stages of this task is to generate query. In the paper: An Exploration of the Principles Underlying
Redundancy-Based Factoid Question
Answering; Jimmy Lin 2007, the author claims that better performance can be achieved by reformulating the query (see section 4.1) to the form more likely to appear in free text. Let me copy some of the examples discussed in the paper.
1. What year did Alaska became a state?
2. Alaska became a state ?x
1. Who was the first person to run the miles in less than four minutes?
2. The first person to run the miles in less than four minutes was ?x
In the above examples, the query in 1 is reformulated to 2. As you might have already observed, ?x is the blank that should be filled by the answer. This reformulation is carried out through a dozen hand-written rules and are built into the software tool discussed in the paper: ARANEA. All you have to do is to find the tool and use it, the paper is a good ten years old, I cannot promise you anything though :)
Hope this helps.
I am classifying the input sentence to different category. like time, distance, speed, location etc
I trained classifier using MultinomialNB.
Classifier considers mainly tf as feature, I also tried with considering sentence structure (using 1-4 grams)
Using multinomialNB with alpha = 0.001 this is the result for few queries
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} #for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is a meter
{"1": {"period": "90.74%"}, "2": {"dist": "9.26%"}} #better result should be distance
Using multinomialNW with considering ngram (1-4)
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} # for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is an hour
{"1": {"dist": "99.61%"}} #result should be time
So result purely depends on word occurrence. Is there any way to add word disambiguation(or anyother mean by which somekind of understanding could be brought) here?
I already checked Word sense disambiguation in NLTK Python
but here issue is identifying the main word in sentence, which differs in every sentence.
POS (gives NN,JJ, on which sentence does not rely), NER(highly dependent on capitalization, sometimes ner is also not disambiguating word like "early" ,"cost" in above sentence) I already tried, none of them helps.
**How long some times cosidered as time or distance. So based on sentence near by words, it should able to able understand what it is. Similarly for "how fast, "how come" "how early" [how + word] should be understable**
I am using nltk, scikit learn, python
Update :
40 classes (each with sentence belonging that class)
Total data 300 Kb
Accuracy depends on query. sometimes very good >90%. Sometimes irrelevant class as a result. Depends on how query matches with dataset
Attempting to deduce semantics purely by looking at individual words out of context is not going to take you very far. In your "watch" examples, the only term which actually indicates that you have "money" semantics is the one you hope to disambiguate. What other information is there in the sentence to help you reach that conclusion, as a human reader? How would you model that knowledge? (A traditional answer would reason about your perception of watches as valuable objects, or something like that.)
Having said that, you might want to look at Wordnet synsets as a possibly useful abstraction. At least then you could say that "cost", "price", and "value" are related somehow, but I suppose the word-level statistics you have already calculated show that they are not fully synonymous, and the variation you see basically accounts for that fact (though your input size sounds kind of small for adequately covering variances of usage patterns for individual word forms).
Another hint could be provided by part of speech annotation. If you know that "value" is used as a noun, that (to my mind, at least) narrows the meaning to "money talk", whereas the verb reading is much less specifically money-oriented ("we value your input", etc). In your other examples, it is harder to see whether it would help at all. Perhaps you could perform a quick experiment with POS-annotated input and see whether it makes a useful difference. (But then POS is not always possible to deduce correctly, for much the same reasons you are having problems now.)
The sentences you show as examples are all rather simple. It would not be very hard to write a restricted parser for a small subset of English where you could actually start to try to make some sense of the input grammatically, if you know that your input will generally be constrained to simple questions with no modal auxiliaries etc.
(Incidentally, I'm not sure "how come can I go to Mumbai" is "manner", if it is grammatical at all. Strictly speaking, you should have subordinate clause word order here. I would understand it to mean roughly "Why is it that I can go to Mumbai?")
Your result "depends purely on word occurrence" because that is the kind of features your code produces. If you feel that this approach is not sufficient for your problem, you need to decide what other information you need to extract. Express it as features, i.e. as key-value pairs, add them to your dictionary, and pass them to the classifier exactly as you do now. To avoid overtraining you should probably limit the number of ngrams you do include in the dictionary; e.g., keep only the frequent ones, or the ones containing certain keywords you consider relevant, or whatever.
I'm not quite sure what classification you mean by "distance, speed, location, **etc.", but you've mentioned most of the tools I'd think to use for something like this. If they didn't work to your satisfaction, think about more specific ways to detect properties that might be relevant; then express them as features so they can contribute to classification along with the "bag of words" features you have already. (But note that many experts in the field get acceptable results using just the bag-of-words approach).
Based on my understanding of the nature of your problem so far, I would suggest to use an unsupervised classification method, meaning that you have to use a set of rules for classification. By rules I mean if ... then ... else conditions. This is how some expert systems work. But, to add understanding of similar concepts and synonyms I suggest you to create an ontolgy. Ontologies are a sub-concept of Semantic web. Problems, such as yours are usually addressed by use of semantic web, let it be using RDF schemes or ontologies. You can learn more about semantic web here and about ontology here. My suggestion to you is not to go too deep into these fields, but just learn a general high-level idea, and then write your own ontology in a text file (avoid using any tools to build an ontology, because they take too much effort and your problem is easy enough not to need that effort).
Now when you search on the web you will find some already existing ontologies, but in your case its better to write a small ontology of your own, use it to build the set of rules and you are good to go.
One note about your solution (using NB) on this kind of data is that you can simply have an overfiting problem which would result in low accuracy for some queries and high accuracy for some queries. I think its better to avoid using supervised learning for this problem. Let me know if you had further questions.
Edit 1: In this edit I would like to elaborate on the above answer:
Lets say you want to build an unsupervised classifier. The data you currently have can be split into about 40 different classes. Because the sentences in your dataset are already somehow restricted and simple, you can simply do this by classifying those sentences based on a set of rules. Let me show you what I mean by this. Lets say a random sentence from your dataset is kept in variable sentence :
if sentence contains "long":
if it also contains "meter":
print "it is distance"
elif ...
.
.
.
else:
print "it is period"
if sentence contains "fast":
print "it is speed or time"
if sentence contains "early":
print "it is time"
So you got the idea what I meant. If you build a simple classifier in this way, and make it as precise as possible, you can easilly reach overall accuracies of almost 100%. Now, if you want to automate some complicated decision makings you need a form of knowledge base which I'd refer to as an ontology. if in a text file you'd have something like (I am writing it in plain English just to make it simple to understand; you can write it in a concise coded manner and its just a general example to show you what I mean):
"Value" depends 60% on "cost (measured with money)", 20% on "durability (measured in time)", 20% on "ease of use (measured in quality)"
Then, if you want to measure value, you already have a formula for it. You should decide if you need such formula based on your data. Or if you wanted to keep a synonyms list you can have them as a text file and alternately replace them.
The overall implementation of the classifier for 40 classes in the way I mentioned requires a few days and since the method used is quite deterministic, you are destined to achive a very high accuracy of up to 100%.
Is there a way to to find all the sub-sentences of a sentence that still are meaningful and contain at least one subject, verb, and a predicate/object?
For example, if we have a sentence like "I am going to do a seminar on NLP at SXSW in Austin next month". We can extract the following meaningful sub-sentences from this sentence: "I am going to do a seminar", "I am going to do a seminar on NLP", "I am going to do a seminar on NLP at SXSW", "I am going to do a seminar at SXSW", "I am going to do a seminar in Austin", "I am going to do a seminar on NLP next month", etc.
Please note that there is no deduced sentences here (e.g. "There will be a NLP seminar at SXSW next month". Although this is true, we don't need this as part of this problem.) . All generated sentences are strictly part of the given sentence.
How can we approach solving this problem? I was thinking of creating annotated training data that has a set of legal sub-sentences for each sentence in the training data set. And then write some supervised learning algorithm(s) to generate a model.
I am quite new to NLP and Machine Learning, so it would be great if you guys could suggest some ways to solve this problem.
You can use dependency parser provided by Stanford CoreNLP.
Collapsed output of your sentence will look like below.
nsubj(going-3, I-1)
xsubj(do-5, I-1)
aux(going-3, am-2)
root(ROOT-0, going-3)
aux(do-5, to-4)
xcomp(going-3, do-5)
det(seminar-7, a-6)
dobj(do-5, seminar-7)
prep_on(seminar-7, NLP-9)
prep_at(do-5, -11)
prep_in(do-5, Austin-13)
amod(month-15, next-14)
tmod(do-5, month-15)
The last 5 of your sentence output are optional. You can remove one or more parts that are not essential to your sentence.
Most of this optional parts are belong to prepositional and modifier e.g : prep_in, prep_do, advmod, tmod, etc. See Stanford Dependency Manual.
For example, if you remove all modifier from the output, you will get
I am going to do a seminar on NLP at SXSW in Austin.
There's a paper titled "Using Discourse Commitments to Recognize Textual Entailment" by Hickl et al that discusses the extraction of discourse commitments (sub-sentences). The paper includes a description of their algorithm which in some level operates on rules. They used it for RTE, and there may be some minimal levels of deduction in the output. Text simplification maybe a related area to look at.
The following paper http://www.mpi-inf.mpg.de/~rgemulla/publications/delcorro13clausie.pdf processes the dependencies from the Stanford parser and contructs simple clauses (text-simplification).
See the online demo - https://d5gate.ag5.mpi-sb.mpg.de/ClausIEGate/ClausIEGate
One approach would be with a parser such as a PCFG. Trying to just train a model to detect 'subsentences' is likely to suffer from data sparsity. Also, I am doubtful that you could write down a really clean and unambiguous definition of a subsentence, and if you can't define it, you can't get annotators to annotate for it.