Stanford NER precedence - machine-learning

I'm using Stanford NER for training, I have 2 questions!
1)How do i give precedence to a particular class label.
for example if a "word" is trained as Class A and the same "word" is Class B, and when "word" is found in the text I want it to overcome the conflict and be labelled as Class B only, irrespective of how many occurrence "word" has when trained for Class A.
2)How to train a token with two words, for example "Mysore" is a City and "Mysore Road" is a Road. But in Stanford NER i need to train both "Mysore" and "Road" separately as Road, so there is a conflict for "Mysore" to be City or Road.

Related

ZERO SHOT LEARNING

I understand that in zero shot learning, the classes are divided into seen/unseen categories. Then, we train the network for example on 50 classes and test on the other 50 that the network has not seen. I also understand that the network uses attributes in the unseen classes(Not sure how it is used). However, my question is that how the network classifies the unseen classes? Does it actually label each class by its name. For example, if I am doing zero-shot action recognition and the unseen classes are such biking, swimming, football. Does the network actually name these classes? How does it know their labels?
The network uses the seen classes to learn relation between images and attributes or other information such as human gaze , word embeddings or whatever information that could be related between classes and images. Based on what the network learns it could be further mapped to the objects and attributes.
Say your classifier has pig , dogs , horses and cats images and its attributes during training time and has to classify a zebra during test time. During training time it learns the relation between image pixels and attribute 'stripes,tail,black,white...'
So during test time given image and attributes of zebra you need to use the classifier to figure out if they are related or not. Oh , well you could be given a image of a horse too which looks like a Zebra. So your classifier must learn to generalize well.

Calculating confidence score for Entity in NLP Named-entity recognition

I am working on named-entity extraction from documents(pdfs). Each pdf contains set of entities (nearly 16 different type entities)
Here are my steps to build the NLP and ML models:
Step 1 : Parsed documents. Got nearly 2 Million tokens (words). Used these words and CBOW method for building word2vec model.
Step 2 : By used word2vec model, generated vectors for words in douments.
Step 3 : As per the domain, i labeled words(vectors) for training, validation and testing.
Step 4 : With labeled data, train the Neural Network model.
Step 5: Once model got build, given testing data (words) to the model. Got 85% accuracy.
Till now everything going good. But problem is in next step. :(
Step 6 : Now i want to make entities with confidence score from words which are classified from the trained model.
Neural network model using SOFTMAX to classify input. From this model getting score for each word.
But my question is, my entities contains minimum 3 words. How can i calculate confidence score for generated entity.
right now i am using P(entity) = P(w1)*P(w2)*(w3) if entity has three words.
Kindly help me. this approach wont make sense all the time.
suppose, if model predict only two words in entity then entity confidence will be P(entity) = P(w1)*P(w2).
And if model predict only one word in a entity then P(entity) = P(w1). :(
Why not P(entity) = P(w1)+P(w2)+P(w3) ?
if you need a normalized number (0-1) and assuming that P(w) has a 0-1 range make it: P(entity) = (P(w1)+P(w2)+P(w3)) / 3
For a better score, you should calculate the information content of each word. A common word should contribute less: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-S1-S21
Stanford NLP toolkit uses the min(Pi) as P(entity). From my perspective, neither is sound from the mathematical view.

What approach to use on sentence classifier? - distant supervision

I am following approach of Distance supervision in article Distant Supervision for Relation Extraction using Ontology Class Hierarchy-Based Features.
I have already tokenized sentence for example:
Her most famous temple, the Parthenon, on the Acropolis in Athens takes its name from that title
and I also have lexical features from this sentence as you can see in table:
The question is how to create feature vector from this table, which can be passed to Logistic regression? Or is there any other classification method, which should be used?
A posible approach could be to use embeddings(for example word2vect). Example:
Embeddings article from tensorflow

Methods to ignore missing word features on test data

I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards
There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.

Using C4.5 classifier with multiple outcomes

I'm looking at C4.5 classifier for a machine learning task. I have a large dataset containing city names, and need to differentiate between e.g. London Ontario, London England or even London in Burgundy in France, but looking at features from the surrounding text: E.g. Zip codes, state names, even when "Canada" or "England" are not mentioned. I also have access to meta data such as dialing codes which can help determine which country it is.
Subsequently once trained I want to run the classifier on the large dataset.
In all the examples I have found here there are only 2 states for the result (in this golf example play or don't play).
Can the c4.5 classifier handle London (Canada), London (England), London (France) as result classes or do I need to have different classifiers for London (Canada) True/False etc?
I see two options in your case.
The first approach is a straightforward extension to c4.5. In each leaf node, you keep all the labels instead of just the majority label. For example, as shown in the figure below, red labels actually present in three different leafs. When you have a query at the data point pointed by the arrow, the outputs are 3 labels (green, red and blue) together with their corresponding conditional probability p(c|v) (given feature x1 and x2, what is the probability of data x belongs to class c).
The 2nd approach is to generate multiple decision trees hence a random forest. The randomness can be injected by randomly sampling subset of training data made available to each individual tree. At classification time, you can aggregate the vote from all decision trees to get multi-class classification results.
The figures are borrowed from this excellent tutorial on multi-class classification by Andrew Zisserma.

Resources