Identifying the most useful words in differentiating between classes [duplicate]

Identifying the most useful words in differentiating between classes [duplicate] - machine-learning

This question already has answers here:
How to get most informative features for scikit-learn classifiers?
(9 answers)
Closed 1 year ago.
Is it possible to use tfidf (tfidfvectorizer in Python) to figure out which words are most important when trying to distinguish between two text classes (i.e., positive or negative sentiment, etc.)? For example, which words were most important for identifying the positive class, and then separately, which were most useful for identifying the negative class?

You can let scikit learn do your heavy lifting - train a random forest on your binary tree, extract the classifier's feature importance ranking and use it to get the most important words:
clf = RandomForestClassifier()
clf.fit(data, labels)
importances = clf.feature_importances_
np.argsort(importances)[::-1]
feature_names = vectorizer.get_feature_names()
top_words = []
for i in xrange(100):
top_words.append(feature_names[indices[i]])
Note that this will only tell you what are the most important words - not what they say for each category. To say what each word say about each class you can classify the individual words and see what is their classification.
Another option is to take all positive/negative data samples, remove from them the word you are trying understand and see how this affects the classification of the sample.

Related

Classes never seen before on Deep Learning Models

I have a basic question. Supposedly I am training an image classifier for cats and dogs. But I need an extra functionality. If an image does not belong to any of the category, how do I get to know it. Some of the options I was thinking of were:
Instead of 2 neurons I add a 3rd Neuron to the last layer. And get my training labels y as a one hot encoding of 3 labels, 3rd for being not in either of cat or dog class. I will use some random examples for my 3rd class.
I will use only 2 neurons and using some probability threshold I will use it to tell which class should my image belong.
However I do not think any of the methods is viable.
Can anyone suggest I a good technique to classify images which do not belong to my training category?

Before going into the solution I would first comment on the proposed solution of the questions. The first solution would work better compared to the second. This is because It is very hard to interpret the (probability )values of the neural network output. Closeness of the values might be caused by similarity of the classes involving(in this case a dog might look like a cat). Sometimes you may end up getting unseen classes being assigned to one of the classes with high probability.
Most of supervised classification machine learning algorithms are designed to map an input to one of some fixed number of classes. This type of classification is called closed world classification.
E.g.
MNIST - handwritten digit classification
Cat - Dog classification
When classification involves some unlabeled/unknown classes, the approach is called Open-world classification. There are various papers published[1, 2, 3].
I will explain my solution using the solution proposed by 3.
There are two options to apply the Open world classification(Here on I will refer to OWC) to the problem in question.
Classifying all new classes as single class
Classifying all new classes as single class, then further grouping similar samples into single class and different samples into different classes.
1. Classifying all new classes as single class
Although there could be many types of model that could fit to this type of classification(One of could be the first solution proposed by the question.) I would discusses model of 3. Here the network first decides to classify or to reject the input. Ideally if the sample is from seen classes then the network will classify into one of seen classes. Other wise the network rejects. The authors of 3 called this network Open classification network(OCN). Keras implementation of OCN could be(I've simplified the network to just focus on output of the model.
inputs = keras.layers.Input(shape=(28, 28,1))
x = keras.layers.Conv2D(64, 3, activation="relu")(inputs)
x = keras.layers.Flatten()(x)
embedding = keras.layers.Dense(256, activation="linear", name="embedding_layer")(x)
reject_output = keras.layers.Dense(1, activaton="sigmoid", name="reject_layer")(embedding)
classification_output = keras.layers.Dense(num_of_classes, activaton="softmax", name="reject_layer")(embedding)
ocn_model = keras.models.Model(inputs=inputs, outputs=[reject_output, classification_output)
The model is trained in a way that jointly optimizes both reject_output and classification_output losses.
2. Classifying all new classes as single class, then further grouping similar
The authors of 3 used another network to find similarity between samples. They called the network Pairwise Classification Network(PCN). PCN classifies whether two inputs are from the same classes or different classes. We can use the embedding of the first solution and use pairwise similarity metrics to create PCN network. In PCN the weights are shared for both inputs. This could be implemented using keras
embedding_model = keras.layers.Sequential([
keras.layers.Conv2D(64, 3, activation="relu", input_shape=(28, 28,1))
keras.layers.Flatten(),
embedding = keras.layers.Dense(256, activation="linear", name="embedding_layer")
])
input1 = keras.layers.Input(shape=(28, 28, 1))
input2 = keras.layers.Input(shape=(28, 28, 1))
embedding1 = embedding_model(input1)
embedding2 = embedding_model(input2)
merged = keras.layers.Concatenate()([embedding1, embedding2])
output = keras.layers.Dense(1, activation="sigmoid")(merged)
pcn_model = keras.models.Model(inputs=[input1, input2], outputs=output)
PCN model will be trained to reduce the distance from the same and increase the distance between different classes.
After the PCN network is trained auto-encoder is trained to learn useful representations from the unseen classes. Then Clustering algorithm is used to group(cluster) unseen classes by using PCN model as distance function.

how to use both word2vec and RNN for NLP together?

I recently studied and understood how word2vec works, it is responsible to convert words into numerical form so when we plot them or put them in the world space they will be spread and reveal the relationship between every word and the other.
my question here, I found also RNNs and suddenly I became confused. Is word2vec an alternative to RNNs or I can use word2vec to transfer the words to numeric form and then use them on RNNs ?
I mean both of them predict the next word, so I want to know if they are the different approaches for the same problem or I can use them both together ?
NOTE: I finished computer vision and started in NLP so please don't judge my question I am just starting, thanks in advance.

You did not understood the meaning of word2vec clearly. word2vec is a representation of words in a multi-dimensional space while RNN is an algorithm like Linear Regression or random forest or logistic regression. word2vec do NOT predicts the next words. Here is a small explanation of word2vec:
Take three words: apple,orange and car. Suppose they are represented in word2vec as:
apple = [0.01, 0.04 ...] orange = [0.02, 0.06 ...] car = [0.03, 0.09 ...]
Now you know apple and orange are similar to each other while car is not. So if you will take the dot product of apple and orange, the result value will be close to 1, say it's 0.85 but if you take the dot product of apple and car, the result will be far from 1 say it's 0.25. This is the concept of word2vec. It gives you a vector representation of words in a numerical form such that similar words are kept near to each other in the graph.
Now for RNN, as I said, it's an algorithm. You will feed some numerical data to it and it'll give you some output. You need to learn RNN in detail from some online tutorial.
To answer your question about how to use them together, RNN takes numerical inputs. It can't take English words directly. So we need to convert all words into some kind of numerical form. This is where word2vec comes into picture. You take each word, get it's numerical representation from word2vec (like I showed above for apple, orange and car) and then feed it to the RNN.
This is just a simple overview and it's not possible to explain everything here. If you really want to learn more then I would strongly suggest you to take this course. Everything from word2vec to RNN is explained beautifully there. It would be even better if you complete the whole specialisation there instead of completing only this course.

SVM feature vector representation by using pre-made dictionary for text classification

I want to classify a collection of text into two class, let's say I would like to do a sentiment classification. I have two pre-made sentiment dictionaries, one contain only positive words and another contain only negative words. I would like to incorporate these dictionaries into feature vector for SVM classifier. My question is, is it possible to separate between positive and negative words dictionary to be represented as SVM feature vector, especially when I generate feature vector for the test set?
If my explanation is not clear enough, let me give the example. Let's say I have these two sentences as training data:
Pos: The book is good
Neg: The book is bad
Word 'good' exists in positive dictionary and 'bad' exists in negative dictionary, while other words do not exist in neither dictionary. I want the words that exist in matching dictionary with the sentence's class have a big weight value, while other words have small value. So, the feature vectors will be like these:
+1 1:0.1 2:0.1 3:0.1 4:0.9
-1 1:0.1 2:0.1 3:0.1 5:0.9
If I want to classify a test sentence "The food is bad", how should I generate a feature vector for the test set with weight that depend on existing dictionary when I cannot match test sentence's class with each of the dictionary? What I can think is, for test set, as long as the word exist in both dictionary, I will give the word a high weight value.
0 1:0.1 3:0.1 5:0.9
I wonder if this is the right way for creating vector representation for both training set and test set.
--Edit--
I forgot to mention that these pre-made dictionaries was extracted using some kind of topic model. For example, the top 100 words from topic 1 are kinda represent positive class and words in topic 2 represent negative class. I want to use this kind of information to improve the classifier more than using only bag-of-words feature.

In short - this is not the way it works.
The whole point of learning is to give classifier ability to assign these weights on their own. You cannot "force it" to have a high value per class for a particular feature (I mean, you could on the optimization level, but this would require changing the whole svm structure).
So the right way is to simply create a "normal" representation. Without any additional specification. Let the model decide, they are better at statistical analysis than human intuition, really.

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.

I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)

Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!

Methods to ignore missing word features on test data

I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards

There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart