word2vec how to get words from vectors? - machine-learning

I use ANN to predict words from words. The input and output are all words vectors. I do not know how to get words from the output of ANN. By the way, it's gensim I am using

You can find cosine similarity of the vector with all other word-vectors to find the nearest neighbors of your vector.
The nearest neighbor search on an n-dimensional space, can be brute force, or you can use libraries like FLANN, Annoy, scikit-kdtree to do it more efficiently.
update
Sharing a gist demonstrating the same:
https://gist.github.com/kampta/139f710ca91ed5fabaf9e6616d2c762b

Related

What is the best way to handle missing words when using word embeddings?

I have a set of pre-trained word2vec word vectors and a corpus. I want to use the word vectors to represent words in the corpus. The corpus has some words in it that I don't have trained word vectors for. What's the best way to handle those words for which there is no pre-trained vector?
I've heard several suggestions.
use a vector of zeros for every missing word
use a vector of random numbers for every missing word (with a bunch of suggestions on how to bound those randoms)
an idea I had: take a vector whose values are the mean of all values in that position from all pre-trained vectors
Anyone with experience with the problem have thoughts on how to handle this?
FastText from Facebook assembles word vectors from subword n-grams which allows it to handle out of vocabulary words. See more about this approach at: Out of Vocab Word Embedding
In a pre-trained word2vec embedding matrix, you can usually use word unk as index to find a predesigned vector which is often the best vector.

Is it possible to use KDTree with cosine similarity?

Looks like I can't use this similarity metric for with sklearn KDTree, for example, but I need because I am using measuring words vectors similarity. What is fast robust customization algorithm for this case? I know about Local Sensitivity Hashing, but it should tunned & tested up a lot to find params.
The ranking your would get with cosine similarity is equivalent to the rank order of the euclidean distance when you normalize all the data points first. So you can use a KD tree to the the k nearest neighbors with KDTrees, but you will need to recompute what the cosine similarity is.
The cosine similarity is not a distance metric as normally presented, but it can be transformed into one. If done, you can then use other structures like Ball Trees to do accelerated nn with cosine similarity directly. I've implemented this in the JSAT library, if you were interested in a Java implementation.
According to the table at the end of this page, cosine support eoth k-d-tree should be possible: ELKI supports cosine with the R-tree, and you can derive bounding rectangles for the k-d-tree, too; and the k-d-tree supports at least five metrics in that table. So I do not see why it shouldn't work.
Indexing support in sklearn often is not very complete (albeit improving), unfortunately; so don't take that as a reference.
While the k-d-tree can theoretically support Cosine by
transforming the data such that Cosine becomes Euclidean distance
working with the bounding boxes and the minimum angle to the bounding box (that appears to be what ELKI is doing for the R-tree)
You should be aware that the k-d-tree does not work very well with high-dimensional data, and cosine is mostly popular for very high-dimensional data. A k-d-tree always only looks at one dimension. If you want all d dimension to be used once, you need O(2^d) data points. For high d, there is no way all attributes are used.
The R-tree is slightly better here because it uses bounding boxes; these shrink with every split in all dimensions, so the pruning does get better. But this also means it needs a lot of memory for such data, and the tree construction may suffer from the same problem.
So in essence, don't use either for high dimensional data.
But also don't assume that Cosine does magically improve your results, in particular for high-d data. It's very much overrated. As above transformation indicates, there cannot be a systematic benefit of Cosine over Euclidean: Cosine is a special case of Euclidean.
For sparse data, inverted lists (c.f. Lucene, Xapian, Solr, ...) are the way to index for cosine.

Addressing synonyms in Supervised Learning for Text Classification

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly.
For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.
What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me
You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.
A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).
Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.
Example pseudocode to get you started:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([ 6.17495403e-02, 2.07064897e-02, -1.56451517e-03,
1.02607915e-02, -1.30429687e-02, 1.60102192e-02, ...
Now lets try a different document:
>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007, 0.01455955, -0.01595449, -0.01795897,
-0.02184369, -0.01654281, 0.01735667, 0.00054854, ...
>>> doc2.similarity(doc1)
0.8820845113100807
Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.
This is how you can use these vectors in scikit-learn:
X = [nlp(text).vector for text in corpus]
clf.fit(X, y)

Does word2vec make sense for supervised learning?

I have a list of sentence/label pairs to train the model, how should I encode the sentences as input to, say an SVM?
Are the sentences in the same language? You could start with the pretrained word2vec file that you can download from Google if it's English. Pay attention to how the train file was created, whether stemming was applied, etc. It's also somewhat important from which corpus it was generated; you'd get different results if this is from newsgroups or if it was extracted from the web or from more formal text.
Word2Vec basically encodes every word into a higher dimensional vector space. This is usually 200,300 or 500 dimensions large. After it is trained, then the "test" sentences are basically bag of words and need not be in any order.
You'd then, for each word in the bag of words, figure out the corresponding word2vec vector. Then you can create features by averaging the vectors, taking the 'minimum', the 'maximum' and if you're comparing text, look at calculating the cosine similarity between vectors. Then use those features in an SVM.

Do all machine learning algorithm usage Word frequency as a feature?

I use scikit learn for classification. And mainly work with NAive bayes, SVM, Neural network. There are variant in each of them.
I see for training algo create vectors. What does this vector contains?
For all algorithm does it consider word frequency as a feature? If yes then how they differ?
For text classification you usually create a vector of words frequency, or tf-idf to be able to compute distances between two documents. You could use all kinds of method to create these weights on word.
The words (features) can be extracted by just a splitting the documents on separator but you can use more complex methods like stemming (keep only the root of the words).
You will find lots of example in the sklearn documentation. For instance :
http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html
This iPython Notebook could be a good start too.

Resources