load pre-trained word2vec model for doc2vec - machine-learning

I'm using gensim to extract feature vector from a document.
I've downloaded the pre-trained model from Google named GoogleNews-vectors-negative300.bin and I loaded that model using the following command:
model = models.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
My purpose is to get a feature vector from a document. For a word, it's very easy to get the corresponding vector:
vector = model[word]
However, I don't know how to do it for a document. Could you please help?

A set of word vectors (such as GoogleNews-vectors-negative300.bin) is neither necessary nor sufficient for the kind of text vectors (Le/Mikolov 'Paragraph Vectors') created by the Doc2Vec class. It instead expects to be trained with example texts to learn per-document vectors. Then, also, the trained model can be used to 'infer' vectors for other new documents.
(The Doc2Vec class only supports the load_word2vec_format() method because it inherits from the Word2Vec class – not because it needs that functionality.)
There's another simple kind of text vector that can be created by simply averaging all the words in the document, perhaps also according to some per-word significance weighting. But that's not what Doc2Vec provides.

I tried this:
model = models.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
and it is giving me an error that doc to vec does not contain any word2vec format.

Related

Mutli-Class Text Classifcation (using TFIDF and SVM). How to implement a scenario where one feedback may belong to more than one class?

I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.

After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this:
from gensim.models.fasttext import FastText as FT_gensim
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)
# train the model
model_gensim.train(
corpus_file=corpus_file, epochs=model_gensim.epochs,
total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)
Then, let's say I want to get the embeddings vectors associated with this sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
How can I get them with model_gensim that I trained previously?
You can look up each word's vector in turn:
wordvecs_obama = [model_gensim[word] for word in sentence_obama]
For your 7-word input sentence, you'll then have a list of 7 word-vectors in wordvecs_obama.
All FastText models do not, as a matter of their inherent functionality, convert longer texts into single vectors. (And specifically, the model you've trained doesn't have a default way of doing that.)
There is a "classification mode" in the original Facebook FastText code that involves a different style of training, where texts are associated with known labels at training time, and all the word-vectors of the sentence are combined during training, and when the model is later asked to classify new texts. But, the gensim implementation of FastText does not currently support this mode, as gensim's goal has been to supply unsupervised rather than supervised algorithms.
You could approximate what that FastText mode does by averaging together those word-vectors:
import numpy as np
meanvec_obama = np.array(wordvecs_obama).mean(axis=0)
Depending on your ultimate purposes, something like that might still be useful. (But, that average wouldn't be as useful for classification as if the word-vectors had originally ben trained for that goal, with known labels, in that FastText mode.)

Gensim Doc2Vec.infer_vector() equivalent in KeyedVector

I have a working app using doc2vec from gensim. I know the KeyedVector is now the recommended approach, and trying to port over however I am not sure what is the equivalent method for the infer_vector method in Doc2Vec?
Or better put, how do I obtain a document vector for an entire document using the KeyedVector model to write to my Annoy model?
KeyedVectors doesn't replace Doc2Vec, it's a storage and index system for word vectors:
Word vector storage and similarity look-ups. Common code independent
of the way the vectors are trained(Word2Vec, FastText, WordRank,
VarEmbed etc)
The word vectors are considered read-only in this class.
This class doesn't know anything about tagged documents and it can't implement infer_vector or an equivalent because this procedure requires training and the idea of KeyedVectors is to abstract from the training method.

Character-Word Embeddings from lm_1b in Keras

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?
The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.
For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.

Is Word2Vec and Glove vectors are suited for Entity Recognition?

I am working on Named Entity Recognition. I evaluated libraries, such as MITIE, Stanford NER , NLTK NER etc., which are built upon conventional nlp techniques. I also looked at deep learning models such as word2vec and Glove vectors for representing words in vector space, they are interesting since they provide the information about the context of a word, but specifically for the task of NER, I think its not well suited. Since all these vector models create a vocab and corresponding vector representation. If any word failed to be in the vocabulary it will not be recognised. Assuming that it is highly likely that a named entity is not present since they are not bound by the language. It can be anything. So if any deep learning technique have to be useful in such cases are the ones which are more dependent on the structure of the sentence by using standard english vocab i.e. ignoring named fields. Is there any such model or method available? Will CNN or RNN may be the answer for it ?
I think you mean texts of a certain language, but the named entities in that text may contain different names (e.g. from other languages)?
The first thing that comes to my mind is some semi-supervised learning techniques that the model is being updated periodically to reflect new vocabulary.
For example, you may want to use word2vec model to train the incoming data, and compare the word vector of possible NEs with existing NEs. Their cosine distance should be close.

Resources