I have a working app using doc2vec from gensim. I know the KeyedVector is now the recommended approach, and trying to port over however I am not sure what is the equivalent method for the infer_vector method in Doc2Vec?
Or better put, how do I obtain a document vector for an entire document using the KeyedVector model to write to my Annoy model?
KeyedVectors doesn't replace Doc2Vec, it's a storage and index system for word vectors:
Word vector storage and similarity look-ups. Common code independent
of the way the vectors are trained(Word2Vec, FastText, WordRank,
VarEmbed etc)
The word vectors are considered read-only in this class.
This class doesn't know anything about tagged documents and it can't implement infer_vector or an equivalent because this procedure requires training and the idea of KeyedVectors is to abstract from the training method.
Related
I am working with Keras and experimenting with AI and Machine Learning. I have a few projects made already and now I'm looking to replicate a dataset. What direction do I go to learn this? What should I be looking up to begin learning about this model? I just need an expert to point me in the right direction.
To clarify; by replicating a dataset I mean I want to take a series of numbers with an easily distinguishable pattern and then have the AI generate new data that is similar.
There are several ways to generate new data similar to a current dataset, but the most prominent way nowadays is to use a Generative Adversarial Network (GAN). This works by pitting two models against one another. The generator model attempts to generate data, and the discriminator model attempts to tell the difference between real data and generated data. There are plenty of tutorials out there on how to do this, though most of them are probably based on image data.
If you want to generate labels as well, make a conditional GAN.
The only other common method for generating data is a Variational Autoencoder (VAE), but the generated data tend to be lower-quality than what a GAN can generate. I don't know if that holds true for non-image data, though.
You can also use Conditional Variational Autoencoder which produces new data with label.
I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)
I have two different word vector models created using word2vec algorithm . Now issue i am facing is few words from first model is not there in second model . I want to create a third model from two different word vectors models where i can use word vectors from both models without loosing meaning and the context of word vectors.
Can I do this, and if so, how?
You could potentially translate the vectors for the words only in one model to the other model's coordinate space, using other shared words to learn a translation-function.
There's a facility to do this in recent gensim versions – see the TranslationMatrix tool. There's a demo Jupyter notebook included in the docs/notebooks directory, viewable online at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
You'd presumably take the larger model (or whichever one is thought to be better, perhaps because it was trained on more data), and translate the smaller number of words its missing into its space. You'd use as many common-reference 'anchor' words as is practical.
I'm using gensim to extract feature vector from a document.
I've downloaded the pre-trained model from Google named GoogleNews-vectors-negative300.bin and I loaded that model using the following command:
model = models.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
My purpose is to get a feature vector from a document. For a word, it's very easy to get the corresponding vector:
vector = model[word]
However, I don't know how to do it for a document. Could you please help?
A set of word vectors (such as GoogleNews-vectors-negative300.bin) is neither necessary nor sufficient for the kind of text vectors (Le/Mikolov 'Paragraph Vectors') created by the Doc2Vec class. It instead expects to be trained with example texts to learn per-document vectors. Then, also, the trained model can be used to 'infer' vectors for other new documents.
(The Doc2Vec class only supports the load_word2vec_format() method because it inherits from the Word2Vec class – not because it needs that functionality.)
There's another simple kind of text vector that can be created by simply averaging all the words in the document, perhaps also according to some per-word significance weighting. But that's not what Doc2Vec provides.
I tried this:
model = models.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
and it is giving me an error that doc to vec does not contain any word2vec format.
I am working on Named Entity Recognition. I evaluated libraries, such as MITIE, Stanford NER , NLTK NER etc., which are built upon conventional nlp techniques. I also looked at deep learning models such as word2vec and Glove vectors for representing words in vector space, they are interesting since they provide the information about the context of a word, but specifically for the task of NER, I think its not well suited. Since all these vector models create a vocab and corresponding vector representation. If any word failed to be in the vocabulary it will not be recognised. Assuming that it is highly likely that a named entity is not present since they are not bound by the language. It can be anything. So if any deep learning technique have to be useful in such cases are the ones which are more dependent on the structure of the sentence by using standard english vocab i.e. ignoring named fields. Is there any such model or method available? Will CNN or RNN may be the answer for it ?
I think you mean texts of a certain language, but the named entities in that text may contain different names (e.g. from other languages)?
The first thing that comes to my mind is some semi-supervised learning techniques that the model is being updated periodically to reflect new vocabulary.
For example, you may want to use word2vec model to train the incoming data, and compare the word vector of possible NEs with existing NEs. Their cosine distance should be close.