Gensim's Document similarity can be used as supervised classification? - machine-learning

Gensim has this document similarity feature which when inputted a query document, it outputs the similarity of that particular document with all the documents it has in its index
Can this be used like an "approximate" version of supervised classification?
I know gensim's word2vec uses Deep Learning, is this involved during the above step?

Related

Word2vec is a Generalization or memorization algorithm?

I need to know that word2vec is a Generalization algorithm like all ML algorithms or its Memorization algorithm like KNN ?
Because we have 2 types of algorithms model based and memory based , word2vec is coming in which category when it's using for most_similar_items
Let me define generalization as the ability of a model which has completed training to be effective in prediction across a whole range of inputs, including include inputs that is not part of training. From that perspective, Word2Vec cannot predict words that are not part of the training dataset because it simply would not have trained on the context of it to create an embedding. To qualify as a generalization method, it needs to be able to predict on an input which was not part of the training dataset.
Word2Vec model maintains a dictionary of words to the corresponding embedding/vector. In summary, cannot predict on unknown words. This was one of the important differences between fastText model and Word2Vec.

After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this:
from gensim.models.fasttext import FastText as FT_gensim
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)
# train the model
model_gensim.train(
corpus_file=corpus_file, epochs=model_gensim.epochs,
total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)
Then, let's say I want to get the embeddings vectors associated with this sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
How can I get them with model_gensim that I trained previously?
You can look up each word's vector in turn:
wordvecs_obama = [model_gensim[word] for word in sentence_obama]
For your 7-word input sentence, you'll then have a list of 7 word-vectors in wordvecs_obama.
All FastText models do not, as a matter of their inherent functionality, convert longer texts into single vectors. (And specifically, the model you've trained doesn't have a default way of doing that.)
There is a "classification mode" in the original Facebook FastText code that involves a different style of training, where texts are associated with known labels at training time, and all the word-vectors of the sentence are combined during training, and when the model is later asked to classify new texts. But, the gensim implementation of FastText does not currently support this mode, as gensim's goal has been to supply unsupervised rather than supervised algorithms.
You could approximate what that FastText mode does by averaging together those word-vectors:
import numpy as np
meanvec_obama = np.array(wordvecs_obama).mean(axis=0)
Depending on your ultimate purposes, something like that might still be useful. (But, that average wouldn't be as useful for classification as if the word-vectors had originally ben trained for that goal, with known labels, in that FastText mode.)

How to use doc2vec embeddings as an input to a neural network

I'm trying to slowly begin working on a Twitter recommender system as part of a project, which requires me to use some form of deep learning. My goal is to recommend other tweets based on the topical content of a tweet with unlabelled data.
I have pre-processed my data and trained a few variations of models in doc2vec to get both word embeddings and document embeddings. But my issue is that I feel a little lost with where to go from here. I've read that doc2vec can be used as an input to a deeper neural network for training such as an LSTM or even a CNN.
Could anyone help me understand how these document embeddings (and word embeddings, I trained the model on DM mode) are used as input and what the purpose of the neural net would be in this case, is it for clustering? I understand the question is a little open-ended but I'm quite new to all this, any help would be appreciated.
If you have trained a d dimensional doc2vec for each document that will become the input vector for that particular tweet. If you have n number of documents, it will become n*d dimensional matrix. Now, this matrix can be given to the neural network. LSTM and CNN models are all used for supervised learning problems (where you have labeled data).
If you dont have labelled data, then go for unsupervised learning. Clustering comes under this! You can run different clustering algos and recommend based on this.

scikit-learn classification using doc2vec representation

I want to classify text documents using doc2vec representation and scikit-learn models.
My problem is that I'm lost on how to get started. can someone explain the general steps usually taken to use doc2vec with scikit-learn?
There is a great tutorial here for a binary classification with scikit-learn + doc2vec. In short:
Using gensim to train/load your doc2vec model.
Input text will be converted to a fixed dimension vector of floats (the same dimension as your embedding). These are the actual input features.
Now feel free to use any classifier in scikit-learn.

Do all machine learning algorithm usage Word frequency as a feature?

I use scikit learn for classification. And mainly work with NAive bayes, SVM, Neural network. There are variant in each of them.
I see for training algo create vectors. What does this vector contains?
For all algorithm does it consider word frequency as a feature? If yes then how they differ?
For text classification you usually create a vector of words frequency, or tf-idf to be able to compute distances between two documents. You could use all kinds of method to create these weights on word.
The words (features) can be extracted by just a splitting the documents on separator but you can use more complex methods like stemming (keep only the root of the words).
You will find lots of example in the sklearn documentation. For instance :
http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html
This iPython Notebook could be a good start too.

Resources