I want to classify text documents using doc2vec representation and scikit-learn models.
My problem is that I'm lost on how to get started. can someone explain the general steps usually taken to use doc2vec with scikit-learn?
There is a great tutorial here for a binary classification with scikit-learn + doc2vec. In short:
Using gensim to train/load your doc2vec model.
Input text will be converted to a fixed dimension vector of floats (the same dimension as your embedding). These are the actual input features.
Now feel free to use any classifier in scikit-learn.
Related
I want to encode my multiclass classification output variable in a specific way to take ordinality into account. I want to use this in a NN with sigmoid objective.
I have a couple of questions about this:
How could I encode my classes in this way?
This would not change the problem from multiclass to multilabel classification right?
P.S. here is a link to the paper I based this on. And here is a figure representing the change from a normal NN to their addaptation:
1. How could I encode my classes in this way?
Depends on the framework, a pytorch example can be found here, which also includes a code snippet for converting from predictions and back to labels
This would not change the problem from multiclass to multilabel classification right?
No, you would have multiple binary outputs, but they are subsequently converted to a single label, thus it is still multiclass classification.
I am trying to build a classifier for my dataset. Each observation in the data has categorical and numerical values, as well as a more general description in free-text. I understand how to build a boosting algorithm to handle the categorical and numerical values, and I have already trained a neural network that predicted over the text quite succesfully. What I'm wrapping my head around is how to integrate both approaches?
Embed your free text using a Language Model (e.g. averaging fasttext wordembeddings, or using google-universal-sentence-encoder) into an N-dim vector of floats. One hot encode the categorical stuff. Concatenate [embedding, one_hot_encoding, numericals] and badabing badaboom, you've got yourself 1 vector representing your datapoint.
Tensorflow hub's KerasLayer + https://tfhub.dev/google/universal-sentence-encoder/4 is def a good starting point. I you need to train something yourself, you could look into tf.keras.layers.Embedding.
I only have around 1000 images of vehicle. I need to train a model that can identify if the image is vehicle or not-vehicle. I do not have a dataset for not-vehicle, as it could be anything besides vehicle.
I guess the best method for this would be to apply transfer learning. I am trying to train data on a pre-trained VGG19 Model. But still, I am unaware on how to train a model with just vehicle images without any non-vehicle images. I am not being able to classify it.
I am new to ML Overall, Any solution based on practical implementation will be highly appreciated.
You are right about transfer learning approach. Have a look a this article, it is exactly about going from multi-class to binary classification with transfer learning - https://medium.com/#mandygu/seefood-creating-a-binary-classifier-using-transfer-learning-da751db7cf9c
You can try using pretrained model and take the output. You might need to apply dimensionality reduction e.g. PCA, to get a more managable size input. After that you can train novelty detection model to identify whether the output is different than your training set.
Refer to this example: https://github.com/J-Yash/Hotdog-Not-Hotdog
Hope this helps.
This is a binary classification problem: whether the input is a vehicle or not.
If you are new to ML, I would suggest you should start implementing basic binary classifiers like Logistic Regression, Support Vector Machines before jumping to Convolutional Neural Networks (CNNs).
I am providing some links for the binary classification problem implementations using different algorithms. I hope this would help.
Logistic Regression: https://github.com/JB1984/Logistic-Regression-Cat-Classifier
SVM: https://github.com/Witsung/SVM-Fruit-Image-Classifier
CNN: https://github.com/A-Jatin/CNN-implementation-for-binary-image-classification
After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this:
from gensim.models.fasttext import FastText as FT_gensim
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)
# train the model
model_gensim.train(
corpus_file=corpus_file, epochs=model_gensim.epochs,
total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)
Then, let's say I want to get the embeddings vectors associated with this sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
How can I get them with model_gensim that I trained previously?
You can look up each word's vector in turn:
wordvecs_obama = [model_gensim[word] for word in sentence_obama]
For your 7-word input sentence, you'll then have a list of 7 word-vectors in wordvecs_obama.
All FastText models do not, as a matter of their inherent functionality, convert longer texts into single vectors. (And specifically, the model you've trained doesn't have a default way of doing that.)
There is a "classification mode" in the original Facebook FastText code that involves a different style of training, where texts are associated with known labels at training time, and all the word-vectors of the sentence are combined during training, and when the model is later asked to classify new texts. But, the gensim implementation of FastText does not currently support this mode, as gensim's goal has been to supply unsupervised rather than supervised algorithms.
You could approximate what that FastText mode does by averaging together those word-vectors:
import numpy as np
meanvec_obama = np.array(wordvecs_obama).mean(axis=0)
Depending on your ultimate purposes, something like that might still be useful. (But, that average wouldn't be as useful for classification as if the word-vectors had originally ben trained for that goal, with known labels, in that FastText mode.)
I'm trying to slowly begin working on a Twitter recommender system as part of a project, which requires me to use some form of deep learning. My goal is to recommend other tweets based on the topical content of a tweet with unlabelled data.
I have pre-processed my data and trained a few variations of models in doc2vec to get both word embeddings and document embeddings. But my issue is that I feel a little lost with where to go from here. I've read that doc2vec can be used as an input to a deeper neural network for training such as an LSTM or even a CNN.
Could anyone help me understand how these document embeddings (and word embeddings, I trained the model on DM mode) are used as input and what the purpose of the neural net would be in this case, is it for clustering? I understand the question is a little open-ended but I'm quite new to all this, any help would be appreciated.
If you have trained a d dimensional doc2vec for each document that will become the input vector for that particular tweet. If you have n number of documents, it will become n*d dimensional matrix. Now, this matrix can be given to the neural network. LSTM and CNN models are all used for supervised learning problems (where you have labeled data).
If you dont have labelled data, then go for unsupervised learning. Clustering comes under this! You can run different clustering algos and recommend based on this.