How to Evaluate the Accuracy in NLP while using CountVectorizer - machine-learning

I trying to develop a movie recommender system, I used CountVectorizer and calculated the cosine similarity, my question is, If I want to calculate the accuracy of the created model then how can I do it? This is my kaggle code. I used tmdb movie dataset found in kaggle

Related

After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this:
from gensim.models.fasttext import FastText as FT_gensim
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)
# train the model
model_gensim.train(
corpus_file=corpus_file, epochs=model_gensim.epochs,
total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)
Then, let's say I want to get the embeddings vectors associated with this sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
How can I get them with model_gensim that I trained previously?
You can look up each word's vector in turn:
wordvecs_obama = [model_gensim[word] for word in sentence_obama]
For your 7-word input sentence, you'll then have a list of 7 word-vectors in wordvecs_obama.
All FastText models do not, as a matter of their inherent functionality, convert longer texts into single vectors. (And specifically, the model you've trained doesn't have a default way of doing that.)
There is a "classification mode" in the original Facebook FastText code that involves a different style of training, where texts are associated with known labels at training time, and all the word-vectors of the sentence are combined during training, and when the model is later asked to classify new texts. But, the gensim implementation of FastText does not currently support this mode, as gensim's goal has been to supply unsupervised rather than supervised algorithms.
You could approximate what that FastText mode does by averaging together those word-vectors:
import numpy as np
meanvec_obama = np.array(wordvecs_obama).mean(axis=0)
Depending on your ultimate purposes, something like that might still be useful. (But, that average wouldn't be as useful for classification as if the word-vectors had originally ben trained for that goal, with known labels, in that FastText mode.)

How to use Genetic Algorithm to find weight of voting classifier in WEKA?

I am working from this article: "A novel method for predicting kidney stone type using ensemble learning". The author used a genetic algorithm for finding the optimal weight vector for voting with WEKA, but i don't know see can they did that. How can i use a genetic algorithm to find weight of voting classifier with WEKA?
This below paragraph has been extracted from the article:
In order to enhance the performance of the voting algorithm,a weighted
majority vote is used. Simple majority vote algorithm is usually an
effective way to combine different classifiers, but not all
classifiers have the same effect on the classification problem. To
optimize the results from weight majority vote classifier, we need to
find the optimal weight vector. Applying Genetic algorithms is our
solution for finding the optimal weight vector in this problem.
Assuming you have some trained classifiers and a test set, you can create a method calculateFitness(double[] weights). In this method for each Instance calculate all predictions and a merged prediction according to the weights. Use the combined predictions and the real values to calculate the total score you want to maximize/minimize.
Using the calculateFitness method you can create a custom GA to find best weights.

How to use doc2vec embeddings as an input to a neural network

I'm trying to slowly begin working on a Twitter recommender system as part of a project, which requires me to use some form of deep learning. My goal is to recommend other tweets based on the topical content of a tweet with unlabelled data.
I have pre-processed my data and trained a few variations of models in doc2vec to get both word embeddings and document embeddings. But my issue is that I feel a little lost with where to go from here. I've read that doc2vec can be used as an input to a deeper neural network for training such as an LSTM or even a CNN.
Could anyone help me understand how these document embeddings (and word embeddings, I trained the model on DM mode) are used as input and what the purpose of the neural net would be in this case, is it for clustering? I understand the question is a little open-ended but I'm quite new to all this, any help would be appreciated.
If you have trained a d dimensional doc2vec for each document that will become the input vector for that particular tweet. If you have n number of documents, it will become n*d dimensional matrix. Now, this matrix can be given to the neural network. LSTM and CNN models are all used for supervised learning problems (where you have labeled data).
If you dont have labelled data, then go for unsupervised learning. Clustering comes under this! You can run different clustering algos and recommend based on this.

How to use learning curves with random forests

I've been going through Andrew Ng's machine learning course and just got done with the learning curve lecture. I created a learning curve for a logistic regression model I created, and it looks like the training and CV scores converge, which means my model could benefit from more features. How could I do a similar analysis for something like a random forest? When I create a learning curve for a random forest classifier with the same data in sklearn my training score just stays very close to 1. Do I need to use a different method of getting the training error?
Learning Curves is a tool to learn about bias-variance-trade-off. Since your random forest model training score stays very close to 1, your random forest mode l is able to learn underlying function. If your underlying function was more non-linear, more complex, you would have had to add more features. See following example, figure Learning Curves.
Start with only 2 features and train your random forests model. Then use all of your features and train random forests your model.
You should see similar graph for your example.

Gensim's Document similarity can be used as supervised classification?

Gensim has this document similarity feature which when inputted a query document, it outputs the similarity of that particular document with all the documents it has in its index
Can this be used like an "approximate" version of supervised classification?
I know gensim's word2vec uses Deep Learning, is this involved during the above step?

Resources