I'm aiming to get my hands dirty by slowly scaling using LSTMs. However in the initial stages now, I'm trying to implement a Youtube LSTM sentiment analyzer using Keras. While searching for the resources available to aid me, I came across the IMDB sentiment analysis dataset and LSTM code. While it works great for longer inputs, shorter inputs don't do so well. The code is here at https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py
Upon saving the Keras model and building a prediction module for this data with this code
model = load_model('ytsentanalysis.h5')
print("Enter text")
text=input()
list=text_to_word_sequence(text,filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n',lower=True,split=" ")
print(list)
word_index = imdb.get_word_index()
x_test = [[word_index[w] for w in list if w in word_index]]
prediction=model.predict(x_test)
print(prediction)
I feed in various inputs such as 'bad video' 'fantastic amazing' or 'good great' 'terrible bad'. The outputs range from close to 1 for similarly bad themed inputs and I've seen a 0.3ish prediction for a good themed input. I'd expect it should be closer to 1 for positive and closer to 0 for negative.
In an effort to solve this problem, I limited maxlen=20 while training and predicting because Youtube comments are much shorter, with the same code run again. This time the probabilities during prediction were all e^insert large negative power here
Is there no way I can adapt and reuse the existing dataset? If not, since labeled Youtube comment datasets aren't as extensive, should I use something like a Twitter comment dataset at the expense of losing the efficiency of the pre-built IMDB input modules in Keras? And is there any way I can see the code for those modules?
Thank you in advance for answering all these questions.
The difference between the IMDb dataset and YouTube comments is quite different since the movie reviews are quite long and extensive compared to comments and tweets.
It may be more helpful to train a model on a publicly available dataset (e.g. Tweets, that may be more inline with YT comments). You can then use the pre-trained model and fine-tune it on your YT comments dataset. Utilising some pre-trained word embeddings can be useful as well, such as GloVe and word2vec.
Alternatively, you can look into using NLTK to analyse the comments instead.
Related
I want to build a classifier that predicts if user i will retweet tweet j.
The dataset is huge, it contains 160 million tweets. Each tweet comes along with some metadata(e.g. does the retweeter follow the user of the tweet).
the text tokens for a single tweet is an ordered list of BERT ids. To get the embedding of the tweet, you just use the ids (So it is not text)
Is it possible to fine-tune BERT to do the prediction? if yes, what do courses/sources do you recommend to learn how to fine-tune? (I'm a beginner)
I should add that the prediction should be a probability.
If it's not possible, I'm thinking of converting the embeddings back to text then using some arbitrary classifier that I'm going to train.
You can fine-tune BERT, and you can use BERT to do retweet prediction, but you need more architecture in order to predict if user i will retweet tweet j.
Here is an architecture off the top of my head.
At a high level:
Create a dense vector representation (embedding) of user i (perhaps containing something about the user's interests, such as sports).
Create an embedding of tweet j.
Create an embedding of the combination of the first two embeddings together, such as with concatenation or hadamard product.
Feed this embedding through a NN that performs binary classification to predict retweet or non-retweet.
Let's break this architecture down by item.
To create an embedding of user i, you will need to create some kind of neural network that accepts whatever features you have about the user and produces a dense vector. This part is the most difficult component of the architecture. This area is not in my wheelhouse, but a quick google search for "user interest embedding" brings up this research paper on an algorithm called StarSpace. It suggests that it can "obtain highly informative user embeddings according to user behaviors", which is what you want.
To create an embedding of tweet j, you can use any type of neural network that takes tokens and produces a vector. Research prior to 2018 would have suggested using an LSTM or a CNN to produce the vector. However, BERT (as you mentioned in your post) is the current state-of-the-art. It takes in text (or text indices) and produces a vector for each token; one of those tokens should have been the prepended [CLS] token, which commonly is taken to be the representation of the whole sentence. This article provides a conceptual overview of the process. It is in this part of the architecture that you can fine-tune BERT. This webpage provides concrete code using PyTorch and the Huggingface implementation of BERT to do this step (I've gone through the steps and can vouch for it). In the future, you'll want to google for "BERT single sentence classification".
To create an embedding representing the combination of user i and tweet j, you can do one of many things. You can simply concatenate them together into one vector; so if user i is an M-dimensional vector and tweet j is an N-dimensional vector, then the concatenation produces an (M+N)-dimensional vector. An alternative approach is to compute the hadamard product (element-wise multiplication); in this case, both vectors must have the same dimension.
To make the final classification of retweet or not-retweet, build a simple NN that takes the combination vector and produces a single value. Here, since you are doing binary classification, a NN with a logistic (sigmoid) function would be appropriate. You can interpret the output as the probability of retweeting, so a value above 0.5 would be to retweet. See this webpage for basic details on building a NN for binary classification.
In order to get this whole system to work, you need to train it all together end-to-end. That is, you have to get all the pieces hooked up first and train it rather than training the components separately.
Your input dataset would look something like this:
user tweet retweet?
---- ----- --------
20 years old, likes sports Great game Y
30 years old, photographer Teen movie was good N
If you want an easier route where there is no user personalization, then just leave out the components that create an embedding of user i. You can use BERT to build a model to determine if the tweet is retweeted without regard to user. You can again follow the links I mentioned above.
There is already an answer on this in Data Science SE, which explains why BERT cannot be used for prediction. Here is the gist:
BERT can't be used for next word prediction, at least not with the current state of the research on masked language modeling.
BERT is trained on a masked language modeling task and therefore you cannot "predict the next word". You can only mask a word and ask BERT to predict it given the rest of the sentence (both to the left and to the right of the masked word).
But as I understand from your case that you want to do 'classification' then BERT is fully equipped to do that. Please refer to the link I have posted below. This will help you to classify the tweets according to its topic so you may then view them in your Leisure time.
After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this:
from gensim.models.fasttext import FastText as FT_gensim
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)
# train the model
model_gensim.train(
corpus_file=corpus_file, epochs=model_gensim.epochs,
total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)
Then, let's say I want to get the embeddings vectors associated with this sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
How can I get them with model_gensim that I trained previously?
You can look up each word's vector in turn:
wordvecs_obama = [model_gensim[word] for word in sentence_obama]
For your 7-word input sentence, you'll then have a list of 7 word-vectors in wordvecs_obama.
All FastText models do not, as a matter of their inherent functionality, convert longer texts into single vectors. (And specifically, the model you've trained doesn't have a default way of doing that.)
There is a "classification mode" in the original Facebook FastText code that involves a different style of training, where texts are associated with known labels at training time, and all the word-vectors of the sentence are combined during training, and when the model is later asked to classify new texts. But, the gensim implementation of FastText does not currently support this mode, as gensim's goal has been to supply unsupervised rather than supervised algorithms.
You could approximate what that FastText mode does by averaging together those word-vectors:
import numpy as np
meanvec_obama = np.array(wordvecs_obama).mean(axis=0)
Depending on your ultimate purposes, something like that might still be useful. (But, that average wouldn't be as useful for classification as if the word-vectors had originally ben trained for that goal, with known labels, in that FastText mode.)
I am working on a classification problem with Tweeter data. User labeled tweets (relevant, not relevant) are used to train a machine learning classifier to predict if an unseen tweet is relevant or not to the user.
I use a simple preprocessing techniques like removal of stopwords, stemming etc and a sklearn Tfidfvectorizer to convert the words into numbers before feeding them into a classifier e.g. SVM, kernel SVM , Naïve Bayes.
I would like to determine which words (features) have the higher predictive power. What is the best way to do so?
I have tried wordcloud but it just shows the words with highest frequency in the sample.
UPDATE:
The following approach along with sklearns feature_selection seem to provide the best answer so far to my problem:
top features Any other suggestions?
Have you tried using tfidf? It creates a weighted matrix providing greater weight to the more semantically meaningful words of each text. It compares the individual text( in this case a tweet) to all of the texts (all of the tweets). It is much more helpful than using raw term counts for classification and other tasks. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
I'm trying to slowly begin working on a Twitter recommender system as part of a project, which requires me to use some form of deep learning. My goal is to recommend other tweets based on the topical content of a tweet with unlabelled data.
I have pre-processed my data and trained a few variations of models in doc2vec to get both word embeddings and document embeddings. But my issue is that I feel a little lost with where to go from here. I've read that doc2vec can be used as an input to a deeper neural network for training such as an LSTM or even a CNN.
Could anyone help me understand how these document embeddings (and word embeddings, I trained the model on DM mode) are used as input and what the purpose of the neural net would be in this case, is it for clustering? I understand the question is a little open-ended but I'm quite new to all this, any help would be appreciated.
If you have trained a d dimensional doc2vec for each document that will become the input vector for that particular tweet. If you have n number of documents, it will become n*d dimensional matrix. Now, this matrix can be given to the neural network. LSTM and CNN models are all used for supervised learning problems (where you have labeled data).
If you dont have labelled data, then go for unsupervised learning. Clustering comes under this! You can run different clustering algos and recommend based on this.
I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science
This is the process i'm using,
9 topics -> Music, Technology, Arts, Science etc etc.
9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.
I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content
I then classify a test document, to see how well the classifier is trained
My Question is,
a.) Is this an efficient way to classify text (using the above steps)?
b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)
classification is only on "generic" topics such as the above
a) The method you describe sounds fine, but everything will depend on the implementation of labeled LDA that you're using. One of the best implementations I know is the Stanford Topic Modeling Toolbox. It is not actively developed anymore, but it worked great when I used it.
b) You can look for topical content on DBPedia, which has a structured ontology of topics/entities, and links to Wikipedia articles on those topics/entities.
I suggest you to use bag-of-words (bow) for each class you are using. Or vectors where each column is the frequency of important keywords related to the class you want to target.
Regarding the dictionaries you have DBPedia as yves referred or WordNet.
a.)The simplest solution is surely the k-nearest neighbors algorithm (knn). In fact, it will classify new texts with categorical content using an overlap metric.
You could find ressources here: https://github.com/search?utf8=✓&q=knn+text&type=Repositories&ref=searchresults
Dataset issue:
If you are dealing with classifying live user feeds, then I guess no single dataset will suffice your requirement.
Because if new movie X released, it might not catch by your classification dataset as the training dataset is obsoleted for it now.
For classification I guess to stay updated with latest datasets, use twitter training datasets. Develop dynamic algorithm which update the classifier with latest updated tweet datasets. You could select top 15-20 hash tag for each category of your choice to get most relevant dataset for each category.
Classifier:
Most of the classifier uses bag of words model, you can try out various classifiers and see which gives best result. see :
http://www.nltk.org/howto/classify.html
http://scikit-learn.org/stable/supervised_learning.html