Is it possible to fine-tune BERT to do retweet prediction? - machine-learning

I want to build a classifier that predicts if user i will retweet tweet j.
The dataset is huge, it contains 160 million tweets. Each tweet comes along with some metadata(e.g. does the retweeter follow the user of the tweet).
the text tokens for a single tweet is an ordered list of BERT ids. To get the embedding of the tweet, you just use the ids (So it is not text)
Is it possible to fine-tune BERT to do the prediction? if yes, what do courses/sources do you recommend to learn how to fine-tune? (I'm a beginner)
I should add that the prediction should be a probability.
If it's not possible, I'm thinking of converting the embeddings back to text then using some arbitrary classifier that I'm going to train.

You can fine-tune BERT, and you can use BERT to do retweet prediction, but you need more architecture in order to predict if user i will retweet tweet j.
Here is an architecture off the top of my head.
At a high level:
Create a dense vector representation (embedding) of user i (perhaps containing something about the user's interests, such as sports).
Create an embedding of tweet j.
Create an embedding of the combination of the first two embeddings together, such as with concatenation or hadamard product.
Feed this embedding through a NN that performs binary classification to predict retweet or non-retweet.
Let's break this architecture down by item.
To create an embedding of user i, you will need to create some kind of neural network that accepts whatever features you have about the user and produces a dense vector. This part is the most difficult component of the architecture. This area is not in my wheelhouse, but a quick google search for "user interest embedding" brings up this research paper on an algorithm called StarSpace. It suggests that it can "obtain highly informative user embeddings according to user behaviors", which is what you want.
To create an embedding of tweet j, you can use any type of neural network that takes tokens and produces a vector. Research prior to 2018 would have suggested using an LSTM or a CNN to produce the vector. However, BERT (as you mentioned in your post) is the current state-of-the-art. It takes in text (or text indices) and produces a vector for each token; one of those tokens should have been the prepended [CLS] token, which commonly is taken to be the representation of the whole sentence. This article provides a conceptual overview of the process. It is in this part of the architecture that you can fine-tune BERT. This webpage provides concrete code using PyTorch and the Huggingface implementation of BERT to do this step (I've gone through the steps and can vouch for it). In the future, you'll want to google for "BERT single sentence classification".
To create an embedding representing the combination of user i and tweet j, you can do one of many things. You can simply concatenate them together into one vector; so if user i is an M-dimensional vector and tweet j is an N-dimensional vector, then the concatenation produces an (M+N)-dimensional vector. An alternative approach is to compute the hadamard product (element-wise multiplication); in this case, both vectors must have the same dimension.
To make the final classification of retweet or not-retweet, build a simple NN that takes the combination vector and produces a single value. Here, since you are doing binary classification, a NN with a logistic (sigmoid) function would be appropriate. You can interpret the output as the probability of retweeting, so a value above 0.5 would be to retweet. See this webpage for basic details on building a NN for binary classification.
In order to get this whole system to work, you need to train it all together end-to-end. That is, you have to get all the pieces hooked up first and train it rather than training the components separately.
Your input dataset would look something like this:
user tweet retweet?
---- ----- --------
20 years old, likes sports Great game Y
30 years old, photographer Teen movie was good N
If you want an easier route where there is no user personalization, then just leave out the components that create an embedding of user i. You can use BERT to build a model to determine if the tweet is retweeted without regard to user. You can again follow the links I mentioned above.

There is already an answer on this in Data Science SE, which explains why BERT cannot be used for prediction. Here is the gist:
BERT can't be used for next word prediction, at least not with the current state of the research on masked language modeling.
BERT is trained on a masked language modeling task and therefore you cannot "predict the next word". You can only mask a word and ask BERT to predict it given the rest of the sentence (both to the left and to the right of the masked word).
But as I understand from your case that you want to do 'classification' then BERT is fully equipped to do that. Please refer to the link I have posted below. This will help you to classify the tweets according to its topic so you may then view them in your Leisure time.

Related

Mutli-Class Text Classifcation (using TFIDF and SVM). How to implement a scenario where one feedback may belong to more than one class?

I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.

Data augmentation for text classification

What is the current state of the art data augmentation technic about text classification?
I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification.
I found some interesting ideas such as:
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.
But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?
Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.
Is it a good method or do I miss some important drawbacks of this technic?
Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.
The two SOTA models are:
GPT-2 https://github.com/openai/gpt-2
BERT https://github.com/google-research/bert
These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.

Questions about feature vector of a tweet in a Twitter Sentiment Analysis Task.

https://aclanthology.info/pdf/S/S17/S17-2132.pdf
In this paper that describes one of the systems used in the shared task of SemEval2017, it explains how an SVM Classifier was implemented for a twitter sentiment analysis task.
I am trying to imitate this system, but there are some parts that are not understandable as a beginner.
It lists bunch of features that describe a single tweet. To implement a feature vector X (where a single row represents a single tweet and the column contains all values of the features listed in the paper) am I supposed to compute vectors for each features and concatenate it at the end?
One of the features is the Tf-Idf vectorizer. As far as I know, Tf-Idf gives us the weight of the word per document. But in this case, what would be the document? Wouldn't using Tf-idf as one of the features cause the final X to have rows with different numbers of columns? I just want to know how Tf-idf would work here as one of the feature vectors.
What does the 2.8 Topic and Hashtag feature indicate? I don't quite what kind of value it is describing.
Any kind of advice would be great! Please help!

Embeddings with recurrent neural networks

I am working on a research project on text data (it's about search engine queries supervised classification). I have already implemented different methods and I have also used different models for the text (such as binary vectors of the dimention of my vocabulary - 1 if the i-th word appears in the text, 0 otherwise - or words embedding with the model word2vec).
My advisor told me that maybe we could find another representation of the queries using Recurrent Neural Network. This representation should keep into account the sequentiality of the words in the text thanks to the recurrence relation. I have read some documentation about RNN but I haven't find anything useful for this goal. I have read lot of things about language modelling (which predict probabilities of the words), but I don't understand how I could adapt this model in order to obtain something like an embedded vector.
Thank you very much!
Usually, if one wants to obtain embeddings from a query or a sentence exploiting RNN, the logits are used. The logits are simply the output values of the network after the forward pass of the full sentence/query.
The logit values produce a vector that has the dimensions of the output layer (i.e. number of the target classes): usually, it is the vocabulary, since they are extracted from a language model.
For hints have a look at these:
http://arxiv.org/abs/1603.07012
How does word2vec give one hot word vector from the embedding vector?
Note that in principle one could use also use bidirectional networks or networks trained on other tasks, obtaining smaller embeddings, even if this last option is kind of fancy and it has not been explored up to my knowledge.

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.
I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!

Resources