Mutli-Class Text Classifcation (using TFIDF and SVM). How to implement a scenario where one feedback may belong to more than one class? - machine-learning

I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.

Related

Time series Autoencoder with Metadata

At the moment I'm trying to build an Autoencoder for detecting anomalies in time series data.
My approach is based on this tutorial: https://keras.io/examples/timeseries/timeseries_anomaly_detection/
But as often, my data is more complex then this simple tutorial.
I have two different time series, from two sensors and some metadata, like from which machine the time series was recorded.
with a normal MLP network you could have one network for the time series and one for the metadata and merge them in higher layers. But how can you use this data as an input to an Autoencoder?
Do you have any ideas, links to tutorials or papers I didn't found?
in this tutorial you can see a LSTM-VAE where the input time series is somehow concatenated with categorical data: https://github.com/cerlymarco/MEDIUM_NoteBook/tree/master/VAE_TimeSeries
There is an article explayining the code (but not on detail). There you can find the following explanation of the model:
"The encoder consists of an LSTM cell. It receives as input 3D sequences resulting from the concatenation of the raw traffic data and the embeddings of categorical features. As in every encoder in a VAE architecture, it produces a 2D output that is used to approximate the mean and the variance of the latent distribution. The decoder samples from the 2D latent distribution upsampling to form 3D sequences. The generated sequences are then concatenated back with the original categorical embeddings which are passed through an LSTM cell to reconstruct the original traffic sequences."
But sadly I don't understand exactly how they concatenate the input datas. If you understand it it would be nice if you could explain it =)
I think I understood it. you have to take a look at the input of the .fit() funktion. It is not one array, but there are seperate arrays for seperate categorical datas. additionaly there is the original input (in this case a time series). Because he has so many arrays in the input, he needs to have a corresponding number of input layers. So there is one Input layer for the Timeseries, another for the same time series (It's an autoencoder so x_train works like y_train) and a list of input layers, directly stacked with the embedding layers for the categorical data. after he has all the data in the corresponding Input layers he can concatenate them as you said.
by the way, he's using the same list for the decoder to give him additional information. I tried it out and it turns out that it was helpfull to add a dropout layer (high dropout e.g. 0.6) between the additional inputs and the decoder. If you do so, the decoder has to learn from the latent z and not only from the additional data!
hope I could help you =)

Is it possible to fine-tune BERT to do retweet prediction?

I want to build a classifier that predicts if user i will retweet tweet j.
The dataset is huge, it contains 160 million tweets. Each tweet comes along with some metadata(e.g. does the retweeter follow the user of the tweet).
the text tokens for a single tweet is an ordered list of BERT ids. To get the embedding of the tweet, you just use the ids (So it is not text)
Is it possible to fine-tune BERT to do the prediction? if yes, what do courses/sources do you recommend to learn how to fine-tune? (I'm a beginner)
I should add that the prediction should be a probability.
If it's not possible, I'm thinking of converting the embeddings back to text then using some arbitrary classifier that I'm going to train.
You can fine-tune BERT, and you can use BERT to do retweet prediction, but you need more architecture in order to predict if user i will retweet tweet j.
Here is an architecture off the top of my head.
At a high level:
Create a dense vector representation (embedding) of user i (perhaps containing something about the user's interests, such as sports).
Create an embedding of tweet j.
Create an embedding of the combination of the first two embeddings together, such as with concatenation or hadamard product.
Feed this embedding through a NN that performs binary classification to predict retweet or non-retweet.
Let's break this architecture down by item.
To create an embedding of user i, you will need to create some kind of neural network that accepts whatever features you have about the user and produces a dense vector. This part is the most difficult component of the architecture. This area is not in my wheelhouse, but a quick google search for "user interest embedding" brings up this research paper on an algorithm called StarSpace. It suggests that it can "obtain highly informative user embeddings according to user behaviors", which is what you want.
To create an embedding of tweet j, you can use any type of neural network that takes tokens and produces a vector. Research prior to 2018 would have suggested using an LSTM or a CNN to produce the vector. However, BERT (as you mentioned in your post) is the current state-of-the-art. It takes in text (or text indices) and produces a vector for each token; one of those tokens should have been the prepended [CLS] token, which commonly is taken to be the representation of the whole sentence. This article provides a conceptual overview of the process. It is in this part of the architecture that you can fine-tune BERT. This webpage provides concrete code using PyTorch and the Huggingface implementation of BERT to do this step (I've gone through the steps and can vouch for it). In the future, you'll want to google for "BERT single sentence classification".
To create an embedding representing the combination of user i and tweet j, you can do one of many things. You can simply concatenate them together into one vector; so if user i is an M-dimensional vector and tweet j is an N-dimensional vector, then the concatenation produces an (M+N)-dimensional vector. An alternative approach is to compute the hadamard product (element-wise multiplication); in this case, both vectors must have the same dimension.
To make the final classification of retweet or not-retweet, build a simple NN that takes the combination vector and produces a single value. Here, since you are doing binary classification, a NN with a logistic (sigmoid) function would be appropriate. You can interpret the output as the probability of retweeting, so a value above 0.5 would be to retweet. See this webpage for basic details on building a NN for binary classification.
In order to get this whole system to work, you need to train it all together end-to-end. That is, you have to get all the pieces hooked up first and train it rather than training the components separately.
Your input dataset would look something like this:
user tweet retweet?
---- ----- --------
20 years old, likes sports Great game Y
30 years old, photographer Teen movie was good N
If you want an easier route where there is no user personalization, then just leave out the components that create an embedding of user i. You can use BERT to build a model to determine if the tweet is retweeted without regard to user. You can again follow the links I mentioned above.
There is already an answer on this in Data Science SE, which explains why BERT cannot be used for prediction. Here is the gist:
BERT can't be used for next word prediction, at least not with the current state of the research on masked language modeling.
BERT is trained on a masked language modeling task and therefore you cannot "predict the next word". You can only mask a word and ask BERT to predict it given the rest of the sentence (both to the left and to the right of the masked word).
But as I understand from your case that you want to do 'classification' then BERT is fully equipped to do that. Please refer to the link I have posted below. This will help you to classify the tweets according to its topic so you may then view them in your Leisure time.

Is there a way to find the most representative set of samples of the entire dataset?

I'm working on text classification and I have a set of 200.000 tweets.
The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.
What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.
This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?
If it's the first scenario, I think the steps in order of importance would be:
Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
Stratify by features you plan to use
Stratify by tweet length/vocabulary etc.
If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:
Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
Use something like LDA to find n topics, then sample stratified by topic
Hope this helps!
It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.
After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.
Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.

Extract WHY label was chosen on classification?

I currently have a system set up where I train from old posts/categories and try to predict what category a new post will be. I am using a pipeline with TfidfVectorizer and LinearSVC to train the dataset and storing that in a pickle, then I process new posts by loading that pickle and using predict from the loaded pickle to classify the new posts. Currently, I am struggling with a few labels and I don't know why.
I am looking to provide some output on what words were triggered in the new post for each classification label so that I can see why a certain label was chosen when classifying new data against a training set, but I cannot find a way to do this.
I know that I can output the top features in my vectorizer when I am training, but how can I output essentially the reason why a certain label was chosen over another one?
During the training phase of the SVM for each word of the corpus vocabulary you learn a weight for each of the classes.
Then, during inference, you calculate the dot product between the class weights and the vector description of the instance to be classified. The algorithm returns the class that yields the highest dot product scores. Hence, you can have an estimate of how things work by examining those weights (coef_ attribute) for your instance.
I agree however that other methods like trees are more interpretable.

Methods to ignore missing word features on test data

I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards
There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.

Resources