Why pad sentences while using RNNs pytorch - machine-learning

Why do we pad sentences to max length of all the sentences in the batch before feeding in RNNs (like LSTM)?

Related

How do RNN's for sentiment classification deal with different sentence lengths?

I have been doing a course which teaches you about Deep Neural Networks, during one of the exercises I was made to make an RNN for sentiment classification which I did, but I did not understand how an RNN is able to deal with sentences of different lengths while conducting sentiment classification.
The RNN doesn't care the length of the original sentences, because all data it takes have the same length. Converting all sentences in the same length is about the method which you use in the data processing step.
For example the simplest method is Bag of Words -> https://machinelearningmastery.com/gentle-introduction-bag-words-model/
So, the given sentences to the RNN have the same length and it is equal to the numbers of the input layer's neurons, otherwise the RNN throws an error.

Use fasttext for the character embeddings?

We have pre-trained fast text word embeddings. Can we use it find the character embeddings. Although I found a blog This link. But in this blog author has just averaged the character over all the words. Is there any other way to have character embeddings without training the RNN or CNN.

Document representation with pre-trained Word Vectors for Author Classification/Regression (GP)

I am trying to replicate (https://arxiv.org/abs/1704.05513) to do a Big 5 author classification on Facebook data (posts and Big 5 profiles are given).
After removing the stop words, I embed each word in the file with their pre-trained GloVe word vectors. However, computing the average or coordinate-wise min/max word vector for each user and using those as input for a Gaussian Process/SVM gives me terrible results. Now I wonder what the paper means by:
Our method combines Word Embedding with Gaussian
Processes. We extract the words from the users’ tweets and
average their Word Embedding representation into a single
vector. The Gaussian Processes model then takes these
vectors as an input for training and testing.
What else can I "average" the vectors to get decent results and do they use some specific Gaussian Process?

Addressing synonyms in Supervised Learning for Text Classification

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly.
For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.
What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me
You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.
A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).
Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.
Example pseudocode to get you started:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([ 6.17495403e-02, 2.07064897e-02, -1.56451517e-03,
1.02607915e-02, -1.30429687e-02, 1.60102192e-02, ...
Now lets try a different document:
>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007, 0.01455955, -0.01595449, -0.01795897,
-0.02184369, -0.01654281, 0.01735667, 0.00054854, ...
>>> doc2.similarity(doc1)
0.8820845113100807
Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.
This is how you can use these vectors in scikit-learn:
X = [nlp(text).vector for text in corpus]
clf.fit(X, y)

Does word2vec make sense for supervised learning?

I have a list of sentence/label pairs to train the model, how should I encode the sentences as input to, say an SVM?
Are the sentences in the same language? You could start with the pretrained word2vec file that you can download from Google if it's English. Pay attention to how the train file was created, whether stemming was applied, etc. It's also somewhat important from which corpus it was generated; you'd get different results if this is from newsgroups or if it was extracted from the web or from more formal text.
Word2Vec basically encodes every word into a higher dimensional vector space. This is usually 200,300 or 500 dimensions large. After it is trained, then the "test" sentences are basically bag of words and need not be in any order.
You'd then, for each word in the bag of words, figure out the corresponding word2vec vector. Then you can create features by averaging the vectors, taking the 'minimum', the 'maximum' and if you're comparing text, look at calculating the cosine similarity between vectors. Then use those features in an SVM.

Resources