Does summing up word embedding vectors in ML destroy their meaning? - machine-learning

For example, I have a paragraph which I want to classify in a binary manner. But because the inputs have to have a fixed length, I need to ensure that every paragraph is represented by a uniform quantity.
One thing I've done is taken every word in the paragraph, vectorized it using GloVe word2vec and then summed up all of the vectors to create a "paragraph" vector, which I've then fed in as an input for my model. In doing so, have I destroyed any meaning the words might have possessed? Considering these two sentences would have the same vector:
"My dog bit Dave" & "Dave bit my dog", how do I get around this? Am I approaching this wrong?
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
I want to be able to train a model that can classify text accurately.
Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
Edit: One suggestion I have received is to instead featurize my data as a 2D array where each word is a column, on which a CNN could work. Another suggestion I received was to use transfer learning through the huggingface transformer to get a vector for the whole paragraph. Which one is more feasible?

I want to be able to train a model that can classify text accurately. Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
If you look up papers on aggregating word embeddings you'll find out that this in fact occurs sometimes, especially if the texts are shorter.
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
Have you tried keyword extraction? It can alleviate some of the problems with averaging
In doing so, have I destroyed any meaning the words might have
possessed?
As you remarked, you throw out information on word order. But that's not even the worst part: most of the times for longer documents if you embed everything the mean will get dominated by common words ("how", "like", "do" et c). BTW see my answer to this question
Other than that, one trick I've seen is to average word vectors, but subtract first principal component of PCA on word embedding matrix. For details you can see for example this repo which also links to the paper (BTW this paper suggests you can ignore "Smooth Inverse Frequency" stuff since principal component reduction does the useful part).

Related

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

how to use both word2vec and RNN for NLP together?

I recently studied and understood how word2vec works, it is responsible to convert words into numerical form so when we plot them or put them in the world space they will be spread and reveal the relationship between every word and the other.
my question here, I found also RNNs and suddenly I became confused. Is word2vec an alternative to RNNs or I can use word2vec to transfer the words to numeric form and then use them on RNNs ?
I mean both of them predict the next word, so I want to know if they are the different approaches for the same problem or I can use them both together ?
NOTE: I finished computer vision and started in NLP so please don't judge my question I am just starting, thanks in advance.
You did not understood the meaning of word2vec clearly. word2vec is a representation of words in a multi-dimensional space while RNN is an algorithm like Linear Regression or random forest or logistic regression. word2vec do NOT predicts the next words. Here is a small explanation of word2vec:
Take three words: apple,orange and car. Suppose they are represented in word2vec as:
apple = [0.01, 0.04 ...] orange = [0.02, 0.06 ...] car = [0.03, 0.09 ...]
Now you know apple and orange are similar to each other while car is not. So if you will take the dot product of apple and orange, the result value will be close to 1, say it's 0.85 but if you take the dot product of apple and car, the result will be far from 1 say it's 0.25. This is the concept of word2vec. It gives you a vector representation of words in a numerical form such that similar words are kept near to each other in the graph.
Now for RNN, as I said, it's an algorithm. You will feed some numerical data to it and it'll give you some output. You need to learn RNN in detail from some online tutorial.
To answer your question about how to use them together, RNN takes numerical inputs. It can't take English words directly. So we need to convert all words into some kind of numerical form. This is where word2vec comes into picture. You take each word, get it's numerical representation from word2vec (like I showed above for apple, orange and car) and then feed it to the RNN.
This is just a simple overview and it's not possible to explain everything here. If you really want to learn more then I would strongly suggest you to take this course. Everything from word2vec to RNN is explained beautifully there. It would be even better if you complete the whole specialisation there instead of completing only this course.

Fuzzy matching sentences to stanzas

I have lyrics from srt subtitle files. If I want to match them to stanzas from another lyrics website, what is the best approach to this?
My approach has been taking tf-idf vector each lyric line and trying to fuzzy match to the staza, using starting and end time of the lyric line as a clue to whether the line might belong to the previous stanzas, next stanzas, or belong to a stanzas of it's own.
I've also tried dynamic programming, but with less success. Due to the high variance in the structure of the lyrics and the stanza, sometimes the results come out completely shifted or messed up, especially if there are repeated chorus.
If there is a Recurrent Neural Networks or other Machine Learning algorithm, is there an existing approach to such problem?
I suggest using Doc2Vec or Word2Vec methods, basically you train a NN with some corpus, the NN will generate a vector for each word/document, those vectors will have similarity based on the similarty of words in the real world (corpus)
so vector of love and care will be very similar, those vectors hold some other cool properties like if yo subtract or add them you can get a word that posses some meaning of the substation or addition induce
once you will get the vector of words or docs you can check similarity between vectors with various methods, commonly used is the cosine similarity
this method mixed with tf-idf to generate weights for best results
usage is very easy, for example
from gensim.models import Word2Vec
from nltk.corpus import brown
b = Word2Vec(brown.sents())
print b.most_similar('money', topn=5)
output
[(u'care', 0.9145717024803162), (u'chance', 0.9034961462020874), (u'job', 0.8980690240859985), (u'trouble', 0.8751360774040222), (u'everything', 0.873866856098175)]
I suggest to take a look at gensim

Features of vector form of sentences for opinion finding.

I want to find the opinion of a sentence either positive or negative. For example talk about only one sentence.
The play was awesome
If change it to vector form
[0,0,0,0]
After searching through the Bag of words
bad
naughty
awesome
The vector form becomes
[0,0,0,1]
Same for other sentences. Now I want to pass it to the machine learning algorithm for training it. How can I train the network using these multiple vectors? (for finding the opinion of unseen sentences) Obviously not! Because the input is fix in neural network. Is there any way? The above procedure is just my thinking. Kindly correct me if I am wrong. Thanks in advance.
Since your intuitive input format is "Sentence". Which is, indeed, a string of tokens with arbitrary length. Abstracting sentences as token series is not a good choice for many existing algorithms only works on determined format of inputs.
Hence, I suggest try using tokenizer on your entire training set. This will give you vectors of length of the dictionary, which is fixed for given training set.
Because when the length of sentences vary drastically, then size of the dictionary always keeps stable.
Then you can apply Neural Networks(or other algorithms) to the tokenized vectors.
However, vectors generated by tokenizer is extremely sparse because you only work on sentences rather than articles.
You can try LDA (supervised, not PCA), to reduce the dimension as well as amplify the difference.
That will keep the essential information of your training data as well as express your data at fixed size, while this "size" is not too large.
By the way, you may not have to label each word by its attitude since the opinion of a sentence also depends on other kind of words.
Simple arithmetics on number of opinion-expressing words many leave your model highly biased. Better label the sentences and leave the rest job to classifiers.
For the confusions
PCA and LDA are Dimensional Reduction techniques.
difference
Let's assume each tuple of sample is denoted as x (1-by-p vector).
p is too large, we don't like that.
Let's find a matrix A(p-by-k) in which k is pretty small.
So we get reduced_x = x*A, and most importantly, reduced_x must
be able to represent x's characters.
Given labeled data, LDA can provide proper A that can maximize
distance between reduced_x of different classes, and also minimize
the distance within identical classes.
In simple words: compress data, keep information.
When you've got
reduced_x, you can define training data: (reduced_x|y) where y is
0 or 1.

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.
I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!

Resources