constructing skip-gram input vector - machine-learning

I am following the tutorial here for implementing word2vec, and I am not sure if I understand how the skip-gram input vector is constructed.
This is the part I am confused about. I thought we were not doing one-hot encoding in word2vec.
For example, if we were to have two sentences "dogs like cats", "cats like dogs", or some more informative sentences, what would the input vector look like? Thank you.

What Skip-gram tries to do is essentially to train a model that predicts its context words given the center word.
Take 'dogs like cats' as example, assuming that window size is three, which means we will use the center word("like") to predict one word before "like" and one word after "like"(correct answers here are "dogs" and "cats").
So the input vector for this sentence will be an one hot vector with kth element being one(assuming "like" is the kth word in your dictionary).

Related

Does summing up word embedding vectors in ML destroy their meaning?

For example, I have a paragraph which I want to classify in a binary manner. But because the inputs have to have a fixed length, I need to ensure that every paragraph is represented by a uniform quantity.
One thing I've done is taken every word in the paragraph, vectorized it using GloVe word2vec and then summed up all of the vectors to create a "paragraph" vector, which I've then fed in as an input for my model. In doing so, have I destroyed any meaning the words might have possessed? Considering these two sentences would have the same vector:
"My dog bit Dave" & "Dave bit my dog", how do I get around this? Am I approaching this wrong?
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
I want to be able to train a model that can classify text accurately.
Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
Edit: One suggestion I have received is to instead featurize my data as a 2D array where each word is a column, on which a CNN could work. Another suggestion I received was to use transfer learning through the huggingface transformer to get a vector for the whole paragraph. Which one is more feasible?
I want to be able to train a model that can classify text accurately. Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
If you look up papers on aggregating word embeddings you'll find out that this in fact occurs sometimes, especially if the texts are shorter.
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
Have you tried keyword extraction? It can alleviate some of the problems with averaging
In doing so, have I destroyed any meaning the words might have
possessed?
As you remarked, you throw out information on word order. But that's not even the worst part: most of the times for longer documents if you embed everything the mean will get dominated by common words ("how", "like", "do" et c). BTW see my answer to this question
Other than that, one trick I've seen is to average word vectors, but subtract first principal component of PCA on word embedding matrix. For details you can see for example this repo which also links to the paper (BTW this paper suggests you can ignore "Smooth Inverse Frequency" stuff since principal component reduction does the useful part).

Replacing empty texts - text embedding

I am trying to embed texts, using pre-trained fastText models. Some are empty. How would one replace them to make embedding possible? I was thinking about replacing them with dummy words, like that (docs being a pandas DataFrame object):
docs = docs.replace(np.nan, 'unknown', regex=True)
However it doesn't really make sense as the choice of this word is arbitrary and it is not equivalent to having an empty string.
Otherwise, I could associate the 0 vector embedding to empty strings, or the average vector, but I am not convinced either would make sense, as the embedding operation is non-linear.
In FastText, the sentence embedding is basically an average of the word vectors, as is shown in one of the FastText papers:
Given this fact, zeroes might a logical choice. But, the answer depends on what you want to do with the embeddings.
If you use them as an input for a classifier, it should be fine to select an arbitrary vector as a representation of empty string and the classifier will learn what that means. FastText also learns a special embedding for </s>, i.e., end of a sentence. This is another natural candidate for an embedding of the empty string, especially if you do similarity search.

SVM feature vector representation by using pre-made dictionary for text classification

I want to classify a collection of text into two class, let's say I would like to do a sentiment classification. I have two pre-made sentiment dictionaries, one contain only positive words and another contain only negative words. I would like to incorporate these dictionaries into feature vector for SVM classifier. My question is, is it possible to separate between positive and negative words dictionary to be represented as SVM feature vector, especially when I generate feature vector for the test set?
If my explanation is not clear enough, let me give the example. Let's say I have these two sentences as training data:
Pos: The book is good
Neg: The book is bad
Word 'good' exists in positive dictionary and 'bad' exists in negative dictionary, while other words do not exist in neither dictionary. I want the words that exist in matching dictionary with the sentence's class have a big weight value, while other words have small value. So, the feature vectors will be like these:
+1 1:0.1 2:0.1 3:0.1 4:0.9
-1 1:0.1 2:0.1 3:0.1 5:0.9
If I want to classify a test sentence "The food is bad", how should I generate a feature vector for the test set with weight that depend on existing dictionary when I cannot match test sentence's class with each of the dictionary? What I can think is, for test set, as long as the word exist in both dictionary, I will give the word a high weight value.
0 1:0.1 3:0.1 5:0.9
I wonder if this is the right way for creating vector representation for both training set and test set.
--Edit--
I forgot to mention that these pre-made dictionaries was extracted using some kind of topic model. For example, the top 100 words from topic 1 are kinda represent positive class and words in topic 2 represent negative class. I want to use this kind of information to improve the classifier more than using only bag-of-words feature.
In short - this is not the way it works.
The whole point of learning is to give classifier ability to assign these weights on their own. You cannot "force it" to have a high value per class for a particular feature (I mean, you could on the optimization level, but this would require changing the whole svm structure).
So the right way is to simply create a "normal" representation. Without any additional specification. Let the model decide, they are better at statistical analysis than human intuition, really.

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.
I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!

Word2Vec Data Setup

In the Word2Vec Skip-gram setup that follows, what is the data setup for the output layer? Is it a matrix that is zero everywhere but with a single "1" in each of the C rows - that represents the words in the C context?
Add to describe Data Setup Question:
Meaning what the dataset would look like that was presented to the NN? Lets consider this to be "what does a single training example look like"?. I assume the total input is a matrix, where each row is a word in the vocabulary (and there is a column for each word as well and each cell is zero except where for the specific word - one hot encoded)? Thus, a single training example is 1xV as shown below (all zeros except for the specific word, whose value is a 1). This aligns with the picture above in that the input is V-dim. I expected that the total input matrix would have duplicated rows however - where the same one-hot encoded vector would be repeated for each time the word was found in the corpus (as the output or target variable would be different).
The Output (target) is more confusing to me. I expected it would exactly mirror the input -- a single training example has a "multi"-hot encoded vector that is zero except is a "1" in C of the cells, denoting that a particular word was in the context of the input word (C = 5 if we are looking, for example, 2 words behind and 3 words ahead of the given input word instance). The picture doesn't seem to agree with this though. I dont understand what appears like C different output layers that share the same W' weight matrix?
The skip-gram architecture has word embeddings as its output (and its input). Depending on its precise implementation, the network may therefore produce two embeddings per word (one embedding for the word as an input word, and one embedding for the word as an output word; this is the case in the basic skip-gram architecture with the traditional softmax function), or one embedding per word (this is the case in a setup with the hierarchical softmax as an approximation to the full softmax, for example).
You can find more information about these architectures in the original word2vec papers, such as Distributed Representations of Words and Phrases
and their Compositionality by Mikolov et al.

Resources