What's the major difference between glove and word2vec? - machine-learning

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?

Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.

Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

Related

Is doc vector learned through PV-DBOW equivalent to the average/sum of the word vectors contained in the doc?

I've seen some posts say that the average of the word vectors perform better in some tasks than the doc vectors learned through PV_DBOW. What is the relationship between the document's vector and the average/sum of its words' vectors? Can we say that vector d
is approximately equal to the average or sum of its word vectors?
Thanks!
No. The PV-DBOW vector is calculated by a different process, based on how well the PV-DBOW-vector can be incrementally nudged to predict each word in the text in turn, via a concurrently-trained shallow neural network.
But, a simple average-of-word-vectors often works fairly well as a summary vector for a text.
So, let's assume both the PV-DBOW vector and the simple-average-vector are the same dimensionality. Since they're bootstrapped from exactly the same inputs (the same list of words), and the neural-network isn't significantly more sophisticated (in its internal state) than a good set of word-vectors, the performance of the vectors on downstream evaluations may not be very different.
For example, if the training data for the PV-DBOW model is meager, or meta-parameters not well optimized, but the word-vectors used for the average-vector are very well-fit to your domain, maybe the simple-average-vector would work better for some downstream task. On the other hand, a PV-DBOW model trained on sufficient domain text could provide vectors that outperform a simple-average based on word-vectors from another domain.
Note that FastText's classification mode (and similar modes in Facebook's StarSpace) actually optimizes word-vectors to work as parts of a simple-average-vector used to predict known text-classes. So if your end-goal is to have a text-vector for classification, and you have a good training dataset with known-labels, those techniques are worth considering as well.

Fine-tuning Glove Embeddings

Has anyone tried to fine-tune Glove embeddings on a domain-specific corpus?
Fine-tuning word2vec embeddings has proven very efficient for me in a various NLP tasks, but I am wondering whether generating a cooccurrence matrix on my domain-specific corpus, and training glove embeddings (initialized with pre-trained embeddings) on that corpus would generate similar improvements.
I myself am trying to do the exact same thing. You can try mittens.
They have succesfully built a framework for it. Christopher D. Manning(co-author of GloVe) is associated with it.
word2vec and Glove are a techniques for producing word embeddings, i.e., for modelling text (a set of sentences) into computer-readable vectors.
While word2vec trains on the local context (neighboring words), Glove will look for words co-occurrence in a whole text or corpus, its approach is more global.
word2vec
There are two main approaches for word2vec, in which the algorithm loops through the worlds of the sentence. For each current word w it will try to predict
the neighboring words from w and its context, this is the Skip-Gram approach
w from its context, this is the CBOW approach
Hence, word2vec will produce a similar embedding for words with similar contexts, for instance a noun in singular and its plural, or two synonyms.
Glove
The main intuition underlying the Glove model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. In other words the embeddings are based on the computation of distances between pairs of target words. The model computes the distance between two target words in a text by analyzing the co-occurence of those two target words with some other probe words (contextual words).
https://nlp.stanford.edu/projects/glove/
For example, consider the co-occurrence probabilities for target words "ice" and "steam" with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus:
As one might expect, "ice" co-occurs more frequently with "solid" than it does with "gas", whereas "steam" co-occurs more frequently with "gas" than it does with "solid". Both words co-occur with their shared property "water" frequently, and both co-occur with the unrelated word "fashion" infrequently. Only in the ratio of probabilities does noise from non-discriminative words like "water" and "fashion" cancel out, so that large values (much greater than 1) correlate well with properties specific to "ice", and small values (much less than 1) correlate well with properties specific of "steam". In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase.
Also, Glove is very good at analogy, and performs well on the word2vec dataset.

Document classification using word vectors

While I was classifying and clustering the documents written in natural language, I came up with a question ...
As word2vec and glove, and or etc, vectorize the word in distributed spaces, I wonder if there are any method recommended or commonly used for document vectorization USING word vectors.
For example,
Document1: "If you chase two rabbits, you will lose them both."
can be vectorized as,
[0.1425, 0.2718, 0.8187, .... , 0.1011]
I know about the one also known as doc2vec, that this document has n dimensions just like word2vec. But this is 1 x n dimensions and I have been testing around to find out the limits of using doc2vec.
So, I want to know how other people apply the word vectors for applications with steady size.
Just stacking vectors with m words will be formed m x n dimensional vectors. In this case, the vector dimension will not be uniformed since dimension m will depends on the number of words in document.
If: [0.1018, ... , 0.8717]
you: [0.5182, ... , 0.8981]
..: [...]
m th word: [...]
And this form is not favorable form to run some machine learning algorithms such as CNN. What are the suggested methods to produce the document vectors in steady form using word vectors?
It would be great if it is provided with papers as well.
Thanks!
The most simple approach to get a fixed-size vector from a text, when all you have is word-vectors, to average all the word-vectors together. (The vectors could be weighted, but if they haven't been unit-length-normalized, their raw magnitudes from training are somewhat of an indicator of their strength-of-single-meaning – polysemous/ambiguous words tend to have vectors with smaller magnitudes.) It works OK for many purposes.
Word vectors can be specifically trained to be better at composing like this, if the training texts are already associated with known classes. Facebook's FastText in its 'classification' mode does this; the word-vectors are optimized as much or more for predicting output classes of the texts they appear in, as they are for predicting their context-window neighbors (classic word2vec).
The 'Paragraph Vector' technique, often called 'doc2vec', gives every training text a sort-of floating pseudoword, that contributes to every prediction, and thus winds up with a word-vector-like position that may represent that full text, rather than the individual words/contexts.
There are many further variants, including some based on deeper predictive networks (eg 'Skip-thought Vectors'), or slightly different prediction targets (eg neighboring sentences in 'fastSent'), or other genericizations that can even include a mixture of symbolic and numeric inputs/targets during training (an option in Facebook's StarSpace, which explores other entity-vectorization possibilities related to word-vectors and FastText-like classification needs).
If you don't need to collapse a text to fixed-size vectors, but just compare texts, there are also techniques like "Word Mover's Distance" which take the "bag of word-vectors" for one text, and another, and give a similarity score.

Fuzzy clustering using unsupervised dimensionality reduction

An unsupervised dimensionality reduction algorithm is taking as input a matrix NxC1 where N is the number of input vectors and C1 is the number of components for each vector (the dimensionality of the vector). As a result, it returns a new matrix NxC2 (C2 < C1) where each vector has a lower number of component.
A fuzzy clustering algorithm is taking as input a matrix N*C1 where N, here again, is the number of input vectors and C1 is the number of components for each vector. As a result, it returns a new matrix NxC2 (C2 usually lower than C1) where each component of each vector is indicating the degree to which the vector belongs to the corresponding cluster.
I noticed that input and output of both classes of algorithms are the same in structure, only the interpretation of the results changes. Moreover, there no fuzzy clustering implementation in scikit-learn, hence the following question:
Does it make sense to use a dimensionality reduction algorithm to perform fuzzy clustering?
For instance, is it a non-sense to apply FeatureAgglomeration or TruncatedSVD to a dataset built from TF-IDF vectors extracted from textual data, and interpret the results as a fuzzy clustering?
In some sense, sure. It kind of depends on how you want to use the results downstream.
Consider SVD truncation or excluding principal components. We have projected into a new, variance-preserving space with essentially few other restrictions on the structure of the new manifold. The new coordinate representations of the original data points could have large negative numbers for some elements, which is a little weird. But one could shift and rescale the data without much difficulty.
One could then interpret each dimension as a cluster membership weight. But consider a common use for fuzzy clustering, which is to generate a hard clustering. Notice how easy this is with fuzzy cluster weights (e.g. just take the max). Consider a set of points in the new dimensionally-reduced space, say <0,0,1>,<0,1,0>,<0,100,101>,<5,100,99>. A fuzzy clustering would given something like {p1,p2}, {p3,p4} if thresholded, but if we took the max here (i.e. treat the dimensionally reduced axes as membership, we get {p1,p3},{p2,p4}, for k=2, for instance. Of course, one could use a better algorithm than max to derive hard memberships (say by looking at pairwise distances, which would work for my example); such algorithms are called, well, clustering algorithms.
Of course, different dimensionality reduction algorithms may work better or worse for this (e.g. MDS which focuses on preserving distances between data points rather than variances is more naturally cluster-like). But fundamentally, many dimensionality reduction algorithms implicitly preserve data about the underlying manifold that the data lie on, whereas fuzzy cluster vectors only hold information about the relations between data points (which may or may not implicitly encode that other information).
Overall, the purpose is a little different. Clustering is designed to find groups of similar data. Feature selection and dimensionality reduction are designed to reduce the noise and/or redundancy of the data by changing the embedding space. Often we use the latter to help with the former.

What does the "support" mean in Support Vector Machine?

What the meaning of the word "support" in the context of Support Vector Machine, which is a supervised learning model?
Copy-pasted from Wikipedia:
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.
In SVMs the resulting separating hyper-plane is attributed to a sub-set of data feature vectors (i.e., the ones that their associated Lagrange multipliers are greater than 0). These feature vectors were named support vectors because intuitively you could say that they "support" the separating hyper-plane or you could say that for the separating hyper-plane the support vectors play the same role as the pillars to a building.
Now formally, paraphrasing Bernhard Schoelkopf 's and Alexander J. Smola's book titled "Learning with Kernels" page 6:
"In the searching process of the unique optimal hyper-plane we consider hyper-planes with normal vectors w that can be represented as general linear combinations (i.e., with non-uniform coefficients) of the training patterns. For instance, we might want to remove the influence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce computational cost of evaluating the decision function. The hyper-plane will then only depend on a sub-set of the training patterns called Support Vectors."
That is, the separating hyper-plane depends on those training data feature vectors, they influence it, it's based on them, consequently they support it.
In a kernel space, the simplest way to represent the separating hyperplane is by the distance to data instances. These data instances are called "support vectors".
The kernel space could be infinite. But as long as you can compute the kernel similarity to the support vectors, you can test which side of the hyperplane an object is, without actually knowing what this infinite dimensional hyperplane looks like.
In 2d, you could of course just produce an equation for the hyperplane. But this doesn't yield any actual benefits, except for understanding the SVM.

Resources