From my understanding, word vectors are only ever used in terms of relations to other word vectors. For example, the word vector for "king" minus the word vector for "boy" should give a vector close to "queen".
Given a vector of some unknown word, can assumptions about the word be made based solely on the values of that vector?
The individual coordinates – such as dimension #7 of a 300-dimensional vector, etc – don't have easily interpretable meanings.
It's primarily the relative distances to other words (neighborhoods), and relative directions with respect to other constellations of words (orientations without regard to the perpendicular coordinate axes, that may be vaguely interpretable, because they correlate with natural-language or natural-thinking semantics.
Further, the pre-training initialization of the model, and much of the training itself, uses randomization. So even on the exact same data, words can wind up in different coordinates on repeated training runs.
The resulting word-vectors should after each run be about as useful with respect to each other, in terms of distances and directions, but neighborhoods like "words describing seasons" or "things that are 'hot'" could be in very different places in subsequent runs. Only vectors that trained together are comparable.
(There are some constrained variants of word2vec that try to force certain dimensions or directions to be more useful for certain purposes, such as answering questions or detecting hypernym/hyponym relationships – but that requires extra constraints or inputs to the training process. Plain vanilla word2vec won't be as cleanly interpretable.)
You cannot make assumptions about the word based on the values of its word vector. A single word vector does not carry information or meaning by itself, but only contains meaning in relation to other word vectors.
Word vectors using algorithms such as Word2Vec and GloVe are computed and rely on the co-occurrence of words in a sequence. As an example, Word2Vec employs the dot product of two vectors as an input to a softmax function, which approximates the conditional probability that those two words appear in the same sequence. Word vectors are then determined such that words that frequently occur in the same context are mapped to similar vectors. Word vectors thereby capture both syntactic and semantic information.
Related
What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.
While I was classifying and clustering the documents written in natural language, I came up with a question ...
As word2vec and glove, and or etc, vectorize the word in distributed spaces, I wonder if there are any method recommended or commonly used for document vectorization USING word vectors.
For example,
Document1: "If you chase two rabbits, you will lose them both."
can be vectorized as,
[0.1425, 0.2718, 0.8187, .... , 0.1011]
I know about the one also known as doc2vec, that this document has n dimensions just like word2vec. But this is 1 x n dimensions and I have been testing around to find out the limits of using doc2vec.
So, I want to know how other people apply the word vectors for applications with steady size.
Just stacking vectors with m words will be formed m x n dimensional vectors. In this case, the vector dimension will not be uniformed since dimension m will depends on the number of words in document.
If: [0.1018, ... , 0.8717]
you: [0.5182, ... , 0.8981]
..: [...]
m th word: [...]
And this form is not favorable form to run some machine learning algorithms such as CNN. What are the suggested methods to produce the document vectors in steady form using word vectors?
It would be great if it is provided with papers as well.
Thanks!
The most simple approach to get a fixed-size vector from a text, when all you have is word-vectors, to average all the word-vectors together. (The vectors could be weighted, but if they haven't been unit-length-normalized, their raw magnitudes from training are somewhat of an indicator of their strength-of-single-meaning – polysemous/ambiguous words tend to have vectors with smaller magnitudes.) It works OK for many purposes.
Word vectors can be specifically trained to be better at composing like this, if the training texts are already associated with known classes. Facebook's FastText in its 'classification' mode does this; the word-vectors are optimized as much or more for predicting output classes of the texts they appear in, as they are for predicting their context-window neighbors (classic word2vec).
The 'Paragraph Vector' technique, often called 'doc2vec', gives every training text a sort-of floating pseudoword, that contributes to every prediction, and thus winds up with a word-vector-like position that may represent that full text, rather than the individual words/contexts.
There are many further variants, including some based on deeper predictive networks (eg 'Skip-thought Vectors'), or slightly different prediction targets (eg neighboring sentences in 'fastSent'), or other genericizations that can even include a mixture of symbolic and numeric inputs/targets during training (an option in Facebook's StarSpace, which explores other entity-vectorization possibilities related to word-vectors and FastText-like classification needs).
If you don't need to collapse a text to fixed-size vectors, but just compare texts, there are also techniques like "Word Mover's Distance" which take the "bag of word-vectors" for one text, and another, and give a similarity score.
I am sorry for my naivety, but I don't understand why word embeddings that are the result of NN training process (word2vec) are actually vectors.
Embedding is the process of dimension reduction, during the training process NN reduces the 1/0 arrays of words into smaller size arrays, the process does nothing that applies vector arithmetic.
So as result we got just arrays and not the vectors. Why should I think of these arrays as vectors?
Even though, we got vectors, why does everyone depict them as vectors coming from the origin (0,0)?
Again, I am sorry if my question looks stupid.
What are embeddings?
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.
Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.
(Source: https://en.wikipedia.org/wiki/Word_embedding)
What is Word2Vec?
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
(Source: https://en.wikipedia.org/wiki/Word2vec)
What's an array?
In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.
An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.
The simplest type of data structure is a linear array, also called one-dimensional array.
What's a vector / vector space?
A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars.
Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field.
The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms, listed below.
(Source: https://en.wikipedia.org/wiki/Vector_space)
What's the difference between vectors and arrays?
Firstly, the vector in word embeddings is not exactly the programming language data structure (so it's not Arrays vs Vectors: Introductory Similarities and Differences).
Programmatically, a word embedding vector IS some sort of an array (data structure) of real numbers (i.e. scalars)
Mathematically, any element with one or more dimension populated with real numbers is a tensor. And a vector is a single dimension of scalars.
To answer the OP question:
Why are word embedding actually vectors?
By definition, word embeddings are vectors (see above)
Why do we represent words as vectors of real numbers?
To learn the differences between words, we have to quantify the difference in some manner.
Imagine, if we assign theses "smart" numbers to the words:
>>> semnum = semantic_numbers = {'car': 5, 'vehicle': 2, 'apple': 232, 'orange': 300, 'fruit': 211, 'samsung': 1080, 'iphone': 1200}
>>> abs(semnum['fruit'] - semnum['apple'])
21
>>> abs(semnum['samsung'] - semnum['apple'])
848
We see that the distance between fruit and apple is close but samsung and apple isn't. In this case, the single numerical "feature" of the word is capable of capturing some information about the word meanings but not fully.
Imagine the we have two real number values for each word (i.e. vector):
>>> import numpy as np
>>> semnum = semantic_numbers = {'car': [5, -20], 'vehicle': [2, -18], 'apple': [232, 1010], 'orange': [300, 250], 'fruit': [211, 250], 'samsung': [1080, 1002], 'iphone': [1200, 1100]}
To compute the difference, we could have done:
>>> np.array(semnum['apple']) - np.array(semnum['orange'])
array([-68, 761])
>>> np.array(semnum['apple']) - np.array(semnum['samsung'])
array([-848, 8])
That's not very informative, it returns a vector and we can't get a definitive measure of distance between the words, so we can try some vectorial tricks and compute the distance between the vectors, e.g. euclidean distance:
>>> import numpy as np
>>> orange = np.array(semnum['orange'])
>>> apple = np.array(semnum['apple'])
>>> samsung = np.array(semnum['samsung'])
>>> np.linalg.norm(apple-orange)
763.03604108849277
>>> np.linalg.norm(apple-samsung)
848.03773500947466
>>> np.linalg.norm(orange-samsung)
1083.4685043876448
Now, we can see more "information" that apple can be closer to samsung than orange to samsung. Possibly that's because apple co-occurs in the corpus more frequently with samsung than orange.
The big question comes, "How do we get these real numbers to represent the vector of the words?". That's where the Word2Vec / embedding training algorithms (originally conceived by Bengio 2003) comes in.
Taking a detour
Since adding more real numbers to the vector representing the words is more informative then why don't we just add a lot more dimensions (i.e. numbers of columns in each word vector)?
Traditionally, we compute the differences between words by computing the word-by-word matrices in the field of distributional semantics/distributed lexical semantics, but the matrices become really sparse with many zero values if the words don't co-occur with another.
Thus a lot of effort has been put into dimensionality reduction after computing the word co-occurrence matrix. IMHO, it's like a top-down view of how global relations between words are and then compressing the matrix to get a smaller vector to represent each word.
So the "deep learning" word embedding creation comes from the another school of thought and starts with a randomly (sometimes not-so random) initialized a layer of vectors for each word and learning the parameters/weights for these vectors and optimizing these parameters/weights by minimizing some loss function based on some defined properties.
It sounds a little vague but concretely, if we look at the Word2Vec learning technique, it'll be clearer, see
https://rare-technologies.com/making-sense-of-word2vec/
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
https://arxiv.org/pdf/1402.3722.pdf (more mathematical)
Here's more resources to read-up on word embeddings: https://github.com/keon/awesome-nlp#word-vectors
the process does nothing that applies vector arithmetic
The training process has nothing to do with vector arithmetic, but when the arrays are produced, it turns out they have pretty nice properties, so that one can think of "word linear space".
For example, what words have embeddings closest to a given word in this space?
Put it differently, words with similar meaning form a cloud. Here's a 2-D t-SNE representation:
Another example, the distance between "man" and "woman" is very close to the distance between "uncle" and "aunt":
As a result, you have pretty much reasonable arithmetic:
W("woman") − W("man") ≃ W("aunt") − W("uncle")
W("woman") − W("man") ≃ W("queen") − W("king")
So it's not far fetched to call them vectors. All pictures are from this wonderful post that I very much recommend to read.
Each word is mapped to a point in d-dimension space (d is usually 300 or 600 though not necessary), thus its called a vector (each point in d-dim space is nothing but a vector in that d-dim space).
The points have some nice properties (words with similar meanings tend to occur closer to each other) [proximity is measured using cosine distance between 2 word vectors]
Famous Word2Vec implementation is CBOW + Skip-Gram
Your input for CBOW is your input word vector (each is a vector of length N; N = size of vocabulary). All these input word vectors together are an array of size M x N; M=length of words).
Now what is interesting in the graphic below is the projection step, where we force an NN to learn a lower dimensional representation of our input space to predict the output correctly. The desired output is our original input.
This lower dimensional representation P consists of abstract features describing words e.g. location, adjective, etc. (in reality these learned features are not really clear). Now these features represent one view on these words.
And like with all features, we can see them as high-dimensional vectors.
If you want you can use dimensionality reduction techniques to display them in 2 or 3 dimensional space.
More details and source of graphic: https://arxiv.org/pdf/1301.3781.pdf
I'm reading the raw word2vec paper: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
According to below equation, every word has two vectors, one is used to predict context word as center word, another is used as context word. For the former one, we can update it with Gradient descent in each iteration. But how to update the latter one? And which vector is the final vector in final model?
To my understanding, irrespective of what architecture is used (skip-gram/CBOW), word vectors are read from same word-vector matrix.
As suggested in second footnote of the paper, v_in and v'_out of same word (say dog) should be different, and they are assumed to be coming from different vocabularies during the derivation of the loss function.
Practically, probability of word appearing in its own context is very low, and most implementations don't save two vector representations of same word for saving memory and efficiency.
What the meaning of the word "support" in the context of Support Vector Machine, which is a supervised learning model?
Copy-pasted from Wikipedia:
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.
In SVMs the resulting separating hyper-plane is attributed to a sub-set of data feature vectors (i.e., the ones that their associated Lagrange multipliers are greater than 0). These feature vectors were named support vectors because intuitively you could say that they "support" the separating hyper-plane or you could say that for the separating hyper-plane the support vectors play the same role as the pillars to a building.
Now formally, paraphrasing Bernhard Schoelkopf 's and Alexander J. Smola's book titled "Learning with Kernels" page 6:
"In the searching process of the unique optimal hyper-plane we consider hyper-planes with normal vectors w that can be represented as general linear combinations (i.e., with non-uniform coefficients) of the training patterns. For instance, we might want to remove the influence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce computational cost of evaluating the decision function. The hyper-plane will then only depend on a sub-set of the training patterns called Support Vectors."
That is, the separating hyper-plane depends on those training data feature vectors, they influence it, it's based on them, consequently they support it.
In a kernel space, the simplest way to represent the separating hyperplane is by the distance to data instances. These data instances are called "support vectors".
The kernel space could be infinite. But as long as you can compute the kernel similarity to the support vectors, you can test which side of the hyperplane an object is, without actually knowing what this infinite dimensional hyperplane looks like.
In 2d, you could of course just produce an equation for the hyperplane. But this doesn't yield any actual benefits, except for understanding the SVM.