word2vec: negative sampling (in layman term)? - machine-learning

I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?

The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.

Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/

I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.

Related

What's the major difference between glove and word2vec?

What is the difference between word2vec and glove?
Are both the ways to train a word embedding? if yes then how can we use both?
Yes, they're both ways to train a word embedding. They both provide the same core output: one vector per word, with the vectors in a useful arrangement. That is, the vectors' relative distances/directions roughly correspond with human ideas of overall word relatedness, and even relatedness along certain salient semantic dimensions.
Word2Vec does incremental, 'sparse' training of a neural network, by repeatedly iterating over a training corpus.
GloVe works to fit vectors to model a giant word co-occurrence matrix built from the corpus.
Working from the same corpus, creating word-vectors of the same dimensionality, and devoting the same attention to meta-optimizations, the quality of their resulting word-vectors will be roughly similar. (When I've seen someone confidently claim one or the other is definitely better, they've often compared some tweaked/best-case use of one algorithm against some rough/arbitrary defaults of the other.)
I'm more familiar with Word2Vec, and my impression is that Word2Vec's training better scales to larger vocabularies, and has more tweakable settings that, if you have the time, might allow tuning your own trained word-vectors more to your specific application. (For example, using a small-versus-large window parameter can have a strong effect on whether a word's nearest-neighbors are 'drop-in replacement words' or more generally words-used-in-the-same-topics. Different downstream applications may prefer word-vectors that skew one way or the other.)
Conversely, some proponents of GLoVe tout that it does fairly well without needing metaparameter optimization.
You probably wouldn't use both, unless comparing them against each other, because they play the same role for any downstream applications of word-vectors.
Word2vec is a predictive model: trains by trying to predict a target word given a context (CBOW method) or the context words from the target (skip-gram method). It uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions. The loss function for training the model is related to how good the model’s predictions are, so as the model trains to make better predictions it will result in better embeddings.
The Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently (matrix values) we see this word in some “context” (the columns) in a large corpus. The number of “contexts” would be very large, since it is essentially combinatorial in size. So we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Before GloVe, the algorithms of word representations can be divided into two main streams, the statistic-based (LDA) and learning-based (Word2Vec). LDA produces the low dimensional word vectors by singular value decomposition (SVD) on the co-occurrence matrix, while Word2Vec employs a three-layer neural network to do the center-context word pair classification task where word vectors are just the by-product.
The most amazing point from Word2Vec is that similar words are located together in the vector space and arithmetic operations on word vectors can pose semantic or syntactic relationships, e.g., “king” - “man” + “woman” -> “queen” or “better” - “good” + “bad” -> “worse”. However, LDA cannot maintain such linear relationship in vector space.
The motivation of GloVe is to force the model to learn such linear relationship based on the co-occurreence matrix explicitly. Essentially, GloVe is a log-bilinear model with a weighted least-squares objective. Obviously, it is a hybrid method that uses machine learning based on the statistic matrix, and this is the general difference between GloVe and Word2Vec.
If we dive into the deduction procedure of the equations in GloVe, we will find the difference inherent in the intuition. GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:
As one might expect, ice co-occurs more frequently with solid than it
does with gas, whereas steam co-occurs more frequently with gas than
it does with solid.
Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.
Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.
However, Word2Vec works on the pure co-occurrence probabilities so that the probability that the words surrounding the target word to be the context is maximized.
In the practice, to speed up the training process, Word2Vec employs negative sampling to substitute the softmax fucntion by the sigmoid function operating on the real data and noise data. This emplicitly results in the clustering of words into a cone in the vector space while GloVe’s word vectors are located more discretely.

How to represent a set of words as a vector in my application? [duplicate]

I am sorry for my naivety, but I don't understand why word embeddings that are the result of NN training process (word2vec) are actually vectors.
Embedding is the process of dimension reduction, during the training process NN reduces the 1/0 arrays of words into smaller size arrays, the process does nothing that applies vector arithmetic.
So as result we got just arrays and not the vectors. Why should I think of these arrays as vectors?
Even though, we got vectors, why does everyone depict them as vectors coming from the origin (0,0)?
Again, I am sorry if my question looks stupid.
What are embeddings?
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.
Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.
(Source: https://en.wikipedia.org/wiki/Word_embedding)
What is Word2Vec?
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
(Source: https://en.wikipedia.org/wiki/Word2vec)
What's an array?
In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.
An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.
The simplest type of data structure is a linear array, also called one-dimensional array.
What's a vector / vector space?
A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars.
Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field.
The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms, listed below.
(Source: https://en.wikipedia.org/wiki/Vector_space)
What's the difference between vectors and arrays?
Firstly, the vector in word embeddings is not exactly the programming language data structure (so it's not Arrays vs Vectors: Introductory Similarities and Differences).
Programmatically, a word embedding vector IS some sort of an array (data structure) of real numbers (i.e. scalars)
Mathematically, any element with one or more dimension populated with real numbers is a tensor. And a vector is a single dimension of scalars.
To answer the OP question:
Why are word embedding actually vectors?
By definition, word embeddings are vectors (see above)
Why do we represent words as vectors of real numbers?
To learn the differences between words, we have to quantify the difference in some manner.
Imagine, if we assign theses "smart" numbers to the words:
>>> semnum = semantic_numbers = {'car': 5, 'vehicle': 2, 'apple': 232, 'orange': 300, 'fruit': 211, 'samsung': 1080, 'iphone': 1200}
>>> abs(semnum['fruit'] - semnum['apple'])
21
>>> abs(semnum['samsung'] - semnum['apple'])
848
We see that the distance between fruit and apple is close but samsung and apple isn't. In this case, the single numerical "feature" of the word is capable of capturing some information about the word meanings but not fully.
Imagine the we have two real number values for each word (i.e. vector):
>>> import numpy as np
>>> semnum = semantic_numbers = {'car': [5, -20], 'vehicle': [2, -18], 'apple': [232, 1010], 'orange': [300, 250], 'fruit': [211, 250], 'samsung': [1080, 1002], 'iphone': [1200, 1100]}
To compute the difference, we could have done:
>>> np.array(semnum['apple']) - np.array(semnum['orange'])
array([-68, 761])
>>> np.array(semnum['apple']) - np.array(semnum['samsung'])
array([-848, 8])
That's not very informative, it returns a vector and we can't get a definitive measure of distance between the words, so we can try some vectorial tricks and compute the distance between the vectors, e.g. euclidean distance:
>>> import numpy as np
>>> orange = np.array(semnum['orange'])
>>> apple = np.array(semnum['apple'])
>>> samsung = np.array(semnum['samsung'])
>>> np.linalg.norm(apple-orange)
763.03604108849277
>>> np.linalg.norm(apple-samsung)
848.03773500947466
>>> np.linalg.norm(orange-samsung)
1083.4685043876448
Now, we can see more "information" that apple can be closer to samsung than orange to samsung. Possibly that's because apple co-occurs in the corpus more frequently with samsung than orange.
The big question comes, "How do we get these real numbers to represent the vector of the words?". That's where the Word2Vec / embedding training algorithms (originally conceived by Bengio 2003) comes in.
Taking a detour
Since adding more real numbers to the vector representing the words is more informative then why don't we just add a lot more dimensions (i.e. numbers of columns in each word vector)?
Traditionally, we compute the differences between words by computing the word-by-word matrices in the field of distributional semantics/distributed lexical semantics, but the matrices become really sparse with many zero values if the words don't co-occur with another.
Thus a lot of effort has been put into dimensionality reduction after computing the word co-occurrence matrix. IMHO, it's like a top-down view of how global relations between words are and then compressing the matrix to get a smaller vector to represent each word.
So the "deep learning" word embedding creation comes from the another school of thought and starts with a randomly (sometimes not-so random) initialized a layer of vectors for each word and learning the parameters/weights for these vectors and optimizing these parameters/weights by minimizing some loss function based on some defined properties.
It sounds a little vague but concretely, if we look at the Word2Vec learning technique, it'll be clearer, see
https://rare-technologies.com/making-sense-of-word2vec/
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
https://arxiv.org/pdf/1402.3722.pdf (more mathematical)
Here's more resources to read-up on word embeddings: https://github.com/keon/awesome-nlp#word-vectors
the process does nothing that applies vector arithmetic
The training process has nothing to do with vector arithmetic, but when the arrays are produced, it turns out they have pretty nice properties, so that one can think of "word linear space".
For example, what words have embeddings closest to a given word in this space?
Put it differently, words with similar meaning form a cloud. Here's a 2-D t-SNE representation:
Another example, the distance between "man" and "woman" is very close to the distance between "uncle" and "aunt":
As a result, you have pretty much reasonable arithmetic:
W("woman") − W("man") ≃ W("aunt") − W("uncle")
W("woman") − W("man") ≃ W("queen") − W("king")
So it's not far fetched to call them vectors. All pictures are from this wonderful post that I very much recommend to read.
Each word is mapped to a point in d-dimension space (d is usually 300 or 600 though not necessary), thus its called a vector (each point in d-dim space is nothing but a vector in that d-dim space).
The points have some nice properties (words with similar meanings tend to occur closer to each other) [proximity is measured using cosine distance between 2 word vectors]
Famous Word2Vec implementation is CBOW + Skip-Gram
Your input for CBOW is your input word vector (each is a vector of length N; N = size of vocabulary). All these input word vectors together are an array of size M x N; M=length of words).
Now what is interesting in the graphic below is the projection step, where we force an NN to learn a lower dimensional representation of our input space to predict the output correctly. The desired output is our original input.
This lower dimensional representation P consists of abstract features describing words e.g. location, adjective, etc. (in reality these learned features are not really clear). Now these features represent one view on these words.
And like with all features, we can see them as high-dimensional vectors.
If you want you can use dimensionality reduction techniques to display them in 2 or 3 dimensional space.
More details and source of graphic: https://arxiv.org/pdf/1301.3781.pdf

Artificial Neural Network with unbalanced weights

I have been reading the concept of ANN for applying it on my project (credit card fraud detection). Given a set of inputs to the network, say:
A1 - Time to input PIN
A2 - Amount to be withdrawn
A3 - ATM location
A4 - Global behavior (Time & date, & sequence in performing a transaction )
The more any of these inputs deviates from the "norm", the greater the weight of that input to the network. Here comes my question, how does the Neural Network treat a situation whereby one input's weight, say A1, is high whilst all the other weights are low?
The input probability density functions combine to form a multidimensional probability distribution (usually an ellipsoid in that many dimensions). The combination of the inputs is a vector, and the probability value at that point in the N-space tells you how likely it is to be real or fake. This works along each of the axes, where all but one input would be zero, as well as out where all the variables have significant values. If all of your inputs have smooth gaussian probability distributions your resulting probability distribution is a hyperellipsoid and you don't really need a neural net.
Using a neural net gets economical when you have a complicated probability density in one or more of the variables, or if combining the variables creates unexpected features (holes and bumps) in the probability density. Then the training of the neural net over a large number of real input combinations and known results tells it what regions of the input space are interesting and what regions are mundane. Again, you could just map them yourself in a big N-dimensional array with high resolution, if you have enough memory, but where's the fun in that? The neural net will also interpolate smoothly between regions, which may make its decisions more fuzzy than the actual probability space (i.e., that's where the accuracy metric drops below 100%).

Text Classification - how to find the features that most affected the decision

When using SVMlight or LIBSVM in order to classify phrases as positive or negative (Sentiment Analysis), is there a way to determine which are the most influential words that affected the algorithms decision? For example, finding that the word "good" helped determine a phrase as positive, etc.
If you use the linear kernel then yes - simply compute the weights vector:
w = SUM_i y_i alpha_i sv_i
Where:
sv - support vector
alpha - coefficient found with SVMlight
y - corresponding class (+1 or -1)
(in some implementations alpha's are already multiplied by y_i and so they are positive/negative)
Once you have w, which is of dimensions 1 x d where d is your data dimension (number of words in the bag of words/tfidf representation) simply select the dimensions with high absolute value (no matter positive or negative) in order to find the most important features (words).
If you use some kernel (like RBF) then the answer is no, there is no direct method of taking out the most important features, as the classification process is performed in completely different way.
As #lejlot mentioned, with linear kernel in SVM, one of the feature ranking strategies is based on the absolute values of weights in the model. Another simple and effective strategy is based on F-score. It considers each feature separately and therefore cannot reveal mutual information between features. You can also determine how important a feature is by removing that feature and observe the classification performance.
You can see this article for more details on feature ranking.
With other kernels in SVM, the feature ranking is not that straighforward, yet still feasible. You can construct an orthogonal set of basis vectors in the kernel space, and calculate the weights by kernel relief. Then the implicit feature ranking can be done based on the absolute value of weights. Finally the data is projected into the learned subspace.

Query about SVM mapping of input vector? And SVM optimization equation

I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).

Resources