Word2Vec Data Setup - machine-learning

Word2Vec Data Setup - machine-learning

In the Word2Vec Skip-gram setup that follows, what is the data setup for the output layer? Is it a matrix that is zero everywhere but with a single "1" in each of the C rows - that represents the words in the C context?
Add to describe Data Setup Question:
Meaning what the dataset would look like that was presented to the NN? Lets consider this to be "what does a single training example look like"?. I assume the total input is a matrix, where each row is a word in the vocabulary (and there is a column for each word as well and each cell is zero except where for the specific word - one hot encoded)? Thus, a single training example is 1xV as shown below (all zeros except for the specific word, whose value is a 1). This aligns with the picture above in that the input is V-dim. I expected that the total input matrix would have duplicated rows however - where the same one-hot encoded vector would be repeated for each time the word was found in the corpus (as the output or target variable would be different).
The Output (target) is more confusing to me. I expected it would exactly mirror the input -- a single training example has a "multi"-hot encoded vector that is zero except is a "1" in C of the cells, denoting that a particular word was in the context of the input word (C = 5 if we are looking, for example, 2 words behind and 3 words ahead of the given input word instance). The picture doesn't seem to agree with this though. I dont understand what appears like C different output layers that share the same W' weight matrix?

The skip-gram architecture has word embeddings as its output (and its input). Depending on its precise implementation, the network may therefore produce two embeddings per word (one embedding for the word as an input word, and one embedding for the word as an output word; this is the case in the basic skip-gram architecture with the traditional softmax function), or one embedding per word (this is the case in a setup with the hierarchical softmax as an approximation to the full softmax, for example).
You can find more information about these architectures in the original word2vec papers, such as Distributed Representations of Words and Phrases
and their Compositionality by Mikolov et al.

Related

Document classification using word vectors

While I was classifying and clustering the documents written in natural language, I came up with a question ...
As word2vec and glove, and or etc, vectorize the word in distributed spaces, I wonder if there are any method recommended or commonly used for document vectorization USING word vectors.
For example,
Document1: "If you chase two rabbits, you will lose them both."
can be vectorized as,
[0.1425, 0.2718, 0.8187, .... , 0.1011]
I know about the one also known as doc2vec, that this document has n dimensions just like word2vec. But this is 1 x n dimensions and I have been testing around to find out the limits of using doc2vec.
So, I want to know how other people apply the word vectors for applications with steady size.
Just stacking vectors with m words will be formed m x n dimensional vectors. In this case, the vector dimension will not be uniformed since dimension m will depends on the number of words in document.
If: [0.1018, ... , 0.8717]
you: [0.5182, ... , 0.8981]
..: [...]
m th word: [...]
And this form is not favorable form to run some machine learning algorithms such as CNN. What are the suggested methods to produce the document vectors in steady form using word vectors?
It would be great if it is provided with papers as well.
Thanks!

The most simple approach to get a fixed-size vector from a text, when all you have is word-vectors, to average all the word-vectors together. (The vectors could be weighted, but if they haven't been unit-length-normalized, their raw magnitudes from training are somewhat of an indicator of their strength-of-single-meaning – polysemous/ambiguous words tend to have vectors with smaller magnitudes.) It works OK for many purposes.
Word vectors can be specifically trained to be better at composing like this, if the training texts are already associated with known classes. Facebook's FastText in its 'classification' mode does this; the word-vectors are optimized as much or more for predicting output classes of the texts they appear in, as they are for predicting their context-window neighbors (classic word2vec).
The 'Paragraph Vector' technique, often called 'doc2vec', gives every training text a sort-of floating pseudoword, that contributes to every prediction, and thus winds up with a word-vector-like position that may represent that full text, rather than the individual words/contexts.
There are many further variants, including some based on deeper predictive networks (eg 'Skip-thought Vectors'), or slightly different prediction targets (eg neighboring sentences in 'fastSent'), or other genericizations that can even include a mixture of symbolic and numeric inputs/targets during training (an option in Facebook's StarSpace, which explores other entity-vectorization possibilities related to word-vectors and FastText-like classification needs).
If you don't need to collapse a text to fixed-size vectors, but just compare texts, there are also techniques like "Word Mover's Distance" which take the "bag of word-vectors" for one text, and another, and give a similarity score.

Do word vectors mean anything on their own?

From my understanding, word vectors are only ever used in terms of relations to other word vectors. For example, the word vector for "king" minus the word vector for "boy" should give a vector close to "queen".
Given a vector of some unknown word, can assumptions about the word be made based solely on the values of that vector?

The individual coordinates – such as dimension #7 of a 300-dimensional vector, etc – don't have easily interpretable meanings.
It's primarily the relative distances to other words (neighborhoods), and relative directions with respect to other constellations of words (orientations without regard to the perpendicular coordinate axes, that may be vaguely interpretable, because they correlate with natural-language or natural-thinking semantics.
Further, the pre-training initialization of the model, and much of the training itself, uses randomization. So even on the exact same data, words can wind up in different coordinates on repeated training runs.
The resulting word-vectors should after each run be about as useful with respect to each other, in terms of distances and directions, but neighborhoods like "words describing seasons" or "things that are 'hot'" could be in very different places in subsequent runs. Only vectors that trained together are comparable.
(There are some constrained variants of word2vec that try to force certain dimensions or directions to be more useful for certain purposes, such as answering questions or detecting hypernym/hyponym relationships – but that requires extra constraints or inputs to the training process. Plain vanilla word2vec won't be as cleanly interpretable.)

You cannot make assumptions about the word based on the values of its word vector. A single word vector does not carry information or meaning by itself, but only contains meaning in relation to other word vectors.
Word vectors using algorithms such as Word2Vec and GloVe are computed and rely on the co-occurrence of words in a sequence. As an example, Word2Vec employs the dot product of two vectors as an input to a softmax function, which approximates the conditional probability that those two words appear in the same sequence. Word vectors are then determined such that words that frequently occur in the same context are mapped to similar vectors. Word vectors thereby capture both syntactic and semantic information.

How to represent a set of words as a vector in my application? [duplicate]

I am sorry for my naivety, but I don't understand why word embeddings that are the result of NN training process (word2vec) are actually vectors.
Embedding is the process of dimension reduction, during the training process NN reduces the 1/0 arrays of words into smaller size arrays, the process does nothing that applies vector arithmetic.
So as result we got just arrays and not the vectors. Why should I think of these arrays as vectors?
Even though, we got vectors, why does everyone depict them as vectors coming from the origin (0,0)?
Again, I am sorry if my question looks stupid.

What are embeddings?
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.
Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.
(Source: https://en.wikipedia.org/wiki/Word_embedding)
What is Word2Vec?
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
(Source: https://en.wikipedia.org/wiki/Word2vec)
What's an array?
In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.
An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.
The simplest type of data structure is a linear array, also called one-dimensional array.
What's a vector / vector space?
A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars.
Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field.
The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms, listed below.
(Source: https://en.wikipedia.org/wiki/Vector_space)
What's the difference between vectors and arrays?
Firstly, the vector in word embeddings is not exactly the programming language data structure (so it's not Arrays vs Vectors: Introductory Similarities and Differences).
Programmatically, a word embedding vector IS some sort of an array (data structure) of real numbers (i.e. scalars)
Mathematically, any element with one or more dimension populated with real numbers is a tensor. And a vector is a single dimension of scalars.
To answer the OP question:
Why are word embedding actually vectors?
By definition, word embeddings are vectors (see above)
Why do we represent words as vectors of real numbers?
To learn the differences between words, we have to quantify the difference in some manner.
Imagine, if we assign theses "smart" numbers to the words:
>>> semnum = semantic_numbers = {'car': 5, 'vehicle': 2, 'apple': 232, 'orange': 300, 'fruit': 211, 'samsung': 1080, 'iphone': 1200}
>>> abs(semnum['fruit'] - semnum['apple'])
21
>>> abs(semnum['samsung'] - semnum['apple'])
848
We see that the distance between fruit and apple is close but samsung and apple isn't. In this case, the single numerical "feature" of the word is capable of capturing some information about the word meanings but not fully.
Imagine the we have two real number values for each word (i.e. vector):
>>> import numpy as np
>>> semnum = semantic_numbers = {'car': [5, -20], 'vehicle': [2, -18], 'apple': [232, 1010], 'orange': [300, 250], 'fruit': [211, 250], 'samsung': [1080, 1002], 'iphone': [1200, 1100]}
To compute the difference, we could have done:
>>> np.array(semnum['apple']) - np.array(semnum['orange'])
array([-68, 761])
>>> np.array(semnum['apple']) - np.array(semnum['samsung'])
array([-848, 8])
That's not very informative, it returns a vector and we can't get a definitive measure of distance between the words, so we can try some vectorial tricks and compute the distance between the vectors, e.g. euclidean distance:
>>> import numpy as np
>>> orange = np.array(semnum['orange'])
>>> apple = np.array(semnum['apple'])
>>> samsung = np.array(semnum['samsung'])
>>> np.linalg.norm(apple-orange)
763.03604108849277
>>> np.linalg.norm(apple-samsung)
848.03773500947466
>>> np.linalg.norm(orange-samsung)
1083.4685043876448
Now, we can see more "information" that apple can be closer to samsung than orange to samsung. Possibly that's because apple co-occurs in the corpus more frequently with samsung than orange.
The big question comes, "How do we get these real numbers to represent the vector of the words?". That's where the Word2Vec / embedding training algorithms (originally conceived by Bengio 2003) comes in.
Taking a detour
Since adding more real numbers to the vector representing the words is more informative then why don't we just add a lot more dimensions (i.e. numbers of columns in each word vector)?
Traditionally, we compute the differences between words by computing the word-by-word matrices in the field of distributional semantics/distributed lexical semantics, but the matrices become really sparse with many zero values if the words don't co-occur with another.
Thus a lot of effort has been put into dimensionality reduction after computing the word co-occurrence matrix. IMHO, it's like a top-down view of how global relations between words are and then compressing the matrix to get a smaller vector to represent each word.
So the "deep learning" word embedding creation comes from the another school of thought and starts with a randomly (sometimes not-so random) initialized a layer of vectors for each word and learning the parameters/weights for these vectors and optimizing these parameters/weights by minimizing some loss function based on some defined properties.
It sounds a little vague but concretely, if we look at the Word2Vec learning technique, it'll be clearer, see
https://rare-technologies.com/making-sense-of-word2vec/
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
https://arxiv.org/pdf/1402.3722.pdf (more mathematical)
Here's more resources to read-up on word embeddings: https://github.com/keon/awesome-nlp#word-vectors

the process does nothing that applies vector arithmetic
The training process has nothing to do with vector arithmetic, but when the arrays are produced, it turns out they have pretty nice properties, so that one can think of "word linear space".
For example, what words have embeddings closest to a given word in this space?
Put it differently, words with similar meaning form a cloud. Here's a 2-D t-SNE representation:
Another example, the distance between "man" and "woman" is very close to the distance between "uncle" and "aunt":
As a result, you have pretty much reasonable arithmetic:
W("woman") − W("man") ≃ W("aunt") − W("uncle")
W("woman") − W("man") ≃ W("queen") − W("king")
So it's not far fetched to call them vectors. All pictures are from this wonderful post that I very much recommend to read.

Each word is mapped to a point in d-dimension space (d is usually 300 or 600 though not necessary), thus its called a vector (each point in d-dim space is nothing but a vector in that d-dim space).
The points have some nice properties (words with similar meanings tend to occur closer to each other) [proximity is measured using cosine distance between 2 word vectors]

Famous Word2Vec implementation is CBOW + Skip-Gram
Your input for CBOW is your input word vector (each is a vector of length N; N = size of vocabulary). All these input word vectors together are an array of size M x N; M=length of words).
Now what is interesting in the graphic below is the projection step, where we force an NN to learn a lower dimensional representation of our input space to predict the output correctly. The desired output is our original input.
This lower dimensional representation P consists of abstract features describing words e.g. location, adjective, etc. (in reality these learned features are not really clear). Now these features represent one view on these words.
And like with all features, we can see them as high-dimensional vectors.
If you want you can use dimensionality reduction techniques to display them in 2 or 3 dimensional space.
More details and source of graphic: https://arxiv.org/pdf/1301.3781.pdf

Naive Bayes, not so Naive?

I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.
contains_A
contains_B
...
contains_Z
For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.
When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?
At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8
So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?

I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.
When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.
if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD
then for example a text HELLO LCD will be transformed to [1,0,1,0].
I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.
Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.

Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.
I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html
-T
Transform the word frequencies into log(1+fij)
where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)

I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.
If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.

Document classification using naive bayse

I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:
construct some probability of each word in the training set for each known classification
given a document we strip all the words that it contains
multiply together the probabilities of the words being present in a classification
perform (3) for each classification
compare the result of (4) and choose the classification with the highest posterior
What I am confused about is the part when we calculate the probability of each word given training set. For example for a word "banana", it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of "banana" appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?

I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words
Classify "Mentions Fruit":
"I like Bananas."
should be weighed no more or less than
"Bananas! Bananas! Bananas! I like them."
So the answer to your question would be 100/200 = 0.5.
The description of Document Classification on Wikipedia also supports my conclusion
Then the probability that a given document D contains all of the words W, given a class C, is
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.
By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.
UPDATE
My direct experience is based on short documents. I would like to highlight research that #BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically
One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.
A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart