TensorFlow has a tutorial on sequence to sequence models shows how to translate English to French.
Is it possible to use real numbers instead of words in these models?
Suppose the vocabulary contains the following words
[hello:0, a:1, the:2, what:3, is:4, problem:5, there:6, it:7, this:8, listen:9]
since a word only appears once in the vocabulary we can use its respective index as a reference.
so a sentence like
listen there is a problem
will be converted into an array of indices like
[9, 6, 4, 1, 5]
so this is how a single input sentence can be represented using such embeddings. This vocabulary embedding with the corresponding indices is fed into the network along with a list of these indexed sentences i.e. indexed corpus.
But this is not going to cut it since your network also needs to learn the common relationships among words it requires some more semantic information. This can be achieved using word2vec. You can use trained word2vec and get vectors for all the words in your vocabulary and associate them with these indices and use this as your embedding lookup instead of the word2index dictionary.
Now you can map all of the word indices to their corresponding word vector generated using word2vec and store it in an embedding matrix where the row index is the word index and each row is a fixed length word vector corresponding to that word. And then feed this information to your network in the form a feed dict.
Alternatively, you can also consider word senses or features(combination of word, pos tag and ner tag) and create a vocabulary and then create a similar word2index and embedding matrix
Related
I am trying to understand the concept of embedding for the deep learning models.
I understand how employing word2vec can address the limitations of using the one-hot vectors.
However, recently I see a plethora of blog posts stating ELMo, BERT, etc. talking about contextual embedding.
How are word embeddings different from contextual embeddings?
Both embedding techniques, traditional word embedding (e.g. word2vec, Glove) and contextual embedding (e.g. ELMo, BERT), aim to learn a continuous (vector) representation for each word in the documents. Continuous representations can be used in downstream machine learning tasks.
Traditional word embedding techniques learn a global word embedding. They first build a global vocabulary using unique words in the documents by ignoring the meaning of words in different context. Then, similar representations are learnt for the words appeared more frequently close each other in the documents. The problem is that in such word representations the words' contextual meaning (the meaning derived from the words' surroundings), is ignored. For example, only one representation is learnt for "left" in sentence "I left my phone on the left side of the table." However, "left" has two different meanings in the sentence, and needs to have two different representations in the embedding space.
On the other hand, contextual embedding methods are used to learn sequence-level semantics by considering the sequence of all words in the documents. Thus, such techniques learn different representations for polysemous words, e.g. "left" in example above, based on their context.
Word embeddings and contextual embeddings are slightly different.
While both word embeddings and contextual embeddings are obtained from the models using unsupervised learning, there are some differences.
Word embeddings provided by word2vec or fastText has a vocabulary (dictionary) of words. The elements of this vocabulary (or dictionary) are words and its corresponding word embeddings. Hence, given a word, its embeddings is always the same in whichever sentence it occurs. Here, the pre-trained word embeddings are static.
However, contextual embeddings (are generally obtained from the transformer based models). The emeddings are obtained from a model by passing the entire sentence to the pre-trained model. Note that, here there is a vocabulary of words, but the vocabulary will not contain the contextual embeddings. The embeddings generated for each word depends on the other words in a given sentence. (The other words in a given sentence is referred as context. The transformer based models work on attention mechanism, and attention is a way to look at the relation between a word with its neighbors). Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated from pre-trained (or fine-tuned) model.
For example, consider the two sentences:
I will show you a valid point of reference and talk to the point.
Where have you placed the point.
Now, the word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word 'point' is same for both of its occurrences in example 1 and also the same for the word 'point' in example 2. (all three occurrences has same embeddings).
While, the embeddings from BERT or ELMO or any such transformer based models, the the two occurrences of the word 'point' in example 1 will have different embeddings. Also, the word 'point' occurring in example 2 will have different embeddings than the ones in example 1.
I am working on a named entity recognition task. Traditional method is to concatenate word embeddings and character level embeddings for creating a word representation first. I want to also use affix embeddings to better understand the relation between the tags and the words.
For example the words "Afghanistan" and "Kajikistan" are clear examples of Location. Here the suffix "istan" or "tan" will be useful to identify future "location" tags. So I want to extract the suffixes and prefixes of all the words and create embeddings for them, and then concatenate with the initial word representation. How to achieve this?
You can do this:
Search for a suffixes vocabulary from Google.
Write a simple max-backward segmentation script to generate all words' suffixes, and add them as another item into your training and testing data, just like words and characters.
Concatenate suffix embedding with words and character embedding.
I am trying to embed texts, using pre-trained fastText models. Some are empty. How would one replace them to make embedding possible? I was thinking about replacing them with dummy words, like that (docs being a pandas DataFrame object):
docs = docs.replace(np.nan, 'unknown', regex=True)
However it doesn't really make sense as the choice of this word is arbitrary and it is not equivalent to having an empty string.
Otherwise, I could associate the 0 vector embedding to empty strings, or the average vector, but I am not convinced either would make sense, as the embedding operation is non-linear.
In FastText, the sentence embedding is basically an average of the word vectors, as is shown in one of the FastText papers:
Given this fact, zeroes might a logical choice. But, the answer depends on what you want to do with the embeddings.
If you use them as an input for a classifier, it should be fine to select an arbitrary vector as a representation of empty string and the classifier will learn what that means. FastText also learns a special embedding for </s>, i.e., end of a sentence. This is another natural candidate for an embedding of the empty string, especially if you do similarity search.
In the Word2Vec Skip-gram setup that follows, what is the data setup for the output layer? Is it a matrix that is zero everywhere but with a single "1" in each of the C rows - that represents the words in the C context?
Add to describe Data Setup Question:
Meaning what the dataset would look like that was presented to the NN? Lets consider this to be "what does a single training example look like"?. I assume the total input is a matrix, where each row is a word in the vocabulary (and there is a column for each word as well and each cell is zero except where for the specific word - one hot encoded)? Thus, a single training example is 1xV as shown below (all zeros except for the specific word, whose value is a 1). This aligns with the picture above in that the input is V-dim. I expected that the total input matrix would have duplicated rows however - where the same one-hot encoded vector would be repeated for each time the word was found in the corpus (as the output or target variable would be different).
The Output (target) is more confusing to me. I expected it would exactly mirror the input -- a single training example has a "multi"-hot encoded vector that is zero except is a "1" in C of the cells, denoting that a particular word was in the context of the input word (C = 5 if we are looking, for example, 2 words behind and 3 words ahead of the given input word instance). The picture doesn't seem to agree with this though. I dont understand what appears like C different output layers that share the same W' weight matrix?
The skip-gram architecture has word embeddings as its output (and its input). Depending on its precise implementation, the network may therefore produce two embeddings per word (one embedding for the word as an input word, and one embedding for the word as an output word; this is the case in the basic skip-gram architecture with the traditional softmax function), or one embedding per word (this is the case in a setup with the hierarchical softmax as an approximation to the full softmax, for example).
You can find more information about these architectures in the original word2vec papers, such as Distributed Representations of Words and Phrases
and their Compositionality by Mikolov et al.
Let us consider the problem of text classification. So if the document is represented as Bag of words , then we will have an n dimensional feature , where n- number of words in the document. Now if the I decide that I also want to use the document length as feature , then the dimension of this feature alone( length ) will be one. So how do I combine to use both the features (length and Bag of words). Should consider the feature now as 2 dimensional( n-dimensional vector(BOW) and 1-dimensional feature(length). If this wont work , How do I combine the features. Any pointers on this will also be helpful ?
This statement is a little ambiguous: "So if the document is represented as Bag of words, then we will have an n dimensional feature, where n- number of words in the document."
My interpretation is that you have a column for each word that occurs in your corpus (probably restricted to some dictionary of interest), and for each document you have counted the number of occurrences of that word. Your number of columns is now equal to the number of words in your dictionary that appear in ANY of the documents. You also have a "length" feature, which could be a count of the number of words in the document, and you want to know how to incorporate it into your analysis.
A simple approach would be to divide the number of occurrences of a word by the total number of words in the document.
This has the effect of scaling the word occurrences based on the size of the document, and the new feature is called a 'term frequency'. The next natural step is to weight the term frequencies to compensate for terms that are more common in the corpus (and therefore less important). Since we give HIGHER weights to terms that are LESS common, this is called 'inverse document frequency', and the whole process is called “Term Frequency times Inverse Document Frequency”, or tf-idf. You can Google this for more information.
It's possible that you are doing word counts in a different way -- for example, counting the number of word occurrences in each paragraph (as opposed to each document). In that case, for each document, you have a word count for each paragraph, and the typical approach is to merge these paragraph-counts using a process such as Singular Value Decomposition.