Character-Word Embeddings from lm_1b in Keras - machine-learning

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?

The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.

For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.

Related

After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this:
from gensim.models.fasttext import FastText as FT_gensim
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)
# train the model
model_gensim.train(
corpus_file=corpus_file, epochs=model_gensim.epochs,
total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)
Then, let's say I want to get the embeddings vectors associated with this sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
How can I get them with model_gensim that I trained previously?
You can look up each word's vector in turn:
wordvecs_obama = [model_gensim[word] for word in sentence_obama]
For your 7-word input sentence, you'll then have a list of 7 word-vectors in wordvecs_obama.
All FastText models do not, as a matter of their inherent functionality, convert longer texts into single vectors. (And specifically, the model you've trained doesn't have a default way of doing that.)
There is a "classification mode" in the original Facebook FastText code that involves a different style of training, where texts are associated with known labels at training time, and all the word-vectors of the sentence are combined during training, and when the model is later asked to classify new texts. But, the gensim implementation of FastText does not currently support this mode, as gensim's goal has been to supply unsupervised rather than supervised algorithms.
You could approximate what that FastText mode does by averaging together those word-vectors:
import numpy as np
meanvec_obama = np.array(wordvecs_obama).mean(axis=0)
Depending on your ultimate purposes, something like that might still be useful. (But, that average wouldn't be as useful for classification as if the word-vectors had originally ben trained for that goal, with known labels, in that FastText mode.)

Fuzzy matching sentences to stanzas

I have lyrics from srt subtitle files. If I want to match them to stanzas from another lyrics website, what is the best approach to this?
My approach has been taking tf-idf vector each lyric line and trying to fuzzy match to the staza, using starting and end time of the lyric line as a clue to whether the line might belong to the previous stanzas, next stanzas, or belong to a stanzas of it's own.
I've also tried dynamic programming, but with less success. Due to the high variance in the structure of the lyrics and the stanza, sometimes the results come out completely shifted or messed up, especially if there are repeated chorus.
If there is a Recurrent Neural Networks or other Machine Learning algorithm, is there an existing approach to such problem?
I suggest using Doc2Vec or Word2Vec methods, basically you train a NN with some corpus, the NN will generate a vector for each word/document, those vectors will have similarity based on the similarty of words in the real world (corpus)
so vector of love and care will be very similar, those vectors hold some other cool properties like if yo subtract or add them you can get a word that posses some meaning of the substation or addition induce
once you will get the vector of words or docs you can check similarity between vectors with various methods, commonly used is the cosine similarity
this method mixed with tf-idf to generate weights for best results
usage is very easy, for example
from gensim.models import Word2Vec
from nltk.corpus import brown
b = Word2Vec(brown.sents())
print b.most_similar('money', topn=5)
output
[(u'care', 0.9145717024803162), (u'chance', 0.9034961462020874), (u'job', 0.8980690240859985), (u'trouble', 0.8751360774040222), (u'everything', 0.873866856098175)]
I suggest to take a look at gensim

How does Fine-tuning Word Embeddings work?

I've been reading some NLP with Deep Learning papers and found Fine-tuning seems to be a simple but yet confusing concept. There's been the same question asked here but still not quite clear.
Fine-tuning pre-trained word embeddings to task-specific word embeddings as mentioned in papers like Y. Kim, “Convolutional Neural Networks for Sentence Classification,” and K. S. Tai, R. Socher, and C. D. Manning, “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,” had only a brief mention without getting into any details.
My question is:
Word Embeddings generated using word2vec or Glove as pretrained word vectors are used as input features (X) for downstream tasks like parsing or sentiment analysis, meaning those input vectors are plugged into a new neural network model for some specific task, while training this new model, somehow we can get updated task-specific word embeddings.
But as far as I know, during the training, what back-propagation does is updating the weights (W) of the model, it does not change the input features (X), so how exactly does the original word embeddings get fine-tuned? and where do these fine-tuned vectors come from?
Yes, if you feed the embedding vector as your input, you can't fine-tune the embeddings (at least easily). However, all the frameworks provide some sort of an EmbeddingLayer that takes as input an integer that is the class ordinal of the word/character/other input token, and performs a embedding lookup. Such an embedding layer is very similar to a fully connected layer that is fed a one-hot encoded class, but is way more efficient, as it only needs to fetch/change one row from the matrix on both front and back passes. More importantly, it allows the weights of the embedding to be learned.
So the classic way would be to feed the actual classes to the network instead of embeddings, and prepend the entire network with a embedding layer, that is initialized with word2vec / glove, and which continues learning the weights. It might also be reasonable to freeze them for several iterations at the beginning until the rest of the network starts doing something reasonable with them before you start fine tuning them.
One hot encoding is the base for constructing initial layer for embeddings. Once you train the network one hot encoding essentially serves as a table lookup. In fine-tuning step you can select data for specific works and mention variables that need to be fine tune when you define the optimizer using something like this
embedding_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="embedding_variables/kernel")
ft_optimizer = tf.train.AdamOptimizer(learning_rate=0.001,name='FineTune')
ft_op = ft_optimizer.minimize(mean_loss,var_list=embedding_variables)
where "embedding_variables/kernel" is the name of the next layer after one-hot encoding.

Addressing synonyms in Supervised Learning for Text Classification

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly.
For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.
What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me
You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.
A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).
Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.
Example pseudocode to get you started:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([ 6.17495403e-02, 2.07064897e-02, -1.56451517e-03,
1.02607915e-02, -1.30429687e-02, 1.60102192e-02, ...
Now lets try a different document:
>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007, 0.01455955, -0.01595449, -0.01795897,
-0.02184369, -0.01654281, 0.01735667, 0.00054854, ...
>>> doc2.similarity(doc1)
0.8820845113100807
Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.
This is how you can use these vectors in scikit-learn:
X = [nlp(text).vector for text in corpus]
clf.fit(X, y)

Embeddings with recurrent neural networks

I am working on a research project on text data (it's about search engine queries supervised classification). I have already implemented different methods and I have also used different models for the text (such as binary vectors of the dimention of my vocabulary - 1 if the i-th word appears in the text, 0 otherwise - or words embedding with the model word2vec).
My advisor told me that maybe we could find another representation of the queries using Recurrent Neural Network. This representation should keep into account the sequentiality of the words in the text thanks to the recurrence relation. I have read some documentation about RNN but I haven't find anything useful for this goal. I have read lot of things about language modelling (which predict probabilities of the words), but I don't understand how I could adapt this model in order to obtain something like an embedded vector.
Thank you very much!
Usually, if one wants to obtain embeddings from a query or a sentence exploiting RNN, the logits are used. The logits are simply the output values of the network after the forward pass of the full sentence/query.
The logit values produce a vector that has the dimensions of the output layer (i.e. number of the target classes): usually, it is the vocabulary, since they are extracted from a language model.
For hints have a look at these:
http://arxiv.org/abs/1603.07012
How does word2vec give one hot word vector from the embedding vector?
Note that in principle one could use also use bidirectional networks or networks trained on other tasks, obtaining smaller embeddings, even if this last option is kind of fancy and it has not been explored up to my knowledge.

Resources