Does word2vec make sense for supervised learning? - machine-learning

I have a list of sentence/label pairs to train the model, how should I encode the sentences as input to, say an SVM?

Are the sentences in the same language? You could start with the pretrained word2vec file that you can download from Google if it's English. Pay attention to how the train file was created, whether stemming was applied, etc. It's also somewhat important from which corpus it was generated; you'd get different results if this is from newsgroups or if it was extracted from the web or from more formal text.
Word2Vec basically encodes every word into a higher dimensional vector space. This is usually 200,300 or 500 dimensions large. After it is trained, then the "test" sentences are basically bag of words and need not be in any order.
You'd then, for each word in the bag of words, figure out the corresponding word2vec vector. Then you can create features by averaging the vectors, taking the 'minimum', the 'maximum' and if you're comparing text, look at calculating the cosine similarity between vectors. Then use those features in an SVM.

Related

Understanding embedding vectors dimension

In deep learning, in particularly NLP, words are transformed into a vector representation to be fed into a neural network such as an RNN. By referring to the link:
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/#Word%20Embeddings
In the section of Word Embeddings, it is said that:
A word embedding W:words→Rn is a paramaterized function mapping words
in
some language to high-dimensional vectors (perhaps 200 to 500 dimensions)
I do not understand the purpose of the dimension of the vectors. What does it mean to have a vector of 200 dimensions compared to a vector of 20 dimensions?
Does it improve the overall accuracy of the model? Could anyone give me a simple example regarding the choice of dimension of the vectors.
These word embeddings also called Distributed Word Embedding is based on
you know a word by the company it keeps
as quoted by John Rupert Firth
So we know the meaning of a word by its context. You can think of each scalar in the vector (of a word) represents its strength for a concept. This slide from Prof. Pawan Goyal explains it all.
So you want good vector size to capture decent amount of concepts but you do not want a too huge vector because it will then become the bottleneck in training of models where these embeddings are used.
Also the vector size is mostly fixed as most do not train their own embedding but rather use openly available embeddings as they are trained for many hours on huge data. So using them will force us to use an embedding layers with dimensions as given by the openly available embedding you are using (word2vec, glove etc)
Distributed Word Embeddings is a major milestone in the area of deep learning in NLP. They give better accuracy as compared of tfidf based embeddings.

Data augmentation for text classification

What is the current state of the art data augmentation technic about text classification?
I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification.
I found some interesting ideas such as:
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.
But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?
Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.
Is it a good method or do I miss some important drawbacks of this technic?
Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.
The two SOTA models are:
GPT-2 https://github.com/openai/gpt-2
BERT https://github.com/google-research/bert
These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.

Fine-tuning Glove Embeddings

Has anyone tried to fine-tune Glove embeddings on a domain-specific corpus?
Fine-tuning word2vec embeddings has proven very efficient for me in a various NLP tasks, but I am wondering whether generating a cooccurrence matrix on my domain-specific corpus, and training glove embeddings (initialized with pre-trained embeddings) on that corpus would generate similar improvements.
I myself am trying to do the exact same thing. You can try mittens.
They have succesfully built a framework for it. Christopher D. Manning(co-author of GloVe) is associated with it.
word2vec and Glove are a techniques for producing word embeddings, i.e., for modelling text (a set of sentences) into computer-readable vectors.
While word2vec trains on the local context (neighboring words), Glove will look for words co-occurrence in a whole text or corpus, its approach is more global.
word2vec
There are two main approaches for word2vec, in which the algorithm loops through the worlds of the sentence. For each current word w it will try to predict
the neighboring words from w and its context, this is the Skip-Gram approach
w from its context, this is the CBOW approach
Hence, word2vec will produce a similar embedding for words with similar contexts, for instance a noun in singular and its plural, or two synonyms.
Glove
The main intuition underlying the Glove model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. In other words the embeddings are based on the computation of distances between pairs of target words. The model computes the distance between two target words in a text by analyzing the co-occurence of those two target words with some other probe words (contextual words).
https://nlp.stanford.edu/projects/glove/
For example, consider the co-occurrence probabilities for target words "ice" and "steam" with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus:
As one might expect, "ice" co-occurs more frequently with "solid" than it does with "gas", whereas "steam" co-occurs more frequently with "gas" than it does with "solid". Both words co-occur with their shared property "water" frequently, and both co-occur with the unrelated word "fashion" infrequently. Only in the ratio of probabilities does noise from non-discriminative words like "water" and "fashion" cancel out, so that large values (much greater than 1) correlate well with properties specific to "ice", and small values (much less than 1) correlate well with properties specific of "steam". In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase.
Also, Glove is very good at analogy, and performs well on the word2vec dataset.

Addressing synonyms in Supervised Learning for Text Classification

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly.
For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.
What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me
You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.
A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).
Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.
Example pseudocode to get you started:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([ 6.17495403e-02, 2.07064897e-02, -1.56451517e-03,
1.02607915e-02, -1.30429687e-02, 1.60102192e-02, ...
Now lets try a different document:
>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007, 0.01455955, -0.01595449, -0.01795897,
-0.02184369, -0.01654281, 0.01735667, 0.00054854, ...
>>> doc2.similarity(doc1)
0.8820845113100807
Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.
This is how you can use these vectors in scikit-learn:
X = [nlp(text).vector for text in corpus]
clf.fit(X, y)

Representing documents in vector space model

I have a very fundamental question. I have two sets of documents, one for training and one for testing. I would like to train a Logistic regression classifier with the training documents. I want to know if I'm doing the right thing.
First find the list of all unique words in the training document and call it vocabulary.
For each word in the vocabulary, find its TFIDF in every training document. A document is then represented as vector of these TFIDF scores.
My question is:
1. How do I represent the test documents? Say, one of the test documents does not have any word that is in the vocabulary. In that case , the TFIDF scores will be zero for all words in the vocabulary for that document.
I'm trying to use LIBSVM which uses the sparse vector format. For the case of the above document, which has all entries set to 0 in its vector representation, how do I represent it?
You have to store enough information about the training corpus to do the TF IDF transform on unseen documents. This means you'll need the document frequencies of the terms in the training corpus. Ignoring unseen words in test docs is fine. Your svm won't learn a weight for them anyway. Note that unseen terms should be rare in the test corpus if your training and test distributions are similar. So even if a few terms are dropped, you'll still have plenty of terms to classify the doc.

Resources