Representing documents in vector space model - machine-learning

I have a very fundamental question. I have two sets of documents, one for training and one for testing. I would like to train a Logistic regression classifier with the training documents. I want to know if I'm doing the right thing.
First find the list of all unique words in the training document and call it vocabulary.
For each word in the vocabulary, find its TFIDF in every training document. A document is then represented as vector of these TFIDF scores.
My question is:
1. How do I represent the test documents? Say, one of the test documents does not have any word that is in the vocabulary. In that case , the TFIDF scores will be zero for all words in the vocabulary for that document.
I'm trying to use LIBSVM which uses the sparse vector format. For the case of the above document, which has all entries set to 0 in its vector representation, how do I represent it?

You have to store enough information about the training corpus to do the TF IDF transform on unseen documents. This means you'll need the document frequencies of the terms in the training corpus. Ignoring unseen words in test docs is fine. Your svm won't learn a weight for them anyway. Note that unseen terms should be rare in the test corpus if your training and test distributions are similar. So even if a few terms are dropped, you'll still have plenty of terms to classify the doc.

Related

How to classify texts that are related to the bible based on their content

I have a database of texts from comments of social networks (FB,Twitter).
My goal is to classify texts that have strong relation to the bible based on their content (for example if there are cites or "biblical" words that are used.
This is a binary classification problem and i need help to figure out how to approach it (maybe use the bible as a dictionary somehow). Thanks!
You can train a supervised binary classifier (e.g. a logistic regression over TF-IDF counters, or a fasttext classifier, or fine-tune a BertForSequenceClassification).
Then apply this classifier to your database of comments and find a reasonable value of the probability threshold to keep only the comments in which the classifier is confident enough.
As positive examples for training, you can use the sentences from the Bible itself, sentences for Bible-related Wikipedia articles, etc. As negative samples, you can use any corpus of sentences collected from web - e.g. one of the Leipzig corpora.

Data augmentation for text classification

What is the current state of the art data augmentation technic about text classification?
I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification.
I found some interesting ideas such as:
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.
But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?
Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.
Is it a good method or do I miss some important drawbacks of this technic?
Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.
The two SOTA models are:
GPT-2 https://github.com/openai/gpt-2
BERT https://github.com/google-research/bert
These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.

Addressing synonyms in Supervised Learning for Text Classification

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly.
For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.
What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me
You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.
A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).
Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.
Example pseudocode to get you started:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([ 6.17495403e-02, 2.07064897e-02, -1.56451517e-03,
1.02607915e-02, -1.30429687e-02, 1.60102192e-02, ...
Now lets try a different document:
>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007, 0.01455955, -0.01595449, -0.01795897,
-0.02184369, -0.01654281, 0.01735667, 0.00054854, ...
>>> doc2.similarity(doc1)
0.8820845113100807
Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.
This is how you can use these vectors in scikit-learn:
X = [nlp(text).vector for text in corpus]
clf.fit(X, y)

Does word2vec make sense for supervised learning?

I have a list of sentence/label pairs to train the model, how should I encode the sentences as input to, say an SVM?
Are the sentences in the same language? You could start with the pretrained word2vec file that you can download from Google if it's English. Pay attention to how the train file was created, whether stemming was applied, etc. It's also somewhat important from which corpus it was generated; you'd get different results if this is from newsgroups or if it was extracted from the web or from more formal text.
Word2Vec basically encodes every word into a higher dimensional vector space. This is usually 200,300 or 500 dimensions large. After it is trained, then the "test" sentences are basically bag of words and need not be in any order.
You'd then, for each word in the bag of words, figure out the corresponding word2vec vector. Then you can create features by averaging the vectors, taking the 'minimum', the 'maximum' and if you're comparing text, look at calculating the cosine similarity between vectors. Then use those features in an SVM.

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification.
I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too.
My question is, how could I calculate the TF*IDF with just a single document?
As far as I understand, TF can be calculated based on a single document itself, but the IDF can only be calculated with a collection of document. In my current experiment, I actually calculate the TF*IDF value for the whole collection of documents. And then I use some documents as training set and the others as test set.
I just suddenly realized that this seems not so applicable to real life.
ADD 1
So there are actually 2 subtly different scenarios for classification:
to classify some documents whose content are known but label are not
known.
to classify some totally unseen document.
For 1, we can combine all the documents, both with and without labels. And get the TF*IDF over all of them. This way, even we only use the documents with labels for training, the training result will still contain the influence of the documents without labels.
But my scenario is 2.
Suppose I have the following information for term T from the summary of the training set corpus:
document count for T in the training set is n
total number of training documents is N
Should I calculate the IDF of t for a unseen document D as below?
IDF(t, D)= log((N+1)/(n+1))
ADD 2
And what if I encounter a term in the new document which didn't show up in the training corpus before?
How should I calculate the weight for it in the doc-term vector?
TF-IDF doesn't make sense for a single document, independent of a corpus. It's fundamentally about emphasizing relatively rare and informative words.
You need to keep corpus summary information in order to compute TF-IDF weights. In particular, you need the document count for each term and the total number of documents.
Whether you want to use summary information from the whole training set and test set for TF-IDF, or for just the training set is a matter of your problem formulation. If it's the case that you only care to apply your classification system to documents whose contents you have, but whose labels you do not have (this is actually pretty common), then using TF-IDF for the entire corpus is okay. If you want to apply your classification system to entirely unseen documents after you train, then you only want to use the TF-IDF summary information from the training set.
TF obviously only depends on the new document.
IDF, you compute only on your training corpus.
You can add a slack term to the IDF computation, or adjust it as you suggested. But for a reasonable training set, the constant +1 term will not have a whole lot of effect. AFAICT, in classic document retrieval (think: search), you don't bother to do this. Often, they query document will not become part of your corpus, so why would it be part of IDF?
For unseen words, TF calculation is not a problem as TF is a document specific metric. While computing IDF, you can use smoothed inverse document frequency technique.
IDF = 1 + log(total documents / document frequency of a term)
Here the lower bound for IDF is 1. So if a word is not seen in the training corpus, its IDF is 1. Since, there is no universally agreed single formula for computing tf-idf or even idf, your formula for tf-idf calculation is also reasonable.
Note that, in many cases, unseen terms are ignored if they don't have much impact in the classification task. Sometimes, people replace unseen tokens with a special symbol like UNKNOWN_TOKEN and do their computation.
Alternative of TF-IDF: Another way of computing weight of each term of a document is using Maximum Likelihood Estimation. While computing MLE, you can smooth using additive smoothing technique which is also known as Laplace smoothing. MLE is used in case you are using Generative models like Naive Bayes algorithm for document classification.

Resources