In my SVM, i am using tf-idf on the documents for feature extraction. These tf-idf are calculated on the whole of training documents.
Now when i get a test-document that i want to classify, how do i generate the vector for it ?
I used stemming before calculating tf-idf. I can perform that on test-document too. I have count_of_words for train-documents.
Should i increment count of words that are in the train-document count_of_words for calculating the tf-idf of test-document or should i use it directly ?
Calculate them the same way as during training but: use idf based on the training documents and tf from the test documents. If you have many new documents coming in, just update the training data time to time and retrain your model.
Related
I have around 100000 documents of varying word length. I also have trained a word2vec model on the entire corpus. Now how do I go from having this word-vectors to create features of same dimension for each individual documents?
I am aware of a couple of techniques of how this can be done, one is to take simple average of vectors of all the words in the document and another is to do k-means clustering.
Can you suggest some other way of carrying out this task?
If you want to create a vector for each document, you might want to check Doc2Vec.
Doc2Vec - Gensim Tutorial
Doc2Vec paper
I programmed my own classifier in python, I used a text corpus to test it using F1 measurement, but now I want to test it in other Data Mining tasks, so I have my classifier output file to a given corpus and I want to measure the quality using Weka different measures, how I can past to Weka the output file and get the quality?
I think the correct procedure should be some sort of n-fold validation: Divide your data set into training and test sets. Develop the model on the training set; calculate its sum of squared errors SSE(train).
The take the model and run the test data through it and calculate the SSE(test) using the predicted and actual response values. That'll help you assess the accuracy and bias of your model.
Have a look at Elements of Statistical Learning Using R.
I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification.
I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too.
My question is, how could I calculate the TF*IDF with just a single document?
As far as I understand, TF can be calculated based on a single document itself, but the IDF can only be calculated with a collection of document. In my current experiment, I actually calculate the TF*IDF value for the whole collection of documents. And then I use some documents as training set and the others as test set.
I just suddenly realized that this seems not so applicable to real life.
ADD 1
So there are actually 2 subtly different scenarios for classification:
to classify some documents whose content are known but label are not
known.
to classify some totally unseen document.
For 1, we can combine all the documents, both with and without labels. And get the TF*IDF over all of them. This way, even we only use the documents with labels for training, the training result will still contain the influence of the documents without labels.
But my scenario is 2.
Suppose I have the following information for term T from the summary of the training set corpus:
document count for T in the training set is n
total number of training documents is N
Should I calculate the IDF of t for a unseen document D as below?
IDF(t, D)= log((N+1)/(n+1))
ADD 2
And what if I encounter a term in the new document which didn't show up in the training corpus before?
How should I calculate the weight for it in the doc-term vector?
TF-IDF doesn't make sense for a single document, independent of a corpus. It's fundamentally about emphasizing relatively rare and informative words.
You need to keep corpus summary information in order to compute TF-IDF weights. In particular, you need the document count for each term and the total number of documents.
Whether you want to use summary information from the whole training set and test set for TF-IDF, or for just the training set is a matter of your problem formulation. If it's the case that you only care to apply your classification system to documents whose contents you have, but whose labels you do not have (this is actually pretty common), then using TF-IDF for the entire corpus is okay. If you want to apply your classification system to entirely unseen documents after you train, then you only want to use the TF-IDF summary information from the training set.
TF obviously only depends on the new document.
IDF, you compute only on your training corpus.
You can add a slack term to the IDF computation, or adjust it as you suggested. But for a reasonable training set, the constant +1 term will not have a whole lot of effect. AFAICT, in classic document retrieval (think: search), you don't bother to do this. Often, they query document will not become part of your corpus, so why would it be part of IDF?
For unseen words, TF calculation is not a problem as TF is a document specific metric. While computing IDF, you can use smoothed inverse document frequency technique.
IDF = 1 + log(total documents / document frequency of a term)
Here the lower bound for IDF is 1. So if a word is not seen in the training corpus, its IDF is 1. Since, there is no universally agreed single formula for computing tf-idf or even idf, your formula for tf-idf calculation is also reasonable.
Note that, in many cases, unseen terms are ignored if they don't have much impact in the classification task. Sometimes, people replace unseen tokens with a special symbol like UNKNOWN_TOKEN and do their computation.
Alternative of TF-IDF: Another way of computing weight of each term of a document is using Maximum Likelihood Estimation. While computing MLE, you can smooth using additive smoothing technique which is also known as Laplace smoothing. MLE is used in case you are using Generative models like Naive Bayes algorithm for document classification.
I am trying to build an SVM classifier in SVM Light using the Vector Space Model. I have 1000 documents and a dictionary of terms I will be using to vectorize each document. Of the 1000 documents, 600 will be for my training set, while the remaining 400 will be split evenly (200 each) for my cross-validation set and my test set.
Now suppose that I were to train my SVM classifier using my training set of 600 (vectorized using tf-idf) in order to generate a model for classification.
When I apply the model to my cross-validation set, would i use the same idf (since the model corresponds to my training set), or would I need to compute a new idf based on the cross-validation set? Also, if I was to apply the model to a single document, how would I apply idf, as this set would only contain 1 document?
You build the idf in your training documents, and use it whenever a new test document comes. For each test document, you can create a word list for the query using the idf of each term in the query. If a word is not included in idf, the query will return 0. The classification is acquired based on the established idf.
You should use the same idf as your training set because you built your classifier corresponding to that idf and thus your results will be different with a new idf.
I have a very fundamental question. I have two sets of documents, one for training and one for testing. I would like to train a Logistic regression classifier with the training documents. I want to know if I'm doing the right thing.
First find the list of all unique words in the training document and call it vocabulary.
For each word in the vocabulary, find its TFIDF in every training document. A document is then represented as vector of these TFIDF scores.
My question is:
1. How do I represent the test documents? Say, one of the test documents does not have any word that is in the vocabulary. In that case , the TFIDF scores will be zero for all words in the vocabulary for that document.
I'm trying to use LIBSVM which uses the sparse vector format. For the case of the above document, which has all entries set to 0 in its vector representation, how do I represent it?
You have to store enough information about the training corpus to do the TF IDF transform on unseen documents. This means you'll need the document frequencies of the terms in the training corpus. Ignoring unseen words in test docs is fine. Your svm won't learn a weight for them anyway. Note that unseen terms should be rare in the test corpus if your training and test distributions are similar. So even if a few terms are dropped, you'll still have plenty of terms to classify the doc.