svm file format in weka - machine-learning

I want to classify texts using svm (smo) in weka. The file I have, contains some sentences (Persian) and a word in front of each sentence which shows its class. The question is: should I change these sentences to a binary vector and give these vectors to weka as input or is it enough if I just turn the sentences to vector by choosing "string to word vector" in weka itself?
sample file:
https://www.dropbox.com/s/ohpyortve8jbwhe/shoor.arff?dl=0

Although, it works with choosing "string to word vector" in weka, it's better to change the sentences to vectors according to 1000 most frequent words or any other features. It works faster.

Related

What is the best way to handle missing words when using word embeddings?

I have a set of pre-trained word2vec word vectors and a corpus. I want to use the word vectors to represent words in the corpus. The corpus has some words in it that I don't have trained word vectors for. What's the best way to handle those words for which there is no pre-trained vector?
I've heard several suggestions.
use a vector of zeros for every missing word
use a vector of random numbers for every missing word (with a bunch of suggestions on how to bound those randoms)
an idea I had: take a vector whose values are the mean of all values in that position from all pre-trained vectors
Anyone with experience with the problem have thoughts on how to handle this?
FastText from Facebook assembles word vectors from subword n-grams which allows it to handle out of vocabulary words. See more about this approach at: Out of Vocab Word Embedding
In a pre-trained word2vec embedding matrix, you can usually use word unk as index to find a predesigned vector which is often the best vector.

Addressing synonyms in Supervised Learning for Text Classification

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly.
For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.
What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me
You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.
A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).
Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.
Example pseudocode to get you started:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([ 6.17495403e-02, 2.07064897e-02, -1.56451517e-03,
1.02607915e-02, -1.30429687e-02, 1.60102192e-02, ...
Now lets try a different document:
>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007, 0.01455955, -0.01595449, -0.01795897,
-0.02184369, -0.01654281, 0.01735667, 0.00054854, ...
>>> doc2.similarity(doc1)
0.8820845113100807
Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.
This is how you can use these vectors in scikit-learn:
X = [nlp(text).vector for text in corpus]
clf.fit(X, y)

Does word2vec make sense for supervised learning?

I have a list of sentence/label pairs to train the model, how should I encode the sentences as input to, say an SVM?
Are the sentences in the same language? You could start with the pretrained word2vec file that you can download from Google if it's English. Pay attention to how the train file was created, whether stemming was applied, etc. It's also somewhat important from which corpus it was generated; you'd get different results if this is from newsgroups or if it was extracted from the web or from more formal text.
Word2Vec basically encodes every word into a higher dimensional vector space. This is usually 200,300 or 500 dimensions large. After it is trained, then the "test" sentences are basically bag of words and need not be in any order.
You'd then, for each word in the bag of words, figure out the corresponding word2vec vector. Then you can create features by averaging the vectors, taking the 'minimum', the 'maximum' and if you're comparing text, look at calculating the cosine similarity between vectors. Then use those features in an SVM.

NLP - Word Representations

I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created.
The problem is, how do I know if my representation work? How do I know if it actually captures the meaning of each word? How do I compare my representation with other existing vector space models?
As of now, I am doing the following tests to check the quality of my word vectors:
Distance test:
Does the cosine distance between vectors reflect the semantic distance between words?
Analogy test:
Can the representation be used to solve problems like "King is to queen what man is to ________ ", (the answer should be woman)
Picking the odd one out:
Can the vectors be used to pick the odd word in a given list of words. If the input is {"cat","dog","phone"}, the output should be "phone"?
What are the other tests that I should do to check the quality of the vectors? What other tasks are word vectors expected to be capable of doing? Is there a benchmark for vector space models?
Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings.
In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates plots, gives correlations with word pair similarity rankings, and compares your embeddings with pre-trained vectors from previous research. You can find a more detailed description in the accompanying paper.

Representing documents in vector space model

I have a very fundamental question. I have two sets of documents, one for training and one for testing. I would like to train a Logistic regression classifier with the training documents. I want to know if I'm doing the right thing.
First find the list of all unique words in the training document and call it vocabulary.
For each word in the vocabulary, find its TFIDF in every training document. A document is then represented as vector of these TFIDF scores.
My question is:
1. How do I represent the test documents? Say, one of the test documents does not have any word that is in the vocabulary. In that case , the TFIDF scores will be zero for all words in the vocabulary for that document.
I'm trying to use LIBSVM which uses the sparse vector format. For the case of the above document, which has all entries set to 0 in its vector representation, how do I represent it?
You have to store enough information about the training corpus to do the TF IDF transform on unseen documents. This means you'll need the document frequencies of the terms in the training corpus. Ignoring unseen words in test docs is fine. Your svm won't learn a weight for them anyway. Note that unseen terms should be rare in the test corpus if your training and test distributions are similar. So even if a few terms are dropped, you'll still have plenty of terms to classify the doc.

Resources