From text to K-Means Vectors input - machine-learning

I've just started diving into Machine Learning, specifically into Clustering. (I'm using Python but this is irrelevant)
My goal is, starting from a collection of tweets (100K) about fashion world, to perform KMeans over their text.
Till now I've filtered texts, truncating stopwords, useless terms, punctuation; done lemmatization (exploiting Part Of Speech tagging for better results).
I show the user the most frequent terms, hashtags, bigrams, trigrams,..9grams so that he can refine preprocessing adding words to useless terms.
My initial idea was to use the top n(1K) terms as features,
creating foreach tweet a vector of fixed size n(1K)
having a cell set to a value if the top term (of this cell) appear in this tweet (maybe calculating the cell's value with TFIDF).
Am I missing something(the 0 values will be considered)? Can I exploit n-grams in some way?
This scikit article is pretty general and I'm not understanding the whole thing.
(Is LSA dimensionality reduction useful or is it better reducing the number of features (so vectors dimension) manually? )

This other sklearn page contains an example of k-means clustering of texts.
But to address some of your specific questions:
My initial idea was to use the top n(1K) terms as features, creating foreach tweet a vector of fixed size n(1K) having a cell set to a value if the top term (of this cell) appear in this tweet (maybe calculating the cell's value with TFIDF).
A standard approach to achieve that is to use sklearn's CountVectorizer and playing with the parameter min_df.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(min_df=10)
X = cv.fit_transform(texts)
The above piece of code converts an array of texts into features X. Setting min_df=10 will ignore all words with less than 10 occurrences (to my understanding, there is no direct way to say "take the top 1000" but this is equivalent).
Can I exploit n-grams in some way?
Yes, CountVectorizer can deal with n-grams. The ngram_range parameter specifies the range of ngrams to consider (which starting "n" and which ending "n"). For instance,
cv = CountVectorizer(min_df=10, ngram_range=(2,2))
will build features based on bigrams instead of individual words (unigrams). For mixing unigrams and bigrams
cv = CountVectorizer(min_df=10, ngram_range=(2,2))
Then you can replace a CountVectorizer by a TfIdfVectorizer, which transforms the word counts to weight more informative words.
Is LSA dimensionality reduction useful or is it better reducing the number of features (so vectors dimension) manually?
Short answer, it depends on your purpose. The example in the link I mentioned above does apply LSA first. But also, in my experience, "topic model" methods like LSA or NMF can be already considered a clustering into latent semantic topics. For instance,
from sklearn.decomposition import NMF
nmf = NMF(n_components=20)
mu = nmf.fit_transform(X)
This will convert the features X into projected feature vectors mu of 20 dimensions. Each dimension d can be interpreted as the score of the text in topic d. By assigning each sample to the dimension with max score, this can also be interpreted as a clustering.

Related

Document classification using word vectors

While I was classifying and clustering the documents written in natural language, I came up with a question ...
As word2vec and glove, and or etc, vectorize the word in distributed spaces, I wonder if there are any method recommended or commonly used for document vectorization USING word vectors.
For example,
Document1: "If you chase two rabbits, you will lose them both."
can be vectorized as,
[0.1425, 0.2718, 0.8187, .... , 0.1011]
I know about the one also known as doc2vec, that this document has n dimensions just like word2vec. But this is 1 x n dimensions and I have been testing around to find out the limits of using doc2vec.
So, I want to know how other people apply the word vectors for applications with steady size.
Just stacking vectors with m words will be formed m x n dimensional vectors. In this case, the vector dimension will not be uniformed since dimension m will depends on the number of words in document.
If: [0.1018, ... , 0.8717]
you: [0.5182, ... , 0.8981]
..: [...]
m th word: [...]
And this form is not favorable form to run some machine learning algorithms such as CNN. What are the suggested methods to produce the document vectors in steady form using word vectors?
It would be great if it is provided with papers as well.
Thanks!
The most simple approach to get a fixed-size vector from a text, when all you have is word-vectors, to average all the word-vectors together. (The vectors could be weighted, but if they haven't been unit-length-normalized, their raw magnitudes from training are somewhat of an indicator of their strength-of-single-meaning – polysemous/ambiguous words tend to have vectors with smaller magnitudes.) It works OK for many purposes.
Word vectors can be specifically trained to be better at composing like this, if the training texts are already associated with known classes. Facebook's FastText in its 'classification' mode does this; the word-vectors are optimized as much or more for predicting output classes of the texts they appear in, as they are for predicting their context-window neighbors (classic word2vec).
The 'Paragraph Vector' technique, often called 'doc2vec', gives every training text a sort-of floating pseudoword, that contributes to every prediction, and thus winds up with a word-vector-like position that may represent that full text, rather than the individual words/contexts.
There are many further variants, including some based on deeper predictive networks (eg 'Skip-thought Vectors'), or slightly different prediction targets (eg neighboring sentences in 'fastSent'), or other genericizations that can even include a mixture of symbolic and numeric inputs/targets during training (an option in Facebook's StarSpace, which explores other entity-vectorization possibilities related to word-vectors and FastText-like classification needs).
If you don't need to collapse a text to fixed-size vectors, but just compare texts, there are also techniques like "Word Mover's Distance" which take the "bag of word-vectors" for one text, and another, and give a similarity score.

Features of vector form of sentences for opinion finding.

I want to find the opinion of a sentence either positive or negative. For example talk about only one sentence.
The play was awesome
If change it to vector form
[0,0,0,0]
After searching through the Bag of words
bad
naughty
awesome
The vector form becomes
[0,0,0,1]
Same for other sentences. Now I want to pass it to the machine learning algorithm for training it. How can I train the network using these multiple vectors? (for finding the opinion of unseen sentences) Obviously not! Because the input is fix in neural network. Is there any way? The above procedure is just my thinking. Kindly correct me if I am wrong. Thanks in advance.
Since your intuitive input format is "Sentence". Which is, indeed, a string of tokens with arbitrary length. Abstracting sentences as token series is not a good choice for many existing algorithms only works on determined format of inputs.
Hence, I suggest try using tokenizer on your entire training set. This will give you vectors of length of the dictionary, which is fixed for given training set.
Because when the length of sentences vary drastically, then size of the dictionary always keeps stable.
Then you can apply Neural Networks(or other algorithms) to the tokenized vectors.
However, vectors generated by tokenizer is extremely sparse because you only work on sentences rather than articles.
You can try LDA (supervised, not PCA), to reduce the dimension as well as amplify the difference.
That will keep the essential information of your training data as well as express your data at fixed size, while this "size" is not too large.
By the way, you may not have to label each word by its attitude since the opinion of a sentence also depends on other kind of words.
Simple arithmetics on number of opinion-expressing words many leave your model highly biased. Better label the sentences and leave the rest job to classifiers.
For the confusions
PCA and LDA are Dimensional Reduction techniques.
difference
Let's assume each tuple of sample is denoted as x (1-by-p vector).
p is too large, we don't like that.
Let's find a matrix A(p-by-k) in which k is pretty small.
So we get reduced_x = x*A, and most importantly, reduced_x must
be able to represent x's characters.
Given labeled data, LDA can provide proper A that can maximize
distance between reduced_x of different classes, and also minimize
the distance within identical classes.
In simple words: compress data, keep information.
When you've got
reduced_x, you can define training data: (reduced_x|y) where y is
0 or 1.

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.
I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!

How to encode different size of feature vectors in SVM

I work on classifying some reviews (paragraphs) consists of multiple sentences. I classified them with bag-of-word features in Weka via libSVM. However, I had another idea which I don't know how to implement :
I thought creating syntactical and shallow-semantics based features per sentence in the reviews is worth to try. However, I couldn't find any way to encode those features sequentially, since a paragraph's sentence size varies. The reason that I wanted to keep those features in an order is that the order of sentence features may give a better clue for classification. For example, if I have two instances P1 (with 3 sentences) and P2 (2 sentences), I would have a space like that (assume each sentence has one binary feature as a or b):
P1 -> a b b /classX
P2 -> b a /classY
So, my question is that whether I can implement that classification of different feature sizes in feature space or not? If yes, is there any kind of classifier that I can use in Weka, scikit-learn or Mallet? I would appreciate any responses.
Thanks
Regardless of the implementation, an SVM with the standard kernels (linear, poly, RBF) requires fixed-length feature vectors. You can encode any information in those feature vectors by encoding as booleans; e.g. collect all syntactical/semantic features that occur in your corpus, then introduce booleans that represent that "feature such and such occurred in this document". If it's important to capture the fact that these features occur in multiple sentences, count them and use put the frequency in the feature vector (but be sure to normalize your frequencies by document length, as SVMs are not scale-invariant).
In case you are classifying textual data, I would suggest looking at "Rational Kernels" which are made on weighted finite transducers for classifying natural language texts. Rational Kernels can be applied on varied length vectors and are already implemented as an open source project (OpenFST).
It is the library's problem, since SVM itself does not require fixed-length feature vectors, it only need a kernel function, if you can provide a kernel function with varied length vector, it should be OK for SVM

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification.
I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too.
My question is, how could I calculate the TF*IDF with just a single document?
As far as I understand, TF can be calculated based on a single document itself, but the IDF can only be calculated with a collection of document. In my current experiment, I actually calculate the TF*IDF value for the whole collection of documents. And then I use some documents as training set and the others as test set.
I just suddenly realized that this seems not so applicable to real life.
ADD 1
So there are actually 2 subtly different scenarios for classification:
to classify some documents whose content are known but label are not
known.
to classify some totally unseen document.
For 1, we can combine all the documents, both with and without labels. And get the TF*IDF over all of them. This way, even we only use the documents with labels for training, the training result will still contain the influence of the documents without labels.
But my scenario is 2.
Suppose I have the following information for term T from the summary of the training set corpus:
document count for T in the training set is n
total number of training documents is N
Should I calculate the IDF of t for a unseen document D as below?
IDF(t, D)= log((N+1)/(n+1))
ADD 2
And what if I encounter a term in the new document which didn't show up in the training corpus before?
How should I calculate the weight for it in the doc-term vector?
TF-IDF doesn't make sense for a single document, independent of a corpus. It's fundamentally about emphasizing relatively rare and informative words.
You need to keep corpus summary information in order to compute TF-IDF weights. In particular, you need the document count for each term and the total number of documents.
Whether you want to use summary information from the whole training set and test set for TF-IDF, or for just the training set is a matter of your problem formulation. If it's the case that you only care to apply your classification system to documents whose contents you have, but whose labels you do not have (this is actually pretty common), then using TF-IDF for the entire corpus is okay. If you want to apply your classification system to entirely unseen documents after you train, then you only want to use the TF-IDF summary information from the training set.
TF obviously only depends on the new document.
IDF, you compute only on your training corpus.
You can add a slack term to the IDF computation, or adjust it as you suggested. But for a reasonable training set, the constant +1 term will not have a whole lot of effect. AFAICT, in classic document retrieval (think: search), you don't bother to do this. Often, they query document will not become part of your corpus, so why would it be part of IDF?
For unseen words, TF calculation is not a problem as TF is a document specific metric. While computing IDF, you can use smoothed inverse document frequency technique.
IDF = 1 + log(total documents / document frequency of a term)
Here the lower bound for IDF is 1. So if a word is not seen in the training corpus, its IDF is 1. Since, there is no universally agreed single formula for computing tf-idf or even idf, your formula for tf-idf calculation is also reasonable.
Note that, in many cases, unseen terms are ignored if they don't have much impact in the classification task. Sometimes, people replace unseen tokens with a special symbol like UNKNOWN_TOKEN and do their computation.
Alternative of TF-IDF: Another way of computing weight of each term of a document is using Maximum Likelihood Estimation. While computing MLE, you can smooth using additive smoothing technique which is also known as Laplace smoothing. MLE is used in case you are using Generative models like Naive Bayes algorithm for document classification.

Resources