Get sentence vector for a K-means clustering task - machine-learning

I am working on a project which groups jobs posted on various job portals into clusters based on the description of the jobs using K-means.
I found the work vector using Word2Vec, but i guess this will not serve the purpose as I will need a vector of the whole job description.
I know that I can average out the word vector of a sentence to get the sentence vector but worried about the accuracy as this will loose the ordering of the words.
Is there any other way I can get the vectors ?

The most using approaches for text vectorization:
Pure TF-IDF, still can be useful, especially using n-grams.
Using Word2Vec to get vectors for the words. For the whole text using the mean value of all vectors.
Combine the first two methods: get a weighted mean of all words in the text using the coefficients from the TF-IDF.
I would suggest trying each and pick what is performed better in your case. The results can be slightly different depends on the nature of the data.

You can facilitate transfer learning by very useful sentence embedding methods such as Bert-as-service or SentenceBert or even Universal Sentence encoding. All of them are easy to use and full of tutorials on the web. They will work better then TF-IDF in most cases.

You can also try doc2vec, an extension of word2vec that builds representations of a whole document. There is an implementation in gensim available:
https://radimrehurek.com/gensim/models/doc2vec.html

Related

Is it a good idea to use word2vec for encoding of categorical features?

I am facing a binary prediction task and have a set of features of which all are categorical. A key challenge is therefore to encode those categorical features to numbers and I was looking for smart ways to do so.
I stumbled over word2vec, which is mostly used for NLP, but I was wondering whether I could use it to encode my variables, i.e. simply take the weights of the neural net as the encoded features.
However, I am not sure, whether it is a good idea since, the context words, which serve as the input features in word2vec are in my case more or less random, in contrast to real sentences which word2vec was originially made for.
Do you guys have any advice, thoughts, recommendations on this?
You should look into entity embedding if you are searching for a way to utilize embeddings for categorical variables.
google has a good crash course on the topic: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data
this is a good paper on arxiv written by a team from a Kaggle competition: https://arxiv.org/abs/1604.06737
It's certainly possible to use the word2vec algorithm to train up 'dense embeddings' for things like keywords, tags, categories, and so forth. It's been done, sometimes beneficially.
Whether it's a good idea in your case will depend on your data & goals – the only way to know for sure is to try it, and evaluate the results versus your alternatives. (For example, if the number of categories is modest from a controlled vocabulary, one-hot encoding of the categories may be practical, and depending on the kind of binary classifier you use downstream, the classifier may itself be able to learn the same sorts of subtle interrelationships between categories that could also otherwise be learned via a word2vec model. On the other hand, if categories are very numerous & chaotic, the pre-step of 'compressing' them into a smaller-dimensional space, where similar categories have similar representational vectors, may be more helpful.)
That such tokens don't quite have the same frequency distributions & surrounding contexts as true natural language text may mean it's worth trying a wider range of non-default training options on any word2vec model.
In particular, if your categories don't have a natural ordering giving rise to meaningful near-neighbors relationships, using a giant window (so all words in a single 'text' are in each others' contexts) may be worth considering.
Recent versions of the Python gensim Word2Vec allow changing a parameter named ns_exponent – which was fixed at 0.75 in many early implementations, but at least one paper has suggested can usefully vary far from that value for certain corpus data and recommendation-like applications.

Document similarity: Vector embedding versus Tf-Idf performance?

I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches:
A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity.
Bag-of-Words: tf-idf or its variations such as BM25.
Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document similarity?
Is there another approach, that allows to dynamically refine the document's vectors as more text is added?
doc2vec or word2vec ?
According to article, the performance of doc2vec or paragraph2vec is poor for short-length documents. [Learning Semantic Similarity for Very Short Texts, 2015, IEEE]
Short-length documents ...?
If you want to compare the similarity between short documents, you might want to vectorize the document via word2vec.
how construct ?
For example, you can construct a document vector with a weighted average vector using tf-idf.
similarity measure
In addition, I recommend using ts-ss rather than cosine or euclidean for similarity.
Please refer to the following article or the summary in github below.
"A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
https://github.com/taki0112/Vector_Similarity
thank you
You have to try it: the answer may vary based on your corpus and application-specific perception of 'similarity'. Effectiveness may especially vary based on typical document lengths, so if "rapidly growing with time" also means "growing arbitrarily long", that could greatly affect what works over time (requiring adaptations for longer docs).
Also note that 'Paragraph Vectors' – where a vector is co-trained like a word vector to represent a range-of-text – may outperform a simple average-of-word-vectors, as an input to similarity/classification tasks. (Many references to 'Doc2Vec' specifically mean 'Paragraph Vectors', though the term 'Doc2Vec' is sometimes also used for any other way of turning a document into a single vector, like a simple average of word-vectors.)
You may also want to look at "Word Mover's Distance" (WMD), a measure of similarity between two texts that uses word-vectors, though not via any simple average. (However, it can be expensive to calculate, especially for longer documents.) For classification, there's a recent refinement called "Supervised Word Mover's Distance" which reweights/transforms word vectors to make them more sensitive to known categories. With enough evaluation/tuning data about which of your documents should be closer than others, an analogous technique could probably be applied to generic similarity tasks.
You also might consider trying Jaccard similarity, which uses basic set algebra to determine the verbal overlap in two documents (although it is somewhat similar to a BOW approach). A nice intro on it can be found here.

How to prepare feature vectors for text classification when the words in the text is not frequently repeating?

I need to perform the text classification on set of emails. But all the words in my text are thinly sparse i.e frequency of each word with respect to all the documents are very less. words are not that much frequently repeating. Since to train the classifiers I think document term matrix with frequency as weightage is not suitable. Can you please suggest me what kind of other methods I need to use .
Thanks
The real problem will be, that if your words are that sparse a learned classifier will not generalise to the real world data. However, there are several solutions to it
1.) Use more data. This is kind-of a no-brainer. However, you can not only add labeled data you can also use unlabelled data in a semi-supervised learning
2.) Use more data (part b). You can look into the transfer learning setting. There you build a classifier on a large data set with similar characteristics. This might be twitter streams and then adapt this classifier to your domain
3.) Get your processing pipeline right. Your problem might origin from a suboptimal processing pipeline. Are you doing stemming? In the email the word steming should be mapped onto stem. This can be pushed even further by using synonym matching with a dictionary.

Features in Document clustering/classification?

This may sound very naive but i just wanted to be sure that when talking in Machine Learning terminology, features in Document Clustering is words which are chosen from a document, if some are discarded after stemming or as stop-words.
I am trying to use LibSvm library and it says that there are different approaches for different types of { no_of_instances, no_of_features }.
Like if no_of_instances is much lower than no_of_features, linear kernels would do. if both are large, linear would be fast. However, if no_of_features is small, non-linear kernels is better.
So for my document clustering/classification, i have small number of documents like 100 each may have words around 2000. So i fall in small no_of_instances and large no_of_features category depending upon what i consider a feature is.
I would like to use tf-idf for the document.
So Is no_of_features is the size of the vector that i get from tf-idf ?
What you are talking about here is just one of possibilities, in fact the most trivial way of defining features for documents. In machine learning terminology feature is any mapping from the input space (in this particular example - from space of documents) into some abstract space, which is suited for a particular machine learning model. Most of the ML models (like neural networks, support vector machines, etc.) work on the numerical vectors, so features has to be mappings from documents to (constant size) vectors of numbers. This is a reason for sometimes choosing a representation of bag of owrds, where we have a words' count vector as a document representation. This limitation can be overcomed by using specific models, like for example Naive Bayes (or a custom kernel for SVM, which enables them to work with nonnumeric data), which can work on any objects, as long as we can define perticular conditional probabilities - here, the most basic approach is treating document containing a particular word or not as a "feature". In general this is not the only possibility, there are dozens of methods that use statistical features, semantic features (based on some ontologies like wordnet) etc.
To sum up - this is only one, most simple representation of document for the machine learning model. Good to start with, good to understand the basics, but far from being a "feature definition".
EDIT
no_of_features is size of the vector you use for your documents' representation, so if you use tf-idf, then size of resulting vecor is a no_of_featuers.

How can HMMs be used for handwriting recognition?

The problem is a bit different than traditional handwriting recognition. I have a dataset that are thousands of the following. For one drawn character, I have several sequential (x, y) coordinates where the pen was pressed down. So, this is a sequential (temporal) problem.
I want to be able to classify handwritten characters based on this data, and would love to implement HMMs for learning purposes. But, is this the right approach? How can they be used to do this?
I think HMM can be used in both problems mentioned by #jens. I'm working on online handwriting too, and HMM is used in many articles. The simplest approach is like this:
Select a feature.
If selected feature is continuous convert it to discrete.
Choose HMM parameters: Topology and # of states.
Train character models using HMM. one model for each class.
Test using test set.
for each item:
the simplest feature is angle of vector which connects consecutive
points. You can use more complicated features like angles of vectors
obtained by Douglas & Peucker algorithm.
the simplest way for discretization is using Freeman codes, but
clustering algorithms like k-means and GMM can be used too.
HMM topologies: Ergodic, Left-Right, Bakis and Linear. # of states
can be obtained by trial & error. HMM parameters can be variable for
each model. # of observations is determined by discretization.
observation samples can be have variable length.
I recommend Kevin Murphy HMM toolbox.
Good luck.
This problem is actually a mix of two problems:
recognizing one character from your data
recognizing a word from a (noisy) sequence of characters
A HMM is used for finding the most likely sequence of a finite number of discrete states out of noisy measurements. This is exactly problem 2, since noisy measurements of discrete states a-z,0-9 follow eachother in a sequence.
For problem 1, a HMM is useless because you aren't interested in the underlying sequence. What you want is to augment your handwritten digit with information on how you wrote it.
Personally, I would start by implementing regular state-of-the-art handwriting recognition which already is very good (with convolutional neural networks or deep learning). After that, you can add information about how it was written, for example clockwise/counterclockwise.

Resources