Text corpus clustering - machine-learning

I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
Is there a better approach to this?
Can I expect to find distinct clusters at all in free text?
What should my next move be?
Thank you very much.

I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.

Related

Recommendation engine without ratings

I have found what must be dozens of articles on Towards Data Science/ medium/ etc. of people making recommendation engines with imdb data (based on ratings that users gave to movies, what movies should we recommend to those users).
These articles begin with 'memory based approaches' of user-based content filtering and item-based content filtering.
I have been tasked with making a recommendation engine, and since none of the suits really care or know anything about this, I want to do the bare minimum (which seems to be user-based content filtering).
Problem is, all of my data is binary (no ratings, just based on the items that other users bought, should we recommend items to similar users - this is actually similar to the cartoons that all of the medium articles have stolen from eachother, but none of the medium articles give an example of how to do that).
All of the articles use Pearson Correlation or cosine similarity to determine user similarity, can I use these approaches with binary dimensions (bought or not), if so how, and if not is there a different way to measure user similarity?
I am working with python btw. And I was thinking of maybe using Hamming Distance (is there a reason that wouldn't be good)
Similarity score based approaches do work even with binary dimension. When you have scores, two similar users may look like [5,3,3,0,1] and [4,3,3,0,0], where as in your case it would be something like [1,1,1,0,1] and [1,1,1,0,0].
from scipy.spatial.distance import cosine
1 - cosine([5,3,2,0,1],[4,3,3,0,0])
0.961161313666907
1 - cosine([1,1,1,0,1],[1,1,1,0,0])
0.8660254037844386
Another approach is, if you can get the number of times a user bought a product, that count can be used as rating and then similarities can be calculated
The data you have is an implicit data which means interactions are not necessarily indicate user's interest it's just interaction. Interaction value of 1 and interaction value of 1000 has no difference in this case they both shows interaction nothing else, such that memory based algorithms are useless here. If you are not familiar with neural networks, then you have to at least use matrix factorization techniques to make a meaningful recommendation using this data, you can start with surprise library
here which has a bunch of matrix factorization models.
It will be better if you use ALS as optimization technique, but SGD will also do the work. If you are ok with deep-learning I can refer to the sources of the best work so far.
I once used non-negative matrix factorization(NNMF for short) algorithm in surprise for data like yours and the results was good enough.
It seems, that in your situation the best approach would be collaborative filtering. You don't need scores, everything that you need is a user-item interaction matrix. The simplest algorithm, in this case, is Alternating Least Square (ALS).
There're already a few implementations in python. For instance, this one. Also,
there's an implementation in PySpark recommendation module.

How can I group text questions that are similar to each other?

I have a dataset of 200k questions, and I would like to group them together by similarity/duplicates.
How can I use NLP/machine learning to group these questions with similar intents together?
Given a question and a list of questions, how can I find the question or questions that are similar or duplicates?
Are there any services that can do this?
Generally, you'd want to convert the questions into a abstract numerical format (such as a single high-dimensional vector, or 'bags of words/vectors'), from which it is then possible to calculate numerical pairwise similarities between questions.
For example: you could turn each question into a simple average of the word-vectors for its individual words. (Those word-vectors might come from your own training corpus, that matches the questions' usage domain exactly, or from some other outside source that's good enough.)
If the word-vectors are 300-dimensional, averaging all the words-vectors of a question together then gives you a 300-dimensional vector for the question. You can then use a typical measure of vector-similarity, such as "cosine similarity", to get a number from -1.0 to 1.0 for each pair of questions, with larger values indicating "more similar".
Such a simple approach is often a good baseline. Being smarter about dropping some words, or weighting words by their observed significance (eg by "TF/IDF" weighting) may improve it.
But there are other ways to get summary vectors that may work better than a simple average. One relatively straightforward algorithm, largely similar to the way word-vectors are created, is called "Paragraph Vectors", and is sometimes called in popular libraries (like Python gensim) "Doc2Vec". It's not quite a simple average of word-vectors, as much as creating a synthetic word-like token for a full text, which then is trained to be as good as possible at predicting the text's words. Again, once you have a (for example) 300-dimensional text-vector, calculating cosine-similarity can rank question similarities.
There's also an interesting algorithm called "Word Mover's Distance", which leaves the texts as variable-sized bags of each constituent word-vector, as if each word-vector was a pile-of-meaning. It then calculates the "effort" to move the piles from one text's shape-of-piles, to another text's – and less effort seems to correlate well with humans' sense of text similarity. (However, finding these minimal-shifts is a lot more computationally expensive than simple cosine-similarity – so this works best with short texts, or small corpuses, or when you can massively parallelize the computation.)
Once you have any of these numeric-similarity measures working, then you can also clustering algorithms to find groups of highly-related questions – and often once you have those groups, the most-common words in those groups (as opposed to others), or human editorial work, can name the groups.

unsupervised learning on sentences

I have a data that represents comments from the operator on various activities performed on a industrial device. The comments, could reflect either a routine maintainence/replacement activity or could represent that some damage occured and it had to be repaired to rectify the damage.
I have a set of 200,000 sentences that needs to be classified into two buckets - Repair/Scheduled Maintainence(or undetermined). These have no labels, hence looking for an unsupervised learning based solution.
Some sample data is as shown below:
"Motor coil damaged .Replaced motor"
"Belt cracks seen. Installed new belt"
"Occasional start up issues. Replaced switches"
"Replaced belts"
"Oiling and cleaning done".
"Did a preventive maintainence schedule"
The first three sentences have to be labeled as Repair while the second three as Scheduled maintainence.
What would be a good approach to this problem. though I have some exposure to Machine learning I am new to NLP based machine learning.
I see many papers related to this https://pdfs.semanticscholar.org/a408/d3b5b37caefb93629273fa3d0c192668d63c.pdf
https://arxiv.org/abs/1611.07897
but wanted to understand if there is any standard approach to such problems
Seems like you could use some reliable keywords (verbs it seems in this case) to create training samples for an NLP Classifier. Or you could use KMeans or KMedioids clustering and use 2 as K, which would do a pretty good job of separating the set. If you want to get really involved, you could use something like Latent Derichlet Allocation, which is a form of unsupervised topic modeling. However, for a problem like this, on the small amount of data you have, the fancier you get the more frustrated with the results you will become IMO.
Both OpenNLP and StanfordNLP have text classifiers for this, so I recommend the following if you want to go the classification route:
- Use key word searches to produce a few thousand examples of your two categories
- Put those sentences in a file with a label based on the OpenNLP format (label |space| sentence | newline )
- Train a classifier with the OpenNLP DocumentClassifier, and I recommend stemming for one of your feature generators
- after you have the model, use it in Java and classify each sentence.
- Keep track of the scores, and quarantine low scores (you will have ambiguous classes I'm sure)
If you don't want to go that route, I recommend using a text indexing technology de-jeur like SOLR or ElasticSearch or your favorite RDBMS's text indexing to perform a "More like this" type function so you don't have to play the Machine learning continuous model updating game.

Clustering of news articles

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their similarity.
What I did so far is to apply basic NLP techniques including stopword removal and stemming. I also calculated the tf-idf vector for each article, and with this can also calculate the, e.g., cosine similarity based on these tf-idf-vectors. But now with the grouping of the articles I struggles a bit. I see two principle ways -- probably related -- to do it:
1) Machine Learning / Clustering: I already played a bit with existing clustering libraries, with more or less success; see here. On the one hand, algorithms such as k-means require the number of clusters as input, which I don't know. Other algorithms require parameters that are also not intuitive to specify (for me that is).
2) Graph algorithms: I can represent my data as a graph with the articles being the nodes and weighted adges representing the pairwise (cosine) similarity between the articles. With that, for example, I can first remove all edges that fall below a certain threshold and then might apply graph algorithms to look for strongly-connected subgraphs.
In short, I'm not sure where best to go from here -- I'm still pretty new in this area. I wonder if there some best practices for that, or some kind of guidelines which methods / algorithms can (not) be applied in certain scenarios.
(EDIT: forgot to link to related question of mine)
Try the class of Hierarchical Agglomerative Clustering HAC algorithms with Single and Complete linkage.
These algorithms do not need the number of clusters as input.
The basic principle is similar to growing a minimal spanning tree across a given set of data points and then stop based on a threshold criteria. A closely related class is the Divisive clustering algorithms which first builds up the minimal spanning tree and then prunes off a branch of the tree based on inter-cluster similarity ratios.
You can also try a canopy variation on k-means to create a relatively quick estimate for the number of clusters (k).
http://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Will you be recomputing over time or do you only care about a static set of news? I ask because your k may change a bit over time.
Since you can model your dataset as a graph you could apply stochastic clustering based on markov models. Here are link for resources on MCL algorithm:
Official thesis description and code base
Gephi plugin for MCL (to experiment and evaluate the method)

Mahout: RowSimilarity vs Clustering

I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.
So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.
Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?
What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.
Clustering is not just another variant of classification or recommendation. It is a different discipline.
When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.
Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.
For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".
And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...
Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!
Similarity is not the same thing as distance -- one is big when the other is small. Clustering is not the same as computing distances either. First you should decide whether you have a clustering problem -- it does not sound like you do based on what you say. So, don't use k-means.

Resources