Apache Spark comes with a package to do TF-IDF calculations that I find it quite handy:
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Is there any equivalent, or maybe a way to do this with Dask? If so, can it also be done in horizontally scaled Dask (i.e., cluster with multiple GPUs)
This was also asked on the dask gitter, with the following reply by #stsievert :
counting/hashing vectorizer are similar. They’re in Dask-ML and are the same as TFIDF without the normalization/function.
I think this would be a good github issue/feature request.
Here is the link to the API for HashingVectorizer.
Related
I am working on a project which groups jobs posted on various job portals into clusters based on the description of the jobs using K-means.
I found the work vector using Word2Vec, but i guess this will not serve the purpose as I will need a vector of the whole job description.
I know that I can average out the word vector of a sentence to get the sentence vector but worried about the accuracy as this will loose the ordering of the words.
Is there any other way I can get the vectors ?
The most using approaches for text vectorization:
Pure TF-IDF, still can be useful, especially using n-grams.
Using Word2Vec to get vectors for the words. For the whole text using the mean value of all vectors.
Combine the first two methods: get a weighted mean of all words in the text using the coefficients from the TF-IDF.
I would suggest trying each and pick what is performed better in your case. The results can be slightly different depends on the nature of the data.
You can facilitate transfer learning by very useful sentence embedding methods such as Bert-as-service or SentenceBert or even Universal Sentence encoding. All of them are easy to use and full of tutorials on the web. They will work better then TF-IDF in most cases.
You can also try doc2vec, an extension of word2vec that builds representations of a whole document. There is an implementation in gensim available:
https://radimrehurek.com/gensim/models/doc2vec.html
I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
Is there a better approach to this?
Can I expect to find distinct clusters at all in free text?
What should my next move be?
Thank you very much.
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.
I will like to know if I can use dask instead of pandas. What are the issues I may face?
1) I guess dask will be slower than pandas for smaller datasets. I am OK with that because there are times when I do not know the size of the data nor do I know server configuration.
2) I will have to learn a slightly different syntax (for e.g. compute)
Will I face a situation where dask dataframe can not do something that pandas dataframe can?
This is a very broad question. In general I recommend referring to the dask.dataframe documentation.
Dask.dataframe does not implement all pandas. This includes the following sorts of operations:
Mutating operations
Operations that are hard to do exactly in parallel, like median, (though approximate solutions often exist, like approximate quantiles)
Iterating over rows of a dataframe
Small corners of the API that no one has bothered to copy over.
However, because a dask dataframe is just a collection of many small dataframes, you can often get around some of these limitations in simple cases.
I got Memory Error when I was running dbscan algorithm of scikit.
My data is about 20000*10000, it's a binary matrix.
(Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number)
Anyway I found sparse matrix and feature extraction of scikit.
http://scikit-learn.org/dev/modules/feature_extraction.html
http://docs.scipy.org/doc/scipy/reference/sparse.html
But I still have no idea how to use it. In DBSCAN's specification, there is no indication about using sparse matrix. Is it not allowed?
If anyone knows how to use sparse matrix in DBSCAN, please tell me.
Or you can tell me a more suitable cluster method.
The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.
As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.
May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)
Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).
Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.
You can pass a distance matrix to DBSCAN, so assuming X is your sample matrix, the following should work:
from sklearn.metrics.pairwise import euclidean_distances
D = euclidean_distances(X, X)
db = DBSCAN(metric="precomputed").fit(D)
However, the matrix D will be even larger than X: n_samples² entries. With sparse matrices, k-means is probably the best option.
(DBSCAN may seem attractive because it doesn't need a pre-determined number of clusters, but it trades that for two parameters that you have to tune. It's mostly applicable in settings where the samples are points in space and you know how close you want those points to be to be in the same cluster, or when you have a black box distance metric that scikit-learn doesn't support.)
Yes, since version 0.16.1.
Here's a commit for a test:
https://github.com/scikit-learn/scikit-learn/commit/494b8e574337e510bcb6fd0c941e390371ef1879
Sklearn's DBSCAN algorithm doesn't take sparse arrays. However, KMeans and Spectral clustering do, you can try these. More on sklearns clustering methods: http://scikit-learn.org/stable/modules/clustering.html
I'm doing a research which involves "unsupervised classification".
Basically I have a trainSet and I want to cluster data in X number of classes in unsupervised way. Idea is similar to what k-means does.
Let's say
Step1)
featureSet is a [1057x10] matrice and I want to cluster them into 88 clusters.
Step2)
Use previously calculated classes to compute how does the testData is classified
Question
-Is it possible to do it with SVM or N-N ? Anything else ?
-Any other recommendations ?
There are many clustering algorithms out there, and the web is awash with information on them and sample implementations. A good starting point is the Wikipedia entry on cluster analysis Cluster_analysis.
As you have a working k-means implementation, you could try one of the many variants to see if they yeild better results (k-means++ perhaps, seeing as you mentioned SVM). If you want a completely different approach, have a look at Kohonen Maps - also called Self Organising Feature Maps. If that looks too tricky, a simple hierarchical clustering would be easy to implement (find the nearest two items, combine, rinse and repeat).
This sounds like a classic clustering problem. Neither SVMs or neural networks are going to be able to directly solve this problem. You can use either approach for dimensionality reduction, for example to embed your 10-dimensional data in two-dimensional space, but they will not put the data into clusters for you.
There are a huge number of clustering algorithms besides k-means. If you wanted a contrasting approach, you might want to try an agglomerative clustering algorithm. I don't know what kind of computing environment you are using, but I quite like R and this (very) short guide on clustering.