Dask in-place replacement of pandas? - dask

I will like to know if I can use dask instead of pandas. What are the issues I may face?
1) I guess dask will be slower than pandas for smaller datasets. I am OK with that because there are times when I do not know the size of the data nor do I know server configuration.
2) I will have to learn a slightly different syntax (for e.g. compute)
Will I face a situation where dask dataframe can not do something that pandas dataframe can?

This is a very broad question. In general I recommend referring to the dask.dataframe documentation.
Dask.dataframe does not implement all pandas. This includes the following sorts of operations:
Mutating operations
Operations that are hard to do exactly in parallel, like median, (though approximate solutions often exist, like approximate quantiles)
Iterating over rows of a dataframe
Small corners of the API that no one has bothered to copy over.
However, because a dask dataframe is just a collection of many small dataframes, you can often get around some of these limitations in simple cases.

Related

TF-IDF calculation in Dask

Apache Spark comes with a package to do TF-IDF calculations that I find it quite handy:
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Is there any equivalent, or maybe a way to do this with Dask? If so, can it also be done in horizontally scaled Dask (i.e., cluster with multiple GPUs)
This was also asked on the dask gitter, with the following reply by #stsievert :
counting/hashing vectorizer are similar. They’re in Dask-ML and are the same as TFIDF without the normalization/function.
I think this would be a good github issue/feature request.
Here is the link to the API for HashingVectorizer.

DBSCAN: How to Cluster Large Dataset with One Huge Cluster

I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.
Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?
Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).
Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?
Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.
DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.
If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster, and remove it / replace it with a smaller approximation.
Subsample. There usually is next to no benefit to using the entire data. Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, e.g., 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024.
This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.
But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem. Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.
Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value.

Template matching for time series using dask

I would like to use template matching with time-series, and I would like to be able to port this to very large datasets. The objective is to look for many relatively short 1d pattern in a relatively long time-series. Any suggestion on how to do this in Dask? I mean to have something like https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.Series.corr with many other and split_every equal or lower than length of other...
Thanks!
I recommend solving this problem first on pandas dataframes and then using the map_partitions method to apply your solution to every pandas dataframe that makes up the dask dataframe.
If your solution requires neighboring rows in order to operate, then I would look at map_overlap
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_overlap

Text corpus clustering

I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
Is there a better approach to this?
Can I expect to find distinct clusters at all in free text?
What should my next move be?
Thank you very much.
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.

What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?

I am trying with various SVM variants in scikit-learn along with CountVectorizer and HashingVectorizer. They use fit or fit_transform in different examples, confusing me which to be used when.
Any clarification would be much honored.
They serve a similar purpose. The documentation provides some pro's and con's for the HashingVectorizer :
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an
in-memory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to
introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if
n_features is large enough (e.g. 2 ** 18 for text classification
problems).
no IDF weighting as this would render the transformer stateful.

Resources