Association rule mining - Dask - dask

I have used algorithms like apriori, FP growth in python in the past. My question is how to use the same algorithms when the dataset does not fit in memory. Because the Dask documentation says it supports scikit learn, numpy, pandas etc. but I can't find an implementation of the association rules algorithms.

Related

Is there a native library written in Julia for Machine Learning?

I have started using Julia.I read that it is faster than C.
So far I have seen some libraries like KNET and Flux, but both are for Deep Learning.
also there is a command "Pycall" tu use Python inside Julia.
But I am interested in Machine Learning too. So I would like to use SVM, Random Forest, KNN, XGBoost, etc but in Julia.
Is there a native library written in Julia for Machine Learning?
Thank you
A lot of algorithms are just plain available using dedicated packages. Like BayesNets.jl
For "classical machine learning" MLJ.jl which is a pure Julia Machine Learning framework, it's written by the Alan Turing Institute with very active development.
For Neural Networks Flux.jl is the way to go in Julia. Also very active, GPU-ready and allow all the exotics combinations that exist in the Julia ecosystem like DiffEqFlux.jl a package that combines Flux.jl and DifferentialEquations.jl.
Just wait for Zygote.jl a source-to-source automatic differentiation package that will be some sort of backend for Flux.jl
Of course, if you're more confident with Python ML tools you still have TensorFlow.jl and ScikitLearn.jl, but OP asked for pure Julia packages and those are just Julia wrappers of Python packages.
Have a look at this kNN implementation and this for XGboost.
There are SVM implementations, but outdated an unmaintained (search for SVM .jl). But, really, think about other algorithms for much better prediction qualities and model construction performance. Have a look at the OLS (orthogonal least squares) and OFR (orthogonal forward regression) algorithm family. You will easily find detailed algorithm descriptions, easy to code in any suitable language. However, there is currently no Julia implementation I am aware of. I found only Matlab implementations and made my own java implementation, some years ago. I have plans to port it to julia, but that has currently no priority and may last some years. Meanwhile - why not coding by yourself? You won't find any other language making it easier to code a prototype and turn it into a highly efficient production algorithm running heavy load on a CUDA enabled GPGPU.
I recommend this quite new publication, to start with: Nonlinear identification using orthogonal forward regression with nested optimal regularization

One particular dataset is benchmark for image classification in the computer vision and machine learning literature

Whenever I read articles about dataset (like MNIST and CIFAR 10), mostly I find statement like,
CIFAR-10 is standard benchmark dataset for image classification in the computer vision and machine learning literature.
If especially, I'll talk about the convolution neural network (deep learning) then, as per knowledge, the accuracy of the network architecture depends on dataset used.
I am really confused with the statement I did bold above.
What does exactly that statement mean?
Thanks in advance if someone can help me by giving real life example.
In this context, "benchmarking" has been defined on Wikipedia as:
In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it.
In the computer vision community, some researchers focus their efforts on improving image classification performance on benchmark datasets. There are many benchmark datasets, e.g. MNIST, CIFAR-10, ImageNet. The existence of benchmark datasets means researchers can more easily compare the performance of their proposed method against existing methods. The researchers know that (generally speaking) everyone who benchmarks their method on this dataset has access to the same input data.
In some cases, a dataset provider keeps a leaderboard of the best-performing results on their benchmark dataset. For example, ImageNet keeps a record of the performance of the methods submitted to the benchmark contest each year. Here is an example of the best-performing methods for the 2017 ImageNet object classification task.

Incremental Learning of SVM

What are some real world applications where incremental learning of (machine learning) algorithms is useful?
Are SVMs preferred for such applications?
Is the solution more computationally intensive than retraining with the set containing old support vectors and new training vectors ?
There is a well known incremental version of SVM:
http://www.isn.ucsd.edu/pubs/nips00_inc.pdf
However, there are not much existing implementations available, maybe something is in Matlab:
http://www.isn.ucsd.edu/svm/incremental/
The advantage of that approach is that it offers exact leave-one-out evaluation of
the generalization performance on the training data
theres is a trend towards large, "out of core" datasets, which are often streamed in from network, disk, or a database. a real world example is the popular nyc taxi dataset, which, at 330+gb, cannot be easily tackled by desktop statistical models.
svms, as a "one batch" algorithm, must load the entire dataset into memory. as such they are not preferred for incremental learning. rather, learners like logistic regression, kmeans, neural nets, which are capable of partial learning, are preferred for such tasks.

Are there similar datasets to MNIST?

I am doing research on machine learning. Now I want to test my algorithms with some famous datasets. Since I am a newbie in this area, I can't find other suitable datasets apart from MNIST. I thing MNIST is quite suitable for our research. Does anyone know some similar datasets with MNIST?
P.S I know another handwritten digit dataset that is often used, called USPS dataset. But I need a dataset with more training examples (typically more than 10000 and comparable to the number of training examples in MNIST), so USPS is out of my selection.
The machine learning archive (http://archive.ics.uci.edu/ml/) contains quite a variety of datasets including those, like MINIST, suitable for classification e.g. (http://archive.ics.uci.edu/ml/datasets/Skin+Segmentation).
I can't say which of them would be suitable without knowing what you're trying to demonstrate with your algorithm but anything inside the UCI archive is well known.
You can try Fashion MNIST or Kuzushiji MNIST that have very similar properties to MNIST, but a bit harder to predict. From Fashion MNIST's page:
Seriously, we are talking about replacing MNIST. Here are some good reasons:
MNIST is too easy. Convolutional nets can achieve 99.7% on MNIST. Classic machine learning algorithms can also achieve 97% easily. Check out our side-by-side benchmark for Fashion-MNIST vs. MNIST, and read "Most pairs of MNIST digits can be distinguished pretty well by just one pixel."
MNIST is overused. In this April 2017 Twitter thread, Google Brain research scientist and deep learning expert Ian Goodfellow calls for people to move away from MNIST.
MNIST can not represent modern CV tasks, as noted in this April 2017 Twitter thread, deep learning expert/Keras author François Chollet.

Is NLTK's naive Bayes Classifier suitable for commercial applications?

I need to train a naive Bayes classifier on two corpuses consisting of approx. 15,000 tokens each. I'm using a basic bag of words feature extractor with binary labeling and I'm wondering if NLTK is powerful enough to handle all this data without significantly slowing down run time if such an application were to gain many users. The program would basically be classifying a regular stream of text messages from potentially thousands of users. Are there other machine learning packages you'd recommend integrating with NLTK if it isn't suitable?
Your corpora are not very big, so NLTK should do the job. However,I wouldn't recommend it in general, it is quite slow and buggy in places. Weka is a more powerful tool, but the fact that it can do so much more makes it harder to understand. If Naive Bayes is all you plan to use, it would probably be fastest to code it yourself.
EDIT (much later):
Try scikit-learn, it is very easy to use.

Resources