How to implement Mini Batch Kmeans using apache spark MLlib? - machine-learning

I have implemented Kmeans using spark. But as my data is huge and feature count is very big I want to implement mini batch kmeans using Apache spark MLlib. Is there any example or document on how to implement it?

The paper below doesn't cover apache spark MLlib, but it does walk through minibatch kmeans:
Sculley, David. “Web-Scale K-Means Clustering.” In Proceedings of the 19th International Conference on World Wide Web, 1177–1178. ACM, 2010. http://dl.acm.org/citation.cfm?id=1772862

Related

A light and accurate classifier which is doable on a device with limited sources

I have a project which I should classify the data coming from several sensors(time series based data) like gyroscope to several classes. I have used several classifiers including SVM, decision tree, neural networks, KNN,... in a batch scenario. My ultimate goal is to find a real-time classifier which is accurate, light and also has the ability to improve itself to implement it on my device which has limited sources(CPU, RAM,..). I was thinking a semi-supervised classifier since I can save a few labeled data on my device and use the future data points to improve my classifier. Does anyone have any recommendation or experience in this regard?
Online learning is very challenging. I recommend you steer away from now and use batch learning. You can always update the model as you update the mobile app or just make the app look for a new updated model on your server every x days.
Now, how to run a machine learning algorithm efficiently on a phone with limited resources. First, you have to identify which platform you are using. I assume you want to get a platform agnostic answer. Most ML algorithms (except lazy learning ones) can run efficiently on smartphone, have a look at this benchmarking experiment.
You have several options here:
iOS: Here's a list of all machine learning libraries available publicly.
Android: Weka for Android, this lib has a huge number of ML algorithms.
Platform agnostic deep learning: Tensorflow, you can export your models to TensorFlow lite (tutorial) and deploy them on any mobile OS and Caffe2 to train deep learning models and export them to any smartphone OS.

How intensive is training a machine learning algorithm?

I'd like to make an app using iOS's new CoreML framework that does image recognition. To do so I'd probably have to train my own model, and I'm wondering exactly how much data and compute power it would require. Is it something I could feasibly accomplish on an dual core i5 Macbook Pro using Google Images for source data or would it be much more involved?
It depends on what sort of images you want to train your model to recognize.
What is often done is fine-tuning an existing model. You take a pretrained version of Inception-v3 (let's say) and then replace the final layer with your own. You train this last layer on your own images.
You still need a fair number of training images (a few 100 per category, but more is better) but you can do this on your MacBook Pro in anywhere between 30 minutes to a few hours.
TensorFlow comes with a script that makes it really easy to do this. Keras has a great blog post on how to do this. I used the TensorFlow script to re-train Inception-v3 to tell apart my two cats, from 50 or so images of each cat.
If you want to train from scratch you probably want to do this in the cloud using AWS, Google's Cloud ML Engine, or something easy like FloydHub.

Machine-Learning - Concept / Recommendations

Hi I'm new at machine learning and therefore looking for a text classification solution. Could one recommend me a nice framework written in java? I thought about using WEKA, but also heard about MALLET. What's better, where are the main differences?
My target is to classify unlabeled text. Therefore I prepared about 18 topics and 100 text for each topic for learning.
What would you recommend to do? Would also appreciate a nice little example or hint of how to proceed.
You have a very minimal text data set, you could use any library - it wouldn't really matter. More advanced options would require more data than you have to be meaningful, so its not an issue worth considering. The simple way text classifications problems are handled is to use a Bag of Words model and a linear classifier. Both Weka and MALLET support this.
Personally, I find Weka to be a pain and MALLET to be poorly documented / out of date when it is, so I use JSAT. There is an example on doing spam classification here.
(bias warning, I'm the author of JSAT).
Since your task is fairly simple and as you mentioned you're new at ML, I'd recommend you to use weka as it is easy to use and has a large user community.
Otherwise here are some General Purpose Machine Learning frameworks in Java that you can have a look at:
Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applications
ELKI - Java toolkit for data mining. (unsupervised: clustering, outlier detection etc.)
H2O - ML engine that supports distributed learning on data stored in HDFS.
htm.java - General Machine Learning library using Numenta’s Cortical Learning Algorithm
java-deeplearning - Distributed Deep Learning Platform for Java, Clojure,Scala
JAVA-ML - A general ML library with a common interface for all algorithms in Java
JSAT - Numerous Machine Learning algoirhtms for classification, regresion, and clustering.
Mahout - Distributed machine learning
Meka - An open source implementation of methods for multi-label classification and evaluation (extension to Weka).
MLlib in Apache Spark - Distributed machine learning library in Spark
Neuroph - Neuroph is lightweight Java neural network framework
ORYX - Simple real-time large-scale machine learning infrastructure.
RankLib - RankLib is a library of learning to rank algorithms
RapidMiner - RapidMiner integration into Java code
Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one of k classes.
WalnutiQ - object oriented model of the human brain
Weka - Weka is a collection of machine learning algorithms for data mining tasks
Source: Awesome Machine Learning

Are there similar datasets to MNIST?

I am doing research on machine learning. Now I want to test my algorithms with some famous datasets. Since I am a newbie in this area, I can't find other suitable datasets apart from MNIST. I thing MNIST is quite suitable for our research. Does anyone know some similar datasets with MNIST?
P.S I know another handwritten digit dataset that is often used, called USPS dataset. But I need a dataset with more training examples (typically more than 10000 and comparable to the number of training examples in MNIST), so USPS is out of my selection.
The machine learning archive (http://archive.ics.uci.edu/ml/) contains quite a variety of datasets including those, like MINIST, suitable for classification e.g. (http://archive.ics.uci.edu/ml/datasets/Skin+Segmentation).
I can't say which of them would be suitable without knowing what you're trying to demonstrate with your algorithm but anything inside the UCI archive is well known.
You can try Fashion MNIST or Kuzushiji MNIST that have very similar properties to MNIST, but a bit harder to predict. From Fashion MNIST's page:
Seriously, we are talking about replacing MNIST. Here are some good reasons:
MNIST is too easy. Convolutional nets can achieve 99.7% on MNIST. Classic machine learning algorithms can also achieve 97% easily. Check out our side-by-side benchmark for Fashion-MNIST vs. MNIST, and read "Most pairs of MNIST digits can be distinguished pretty well by just one pixel."
MNIST is overused. In this April 2017 Twitter thread, Google Brain research scientist and deep learning expert Ian Goodfellow calls for people to move away from MNIST.
MNIST can not represent modern CV tasks, as noted in this April 2017 Twitter thread, deep learning expert/Keras author François Chollet.

Does MapR have scalable machine learning algos. Like Mahout?

I am specifically wondering if MapR has Kmeans clustering just like Mahout?
As far as I know, MapR is only a "faster" Hadoop. There are no algorithms included.
So your jobs should be compatible.
But what is the deal in implementing your own? K-means is ultra simple. See my blog post:
http://codingwiththomas.blogspot.com/2011/05/k-means-clustering-with-mapreduce.html
However I have implemented a k-means clustering with BSP (Bulk Synchronous Parallel) and Apache Hama which is almost ten times faster if you compare it with the Mahout benchmark results in this book: http://www.manning.com/ingersoll/ (linked jira: https://issues.apache.org/jira/browse/MAHOUT-588)
Here is the benchmark of k-means with Apache Hama: http://wiki.apache.org/hama/Benchmarks
You can find it here:
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/clustering/KMeansBSP.java

Resources