I am specifically wondering if MapR has Kmeans clustering just like Mahout?
As far as I know, MapR is only a "faster" Hadoop. There are no algorithms included.
So your jobs should be compatible.
But what is the deal in implementing your own? K-means is ultra simple. See my blog post:
http://codingwiththomas.blogspot.com/2011/05/k-means-clustering-with-mapreduce.html
However I have implemented a k-means clustering with BSP (Bulk Synchronous Parallel) and Apache Hama which is almost ten times faster if you compare it with the Mahout benchmark results in this book: http://www.manning.com/ingersoll/ (linked jira: https://issues.apache.org/jira/browse/MAHOUT-588)
Here is the benchmark of k-means with Apache Hama: http://wiki.apache.org/hama/Benchmarks
You can find it here:
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/clustering/KMeansBSP.java
Related
Scikit-Learn algorithms are single node implementations. Does this mean, that they are not an appropriate choice for building machine learning models on Databricks cluster for the reason that they cannot take advantage of the cluster computing resources ?
They are not appropriate, in the sense that, as you say, they cannot take advantage of the cluster computing resources, which Databricks is arguably all about. The raison d'ĂȘtre of Databricks is Apache Spark, and specifically for ML tasks, its ML library Spark MLlib.
This does not mean that you cannot use scikit-learn in Databricks (you'll find that a Databricks cluster comes by scikit-learn installed by default), only that it is usable for problems that do not actually require a cluster. If you want to exploit the cluster resource capabilities for ML, you need to revert to Spark MLlib.
I think desertnaut hit the nail on the head here. I believe Scikit Learn algos are designed only for non-parallel processing jobs, and all the MLlib stuff is designed to leverage cluster compute resources and parallel processing resources. Take a look at the link below for sample code for standard regression and classification tasks.
https://spark.apache.org/docs/latest/ml-classification-regression.html
In addition, here are some code samples for different clustering tasks.
https://spark.apache.org/docs/latest/ml-clustering.html
That should probably cover most of the things you will be doing.
I believe that it depends on the task at hand. I see two general scenarios:
Your data is big and does not fit into memory. Go with the Spark MLlib and their distributed algos.
Your data is not that big and you want to utilize sheer computing power. The typical use case is hyperparameter search.
Databricks allow for distributing such workloads from the driver node to the executors with hyperopt and its SparkTrials (random + Bayesian search).
Some docs here>
http://hyperopt.github.io/hyperopt/scaleout/spark/
However, there are much more attempts to make the sklearn on spark work. You can supposedly distribute the workloads through UDF, using joblib, or others. I am investigating the issue myself, and will update the answer later.
I have started using Julia.I read that it is faster than C.
So far I have seen some libraries like KNET and Flux, but both are for Deep Learning.
also there is a command "Pycall" tu use Python inside Julia.
But I am interested in Machine Learning too. So I would like to use SVM, Random Forest, KNN, XGBoost, etc but in Julia.
Is there a native library written in Julia for Machine Learning?
Thank you
A lot of algorithms are just plain available using dedicated packages. Like BayesNets.jl
For "classical machine learning" MLJ.jl which is a pure Julia Machine Learning framework, it's written by the Alan Turing Institute with very active development.
For Neural Networks Flux.jl is the way to go in Julia. Also very active, GPU-ready and allow all the exotics combinations that exist in the Julia ecosystem like DiffEqFlux.jl a package that combines Flux.jl and DifferentialEquations.jl.
Just wait for Zygote.jl a source-to-source automatic differentiation package that will be some sort of backend for Flux.jl
Of course, if you're more confident with Python ML tools you still have TensorFlow.jl and ScikitLearn.jl, but OP asked for pure Julia packages and those are just Julia wrappers of Python packages.
Have a look at this kNN implementation and this for XGboost.
There are SVM implementations, but outdated an unmaintained (search for SVM .jl). But, really, think about other algorithms for much better prediction qualities and model construction performance. Have a look at the OLS (orthogonal least squares) and OFR (orthogonal forward regression) algorithm family. You will easily find detailed algorithm descriptions, easy to code in any suitable language. However, there is currently no Julia implementation I am aware of. I found only Matlab implementations and made my own java implementation, some years ago. I have plans to port it to julia, but that has currently no priority and may last some years. Meanwhile - why not coding by yourself? You won't find any other language making it easier to code a prototype and turn it into a highly efficient production algorithm running heavy load on a CUDA enabled GPGPU.
I recommend this quite new publication, to start with: Nonlinear identification using orthogonal forward regression with nested optimal regularization
I have been using SVM for training and testing one dimensional data (15000 sample points for training, 7500 sample points for testing) and it has brought up satisfactory results so far. But to improve on the results, I am thinking of using Deep Learning for the same. Will it be able to improve results? What should I study for a quick implementation of Deep Learning algorithms? I am new to the DL field but want a quick implementation, if at all it is justifiable.
In machine learning applications it is hard to say if an algorithm will improve the results or not because the results really depend on the data. There is no best algorithm. You should follow the steps given below:
Analyze your data
Apply the appropriate algorithms by the help of your machine learning background
Evaluate the results
There are many machine learning libraries for different programming languages i.e. Weka for Java and scikit-learn for Python. The implementations may have special names other than the abstract names like Deep Learning. Thus, research for the implementation you are looking for in the library you are using.
For a novice to machine learning, what are the learning prerequisites to using Apache Mahout in an efficient way?
I know that a committer to Mahout would need calculus, linear algebra, probability and machine learning before they can contribute anything useful. But does a "User" of Apache Mahout need all of this?
I'm asking this because learning/revising all of the above would take me ages..
Mahout In Action provides a good overview of what you need to know to use Mahout.
Typically, scalable machine learning does not require advanced mathematics for use. It may require serious math to develop, but not necessarily to use.
The primary requirement is that you really understand your data and its origins and what you want to do with it. That understanding doesn't have to come all at once and can be developed over time.
Try to Google the topics below:
Programming Collaborative Intelligence
Similarity calculation with vectors
What's the different between cluster and classification.
This question does not have a single "right" answer.
I'm interested in running Map Reduce algorithms, on a cluster, on Terabytes of data.
I want to learn more about the running time of said algorithms.
What books should I read?
I'm not interested in setting up Map Reduce clusters, or running standard algorithms. I want rigorous theoretical treatments or running time.
EDIT: The issue is not that map reduce changes running time. The issue is -- most algorithms do not distribute well to map reduce frameworks. I'm interested in algorithms that run on the map reduce framework.
Technically, there's no real different in the runtime analysis of MapReduce in comparison to "standard" algorithms - MapReduce is still an algorithm just like any other (or specifically, a class of algorithms that occur in multiple steps, with a certain interaction between those steps).
The runtime of a MapReduce job is still going to scale how normal algorithmic analysis would predict, when you factor in division of tasks across multiple machines and then find the maximum individual machine time required for each step.
That is, if you have a task which requires M map operations, and R reduce operations, running on N machines, and you expect that the average map operation will take m time and the average reduce operation r time, then you'll have an expected runtime of ceil(M/N)*m + ceil(R/N)*r time to complete all of the tasks in question.
Prediction of the values for M,R,m, and r are all something that can be accomplished with normal analysis of whatever algorithm you're plugging into MapReduce.
There are only two books that i know of that are published, but there are more in the works:
Pro hadoop and Hadoop: The Definitive Guide
Of these, Pro Hadoop is more of a beginners book, whilst The Definitive Guide is for those that know what Hadoop actually is.
I own The Definitive Guide and think its an excellent book. It provides good technical details on how the HDFS works, as well as covering a range of related topics such as MapReduce, Pig, Hive, HBase etc. It should also be noted that this book was written by Tom White who has been involved with the development of Hadoop for a good while, and now works at cloudera.
As far as the analysis of algorithms goes on Hadoop you could take a look at the TeraByte sort benchmarks. Yahoo have done a write up of how Hadoop performs for this particular benchmark: TeraByte Sort on Apache Hadoop. This paper was written in 2008.
More details about the 2009 results can be found here.
There is a great book about Data Mining algorithms applied to the MapReduce model.
It was written by two Stanford Professors and it if available for free:
http://infolab.stanford.edu/~ullman/mmds.html