how to choose parallel computing framework for machine learning? - machine-learning

how to choose parallel computing framework for machine learning? I am a beginner, I saw there are Spark,Hadoop, OpenMP...what should I consider besides the language?

Look up Horovod from Uber, it's specifically designed for machine learning, available for several frameworks such as tensorflow/pytorch. It's available in Docker image repository on AWS too.

Related

Using scikit-learn on Databricks

Scikit-Learn algorithms are single node implementations. Does this mean, that they are not an appropriate choice for building machine learning models on Databricks cluster for the reason that they cannot take advantage of the cluster computing resources ?
They are not appropriate, in the sense that, as you say, they cannot take advantage of the cluster computing resources, which Databricks is arguably all about. The raison d'ĂȘtre of Databricks is Apache Spark, and specifically for ML tasks, its ML library Spark MLlib.
This does not mean that you cannot use scikit-learn in Databricks (you'll find that a Databricks cluster comes by scikit-learn installed by default), only that it is usable for problems that do not actually require a cluster. If you want to exploit the cluster resource capabilities for ML, you need to revert to Spark MLlib.
I think desertnaut hit the nail on the head here. I believe Scikit Learn algos are designed only for non-parallel processing jobs, and all the MLlib stuff is designed to leverage cluster compute resources and parallel processing resources. Take a look at the link below for sample code for standard regression and classification tasks.
https://spark.apache.org/docs/latest/ml-classification-regression.html
In addition, here are some code samples for different clustering tasks.
https://spark.apache.org/docs/latest/ml-clustering.html
That should probably cover most of the things you will be doing.
I believe that it depends on the task at hand. I see two general scenarios:
Your data is big and does not fit into memory. Go with the Spark MLlib and their distributed algos.
Your data is not that big and you want to utilize sheer computing power. The typical use case is hyperparameter search.
Databricks allow for distributing such workloads from the driver node to the executors with hyperopt and its SparkTrials (random + Bayesian search).
Some docs here>
http://hyperopt.github.io/hyperopt/scaleout/spark/
However, there are much more attempts to make the sklearn on spark work. You can supposedly distribute the workloads through UDF, using joblib, or others. I am investigating the issue myself, and will update the answer later.

Is there a native library written in Julia for Machine Learning?

I have started using Julia.I read that it is faster than C.
So far I have seen some libraries like KNET and Flux, but both are for Deep Learning.
also there is a command "Pycall" tu use Python inside Julia.
But I am interested in Machine Learning too. So I would like to use SVM, Random Forest, KNN, XGBoost, etc but in Julia.
Is there a native library written in Julia for Machine Learning?
Thank you
A lot of algorithms are just plain available using dedicated packages. Like BayesNets.jl
For "classical machine learning" MLJ.jl which is a pure Julia Machine Learning framework, it's written by the Alan Turing Institute with very active development.
For Neural Networks Flux.jl is the way to go in Julia. Also very active, GPU-ready and allow all the exotics combinations that exist in the Julia ecosystem like DiffEqFlux.jl a package that combines Flux.jl and DifferentialEquations.jl.
Just wait for Zygote.jl a source-to-source automatic differentiation package that will be some sort of backend for Flux.jl
Of course, if you're more confident with Python ML tools you still have TensorFlow.jl and ScikitLearn.jl, but OP asked for pure Julia packages and those are just Julia wrappers of Python packages.
Have a look at this kNN implementation and this for XGboost.
There are SVM implementations, but outdated an unmaintained (search for SVM .jl). But, really, think about other algorithms for much better prediction qualities and model construction performance. Have a look at the OLS (orthogonal least squares) and OFR (orthogonal forward regression) algorithm family. You will easily find detailed algorithm descriptions, easy to code in any suitable language. However, there is currently no Julia implementation I am aware of. I found only Matlab implementations and made my own java implementation, some years ago. I have plans to port it to julia, but that has currently no priority and may last some years. Meanwhile - why not coding by yourself? You won't find any other language making it easier to code a prototype and turn it into a highly efficient production algorithm running heavy load on a CUDA enabled GPGPU.
I recommend this quite new publication, to start with: Nonlinear identification using orthogonal forward regression with nested optimal regularization

Image Classification in Azure Machine Learning

I'm preparing for the Azure Machine Learning exam, and here is a question confuses me.
You are designing an Azure Machine Learning workflow. You have a
dataset that contains two million large digital photographs. You plan
to detect the presence of trees in the photographs. You need to ensure
that your model supports the following:
Solution: You create a Machine
Learning experiment that implements the Multiclass Decision Jungle
module. Does this meet the goal?
Solution: You create a Machine Learning experiment that implements the
Multiclass Neural Network module. Does this meet the goal?
The answer for the first question is No while for second is Yes, but I cannot understand why Multiclass Decision Jungle doesn't meet the goal since it is a classifier. Can someone explain to me the reason?
I suppose that this is part of a series of questions that present the same scenario. And there should be definitely some constraints in the scenario.
Moreover if you have a look on the Azure documentation:
However, recent research has shown that deep neural networks (DNN)
with many layers can be very effective in complex tasks such as image
or speech recognition. The successive layers are used to model
increasing levels of semantic depth.
Thus, Azure recommends using Neural Networks for image classification. Remember, that the goal of the exam is to test your capacity to design data science solution using Azure so better to use their official documentation as a reference.
And comparing to the other solutions:
You create an Azure notebook that supports the Microsoft Cognitive
Toolkit.
You create a Machine Learning experiment that implements
the Multiclass Decision Jungle module.
You create an endpoint to the
Computer vision API.
You create a Machine Learning experiment that
implements the Multiclass Neural Network module.
You create an Azure
notebook that supports the Microsoft Cognitive Toolkit.
There are only 2 Azure ML Studio modules, and as the question is about constructing a workflow I guess we can only choose between them. (CNTK is actually the best solution as it allows constructing a deep neural network with ReLU whereas AML Studio doesn't, and API call is not about data science at all).
Finally, I do agree with the other contributors that the question is absurd. Hope this helps.
This question is indeed part of a series of questions that present the same scenario with multiple options. Both of the solutions approach the problem as a multi-class classification problem, which is correct. However, the key element here is dimensionality.
Your inputs (images) are highly dimensional which requires a deep learning approach in order to be effective. A decision jungle won't be able to learn effectively in such a high dimensional feature space, where a NN has higher chances to do so.
I hope it helps.

Suggestions for machine learning toolset without Matlab

I am new to the field of machine learning, I am planning to use python as the programing language for implementing algorithms and Java for system architecture.
As far as I understand, machine learning is more about modeling data specific to the domain, visualize the data, and choose appropriate models & parameters. Implementing the models/algorithms is the last and relatively easy step.
Matlab seems to have everything for machine learning but it is too expensive and requires to learn a new language.
What tools other than programming language do I need in general for machine learning for enterprise projects? things like data modeling, visualization,etc
After a couple of years of trial and error, I would suggest you to go directly with python, possibly with scikit-learn or tensorflow (if you want to go hardcore :).
I also tried R in the past, and while it is a very valid language it has some limitations: It is single threaded by default, and although there are solutions for that, they are non as clean as python.
Also, python seems to be THE language for machine learning, it is easy to learn, and fast (depending on the interpreter implementation of course), also there is huuuuuuge support for it, lots of tutorials, documentation and, more important, libraries are actively develop and supported.
Finally, i recommend you to consider spyder as a good IDE for data science, I also tried Rodeo, but it does not seem as mature and stable as spyder.
Hope this helps.

Spiking Neural Network Classifier Implementation

Are there any machine learning packages that implement spiking neural networks? or any other stand-alone implementations of them that could get me started to work with?
A python library named Brian ought to be useful for you.
There's also what I believe is a programing language named NEURON, but Brian is fairly easy to learn, at least for the basics. It took me a while though to figure out how to do a couple small things, since its a really high level language or whatnot.
There are several other SNN platforms these days that allows you to run classification. I have worked with NeuCube (https://kedri.aut.ac.nz/R-and-D-Systems/neucube) which is a Matlab & Java-based SNN platform.
Also, check out Akida Development Environment (ADE) from Brainchip Inc (https://brainchipinc.com/). One of the best features of ADE is that it's APIs are based on tensorflow/keras structure and also supports CNN2SNN converter to use your deep learning models in SNN domain. SNN models developed using this platform can be deployed on their neuromorphic processor Akida.
I believe there are other platforms such as PyNN and Nengo (compatibility to run models on Loihi) within the SNN domain.
Here are links for brain simulator
https://github.com/brian-team/brian2
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2605403/
http://briansimulator.org/
You can install the Nengo Loihi library for deployment not only of spiking neural networks but also neuromorphic neural networks.
here's the link to their website: https://www.nengo.ai/nengo-loihi/v1.0.0/index.html
You can find on Kaggle an implementation of the ciphar10 dataset, locally loaded, using Nengo Loihi library. Here's the link:
https://www.kaggle.com/migueltoms/neuromorphic-ciphar-10-loihi-comparison-of-results

Resources