What's the best way to run machine learning algorithms on Google Cloud Dataflow? I can imagine that using Mahout would be one option given it's Java based.
The answer is probably no, but is there a way to invoke R or Python (that have strong support for algorithms) based scripts to offload ML execution?
-Girish
You can already implement many algorithms in terms of Dataflow transforms.
A class of algorithms that may not be as easy to implement are iterative algorithms, where the pipeline's execution graph depends on the data itself. Simplifying implementation of iterative algorithms is something that we are interested in, and you can expect future improvements and simplifications in this area.
Invoking Python (or any other) executable shouldn't be hard from a Dataflow pipeline. A ParDo can, for example, shell out and start an arbitrary process. You can use, for example, --filesToStage pipeline option to add additional files to the Dataflow worker environment.
There is also http://quickml.org/ (haven't used personally) and Weka. I remember the docs mention that it's possible to launch a new process from within the job, but AFAIK it's not recommended.
Related
I'm planning a TFF scheme in which the clients send to the sever data besides the weights, like their hardware information (e.g CPU frequency). To achieve that, I need to call functions of third-party python libraries, like psutils. Is it possible to serialize (using tff.tf_computation) such kind of functions?
If not, what could be a solution to achieve this objective in a scenario where I'm using a remote executor setting through gRPC?
Unfortunately no, this does not work without modification. TFF uses TensorFlow graphs to serialize the computation logic to run on remote machines. TFF does not interpret Python code on the remote machines.
There maybe a solution by creating a TensorFlow custom op. This would mean writing C++ code to retrieve CPU frequency, and then a Python API to add the operation to the TensorFlow graph during computation construction. TensorFlow's guide for Create an op can provide detailed instructions.
Scikit-Learn algorithms are single node implementations. Does this mean, that they are not an appropriate choice for building machine learning models on Databricks cluster for the reason that they cannot take advantage of the cluster computing resources ?
They are not appropriate, in the sense that, as you say, they cannot take advantage of the cluster computing resources, which Databricks is arguably all about. The raison d'être of Databricks is Apache Spark, and specifically for ML tasks, its ML library Spark MLlib.
This does not mean that you cannot use scikit-learn in Databricks (you'll find that a Databricks cluster comes by scikit-learn installed by default), only that it is usable for problems that do not actually require a cluster. If you want to exploit the cluster resource capabilities for ML, you need to revert to Spark MLlib.
I think desertnaut hit the nail on the head here. I believe Scikit Learn algos are designed only for non-parallel processing jobs, and all the MLlib stuff is designed to leverage cluster compute resources and parallel processing resources. Take a look at the link below for sample code for standard regression and classification tasks.
https://spark.apache.org/docs/latest/ml-classification-regression.html
In addition, here are some code samples for different clustering tasks.
https://spark.apache.org/docs/latest/ml-clustering.html
That should probably cover most of the things you will be doing.
I believe that it depends on the task at hand. I see two general scenarios:
Your data is big and does not fit into memory. Go with the Spark MLlib and their distributed algos.
Your data is not that big and you want to utilize sheer computing power. The typical use case is hyperparameter search.
Databricks allow for distributing such workloads from the driver node to the executors with hyperopt and its SparkTrials (random + Bayesian search).
Some docs here>
http://hyperopt.github.io/hyperopt/scaleout/spark/
However, there are much more attempts to make the sklearn on spark work. You can supposedly distribute the workloads through UDF, using joblib, or others. I am investigating the issue myself, and will update the answer later.
Golang is much faster than Python.
However, in the case of Google Cloud Dataflow where Apache Beam is used as a programming model,
I want to understand whether the processing speed difference between Python and Golang is nearly the same or Golang is much faster than Python.
So I'm looking for Golang and Python benchmark material with big data in Dataflow.
Furthermore, it is even better to indicate the cause of the speed difference.
While Go as a language has advantages over Python, python benefits from tight interoperation with C, which gives it some speed benefits, since many of the more popular libraries have C implementations. Python is highly favoured for Machine Learning as a result, including for Beam.
Do you have any particular usecases that you would find valuable as comparisons?
The current expectation is that Go workers will have improved startup time compared to Python workers, but beyond that, it's harder to say without a concrete scenario.
I work on the Beam Go SDK. Currently the SDK isn't supported on Dataflow, and there are no comparative benchmarks between the Go and Python SDKs at present.
The Go SDK is still considered experimental. See the roadmap for the blockers in resolving that, which are currently in progress.
The Watson Machine Learning service provides three options for training deep learning models. The docs list the following:
There are several ways to train models Use one of the following
methods to train your model:
Experiment Builder
Command line interface (CLI)
Python client
I believe these approaches will differ with their (1) maturity and (2) the features they support.
What are the differences in these approaches? To ensure this question meets the quality requirements, can you please provide a objective list of the differences? Providing your answer as a community wiki answer will also allow the answer to be updated over time when the list changes.
If you feel this question is not a good fit for stack overflow, please provide a comment listing why and I will do my best to improve it.
The reasons to use these techniques depends on a user's skillset and how they are fitting the training/monitoring/deploying steps into their workflow:
Command Line Interface (CLI)
The CLI is useful for quick and random access to details about your training runs. It's also useful if you're building a data science workflow using shell scripts.
Python Library
WML's python library allows users to integrate their model training+deployment into a programmatic workflow. It can be used both within notebooks as well as via IDEs. The library has become the most widely used way for executing batch training experiments.
Experiment Builder UI
This is the "easy button" for executing batch training experiments within Watson Studio. It's a quick way to learn the basics of the batch training capabilities in Watson Studio. At present, it's not expected that data scientists would use Experiment Builder as their primary way of starting batch training experiments. Perhaps as Model Builder matures, this could change but the Python library is more flexible for integrating into production workflows.
Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:
spark.mllib, built on top of RDDs.
spark.ml, built on top of Dataframes.
According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible.
The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib(for RDDs) provides this algorithms.
If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml, why aren't common machine learning methods implemented in that lib?
What's the missing point here?
Spark 2.0.0
Currently Spark moves strongly towards DataFrame API with ongoing deprecation of RDD API. While number of native "ML" algorithms is growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.
See also: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Spark < 2.0.0
I guess that the main missing point is that spark.ml algorithms in general don't operate on DataFrames. So in practice it is more a matter of having a ml wrapper than anything else. Even native ML implementation (like ml.recommendation.ALS use RDDs internally).
Why not implement everything from scratch on top of DataFrames? Most likely because only a very small subset of machine learning algorithms can actually benefit from the optimizations which are currently implemented in Catalyst not to mention be efficiently and naturally implemented using DataFrame API / SQL.
Majority of the ML algorithms require efficient linear algebra library not a tabular processing. Using cost based optimizer for linear algebra could be an interesting addition (I think that flink already has one) but it looks like for now there is nothing to gain here.
DataFrames API gives you very little control over the data. You cannot use partitioner*, you cannot access multiple records at the time (I mean a whole partition for example), you're limited to a relatively small set of types and operations, you cannot use mutable data structures and so on.
Catalyst applies local optimizations. If you pass a SQL query / DSL expression it can analyze it, reorder, apply early projections. All of that is that great but typical scalable algorithms require iterative processing. So what you really want to optimize is a whole workflow and DataFrames alone are not faster than plain RDDs and depending on an operation can be actually slower.
Iterative processing in Spark, especially with joins, requires a fine graded control over the number of partitions, otherwise weird things happen. DataFrames give you no control over partitioning. Also, DataFrame / Dataset don't provide native checkpoint capabilities (fixed in Spark 2.1) which makes iterative processing almost impossible without ugly hacks
Ignoring low level implementation details some groups of algorithms, like FPM, don't fit very well into a model defined by ML pipelines.
Many optimizations are limited to native types, not UDT extensions like VectorUDT.
There is one more problem with DataFrames, which is not really related to machine learning. When you decide to use a DataFrame in your code you give away almost all benefits of static typing and type inference. It is highly subjective if you consider it to be a problem or not but one thing for sure, it doesn't feel natural in Scala world.
Regarding better, newer and faster I would take a look at Deep Dive into Spark SQL’s Catalyst Optimizer, in particular the part related to quasiquotes:
The following figure shows that quasiquotes let us generate code with performance similar to hand-tuned programs.
* This has been changed in Spark 1.6 but it is still limited to default HashPartitioning