I want to understand what is the difference between dask and rapids, what benefits does rapids provides which dask doesn't have.
Does rapids internally use dask code? If so then why do we have dask, cause even dask can interact with GPU.
Dask is a Python library which enables out of core parallelism and distribution of some popular Python libraries as well as custom functions.
Take Pandas for example. Pandas is a popular library for working with Dataframes in Python. However it is single-threaded and the Dataframes you are working on must fit within memory.
Dask has a subpackage called dask.dataframe which follows most of the same API as Pandas but instead breaks your Dataframe down into partitions which can be operated on in parallel and can be swapped in and out of memory. Dask uses Pandas under the hood, so each partition is a valid Pandas Dataframe.
The overall Dask Dataframe can scale out and use multiple cores or multiple machines.
RAPIDS is a collection of GPU accelerated Python libraries which follow the API of other popular Python packages.
To continue with our Pandas theme, RAPIDS has a package called cuDF, which has much of the same API as Pandas. However cuDF stores Dataframes in GPU memory and uses the GPU to perform computations.
As GPUs can accelerate computations and this can lead to performance benefits for your Dataframe operations and enables you to scale up your workflow.
RAPIDS and Dask also work together and Dask is considered a component of RAPIDS because of this. So instead of having a Dask Dataframe made up of individual Pandas Dataframes you could instead have one made up of cuDF Dataframes. This is possible because they follow the same API.
This way you can both scale up by using a GPU and also scale out using multiple GPUs on multiple machines.
Dask provides the ability to distribute a job. Dask can scale both horizontally (multiple machines) and vertically (same machine).
RAPIDS provides a set of PyData APIs which are GPU-Accelerated. Pandas (cuDF), Scikit-learn (cuML), NumPy (CuPy), etc.. are GPU-Accelerated with RAPIDS. This means that you can use the code you already wrote against those APIs and just swap in the RAPIDS library and benefit from GPU-Acceleration.
When you combine Dask and RAPIDS together, you basically get a framework (Dask) that scales horizontally and vertically, and PyData APIs (RAPIDS) which can leverage underlying GPUs.
If you look at broader solutions, Dask can then integrate with orchestration tools like Kubernetes and SLURM to be able to provide even better resource utilization across a large environment.
Related
As of now, lightGBM model supports GPU training and distributed training (using DASK).
If it is possible, how can I use distributed training with DASK using my GPU or is there any other way to do so?
Actually my task is to use the power of GPU and distributed training in lightGBM model.
It may possible I am missing a concept because I'm a beginner.
I'm not a LightGBM expert, so it might be better to wait for some to chime in. But from what I've been able to find, lightGBM does not really work with both Dask and GPU support.
See https://github.com/microsoft/LightGBM/issues/4761#issuecomment-956358341:
Right now the dask interface doesn't directly support distributed training using GPU, you can subscribe to #3776 if you're interested in that. Are you getting any warnings about this? I think it probably isn't using the GPU at all.
Furthermore, if your data fits in a single machine then it's probably best not using distributed training at all. The dask interface is there to help you train a model on data that doesn't fit on a single machine by having partitions of the data on different machines which communicate with each other, which adds some overhead compared to single-node training.
And https://github.com/microsoft/LightGBM/issues/3776:
The Dask interface in https://github.com/microsoft/LightGBM/blob/706f2af7badc26f6ec68729469ec6ec79a66d802/python-package/lightgbm/dask.py currently only supports CPU-based training.
Anyway, if you have only one GPU, Dask shouldn't be of much help.
Scikit-Learn algorithms are single node implementations. Does this mean, that they are not an appropriate choice for building machine learning models on Databricks cluster for the reason that they cannot take advantage of the cluster computing resources ?
They are not appropriate, in the sense that, as you say, they cannot take advantage of the cluster computing resources, which Databricks is arguably all about. The raison d'être of Databricks is Apache Spark, and specifically for ML tasks, its ML library Spark MLlib.
This does not mean that you cannot use scikit-learn in Databricks (you'll find that a Databricks cluster comes by scikit-learn installed by default), only that it is usable for problems that do not actually require a cluster. If you want to exploit the cluster resource capabilities for ML, you need to revert to Spark MLlib.
I think desertnaut hit the nail on the head here. I believe Scikit Learn algos are designed only for non-parallel processing jobs, and all the MLlib stuff is designed to leverage cluster compute resources and parallel processing resources. Take a look at the link below for sample code for standard regression and classification tasks.
https://spark.apache.org/docs/latest/ml-classification-regression.html
In addition, here are some code samples for different clustering tasks.
https://spark.apache.org/docs/latest/ml-clustering.html
That should probably cover most of the things you will be doing.
I believe that it depends on the task at hand. I see two general scenarios:
Your data is big and does not fit into memory. Go with the Spark MLlib and their distributed algos.
Your data is not that big and you want to utilize sheer computing power. The typical use case is hyperparameter search.
Databricks allow for distributing such workloads from the driver node to the executors with hyperopt and its SparkTrials (random + Bayesian search).
Some docs here>
http://hyperopt.github.io/hyperopt/scaleout/spark/
However, there are much more attempts to make the sklearn on spark work. You can supposedly distribute the workloads through UDF, using joblib, or others. I am investigating the issue myself, and will update the answer later.
Do dask.delayed objects get distributed by dask on a cluster?
Also, is the execution of its task graph also distributed on a cluster?
The short answer is yes.
Users interact by connecting a local Python session to the scheduler and submitting work, either by individual calls to the simple interface client.submit(function, *args, **kwargs) or by using the large data collections and parallel algorithms of the parent dask library. The collections in the dask library like dask.array and dask.dataframe provide easy access to sophisticated algorithms and familiar APIs like NumPy and Pandas, while the simple client.submit interface provides users with custom control when they want to break out of canned “big data” abstractions and submit fully custom workloads.
Dask delayed objects are included in the "parallel algorithms of the parent dask library".
See the documentation for more info.
http://distributed.dask.org/en/latest/
I recently took a courser by Andrew Ng on Coursera. After that I shifted to Python and used Pandas, Numpy, Sklearn to implement ML algorithms. Now while surfing I came across tensorFLow and found it pretty amazing, and implemented this example which takes MNIST data as input.
But I am unsure why use such as library(TensorFlow)?
We are not doing any parallel calculations, since the weights updated in the previous epoch are used in the next one???
I am finding it difficult to find a reason to use such a Library?
There are several forms of parallelism that TensorFlow provides when training a convolutional neural network (and many other machine learning models), including:
Parallelism within individual operations (such as tf.nn.conv2d() and tf.matmul()). These operations have efficient parallel implementations for multi-core CPUs and GPUs, and TensorFlow uses these implementations wherever available.
Parallelism between operations. TensorFlow uses a dataflow graph representation for your model, and where there are two nodes that aren't connected by a directed path in the dataflow graph, these may execute in parallel. For example, the Inception image recognition model has many parallel branches in its dataflow graph (see figure 3 in this paper), and TensorFlow can exploit this to run many operations at the same time. The AlexNet paper also describes how to use "model parallelism" to run operations in parallel on different parts of the model, and TensorFlow supports that using the same mechanism.
Parallelism between model replicas. TensorFlow is also designed for distributed execution. One common scheme for parallel training ("data parallelism") involves sharding your dataset across a set of identical workers, performing the same training computation on each of those workers for different data, and sharing the model parameters between the workers.
In addition, libraries like TensorFlow and Theano can perform various optimizations when they can work with the whole dataflow graph of your model. For example, they can eliminate common subexpressions, avoid recomputing constant values, and generate more efficient fused code.
You might be able to find pre-baked models in sklearn or other libraries, but TensorFlow allows for really fast iteration of custom machine learning models. It also comes with a ton of useful functions that you would have to (and probably shouldn't) write yourself.
To me, it's less about performance (though they certainly care about performance), and more about whipping out neural networks really quickly.
After playing with the current distributed training implementation for a while, I think it views each GPU as a separate worker.However, It is common now to have 2~4 GPUs in one box. Isn't it better to adopt the single box multi-GPU methodology to compute average gradients in single box first and then sync up across multiple nodes? This way it ease the I/O traffic a lot, which is always the bottleneck in data parallelism.
I was told it's possible with the current implementation by having all GPUs in single box as a worker, but I am not able to figure out how to tie the average gradients with SyncReplicasOptimizer, since SyncReplicasOptimizer directly takes the optimizer as input.
Any ideas from anyone?
Distributed TensorFlow supports multiple GPUs in the same worker task. One common way to perform distributed training for image models is to perform synchronous training across multiple GPUs in the same worker, and asynchronous training across workers (though other configurations are possible). This way you only pull the model parameters to the worker once, and they are distributed among the local GPUs, easing the network bandwidth utilization.
To do this kind of training, many users perform "in-graph replication" across the GPUs in a single worker. This can use an explicit loop across the local GPU devices, like in the CIFAR-10 example model; or higher-level library support, like in the model_deploy() utility from TF-Slim.