Do dask.delayed objects get distributed by dask on a cluster?
Also, is the execution of its task graph also distributed on a cluster?
The short answer is yes.
Users interact by connecting a local Python session to the scheduler and submitting work, either by individual calls to the simple interface client.submit(function, *args, **kwargs) or by using the large data collections and parallel algorithms of the parent dask library. The collections in the dask library like dask.array and dask.dataframe provide easy access to sophisticated algorithms and familiar APIs like NumPy and Pandas, while the simple client.submit interface provides users with custom control when they want to break out of canned “big data” abstractions and submit fully custom workloads.
Dask delayed objects are included in the "parallel algorithms of the parent dask library".
See the documentation for more info.
http://distributed.dask.org/en/latest/
Related
What is the difference between GCP pipeline services:
Cloud Dataflow and Cloud Data fusion ...
which to you when?
I did a high level pricing taking 10 instances with Basic in Data fusion.
and 10 instance cluster (n1-standard-8) in Dataflow.
The pricing is more than double for Datafusion.
What are the pros and cons for each over one another
Cloud Dataflow is purpose built for highly parallelized graph processing. And can be used for batch processing and stream based processing. It is also built to be fully managed, obfuscating the need to manage and understand underlying resource scaling concepts e.g how to optimize shuffle performance or deal with key imbalance issues. The user/developer is responsible for building the graph via code; creating N transforms and or operations to achieve desired goal. For example: read files from storage, process each line in file, extract data from line, cast data to numeric, sum data in groups of X, write output to data lake.
Cloud Data Fusion is focused on enabling data integration scenarios => reading from source (via extensible set of connectors) and writing to targets e.g. BigQuery, storage, etc. It does have parallelization concepts, but they are not fully managed like Cloud Dataflow. CDF rides on top of Cloud Dataproc which is a managed version for Hadoop based processing. It's sweet spot is visual based graph development leveraging an extensible set of connectors and operators.
Your question is based on "cost" concepts. My advice is to take a step back and define what your processing/graph goal(s) look like. Then look at each products value. If you want full control over processing semantics with greater focus on analytics and want to run in batch and or must have streaming focus on Dataflow. If you want point and click data movement, with less focus need on data analytics AND do not need streaming then look at CDF.
I'm planning a TFF scheme in which the clients send to the sever data besides the weights, like their hardware information (e.g CPU frequency). To achieve that, I need to call functions of third-party python libraries, like psutils. Is it possible to serialize (using tff.tf_computation) such kind of functions?
If not, what could be a solution to achieve this objective in a scenario where I'm using a remote executor setting through gRPC?
Unfortunately no, this does not work without modification. TFF uses TensorFlow graphs to serialize the computation logic to run on remote machines. TFF does not interpret Python code on the remote machines.
There maybe a solution by creating a TensorFlow custom op. This would mean writing C++ code to retrieve CPU frequency, and then a Python API to add the operation to the TensorFlow graph during computation construction. TensorFlow's guide for Create an op can provide detailed instructions.
I want to understand what is the difference between dask and rapids, what benefits does rapids provides which dask doesn't have.
Does rapids internally use dask code? If so then why do we have dask, cause even dask can interact with GPU.
Dask is a Python library which enables out of core parallelism and distribution of some popular Python libraries as well as custom functions.
Take Pandas for example. Pandas is a popular library for working with Dataframes in Python. However it is single-threaded and the Dataframes you are working on must fit within memory.
Dask has a subpackage called dask.dataframe which follows most of the same API as Pandas but instead breaks your Dataframe down into partitions which can be operated on in parallel and can be swapped in and out of memory. Dask uses Pandas under the hood, so each partition is a valid Pandas Dataframe.
The overall Dask Dataframe can scale out and use multiple cores or multiple machines.
RAPIDS is a collection of GPU accelerated Python libraries which follow the API of other popular Python packages.
To continue with our Pandas theme, RAPIDS has a package called cuDF, which has much of the same API as Pandas. However cuDF stores Dataframes in GPU memory and uses the GPU to perform computations.
As GPUs can accelerate computations and this can lead to performance benefits for your Dataframe operations and enables you to scale up your workflow.
RAPIDS and Dask also work together and Dask is considered a component of RAPIDS because of this. So instead of having a Dask Dataframe made up of individual Pandas Dataframes you could instead have one made up of cuDF Dataframes. This is possible because they follow the same API.
This way you can both scale up by using a GPU and also scale out using multiple GPUs on multiple machines.
Dask provides the ability to distribute a job. Dask can scale both horizontally (multiple machines) and vertically (same machine).
RAPIDS provides a set of PyData APIs which are GPU-Accelerated. Pandas (cuDF), Scikit-learn (cuML), NumPy (CuPy), etc.. are GPU-Accelerated with RAPIDS. This means that you can use the code you already wrote against those APIs and just swap in the RAPIDS library and benefit from GPU-Acceleration.
When you combine Dask and RAPIDS together, you basically get a framework (Dask) that scales horizontally and vertically, and PyData APIs (RAPIDS) which can leverage underlying GPUs.
If you look at broader solutions, Dask can then integrate with orchestration tools like Kubernetes and SLURM to be able to provide even better resource utilization across a large environment.
I am looking to host 5 deep learning models where data preprocessing/postprocessing is required.
It seems straightforward to host each model using TF serving (and Kubernetes to manage the containers), but if that is the case, where should the data pre and post-processing take place?
I'm not sure there's a single definitive answer to this question, but I've had good luck deploying models at scale bundling the data pre- and post-processing code into fairly vanilla Go or Python (e.g., Flask) applications that are connected to my persistent storage for other operations.
For instance, to take the movie recommendation example, on the predict route it's pretty performant to pull the 100 films a user has watched from the database, dump them into a NumPy array of the appropriate size and encoding, dispatch to the TensorFlow serving container, and then do the minimal post-processing (like pulling the movie name, description, cast from a different part of the persistent storage layer) before returning.
Additional options to josephkibe's answer, you can:
Implementing processing into model itself (see signatures for keras models and input receivers for estimators in SavedModel guide).
Install Seldon-core. It is a whole framework for serving that handles building images and networking. It builds service as a graph of pods with different API's, one of them are transformers that pre/post-process data.
It's certainly possible to view a Dask graph at any stage while holding onto the object. Though once .compute() is called on a Dask object, there is an opportunity to apply additional optimizations to the Dask graph before running the computation. Any optimizations applied at this stage would impact how the computation is run. However this optimized graph would not necessarily be attached to a corresponding Dask object available to the user. Is there a way to also view the final Dask graph that was actually used for the computation?
The graph is not easily accessible after it has been submitted.
If you are using the distributed scheduler you can inspect the state there after submission, but it is no longer in a form that matches the traditional graph specification.
The best option I can think of is to optimize the graph before computing, and to investigate this. This isn't guaranteed to be exactly the same, but is likely close.