Interpreting the Dask UI - dask

I was looking at the Dask UI and trying to figure out what each field means. However, I was not able to make sense of the write_bytes and read_bytes that can be shown in the image below. Also, the number of write_bytes is more than read_bytes in some cases. I was not able to find any documentation regarding this. What exactly do write_bytes and read_bytes mean?
I am running a simple Logistic Regression task on the MNIST data using Joblib dask backend.
Dask UI image

Those fields are about network traffic going in and out of that worker process.

These are common computing terms, referring to measurement of bytes written out over a network interface, and measurement of bytes read in from a network interface.
It's a bandwidth utilisation meaurement.
Some tools (like Windows' Resource Monitor) call this "Sent" and "Received" instead.

Related

Is there a way to simulate the communications costs in tensorflow-federated?

I am working on optimizing the communication costs in Federated Learning. Therefore, I need to simulate realistic network delays and measure communication overhead (the communication between the clients and the server). Is it possible to do that with TFF? Is there a realistic networking model for communications in Federated Learning setting?
Introducing network latency or delays in the execution stack is not something that TFF currently supports out of the box.
However, architecturally this is absolutely possible. One example of a recent contribution that addresses a similar request is the SizingExecutor, which measures bits passed through it on the way down and up in the execution hierarchy. Placing a SizingExecutor immediately on top of each executor representing a client, then, measures the bits broadcast and aggregated in each federated computation run through this execution stack; this implementation can be found here, and is in fact exposed in the public API.
Your desire is not entirely dissimilar to the sizing executor, and the sizing executor may serve your purpose directly, if you take total bits ber round as the metric you are trying to optimize. If however you would rather be examining other aspects of distributed computation (e.g. random data corruption) you may imagine doing so by implementing similar functionality to the sizing executor, though one could also imagine doing this at the computation level (a client chooses at random whether to return its true result or a corrupted version of its result).
I think from a design perspective, TFF would prefer any new executors to leave the semantics of the computations they are executing unchanged, and would steer towards either simply measuring properties like bits per round, or introducing any corruptions into the computation or algorithm directly, rather then in execution of these computations. The kind of corruption or delay a client can choose to introduce is effectively arbitrary; here is an example of a recent research project attempting to attack the global model by inserting malicious updates on certain clients. The same approach could be used, I imagine, to simulate any desired network property (e.g., some clients sleep, some send back corrupted updates, etc.).
Hope this helps!

Loading large datasets with dask

I am in an HPC environment with clusters, tightly coupled interconnects, and backing Lustre filesystems. We have been exploring how to leverage Dask to not only provide computation, but also to act as a distributed cache to speed up our workflows. Our proprietary data format is n-dimensional and regular, and we have coded a lazy reader to pass into the from_array/from_delayed methods.
We have had some issues with loading and persisting larger-than-memory datasets across a Dask cluster.
Example with hdf5:
# Dask scheduler has been started and connected to 8 workers
# spread out on 8 machines, each with --memory-limit=150e9.
# File locking for reading hdf5 is also turned off
from dask.distributed import Client
c = Client({ip_of_scheduler})
import dask.array as da
import h5py
hf = h5py.File('path_to_600GB_hdf5_file', 'r')
ds = hf[hf.keys()[0]]
x = da.from_array(ds, chunks=(100, -1, -1))
x = c.persist(x) # takes 40 minutes, far below network and filesystem capabilities
print x[300000,:,:].compute() # works as expected
We have also loaded datasets (using slicing, dask.delayed, and from_delayed) from some of our own file file formats, and have seen similar degradation of performance as the file size increases.
My questions: Are there inherent bottlenecks to using Dask as a distributed cache? Will all data be forced to funnel through the scheduler? Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow? If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
Are there inherent bottlenecks to using Dask as a distributed cache?
There are bottlenecks to every system, but it sounds like you're not close to running into the bottlenecks that I would expect from Dask.
I suspect that you're running into something else.
Will all data be forced to funnel through the scheduler?
No, workers can execute functions that load data on their own. That data will then stay on the workers.
Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow?
Workers are just Python processes, so if Python processes running on your cluster can take advantage of Lustre (this is almost certainly the case) then yes, Dask Workers can take advantage of Lustre.
If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
This is certainly common. The tradeoff here is between distributed bandwidth to your NFS and the availability of distributed memory.
In your position I would use Dask's diagnostics to figure out what was taking up so much time. You might want to read through the documentation on understanding performance and the section on the dashboard in particular. That section has a video that might be particularly helpful. I would ask two questions:
Are workers running tasks all the time? (status page, Task Stream plot)
Within those tasks, what is taking up time? (profile page)

Is GPU efficient on parameter server for data parallel training?

On data parallel training, I guess the GPU instance is not necessarily efficient for parameter servers because parameter servers only keep the values and don't run any computation such as matrix multiplication.
Therefore, I think the example config for Cloud ML Engine (using CPU for parameter servers and GPU for others) below has good cost performance:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
parameterServerType: standard_cpu
workerCount: 3
parameterServerCount: 4
Is that right?
Your assumption is a reasonable rule of thumb. That said, Parag points to a paper that describes a model that can leverage GPUs in the parameter server, so it's not always the case that parameter servers are not able to leverage GPUs.
In general, you may want to try both for a short time and see if throughput improves.
If you have any question as to what ops are actually being assigned to your parameter server, you can log the device placement. If it looks like ops are on the parameter server that can benefit from the GPU (and supposing they really should be there), then you can go ahead and try a GPU in the parameter server.

Tensorflow scalibility

I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?
General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)

Storm process increasing memory

I am implementing a distributed algorithm for pagerank estimation using Storm. I have been having memory problems, so I decided to create a dummy implementation that does not explicitly save anything in memory, to determine whether the problem lies in my algorithm or my Storm structure.
Indeed, while the only thing the dummy implementation does is message-passing (a lot of it), the memory of each worker process keeps rising until the pipeline is clogged. I do not understand why this might be happening.
My cluster has 18 machines (some with 8g, some 16g and some 32g of memory). I have set the worker heap size to 6g (-Xmx6g).
My topology is very very simple:
One spout
One bolt (with parallelism).
The bolt receives data from the spout (fieldsGrouping) and also from other tasks of itself.
My message-passing pattern is based on random walks with a certain stopping probability. More specifically:
The spout generates a tuple.
One specific task from the bolt receives this tuple.
Based on a certain probability, this task generates another tuple and emits it again to another task of the same bolt.
I am stuck at this problem for quite a while, so it would be very helpful if someone could help.
Best Regards,
Nick
It seems you have a bottleneck in your topology, ie, a bolt receivers more data than in can process. Thus, the bolt's input queue grows over time consuming more and more memory.
You can either increase the parallelism for the "bottleneck bolt" or enable fault-tolerance mechanism which also enables flow-control via limited number of in-flight tuples (https://storm.apache.org/documentation/Guaranteeing-message-processing.html). For this, you also need to set "max spout pending" parameter.

Resources