This may be a very silly pb but I cannot diagnose bandwidth from the dask dashboard. I am under the impression the line is always so low it is not visible, cf screen grab.
Can I use the dashboard to get a value in such a situation?
Yes, there are a few places in the dashboard where bandwidth is mentioned:
workers: the read/write network bandwidth is listed real-time per worker
bandwidth-per-workers: the aggregate bandwidth is accumulated per worker pairs
bandwidth-per-type: the aggregate bandwidth is accumulated per type of data
Some of these are more accessible in the performance_report, which may interest you. See https://docs.dask.org/en/latest/diagnostics-distributed.html#capture-diagnostics
Related
I have a distributed Dask cluster that I send a bunch of work to via Dask Distributed Client.
At the end of sending a bunch of work, I'd love to get a report or something that tells me what was the peak memory usage of each worker.
Is this possible via existing diagnostics tools? https://docs.dask.org/en/latest/diagnostics-distributed.html
Thanks!
Best,
Specifically for memory, it's possible to extract information from the scheduler (while it's running) using client.scheduler_info() (this can be dumped as a json). For peak memory there would have to be an extra function that will compare the current usage with the previous usage and pick max.
For a lot of other useful information, but not the peak memory consumption, there's the built-in report:
from dask.distributed import performance_report
with performance_report(filename="dask-report.html"):
## some dask computation
(code from the documentation: https://docs.dask.org/en/latest/diagnostics-distributed.html)
Update: there is also a dedicated plugin for dask to record min/max memory usage per task: https://github.com/itamarst/dask-memusage
Update 2: there is a nice blog post with code to track memory usage by dask: https://blog.dask.org/2021/03/11/dask_memory_usage
Is there any built-in way to monitor memory usage of an application running in managed Google Cloud Run instances?
In the "Metrics" page of a managed Cloud Run service, there is an item called "Container Memory Allocation". However, as far as I understand it, this graph refers to the instance's maximum allocated memory (chosen in the settings), and not to the memory actually used inside the container. (Please correct me if I'm wrong.)
In the Stackdriver Monitoring list of available metrics for managed Cloud Run ( https://cloud.google.com/monitoring/api/metrics_gcp#gcp-run ), there also doesn't seem to be any metric related to the memory usage, only to allocated memory.
Thank you in advance.
Cloud Run now exposes a new metrics named "Memory Utilization" in Cloud Monitoring, see more details here.
This metrics captures the container memory utilization distribution across all container instances of the revision. It is recommended to look at the percentiles of this metric: 50th percentile, 95th percentiles ad 99th percentiles to understand how utilized are your instances
Currently, there seems to be no way to monitor the memory usage of a Google Cloud Run instance through Stackdriver or on "Cloud Run" page in Google Cloud Console.
I have filed a feature request on your behalf, in order to add memory usage metrics to Cloud Run. You can see and track this feature request in the following link.
There is not currently a metric on memory utilization. However, if your service reaches a memory limit, the following log will appear in Stackdriver Logging with ERROR-level severity:
"Memory limit of 256M exceeded with 325M used. Consider increasing the memory limit, see https://cloud.google.com/run/docs/configuring/memory-limits"
(Replace specific numbers accordingly.)
Based on this log message, you could create a Log-based Metric for memory exceeded.
I am using azure log analytics to store monitoring data from AKS clusters. 72% of the data stored is performance data. Is there a way to limit how often AKS reports performance data?
At this point we do not provide a mechanism to change performance metric collection frequency. It is set to 1 minute and cannot be changed.
We were actually thinking about adding an option to make more frequent collection as was requested by some customers.
Given the number of objects (pods, containers, etc) running in the cluster collecting even a few perf metrics may generate noticeable amount of data... You need that data in order to figure out what is going on in case of a problem.
Curious: you say your perf data is 72% of total - how much is it in terms om Gb/day, do you now? Do you have any active applications running on the cluster generating tracing? What we see is that once you stand up a new cluster, perf data is "the king" of volume, but once you start ading active apps that trace, logs become more and more of a factor in telemetry data volume...
I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?
General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)
Here is my problem:
There are n peers in the P2P network, which request the same data block; And with some constraint.
1. Peers with its own upload bandwidth, and the average bandwidth is the size of the data block.
2. The peers have different deadline about this data block. If one peer didnt get the entire block before the deadline, it has to search for the server help.
3. A peer can transfer data (partial or entire) only if it has the entire data block.
The object is to minimize the server total upload, I cant figure it out if it has an optimal algorithm or it is an NP problem. Deadline first or largest bandwidth first may not deal with some situation
Is there some NP problem similar to this? This is like a graph flow problem or an instruction scheduling, but I found that it is difficult cause I have to deal with the deadline and the growth of the suppliers total bandwidth at the same time.
I hope that I can get some directions or resource about the solution :)
Thanks.
Considering that each peer acts individually in your case, it is not like only one automata is solving your issue, but many. Since fetching a data block when it is not available within a given delay, is typically a polynomial problem, and since the job is accomplished by individual peers, your issue is not an NP problem for each peer locally.
On the other side, if a server has to compute the minimal allocation of backup resources to transfer 'missing blocks', you would have to first find out about the probability that a peer misses a block (average + standard deviation for example). Assuming you know the statistical distribution of such events, you could compute the total bandwidth you would need to transfer those missing blocks with a chosen risk of failure/tolerance in the bandwidth. If you are using multiple servers to cover for the need, make sure your peers contact them randomly to distribute the load.
Solving this statistical problem is not an NP issue. You can collect failure info from each peer and add it on a central/server peer. Therefore, my conclusion is that your issue is not an NP problem.
PART II:
Oh, I understand your case better now: multiple 'server' peers can potentially help one peer getting a full block. In this case, the number of server peers grows exponentially in your system for a given block. In this case, this optimization problem has all the characteristic of a flooding problem for me and they are NP.
Even if your graph of peers and the potential connections between them was static (which is never the case in a real P2P network), computing the optimal solution in a reasonable amount of time for more than 50 or 100 nodes is virtually impossible, unless you can make very specific assumptions on this graph (which is almost never the case in general and not always useful).
But do you absolutely need to have the absolute optimal solution or is something near the optimal good enough?
Heuristics will tell you that if your peers have more or less the same download bandwidth capacity, then it makes sense to serve peers with the highest UPLOAD bandwidth first to maximize the avalanche effect and to reduce the risk for a peer having to ask for help, in general.
If your graph is relatively balanced (that is, most peers can connect to most peers), then I bet the minimum bandwidth of initial servers will be a logarithmic function of the number of nodes in your graph times the average speed at which peers expect to be served. This is only my gut feeling and should be validated with real measures or a strong modeling of your case.