On data parallel training, I guess the GPU instance is not necessarily efficient for parameter servers because parameter servers only keep the values and don't run any computation such as matrix multiplication.
Therefore, I think the example config for Cloud ML Engine (using CPU for parameter servers and GPU for others) below has good cost performance:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
parameterServerType: standard_cpu
workerCount: 3
parameterServerCount: 4
Is that right?
Your assumption is a reasonable rule of thumb. That said, Parag points to a paper that describes a model that can leverage GPUs in the parameter server, so it's not always the case that parameter servers are not able to leverage GPUs.
In general, you may want to try both for a short time and see if throughput improves.
If you have any question as to what ops are actually being assigned to your parameter server, you can log the device placement. If it looks like ops are on the parameter server that can benefit from the GPU (and supposing they really should be there), then you can go ahead and try a GPU in the parameter server.
Related
I am working on optimizing the communication costs in Federated Learning. Therefore, I need to simulate realistic network delays and measure communication overhead (the communication between the clients and the server). Is it possible to do that with TFF? Is there a realistic networking model for communications in Federated Learning setting?
Introducing network latency or delays in the execution stack is not something that TFF currently supports out of the box.
However, architecturally this is absolutely possible. One example of a recent contribution that addresses a similar request is the SizingExecutor, which measures bits passed through it on the way down and up in the execution hierarchy. Placing a SizingExecutor immediately on top of each executor representing a client, then, measures the bits broadcast and aggregated in each federated computation run through this execution stack; this implementation can be found here, and is in fact exposed in the public API.
Your desire is not entirely dissimilar to the sizing executor, and the sizing executor may serve your purpose directly, if you take total bits ber round as the metric you are trying to optimize. If however you would rather be examining other aspects of distributed computation (e.g. random data corruption) you may imagine doing so by implementing similar functionality to the sizing executor, though one could also imagine doing this at the computation level (a client chooses at random whether to return its true result or a corrupted version of its result).
I think from a design perspective, TFF would prefer any new executors to leave the semantics of the computations they are executing unchanged, and would steer towards either simply measuring properties like bits per round, or introducing any corruptions into the computation or algorithm directly, rather then in execution of these computations. The kind of corruption or delay a client can choose to introduce is effectively arbitrary; here is an example of a recent research project attempting to attack the global model by inserting malicious updates on certain clients. The same approach could be used, I imagine, to simulate any desired network property (e.g., some clients sleep, some send back corrupted updates, etc.).
Hope this helps!
I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?
General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)
I am implementing a distributed algorithm for pagerank estimation using Storm. I have been having memory problems, so I decided to create a dummy implementation that does not explicitly save anything in memory, to determine whether the problem lies in my algorithm or my Storm structure.
Indeed, while the only thing the dummy implementation does is message-passing (a lot of it), the memory of each worker process keeps rising until the pipeline is clogged. I do not understand why this might be happening.
My cluster has 18 machines (some with 8g, some 16g and some 32g of memory). I have set the worker heap size to 6g (-Xmx6g).
My topology is very very simple:
One spout
One bolt (with parallelism).
The bolt receives data from the spout (fieldsGrouping) and also from other tasks of itself.
My message-passing pattern is based on random walks with a certain stopping probability. More specifically:
The spout generates a tuple.
One specific task from the bolt receives this tuple.
Based on a certain probability, this task generates another tuple and emits it again to another task of the same bolt.
I am stuck at this problem for quite a while, so it would be very helpful if someone could help.
Best Regards,
Nick
It seems you have a bottleneck in your topology, ie, a bolt receivers more data than in can process. Thus, the bolt's input queue grows over time consuming more and more memory.
You can either increase the parallelism for the "bottleneck bolt" or enable fault-tolerance mechanism which also enables flow-control via limited number of in-flight tuples (https://storm.apache.org/documentation/Guaranteeing-message-processing.html). For this, you also need to set "max spout pending" parameter.
I noticed there is an option that allows specifying a machine type.
What is the criteria I should use to decide whether to override the default machine type?
In some experiments I saw that throughput is better with smaller instances, but on the other hand jobs tend to experience more "system" failures when many small instances are used instead of a smaller number of default instances.
Thanks,
G
Dataflow will eventually optimize the machine type for you. In the meantime here are some scenarios I can think of where you might want to change the machine type.
If your ParDO operation needs a lot of memory you might want to change the machine type to one of the high memory machines that Google Compute Engine provides.
Optimizing for cost and speed. If your CPU utilization is less than 100% you could probably reduce the cost of your job by picking a machine with fewer CPUs. Alternatively, if you increase the number of machines and reduce the number of CPUs per machine (so total CPUs stays approximately constant) you can make your job run faster but cost approximately the same.
Can you please elaborate more on what type of system failures you are seeing? A large class of failures (e.g. VM interruptions) are probabilistic so you would expect to see a larger absolute number of failures as the number of machines increases. However, failures like VM interruptions should be fairly rare so I'd be surprised if you noticed an increase unless you were using order of magnitude more VMs.
On the other hand, its possible you are seeing more failures because of resource contention due to the increased parallelism of using more machines. If that's the case we'd really like to know about it to see if this is something we can address.
Azure embraces the notion of elastic scaling and I've been able to acheive this with my Worker Roles. However, when it comes to my Web Roles (e.g. MVC Apps) I am not sure what to monitor (or how) to determine when its a good time to increase (or descrease) the number of running instances. I'm assuming I need to monitor one or many Performance Counters but not sure where to start.
Can anyone recommend a best practice for assessing an MVC Web Role instances load relative to scaling decisions?
This question is a bit open-ended, as monitoring is typically app-specific. Having said that:
Start with simple measurements that you'd look at on a local server, representing KPIs for your app. For instance: Maybe look at network utilization. This TechNet article describes performance counters collected by System Center for Windows Azure. For instance:
ASP.NET Applications Requests/sec
Network Interface Bytes
Received/sec
Network Interface Bytes Sent/sec
Processor % Processor Time Total
LogicalDisk Free Megabytes
LogicalDisk % Free Space
Memory Available Megabytes
You may also want to watch # of requests queued and request wait time.
Network utilization is interesting, since your NIC provides approx. 100Mbps per core and could end up being a bottleneck even when CPU and other resources are underutilized. You may need to scale out to more instances to handle high-bandwidth scenarios.
Also: I tend to give less importance to CPU utilization, even though it's so easy to measure (and shows up so frequently in examples). Running a CPU at near capacity is a good thing usually, since you're paying for it and might as well use as much as possible.
As far as decreasing: This needs to be handled a bit more carefully. Windows Azure compute is billed by the hour. If, say, you scale out to an extra instance at 11:50 and scale in again at 12:10, you've just incurred two cpu-hours. Also: You don't want to scale out, then take new measurements and deciding you can now scale back again (effectively creating a constant pulse of adding and decreasing instances). To make things easier, consider the Autoscaling Application Block (WASABi), found in the Enterprise Library. This has all the scale rules baked in (such as the ones I just mentioned) and is very straightforward to use.