I am implementing a distributed algorithm for pagerank estimation using Storm. I have been having memory problems, so I decided to create a dummy implementation that does not explicitly save anything in memory, to determine whether the problem lies in my algorithm or my Storm structure.
Indeed, while the only thing the dummy implementation does is message-passing (a lot of it), the memory of each worker process keeps rising until the pipeline is clogged. I do not understand why this might be happening.
My cluster has 18 machines (some with 8g, some 16g and some 32g of memory). I have set the worker heap size to 6g (-Xmx6g).
My topology is very very simple:
One spout
One bolt (with parallelism).
The bolt receives data from the spout (fieldsGrouping) and also from other tasks of itself.
My message-passing pattern is based on random walks with a certain stopping probability. More specifically:
The spout generates a tuple.
One specific task from the bolt receives this tuple.
Based on a certain probability, this task generates another tuple and emits it again to another task of the same bolt.
I am stuck at this problem for quite a while, so it would be very helpful if someone could help.
Best Regards,
Nick
It seems you have a bottleneck in your topology, ie, a bolt receivers more data than in can process. Thus, the bolt's input queue grows over time consuming more and more memory.
You can either increase the parallelism for the "bottleneck bolt" or enable fault-tolerance mechanism which also enables flow-control via limited number of in-flight tuples (https://storm.apache.org/documentation/Guaranteeing-message-processing.html). For this, you also need to set "max spout pending" parameter.
Related
In the clustering / causal consistency docs neo4j talks about how they use async replication between the primaries (which are in a Raft group and are replicating writes to over half the nodes) and secondaries (read replicas). AIUI, you can guarantee read-after-write goodness by passing a token from the writer (after commit) to the reader, and then the reader uses the token to query, and the read replica will only respond when it has the data replicated.
That seems great, but what is the minimum replication latency that I would have to wait for? The docs are understandably cagey about committing to a maximum (since an overwhelmed or misconfigured reader might not make progress at all!), but does log tailing after a commit begin almost constantly, or once a second, or once a minute?
The reason I care is that I want to bound my write-to-read latency to less than a second (except in rarer tail cases where read replicas are unhealthy). If the log shipping is slower than that even in the happy case, I can’t use read replicas and would have to make the primaries scale out instead (which has its own latency impact of course, but is probably a bit better).
I am working on optimizing the communication costs in Federated Learning. Therefore, I need to simulate realistic network delays and measure communication overhead (the communication between the clients and the server). Is it possible to do that with TFF? Is there a realistic networking model for communications in Federated Learning setting?
Introducing network latency or delays in the execution stack is not something that TFF currently supports out of the box.
However, architecturally this is absolutely possible. One example of a recent contribution that addresses a similar request is the SizingExecutor, which measures bits passed through it on the way down and up in the execution hierarchy. Placing a SizingExecutor immediately on top of each executor representing a client, then, measures the bits broadcast and aggregated in each federated computation run through this execution stack; this implementation can be found here, and is in fact exposed in the public API.
Your desire is not entirely dissimilar to the sizing executor, and the sizing executor may serve your purpose directly, if you take total bits ber round as the metric you are trying to optimize. If however you would rather be examining other aspects of distributed computation (e.g. random data corruption) you may imagine doing so by implementing similar functionality to the sizing executor, though one could also imagine doing this at the computation level (a client chooses at random whether to return its true result or a corrupted version of its result).
I think from a design perspective, TFF would prefer any new executors to leave the semantics of the computations they are executing unchanged, and would steer towards either simply measuring properties like bits per round, or introducing any corruptions into the computation or algorithm directly, rather then in execution of these computations. The kind of corruption or delay a client can choose to introduce is effectively arbitrary; here is an example of a recent research project attempting to attack the global model by inserting malicious updates on certain clients. The same approach could be used, I imagine, to simulate any desired network property (e.g., some clients sleep, some send back corrupted updates, etc.).
Hope this helps!
I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?
General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)
I am confused by hadoop namenode memory problem.
when namenode memory usage is higher than a certain percentage (say 75%), reading and writing hdfs files through hadoop api will fail (for example, call some open() will throw exception), what is the reason? Does anyone has the same thing?
PS.This time the namenode disk io is not high, the CPU is relatively idle.
what determines namenode'QPS (Query Per Second) ?
Thanks very much!
Since the namenode is basically just a RPC Server managing a HashMap with the blocks, you have two major memory problems:
Java HashMap is quite costly, its collision resolution (seperate chaining algorithm) is costly as well, because it stores the collided elements in a linked list.
The RPC Server needs threads to handle requests- Hadoop ships with his own RPC framework and you can configure this with the dfs.namenode.service.handler.count for the datanodes it is default set to 10. Or you can configure this dfs.namenode.handler.count for other clients, like MapReduce jobs, JobClients that want to run a job. When a request comes in and it want to create a new handler, it go may out of memory (new Threads are also allocating a good chunk of stack space, maybe you need to increase this).
So these are the reasons why your namenode needs so much memory.
What determines namenode'QPS (Query Per Second) ?
I haven't benchmarked it yet, so I can't give you very good tips on that. Certainly fine tuning the handler counts higher than the number of tasks that can be run in parallel + speculative execution.
Depending on how you submit your jobs, you have to fine tune the other property as well.
Of course you should give the namenode always enough memory, so it has headroom to not fall into full garbage collection cycles.
Between nodes, message are (must be) passed over TCP/IP. However, by what mechanism are they passed between processes running on the same node? Is TCP/IP used in this case as well? Unix domain sockets? What is the difference in performance between "within node" and "between node" message passing?
by what mechanism are they passed between processes running on the same node?
Because Erlang processes on the same node are all running within a single native process — the BEAM emulator — message structures are simply copied into the receiver's message queue. The message structure is copied, rather than simply referenced, for all the standard no-side-effects functional programming reasons.
See erts_send_message() in erts/emulator/beam/erl_message.c in the Erlang sources for more detail. In R15B01, the bits most relevant to your question start at line 980 or so, with the call to erts_queue_message().
If you did choose to run multiple BEAM emulators on a single physical machine, I would guess messages get sent between them the same way as between different physical machines. There's probably no good reason to do that now that BEAM has good SMP support, though.
What is the difference in performance between "within node" and "between node" message passing?
A simple benchmark on your actual hardware would be more useful to you than anecdotal evidence from others.
If you want generalities, however, observe that memory bandwidths are around 20 GByte/sec these days, and that you're unlikely to have a network link faster than 10 Gbit/sec between nodes. That means that while there may be many differences between your actual application and any simple benchmark you perform or find, these differences probably cannot swamp an order of magnitude difference in transfer rate.
If you "only" have a 1 Gbit/sec end-to-end network link between nodes, intranode transfers will probably be over two orders of magnitude faster than internode transfers.
"All data in messages between Erlang processes is copied, with the exception of refc binaries on the same Erlang node.":
http://erlang.org/doc/efficiency_guide/processes.html#id2265332