Here is my problem:
There are n peers in the P2P network, which request the same data block; And with some constraint.
1. Peers with its own upload bandwidth, and the average bandwidth is the size of the data block.
2. The peers have different deadline about this data block. If one peer didnt get the entire block before the deadline, it has to search for the server help.
3. A peer can transfer data (partial or entire) only if it has the entire data block.
The object is to minimize the server total upload, I cant figure it out if it has an optimal algorithm or it is an NP problem. Deadline first or largest bandwidth first may not deal with some situation
Is there some NP problem similar to this? This is like a graph flow problem or an instruction scheduling, but I found that it is difficult cause I have to deal with the deadline and the growth of the suppliers total bandwidth at the same time.
I hope that I can get some directions or resource about the solution :)
Thanks.
Considering that each peer acts individually in your case, it is not like only one automata is solving your issue, but many. Since fetching a data block when it is not available within a given delay, is typically a polynomial problem, and since the job is accomplished by individual peers, your issue is not an NP problem for each peer locally.
On the other side, if a server has to compute the minimal allocation of backup resources to transfer 'missing blocks', you would have to first find out about the probability that a peer misses a block (average + standard deviation for example). Assuming you know the statistical distribution of such events, you could compute the total bandwidth you would need to transfer those missing blocks with a chosen risk of failure/tolerance in the bandwidth. If you are using multiple servers to cover for the need, make sure your peers contact them randomly to distribute the load.
Solving this statistical problem is not an NP issue. You can collect failure info from each peer and add it on a central/server peer. Therefore, my conclusion is that your issue is not an NP problem.
PART II:
Oh, I understand your case better now: multiple 'server' peers can potentially help one peer getting a full block. In this case, the number of server peers grows exponentially in your system for a given block. In this case, this optimization problem has all the characteristic of a flooding problem for me and they are NP.
Even if your graph of peers and the potential connections between them was static (which is never the case in a real P2P network), computing the optimal solution in a reasonable amount of time for more than 50 or 100 nodes is virtually impossible, unless you can make very specific assumptions on this graph (which is almost never the case in general and not always useful).
But do you absolutely need to have the absolute optimal solution or is something near the optimal good enough?
Heuristics will tell you that if your peers have more or less the same download bandwidth capacity, then it makes sense to serve peers with the highest UPLOAD bandwidth first to maximize the avalanche effect and to reduce the risk for a peer having to ask for help, in general.
If your graph is relatively balanced (that is, most peers can connect to most peers), then I bet the minimum bandwidth of initial servers will be a logarithmic function of the number of nodes in your graph times the average speed at which peers expect to be served. This is only my gut feeling and should be validated with real measures or a strong modeling of your case.
Related
I am working on optimizing the communication costs in Federated Learning. Therefore, I need to simulate realistic network delays and measure communication overhead (the communication between the clients and the server). Is it possible to do that with TFF? Is there a realistic networking model for communications in Federated Learning setting?
Introducing network latency or delays in the execution stack is not something that TFF currently supports out of the box.
However, architecturally this is absolutely possible. One example of a recent contribution that addresses a similar request is the SizingExecutor, which measures bits passed through it on the way down and up in the execution hierarchy. Placing a SizingExecutor immediately on top of each executor representing a client, then, measures the bits broadcast and aggregated in each federated computation run through this execution stack; this implementation can be found here, and is in fact exposed in the public API.
Your desire is not entirely dissimilar to the sizing executor, and the sizing executor may serve your purpose directly, if you take total bits ber round as the metric you are trying to optimize. If however you would rather be examining other aspects of distributed computation (e.g. random data corruption) you may imagine doing so by implementing similar functionality to the sizing executor, though one could also imagine doing this at the computation level (a client chooses at random whether to return its true result or a corrupted version of its result).
I think from a design perspective, TFF would prefer any new executors to leave the semantics of the computations they are executing unchanged, and would steer towards either simply measuring properties like bits per round, or introducing any corruptions into the computation or algorithm directly, rather then in execution of these computations. The kind of corruption or delay a client can choose to introduce is effectively arbitrary; here is an example of a recent research project attempting to attack the global model by inserting malicious updates on certain clients. The same approach could be used, I imagine, to simulate any desired network property (e.g., some clients sleep, some send back corrupted updates, etc.).
Hope this helps!
I'm doing some research about the problem blockchain applications is facing (scalability).
At the moment I'm reading: https://hackernoon.com/blockchains-dont-scale-not-today-at-least-but-there-s-hope-2cb43946551a
There was something I got stuck on.
"The number of transactions the blockchain can process can never
exceed that of a single node that is participating in the network."
Is this correct? Are we talking strictly about PoW? I can't seem to understand this.
I tough the highest transaction throughput is capped at maximum block size divided by block interval.
Validating a single transaction is relatively inconsequential. This is not the same as "mining" a new block with Bitcoin's PoW algorithm. Validating a transaction usually means confirming that a cryptographic signature is valid, as well as some other data validation. This can be done quickly, but adds up as you get more transactions. On the other hand, mining a block means brute forcing a hash, and is extremely CPU intensive. However, this only needs to be done once per block for the whole network.
The article is well written, and accurate as far as I know. Blockchains as they currently exist will be limited to a relatively low transaction throughput, probably not more than a few thousand per second. This will be fine for many use cases, but will likely prevent them being used for high rate applications like a stock market.
This is true for a blockchain network based on a single chain. Security of any blockchain network is based on full nodes which do a full verification of block-candidates according to protocol rules. That is a reason why transaction throughput can't exceed aforementioned limit without some trade-of.
This is true for PoS networks too. However, calculation of this limit for PoS networks is more sophisticated.
Many projects attempt to resolve this problem by splitting the blockchain into multiple chains or shards. However, these shard chains remain heavily interconnected. Thus state splitting there often appears to be an illusion.
I suggest you to take a look at JaxNet protocol.
I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?
General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)
I am implementing a distributed algorithm for pagerank estimation using Storm. I have been having memory problems, so I decided to create a dummy implementation that does not explicitly save anything in memory, to determine whether the problem lies in my algorithm or my Storm structure.
Indeed, while the only thing the dummy implementation does is message-passing (a lot of it), the memory of each worker process keeps rising until the pipeline is clogged. I do not understand why this might be happening.
My cluster has 18 machines (some with 8g, some 16g and some 32g of memory). I have set the worker heap size to 6g (-Xmx6g).
My topology is very very simple:
One spout
One bolt (with parallelism).
The bolt receives data from the spout (fieldsGrouping) and also from other tasks of itself.
My message-passing pattern is based on random walks with a certain stopping probability. More specifically:
The spout generates a tuple.
One specific task from the bolt receives this tuple.
Based on a certain probability, this task generates another tuple and emits it again to another task of the same bolt.
I am stuck at this problem for quite a while, so it would be very helpful if someone could help.
Best Regards,
Nick
It seems you have a bottleneck in your topology, ie, a bolt receivers more data than in can process. Thus, the bolt's input queue grows over time consuming more and more memory.
You can either increase the parallelism for the "bottleneck bolt" or enable fault-tolerance mechanism which also enables flow-control via limited number of in-flight tuples (https://storm.apache.org/documentation/Guaranteeing-message-processing.html). For this, you also need to set "max spout pending" parameter.
I need to build and analyze a complex network using neo4j and would like to know what is the recommended hardware for the following setup:
There are three types of nodes.
There are three types of relationships.
At the steady state, the network will contain about 1M nodes of each type and about the same amount of edges
Every day, about 500K relationships are updated and 100K nodes and edges are added. Approximately the same amount of nodes/edges are also removed.
Network update will be done in daily batches and we can tolerate update times of 1-2 hours
Once the system is up, we will quire the database for shortest paths between different nodes. Not more than 500K times per day. We can live with batch query.
Most probably, I'll use REST API
I think you should take a look at Neo4j Hardware requirements.
For the server you're talking about, I think the first thing needed will obviously be a large bandwidth. If your requests are done in a short time, it'll be needed.
Apart from that, a "normal" server should be enough :
8 or more cores
At least 24Go ram
At least 1To SSD storage (this one is important and expensive)
A good bandwidth (like 1Gbps)
By the way, it's not a programming question, so I think you should have asked this to Neo4j.
You can use Neo4j Hardware sizing calculator for rough estimation of the HW needs.