Erlang messages when there are lots of nodes or binary data - erlang

Would native Erlang messages provide reasonable performance when there are lots of nodes or binary data?
Case 1: There's a dynamic pool of about 50-200 machines (erlang nodes). It's constantly changing, about 5-50 machines added or removed every 10min.
Case 2: Let's say we are using this cluster to build youtube-clone and planning to stream video data via messages.
By reasonable performance I mean - it's ok to be 2-3 times slower than the top possible performance achieved by the complex Erlang code, 10 times slower is not ok.

There is not any significant difference between sending a message and binary data. The message is just transformed to the binary packet using term_to_binary and sent via TCP and same apply to the binary data. (Well, it is little bit smarter than that because textual form of the same atoms is not sent again and again as would simple term_to_binary do.) So the difference is negligible.

There are important details:
1) In clusters over 100 nodes, ping noise in full connected cluster will be significant part of network traffic. Even bigger deployments require deep changes in Erlang VM and OS.
2) If you want to stream video or audio you need plan capacity of single node: clients per node, tcp/udp packets rate, network bandwidth.
3) There is performance limit ~150-200K/s messages between 2 processes on different nodes.

Related

STM32F407ZET6, Is it possible for Multiple streams of a DMA to run in Parallel?

Hi I am using STM32F407ZET6 Microcontroller and I want to use multiple streams of DMA1. Is it possible to trigger two different streams of the same DMA for transferring data to two different peripherals simulatenously. (Like in Parallel).
In the advanced AHB bus matrix I observe that for each DMA there are only two lines, one for memory and one for peripheral, which suggest to me that at any time at max two streams can perhaps run in parallel and that also if none of the streams are really doing memory<->peripheral transfer. Is this assumption correct? And, is this also correct that to run two streams in parallel through a single DMA they should not be doing memory<->peripheral transfer? what I mean is that by the look of AHB matrix it felt if only Mem to Mem and Periph to Periph transfers are done then probably two streams can run in parallel, but if any one of them does memory<->peripheral transfer then the use of DMA memory and peripheral interface for a single transfer will probably make that NOT possible. Can you shed some light on this?
I would like to request some guidance on this particular topic as i could not find satisfactory information on it... And if it is dependent on the bus bandwidth to transfer streams in parallel then how the bandwidth is divided among multiple channels for a single bus to perform multiple transfer.... Some If there is any such example, i would be thankful. As a reference I have put the AHB matrix below:
You can only select one channel per stream, but you can enable all 8 streams per DMA peripheral at once if you like, subject to the hardware defect listed in the errata sheet*.
Each of the masters take turns to access the buses. Once a master takes the bus it decides how long to use it for. For the DMA master, this is configured with the MBURST and PBURST bits of the DMA_SxCR register. If you require very low latency in the system and do not want the processor or another master (ethernet etc) to be stalled and have to wait for the DMA to get off the bus then set the burst configuration short (but even the longest burst you can configure will still only be a microsecond or so).
(*) there is a hardware defect in DMA2 which disallows concurrent use of AHB and APB peripherals, see the errata for details.

Tensorflow scalibility

I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?
General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)

Mirrored queue performance factors

We operate two dual-node brokers, each broker having quite different queues and workloads. Each box has 24 cores (H/T) worth of Xeon E5645 # 2.4GHz with 48GB RAM, connected by Gigabit LAN with ~150μs latency, running RHEL 5.6, RabbitMQ 3.1, Erlang R16B with HiPE off. We've tried with HiPE on but it made no noticeable performance impact, and was very crashy.
We appear to have hit a ceiling for our message rates of between 1,000/s and 1,400/s both in and out. This is broker-wide, not per-queue. Adding more consumers doesn't improve throughput overall, just gives that particular queue a bigger slice of this apparent "pool" of resource.
Every queue is mirrored across the two nodes that make up the broker. Our publishers and consumers connect equally to both nodes in a persistant way. We notice an ADSL-like asymmetry in the rates too; if we manage to publish a high rate of messages the deliver rate drops to high double digits. Testing with an un-mirrored queue has much higher throughput, as expected. Queues and Exchanges are durable, messages are not persistent.
We'd like to know what we can do to improve the situation. The CPU on the box is fine, beam takes a core and a half for 1 process, then another 80% each of two cores for another couple of processes. The rest of the box is essentially idle. We are using ~20GB of RAM in userland with system cache filling the rest. IO rates are fine. Network is fine.
Is there any Erlang/OTP tuning we can do? delegate_count is the default 16, could someone explain what this does in a bit more detail please?
This is difficult to answer without knowing more about how your producers and consumers are configured, which client library you're using and so on. As discussed on irc (http://dev.rabbitmq.com/irclog/index.php?date=2013-05-22) a minute ago, I'd suggest you attempt to reproduce the topology using the MulticastMain java load test tool that ships with the RabbitMQ java client. You can configure multiple producers/consumers, message sizes and so on. I can certainly get 5Khz out of a two-node cluster with HA on my desktop, so this may be a client (or application code) related issue.

How scalable is distributed Erlang?

Part A:
Erlang has a lot of success stories about running concurrent agents e.g. the millions of simultaneous Facebook chats. That's millions of agents, but of course it's not millions of CPUs across a network. I'm having trouble finding metrics on how well Erlang scales when scaling is "horizontal" across a LAN/WAN.
Let's assume that I have many (tens of thousands) physical nodes (running Erlang on Linux) that need to communicate and synchronize small infrequent amounts of data across the LAN/WAN. At what point will I have communications bottlenecks, not between agents, but between physical nodes? (Or will this just work, assuming a stable network?)
Part B:
I understand (as an Erlang newbie, meaning I could be totally wrong) that Erlang nodes attempt to all connect to and be aware of each other, resulting in an N^2 connection point-to-point network. Assuming that part A won't just work with N = 10K's, can Erlang be configured easily (using out-of-the-box config or trivial boilerplate, not writing a full implementation of grouping/routing algorithms myself) to cluster nodes into manageable groups and route system -wide messages through the cluster/group hierarchy?
We should specify that we talk about horizontal scalability of physical machines -- that's the only problem. CPUs on one machine will be handled by one VM, no matter what the number of those is.
node = machine.
To begin, I can say that 30-60 nodes you get out of the box (vanilla OTP installation) with any custom application written on the top of that (in Erlang). Proof: ejabberd.
~100-150 is possible with optimized custom application. I means, it has to be good code, written with knowledge about GC, characteristic of data types, message passing etc.
over +150 is all right but when we talk about numbers like 300, 500 it will require optimizations & customizations of TCP layer. Also, our app has to be aware of cost of e.g. sync calls across the cluster.
The other thing is DB layer. Mnesia (built-in) due its features will not be effective over 20 nodes (my experience - I may be wrong). Solution: just use something else: dynamo DBs, separate cluster of MySQLs, HBase etc.
The most common technique to leverage cost of creating high quality application and scalability are federations of ~20-50 nodes clusters. So internally its an efficient mesh of ~50 erlang nodes and its connected via any suitable protocol with N another 50 nodes clusters. To sum up, such a system is federation of N erlang clusters.
Distributed erlang is designed to run in one data center. If you need more, geographically distant nodes, then use federations.
There are lots of config options e.g. which do not connect all nodes to each other. It may be helpful, however in ~50 cluster erlang overhead is not significant. Also you can create a graph of erlang nodes using 'hidden' connection, which doesn't join this full mesh, but also it cannot benefit from connection to all nodes.
The biggest problem I see, in this kind of systems, is designing it as master-less system. If you do not need that, everything should be ok.

How does Erlang pass messages between processes on the same node?

Between nodes, message are (must be) passed over TCP/IP. However, by what mechanism are they passed between processes running on the same node? Is TCP/IP used in this case as well? Unix domain sockets? What is the difference in performance between "within node" and "between node" message passing?
by what mechanism are they passed between processes running on the same node?
Because Erlang processes on the same node are all running within a single native process — the BEAM emulator — message structures are simply copied into the receiver's message queue. The message structure is copied, rather than simply referenced, for all the standard no-side-effects functional programming reasons.
See erts_send_message() in erts/emulator/beam/erl_message.c in the Erlang sources for more detail. In R15B01, the bits most relevant to your question start at line 980 or so, with the call to erts_queue_message().
If you did choose to run multiple BEAM emulators on a single physical machine, I would guess messages get sent between them the same way as between different physical machines. There's probably no good reason to do that now that BEAM has good SMP support, though.
What is the difference in performance between "within node" and "between node" message passing?
A simple benchmark on your actual hardware would be more useful to you than anecdotal evidence from others.
If you want generalities, however, observe that memory bandwidths are around 20 GByte/sec these days, and that you're unlikely to have a network link faster than 10 Gbit/sec between nodes. That means that while there may be many differences between your actual application and any simple benchmark you perform or find, these differences probably cannot swamp an order of magnitude difference in transfer rate.
If you "only" have a 1 Gbit/sec end-to-end network link between nodes, intranode transfers will probably be over two orders of magnitude faster than internode transfers.
"All data in messages between Erlang processes is copied, with the exception of refc binaries on the same Erlang node.":
http://erlang.org/doc/efficiency_guide/processes.html#id2265332

Resources