In parallel CPU architecture, how do CPU's in a distributed memory architecture communicate with each other? - memory

When we are talking about the shared memory CPU architecture, if the CPU's want to communicate with each other, then they have to look for a "variable" that they share - inside the shared memory. But what if we have distributed memory instead? How do the CPU's communicate with each other, if at all?
PS - The type of processing here is parallel.

Q : "How do the CPU's ( In parallel CPU architecture ) communicate with each other, if at all?"
Let me take one such smart example - an EpiphanyTM architecture, designed by Andreas Olofsson and his Adapteva team.
These smart many-core parallel RISC CPU use both the NoC-hardware implemented 2D-mesh networks for intra-system, inter-node hardware interactions (named eMeshTM), consisting of triple-layered, specialised, networks [ cMesh | xMesh | rMesh ], and have an extended reach of using inter-system eLINKTM network to communicate with other, non-local systems.
This shows, how smart parallel architectures can promote the state-of-art solutions. Great respect to Andreas Olofsson and his team.
Just for the context - this came some fourty years after the pioneers from InMOS (UK, Bristol) first launched a Transputer architecture with (guess what, similar) parallel-networks equipped with adding also a tight-most knit parallel-language occam, that helped generate principally hard parallel software right-fit onto the Transputer in-silicon properties and yet was IMHO up until these years the most productive parallel-systems language with support for Real-Time System controls, to design such smart & demanding parallel-control systems, as the ESA Giotto satellite and many others.
PS :
A cool lesson on how to do [PARALLEL]-computing right right from the in-silicon level, wasn't it? - all respect goes to InMOS / occam people.

Related

How to discover the high-performance network interface on a linux HPC cluster?

I have a distributed program which communicates with ZeroMQ that runs on HPC clusters.
ZeroMQ uses TCP sockets, so by default on HPC clusters the communications will use the admin network, so I have introduced an environment variable read by my code to force communication on a particular network interface.
With Infiniband (IB), usually it is ib0. But there are cases where another IB interface is used for the parallel file system, or on Cray systems the interface is ipogif, on some non-HPC systems it can be eth1, eno1, p4p2, em2, enp96s0f0, or whatever...
The problem is that I need to ask the administrator of the cluster the name of the network interface to use, while codes using MPI don't need to because MPI "knows" which network to use.
What is the most portable way to discover the name of the high-performance network interface on a linux HPC cluster? (I don't mind writing a small MPI program for this if there is no simple way)
There is no simple way and I doubt a complete solution exists. For example, Open MPI comes with an extensive set of ranked network communication modules and tries to instantiate all of them, selecting in the end the one that has the highest rank. The idea is that ranks somehow reflect the speed of the underlying network and that if a given network type is not present, its module will fail to instantiate, so faced with a system that has both Ethernet and InfiniBand, it will pick InfiniBand as its module has higher precedence. This is why larger Open MPI jobs start relatively slowly and is definitely not fool proof - in some cases one has to intervene and manually select the right modules, especially if the node has several network interfaces of InfiniBand HCAs and not all of them provide node-to-node connectivity. This is usually configured system-wide by the system administrator or the vendor and is why MPI "just works" (pro tip: in not-so-small number of cases it actually doesn't).
You may copy the approach taken by Open MPI and develop a set of detection modules for your program. For TCP, spawn two or more copies on different nodes, list their active network interfaces and the corresponding IP addresses, match the network addresses and bind on all interfaces on one node, then try to connect to it from the other node(s). Upon successful connection, run something like the TCP version of NetPIPE to measure the network speed and latency and pick the fastest network. Once you've gotten this information from the initial small set of nodes, it is very likely that the same interface is used on all other nodes too, since most HPC systems are as homogeneous as possible when it comes to their nodes' network configuration.
If there is a working MPI implementation installed, you can use it to launch the test program. You may also enable debug logging in the MPI library and parse the output, but this will require that the target system has an MPI implementation supported by your log parser. Also, most MPI libraries use native InfiniBand or whatever high-speed network API there is and will not tell you which is the IP-over-whatever interface, because they won't use it at all (unless configured otherwise by the system administrator).
Q : What is the most portable way to discover the name of the high-performance network interface on a linux HPC cluster?
This seems to be in a gray-zone - trying to solve a multi-faceted problem among site-specific hardware (technical) interface naming and theirs non-technical, weakly administratively maintained, preferred ways of use.
As-is State :
ZeroMQ can (as per RFC 37/ZMTP v3.0+) specify <hardware(interface)>:<port>/<service> details :
zmq_bind (server_socket, "tcp://eth0:6000/system/name-service/test");
And:
zmq_connect (client_socket, "tcp://192.168.55.212:6000/system/name-service/test");
yet has no means, to my knowledge, to reverse-engineer the primary use of such an interface, in the holistic context of the HPC-site and it's hardware configuration.
Seems to me, your idea of pre-testing the administrative mappings via MPI-tool first and letting ZeroMQ deployment use these externally detected (if indeed auto-detectable, as you assumed above) configuration details for a proper (preferred) interface usage.
The Safe Way to Go :
Asking the HPC-infrastructure Support Team ( who is responsible for knowing all of the above and trained to help Scientific Teams to use the HPC in the most productive manner ) would be my preferred way to go.
Disclaimer :
Sorry in case this did not help your will to read & auto-detect all the needed configuration details ( a universal BlackBox-HPC-ecosystem detection and auto-configuration strategy would hardly be a trivial one-liner, I guess, wouldn't it? )

Why would one chose many smaller machine types instead of fewer big machine types?

In a clustering high-performance computing framework such as Google Cloud Dataflow (or for that matter even Apache Spark or Kubernetes clusters etc), I would think that it's far more performant to have fewer really BIG machine types rather than many small machine types, right? As in, it's more performant to have 10 n1-highcpu-96 rather than say 120 n1-highcpu-8 machine types, because
the cpus can use shared memory, which is way way faster than network communications
if a single thread needs access to lots of memory for a single threaded operation (eg sort), it has access to that greater memory in a BIG machine rather than a smaller one
And since the price is the same (eg 10 n1-highcpu-96 costs the same as 120 n1-highcpu-8 machine types), why would anyone opt for the smaller machine types?
As well, I have a hunch that for the n1-highcpu-96 machine type, we'd occupy the whole host, so we don't need to worry about competing demands on the host by another VM from another Google cloud customer (eg contention in the CPU caches
or motherboard bandwidth etc.), right?
Finally, although I don't think the google compute VMs correctly report the "true" CPU topology of the host system, if we do chose the n1-highcpu-96 machine type, the reported CPU topology may be a touch closer to the "truth" because presumably the VM is using up the whole host, so the reported CPU topology is a little closer to the truth, so any programs (eg the "NUMA" aware option in Java?) running on that VM that may attempt to take advantage of the topology has a better chance of making the "right decisions".
It will depend on many factors if you want to choose many instances with smaller machine type or a few instances with big machine types.
The VMs sizes differ not only in number of cores and RAM, but also on network I/O performance.
Instances with small machine types have are limited in CPU and I/O power and are inadequate for heavy workloads.
Also, if you are planning to grow and scale it is better to design and develop your application in several instances. Having small VMs gives you a better chance of having them distributed across physical servers in the datacenter that have the best resource situation at the time the machines are provisioned.
Having a small number of instances helps to isolate fault domains. If one of your small nodes crashes, that only affects a small number of processes. If a large node crashes, multiple processes go down.
It also depends on the application you are running on your cluster and the workload.I would also recommend going through this link to see the sizing recommendation for an instance.

Main difference between Shared memory and Distributed memory

I'm a bit confused between about the difference between shared memory and distributed memory. Can you clarify?
Is shared memory for one processor and distributed for many (for network)?
Why do we need distributed memory, if we have shared memory?
Short answer
Shared memory and distributed memory are low-level programming abstractions that are used with certain types of parallel programming. Shared memory allows multiple processing elements to share the same location in memory (that is to see each others reads and writes) without any other special directives, while distributed memory requires explicit commands to transfer data from one processing element to another.
Detailed answer
There are two issues to consider regarding the terms shared memory and distributed memory. One is what do these mean as programming abstractions, and the other is what do they mean in terms of how the hardware is actually implemented.
In the past there were true shared memory cache-coherent multiprocessor systems. The systems communicated with each other and with shared main memory over a shared bus. This meant that any access from any processor to main memory would have equal latency. Today these types of systems are not manufactured. Instead there are various point-to-point links between processing elements and memory elements (this is the reason for non-uniform memory access, or NUMA). However, the idea of communicating directly through memory remains a useful programming abstraction. So in many systems this is handled by the hardware and the programmer does not need to insert any special directives. Some common programming techniques that use these abstractions are OpenMP and Pthreads.
Distributed memory has traditionally been associated with processors performing computation on local memory and then once it using explicit messages to transfer data with remote processors. This adds complexity for the programmer, but simplifies the hardware implementation because the system no longer has to maintain the illusion that all memory is actually shared. This type of programming has traditionally been used with supercomputers that have hundreds or thousands of processing elements. A commonly used technique is MPI.
However, supercomputers are not the only systems with distributed memory. Another example is GPGPU programming which is available for many desktop and laptop systems sold today. Both CUDA and OpenCL require the programmer to explicitly manage sharing between the CPU and the GPU (or other accelerator in the case of OpenCL). This is largely because when GPU programming started the GPU and CPU memory was separated by the PCI bus which has a very long latency compared to performing computation on the locally attached memory. So the programming models were developed assuming that the memory was separate (or distributed) and communication between the two processing elements (CPU and GPU) required explicit communication. Now that many systems have GPU and CPU elements on the same die there are proposals to allow GPGPU programming to have an interface that is more like shared memory.
In modern x86 terms, for example, all the CPUs in one physical computer share memory. e.g. 4-socket system with four 18-core CPUs. Each CPU has its own memory controllers, but they talk to each other so all the CPUs are part of one coherency domain. The system is NUMA shared memory, not distributed.
A room full of these machines form a distributed-memory cluster which communicates by sending messages over a network.
Practical considerations are one major reasons for distributed memory: it's impractical to have thousands or millions of CPU cores sharing the same memory with any kind of coherency semantics that make it worth calling it shared memory.

How scalable is distributed Erlang?

Part A:
Erlang has a lot of success stories about running concurrent agents e.g. the millions of simultaneous Facebook chats. That's millions of agents, but of course it's not millions of CPUs across a network. I'm having trouble finding metrics on how well Erlang scales when scaling is "horizontal" across a LAN/WAN.
Let's assume that I have many (tens of thousands) physical nodes (running Erlang on Linux) that need to communicate and synchronize small infrequent amounts of data across the LAN/WAN. At what point will I have communications bottlenecks, not between agents, but between physical nodes? (Or will this just work, assuming a stable network?)
Part B:
I understand (as an Erlang newbie, meaning I could be totally wrong) that Erlang nodes attempt to all connect to and be aware of each other, resulting in an N^2 connection point-to-point network. Assuming that part A won't just work with N = 10K's, can Erlang be configured easily (using out-of-the-box config or trivial boilerplate, not writing a full implementation of grouping/routing algorithms myself) to cluster nodes into manageable groups and route system -wide messages through the cluster/group hierarchy?
We should specify that we talk about horizontal scalability of physical machines -- that's the only problem. CPUs on one machine will be handled by one VM, no matter what the number of those is.
node = machine.
To begin, I can say that 30-60 nodes you get out of the box (vanilla OTP installation) with any custom application written on the top of that (in Erlang). Proof: ejabberd.
~100-150 is possible with optimized custom application. I means, it has to be good code, written with knowledge about GC, characteristic of data types, message passing etc.
over +150 is all right but when we talk about numbers like 300, 500 it will require optimizations & customizations of TCP layer. Also, our app has to be aware of cost of e.g. sync calls across the cluster.
The other thing is DB layer. Mnesia (built-in) due its features will not be effective over 20 nodes (my experience - I may be wrong). Solution: just use something else: dynamo DBs, separate cluster of MySQLs, HBase etc.
The most common technique to leverage cost of creating high quality application and scalability are federations of ~20-50 nodes clusters. So internally its an efficient mesh of ~50 erlang nodes and its connected via any suitable protocol with N another 50 nodes clusters. To sum up, such a system is federation of N erlang clusters.
Distributed erlang is designed to run in one data center. If you need more, geographically distant nodes, then use federations.
There are lots of config options e.g. which do not connect all nodes to each other. It may be helpful, however in ~50 cluster erlang overhead is not significant. Also you can create a graph of erlang nodes using 'hidden' connection, which doesn't join this full mesh, but also it cannot benefit from connection to all nodes.
The biggest problem I see, in this kind of systems, is designing it as master-less system. If you do not need that, everything should be ok.

Erlang Documentation/SMP: single-node and multi-node per machine or per application, and the confusion that may follow

I'm studying Erlang's process model at the moment. I have hit a snag in a tech report (section 3, paragraph 2) on Erlang:
This explains why it in some cases can be more efficient to run several SMP VM's
with one scheduler each instead on one SMP VM with several schedulers. Of course
the running of several VM's require that the application can run in many parallel tasks
which has no or very little communication with each other.
Now this paragraph is confusing me; I can see the uni-process multiple scheduler scenario, but I am failing to see multiple processes with a single scheduler; Presumably each process would have a different node name, and this would mean a certain application, without modification, cannot be used with this model; the virtue of not requiring modification has been mentioned as a key feature of SMP in the report. If the multiple processes have the same node names, than performance would be disastrous due to inter-Erlang-process messaging storms -- this assume the use of in-memory amnesia. Is there some process model that is not introduced in the article and that I am missing here ?
What is the author trying say here ? is he trying to suggest that an application would have to be rewritten (to take multiple unique node-names into account) for the multi-process single-scheduler case ?
-- edit 1: Clarification of Source of Problem --
The question has been answered through discussion; the following is an outline of the trouble I had.
The issue for this question has been that the documentation, as I recall, does not touch on a scenario of running multiple Erlang emulators per physical machine -- it has always been shown that the emulator represents your physical machine (in industrial usage); also, the scenario of having to explicitly partition a program for computational efficiency has never been considered. This sudden introduction has been the source of my woe.
The convention is still biased towards creating LOTS of processes and that the future holds many improvements for the SMP emulator for Erlang, and this means that single node per machine is still a very viable option assuming favourable application design.
Rewrite after reading article:
This explains why it in some cases can
be more efficient to run several SMP
VM's with one scheduler each instead
on one SMP VM with several schedulers.
Non-SMP VM has no-lock so runs fast.
Single scheduler SMP VM 10% slower, due to cost of checking locks
Multiple scheduler SMP VM slower again due to using/waiting for locks
Of course the running of several VM's
require that the application can run
in many parallel tasks which has no or
very little communication with each
other.
I think: Nodes on the same server have to have different names.
Inter process messaging while by slower due to the inter-process nature verse intra process messaging of a VM node.
If you have multiple schedulers in a single VM, they will inevitably contend over various resources (e.g. ets meta table, atom-table, scheduler run-queue during migration, etc.) because of the inner architecture. If you have a single scheduler, contention will obviously not occur. Lock checking and acquiring will still be done though, so running a non SMP VM instead shall yield even better performance (but requires a rebuilding of the VM from source).
Take a four-core machine for example. Option one means that you run four instances of the Erlang VM, each with a single scheduler, affinity set to different processor cores. Option two means running a single Erlang VM with four schedulers, each scheduler's affinity set to different processor cores.
If you have a whole lot of independent processes to run, option two will result in better performance, because the four cores will be fully utilized (theoretically). In contrast, in option one, this won't be possible, because the lock contention will make execution on cores wait for each other every now and then.
On the other hand if your processes need to chatter a lot, option one is the way to go because the inter-process communication is way cheaper than communication between different VMs. You gain more with this than you lose with lock contention.
I believe the answer is in the preceding paragraph:
The SMP VM with only one scheduler is slightly slower (10%) than the non
SMP VM.
This is because the SMP VM need to use locks for all shared
datastructures. But as
long as there are no lock-conflicts the overhead caused by
locking is not that high (it
is the lock conflicts that takes time).
Scheduler's reliance on locks for shared data structures can impose an overhead on a given system. It seems to follow that having multiple schedulers on one SMP VM imposes a collectively greater overhead.
There are some advatanges with several nodes on one physical machine.
1) Resource locking overhead as mentioned.
2) Fail-over. In telecom products you really don't want to have the beam come crashing down on you. If you have NIFs or linked-in drivers in your system this might occur.
3) Memory locality. Few nodes gives you a poor-mans way to force processes to a few cores. This could be a big boost for NUMA archs typically but also for SMP. The scheduler don't take NUMA into account (yet). You can spawn a process to a specific scheduler and lock it to it, it won't migrate but that is an undocumented feature ... or it was removed all together. I forget.
With several nodes you will need a load balancer between the nodes of course but that is the usual way to do it anyways. Some logic that supervises the nodes.
However, the numbers from the EUC papers are over a year old [#] and I wouldn't recommend a multi-node approach if you don't really need it. The runtime system is much better at handling these types of problems today. A lot of lock overhead has been removed and the mrq-scheduler has been improved.
# 2009's numbers look like this.
Edit:
Regarding 3) the spawn feature i mentioned is,
spawn_opt(fun() -> ... end, [{scheduler, Id}]) -> pid(),
where Id is an integer and refers to a specific scheduler.
I wouldn't recommend using it since it undocumented.

Resources