CPM and APM in supercomuting? - supercomputers

I am doing a research for my paper work on supercomputer subject, specifically for Tianhe-2. So i was reading a report from a professor Jack Dongarra and he mentions the CPM and APM halves of the board: "The compute board has two compute nodes and is composed of
two half’s the CPM and the APM halves.The CPM portion of the compute board
contains the 4 Ivy Bridge processors, memory, and 1 Xeon Phi board and the CPM half
contains the 5 Xeon Phi boards".
So the first thing i have problem with is a compute, as a term, because i don't know how to translate compute board, if a have Xeon Phi boards on that compute board... O.o?
The 2nd thing is about CPM and APM. What are CPM and APM? What is their full name? And how are they functioning?
Please help me, I'm stuck with it and can't find explanation anywhere ?
Thaks.
Tami

Tianhe-2 is a cluster, a set of computers (called 'nodes') linked together with a fast interconnect (network), and a distributed storage system. Most nodes are dedicated to computing (the 'compute nodes'), while some others are dedicated to management ('management nodes'.) Dongarra's document also mentions blades as a synonym to nodes. Blades are a kind of node form factor whose working is similar to a docking station for a laptop.
Traditionally, a node is a full computer, with a main circuitry system (the 'board', or 'motherboard') on which the processors and the memory modules are plugged, a network interface, and possibly a local hard disk, and an operating system.
On Tianhe-2, things are a bit different. A single board is made of two distinct parts (modules) plugged together (the CPM and the APU) and that single board hosts two separate nodes. Rather than having two identical boards for two distinct nodes, Tianhe-2 uses one, two-parts-board for two distinct nodes.
One of the boards (the CPM) hosts the CPUs (Intel IvyBridge) and the memory, plus one accelerator (Intel Xeon Phi) and two network connections, while the other (APU) hosts 5 accelerators. Plugged together, they offer two nodes, each with 2 CPUS and 3 accelerators and one network connection.
The Intel Xeon Phi is an extension card, that is plugged to the main board. In that extension card lives a fully-featured mini-computer with a CPU, some memory, and ... a tiny motherboard.
The exact meaning of CPM and APU (also referred to as APM in the Dongarra's document, which looks more like a typo(?) though it was quoted in many many places) is nowhere to be found, one could guess it means Central Processing Module and Accelerated Processing Unit or a variant of it.

Related

add nodes to dockers swarm from different servers

I am new to docker swarm, I read documentation and googled to the topic, but results was vague,
Is it possible to add worker or manager node from distinct and separate Virtual private servers?
Idea is to connect many non-related hosts into a swarm which then creates distribution over many systems and resiliency in case of any HW failures. The only thing you need to watch out for is that the internet connection between the hosts is stable and that all of the needed ports based of the official documentation are open. And you are good to go :)
Oh and between managers you want a VERY stable internet connection without any random ping spikes, or you may encounter weird behaviour (because of consensus with raft and decision making).
other than that it is good
Refer to Administer and maintain a swarm of Docker Engines
In production the best practice to maximise swarm HA is to spread your swarm managers across multiple availability zones. Availability Zones are geo-graphically co-located but distinct sites. i.e. instead of having a single London data centre, have 3 - each connected to a different internet and power utility. That way, if any single ISP or Power utility has an outage, you still have 2 data centres connected to the internet.
Swarm was designed with this kind of Highly available topology in mind and can scale to having its managers - and workers - distributed across nodes in different data centres.
However, Swarm is sensitive to latency over longer distances - so global distribution is not a good idea. In a single city, Data center to Data centre latencies will be in the low 10s of ms. Which is fine.
Connecting data centres in different cities / continents moves the latency to the low, to mid 100s of ms which does cause problems and leads to instability.
Otherwise, go ahead. Build your swarm across AZ distributed nodes.

In parallel CPU architecture, how do CPU's in a distributed memory architecture communicate with each other?

When we are talking about the shared memory CPU architecture, if the CPU's want to communicate with each other, then they have to look for a "variable" that they share - inside the shared memory. But what if we have distributed memory instead? How do the CPU's communicate with each other, if at all?
PS - The type of processing here is parallel.
Q : "How do the CPU's ( In parallel CPU architecture ) communicate with each other, if at all?"
Let me take one such smart example - an EpiphanyTM architecture, designed by Andreas Olofsson and his Adapteva team.
These smart many-core parallel RISC CPU use both the NoC-hardware implemented 2D-mesh networks for intra-system, inter-node hardware interactions (named eMeshTM), consisting of triple-layered, specialised, networks [ cMesh | xMesh | rMesh ], and have an extended reach of using inter-system eLINKTM network to communicate with other, non-local systems.
This shows, how smart parallel architectures can promote the state-of-art solutions. Great respect to Andreas Olofsson and his team.
Just for the context - this came some fourty years after the pioneers from InMOS (UK, Bristol) first launched a Transputer architecture with (guess what, similar) parallel-networks equipped with adding also a tight-most knit parallel-language occam, that helped generate principally hard parallel software right-fit onto the Transputer in-silicon properties and yet was IMHO up until these years the most productive parallel-systems language with support for Real-Time System controls, to design such smart & demanding parallel-control systems, as the ESA Giotto satellite and many others.
PS :
A cool lesson on how to do [PARALLEL]-computing right right from the in-silicon level, wasn't it? - all respect goes to InMOS / occam people.

Why would one chose many smaller machine types instead of fewer big machine types?

In a clustering high-performance computing framework such as Google Cloud Dataflow (or for that matter even Apache Spark or Kubernetes clusters etc), I would think that it's far more performant to have fewer really BIG machine types rather than many small machine types, right? As in, it's more performant to have 10 n1-highcpu-96 rather than say 120 n1-highcpu-8 machine types, because
the cpus can use shared memory, which is way way faster than network communications
if a single thread needs access to lots of memory for a single threaded operation (eg sort), it has access to that greater memory in a BIG machine rather than a smaller one
And since the price is the same (eg 10 n1-highcpu-96 costs the same as 120 n1-highcpu-8 machine types), why would anyone opt for the smaller machine types?
As well, I have a hunch that for the n1-highcpu-96 machine type, we'd occupy the whole host, so we don't need to worry about competing demands on the host by another VM from another Google cloud customer (eg contention in the CPU caches
or motherboard bandwidth etc.), right?
Finally, although I don't think the google compute VMs correctly report the "true" CPU topology of the host system, if we do chose the n1-highcpu-96 machine type, the reported CPU topology may be a touch closer to the "truth" because presumably the VM is using up the whole host, so the reported CPU topology is a little closer to the truth, so any programs (eg the "NUMA" aware option in Java?) running on that VM that may attempt to take advantage of the topology has a better chance of making the "right decisions".
It will depend on many factors if you want to choose many instances with smaller machine type or a few instances with big machine types.
The VMs sizes differ not only in number of cores and RAM, but also on network I/O performance.
Instances with small machine types have are limited in CPU and I/O power and are inadequate for heavy workloads.
Also, if you are planning to grow and scale it is better to design and develop your application in several instances. Having small VMs gives you a better chance of having them distributed across physical servers in the datacenter that have the best resource situation at the time the machines are provisioned.
Having a small number of instances helps to isolate fault domains. If one of your small nodes crashes, that only affects a small number of processes. If a large node crashes, multiple processes go down.
It also depends on the application you are running on your cluster and the workload.I would also recommend going through this link to see the sizing recommendation for an instance.

What are hardware requirements to run Hyperledger Fabric peer?

What are minimum hardware requirements to run a Hyperledger Fabric v1 peer?
It can run on a RaspberryPi, so technically it does not need much if you aren't planning on doing much with it. However, to achieve the performance results you might expect, you'll need to look to achieving the right balance of network, processor, disk and CPU speeds. Additionally, as the peer is essentially managing a database, you'll need to take into consideration the data storage needs over time.
You'll also need to consider such factors as number of chaincode smart contracts, the number of expected channels and the size of the network. IOW, the hardware requirements will really depend on many other factors than simply what the peer (or orderer) process requires to minimally function.
If you are merely interested in running a development/test cluster of 4 peer nodes, an orderer and CA, keep in mind that this can all be easily handled on a Macbook Pro with 16G memory, and with slightly less ease at 8G memory. You can use that as a yardstick for cloud instances to run a development/test cluster.
Finally, there's a LOT of crypto processing, so you will want to consider hardware crypto acceleration to yield the optimal performance.

How scalable is distributed Erlang?

Part A:
Erlang has a lot of success stories about running concurrent agents e.g. the millions of simultaneous Facebook chats. That's millions of agents, but of course it's not millions of CPUs across a network. I'm having trouble finding metrics on how well Erlang scales when scaling is "horizontal" across a LAN/WAN.
Let's assume that I have many (tens of thousands) physical nodes (running Erlang on Linux) that need to communicate and synchronize small infrequent amounts of data across the LAN/WAN. At what point will I have communications bottlenecks, not between agents, but between physical nodes? (Or will this just work, assuming a stable network?)
Part B:
I understand (as an Erlang newbie, meaning I could be totally wrong) that Erlang nodes attempt to all connect to and be aware of each other, resulting in an N^2 connection point-to-point network. Assuming that part A won't just work with N = 10K's, can Erlang be configured easily (using out-of-the-box config or trivial boilerplate, not writing a full implementation of grouping/routing algorithms myself) to cluster nodes into manageable groups and route system -wide messages through the cluster/group hierarchy?
We should specify that we talk about horizontal scalability of physical machines -- that's the only problem. CPUs on one machine will be handled by one VM, no matter what the number of those is.
node = machine.
To begin, I can say that 30-60 nodes you get out of the box (vanilla OTP installation) with any custom application written on the top of that (in Erlang). Proof: ejabberd.
~100-150 is possible with optimized custom application. I means, it has to be good code, written with knowledge about GC, characteristic of data types, message passing etc.
over +150 is all right but when we talk about numbers like 300, 500 it will require optimizations & customizations of TCP layer. Also, our app has to be aware of cost of e.g. sync calls across the cluster.
The other thing is DB layer. Mnesia (built-in) due its features will not be effective over 20 nodes (my experience - I may be wrong). Solution: just use something else: dynamo DBs, separate cluster of MySQLs, HBase etc.
The most common technique to leverage cost of creating high quality application and scalability are federations of ~20-50 nodes clusters. So internally its an efficient mesh of ~50 erlang nodes and its connected via any suitable protocol with N another 50 nodes clusters. To sum up, such a system is federation of N erlang clusters.
Distributed erlang is designed to run in one data center. If you need more, geographically distant nodes, then use federations.
There are lots of config options e.g. which do not connect all nodes to each other. It may be helpful, however in ~50 cluster erlang overhead is not significant. Also you can create a graph of erlang nodes using 'hidden' connection, which doesn't join this full mesh, but also it cannot benefit from connection to all nodes.
The biggest problem I see, in this kind of systems, is designing it as master-less system. If you do not need that, everything should be ok.

Resources