Main difference between Shared memory and Distributed memory - memory

I'm a bit confused between about the difference between shared memory and distributed memory. Can you clarify?
Is shared memory for one processor and distributed for many (for network)?
Why do we need distributed memory, if we have shared memory?

Short answer
Shared memory and distributed memory are low-level programming abstractions that are used with certain types of parallel programming. Shared memory allows multiple processing elements to share the same location in memory (that is to see each others reads and writes) without any other special directives, while distributed memory requires explicit commands to transfer data from one processing element to another.
Detailed answer
There are two issues to consider regarding the terms shared memory and distributed memory. One is what do these mean as programming abstractions, and the other is what do they mean in terms of how the hardware is actually implemented.
In the past there were true shared memory cache-coherent multiprocessor systems. The systems communicated with each other and with shared main memory over a shared bus. This meant that any access from any processor to main memory would have equal latency. Today these types of systems are not manufactured. Instead there are various point-to-point links between processing elements and memory elements (this is the reason for non-uniform memory access, or NUMA). However, the idea of communicating directly through memory remains a useful programming abstraction. So in many systems this is handled by the hardware and the programmer does not need to insert any special directives. Some common programming techniques that use these abstractions are OpenMP and Pthreads.
Distributed memory has traditionally been associated with processors performing computation on local memory and then once it using explicit messages to transfer data with remote processors. This adds complexity for the programmer, but simplifies the hardware implementation because the system no longer has to maintain the illusion that all memory is actually shared. This type of programming has traditionally been used with supercomputers that have hundreds or thousands of processing elements. A commonly used technique is MPI.
However, supercomputers are not the only systems with distributed memory. Another example is GPGPU programming which is available for many desktop and laptop systems sold today. Both CUDA and OpenCL require the programmer to explicitly manage sharing between the CPU and the GPU (or other accelerator in the case of OpenCL). This is largely because when GPU programming started the GPU and CPU memory was separated by the PCI bus which has a very long latency compared to performing computation on the locally attached memory. So the programming models were developed assuming that the memory was separate (or distributed) and communication between the two processing elements (CPU and GPU) required explicit communication. Now that many systems have GPU and CPU elements on the same die there are proposals to allow GPGPU programming to have an interface that is more like shared memory.

In modern x86 terms, for example, all the CPUs in one physical computer share memory. e.g. 4-socket system with four 18-core CPUs. Each CPU has its own memory controllers, but they talk to each other so all the CPUs are part of one coherency domain. The system is NUMA shared memory, not distributed.
A room full of these machines form a distributed-memory cluster which communicates by sending messages over a network.
Practical considerations are one major reasons for distributed memory: it's impractical to have thousands or millions of CPU cores sharing the same memory with any kind of coherency semantics that make it worth calling it shared memory.

Related

What If a processes don't fit in memory?

If processes don’t fit in memory, What moves them in and out of memory to run?
this question is based on Operating System Memory management theory.
I have checked about the purpose of memory management unit. Is this related to swapping?
The operating system will use a memory management technique called virtual memory.
This is when a computer compensates for shortages of physical memory by temporarily transferring pages (segments of memory) of data from RAM to backing store. RAM is much faster than secondary storage and when a computer needs to use secondary storage over primary the user will feel the computer running slower.
The operating systems virtual memory manager is responsible for managing this. It will use techniques such as placing pages that have not been referenced for in a while into secondary memory (you hard disk for example) and if a page in secondary storage is required it will move the page from secondary to primary memory.
Another point is that most modern apps will page themselves, such as when they are minimised for example, to reduce the amount of memory they're using for other applications running.

Why would one chose many smaller machine types instead of fewer big machine types?

In a clustering high-performance computing framework such as Google Cloud Dataflow (or for that matter even Apache Spark or Kubernetes clusters etc), I would think that it's far more performant to have fewer really BIG machine types rather than many small machine types, right? As in, it's more performant to have 10 n1-highcpu-96 rather than say 120 n1-highcpu-8 machine types, because
the cpus can use shared memory, which is way way faster than network communications
if a single thread needs access to lots of memory for a single threaded operation (eg sort), it has access to that greater memory in a BIG machine rather than a smaller one
And since the price is the same (eg 10 n1-highcpu-96 costs the same as 120 n1-highcpu-8 machine types), why would anyone opt for the smaller machine types?
As well, I have a hunch that for the n1-highcpu-96 machine type, we'd occupy the whole host, so we don't need to worry about competing demands on the host by another VM from another Google cloud customer (eg contention in the CPU caches
or motherboard bandwidth etc.), right?
Finally, although I don't think the google compute VMs correctly report the "true" CPU topology of the host system, if we do chose the n1-highcpu-96 machine type, the reported CPU topology may be a touch closer to the "truth" because presumably the VM is using up the whole host, so the reported CPU topology is a little closer to the truth, so any programs (eg the "NUMA" aware option in Java?) running on that VM that may attempt to take advantage of the topology has a better chance of making the "right decisions".
It will depend on many factors if you want to choose many instances with smaller machine type or a few instances with big machine types.
The VMs sizes differ not only in number of cores and RAM, but also on network I/O performance.
Instances with small machine types have are limited in CPU and I/O power and are inadequate for heavy workloads.
Also, if you are planning to grow and scale it is better to design and develop your application in several instances. Having small VMs gives you a better chance of having them distributed across physical servers in the datacenter that have the best resource situation at the time the machines are provisioned.
Having a small number of instances helps to isolate fault domains. If one of your small nodes crashes, that only affects a small number of processes. If a large node crashes, multiple processes go down.
It also depends on the application you are running on your cluster and the workload.I would also recommend going through this link to see the sizing recommendation for an instance.

can two process shared same GPU memory? (CUDA)

In CPU world one can do it via memory map. Can similar things done for GPU?
If two process can share a same CUDA context, I think it will be trivial - just pass GPU memory pointer around. Is it possible to share same CUDA context between two processes?
Another possibility I could think of is to map device memory to a memory mapped host memory. Since it's memory mapped, it can be shared between two processes. Does this make sense / possible, and are there any overhead?
CUDA MPS effectively allows CUDA activity emanating from 2 or more processes to behave as if they share the same context on the GPU. (For clarity: CUDA MPS does not cause two or more processes to share the same context. However the work scheduling behavior appears similar to what you would observe if the work were emanating from the same process and therefore the same context.) However this won't provide for what you are asking for:
can two processes share the same GPU memory?
One method to achieve this is via CUDA IPC (interprocess communication) API.
This will allow you to share an allocated device memory region (i.e. a memory region allocated via cudaMalloc) between multiple processes. This answer contains additional resources to learn about CUDA IPC.
However, according to my testing, this does not enable sharing of host pinned memory regions (e.g. a region allocated via cudaHostAlloc) between multiple processes. The memory region itself can be shared using ordinary IPC mechanisms available for your particular OS, but it cannot be made to appear as "pinned" memory in 2 or more processes (according to my testing).

Is Intel QuickPath Interconnect (QPI) used by processors to access memory?

I have read An Introduction to the Intel® QuickPath Interconnect. The document does not mention that QPI is used by processors to access memory. So I think that processors don't access memory through QPI.
Is my understanding correct?
Intel QuickPath Interconnect (QPI) is not wired to the DRAM DIMMs and as such is not used to access the memory that connected to the CPU integrated memory controller (iMC).
In the paper you linked this picture is present
That shows the connections of a processor, with the QPI signals pictured separately from the memory interface.
A text just before the picture confirm that QPI is not used to access memory
The processor
also typically has one or more integrated memory
controllers. Based on the level of scalability
supported in the processor, it may include an
integrated crossbar router and more than one
Intel® QuickPath Interconnect port.
Furthermore, if you look at a typical datasheet you'll see that the CPU pins for accessing the DIMMs are not the ones used by QPI.
The QPI is however used to access the uncore, the part of the processor that contains the memory controller.
Courtesy of QPI article on Wikipedia
QPI is a fast internal general purpose bus, in addition to giving access to the uncore of the CPU it gives access to other CPUs' uncore.
Due to this link, every resource available in the uncore can potentially be accessed with QPI, including the iMC of a remote CPU.
QPI define a protocol with multiple message classes, two of them are used to read memory using another CPU iMC.
The flow use a stack similar to the usual network stack.
Thus the path to remote memory include a QPI segment but the path to local memory doesn't.
Update
For Xeon E7 v3-18C CPU (designed for multi-socket systems), the Home agent doesn't access the DIMMS directly instead it uses an Intel SMI2 link to access the Intel C102/C104 Scalable Memory Buffer that in turn accesses the DIMMS.
The SMI2 link is faster than the DDR3 and the memory controller implements reliability or interleaving with the DIMMS.
Initially the CPU used a FSB to access the North bridge, this one had the memory controller and was linked to the South bridge (ICH - IO Controller Hub in Intel terminology) through DMI.
Later the FSB was replaced by QPI.
Then the memory controller was moved into the CPU (using its own bus to access memory and QPI to communicate with the CPU).
Later, the North bridge (IOH - IO Hub in Intel terminology) was integrated into the CPU and was used to access the PCH (that now replaces the south bridge) and PCIe was used to access fast devices (like the external graphic controller).
Recently the PCH has been integrated into the CPU as well that now exposes only PCIe, DIMMs pins, SATAexpress and any other common internal bus.
As a rule of thumb the buses used by the processors are:
To other CPUs - QPI
To IOH - QPI (if IOH present)
To the uncore - QPI
To DIMMs - Pins as the DRAM technology (DDR3, DDR4, ...) support mandates. For Xeon v2+ Intel uses a fast SMI(2) link to connect to an off-core memory controller (Intel C102/104) that handle the DIMMS and channels based on two configurations.
To PCH - DMI
To devices - PCIe, SATAexpress, I2C, and so on.
Yes, QPI is used to access all remote memory on multi-socket systems, and much of its design and performance is intended to support such access in a reasonable fashion (i.e., with latency and bandwidth not too much worse than local access).
Basically, most x86 multi-socket systems are lightly1 NUMA: every DRAM bank is attached to a the memory controller of a particular socket: this memory is then local memory for that socket, while the remaining memory (attached to some other socket) is remote memory. All access to remote memory goes over the QPI links, and on many systems2 that is fully half of all memory access and more.
So QPI is designed to be low latency and high bandwidth to make such access still perform well. Furthermore, aside from pure memory access, QPI is the link through which the cache coherence between sockets occurs, e.g., notifying the other socket of invalidations, lines which have transitioned into the shared state, etc.
1 That is, the NUMA factor is fairly low, typically less than 2 for latency and bandwidth.
2 E.g., with NUMA interleave mode on, and 4 sockets, 75% of your access is remote.

Nehalem memory architecture address mapping

Given a 2 processor Nehalem Xeon server with 12GB of RAM (6x2GB), how are memory addresses mapped onto the physical memory modules?
I would imagine that on a single processor Nehalem with 3 identical memory modules, the address space would be striped over the modules to give better memory bandwidth. But with what kind of stripe size? And how does the second processor (+memory) change that picture?
Intel is not very clear on that, you have to dig into their hardcore technical documentation to find out all the details. Here's my understanding. Each processor has an integrated memory controller. Some Nehalems have triple-channel controllers, some have dual-channel controllers. Each memory module is assigned to one of the processors. Triple channel means that accesses are interleaved across three banks of modules, dual channel = two banks.
The specific interleaving pattern is configurable to some extent, but, given their design, it's almost inevitable that you'll end up with 64 to 256 byte stripes.
If one of the processors wants to access memory that's attached to the IMC of some other processor, the access goes through both processor and incurs additional latency.

Resources