I am reading some articles discussing about "memory subsystems". What is the definition for memory subsystems?
As far as I understand by googling or reading other documents, it kind of indicates a group of main memory and processor cache. Is that correct?
There are three types of memory subsystem comoponents, RAM (R) components, single access (S) components, and dual-access (D) components. All memory subsystem components are for automatically retrieving operands from and storing results in their associated memory modules. All memory subsystem components have an output data connection and an input data connection. Therefore, they must be capable of handling both an output data stream and an input data stream. In addition, a D component includes a second pair of input and output connections. All memory subsystem components have a queue in each of their input and output data streams.
A significant difference between the memory subsystem components and the other components is that a Number of Operands In (NumOpsIn) register as well as a NumOpsOut register must be included. The NumOpsIn register serves the same purpose for the input data stream as NumOpsOut does for the output stream. Both NumOpsIn and NumOpsOut must be zero before new instructions can be distributed to the component's programmable registers.
The term "memory subsystem" seems to refer to a DRAM module, not CPU caches, and not CPU registers. I claim this based on the usages of the term listed below.
However, each usage is in the context of CPU caches, cache misses, and performance.
Here's a sample of usages:
Ulrich Drepper in What Every Programmer Should Know About Memory:
Unfortunately, neither the structure nor the cost of using the memory
subsystem of a computer or the caches on CPUs is well understood by
most programmers.
Intel in Intel 64 and IA-32 Architectures Optimization Reference Manual:
The manual also distinguishes between cache and the memory subsystems
Memory Bound corresponds to execution stalls related to the cache and
memory subsystems.
Intel's VTune Profiler lumps CPU caches in with DRAM when listing things related to "memory subsystem issues", but doesn't claim that CPU caches are part of a "memory subsystem":
This metric shows how memory subsystem issues affect the performance. [...]
The manual contrasts between an on-package memory subsystem and an off-package memory subsystem ("main memory") in Knights Landing:
MCDRAM is an on-package, high bandwidth memory subsystem that provides
peak bandwidth for read traffic, but lower bandwidth for write traffic
(compared to reads). The aggregate bandwidth provided by MCDRAM is
higher than the off-package memory subsystem (i.e., DDR memory).
Related
Multi processor systems perform "real" memory operations (those that influence definitive executions, not just speculative execution) out of order and asynchronously as waiting for global synchronization of global state would needlessly stall all executions nearly all the time. On the other hand, immediately outside each individual core, it seems that the memory system, starting with L1 cache, is purely synchronous, consistent, flat from the allowed behavior point of view (allowed semantics); obviously timing depends on the cache size and behavior.
So on a CPU there on one extreme are named "registers" which are private by definition, and on the other extreme there is memory which is shared; it seems a shame that outside the minuscule space of registers, which have peculiar naming or addressing mode, the memory is always global, shared and globally synchronous, and effectively entirely subject to all fences, even if it's memory used as unnamed registers, for the purpose of storing more data than would fit in the few registers, without a possibility of being examined by other threads (except by debugging with ptrace which obviously stalls, halts, serializes and stores the complete observable state of an execution).
Is that always the case on modern computers (modern = those that can reasonably support C++ and Java)?
Why doesn't the dedicated L1 cache provide register-like semantics for those memory units that are only used by a particular core? The cache must track which memory is shared, no matter what. Memory operations on such local data doesn't have to be stalled when strict global ordering of memory operations are needed, as no other core is observing it, and the cache has the power to stall such external accesses if needed. The cache would just have to know which memory units are private (non globally readable) until a stall of out of order operations, which makes then consistent (the cache would probably need a way to ask the core to serialize operations and publish a consistent state in memory).
Do all CPU stall and synchronize all memory accesses on a fence or synchronizing operation?
Can the memory be used as an almost infinite register resource not subject to fencing?
In practice, a single core operating on memory that no other threads are accessing doesn't slow down much in order to maintain global memory semantics, vs. how a uniprocessor system could be designed.
But on a big multi-socket system, especially x86, cache-coherency (snooping the other socket) is part of what makes memory latency worse for cache misses than on a single-socket system, though. (For accesses that miss in private caches).
Yes, all multi-core systems that you can run a single multi-threaded program on have coherent shared memory between all cores, using some variant of the MESI cache-coherency protocol. (Any exceptions to this rule are considered exotic and have to be programmed specially.)
Huge systems with multiple separate coherency domains that require explicit flushing are more like a tightly-coupled cluster for efficient message passing, not an SMP system. (Normal NUMA multi-socket systems are cache-coherent: Is mov + mfence safe on NUMA? goes into detail for x86 specifically.)
While a core has a cache line in MESI Modified or Exclusive state, it can modify it without notifying other cores about changes. M and E states in one cache mean that no other caches in the system have any valid copy of the line. But loads and stores still have to respect the memory model, e.g. an x86 core still has to commit stores to L1d cache in program order.
L1d and L2 are part of a modern CPU core, but you're right that L1d is not actually modified speculatively. It can be read speculatively.
Most of what you're asking about is handled by a store buffer with store forwarding, allowing store/reload to execute without waiting for the store to become globally visible.
what is a store buffer? and Size of store buffers on Intel hardware? What exactly is a store buffer?
A store buffer is essential for decoupling speculative out-of-order execution (writing data+address into the store buffer) from in-order commit to globally-visible L1d cache.
It's very important even for an in-order core, otherwise cache-miss stores would stall execution. And generally you want a store buffer to coalesce consecutive narrow stores into a single wider cache write, especially for weakly-ordered uarches that can do so aggressively; many non-x86 microarchitectures only have fully efficient commit to cache for aligned 4-byte or wider chunks.
On a strongly-ordered memory model, speculative out-of-order loads and checking later to see if any other core invalidated the line before we're "allowed" to read it is also essential for high performance, allowing hit-under-miss for out-of-order exec to continue instead of one cache miss load stalling all other loads.
There are some limitations to this model:
limited store-buffer size means we don't have much private store/reload space
a strongly-ordered memory model stops private stores from committing to L1d out of order, so a store to a shared variable that has to wait for the line from another core could result in the store buffer filling up with private stores.
memory barrier instructions like x86 mfence or lock add, or ARM dsb ish have to drain the store buffer, so stores to (and reloads from) thread-private memory that's not in practice shared still has to wait for stores you care about to become globally visible.
conversely, waiting for shared store you care about to become visible (with a barrier or a release-store) has to also wait for private memory operations even if they're independent.
the memory is always global, shared and globally synchronous, and
effectively entirely subject to all fences, even if it's memory used
as unnamed registers,
I'm not sure what you mean here. If a thread is accessing private data (i.e., not shared with any other thread), then there is almost no need for memory fence instructions1. Fences are used to control the order in which memory accesses from one core are seen by other cores.
Why doesn't the dedicated L1 cache provide register-like semantics for
those memory units that are only used by a particular execution unit?
I think (if I understand you correctly) what you're describing is called a scratchpad memory (SPM), which is a hardware memory structure that is mapped to the architectural physical address space or has its own physical address space. The software can directly access any location in an SPM, similar to main memory. However, unlike main memory, SPM has a higher bandwidth and/or lower latency than main memory, but is typically much smaller in size.
SPM is much simpler than a cache because it doesn't need tags, MSHRs, a replacement policy, or hardware prefetchers. In addition, the coherence of SPM works like main memory, i.e., it comes into play only when there are multiple processors.
SPM has been used in many commercial hardware accelerators such as GPUs, DSPs, and manycore processor. One example I am familiar with is the MCDRAM of the Knights Landing (KNL) manycore processor, which can be configured to work as near memory (i.e., an SPM), a last-level cache for main memory, or as a hybrid. The portion of the MCDRAM that is configured to work as SPM is mapped to the same physical address space as DRAM and the L2 cache (which is private to each tile) becomes the last-level cache for that portion of MCDRAM. If there is a portion of MCDRAM that is configured as a cache for DRAM, then it would be the last-level cache of DRAM only and not the SPM portion. MCDRAM has a much higher bandwdith than DRAM, but the latency is about the same.
In general, SPM can be placed anywhere in the memory hierarchy. For example, it could placed at the same level as the L1 cache. SPM improves performance and reduces energy consumption when there is no or little need to move data between SPM and DRAM.
SPM is very suitable for systems with real-time requirements because it provides guarantees regarding the maximum latency and/or lowest bandwdith, which is necessary to determine with certainty whether real-time constraints can be met.
SPM is not very suitable for general-purpose desktop or server systems where they can be multiple applications running concurrently. Such systems don't have real-time requirements and, currently, the average bandwdith demand doesn't justify the cost of including something like MCDRAM. Moreover, using an SPM at the L1 or L2 level imposes size constraints on the SPM and the caches and makes difficult for the OS and applications to exploit such a memory hierarchy.
Intel Optance DC memory can be mapped to the physical address space, but it is at the same level as main memory, so it's not considered as an SPM.
Footnotes:
(1) Memory fences may still be needed in single-thread (or uniprocessor) scenarios. For example, if you want to measure the execution time of a specific region of code on an out-of-order processor, it may be necessary to wrap the region between two suitable fence instructions. Fences are also required when communicating with an I/O device through write-combining memory-mapped I/O pages to ensure that all earlier stores have reached the device.
On an arm based SoC running Android/Linux, I observed following:
Allocate a memory area as un-cached for device DMA input. After DMA finishes, the content of this memory area is copied to another system memory area.
Alloc a memory area as cached for device DMA input. After DMA finished, invalid the memory range, then copy the content of this memory area to anther system memory area.
The size of memory area allocated is about 2MB which is larger than the cache size (the L2 cache size is 256KB).
method 2 is x10 faster than method 1
That is: the memory copy operation of method 2 is x10 faster than method 1
I speculate that method 2 using cache read by cache line size from system memory when copying and the method 1 needs cpu read by bus transaction size from system memory bypassing the cache hardware.
However, I cannot find explicit explanation. I appreciate who can help providing detailed explaination.
There are so many hardware items involved that it is difficult to give specifics. The SOC determines a lot of this. However, what you observe is typical in performance terms for modern ARM systems.
The main factor is SDRAM. All DRAM is structured with 'rows' and 'columns'.DRAM history On the DRAM chip, an entire 'row' can be read at one time. Ie, there is a matrix of transistors and there is a physical point/wiring where an entire row can be read (in fact there maybe SRAM to store the ROW on the chip). When you read another 'column', you need to 'un-charge/pre-charge' the wiring to access the new 'row'. This takes some time. The main point is that DRAM can read sequential memory very fast in large chunks. Also, there is no command overhead as the memory streams out with each clock edge.
If you mark memory as un-cached, then a CPU/SOC may issue single beat reads. Often these will 'pre-charge' consuming extra cycles during a single read/write and many extra commands must be sent to the DRAM device.
SDRAM also has 'banks'. A bank has a separate 'ROW' buffer (static RAM/multi-transistor memory) which allows you to read from one bank to another without having to recharge/re-read. The banks are often very far apart. If your OS has physically allocated the 'un-cached' memory in a different bank from the 2nd 'cached' area, then this will also add an additional efficiency. It common in an OS to manage cached/un-cached memory separately (for MMU issues). The memory pools are often distant enough to be in separate banks.
I understand that cudaMallocManaged simplifies memory access by eliminating the need for explicit memory allocations on host and device. Consider a scenario where the host memory is significantly larger than the device memory, say 16 GB host & 2 GB device which is fairly common these days. If I am dealing with input data of large size say 4-5 GB which is read from an external data source. Am I forced to resort to explicit host and device memory allocation (as device memory is insufficient to accommodate at once) or does the CUDA unified memory model has a way to get around this (something like, auto allocate/deallocate on need basis)?
Am I forced to resort to explicit host and device memory allocation?
You are not forced to resort to explicit host and device memory allocation, but you will be forced to handle the amount of allocated memory manually. This is because, on current hardware at least, the CUDA unified virtual memory doesn't allow you to oversubscribe GPU memory. In other words, cudaMallocManaged will fail once you allocate more memory than what is available on the device. But that doesn't mean you can't use cudaMallocManaged, it merely means you have to keep track of the amount of memory allocated and never exceed what the device could support, by "streaming" your data instead of allocating everything at once.
Pure speculation as I can't speak for NVIDIA, but I believe this could be one of the future improvements on upcoming hardware.
And indeed, one year and a half after the above prediction, as of CUDA 8, Pascal GPUs are now enhanced with a page-faulting capability that allows memory pages to migrate between the host and the device without explicit intervention from the programmer.
So my understanding is that every process has its own virtual memory space ranging from 0x0 to 0xFF....F. These virtual addresses correspond to addresses in physical memory (RAM). Why is this level of abstraction helpful? Why not just use the direct addresses?
I understand why paging is beneficial, but not virtual memory.
There are many reasons to do this:
If you have a compiled binary, each function has a fixed address in memory and the assembly instructions to call functions have that address hardcoded. If virtual memory didn't exist, two programs couldn't be loaded into memory and run at the same time, because they'd potentially need to have different functions at the same physical address.
If two or more programs are running at the same time (or are being context-switched between) and use direct addresses, a memory error in one program (for example, reading a bad pointer) could destroy memory being used by the other process, taking down multiple programs due to a single crash.
On a similar note, there's a security issue where a process could read sensitive data in another program by guessing what physical address it would be located at and just reading it directly.
If you try to combat the two above issues by paging out all the memory for one process when switching to a second process, you incur a massive performance hit because you might have to page out all of memory.
Depending on the hardware, some memory addresses might be reserved for physical devices (for example, video RAM, external devices, etc.) If programs are compiled without knowing that those addresses are significant, they might physically break plugged-in devices by reading and writing to their memory. Worse, if that memory is read-only or write-only, the program might write bits to an address expecting them to stay there and then read back different values.
Hope this helps!
Short answer: Program code and data required for execution of a process must reside in main memory to be executed, but main memory may not be large enough to accommodate the needs of an entire process.
Two proposals
(1) Using a very large main memory to alleviate any need for storage allocation: it's not feasible due to very high cost.
(2) Virtual memory: It allows processes that may not be entirely in the memory to execute by means of automatic storage allocation upon request. The term virtual memory refers to the abstraction of separating LOGICAL memory--memory as seen by the process--from PHYSICAL memory--memory as seen by the processor. Because of this separation, the programmer needs to be aware of only the logical memory space while the operating system maintains two or more levels of physical memory space.
More:
Early computer programmers divided programs into sections that were transferred into main memory for a period of processing time. As higher level languages became popular, the efficiency of complex programs suffered from poor overlay systems. The problem of storage allocation became more complex.
Two theories for solving the problem of inefficient memory management emerged -- static and dynamic allocation. Static allocation assumes that the availability of memory resources and the memory reference string of a program can be predicted. Dynamic allocation relies on memory usage increasing and decreasing with actual program needs, not on predicting memory needs.
Program objectives and machine advancements in the '60s made the predictions required for static allocation difficult, if not impossible. Therefore, the dynamic allocation solution was generally accepted, but opinions about implementation were still divided.
One group believed the programmer should continue to be responsible for storage allocation, which would be accomplished by system calls to allocate or deallocate memory. The second group supported automatic storage allocation performed by the operating system, because of increasing complexity of storage allocation and emerging importance of multiprogramming.
In 1961, two groups proposed a one-level memory store. One proposal called for a very large main memory to alleviate any need for storage allocation. This solution was not possible due to very high cost. The second proposal is known as virtual memory.
cne/modules/vm/green/defn.html
To execute a process its data is needed in the main memory (RAM). This might not be possible if the process is large.
Virtual memory provides an idealized abstraction of the physical memory which creates the illusion of a larger virtual memory than the physical memory.
Virtual memory combines active RAM and inactive memory on disk to form
a large range of virtual contiguous addresses. implementations usually require hardware support, typically in the form of a memory management
unit built into the CPU.
The main purpose of virtual memory is multi-tasking and running large programmes. It would be great to use physical memory, because it would be a lot faster, but RAM memory is a lot more expensive than ROM.
Good luck!
I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.
I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.
Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?
What about texture cache?
For (Kepler) Tesla K20 the latencies are as follows:
Global memory: 440 clocks
Constant memory
L1: 48 clocks
L2: 120 clocks
Shared memory: 48 clocks
Texture memory
L1: 108 clocks
L2: 240 clocks
How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.
This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the sm_20 to sm_35 for compiling.
There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.
The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.
The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.
You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.
Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.