Do desktop processors support weakly ordered memory? - memory

Are there any Intel/AMD desktop processors that support weakly ordered memory or this is a feature of a server setup with multiple processors?

I'm not sure if this is what you're after, but IIRC there have been some architectures that had weakly-ordered memory accesses in that they could be ordered arbitrarily, and you had to insert memory barriers to ensure a particular ordering.
Modern processors use what's called a "load-store queue" that hides memory reordering, making it look almost as though it's happening in program order. Reads are often reordered (but with some care), writes can be made out of order but are committed in-order (although multiple writes to the same location are consolidated), and reads and writes are reordered wrt each other only carefully and speculatively. The latter is called "hoisting", where a read is performed speculatively ahead of a write (that appears earlier in the instruction sequence) and may be canceled (like a mispredicted branch) if it turns out that preceding write would have affected it.
Also, if memory is marked as uncached, then CPUs generally infer that to mean it's I/O space and perform no access reordering. x86 and SPARC are like this. However, PowerPC will still reorder reads to I/O memory space, and we have to use the EIEIO (ensure in order execution of I/O) instruction to force a particular ordering. IIRC, we also had to use memory barriers on PA-RISC and Alpha. Moreover, there are memory barriers on x86, but I'm not familiar with their use (possibly to ensure ordering of accesses to cached memory space).
You mention multi-core systems. In general, elaborate cache coherency protocols are employed to make all memory access appear to conform to certain interleaving rules, such that accesses hit the last-level caches and main memory in an order that would be possible if there were no caching.

Many modern processors now use out-of-order execution to improve performance by hiding memory latencies. This is not related to multiple processors/cores, it can be done with a single-cored processor acting alone. For this reason, you should not rely on memory ordering.

Related

What's the relationship between CPU Out-of-order execution and memory order?

In my understanding, CPU changes the operations order which are written on machine code for optimization and it is called out-of-order execution.
In the term "memory order", it defines the order of accessing to the memory. For example, in relaxed order, it defines very weak ordering rules and execution reordering is easy to happen.
There are some memory ordering models like TSO in x86. In such memory ordering model, the semantics of memory access order by the processor is defined.
What I don't understand is the relationship of them.
Is memory order a kind of out of order execution and are there any other ways for OoOe?
Or, is memory order the implementation of out of order execution and all the reorders by processors are based on the semantics?
The general issue is that on a modern multiprocessor system, load and store instructions may become visible to other cores in a different order than program order. Out-of-order execution is one way in which this can happen, but there are others.
For instance, you could have a CPU which executes and retires all instructions in strict program order, but when it does a store instruction, instead of committing it to L1 cache immediately, it puts it in a store buffer to be written to cache later. The store buffer could be designed to write out stores in a different order than they came in; for instance, if a first store misses L1 cache but a second one would hit, you could save time by writing out the second one while waiting for the first one's cache line to load.
Or, even if the store buffer doesn't reorder, you could have a situation where, while a store is still waiting in the store buffer, the CPU executes a load instruction that came later in program order. Other cores will thus see the load happening before the store. This is the situation with x86, for instance.
The memory ordering model defines, in an abstract way, what the programmer is entitled to expect about the order in which loads and stores become visible to other cores (or hardware, etc). It also usually specifies how the programmer can gain stronger guarantees when needed (e.g. by executing barrier instructions). The CPU then has to be designed to provide the defined behavior, which may place constraints on the features it can include. For instance, if the architecture promises TSO, the CPU probably can't include a store buffer that's capable of reordering, unless they manage to do it in such a clever way that the reordering can never be noticed by other cores.
Related questions:
Are memory barriers needed because of cpu out of order execution or because of cache consistency problem?
Out of Order Execution and Memory Fences
How does memory reordering help processors and compilers?
How do modern Intel x86 CPUs implement the total order over stores

Is memory outside each core always conceptually flat/uniform/synchronous in a multiprocessor system?

Multi processor systems perform "real" memory operations (those that influence definitive executions, not just speculative execution) out of order and asynchronously as waiting for global synchronization of global state would needlessly stall all executions nearly all the time. On the other hand, immediately outside each individual core, it seems that the memory system, starting with L1 cache, is purely synchronous, consistent, flat from the allowed behavior point of view (allowed semantics); obviously timing depends on the cache size and behavior.
So on a CPU there on one extreme are named "registers" which are private by definition, and on the other extreme there is memory which is shared; it seems a shame that outside the minuscule space of registers, which have peculiar naming or addressing mode, the memory is always global, shared and globally synchronous, and effectively entirely subject to all fences, even if it's memory used as unnamed registers, for the purpose of storing more data than would fit in the few registers, without a possibility of being examined by other threads (except by debugging with ptrace which obviously stalls, halts, serializes and stores the complete observable state of an execution).
Is that always the case on modern computers (modern = those that can reasonably support C++ and Java)?
Why doesn't the dedicated L1 cache provide register-like semantics for those memory units that are only used by a particular core? The cache must track which memory is shared, no matter what. Memory operations on such local data doesn't have to be stalled when strict global ordering of memory operations are needed, as no other core is observing it, and the cache has the power to stall such external accesses if needed. The cache would just have to know which memory units are private (non globally readable) until a stall of out of order operations, which makes then consistent (the cache would probably need a way to ask the core to serialize operations and publish a consistent state in memory).
Do all CPU stall and synchronize all memory accesses on a fence or synchronizing operation?
Can the memory be used as an almost infinite register resource not subject to fencing?
In practice, a single core operating on memory that no other threads are accessing doesn't slow down much in order to maintain global memory semantics, vs. how a uniprocessor system could be designed.
But on a big multi-socket system, especially x86, cache-coherency (snooping the other socket) is part of what makes memory latency worse for cache misses than on a single-socket system, though. (For accesses that miss in private caches).
Yes, all multi-core systems that you can run a single multi-threaded program on have coherent shared memory between all cores, using some variant of the MESI cache-coherency protocol. (Any exceptions to this rule are considered exotic and have to be programmed specially.)
Huge systems with multiple separate coherency domains that require explicit flushing are more like a tightly-coupled cluster for efficient message passing, not an SMP system. (Normal NUMA multi-socket systems are cache-coherent: Is mov + mfence safe on NUMA? goes into detail for x86 specifically.)
While a core has a cache line in MESI Modified or Exclusive state, it can modify it without notifying other cores about changes. M and E states in one cache mean that no other caches in the system have any valid copy of the line. But loads and stores still have to respect the memory model, e.g. an x86 core still has to commit stores to L1d cache in program order.
L1d and L2 are part of a modern CPU core, but you're right that L1d is not actually modified speculatively. It can be read speculatively.
Most of what you're asking about is handled by a store buffer with store forwarding, allowing store/reload to execute without waiting for the store to become globally visible.
what is a store buffer? and Size of store buffers on Intel hardware? What exactly is a store buffer?
A store buffer is essential for decoupling speculative out-of-order execution (writing data+address into the store buffer) from in-order commit to globally-visible L1d cache.
It's very important even for an in-order core, otherwise cache-miss stores would stall execution. And generally you want a store buffer to coalesce consecutive narrow stores into a single wider cache write, especially for weakly-ordered uarches that can do so aggressively; many non-x86 microarchitectures only have fully efficient commit to cache for aligned 4-byte or wider chunks.
On a strongly-ordered memory model, speculative out-of-order loads and checking later to see if any other core invalidated the line before we're "allowed" to read it is also essential for high performance, allowing hit-under-miss for out-of-order exec to continue instead of one cache miss load stalling all other loads.
There are some limitations to this model:
limited store-buffer size means we don't have much private store/reload space
a strongly-ordered memory model stops private stores from committing to L1d out of order, so a store to a shared variable that has to wait for the line from another core could result in the store buffer filling up with private stores.
memory barrier instructions like x86 mfence or lock add, or ARM dsb ish have to drain the store buffer, so stores to (and reloads from) thread-private memory that's not in practice shared still has to wait for stores you care about to become globally visible.
conversely, waiting for shared store you care about to become visible (with a barrier or a release-store) has to also wait for private memory operations even if they're independent.
the memory is always global, shared and globally synchronous, and
effectively entirely subject to all fences, even if it's memory used
as unnamed registers,
I'm not sure what you mean here. If a thread is accessing private data (i.e., not shared with any other thread), then there is almost no need for memory fence instructions1. Fences are used to control the order in which memory accesses from one core are seen by other cores.
Why doesn't the dedicated L1 cache provide register-like semantics for
those memory units that are only used by a particular execution unit?
I think (if I understand you correctly) what you're describing is called a scratchpad memory (SPM), which is a hardware memory structure that is mapped to the architectural physical address space or has its own physical address space. The software can directly access any location in an SPM, similar to main memory. However, unlike main memory, SPM has a higher bandwidth and/or lower latency than main memory, but is typically much smaller in size.
SPM is much simpler than a cache because it doesn't need tags, MSHRs, a replacement policy, or hardware prefetchers. In addition, the coherence of SPM works like main memory, i.e., it comes into play only when there are multiple processors.
SPM has been used in many commercial hardware accelerators such as GPUs, DSPs, and manycore processor. One example I am familiar with is the MCDRAM of the Knights Landing (KNL) manycore processor, which can be configured to work as near memory (i.e., an SPM), a last-level cache for main memory, or as a hybrid. The portion of the MCDRAM that is configured to work as SPM is mapped to the same physical address space as DRAM and the L2 cache (which is private to each tile) becomes the last-level cache for that portion of MCDRAM. If there is a portion of MCDRAM that is configured as a cache for DRAM, then it would be the last-level cache of DRAM only and not the SPM portion. MCDRAM has a much higher bandwdith than DRAM, but the latency is about the same.
In general, SPM can be placed anywhere in the memory hierarchy. For example, it could placed at the same level as the L1 cache. SPM improves performance and reduces energy consumption when there is no or little need to move data between SPM and DRAM.
SPM is very suitable for systems with real-time requirements because it provides guarantees regarding the maximum latency and/or lowest bandwdith, which is necessary to determine with certainty whether real-time constraints can be met.
SPM is not very suitable for general-purpose desktop or server systems where they can be multiple applications running concurrently. Such systems don't have real-time requirements and, currently, the average bandwdith demand doesn't justify the cost of including something like MCDRAM. Moreover, using an SPM at the L1 or L2 level imposes size constraints on the SPM and the caches and makes difficult for the OS and applications to exploit such a memory hierarchy.
Intel Optance DC memory can be mapped to the physical address space, but it is at the same level as main memory, so it's not considered as an SPM.
Footnotes:
(1) Memory fences may still be needed in single-thread (or uniprocessor) scenarios. For example, if you want to measure the execution time of a specific region of code on an out-of-order processor, it may be necessary to wrap the region between two suitable fence instructions. Fences are also required when communicating with an I/O device through write-combining memory-mapped I/O pages to ensure that all earlier stores have reached the device.

Triggering L2 cache write to global memory on AMD GCN architecture using OpenCL

I am writing a series of test for a GPU's DRAM (global) memory. Specifically targeting AMD GCN architecture of Tahiti and Hawaii model lines. The archs have a write-back L2 caches.
What I want is to ensure that the stores to global memory are indeed written through to global memory before another thread does a read.
The barrier and mem_fence documentation in the spec states:
CLK_GLOBAL_MEM_FENCE - The barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. This can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data.
However, this only enforces correct ordering. My question is does this trigger a write to global memory of the L2 cache data?
OpenCL 1.2 gives next to no control of this. The fences are very poorly defined and technically if you read carefully only affect the work-group. So most likely nothing will force the cache to flush until the kernel completes.
OpenCL 2.0 gives you full ordering control. Ordering is all you get, not explicit cache operations.
If you do a release write to all_svm_devices scope then by the time you can see that in a work-item on a different device you know that every write before it must be visible too. This may mean the cache has been flushed if the cache was not using a standard ownership-based coherence protocol.
If you release to device scope only and L2 is shared across the whole device there would be no need to flush it to guarantee that ordering.
The memory model is defined entirely in terms of ordering, not in terms of caches, but with scopes it is intended to allow efficient implementation on very relaxed cache hierarchies.

How does the cpu decide which data it puts in what memory (ram, cache, registers)?

When the cpu is executing a program, does it move all data through the memory pipeline? Then any piece of data would be moved from ram->cache->registers so all data that's executed goes in the cpu registers at some point. Or does it somehow select the code it puts in those faster memory types, or can you as a programmer select specific code you want to keep in, for example, the cache for optimization?
The answer to this question is an entire course in itself! A very brief summary of what (usually) happens is that:
You, the programmer, specify what goes in RAM. Well, the compiler does it on your behalf, but you're in control of this by how you declare your variables.
Whenever your code accesses a variable the CPU's MMU will check if the value is in the cache and if it is not, then it will fetch the 'line' that contains the variable from RAM into the cache. Some CPU instruction sets may allow you to prevent it from doing so (causing a stall) for specific low-frequecy operations, but it requires very low-level code to do so. When you update a value, the MMU will perform a 'cache flush' operation, committing the cached memory to RAM. Again, you can affect how and when this happens by low-level code. It will also depend on the MMU configuration such as whether the cache is write-through, etc.
If you are going to do any kind of operation on the value that will require it being used by an ALU (arithmetic Logic Unit) or similar, then it will be loaded into an appropriate register from the cache. Which register will depend on the instruction the compiler generated.
Some CPUs support Dynamic Memory Access (DMA), which provides a shortcut for operations that do not really require the CPU to be involved. These include memory-to-memory copies and the transfer of data between memory and memory-mapped peripheral control blocks (such as UARTs and other I/O blocks). These will cause data to be moved, read or written in RAM without actually affecting the CPU core at all.
At a higher level, some operating systems that support multiple processes will save the RAM allocated to the current process to the hard disk when the process is swapped out, and load it back in again from the disk when the process runs again. (This is why you may find 'Page Files' on your C: drive and the options to limit their size.) This allows all of the running processes to utilise most of the available RAM, even though they can't actually share it all simultaneously. Paging is yet another subject worthy of a course on its own. (Thanks to Leeor for mentioning this.)

GCC's reordering of read/write instructions

Linux's synchronization primitives (spinlock, mutex, RCUs) use memory barrier instructions to force the memory access instructions from getting re-ordered. And this reordering can be done either by the CPU itself or by the compiler.
Can someone show some examples of GCC produced code where such reordering is done ? I am interested mainly in x86. The reason I am asking this is to understand how GCC decides what instructions can be reordered. Different x86 mirco architectures (for ex: sandy bridge vs ivy bridge) use different cache architecture. Hence I am wondering how GCC does effective reordering that helps in the execution performance irrespective of the cache architecture. Some example C code and reordered GCC generated code would be very useful. Thanks!
The reordering that GCC may do is unrelated to the reordering an (x86) CPU may do.
Let's start off with compiler reordering. The C language rules are such that GCC is forbidden from reordering volatile loads and store memory accesses with respect to each other, or deleting them, when a sequence point occurs between them (Thanks to bobc for this clarification). That is to say, in the assembly output, those memory accesses will appear, and will be sequenced precisely in the order you specified. Non-volatile accesses, on the other hand, can be reordered with respect to all other accesses, volatile or not, provided that (by the as-if rule) the end result of the calculation is the same.
For instance, a non-volatile load in the C code could be done as many times as the code says, but in a different order (e.g. If the compiler feels it's more convenient to do it earlier or later when more registers are available). It could be done fewer times than the code says (e.g. If a copy of the value happened to still be available in a register in the middle of a large expression). Or it could even be deleted (e.g. if the compiler can prove the uselessness of the load, or if it moved a variable entirely into a register).
To prevent compiler reorderings at other times, you must use a compiler-specific barrier. GCC uses __asm__ __volatile__("":::"memory"); for this purpose.
This is different from CPU reordering, a.k.a. the memory-ordering model. Ancient CPUs executed instructions precisely in the order they appeared in the program; This is called program ordering, or the strong memory-ordering model. Modern CPUs, however, sometimes resort to "cheats" to run faster, by weakening a little the memory model.
The way x86 CPUs weaken the memory model is documented in Intel's Software Developer Manuals, Volume 3, Chapter 8, Section 8.2.2 "Memory Ordering in P6 and More Recent Processor Families". This is, in part, what it reads:
Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes, with [some] exceptions.
Reads may be reordered with older writes to different locations but not with older writes to the same location.
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
Reads cannot pass earlier LFENCE and MFENCE instructions.
Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
LFENCE instructions cannot pass earlier reads.
SFENCE instructions cannot pass earlier writes.
MFENCE instructions cannot pass earlier reads or writes.
It also gives very good examples of what can and cannot be reordered, in Section 8.2.3 "Examples Illustrating the Memory-Ordering Principles".
As you can see, one uses FENCE instructions to prevent an x86 CPU from reordering memory accesses inappropriately.
Lastly, you may be interested in this link, which goes into further detail and comes with the assembly examples you crave.
The C language rules are such that GCC is forbidden from reordering volatile loads and store memory accesses with respect to each other, or deleting them.
That is not true, and is quite misleading. The C spec does not make such guarantee.
See When is a Volatile Object Accessed?
The standard encourages compilers to refrain from optimizations concerning accesses to volatile objects, but leaves it implementation defined as to what constitutes a volatile access. The minimum requirement is that at a sequence point all previous accesses to volatile objects have stabilized and no subsequent accesses have occurred. Thus an implementation is free to reorder and combine volatile accesses that occur between sequence points, but cannot do so for accesses across a sequence point.
Traditionally, programmers have relied on volatile as a cheap synchronization method but this is no longer a reliable method.

Resources