Why can't a load bypass a value written by another thread on the same core from a write buffer?

Why can't a load bypass a value written by another thread on the same core from a write buffer? - memory

If a CPU core uses a write buffer, then the load can bypass the most recent store to the referenced location from the write buffer, without waiting until it will appear in the cache. But, as it's written in A Primer on Memory Consistency and Coherence, if the CPU honors TSO memory model, then
... multithreading introduces a subtle write buffer issue for TSO. TSO
write buffers are logically private to each thread context (virtual
core). Thus, on a multithreaded core, one thread context should never
bypass from the write buffer of another thread context. This logical
separation can be implemented with per-thread-context write buffers
or, more commonly, by using a shared write buffer with entries tagged
by thread-context identifiers that permit bypassing only when tags
match.
I can't grasp the necessity of this limitation. Could you please give me an example when allowing some thread to bypass a write buffer entry written by another thread on the same core leads to the violation of the TSO memory model?

The classic example of how TSO differs from sequential consistency (SC) is:
(This is example 2.4 here - http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf)
thread 0 | thread 1
---------------------------------
write 1-->[x] | write 1-->[y]
a = read [x] | b = read [y]
c = read [y] | d = read [x]
Both addresses store 0 initially. The question is: would c=d=0 be a valid outcome? We know a and b must forward the stores before them since they match the addresses of the local stores, and will probably be forwarded from the local threads store buffer. However, c and d may not be forwarded across context, so they may still show the old value.
The interesting gotcha here is that since each thread observes both stores, and forwards the local one, and outcome of a=1,c=0 would mean that t0 saw the store to [x] occurring first. An outcome of b=1,d=0 would mean that t1 saw the store to [y] occurring first. The fact that this is a possible outcome due to store buffer forwarding would break sequential consistency as it requires that all contexts agree on the same global order of stores. Instead, x86 settled for a weaker TSO model that allows this case.
Forwarding stores globally is practically impossible since buffered stores are not necessarily committed, which means they may even be in the wrong path of a branch misprediction. Forwarding locally is fine since a flush would also eliminate all the loads that forwarded from them, but on multiple contexts you don't have that.
I've also seen work that tries to buffer stores globally outside of the core, but this is not very practical due to latency and bandwidth. For further reading, here's a recent paper that may be relevant - http://ieeexplore.ieee.org/abstract/document/7783736/

Related

OpenCL memory consistency

I have a question concerning the OpenCL memory consistency model. Consider the following kernel:
__kernel foo() {
__local lmem[1];
lmem[0] = 1;
lmem[0] += 2;
}
In this case, is any synchronization or memory fence necessary to ensure that lmem[0] == 3?
According to section 3.3.1 of the OpenCL specification,
within a work-item memory has load / store consistency.
To me, this says that the assignment will always be executed before the increment.
However, section 6.12.9 defines the mem_fence function as follows:
Orders loads and stores of a work-item executing a kernel. This means that loads and stores preceding the mem_fence will be committed to memory before any loads and stores following the mem_fence.
Doesn't this contradict section 3.3.1? Or maybe my understanding of load / store consistency is wrong? I would appreciate your help.

As long as only one work-item performs read/write access to a local memory cell, that work-item has a consistent view of it. Committing to memory using a barrier is only necessary to propagate writes to other work-items in the work-group. For example, an OpenCL implementation would be permitted to keep any changes to local memory in private registers until a barrier is encountered. Within the work-item, everything would appear fine, but other work-items would never see these changes. This is how the phrase "committed to memory" should be interpreted in 6.12.9.
Essentially, the interaction between local memory and barriers boils down to this:
Between barriers:
Only one work-item is allowed read/write access to a local memory cell.OR
Any number of work-items in a work-group is allowed read-only access to a local memory cell.
In other words, no work-item may read or write to a local memory cell which is written to by another work-item after the last barrier.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.

C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Questions about Memory models

When I read the book related to compiler , I saw that there are two major memory models.
Register to Register model and Memory to memory model.
In the book, it says that register-to-register models ignore machine limitations on the number of registers, and compiler back-ends must insert loads and stores. Is it because register-to-register models can use virtual registers...and this model keeps all values that can be stored in registers, so before finishing it must insert loads and stores (related to memory)?
Also, in the memory to memory part, the book says that the compiler back-end can remove redundant loads and stores. Does it mean that the model has to remove redundant uses of memory for optimization?

I'm going to answer your question in the context of compilers because that's what you said you were reading about. In a computer architecture context these answers will not apply, so read with caution.
Is it because register-to-register models can use virtual registers...and this model keeps all values that can be stored in registers, so before finishing it must insert loads and stores (related to memory)?
That's likely one reason. If the underlying machine does not support register/register operations, then the "virtual register" operations will need to be translated into loads and stores instead. Similarly, if your compiler assumes an infinite register machine during the IR phase, it might be necessary to spill some registers to memory during the register allocation phase (in which you map your infinite set of virtual registers to a finite set of real registers, using memory accesses when you run out).
Does it mean that the model has to remove redundant uses of memory for optimization?
Yes, this is something the compiler may do as an optimization step. If we do something like this:
register1 <- LOAD 1234
// Operation using register 1 that leaves the result in register 1
STORE register1, 1234
register1 <- LOAD 1234
// Another operation that uses register 1
STORE register1, 1235
This can be optimised to simply leave the value in the register instead, like this:
register1 <- LOAD 1234
// Operation using register 1 that leaves the result in register 1
// Another operation that uses register 1
STORE register1, 1235
This is clearly more efficient because it avoids additional DRAM accesses that are slow when compared to registers.

Memory Barriers and Relaxed Memory Models

Currently I try to improve my understanding of memory barriers, locks and memory model.
As far as I know there exist four different types of relaxations, namley
Write -> Read, Write -> Write, Read -> Write and Read -> Read.
An x86 processor allows just Write->Read relaxation which is often called Total Store Order (TSO).
Partial Store Order (PSO) allows further Write->Write relaxations and Relaxed Store Order (RSO)
allows all the above relaxations.
Further there exist three types of memory barriers: release, acquire and both together.
Locks can use just acquire and release barriers or sometimes full barriers (.Net).
Now consider the following example:
// thread 0
x = 1
flag = 1
//thread 1
while (flag != 1);
print x
My current understanding tells me, that I need no additional memory barriers if I run this code on
TSO machine.
If it is a PSO machine I need a release barrier between x=1 and flag = 1 to ensure
that thread 1 gets the actual value of x if flag =1.
If it is a RSO machine I need further a acquire barrier between while(flag != 1); and print x to prevent
that thread 1 reads the value of x to early.
Are my observations correct?

I am thinking your code sample is close to one in this question
That said for RSO you need more memory barriers than you describe, more specifically for example one that provides freshness guarantee for thread 1 before while.
I am unsure about TSO and PSO part, hope this can be helpful cause I was also trying to understand memory barriers in that question and a couple of related ones

Reordering can happen on both software (compiler) and hardware level. So keep that in mind. So even though on TSO CPU the 2 stores would not be reordered, there is nothing that prevents the compiler to reorder the 2 stores (or the 2 loads). So flag needs to be a synchronization variable and the store of flag needs to be a release store and the load of flag needs to be an acquire load.
But if we assume that the above code represents X86 instructions:
Then with TSO the above will work correctly since it will prevent the 2 stores and the 2 loads from being reordered.
But with PSO the above could fail because the 2 stores could be reordered.
So imagine you would have the following:
b = 1
x = 1
flag = 1
Whereby b is a value on the same cache line as flag. Then with write coalescing, the flag=1 and b=1 could be coalesced and as a consequence flag=1 could overtake the x=1 and hence become globally visible before the x=1.

Atomically read/write int value w/o additional operation on the int value itself

GCC offers a nice set of built-in functions for atomic operations. And being on MacOS or iOS, even Apple offers a nice set of atomic functions. However, all these functions perform an operation, e.g. an addition/subtraction, a logical operation (AND/OR/XOR) or a compare-and-set/compare-and-swap. What I am looking for is a way to atomically assign/read an int value, like:
int a;
/* ... */
a = someVariable;
That's all. a will be read by another thread and it is only important that a either has its old value or its new value. Unfortunately the C standard does not guarantee that assigning or reading a value is an atomic operation. I remember that I once read somewhere, that writing or reading a value to a variable of type int is guaranteed to be atomic in GCC (regardless the size of int) but I searched everywhere on the GCC homepage and I cannot find this statement any longer (maybe it was removed).
I cannot use sig_atomic_t because sig_atomic_t has no guaranteed size and it might also have a different size than int.
Since only one thread will ever "write" a value to a, while both threads will "read" the current value of a, I don't need to perform the operations themselves in an atomic manner, e.g.:
/* thread 1 */
someVariable = atomicRead(a);
/* Do something with someVariable, non-atomic, when done */
atomicWrite(a, someVariable);
/* thread 2 */
someVariable = atomicRead(a);
/* Do something with someVariable, but never write to a */
If both threads were going to write to a, then all operations would have to be atomic, but that way, this may only waste CPU time; and we are extremely low on CPU resources in our project. So far we use a mutex around read/write operations of a and even though the mutex is held for such a tiny amount of time, this already causes problems (one of the threads is a realtime thread and blocking on a mutex causes it to fail its realtime constraints, which is pretty bad).
Of course I could use a __sync_fetch_and_add to read the variable (and simply add "0" to it, to not modify its value) and for writing use a __sync_val_compare_and_swap for writing it (as I know its old value, so passing that in will make sure the value is always exchanged), but won't this add unnecessary overhead?

A __sync_fetch_and_add with a 0 argument is indeed the best bet if you want your load to be atomic and act as a memory barrier. Similarly, you can use an and with 0 or an or with -1 to store 0 and -1 atomically with a memory barrier. For writing, you can use __sync_test_and_set (actually an xchg operation) if an "acquire" barrier is enough, or if using Clang you can use __sync_swap (which is an xchg operation with a full barrier).
However, in many cases that's overkill and you may prefer to add memory barriers manually. If you do not want the memory barrier, you can use a volatile load to atomically read/write a variable that is aligned and no wider than a word:
#define __sync_access(x) (*(volatile __typeof__(x) *) &(x))
(This macro is an lvalue, so you can also use it for a store like __sync_store(x) = 0). The function implements the same semantics as the C++11 memory_order_consume form, but only under two assumptions:
that your machine has coherent caches; if not, you need a memory barrier or global cache flush before the load (or before the first of a group of load).
that your machine is not a DEC Alpha. The Alpha had very relaxed semantics for reordering memory accesses, so on it you'd need a memory barrier after the load (and after each load in a group of loads). On the Alpha the above macro only provides memory_order_relaxed semantics. BTW, the first versions of the Alpha couldn't even store a byte atomically (only a word, which was 8 bytes).
In either case, the __sync_fetch_and_add would work. As far as I know, no other machine imitated the Alpha so neither assumption should pose problems on current computers.

Volatile, aligned, word sized reads/writes are atomic on most platforms. Checking your assembly would be the best way to find out if this is true on your platform. Atomic registers cannot produce nearly as many interesting wait free structures as the more complicated mechanisms like compare and swap, which is why they are included.
See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5659&rank=3 for the theory.
Regarding synch_fetch_and_add with a 0 argument - This seems like the safest bet. If you're worried about efficiency, profile the code and see if you're meeting your performance targets. You may be falling victim to premature optimization.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart