Is a read on an atomic variable guaranteed to acquire the current value of it in C++11? - memory

It is known that the modifications on a single atomic variable form a total order. Suppose we have an atomic read operation on some atomic variable v at wall-clock time T. Then, is this read guaranteed to acquire the current value of v that is wrote by the last one in the modification order of v at time T? To put it in another way, if an atomic write is done before an atomic read in natural time, and there is no other writes in between, then is the read guaranteed to return the value just written?
My accepted answer is the 6th comment made by Cubbi to his answer.

Wall-clock time is irrelevant. However, what you're describing sounds like the write-read coherence guarantee:
$1.10[intro.multithread]/20
If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M.
(translating the standardese, "value computation" is a read, and "side effect" is a write)
In particular, if your relaxed write and your relaxed read are in different statements of the same function, they are connected by a sequenced-before relationship, therefore they are connected by a happens-before relationship, therefore the guarantee holds.

Depends on the memory order which you can specify for the load() operation.
By default, it is std::memory_order_seq_cst and the answer is yes, it guarantees the current value stored by another thread (if stored at all, i.e. it must use std::memory_order_release memory order at least, otherwise the store visibility is not guaranteed).
But if you specify std::memory_order_relaxed for the load operation the documentation says Relaxed ordering: there are no synchronization or ordering constraints, only atomicity is required of this operation. I.e. the program could end up not reading from the memory at all.

Is a read on an atomic variable guaranteed to acquire the current value of it
No
Even though each atomic variable has a single modification order (which is observed by all threads), that does not mean that all threads observe modifications at the same time scale.
Consider this code:
std::atomic<int> g{0};
// thread 1
g.store(42);
// thread 2
int a = g.load();
// do stuff with a
int b = g.load();
A possible outcome is (see diagram):
thread 1: 42 is stored at time T1
thread 2: the first load returns 0 at time T2
thread 2: the store from thread 1 becomes visible at time T3
thread 2: the second load returns 42 at time T4.
This outcome is possible even though the first load at time T2 occurs after the store at T1 (in clock time).
The standard says:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
It does not require a store to become visible right away and it even allows room for a store to remain invisible (e.g. on systems without cache-coherency).
In that case, an atomic read-modify-write (RMW) is required to access the last value.
Atomic read-modify-write operations shall always read the last value (in the modification order) written
before the write associated with the read-modify-write operation.
Needless to say, RMW's are more expensive to execute (they lock the bus) and that is why a regular atomic load is allowed to return an older (cached) value.
If a regular load was required to return the last value, performance would be horrible while there would be hardly any benefit.

Related

Atomic property and usage

I have read many stackoverflow answers for this like Is an atomic property thread safe?, When to use #atomic? or Atomic properties vs thread-safe in Objective-C but I have question for this:
Please correct me If I am wrong, This is like that I am using a count variable which I have declared with Atomic property and currently its value is 5
which is accessed by two thread , First thread that is increasing count value by 2 and 2nd thread decreasing the count value by 1 ,According to my understanding this go sequentially like when first thread increased its value which is now 5 + 2 = 7; after then only 2nd thread can access count variable and only decrease its value by 1 and that is 7 - 1 = 6?
First thread that is increasing count value by 2
That is not an atomic operation, and atomic in no way helps you. This is (at least) three separate atomic operations:
read value
increase value
write value
This is the classic multi-writer race condition. Another thread might read between "read value" and "write value." In your example the final result could be 4, such that the increase operation is completely lost (A reads 5, B reads 5, A +2, A writes 7, B -1, B writes 4).
The problem that atomic is meant to solve is that "read value" and "write value" aren't even atomic operations in many platform-specific situations. The above may actually be 5 operations such as:
read lower word
read upper word
increase value
write lower word
write upper word
Without atomic, another thread might read between "write lower word" and "write upper word" and get a garbage value that was never written (half of one value and half of another). By using atomic, you ensure that readers will always receive a "legitimate" value (one that was written at some point). But that's not much of a promise.
But as is noted in the questions you provide, if you need to make reads and writes atomic, you almost certainly need more than that, because you also want to make "increase the value" atomic. So in practice atomic is rarely useful. If you don't need it, it's slow. If you do need it, it's probably insufficient.
Atomic based property where two threads are accessing same object to increase or decrease a value , You can understand this like both have accessed the same/whole value which is 5 but First thread trying to update this value then this is locked no other thread can update this at the same time however 2nd thread is also having same value which is 5 and can update only when Count object updation get exited by first thread.
Ok, threads might not execute sequentially, the first thread created might run after the second one. If the threads executes in the order you describe, the mentioned behavior is fine. But I think that your have a misconception about thread safe.
I encourage you to read more about Concurrency Programming Guide and Thread safe.

Which scenes keyword "volatile" is needed to declare in objective-c?

As i know, volatile is usually used to prevent unexpected compile optimization during some hardware operations. But which scenes volatile should be declared in property definition puzzles me. Please give some representative examples.
Thx.
A compiler assumes that the only way a variable can change its value is through code that changes it.
int a = 24;
Now the compiler assumes that a is 24 until it sees any statement that changes the value of a. If you write code somewhere below above statement that says
int b = a + 3;
the compiler will say "I know what a is, it's 24! So b is 27. I don't have to write code to perform that calculation, I know that it will always be 27". The compiler may just optimize the whole calculation away.
But the compiler would be wrong in case a has changed between the assignment and the calculation. However, why would a do that? Why would a suddenly have a different value? It won't.
If a is a stack variable, it cannot change value, unless you pass a reference to it, e.g.
doSomething(&a);
The function doSomething has a pointer to a, which means it can change the value of a and after that line of code, a may not be 24 any longer. So if you write
int a = 24;
doSomething(&a);
int b = a + 3;
the compiler will not optimize the calculation away. Who knows what value a will have after doSomething? The compiler for sure doesn't.
Things get more tricky with global variables or instance variables of objects. These variables are not on stack, they are on heap and that means that different threads can have access to them.
// Global Scope
int a = 0;
void function ( ) {
a = 24;
b = a + 3;
}
Will b be 27? Most likely the answer is yes, but there is a tiny chance that some other thread has changed the value of a between these two lines of code and then it won't be 27. Does the compiler care? No. Why? Because C doesn't know anything about threads - at least it didn't used to (the latest C standard finally knows native threads, but all thread functionality before that was only API provided by the operating system and not native to C). So a C compiler will still assume that b is 27 and optimize the calculation away, which may lead to incorrect results.
And that's what volatile is good for. If you tag a variable volatile like that
volatile int a = 0;
you are basically telling the compiler: "The value of a may change at any time. No seriously, it may change out of the blue. You don't see it coming and *bang*, it has a different value!". For the compiler that means it must not assume that a has a certain value just because it used to have that value 1 pico-second ago and there was no code that seemed to have changed it. Doesn't matter. When accessing a, always read its current value.
Overuse of volatile prevents a lot of compiler optimizations, may slow down calculation code dramatically and very often people use volatile in situations where it isn't even necessary. For example, the compiler never makes value assumptions across memory barriers. What exactly a memory barrier is? Well, that's a bit far beyond the scope of my reply. You just need to know that typical synchronization constructs are memory barriers, e.g. locks, mutexes or semaphores, etc. Consider this code:
// Global Scope
int a = 0;
void function ( ) {
a = 24;
pthread_mutex_lock(m);
b = a + 3;
pthread_mutex_unlock(m);
}
pthread_mutex_lock is a memory barrier (pthread_mutex_unlock as well, by the way) and thus it's not necessary to declare a as volatile, the compiler will not make an assumption of the value of a across a memory barrier, never.
Objective-C is pretty much like C in all these aspects, after all it's just a C with extensions and a runtime. One thing to note is that atomic properties in Obj-C are memory barriers, so you don't need to declare properties volatile. If you access the property from multiple threads, declare it atomic, which is even default by the way (if you don't mark it nonatomic, it will be atomic). If you never access it from multiple thread, tagging it nonatomic will make access to that property a lot faster, but that only pays off if you access the property really a lot (a lot doesn't mean ten times a minute, it's rather several thousand times a second).
So you want Obj-C code, that requires volatile?
#implementation SomeObject {
volatile bool done;
}
- (void)someMethod {
done = false;
// Start some background task that performes an action
// and when it is done with that action, it sets `done` to true.
// ...
// Wait till the background task is done
while (!done) {
// Run the runloop for 10 ms, then check again
[[NSRunLoop currentRunLoop]
runUntilDate:[NSDate dateWithTimeIntervalSinceNow:0.01]
];
}
}
#end
Without volatile, the compiler may be dumb enough to assume, that done will never change here and replace !done simply with true. And while (true) is an endless loop that will never terminate.
I haven't tested that with modern compilers. Maybe the current version of clang is more intelligent than that. It may also depend on how you start the background task. If you dispatch a block, the compiler can actually easily see whether it changes done or not. If you pass a reference to done somewhere, the compiler knows that the receiver may the value of done and will not make any assumptions. But I tested exactly that code a long time ago when Apple was still using GCC 2.x and there not using volatile really caused an endless loop that never terminated (yet only in release builds with optimizations enabled, not in debug builds). So I would not rely on the compiler being clever enough to do it right.
Just some more fun facts about memory barriers:
If you ever had a look at the atomic operations that Apple offers in <libkern/OSAtomic.h>, then you might have wondered why every operation exists twice: Once as x and once as xBarrier (e.g. OSAtomicAdd32 and OSAtomicAdd32Barrier). Well, now you finally know it. The one with "Barrier" in its name is a memory barrier, the other one isn't.
Memory barriers are not just for compilers, they are also for CPUs (there exists CPU instructions, that are considered memory barriers while normal instructions are not). The CPU needs to know these barriers because CPUs like to reorder instructions to perform operations out of order. E.g. if you do
a = x + 3 // (1)
b = y * 5 // (2)
c = a + b // (3)
and the pipeline for additions is busy, but the pipeline for multiplication is not, the CPU may perform instruction (2) before (1), after all the order won't matter in the end. This prevents a pipeline stall. Also the CPU is clever enough to know that it cannot perform (3) before either (1) or (2) because the result of (3) depends on the results of the other two calculations.
Yet, certain kinds of order changes will break the code, or the intention of the programmer. Consider this example:
x = y + z // (1)
a = 1 // (2)
The addition pipe might be busy, so why not just perform (2) before (1)? They don't depend on each other, the order shouldn't matter, right? Well, it depends. Consider another thread monitors a for changes and as soon as a becomes 1, it reads the value of x, which should now be y+z if the instructions were performed in order. Yet if the CPU reordered them, then x will have whatever value it used to have before getting to this code and this makes a difference as the other thread will now work with a different value, not the value the programmer would have expected.
So in this case the order will matter and that's why barriers are needed also for CPUs: CPUs don't order instructions across such barriers and thus instruction (2) would need to be a barrier instruction (or there needs to be such an instruction between (1) and (2); that depends on the CPU). However, reordering instructions is only performed by modern CPUs, a much older problem are delayed memory writes. If a CPU delays memory writes (very common for some CPUs, as memory access is horribly slow for a CPU), it will make sure that all delayed writes are performed and have completed before a memory barrier is crossed, so all memory is in a correct state in case another thread might now access it (and now you also know where the name "memory barrier" actually comes from).
You are probably working a lot more with memory barriers than you are even aware of (GCD - Grand Central Dispatch is full of these and NSOperation/NSOperationQueue bases on GCD), that's why your really need to use volatile only in very rare, exceptional cases. You might get away writing 100 apps and never have to use it even once. However, if you write a lot low level, multi-threading code that aims to achieve maximum performance possible, you will sooner or later run into a situation where only volatile can grantee you correct behavior; not using it in such a situation will lead to strange bugs where loops don't seem to terminate or variables simply seem to have incorrect values and you find no explanation for that. If you run into bugs like these, especially if you only see them in release builds, you might miss a volatile or a memory barrier somewhere in your code.
A good explanation is given here: Understanding “volatile” qualifier in C
The volatile keyword is intended to prevent the compiler from applying any optimizations on objects that can change in ways that cannot be determined by the compiler.
Objects declared as volatile are omitted from optimization because their values can be changed by code outside the scope of current code at any time. The system always reads the current value of a volatile object from the memory location rather than keeping its value in temporary register at the point it is requested, even if a previous instruction asked for a value from the same object. So the simple question is, how can value of a variable change in such a way that compiler cannot predict. Consider the following cases for answer to this question.
1) Global variables modified by an interrupt service routine outside the scope: For example, a global variable can represent a data port (usually global pointer referred as memory mapped IO) which will be updated dynamically. The code reading data port must be declared as volatile in order to fetch latest data available at the port. Failing to declare variable as volatile, the compiler will optimize the code in such a way that it will read the port only once and keeps using the same value in a temporary register to speed up the program (speed optimization). In general, an ISR used to update these data port when there is an interrupt due to availability of new data
2) Global variables within a multi-threaded application: There are multiple ways for threads communication, viz, message passing, shared memory, mail boxes, etc. A global variable is weak form of shared memory. When two threads sharing information via global variable, they need to be qualified with volatile. Since threads run asynchronously, any update of global variable due to one thread should be fetched freshly by another consumer thread. Compiler can read the global variable and can place them in temporary variable of current thread context. To nullify the effect of compiler optimizations, such global variables to be qualified as volatile
If we do not use volatile qualifier, the following problems may arise
1) Code may not work as expected when optimization is turned on.
2) Code may not work as expected when interrupts are enabled and used.
volatile comes from C. Type "C language volatile" into your favourite search engine (some of the results will probably come from SO), or read a book on C programming. There are plenty of examples out there.

ARM single-copy atomicity

I am currently wading through the ARM architecture manual for the ARMv7 core. In chapter A3.5.3 about atomicity of memory accesses, it states:
If a single-copy atomic load overlaps a single-copy atomic store and
for any of the overlapping bytes the load returns the data written by
the write inserted into the Coherence order of that byte by the
single-copy atomic store then the load must return data from a point
in the Coherence order no earlier than the writes inserted into the
Coherence order by the single-copy atomic store of all of the
overlapping bytes.
As non-native english speaker I admit that I am slightly challenged in understanding this sentence.
Is there a scenario where writes to a memory byte are not inserted in the Coherence Order and thus the above does not apply? If not, am I correct to say that shortening and rephrasing the sentence to the following:
If the load happens to return at least one byte of the
the write, then the load must return all overlapping bytes from a point
no earlier than where the write inserted them into the
Coherence order of all of the overlapping bytes.
still transports the same meaning?
I see that wording in the ARMv8 ARM, which really tries to remove any possible ambiguity in a lot of places (even if it does make the memory ordering section virtually unreadable).
In terms of general understanding (as opposed to to actually implementing the specification), a little bit of ambiguity doesn't always hurt, so whilst it fails to make it absolutely clear what a "memory location" means, I think the old v7 manual (DDI0406C.b) is a nicer read in this case:
A read or write operation is single-copy atomic if the following conditions are both true:
After any number of write operations to a memory location, the value of the memory location is the value written by one of the write operations. It is impossible for part of the value of the memory location to come from one write operation and another part of the value to come from a different write operation
When a read operation and a write operation are made to the same memory location, the value obtained by the read operation is one of:
the value of the memory location before the write operation
the value of the memory location after the write operation.
It is never the case that the value of the read operation is partly the value of the memory location before the write operation and partly the value of the memory location after the write operation.
So your understanding is right - the defining point of a single-copy atomic operation is that at any given time you can only ever see either all of it, or none of it.
There is a case in v7 whereby (if I'm interpreting it right) two normally single-copy atomic stores that occur to the same location at the same time but with different sizes break any guarantee of atomicity, so in theory you could observe some unexpected mix of bytes there - this looks to have been removed in v8.

Memory Barriers and Relaxed Memory Models

Currently I try to improve my understanding of memory barriers, locks and memory model.
As far as I know there exist four different types of relaxations, namley
Write -> Read, Write -> Write, Read -> Write and Read -> Read.
An x86 processor allows just Write->Read relaxation which is often called Total Store Order (TSO).
Partial Store Order (PSO) allows further Write->Write relaxations and Relaxed Store Order (RSO)
allows all the above relaxations.
Further there exist three types of memory barriers: release, acquire and both together.
Locks can use just acquire and release barriers or sometimes full barriers (.Net).
Now consider the following example:
// thread 0
x = 1
flag = 1
//thread 1
while (flag != 1);
print x
My current understanding tells me, that I need no additional memory barriers if I run this code on
TSO machine.
If it is a PSO machine I need a release barrier between x=1 and flag = 1 to ensure
that thread 1 gets the actual value of x if flag =1.
If it is a RSO machine I need further a acquire barrier between while(flag != 1); and print x to prevent
that thread 1 reads the value of x to early.
Are my observations correct?
I am thinking your code sample is close to one in this question
That said for RSO you need more memory barriers than you describe, more specifically for example one that provides freshness guarantee for thread 1 before while.
I am unsure about TSO and PSO part, hope this can be helpful cause I was also trying to understand memory barriers in that question and a couple of related ones
Reordering can happen on both software (compiler) and hardware level. So keep that in mind. So even though on TSO CPU the 2 stores would not be reordered, there is nothing that prevents the compiler to reorder the 2 stores (or the 2 loads). So flag needs to be a synchronization variable and the store of flag needs to be a release store and the load of flag needs to be an acquire load.
But if we assume that the above code represents X86 instructions:
Then with TSO the above will work correctly since it will prevent the 2 stores and the 2 loads from being reordered.
But with PSO the above could fail because the 2 stores could be reordered.
So imagine you would have the following:
b = 1
x = 1
flag = 1
Whereby b is a value on the same cache line as flag. Then with write coalescing, the flag=1 and b=1 could be coalesced and as a consequence flag=1 could overtake the x=1 and hence become globally visible before the x=1.

Atomically read/write int value w/o additional operation on the int value itself

GCC offers a nice set of built-in functions for atomic operations. And being on MacOS or iOS, even Apple offers a nice set of atomic functions. However, all these functions perform an operation, e.g. an addition/subtraction, a logical operation (AND/OR/XOR) or a compare-and-set/compare-and-swap. What I am looking for is a way to atomically assign/read an int value, like:
int a;
/* ... */
a = someVariable;
That's all. a will be read by another thread and it is only important that a either has its old value or its new value. Unfortunately the C standard does not guarantee that assigning or reading a value is an atomic operation. I remember that I once read somewhere, that writing or reading a value to a variable of type int is guaranteed to be atomic in GCC (regardless the size of int) but I searched everywhere on the GCC homepage and I cannot find this statement any longer (maybe it was removed).
I cannot use sig_atomic_t because sig_atomic_t has no guaranteed size and it might also have a different size than int.
Since only one thread will ever "write" a value to a, while both threads will "read" the current value of a, I don't need to perform the operations themselves in an atomic manner, e.g.:
/* thread 1 */
someVariable = atomicRead(a);
/* Do something with someVariable, non-atomic, when done */
atomicWrite(a, someVariable);
/* thread 2 */
someVariable = atomicRead(a);
/* Do something with someVariable, but never write to a */
If both threads were going to write to a, then all operations would have to be atomic, but that way, this may only waste CPU time; and we are extremely low on CPU resources in our project. So far we use a mutex around read/write operations of a and even though the mutex is held for such a tiny amount of time, this already causes problems (one of the threads is a realtime thread and blocking on a mutex causes it to fail its realtime constraints, which is pretty bad).
Of course I could use a __sync_fetch_and_add to read the variable (and simply add "0" to it, to not modify its value) and for writing use a __sync_val_compare_and_swap for writing it (as I know its old value, so passing that in will make sure the value is always exchanged), but won't this add unnecessary overhead?
A __sync_fetch_and_add with a 0 argument is indeed the best bet if you want your load to be atomic and act as a memory barrier. Similarly, you can use an and with 0 or an or with -1 to store 0 and -1 atomically with a memory barrier. For writing, you can use __sync_test_and_set (actually an xchg operation) if an "acquire" barrier is enough, or if using Clang you can use __sync_swap (which is an xchg operation with a full barrier).
However, in many cases that's overkill and you may prefer to add memory barriers manually. If you do not want the memory barrier, you can use a volatile load to atomically read/write a variable that is aligned and no wider than a word:
#define __sync_access(x) (*(volatile __typeof__(x) *) &(x))
(This macro is an lvalue, so you can also use it for a store like __sync_store(x) = 0). The function implements the same semantics as the C++11 memory_order_consume form, but only under two assumptions:
that your machine has coherent caches; if not, you need a memory barrier or global cache flush before the load (or before the first of a group of load).
that your machine is not a DEC Alpha. The Alpha had very relaxed semantics for reordering memory accesses, so on it you'd need a memory barrier after the load (and after each load in a group of loads). On the Alpha the above macro only provides memory_order_relaxed semantics. BTW, the first versions of the Alpha couldn't even store a byte atomically (only a word, which was 8 bytes).
In either case, the __sync_fetch_and_add would work. As far as I know, no other machine imitated the Alpha so neither assumption should pose problems on current computers.
Volatile, aligned, word sized reads/writes are atomic on most platforms. Checking your assembly would be the best way to find out if this is true on your platform. Atomic registers cannot produce nearly as many interesting wait free structures as the more complicated mechanisms like compare and swap, which is why they are included.
See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5659&rank=3 for the theory.
Regarding synch_fetch_and_add with a 0 argument - This seems like the safest bet. If you're worried about efficiency, profile the code and see if you're meeting your performance targets. You may be falling victim to premature optimization.

Resources