ARM single-copy atomicity - memory

I am currently wading through the ARM architecture manual for the ARMv7 core. In chapter A3.5.3 about atomicity of memory accesses, it states:
If a single-copy atomic load overlaps a single-copy atomic store and
for any of the overlapping bytes the load returns the data written by
the write inserted into the Coherence order of that byte by the
single-copy atomic store then the load must return data from a point
in the Coherence order no earlier than the writes inserted into the
Coherence order by the single-copy atomic store of all of the
overlapping bytes.
As non-native english speaker I admit that I am slightly challenged in understanding this sentence.
Is there a scenario where writes to a memory byte are not inserted in the Coherence Order and thus the above does not apply? If not, am I correct to say that shortening and rephrasing the sentence to the following:
If the load happens to return at least one byte of the
the write, then the load must return all overlapping bytes from a point
no earlier than where the write inserted them into the
Coherence order of all of the overlapping bytes.
still transports the same meaning?

I see that wording in the ARMv8 ARM, which really tries to remove any possible ambiguity in a lot of places (even if it does make the memory ordering section virtually unreadable).
In terms of general understanding (as opposed to to actually implementing the specification), a little bit of ambiguity doesn't always hurt, so whilst it fails to make it absolutely clear what a "memory location" means, I think the old v7 manual (DDI0406C.b) is a nicer read in this case:
A read or write operation is single-copy atomic if the following conditions are both true:
After any number of write operations to a memory location, the value of the memory location is the value written by one of the write operations. It is impossible for part of the value of the memory location to come from one write operation and another part of the value to come from a different write operation
When a read operation and a write operation are made to the same memory location, the value obtained by the read operation is one of:
the value of the memory location before the write operation
the value of the memory location after the write operation.
It is never the case that the value of the read operation is partly the value of the memory location before the write operation and partly the value of the memory location after the write operation.
So your understanding is right - the defining point of a single-copy atomic operation is that at any given time you can only ever see either all of it, or none of it.
There is a case in v7 whereby (if I'm interpreting it right) two normally single-copy atomic stores that occur to the same location at the same time but with different sizes break any guarantee of atomicity, so in theory you could observe some unexpected mix of bytes there - this looks to have been removed in v8.

Related

When would a linked list be preferred over a circular buffer?

In terms of big-O runtime, it seems that both data structures have in the "average" case:
O(1) insertion/removal into the start and end
O(n) insertion/removal into some arbitrary index
O(1) lookup of the start and end
Advantages of circular buffer:
O(1) lookup instead of O(n) of some arbitrary index
Doesn't need to create nodes, thus doesn't need a dynamic allocation on each insertion
Faster traversal due to better cache prediction
Faster removal due to vectorization (e.g. using memmove) to fill the gap
Typically needs less space (because in a linked list, for each node, you have to sort pointers to the next and/or previous node)
Advantages of linked list:
Easier to get O(1) insertion/removal to some specific place (e.g., could get it for midway through the linked list). Circular buffers can do it, but it's more complicated
O(1) insertion in the worst case, unlike circular buffers that are O(n) (when it needs to grow the buffer)
Based on this list, it seems to me that circular buffers are a far better choice in almost every case. Am I missing something?
The MCS lock is one of the most scalable lock designs there is. A thread uses an atomic compare and exchange to attempt to seize the lock. If it works it is done. If it doesn't work, the thread uses an atomic exchange to enqueue itself at the tail of the list of waiters.
There is not way to do a similar thing with circular buffers without locks or more complicated use of atomic instructions.

Does the order or syntax of allocate statement affect performance? (Fortran)

Because of having performance issues when passing a code from static to dynamic allocation, I started to wander about how memory allocation is managed in a Fortran code.
Specifically, in this question, I wander if the order or syntax used for the allocate statement makes any difference. That is, does it make any difference to allocate vectors like:
allocate(x(DIM),y(DIM))
versus
allocate(x(DIM))
allocate(y(DIM))
The syntax suggests that in the first case the program would allocate all the space for the vectors at once, possibly improving the performance, while in the second case it must allocate the space for one vector at a time, in such a way that they could end up far from each other. If not, that is, if the syntax does not make any difference, I wander if there is a way to control that allocation (for instance, allocating a vector for all space and using pointers to address the space allocated as multiple variables).
Finally, I notice now that I don't even know one thing: an allocate statement guarantees that at least a single vector occupies a contiguous space in memory (or the best it can?).
From the language standard point of view both ways how to write them are possible. The compiler is free to allocate the arrays where it wants. It normally calls malloc() to allocate some piece of memory and makes the allocatable arrays from that piece.
Whether it might allocate a single piece of memory for two different arrays in a single allocate statement is up to the compiler, but I haven't heard about any compiler doing that.
I just verified that my gfortran just calls __builtin_malloc two times in this case.
Another issue is already pointed out by High Performance Mark. Even when malloc() successfully returns, the actual memory pages might still not be assigned. On Linux that happens when you first access the array.
I don't think it is too important if those arrays are close to each other in memory or not anyway. The CPU can cache arrays from different regions of address space if it needs them.
Is there a way how to control the allocation? Yes, you can overload the malloc by your own allocator which does some clever things. It may be used to have always memory aligned to 32-bytes or similar purposes (example). Whether you will improve performance of your code by allocating things somehow close to each other is questionable, but you can have a try. (Of course this is completely compiler-dependent thing, a compiler doesn't have to use malloc() at all, but mostly they do.) Unfortunately, this will only works when the calls to malloc are not inlined.
There are (at least) two issues here, firstly the time taken to allocate the memory and secondly the locality of memory in the arrays and the impact of this on performance. I don't know much about the actual allocation process, although the links suggested by High Performance Mark and the answer by Vadimir F cover this.
From your question, it seems you are more interested in cache hits and memory locality given by arrays being next to each other. I would guess there is no guarantee either allocate statement ensures both arrays next to each other in memory. This is based on allocating arrays in a type, which in the fortran 2003 MAY 2004 WORKING DRAFT J3/04-007 standard
NOTE 4.20
Unless the structure includes a SEQUENCE statement, the use of this terminology in no way implies that these components are stored in this, or any other, order. Nor is there any requirement that contiguous storage be used.
From the discussion with Vadimir F, if you put allocatable arrays in a type and use the sequence keyword, e.g.
type botharrays
SEQUENCE
double precision, dimension(:), allocatable :: x, y
end type
this DOES NOT ensure they are allocated as adjacent in memory. For static arrays or lots of variables, a sequential type sounds like it may work like your idea of "allocating a vector for all space and using pointers to address the space allocated as multiple variables". I think common blocks (Fortran 77) allowed you to specify the relationship between memory location of arrays and variables in memory, but don't work with allocatable arrays either.
In short, I think this means you cannot ensure two allocated arrays are adjacent in memory. Even if you could, I don't see how this will result in a reduction in cache misses or improved performance. Even if you typically use the two together, unless the arrays are small enough that the cache will include multiple arrays in one read (assuming reads are allowed to go beyond array bounds) you won't benefit from the memory locality.

Tracking address when writing to flash

My system needs to store data in an EEPROM flash. Strings of bytes will be written to the EEPROM one at a time, not continuously at once. The length of strings may vary. I want the strings to be saved in order without wasting any space by continuing from the last write address. For example, if the first string of bytes was written at address 0x00~0x08, then I want the second string of bytes to be written starting at address 0x09.
How can it be achieved? I found that some EEPROM's write command does not require the address to be specified and just continues from lastly written point. But EEPROM I am using does not support that. (I am using Spansion's S25FL1-K). I thought about allocating part of memory to track the address and storing the address every time I write, but that might wear out flash faster. What is widely used method to handle such case?
Thanks.
EDIT:
What I am asking is how to track/save the address in a non-volatile way so that when next write happens, I know what address to start.
I never worked with this particular flash, but I've implemented something similar. Unfortunately, without knowing your constrains / priorities (memory or CPU efficient, how often write happens etc.) it is impossible to give a definite answer. Here are some techniques that you may want to consider. I don't know if they are widely used though.
Option 1: Write X bytes containing string length before the string. Then on initialization you could parse your flash: read the length n, jump n bytes forward; read the next byte. If it's empty (all ones for your flash according to the datasheet) then you got your first empty bit. Otherwise you've just read the length of the next string, so do the same over again.
This method allows you to quickly search for the last used sector, since the first byte of the used sector is guaranteed to have a value. The flip side here is overhead of extra n bytes (depending on the max string length) each time you write a string, and having to parse it to get the value (although this can only be done once on boot).
Option 2: Instead of prepending the size, append the unique "end-of-string" sequence, and then parse on boot for the last sequence before ones that represent empty flash.
Disadvantage here is longer parse, but you possibly could get away with just 1 byte-long overhead for each string.
Option 3 would be just what you already thought of: allocating a separate sector that would contain the value you need. To reduce flash wear you could also write these values back-to-back and search for the last one each time you boot. Also, you might consider the expected lifetime of the device that you program versus 100,000 erases that your flash can sustain (again according to the datasheet) - is wearing even a problem? That of course depends on how often data will be saved.
Hope that helps.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Atomically read/write int value w/o additional operation on the int value itself

GCC offers a nice set of built-in functions for atomic operations. And being on MacOS or iOS, even Apple offers a nice set of atomic functions. However, all these functions perform an operation, e.g. an addition/subtraction, a logical operation (AND/OR/XOR) or a compare-and-set/compare-and-swap. What I am looking for is a way to atomically assign/read an int value, like:
int a;
/* ... */
a = someVariable;
That's all. a will be read by another thread and it is only important that a either has its old value or its new value. Unfortunately the C standard does not guarantee that assigning or reading a value is an atomic operation. I remember that I once read somewhere, that writing or reading a value to a variable of type int is guaranteed to be atomic in GCC (regardless the size of int) but I searched everywhere on the GCC homepage and I cannot find this statement any longer (maybe it was removed).
I cannot use sig_atomic_t because sig_atomic_t has no guaranteed size and it might also have a different size than int.
Since only one thread will ever "write" a value to a, while both threads will "read" the current value of a, I don't need to perform the operations themselves in an atomic manner, e.g.:
/* thread 1 */
someVariable = atomicRead(a);
/* Do something with someVariable, non-atomic, when done */
atomicWrite(a, someVariable);
/* thread 2 */
someVariable = atomicRead(a);
/* Do something with someVariable, but never write to a */
If both threads were going to write to a, then all operations would have to be atomic, but that way, this may only waste CPU time; and we are extremely low on CPU resources in our project. So far we use a mutex around read/write operations of a and even though the mutex is held for such a tiny amount of time, this already causes problems (one of the threads is a realtime thread and blocking on a mutex causes it to fail its realtime constraints, which is pretty bad).
Of course I could use a __sync_fetch_and_add to read the variable (and simply add "0" to it, to not modify its value) and for writing use a __sync_val_compare_and_swap for writing it (as I know its old value, so passing that in will make sure the value is always exchanged), but won't this add unnecessary overhead?
A __sync_fetch_and_add with a 0 argument is indeed the best bet if you want your load to be atomic and act as a memory barrier. Similarly, you can use an and with 0 or an or with -1 to store 0 and -1 atomically with a memory barrier. For writing, you can use __sync_test_and_set (actually an xchg operation) if an "acquire" barrier is enough, or if using Clang you can use __sync_swap (which is an xchg operation with a full barrier).
However, in many cases that's overkill and you may prefer to add memory barriers manually. If you do not want the memory barrier, you can use a volatile load to atomically read/write a variable that is aligned and no wider than a word:
#define __sync_access(x) (*(volatile __typeof__(x) *) &(x))
(This macro is an lvalue, so you can also use it for a store like __sync_store(x) = 0). The function implements the same semantics as the C++11 memory_order_consume form, but only under two assumptions:
that your machine has coherent caches; if not, you need a memory barrier or global cache flush before the load (or before the first of a group of load).
that your machine is not a DEC Alpha. The Alpha had very relaxed semantics for reordering memory accesses, so on it you'd need a memory barrier after the load (and after each load in a group of loads). On the Alpha the above macro only provides memory_order_relaxed semantics. BTW, the first versions of the Alpha couldn't even store a byte atomically (only a word, which was 8 bytes).
In either case, the __sync_fetch_and_add would work. As far as I know, no other machine imitated the Alpha so neither assumption should pose problems on current computers.
Volatile, aligned, word sized reads/writes are atomic on most platforms. Checking your assembly would be the best way to find out if this is true on your platform. Atomic registers cannot produce nearly as many interesting wait free structures as the more complicated mechanisms like compare and swap, which is why they are included.
See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5659&rank=3 for the theory.
Regarding synch_fetch_and_add with a 0 argument - This seems like the safest bet. If you're worried about efficiency, profile the code and see if you're meeting your performance targets. You may be falling victim to premature optimization.

Resources