is the overhead of PTHREAD_MUTEX_ERRORCHECK mutex type significant over PTHREAD_MUTEX_NORMAL types? - pthreads

Am trying to understand if the overhead caused by mutex checks for mutex attribute type PTHREAD_MUTEX_ERRORCHECK is significant over the PTHREAD_MUTEX_NORMAL type. I mean, are there any benchmark results or any data available to make this decision in our implementation.
Any references are greatly appreciated.

Related

Would there be a practical application for a more memory efficient boolean?

I've noticed that booleans occupy a whole byte, despite only needing 1 bit. I was wondering whether we could have something like
struct smartbool{char data;}
, which would store 8 booleans at once.
I am aware that it would take more time to retrieve data, although would the tradeoff be a practical application in some scenarios?
Am I missing something about the memory usage of booleans?
Normally variables are aligned on word boundaries, memory use is balanced against efficiency of access. For one-off boolean variables it may not make sense to store them in a denser form.
If you do need a bunch of booleans you can use things like this BitSet data structure: https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/util/BitSet.html.
There is a type of database index that stores booleans efficiently:
https://en.wikipedia.org/wiki/Bitmap_index. The less space an index takes up the easier it is to keep in memory.
There are already widely used data types that support multiple booleans, they are called integers. you can store and retrieve multiple booleans in an integral type, using bitwise operations, screening out the bits you don't care about with a pattern of bits called a bitmask.
This sort of "packing" is certainly possible and sometimes useful, as a memory-saving optimization. Many languages and libraries provide a way to make it convenient, e.g. std::vector<bool> in C++ is meant to be implemented this way.
However, it should be done only when the programmer knows it will happen and specifically wants it. There is a tradeoff in speed: if bits are used, then setting / clearing / testing a specific bool requires first computing a mask with an appropriate shift, and setting or clearing it now requires a read-modify-write instead of just a write.
And there is a more serious issue in multithreaded programs. Languages like C++ promise that different threads can freely modify different objects, including different elements of the same array, without needing synchronization or causing a data race. For instance, if we have
bool a, b; // not atomic
void thread1() { /* reads and writes a */ }
void thread2() { /* reads and writes b */ }
then this is supposed to work fine. But if the compiler made a and b two different bits in the same byte, concurrent accesses to them would be a data race on that byte, and could cause incorrect behavior (e.g. if the read-modify-writes being done by the two threads were interleaved). The only way to make it safe would be to require that both threads use atomic operations for all their accesses, which are typically many times slower. And if the compiler could freely pack bools in this way, then every operation on a potentially shared bool would have to be made atomic, throughout the entire program. That would be prohibitively expensive.
So this is fine if the programmer wants to pack bools to save memory, is willing to take the hit to speed, and can guarantee that they won't be accessed concurrently. But they should be aware that it's happening, and have control over whether it does.
(Indeed, some people feel that having C++ provide this with vector<bool> was a mistake, since programmers have to know that it is a special exception to the otherwise general rule that vector<T> behaves like an array of T, and different elements of the vector can safely be accessed concurrently. Perhaps they should have left vector<bool> to work in the naive way, and given a different name to the packed version, similar to std::bitset.)

When to use #atomic?

I've already seen this question:
What's the difference between the atomic and nonatomic attributes?
I understand that #atomic does not guarantee thread safe, and I have to use other mechanisms (e.g. #synchronized) to realize that. Based on that, I don't still know EXACTLY when to use #atomic attribute. I'd like to know USE CASE of using #atomic alone.
The typical use-case for atomic properties is when dealing with a primitive data type across multiple threads. For example, let's say you have some background thread doing some processing and you have some BOOL state property, e.g. isProcessComplete and your main thread wants to check to see if the background process is complete:
if (self.isProcessComplete) {
// Do something
}
In this case, declaring this property as atomic allows us to use/update this property across multiple threads without any more complicated synchronization mechanism because:
we're dealing with a scalar, primitive data type, e.g. BOOL;
we declared it to be atomic; and
we're using the accessor method (e.g. self.) rather than accessing the ivar directly.
When dealing with objects or other more complicated situations, atomic generally is insufficient. As you point out, in practice, atomic, alone, is rarely sufficient to achieve thread safety, which is why we don't use it very often. But for simple, stand-alone, primitive data types, atomic can be an easy way to ensure safe access across multiple threads.
#atomic will guarantee You, that the value, that You will receive will not be gibberish. A possible situation is to read a given value from one thread and to set its value from another. Then the #atomic keyword will ensure that You receive a whole value. Important thing to now is that the value that You get is not guaranteed to be the one that has been set most recently.
The second part of the Your question, about the use case is purely circumstantial, depending on the implementation the intentions. For example if You have some kind of time based update on a list every second or so, You could use atomic to ensure that You will get whole values, and the timer that updates Your list can ensure that You will have the latest data on screen or under the hood in some implicit logic.
EDIT: After a remark from #Rob I saw the need of paraphrasing the last part of my answer. In most cases atomic is not enough to get the job done. And there is a need of a better solution, like #synchronized.
atomic guarantees that competing threads will get/set whole values, whether you're setting a primitive type or a pointer to an object. With nonatomic access, it's possible to get a torn read or write, especially if you're dealing with 64-bit values.
atomic also guarantees that an object pointer won't be garbage when you retrieve it:
https://github.com/opensource-apple/objc4/blob/master/runtime/objc-accessors.mm#L58-L61.
That being said, if you're pointing to a mutable object, atomic won't offer you any guarantees about the object's state.
atomic is appropriate if you're dealing with:
Immutable objects
Primitives (like int, char, float)
atomic is inappropriate if:
The object or any of its descendant properties are mutable
If you need thread safety for a primitive value or immutable object, this is one of the fastest ways to guard against torn reads/writes that you can find.

Questions about CUDA memory

I am quite new to CUDA programming and there are some stuff about the memory model that are quite unclear to me. Like, how does it work? For example if I have a simple kernel
__global__ void kernel(const int* a, int* b){
some computation where different threads in different blocks might
write at the same index of b
}
So I imagine a will be in the so-called constant memory. But what about b? Since different threads in different blocks will write in it, how will it work? I read somewhere that it was guaranteed that in the case of concurrent writes in global memory by different threads in the same block at least one would be written, but there's no guarantee about the others. Do I need to worry about that, ie for example have every thread in a block write in shared memory and once they are all done, have one write it all to the global memory? Or is CUDA taking care of it for me?
So I imagine a will be in the so-called constant memory.
Yes, a the pointer will be in constant memory, but not because it is marked const (this is completely orthogonal). b the pointer is also in constant memory. All kernel arguments are passed in constant memory (except in CC 1.x). The memory pointed-to by a and b could, in theory, be anything (device global memory, host pinned memory, anything addressable by UVA, I believe). Where it resides is chosen by you, the user.
I read somewhere that it was guaranteed that in the case of concurrent writes in global memory by different threads in the same block at least one would be written, but there's no guarantee about the others.
Assuming your code looks like this:
b[0] = 10; // Executed by all threads
Then yes, that's a (benign) race condition, because all threads write the same value to the same location. The result of the write is defined, however the number of writes is unspecified and so is the thread that does the "final" write. The only guarantee is that at least one write happens. In practice, I believe one write per warp is issued, which is a waste of bandwidth if your blocks contain more than one warp (which they should).
On the other hand, if your code looks like this:
b[0] = threadIdx.x;
This is plain undefined behavior.
Do I need to worry about that, ie for example have every thread in a block write in shared memory and once they are all done, have one write it all to the global memory?
Yes, that's how it's usually done.

Which is thread Safe atomic or non-atomic?

After reading this answer I am very confused.
Some says atomic is thread safe and some are saying nonatomic is thread safe.
What is the exact answer of this.
Thread unsafety of operations is caused by the fact that an operation can be divided into several suboperations, for example:
a = a + 1
can be subdivided into operations
load value of a
add 1 to the loaded value
assign the calculated value to a.
The word "Atomic" comes from "atom" which comes from greek "atomos", which means "that which can't be split". For an operation it means that it is always performed as a whole, it is never performed one suboperation at a time. That's why it is thread safe.
TL;DR Atomic = thread safe.
Big warning: Having properties atomic does not mean that a whole function/class is thread safe. Atomic properties means only that operations with the given properties are thread safe.
Obviously, nonatomic certainly isn’t thread safe. More interestingly, atomic is closer, but it is insufficient to achieve “thread safety”, as well. To quote Apple’s Programming with Objective-C: Encapsulating Data:
Note: Property atomicity is not synonymous with an object’s thread safety.
It goes on to provide an example:
Consider an XYZPerson object in which both a person’s first and last names are changed using atomic accessors from one thread. If another thread accesses both names at the same time, the atomic getter methods will return complete strings (without crashing), but there’s no guarantee that those values will be the right names relative to each other. If the first name is accessed before the change, but the last name is accessed after the change, you’ll end up with an inconsistent, mismatched pair of names.
This example is quite simple, but the problem of thread safety becomes much more complex when considered across a network of related objects. Thread safety is covered in more detail in Concurrency Programming Guide.
Also see bbum’s Objective-C: Atomic, properties, threading and/or custom setter/getter.
The reason this is so confusing is that, in fact, the atomic keyword does ensure that your access to that immediate reference is thread safe. Unfortunately, when dealing with objects, that’s rarely sufficient. First, you have no assurances that the property’s own internal properties are thread safe. Second, it doesn’t synchronize your app’s access to the object’s individual properties (such as Apple’s example above). So, atomic is almost always insufficient to achieve thread safety, so you generally have to employ some higher-level degree of synchronization. And if you provide that higher-level synchronization, adding atomicity to that mix is redundant and inefficient.
So, with objects, atomic rarely has any utility. It can be useful, though, when dealing primitive C data types (e.g. integers, booleans, floats). For example, you might have some boolean that might be updated in some other thread indicating whether that thread’s asynchronous task is completed. This is a perfect use case for atomic.
Otherwise, we generally reach for higher-level synchronization mechanisms for thread safety, such as GCD serial queues or reader-writer pattern (... or, less common nowadays, locks, the #synchronized directive, etc.).
As you can read at developer.apple you should use atomic functions for thread savety.
You can read more about atomic functions here: Atomic Man page
In short:
Atomic ~ not splittable ~ not shared by threads
As is mentioned in several answers to the posted question, atomic is thread safe. This means that getter/setter working on any thread should finish first, before any other thread can perform getter/setter.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Resources