What's the case for semaphore passing a value greater than zero - ios

Most of the articles I've found are using semaphore for usage of waiting a task finished, then do the rest of tasks. In this case, semaphore is always start with a value 0, and then increase the value in the completeHandler.
However, according to documentation:
Passing a value greater than zero is useful for managing a finite pool of resources, where the pool size is equal to the value
I've tried to find an example for this case, but which turn out all I've found are talking about using dispatch_group_wait for waiting for multiple tasks finished. Would someone please give me an example and scenario for the case using a semaphore with a init value greater than 0?

Related

How to implement a lock and unlock sequence in a metal shader?

How should I implement a lock/unlock sequence with Compare and Swap using a Metal compute shader.
I’ve tested this sample code but it does not seem to work. For some reason, the threads are not detecting that the lock was released.
Here is a brief explanation of the code below:
The depthFlag is an array of atomic_bools. In this simple example, I simply try to do a lock by comparing the content of depthFlag[1]. I then go ahead and do my operation and once the operation is done, I do an unlock.
As stated above, only one thread is able to do the locking/work/unlocking but the rest of the threads get stuck in the while loop. They NEVER leave. I expect another thread to detect the unlock and go through the sequence.
What am I doing wrong? My knowledge on CAS is limited, so I appreciate any tips.
kernel void testFunction(device float *depthBuffer[[buffer(4)]], device atomic_bool *depthFlag [[buffer(5)]], uint index[[thread_position_in_grid]]){
//lock
bool expected=false;
while(!atomic_compare_exchange_weak_explicit(&depthFlag[1],&expected,true,memory_order_relaxed,memory_order_relaxed)){
//wait
expected=false;
}
//Do my operation here
//unlock
atomic_store_explicit(&depthFlag[1], false, memory_order_relaxed);
//barrier
}
You essentially can't use the locking programming model for GPU concurrency. For one, the relaxed memory order model (the only one available) is not suitable for this; for another, you can't guarantee that other threads will make progress between your atomic operations. Your code must always be able to make progress, regardless of what the other threads are doing.
My recommendation is that you use something like the following model instead:
Read atomic value to check if another thread has already completed the operation in question.
If no other thread has done it yet, perform the operation. (But don't cause any side effects, i.e. don't write to device memory.)
Perform an atomic operation to indicate your thread has completed the operation while checking whether another thread got there first. (e.g. compare-and-swap a boolean, but increasing a counter also works)
If another thread got there first, don't perform side effects.
If your thread "won" and no other thread registered completion, perform your operation's side effects, e.g. do whatever you need to do to write out the result etc.
This works well if there's not much competition, and if the result does not vary depending on which thread performs the operation.
The occasional discarded work should not matter. If there is significant competition, use thread groups; within a thread group, the threads can coordinate which thread will perform which operation. You may still end up with wasted computation from competition between groups. If this is a problem, you may need to change your approach more fundamentally.
If the results of the operation are not deterministic, and the threads all need to proceed using the same result, you will need to change your approach. For example, split your kernels up so any computation which depends on the result of the operation in question runs in a sequentially queued kernel.

Atomic property and usage

I have read many stackoverflow answers for this like Is an atomic property thread safe?, When to use #atomic? or Atomic properties vs thread-safe in Objective-C but I have question for this:
Please correct me If I am wrong, This is like that I am using a count variable which I have declared with Atomic property and currently its value is 5
which is accessed by two thread , First thread that is increasing count value by 2 and 2nd thread decreasing the count value by 1 ,According to my understanding this go sequentially like when first thread increased its value which is now 5 + 2 = 7; after then only 2nd thread can access count variable and only decrease its value by 1 and that is 7 - 1 = 6?
First thread that is increasing count value by 2
That is not an atomic operation, and atomic in no way helps you. This is (at least) three separate atomic operations:
read value
increase value
write value
This is the classic multi-writer race condition. Another thread might read between "read value" and "write value." In your example the final result could be 4, such that the increase operation is completely lost (A reads 5, B reads 5, A +2, A writes 7, B -1, B writes 4).
The problem that atomic is meant to solve is that "read value" and "write value" aren't even atomic operations in many platform-specific situations. The above may actually be 5 operations such as:
read lower word
read upper word
increase value
write lower word
write upper word
Without atomic, another thread might read between "write lower word" and "write upper word" and get a garbage value that was never written (half of one value and half of another). By using atomic, you ensure that readers will always receive a "legitimate" value (one that was written at some point). But that's not much of a promise.
But as is noted in the questions you provide, if you need to make reads and writes atomic, you almost certainly need more than that, because you also want to make "increase the value" atomic. So in practice atomic is rarely useful. If you don't need it, it's slow. If you do need it, it's probably insufficient.
Atomic based property where two threads are accessing same object to increase or decrease a value , You can understand this like both have accessed the same/whole value which is 5 but First thread trying to update this value then this is locked no other thread can update this at the same time however 2nd thread is also having same value which is 5 and can update only when Count object updation get exited by first thread.
Ok, threads might not execute sequentially, the first thread created might run after the second one. If the threads executes in the order you describe, the mentioned behavior is fine. But I think that your have a misconception about thread safe.
I encourage you to read more about Concurrency Programming Guide and Thread safe.

Do I have to read NSOperationQueue's operationCount on the operation queue itself?

Sometimes it seems operationCount does not return the right value. Do I have to access it from the queue itself or it does not matter if I access it from an other thread?
"the value returned by this property reflects the instantaneous number of operations at the time the property was accessed."
it isn't guaranteed to be precise / stable and you should NOT use it to do calculations / decisions

Is a read on an atomic variable guaranteed to acquire the current value of it in C++11?

It is known that the modifications on a single atomic variable form a total order. Suppose we have an atomic read operation on some atomic variable v at wall-clock time T. Then, is this read guaranteed to acquire the current value of v that is wrote by the last one in the modification order of v at time T? To put it in another way, if an atomic write is done before an atomic read in natural time, and there is no other writes in between, then is the read guaranteed to return the value just written?
My accepted answer is the 6th comment made by Cubbi to his answer.
Wall-clock time is irrelevant. However, what you're describing sounds like the write-read coherence guarantee:
$1.10[intro.multithread]/20
If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M.
(translating the standardese, "value computation" is a read, and "side effect" is a write)
In particular, if your relaxed write and your relaxed read are in different statements of the same function, they are connected by a sequenced-before relationship, therefore they are connected by a happens-before relationship, therefore the guarantee holds.
Depends on the memory order which you can specify for the load() operation.
By default, it is std::memory_order_seq_cst and the answer is yes, it guarantees the current value stored by another thread (if stored at all, i.e. it must use std::memory_order_release memory order at least, otherwise the store visibility is not guaranteed).
But if you specify std::memory_order_relaxed for the load operation the documentation says Relaxed ordering: there are no synchronization or ordering constraints, only atomicity is required of this operation. I.e. the program could end up not reading from the memory at all.
Is a read on an atomic variable guaranteed to acquire the current value of it
No
Even though each atomic variable has a single modification order (which is observed by all threads), that does not mean that all threads observe modifications at the same time scale.
Consider this code:
std::atomic<int> g{0};
// thread 1
g.store(42);
// thread 2
int a = g.load();
// do stuff with a
int b = g.load();
A possible outcome is (see diagram):
thread 1: 42 is stored at time T1
thread 2: the first load returns 0 at time T2
thread 2: the store from thread 1 becomes visible at time T3
thread 2: the second load returns 42 at time T4.
This outcome is possible even though the first load at time T2 occurs after the store at T1 (in clock time).
The standard says:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
It does not require a store to become visible right away and it even allows room for a store to remain invisible (e.g. on systems without cache-coherency).
In that case, an atomic read-modify-write (RMW) is required to access the last value.
Atomic read-modify-write operations shall always read the last value (in the modification order) written
before the write associated with the read-modify-write operation.
Needless to say, RMW's are more expensive to execute (they lock the bus) and that is why a regular atomic load is allowed to return an older (cached) value.
If a regular load was required to return the last value, performance would be horrible while there would be hardly any benefit.

Mutex are needed to protect the Condition Variables

As it is said that Mutex are needed to protect the Condition Variables.
Is the reference here to the actual condition variable declared as pthread_cond_t
OR
A normal shared variable count whose values decide the signaling and wait.
?
is the reference here to the actual condition variable declared as pthread_cond_t or a normal shared variable count whose values decide the signaling and wait?
The reference is to both.
The mutex makes it so, that the shared variable (count in your question) can be checked, and if the value of that variable doesn't meet the desired condition, the wait that is performed inside pthread_cond_wait() will occur atomically with respect to that check.
The problem being solved with the mutex is that you have two separate operations that need to be atomic:
check the condition of count
wait inside of pthread_cond_wait() if the condition isn't met yet.
A pthread_cond_signal() doesn't 'persist' - if there are no threads waiting on the pthread_cond_t object, a signal does nothing. So if there wasn't a mutex making the two operations listed above atomic with respect to one another, you could find yourself in the following situation:
Thread A wants to do something once count is non-zero
Thread B will signal when it increments count (which will set count to something other than zero)
thread "A" checks count and finds that it's zero
before "A" gets to call pthread_cond_wait(), thread "B" comes along and increments count to 1 and calls pthread_cond_signal(). That call actually does nothing of consequence since "A" isn't waiting on the pthread_cond_t object yet.
"A" calls pthread_cond_wait(), but since condition variable signals aren't remembered, it will block at this point and wait for the signal that has already come and gone.
The mutex (as long as all threads are following the rules) makes it so that item #2 cannot occur between items 1 and 3. The only way that thread "B" will get a chance to increment count is either before A looks at count or after "A" is already waiting for the signal.
A condition variable must always be associated with a mutex, to avoid the race condition where a thread prepares to wait on a condition variable and another thread signals the condition just before the first thread actually waits on it.
More info here
Some Sample:
Thread 1 (Waits for the condition)
pthread_mutex_lock(cond_mutex);
while(i<5)
{
pthread_cond_wait(cond, cond_mutex);
}
pthread_mutex_unlock(cond_mutex);
Thread 2 (Signals the condition)
pthread_mutex_lock(cond_mutex);
i++;
if(i>=5)
{
pthread_cond_signal(cond);
}
pthread_mutex_unlock(cond_mutex);
As you can see in the same above, the mutex protects the variable 'i' which is the cause of the condition. When we see that the condition is not met, we go into a condition wait, which implicitly releases the mutex and thereby allowing the thread doing the signalling to acquire the mutex and work on 'i' and avoid race condition.
Now, as per your question, if the signalling thread signals first, it should have acquired the mutex before doing so, else the first thread might simply check the condition and see that it is not being met and might go for condition wait and since the second thread has already signalled it, no one will signal it there after and the first thread will keep waiting forever.So, in this sense, the mutex is for both the condition & the conditional variable.
Per the pthreads docs the reason that the mutex was not separated is that there is a significant performance improvement by combining them and they expect that because of common race conditions if you don't use a mutex, it's almost always going to be done anyway.
https://linux.die.net/man/3/pthread_cond_wait​
Features of Mutexes and Condition Variables
It had been suggested that the mutex acquisition and release be
decoupled from condition wait. This was rejected because it is the
combined nature of the operation that, in fact, facilitates realtime
implementations. Those implementations can atomically move a
high-priority thread between the condition variable and the mutex in a
manner that is transparent to the caller. This can prevent extra
context switches and provide more deterministic acquisition of a mutex
when the waiting thread is signaled. Thus, fairness and priority
issues can be dealt with directly by the scheduling discipline.
Furthermore, the current condition wait operation matches existing
practice.
I thought that a better use-case might help better explain conditional variables and their associated mutex.
I use posix conditional variables to implement what is called a Barrier Sync. Basically, I use it in an app where I have 15 (data plane) threads that all do the same thing, and I want them all to wait until all data planes have completed their initialization. Once they have all finished their (internal) data plane initialization, then they can start processing data.
Here is the code. Notice I copied the algorithm from Boost since I couldnt use templates in this particular application:
void LinuxPlatformManager::barrierSync()
{
// Algorithm taken from boost::barrier
// In the class constructor, the variables are initialized as follows:
// barrierGeneration_ = 0;
// barrierCounter_ = numCores_; // numCores_ is 15
// barrierThreshold_ = numCores_;
// Locking the mutex here synchronizes all condVar logic manipulation
// from this point until the point where either pthread_cond_wait() or
// pthread_cond_broadcast() is called below
pthread_mutex_lock(&barrierMutex_);
int gen = barrierGeneration_;
if(--barrierCounter_ == 0)
{
// The last thread to call barrierSync() enters here,
// meaning they have all called barrierSync()
barrierGeneration_++;
barrierCounter_ = barrierThreshold_;
// broadcast is the same as signal, but it signals ALL waiting threads
pthread_cond_broadcast(&barrierCond_);
}
while(gen == barrierGeneration_)
{
// All but the last thread to call this method enter here
// This call is blocking, not on the mutex, but on the condVar
// this call actually releases the mutex
pthread_cond_wait(&barrierCond_, &barrierMutex_);
}
pthread_mutex_unlock(&barrierMutex_);
}
Notice that every thread that enters the barrierSync() method locks the mutex, which makes everything between the mutex lock and the call to either pthread_cond_wait() or pthread_mutex_unlock() atomic. Also notice that the mutex is released/unlocked in pthread_cond_wait() as mentioned here. In this link it also mentions that the behavior is undefined if you call pthread_cond_wait() without having first locked the mutex.
If pthread_cond_wait() did not release the mutex lock, then all threads would block on the call to pthread_mutex_lock() at the beginning of the barrierSync() method, and it wouldnt be possible to decrease the barrierCounter_ variables (nor manipulate related vars) atomically (nor in a thread safe manner) to know how many threads have called barrierSync()
So to summarize all of this, the mutex associated with the Conditional Variable is not used to protect the Conditional Variable itself, but rather it is used to make the logic associated with the condition (barrierCounter_, etc) atomic and thread-safe. When the threads block waiting for the condition to become true, they are actually blocking on the Conditional Variable, not on the associated mutex. And a call to pthread_cond_broadcast/signal() will unblock them.
Here is another resource related to pthread_cond_broadcast() and pthread_cond_signal() for an additional reference.

Resources