As it is said that Mutex are needed to protect the Condition Variables.
Is the reference here to the actual condition variable declared as pthread_cond_t
OR
A normal shared variable count whose values decide the signaling and wait.
?
is the reference here to the actual condition variable declared as pthread_cond_t or a normal shared variable count whose values decide the signaling and wait?
The reference is to both.
The mutex makes it so, that the shared variable (count in your question) can be checked, and if the value of that variable doesn't meet the desired condition, the wait that is performed inside pthread_cond_wait() will occur atomically with respect to that check.
The problem being solved with the mutex is that you have two separate operations that need to be atomic:
check the condition of count
wait inside of pthread_cond_wait() if the condition isn't met yet.
A pthread_cond_signal() doesn't 'persist' - if there are no threads waiting on the pthread_cond_t object, a signal does nothing. So if there wasn't a mutex making the two operations listed above atomic with respect to one another, you could find yourself in the following situation:
Thread A wants to do something once count is non-zero
Thread B will signal when it increments count (which will set count to something other than zero)
thread "A" checks count and finds that it's zero
before "A" gets to call pthread_cond_wait(), thread "B" comes along and increments count to 1 and calls pthread_cond_signal(). That call actually does nothing of consequence since "A" isn't waiting on the pthread_cond_t object yet.
"A" calls pthread_cond_wait(), but since condition variable signals aren't remembered, it will block at this point and wait for the signal that has already come and gone.
The mutex (as long as all threads are following the rules) makes it so that item #2 cannot occur between items 1 and 3. The only way that thread "B" will get a chance to increment count is either before A looks at count or after "A" is already waiting for the signal.
A condition variable must always be associated with a mutex, to avoid the race condition where a thread prepares to wait on a condition variable and another thread signals the condition just before the first thread actually waits on it.
More info here
Some Sample:
Thread 1 (Waits for the condition)
pthread_mutex_lock(cond_mutex);
while(i<5)
{
pthread_cond_wait(cond, cond_mutex);
}
pthread_mutex_unlock(cond_mutex);
Thread 2 (Signals the condition)
pthread_mutex_lock(cond_mutex);
i++;
if(i>=5)
{
pthread_cond_signal(cond);
}
pthread_mutex_unlock(cond_mutex);
As you can see in the same above, the mutex protects the variable 'i' which is the cause of the condition. When we see that the condition is not met, we go into a condition wait, which implicitly releases the mutex and thereby allowing the thread doing the signalling to acquire the mutex and work on 'i' and avoid race condition.
Now, as per your question, if the signalling thread signals first, it should have acquired the mutex before doing so, else the first thread might simply check the condition and see that it is not being met and might go for condition wait and since the second thread has already signalled it, no one will signal it there after and the first thread will keep waiting forever.So, in this sense, the mutex is for both the condition & the conditional variable.
Per the pthreads docs the reason that the mutex was not separated is that there is a significant performance improvement by combining them and they expect that because of common race conditions if you don't use a mutex, it's almost always going to be done anyway.
https://linux.die.net/man/3/pthread_cond_wait
Features of Mutexes and Condition Variables
It had been suggested that the mutex acquisition and release be
decoupled from condition wait. This was rejected because it is the
combined nature of the operation that, in fact, facilitates realtime
implementations. Those implementations can atomically move a
high-priority thread between the condition variable and the mutex in a
manner that is transparent to the caller. This can prevent extra
context switches and provide more deterministic acquisition of a mutex
when the waiting thread is signaled. Thus, fairness and priority
issues can be dealt with directly by the scheduling discipline.
Furthermore, the current condition wait operation matches existing
practice.
I thought that a better use-case might help better explain conditional variables and their associated mutex.
I use posix conditional variables to implement what is called a Barrier Sync. Basically, I use it in an app where I have 15 (data plane) threads that all do the same thing, and I want them all to wait until all data planes have completed their initialization. Once they have all finished their (internal) data plane initialization, then they can start processing data.
Here is the code. Notice I copied the algorithm from Boost since I couldnt use templates in this particular application:
void LinuxPlatformManager::barrierSync()
{
// Algorithm taken from boost::barrier
// In the class constructor, the variables are initialized as follows:
// barrierGeneration_ = 0;
// barrierCounter_ = numCores_; // numCores_ is 15
// barrierThreshold_ = numCores_;
// Locking the mutex here synchronizes all condVar logic manipulation
// from this point until the point where either pthread_cond_wait() or
// pthread_cond_broadcast() is called below
pthread_mutex_lock(&barrierMutex_);
int gen = barrierGeneration_;
if(--barrierCounter_ == 0)
{
// The last thread to call barrierSync() enters here,
// meaning they have all called barrierSync()
barrierGeneration_++;
barrierCounter_ = barrierThreshold_;
// broadcast is the same as signal, but it signals ALL waiting threads
pthread_cond_broadcast(&barrierCond_);
}
while(gen == barrierGeneration_)
{
// All but the last thread to call this method enter here
// This call is blocking, not on the mutex, but on the condVar
// this call actually releases the mutex
pthread_cond_wait(&barrierCond_, &barrierMutex_);
}
pthread_mutex_unlock(&barrierMutex_);
}
Notice that every thread that enters the barrierSync() method locks the mutex, which makes everything between the mutex lock and the call to either pthread_cond_wait() or pthread_mutex_unlock() atomic. Also notice that the mutex is released/unlocked in pthread_cond_wait() as mentioned here. In this link it also mentions that the behavior is undefined if you call pthread_cond_wait() without having first locked the mutex.
If pthread_cond_wait() did not release the mutex lock, then all threads would block on the call to pthread_mutex_lock() at the beginning of the barrierSync() method, and it wouldnt be possible to decrease the barrierCounter_ variables (nor manipulate related vars) atomically (nor in a thread safe manner) to know how many threads have called barrierSync()
So to summarize all of this, the mutex associated with the Conditional Variable is not used to protect the Conditional Variable itself, but rather it is used to make the logic associated with the condition (barrierCounter_, etc) atomic and thread-safe. When the threads block waiting for the condition to become true, they are actually blocking on the Conditional Variable, not on the associated mutex. And a call to pthread_cond_broadcast/signal() will unblock them.
Here is another resource related to pthread_cond_broadcast() and pthread_cond_signal() for an additional reference.
Related
How should I implement a lock/unlock sequence with Compare and Swap using a Metal compute shader.
I’ve tested this sample code but it does not seem to work. For some reason, the threads are not detecting that the lock was released.
Here is a brief explanation of the code below:
The depthFlag is an array of atomic_bools. In this simple example, I simply try to do a lock by comparing the content of depthFlag[1]. I then go ahead and do my operation and once the operation is done, I do an unlock.
As stated above, only one thread is able to do the locking/work/unlocking but the rest of the threads get stuck in the while loop. They NEVER leave. I expect another thread to detect the unlock and go through the sequence.
What am I doing wrong? My knowledge on CAS is limited, so I appreciate any tips.
kernel void testFunction(device float *depthBuffer[[buffer(4)]], device atomic_bool *depthFlag [[buffer(5)]], uint index[[thread_position_in_grid]]){
//lock
bool expected=false;
while(!atomic_compare_exchange_weak_explicit(&depthFlag[1],&expected,true,memory_order_relaxed,memory_order_relaxed)){
//wait
expected=false;
}
//Do my operation here
//unlock
atomic_store_explicit(&depthFlag[1], false, memory_order_relaxed);
//barrier
}
You essentially can't use the locking programming model for GPU concurrency. For one, the relaxed memory order model (the only one available) is not suitable for this; for another, you can't guarantee that other threads will make progress between your atomic operations. Your code must always be able to make progress, regardless of what the other threads are doing.
My recommendation is that you use something like the following model instead:
Read atomic value to check if another thread has already completed the operation in question.
If no other thread has done it yet, perform the operation. (But don't cause any side effects, i.e. don't write to device memory.)
Perform an atomic operation to indicate your thread has completed the operation while checking whether another thread got there first. (e.g. compare-and-swap a boolean, but increasing a counter also works)
If another thread got there first, don't perform side effects.
If your thread "won" and no other thread registered completion, perform your operation's side effects, e.g. do whatever you need to do to write out the result etc.
This works well if there's not much competition, and if the result does not vary depending on which thread performs the operation.
The occasional discarded work should not matter. If there is significant competition, use thread groups; within a thread group, the threads can coordinate which thread will perform which operation. You may still end up with wasted computation from competition between groups. If this is a problem, you may need to change your approach more fundamentally.
If the results of the operation are not deterministic, and the threads all need to proceed using the same result, you will need to change your approach. For example, split your kernels up so any computation which depends on the result of the operation in question runs in a sequentially queued kernel.
func doSomething() -> Int {
var sum = 0
let increaseWork = DispatchWorkItem {
sum = sum + 100 //point 1
}
DispatchQueue.global().async(execute:increaseWork)
increaseWork.wait()
return sum //point 2
}
Thread Sanitizer is saying that there is race condition between point 1 and point 2. But I don't think there is any race condition as increaseWork.wait() is blocking call and it will not pass until the closure is executed.
There could be a race condition between those 2 points. Imagine what happens if you execute the doSomething() function on 2 separate threads like this:
The first thread executes the increaseWork() closure and finishes it. It's at the line with wait right now
The second thread starts executing and hits the async execute instruction
The first thread hits the return instruction as the wait can continue
At the same time, the closure from the second is scheduled and executed
At this point, you can't tell for sure what is executed first: the sum = sum + 100 from the second thread or the return sum from the first one.
The idea is that sum is a shared resource which is not synchronised, so, in theory, that could be a race condition. Even if you took care that such thing does not happen, the Thread Sanitizer detects a possible race condition as it does not know whether you start a single thread or you execute doSomething() function from 100 different threads at the same time.
UPDATE:
As I missed the fact that the sum variable is local, the above explanation does not answer the current question. The scenario described would never take place in the given piece of code.
But, even though the sum is a local variable, due to the fact it is used and retained inside a closure, it will be allocated on heap, rather than on stack. That's because the Swift compiler cannot determine whether the closure will be finished its execution before the doSomething() function return or not.
Why? Because the closure is passed to a constructor which behaves like an #escaping parameter. It implies that there is no guarantee when the closure will be executed, thus all the variables retained by the closure must be allocated on heap for safety. Not knowing when the closure will be executed, the Thread Sanitizer cannot determine that the return sum statement will be indeed executed after the closure finishes.
So even though here we can be sure that no race condition can happen, the Thread Sanitizer raises an alarm, as it cannot determine it cannot happen.
From this Apple's document about NSCondition, the usage of NSCondition should be:
Thead 1:
[cocoaCondition lock];
while (timeToDoWork <= 0)
[cocoaCondition wait];
timeToDoWork--;
// Do real work here.
[cocoaCondition unlock];
Thread 2:
[cocoaCondition lock];
timeToDoWork++;
[cocoaCondition signal];
[cocoaCondition unlock];
And in the document of method signal in NSConditon:
You use this method to wake up one thread that is waiting on the condition. You may call this method multiple times to wake up multiple threads. If no threads are waiting on the condition, this method does nothing. To avoid race conditions, you should invoke this method only while the receiver is locked.
My question is:
I don't want the Thread 2 been blocked in any situation, so I removed the lock and unlock call in Thread 2. That is, Thread 2 can put as many work as it wish, Thread 1 will do the work one by one, if no more work, it wait (blocked). This is also a producer-consumer pattern, but the producer never been blocked.
But the way is not correct according to Apple's document So what things could possibly go wrong in this pattern? Thanks.
Failing to lock is a serious problem when multiple threads are accessing shared data. In the example from Apple's code, if Thread 2 doesn't lock the condition object then it can be incrementing timeToDoWork at the same time that Thread 1 is decrementing it. That can result in the results from one of those operations being lost. For example:
Thread 1 reads the current value of timeToDoWork, gets 1
Thread 2 reads the current value of timeToDoWork, gets 1
Thread 2 computes the incremented value (timeToDoWork + 1), gets 2
Thread 1 computes the decremented value (timeToDoWork - 1), gets 0
Thread 2 writes the new value of timeToDoWork, stores 2
Thread 1 writes the new value of timeToDoWork, stores 0
timeToDoWork started at 1, was incremented and decremented, so it should end at 1, but it actually ends at 0. By rearranging the steps, it could end up at 2, instead. Presumably, the value of timeToDoWork represents something real and important. Getting it wrong would probably screw up the program.
If your two threads are doing something as simple as incrementing and decrementing a number, they can do it without locks by using the atomic operation functions, such as OSAtomicIncrement32Barrier() and OSAtomicDecrement32Barrier(). However, if the shared data is any more complicated than that (and it probably is in any non-trivial case), then they really need to use synchronization mechanisms such as condition locks.
Does pthread_cond_signal unblock exactly one thread? If not, what will be the case it releases more than one thread? The specification says as follows:
The pthread_cond_signal() function shall unblock at least one of the
threads that are blocked on the specified condition variable cond (if
any threads are blocked on cond).
The pthreads specification allows for "spurious wakeups" in an implementation. See, for example, the hypothetical implementation of pthread_cond_signal and pthread_cond_wait sketched in the specification that allows for just this condition.
The possibility of spurious wakeups is why one always associates some predicate with a condition, and checks that predicate upon wakeup.
Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.