How to implement a lock and unlock sequence in a metal shader? - metal

How should I implement a lock/unlock sequence with Compare and Swap using a Metal compute shader.
I’ve tested this sample code but it does not seem to work. For some reason, the threads are not detecting that the lock was released.
Here is a brief explanation of the code below:
The depthFlag is an array of atomic_bools. In this simple example, I simply try to do a lock by comparing the content of depthFlag[1]. I then go ahead and do my operation and once the operation is done, I do an unlock.
As stated above, only one thread is able to do the locking/work/unlocking but the rest of the threads get stuck in the while loop. They NEVER leave. I expect another thread to detect the unlock and go through the sequence.
What am I doing wrong? My knowledge on CAS is limited, so I appreciate any tips.
kernel void testFunction(device float *depthBuffer[[buffer(4)]], device atomic_bool *depthFlag [[buffer(5)]], uint index[[thread_position_in_grid]]){
//lock
bool expected=false;
while(!atomic_compare_exchange_weak_explicit(&depthFlag[1],&expected,true,memory_order_relaxed,memory_order_relaxed)){
//wait
expected=false;
}
//Do my operation here
//unlock
atomic_store_explicit(&depthFlag[1], false, memory_order_relaxed);
//barrier
}

You essentially can't use the locking programming model for GPU concurrency. For one, the relaxed memory order model (the only one available) is not suitable for this; for another, you can't guarantee that other threads will make progress between your atomic operations. Your code must always be able to make progress, regardless of what the other threads are doing.
My recommendation is that you use something like the following model instead:
Read atomic value to check if another thread has already completed the operation in question.
If no other thread has done it yet, perform the operation. (But don't cause any side effects, i.e. don't write to device memory.)
Perform an atomic operation to indicate your thread has completed the operation while checking whether another thread got there first. (e.g. compare-and-swap a boolean, but increasing a counter also works)
If another thread got there first, don't perform side effects.
If your thread "won" and no other thread registered completion, perform your operation's side effects, e.g. do whatever you need to do to write out the result etc.
This works well if there's not much competition, and if the result does not vary depending on which thread performs the operation.
The occasional discarded work should not matter. If there is significant competition, use thread groups; within a thread group, the threads can coordinate which thread will perform which operation. You may still end up with wasted computation from competition between groups. If this is a problem, you may need to change your approach more fundamentally.
If the results of the operation are not deterministic, and the threads all need to proceed using the same result, you will need to change your approach. For example, split your kernels up so any computation which depends on the result of the operation in question runs in a sequentially queued kernel.

Related

Why object's mutability is relevant to its Thread safe?

I am working on CoreImage on IOS, you guys know CIContext and CIImage are immutable so they can be shared on threads. The problem is i am not sure why objects' mutability is closely relevant to its thread safe.
I can guess the rough reason is to prevent multiple threads from do something on a certain object at the same time. But can anybody provide some theoretical evidence or give a specific answer to it?
You are correct, mutable objects are not thread safe because multiple threads can write to that data at the same time. This is in contrast to reading data, an operation multiple threads can do simultaneously without causing problems. Immutable types can only be read.
However, when multiple threads are writing to the same data, these operations may interfere. For example, an NSMutableArray which two threads are editing could easily be corrupted. One thread is editing at one location, suddenly changing the memory the other thread was updating. iOS will use what is called an atomic operation for simple data types. What this means is that iOS requires the edit operation to fully finish before anything else can happen. This has an efficiency advantage over locks. Just Google about this stuff if you want to know more.
Well, actually, you are right.
If you have an immutable objects, all you can do is to read data from them. And every thread will get the same data, since the object can not be changed.
When you have a mutable object, the problem is that (in general) read and write operations are not atomic. It means that they are not performed instantaneously, they take time. So they can overlap each other. Even if we have a single-core processor, it can switch between threads at arbitrary moments of time. Possible problems it might cause:
1) Two threads write at the same object simultaneously. In this case the object might become corrupted (for instance, half of data comes from the first thread, the other half – from the second, the result is unpredictable.
2) One thread writes the data, another one reads it. One problem is that thread might read already outdated data. Another one is that it might read a corrupted data (if the writing thread haven't finished writing yet).
Take the example with an array of values. Let's say I'm doing some computation and I find the count of the array to be 4
var array = [1,2,3,4]
let count = array.count
Now maybe I want to loop through all the elements in the array, so I setup a loop that goes through index i<4 or something along those lines. So far so good.
This array is mutable though, so on a separate thread, I could easily remove an element. So perhaps I'm on thread 1 and I get the count to be 4, I perhaps start looping through the array. Now we switch to thread 2 and I remove an element, so now this same array has only 3 values in it. If I end up going back to thread 1 and I still assume I have 4 values while looping through the array, my program is going to crash when it tries to access the 4th element.
This is why immutability is desirable, it can guarantee some level of consistency across threads.

Using random() function on multiple threads

I'm working on an app where I need reproducible random numbers. I use srandom() with a seed to initialize the random number sequence. Then I use random() to generate the random numbers from this seed. If this is the only thread generating random numbers, everything works fine. However, if there are multiple threads generating random numbers, they interfere with each other.
Apparently, the sequence of random numbers is not thread safe. There must be a central random number generator that is called by all threads.
My app generates hundreds of objects, each one of which has four sequences of 14 random numbers generated this way. Each of these 4 sequences has its own non-random seed. This way, the random numbers should be reproducible. The problem is, because of the thread interference I just described, sometimes the sequence of 14 numbers being generated will be interrupted by a random number request by another thread.
After thinking about this for a while, I've decided to call
dispatch_sync(dispatch_get_main_queue(), ^{//generate the 14 numbers});
to get each sequence. This should force them to get generated in the proper sequence. In reading the documentation, it says there could be a deadlock if dispatch_sync is called on the queue it's running in. How can I tell if I'm already on the main queue? If I am, I don't need to dispatch anything, right?
Is there a better way to do this?
I suspect another way to do this is similar to this but using a dedicated queue instead of the main queue. I've never tried making my own queue before. Also, the method that needs to call the queue is an ephemeral one, so I'd need to somehow pass the custom queue around if I'm going to go that route. How does one pass a queue as an argument?
For now, I'm running with my idea, above, dispatching synchronously to the main queue, and the app seems to work fine. Worst case scenario, this snippet of code would be run about 4800 times (4 for each of 1200 objects, which is currently the max.).
I assume you want computationally random numbers, rather than cryptographic random numbers.
My suggestion would be to have separate RNGs for each thread, with each thread RNG seeded centrally from a master RNG. Since the system RNG is not thread safe, then create your own small RNG method -- a good LCG should work -- for use exclusively within one thread.
Use the built-in random() to produce only the initial seeds for each of your sub-threads. Setting the overall initial seed with srandom() will ensure that the thread local my_random() methods will all get a consistent initial reseed as long as the threads are started in the same order each time.
Effectively you are building a hierarchy of RNGs to match your hierarchy of threads.
Another option would be to have a singleton do the computation. The object needing the set of random numbers would ask the singleton for them in a batch.

Mutex are needed to protect the Condition Variables

As it is said that Mutex are needed to protect the Condition Variables.
Is the reference here to the actual condition variable declared as pthread_cond_t
OR
A normal shared variable count whose values decide the signaling and wait.
?
is the reference here to the actual condition variable declared as pthread_cond_t or a normal shared variable count whose values decide the signaling and wait?
The reference is to both.
The mutex makes it so, that the shared variable (count in your question) can be checked, and if the value of that variable doesn't meet the desired condition, the wait that is performed inside pthread_cond_wait() will occur atomically with respect to that check.
The problem being solved with the mutex is that you have two separate operations that need to be atomic:
check the condition of count
wait inside of pthread_cond_wait() if the condition isn't met yet.
A pthread_cond_signal() doesn't 'persist' - if there are no threads waiting on the pthread_cond_t object, a signal does nothing. So if there wasn't a mutex making the two operations listed above atomic with respect to one another, you could find yourself in the following situation:
Thread A wants to do something once count is non-zero
Thread B will signal when it increments count (which will set count to something other than zero)
thread "A" checks count and finds that it's zero
before "A" gets to call pthread_cond_wait(), thread "B" comes along and increments count to 1 and calls pthread_cond_signal(). That call actually does nothing of consequence since "A" isn't waiting on the pthread_cond_t object yet.
"A" calls pthread_cond_wait(), but since condition variable signals aren't remembered, it will block at this point and wait for the signal that has already come and gone.
The mutex (as long as all threads are following the rules) makes it so that item #2 cannot occur between items 1 and 3. The only way that thread "B" will get a chance to increment count is either before A looks at count or after "A" is already waiting for the signal.
A condition variable must always be associated with a mutex, to avoid the race condition where a thread prepares to wait on a condition variable and another thread signals the condition just before the first thread actually waits on it.
More info here
Some Sample:
Thread 1 (Waits for the condition)
pthread_mutex_lock(cond_mutex);
while(i<5)
{
pthread_cond_wait(cond, cond_mutex);
}
pthread_mutex_unlock(cond_mutex);
Thread 2 (Signals the condition)
pthread_mutex_lock(cond_mutex);
i++;
if(i>=5)
{
pthread_cond_signal(cond);
}
pthread_mutex_unlock(cond_mutex);
As you can see in the same above, the mutex protects the variable 'i' which is the cause of the condition. When we see that the condition is not met, we go into a condition wait, which implicitly releases the mutex and thereby allowing the thread doing the signalling to acquire the mutex and work on 'i' and avoid race condition.
Now, as per your question, if the signalling thread signals first, it should have acquired the mutex before doing so, else the first thread might simply check the condition and see that it is not being met and might go for condition wait and since the second thread has already signalled it, no one will signal it there after and the first thread will keep waiting forever.So, in this sense, the mutex is for both the condition & the conditional variable.
Per the pthreads docs the reason that the mutex was not separated is that there is a significant performance improvement by combining them and they expect that because of common race conditions if you don't use a mutex, it's almost always going to be done anyway.
https://linux.die.net/man/3/pthread_cond_wait​
Features of Mutexes and Condition Variables
It had been suggested that the mutex acquisition and release be
decoupled from condition wait. This was rejected because it is the
combined nature of the operation that, in fact, facilitates realtime
implementations. Those implementations can atomically move a
high-priority thread between the condition variable and the mutex in a
manner that is transparent to the caller. This can prevent extra
context switches and provide more deterministic acquisition of a mutex
when the waiting thread is signaled. Thus, fairness and priority
issues can be dealt with directly by the scheduling discipline.
Furthermore, the current condition wait operation matches existing
practice.
I thought that a better use-case might help better explain conditional variables and their associated mutex.
I use posix conditional variables to implement what is called a Barrier Sync. Basically, I use it in an app where I have 15 (data plane) threads that all do the same thing, and I want them all to wait until all data planes have completed their initialization. Once they have all finished their (internal) data plane initialization, then they can start processing data.
Here is the code. Notice I copied the algorithm from Boost since I couldnt use templates in this particular application:
void LinuxPlatformManager::barrierSync()
{
// Algorithm taken from boost::barrier
// In the class constructor, the variables are initialized as follows:
// barrierGeneration_ = 0;
// barrierCounter_ = numCores_; // numCores_ is 15
// barrierThreshold_ = numCores_;
// Locking the mutex here synchronizes all condVar logic manipulation
// from this point until the point where either pthread_cond_wait() or
// pthread_cond_broadcast() is called below
pthread_mutex_lock(&barrierMutex_);
int gen = barrierGeneration_;
if(--barrierCounter_ == 0)
{
// The last thread to call barrierSync() enters here,
// meaning they have all called barrierSync()
barrierGeneration_++;
barrierCounter_ = barrierThreshold_;
// broadcast is the same as signal, but it signals ALL waiting threads
pthread_cond_broadcast(&barrierCond_);
}
while(gen == barrierGeneration_)
{
// All but the last thread to call this method enter here
// This call is blocking, not on the mutex, but on the condVar
// this call actually releases the mutex
pthread_cond_wait(&barrierCond_, &barrierMutex_);
}
pthread_mutex_unlock(&barrierMutex_);
}
Notice that every thread that enters the barrierSync() method locks the mutex, which makes everything between the mutex lock and the call to either pthread_cond_wait() or pthread_mutex_unlock() atomic. Also notice that the mutex is released/unlocked in pthread_cond_wait() as mentioned here. In this link it also mentions that the behavior is undefined if you call pthread_cond_wait() without having first locked the mutex.
If pthread_cond_wait() did not release the mutex lock, then all threads would block on the call to pthread_mutex_lock() at the beginning of the barrierSync() method, and it wouldnt be possible to decrease the barrierCounter_ variables (nor manipulate related vars) atomically (nor in a thread safe manner) to know how many threads have called barrierSync()
So to summarize all of this, the mutex associated with the Conditional Variable is not used to protect the Conditional Variable itself, but rather it is used to make the logic associated with the condition (barrierCounter_, etc) atomic and thread-safe. When the threads block waiting for the condition to become true, they are actually blocking on the Conditional Variable, not on the associated mutex. And a call to pthread_cond_broadcast/signal() will unblock them.
Here is another resource related to pthread_cond_broadcast() and pthread_cond_signal() for an additional reference.

The memory consistency model CUDA 4.0 and global memory?

Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.

Why does epoll_wait only provide huge 1ms timeout?

epoll_wait, select and poll functions all provide a timeout. However with epoll, it's at a large resolution of 1ms. Select & ppoll are the only one providing sub-millisecond timeout.
That would mean doing other things at 1ms intervals at best. I could do a lot of other things within 1ms on a modern CPU.
So to do other things more often than 1ms I actually have to provide a timeout of zero (essentially disabling it). And I'd probably add my own usleep somewhere in the main loop to stop it chewing up too much CPU.
So the question is, why is the timeout in milli's when I would think clearly there is a case for a higher resolution timeout.
Since you are on Linux, instead of providing a zero timeout value and manually usleeeping in the loop body, you could simply use the timerfd API. This essentially lets you create a timer (with a resolution finer than 1ms) associated with a file descriptor, which you can add to the set of monitored descriptors.
The epoll_wait interface just inherited a timeout measured in milliseconds from poll. While it doesn't make sense to poll for less than a millisecond, because of the overhead of adding the calling thread to all the wait sets, it does make sense for epoll_wait. A call to epoll_wait doesn't require ever putting the calling thread onto more than one wait set, the calling overhead is very low, and it could make sense, on very rare occasions, to block for less than a millisecond.
I'd recommend just using a timing thread. Most of what you would want to do can just be done in that timing thread, so you won't need to break out of epoll_wait. If you do need to make a thread return from epoll_wait, just send a byte to a pipe that thread is polling and the wait will terminate.
In Linux 5.11, an epoll_pwait2 API has been added, which uses a struct timespec as timeout. This means you can now wait using nanoseconds precision.

Resources