My block operation completion handler is displaying random results. Not sure why. I've read this and all lessons say it is similar to Dispatch Groups in GCD
Please find my code below
import Foundation
let sentence = "I love my car"
let wordOperation = BlockOperation()
var wordArray = [String]()
for word in sentence.split(separator: " ") {
wordOperation.addExecutionBlock {
print(word)
wordArray.append(String(word))
}
}
wordOperation.completionBlock = {
print(wordArray)
print("Completion Block")
}
wordOperation.start()
I was expecting my output to be ["I", "love", "my", "car"] (it should display all these words - either in sequence or in random order)
But when I run my output is either ["my"] or ["love"] or ["I", "car"] - it prints randomly without all expected values
Not sure why this is happening. Please advice
The problem is that those separate execution blocks may run concurrently with respect to each other, on separate threads. This is true if you start the operation like you have, or even if you added this operation to an operation queue with maxConcurrentOperationCount of 1. As the documentation says, when dealing with addExecutionBlock:
The specified block should not make any assumptions about its execution environment.
On top of this, Swift arrays are not a thread-safe. So in the absence of synchronization, concurrent interaction with a non-thread-safe object may result in unexpected behavior, such as what you’ve shared with us.
If you turn on TSAN, the thread sanitizer, (found in “Product” » “Scheme” » “Edit Scheme...”, or press ⌘+<, and then choose “Run” » “Diagnostics” » “Thread Sanitizer”) it will warn you about the data race.
So, bottom line, the problem isn’t addExecutionBlock, per se, but rather the attempt to mutate the array from multiple threads at the same time. If you used concurrent queue in conjunction with dispatch group, you can experience similar problems (though, like many race conditions, sometimes it is hard to manifest).
Theoretically, one could add synchronization code to your code snippet and that would fix the problem. But then again, it would be silly to try to initiate a bunch of concurrent updates, only to then employ synchronization within that to prevent concurrent updates. It would work, but would be inefficient. You only employ that pattern when the work on the background threads is substantial in comparison to the amount of time spent synchronizing updates to some shared resource. But that’s not the case here.
Related
How should I implement a lock/unlock sequence with Compare and Swap using a Metal compute shader.
I’ve tested this sample code but it does not seem to work. For some reason, the threads are not detecting that the lock was released.
Here is a brief explanation of the code below:
The depthFlag is an array of atomic_bools. In this simple example, I simply try to do a lock by comparing the content of depthFlag[1]. I then go ahead and do my operation and once the operation is done, I do an unlock.
As stated above, only one thread is able to do the locking/work/unlocking but the rest of the threads get stuck in the while loop. They NEVER leave. I expect another thread to detect the unlock and go through the sequence.
What am I doing wrong? My knowledge on CAS is limited, so I appreciate any tips.
kernel void testFunction(device float *depthBuffer[[buffer(4)]], device atomic_bool *depthFlag [[buffer(5)]], uint index[[thread_position_in_grid]]){
//lock
bool expected=false;
while(!atomic_compare_exchange_weak_explicit(&depthFlag[1],&expected,true,memory_order_relaxed,memory_order_relaxed)){
//wait
expected=false;
}
//Do my operation here
//unlock
atomic_store_explicit(&depthFlag[1], false, memory_order_relaxed);
//barrier
}
You essentially can't use the locking programming model for GPU concurrency. For one, the relaxed memory order model (the only one available) is not suitable for this; for another, you can't guarantee that other threads will make progress between your atomic operations. Your code must always be able to make progress, regardless of what the other threads are doing.
My recommendation is that you use something like the following model instead:
Read atomic value to check if another thread has already completed the operation in question.
If no other thread has done it yet, perform the operation. (But don't cause any side effects, i.e. don't write to device memory.)
Perform an atomic operation to indicate your thread has completed the operation while checking whether another thread got there first. (e.g. compare-and-swap a boolean, but increasing a counter also works)
If another thread got there first, don't perform side effects.
If your thread "won" and no other thread registered completion, perform your operation's side effects, e.g. do whatever you need to do to write out the result etc.
This works well if there's not much competition, and if the result does not vary depending on which thread performs the operation.
The occasional discarded work should not matter. If there is significant competition, use thread groups; within a thread group, the threads can coordinate which thread will perform which operation. You may still end up with wasted computation from competition between groups. If this is a problem, you may need to change your approach more fundamentally.
If the results of the operation are not deterministic, and the threads all need to proceed using the same result, you will need to change your approach. For example, split your kernels up so any computation which depends on the result of the operation in question runs in a sequentially queued kernel.
Please consider the following statement:
DispatchQueue.global(qos: .userInitiated).asyncAfter(deadline: .now() + .milliseconds(500), qos: .utility, flags: .noQoS) {
print("What is my QOS?")
}
Notice how many of the parameters refer to the quality of service. How can a mere mortal possibility sort out the permutations?
Generally you shouldn't try to sort out all those permutations. In most cases, messing around too much with QoS is a recipe for trouble. But there are fairly simple rules.
Queues have priorities, and they can assign that priority to blocks that request to inherit.
This particular block is explicitly requesting a lower priority, but then says "ignore my QoS request." As a rule, don't do that. The only reason I know of for doing that is if you're interacting with some legacy API that doesn't understand QoS. (I've never encountered this myself, and it's hard to imagine it coming up in user-level code.)
A more interesting question IMO (and one that comes up much more often in real code) is this one:
DispatchQueue.global(qos: .utility).async(qos: .userInitiated) {}
What is the priority of this block? The answer is .userInitiated, and the block will "loan" its priority to the queue until it finishes executing. So for some period of time, this entire queue will become .userInitiated. This is to prevent priority inversion (where a high-priority task blocks waiting for a low-priority task).
This is all discussed in depth in Concurrent Programming With GCD in Swift 3, which is a must-watch for anyone interested in non-trivial GCD.
I'm working on an app where I need reproducible random numbers. I use srandom() with a seed to initialize the random number sequence. Then I use random() to generate the random numbers from this seed. If this is the only thread generating random numbers, everything works fine. However, if there are multiple threads generating random numbers, they interfere with each other.
Apparently, the sequence of random numbers is not thread safe. There must be a central random number generator that is called by all threads.
My app generates hundreds of objects, each one of which has four sequences of 14 random numbers generated this way. Each of these 4 sequences has its own non-random seed. This way, the random numbers should be reproducible. The problem is, because of the thread interference I just described, sometimes the sequence of 14 numbers being generated will be interrupted by a random number request by another thread.
After thinking about this for a while, I've decided to call
dispatch_sync(dispatch_get_main_queue(), ^{//generate the 14 numbers});
to get each sequence. This should force them to get generated in the proper sequence. In reading the documentation, it says there could be a deadlock if dispatch_sync is called on the queue it's running in. How can I tell if I'm already on the main queue? If I am, I don't need to dispatch anything, right?
Is there a better way to do this?
I suspect another way to do this is similar to this but using a dedicated queue instead of the main queue. I've never tried making my own queue before. Also, the method that needs to call the queue is an ephemeral one, so I'd need to somehow pass the custom queue around if I'm going to go that route. How does one pass a queue as an argument?
For now, I'm running with my idea, above, dispatching synchronously to the main queue, and the app seems to work fine. Worst case scenario, this snippet of code would be run about 4800 times (4 for each of 1200 objects, which is currently the max.).
I assume you want computationally random numbers, rather than cryptographic random numbers.
My suggestion would be to have separate RNGs for each thread, with each thread RNG seeded centrally from a master RNG. Since the system RNG is not thread safe, then create your own small RNG method -- a good LCG should work -- for use exclusively within one thread.
Use the built-in random() to produce only the initial seeds for each of your sub-threads. Setting the overall initial seed with srandom() will ensure that the thread local my_random() methods will all get a consistent initial reseed as long as the threads are started in the same order each time.
Effectively you are building a hierarchy of RNGs to match your hierarchy of threads.
Another option would be to have a singleton do the computation. The object needing the set of random numbers would ask the singleton for them in a batch.
I have a general objective-c pattern/practice question relative to a problem I'm trying to solve with my app. I could not find a similar objective-c focused question/answer here, yet.
My app holds a mutable array of objects which I call "Records". The app gathers records and puts them into the that array in one of two ways:
It reads data from a SQLite database available locally within the App's sand box. The read is usually very fast.
It requests data asynchronously from a web service, waits for it to finish then parses the data. The read can be fast, but often it is not.
Sometimes the app reads from the database (1) and requests data from the web service (2) at essentially the same time. It is often the case that (1) will finish before (2) finishes and adding Records to the mutable array does not cause a conflict.
I am worried that at some point my SQLite read process will take a bit longer than expected and it will try to add objects to the mutable array at the exact same time the async request finishes and does the same; or vice-versa. These are edge cases that seem difficult to test for but that surely would make my app crash or at the very least cause issues with my array of records.
I should also point out that the Records are to be merged into the mutable array. For example: if (1) runs first and returns 10 records, then shortly after (2) finishes and returns 5 records, my mutable array will contain all 15 records. I'm combining the data rather than overwriting it.
What I want to know is:
Is it safe for me to add objects to the same mutable array instance when the processes, either (1) or (2) finish?
Is there a good pattern/practice to implement for this sort of processing in objective-c?
Does this involve locking access to the mutable array so when (1) is adding objects to it (2) can't add any objects until (1) is done with it?
I appreciate any info you could share.
[EDIT #1]
For posterity, I found this URL to be a great help in understanding how to use NSOperations and an NSOperationQueue. It is a bit out of date, but works, none the less:
http://www.raywenderlich.com/19788/how-to-use-nsoperations-and-nsoperationqueues
Also, It doesn't talk specifically about the problem I'm trying to solve, but the example it uses is practical and easy to understand.
[EDIT #2]
I've decided to go with the approach suggested by danh, where I'll read locally and as needed hit my web service after the local read finished (which should be fast anyway). Taht said, I'm going to try and avoid synchronization issues altogether. Why? Because Apple says so, here:
http://developer.apple.com/library/IOS/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html#//apple_ref/doc/uid/10000057i-CH8-SW8
Avoid Synchronization Altogether
For any new projects you work on, and even for existing projects, designing your code and data structures to avoid the need for synchronization is the best possible solution. Although locks and other synchronization tools are useful, they do impact the performance of any application. And if the overall design causes high contention among specific resources, your threads could be waiting even longer.
The best way to implement concurrency is to reduce the interactions and inter-dependencies between your concurrent tasks. If each task operates on its own private data set, it does not need to protect that data using locks. Even in situations where two tasks do share a common data set, you can look at ways of partitioning that set or providing each task with its own copy. Of course, copying data sets has its costs too, so you have to weigh those costs against the costs of synchronization before making your decision.
Is it safe for me to add objects to the same mutable array instance when the processes, either (1) or (2) finish?
Absolutely not. NSArray, along with the rest of the collection classes, are not synchronized. You can use them in conjunction with some kind of lock when you add and remove objects, but that's definitely a lot slower than just making two arrays (one for each operation), and merging them when they both finish.
Is there a good pattern/practice to implement for this sort of processing in objective-c?
Unfortunately, no. The most you can come up with is tripping a Boolean, or incrementing an integer to a certain number in a common callback. To see what I mean, here's a little pseudo-code:
- (void)someAsyncOpDidFinish:(NSSomeOperation*)op {
finshedOperations++;
if (finshedOperations == 2) {
finshedOperations = 0;
//Both are finished, process
}
}
Does this involve locking access to the mutable array so when (1) is adding objects to it (2) can't add any objects until (1) is done with it?
Yes, see above.
You should either lock around your array modifications, or schedule your modifications in the main thread. The SQL fetch is probably running in the main thread, so in your remote fetch code you could do something like:
dispatch_async(dispatch_get_main_queue(), ^{
[myArray addObject: newThing];
}];
If you are adding a bunch of objects this will be slow since it is putting a new task on the scheduler for each record. You can bunch the records in a separate array in the thread and add the temp array using addObjectsFromArray: if that is the case.
Personally, I'd be inclined to have a concurrent NSOperationQueue and add the two retrieval operations operations, one for the database operation, one for the network operation. I would then have a dedicated serial queue for adding the records to the NSMutableArray, which each of the two concurrent retrieval operations would use to add records to the mutable array. That way you have one queue for adding records, but being fed from the two retrieval operations running on the other, concurrent queue. If you need to know when the two concurrent retrieval operations are done, I'd add a third operation to that concurrent queue, set its dependencies to be the two retrieval operations, which would fire automatically when the two retrieval operations are done.
In addition to the good suggestions above, consider not launching the GET and the sql concurrently.
[self doTheLocalLookupThen:^{
// update the array and ui
[self doTheServerGetThen:^{
// update the array and ui
}];
}];
- (void)doTheLocalLookupThen:(void (^)(void))completion {
if ([self skipTheLocalLookup]) return completion();
// do the local lookup, invoke completion
}
- (void)doTheServerGetThen:(void (^)(void))completion {
if ([self skipTheServerGet]) return completion();
// do the server get, invoke completion
}
Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.