Using random() function on multiple threads - ios

I'm working on an app where I need reproducible random numbers. I use srandom() with a seed to initialize the random number sequence. Then I use random() to generate the random numbers from this seed. If this is the only thread generating random numbers, everything works fine. However, if there are multiple threads generating random numbers, they interfere with each other.
Apparently, the sequence of random numbers is not thread safe. There must be a central random number generator that is called by all threads.
My app generates hundreds of objects, each one of which has four sequences of 14 random numbers generated this way. Each of these 4 sequences has its own non-random seed. This way, the random numbers should be reproducible. The problem is, because of the thread interference I just described, sometimes the sequence of 14 numbers being generated will be interrupted by a random number request by another thread.
After thinking about this for a while, I've decided to call
dispatch_sync(dispatch_get_main_queue(), ^{//generate the 14 numbers});
to get each sequence. This should force them to get generated in the proper sequence. In reading the documentation, it says there could be a deadlock if dispatch_sync is called on the queue it's running in. How can I tell if I'm already on the main queue? If I am, I don't need to dispatch anything, right?
Is there a better way to do this?
I suspect another way to do this is similar to this but using a dedicated queue instead of the main queue. I've never tried making my own queue before. Also, the method that needs to call the queue is an ephemeral one, so I'd need to somehow pass the custom queue around if I'm going to go that route. How does one pass a queue as an argument?
For now, I'm running with my idea, above, dispatching synchronously to the main queue, and the app seems to work fine. Worst case scenario, this snippet of code would be run about 4800 times (4 for each of 1200 objects, which is currently the max.).

I assume you want computationally random numbers, rather than cryptographic random numbers.
My suggestion would be to have separate RNGs for each thread, with each thread RNG seeded centrally from a master RNG. Since the system RNG is not thread safe, then create your own small RNG method -- a good LCG should work -- for use exclusively within one thread.
Use the built-in random() to produce only the initial seeds for each of your sub-threads. Setting the overall initial seed with srandom() will ensure that the thread local my_random() methods will all get a consistent initial reseed as long as the threads are started in the same order each time.
Effectively you are building a hierarchy of RNGs to match your hierarchy of threads.

Another option would be to have a singleton do the computation. The object needing the set of random numbers would ask the singleton for them in a batch.

Related

How to implement a lock and unlock sequence in a metal shader?

How should I implement a lock/unlock sequence with Compare and Swap using a Metal compute shader.
I’ve tested this sample code but it does not seem to work. For some reason, the threads are not detecting that the lock was released.
Here is a brief explanation of the code below:
The depthFlag is an array of atomic_bools. In this simple example, I simply try to do a lock by comparing the content of depthFlag[1]. I then go ahead and do my operation and once the operation is done, I do an unlock.
As stated above, only one thread is able to do the locking/work/unlocking but the rest of the threads get stuck in the while loop. They NEVER leave. I expect another thread to detect the unlock and go through the sequence.
What am I doing wrong? My knowledge on CAS is limited, so I appreciate any tips.
kernel void testFunction(device float *depthBuffer[[buffer(4)]], device atomic_bool *depthFlag [[buffer(5)]], uint index[[thread_position_in_grid]]){
//lock
bool expected=false;
while(!atomic_compare_exchange_weak_explicit(&depthFlag[1],&expected,true,memory_order_relaxed,memory_order_relaxed)){
//wait
expected=false;
}
//Do my operation here
//unlock
atomic_store_explicit(&depthFlag[1], false, memory_order_relaxed);
//barrier
}
You essentially can't use the locking programming model for GPU concurrency. For one, the relaxed memory order model (the only one available) is not suitable for this; for another, you can't guarantee that other threads will make progress between your atomic operations. Your code must always be able to make progress, regardless of what the other threads are doing.
My recommendation is that you use something like the following model instead:
Read atomic value to check if another thread has already completed the operation in question.
If no other thread has done it yet, perform the operation. (But don't cause any side effects, i.e. don't write to device memory.)
Perform an atomic operation to indicate your thread has completed the operation while checking whether another thread got there first. (e.g. compare-and-swap a boolean, but increasing a counter also works)
If another thread got there first, don't perform side effects.
If your thread "won" and no other thread registered completion, perform your operation's side effects, e.g. do whatever you need to do to write out the result etc.
This works well if there's not much competition, and if the result does not vary depending on which thread performs the operation.
The occasional discarded work should not matter. If there is significant competition, use thread groups; within a thread group, the threads can coordinate which thread will perform which operation. You may still end up with wasted computation from competition between groups. If this is a problem, you may need to change your approach more fundamentally.
If the results of the operation are not deterministic, and the threads all need to proceed using the same result, you will need to change your approach. For example, split your kernels up so any computation which depends on the result of the operation in question runs in a sequentially queued kernel.

Why object's mutability is relevant to its Thread safe?

I am working on CoreImage on IOS, you guys know CIContext and CIImage are immutable so they can be shared on threads. The problem is i am not sure why objects' mutability is closely relevant to its thread safe.
I can guess the rough reason is to prevent multiple threads from do something on a certain object at the same time. But can anybody provide some theoretical evidence or give a specific answer to it?
You are correct, mutable objects are not thread safe because multiple threads can write to that data at the same time. This is in contrast to reading data, an operation multiple threads can do simultaneously without causing problems. Immutable types can only be read.
However, when multiple threads are writing to the same data, these operations may interfere. For example, an NSMutableArray which two threads are editing could easily be corrupted. One thread is editing at one location, suddenly changing the memory the other thread was updating. iOS will use what is called an atomic operation for simple data types. What this means is that iOS requires the edit operation to fully finish before anything else can happen. This has an efficiency advantage over locks. Just Google about this stuff if you want to know more.
Well, actually, you are right.
If you have an immutable objects, all you can do is to read data from them. And every thread will get the same data, since the object can not be changed.
When you have a mutable object, the problem is that (in general) read and write operations are not atomic. It means that they are not performed instantaneously, they take time. So they can overlap each other. Even if we have a single-core processor, it can switch between threads at arbitrary moments of time. Possible problems it might cause:
1) Two threads write at the same object simultaneously. In this case the object might become corrupted (for instance, half of data comes from the first thread, the other half – from the second, the result is unpredictable.
2) One thread writes the data, another one reads it. One problem is that thread might read already outdated data. Another one is that it might read a corrupted data (if the writing thread haven't finished writing yet).
Take the example with an array of values. Let's say I'm doing some computation and I find the count of the array to be 4
var array = [1,2,3,4]
let count = array.count
Now maybe I want to loop through all the elements in the array, so I setup a loop that goes through index i<4 or something along those lines. So far so good.
This array is mutable though, so on a separate thread, I could easily remove an element. So perhaps I'm on thread 1 and I get the count to be 4, I perhaps start looping through the array. Now we switch to thread 2 and I remove an element, so now this same array has only 3 values in it. If I end up going back to thread 1 and I still assume I have 4 values while looping through the array, my program is going to crash when it tries to access the 4th element.
This is why immutability is desirable, it can guarantee some level of consistency across threads.

Best practice for writing to resource from two different processes in objective-c

I have a general objective-c pattern/practice question relative to a problem I'm trying to solve with my app. I could not find a similar objective-c focused question/answer here, yet.
My app holds a mutable array of objects which I call "Records". The app gathers records and puts them into the that array in one of two ways:
It reads data from a SQLite database available locally within the App's sand box. The read is usually very fast.
It requests data asynchronously from a web service, waits for it to finish then parses the data. The read can be fast, but often it is not.
Sometimes the app reads from the database (1) and requests data from the web service (2) at essentially the same time. It is often the case that (1) will finish before (2) finishes and adding Records to the mutable array does not cause a conflict.
I am worried that at some point my SQLite read process will take a bit longer than expected and it will try to add objects to the mutable array at the exact same time the async request finishes and does the same; or vice-versa. These are edge cases that seem difficult to test for but that surely would make my app crash or at the very least cause issues with my array of records.
I should also point out that the Records are to be merged into the mutable array. For example: if (1) runs first and returns 10 records, then shortly after (2) finishes and returns 5 records, my mutable array will contain all 15 records. I'm combining the data rather than overwriting it.
What I want to know is:
Is it safe for me to add objects to the same mutable array instance when the processes, either (1) or (2) finish?
Is there a good pattern/practice to implement for this sort of processing in objective-c?
Does this involve locking access to the mutable array so when (1) is adding objects to it (2) can't add any objects until (1) is done with it?
I appreciate any info you could share.
[EDIT #1]
For posterity, I found this URL to be a great help in understanding how to use NSOperations and an NSOperationQueue. It is a bit out of date, but works, none the less:
http://www.raywenderlich.com/19788/how-to-use-nsoperations-and-nsoperationqueues
Also, It doesn't talk specifically about the problem I'm trying to solve, but the example it uses is practical and easy to understand.
[EDIT #2]
I've decided to go with the approach suggested by danh, where I'll read locally and as needed hit my web service after the local read finished (which should be fast anyway). Taht said, I'm going to try and avoid synchronization issues altogether. Why? Because Apple says so, here:
http://developer.apple.com/library/IOS/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html#//apple_ref/doc/uid/10000057i-CH8-SW8
Avoid Synchronization Altogether
For any new projects you work on, and even for existing projects, designing your code and data structures to avoid the need for synchronization is the best possible solution. Although locks and other synchronization tools are useful, they do impact the performance of any application. And if the overall design causes high contention among specific resources, your threads could be waiting even longer.
The best way to implement concurrency is to reduce the interactions and inter-dependencies between your concurrent tasks. If each task operates on its own private data set, it does not need to protect that data using locks. Even in situations where two tasks do share a common data set, you can look at ways of partitioning that set or providing each task with its own copy. Of course, copying data sets has its costs too, so you have to weigh those costs against the costs of synchronization before making your decision.
Is it safe for me to add objects to the same mutable array instance when the processes, either (1) or (2) finish?
Absolutely not. NSArray, along with the rest of the collection classes, are not synchronized. You can use them in conjunction with some kind of lock when you add and remove objects, but that's definitely a lot slower than just making two arrays (one for each operation), and merging them when they both finish.
Is there a good pattern/practice to implement for this sort of processing in objective-c?
Unfortunately, no. The most you can come up with is tripping a Boolean, or incrementing an integer to a certain number in a common callback. To see what I mean, here's a little pseudo-code:
- (void)someAsyncOpDidFinish:(NSSomeOperation*)op {
finshedOperations++;
if (finshedOperations == 2) {
finshedOperations = 0;
//Both are finished, process
}
}
Does this involve locking access to the mutable array so when (1) is adding objects to it (2) can't add any objects until (1) is done with it?
Yes, see above.
You should either lock around your array modifications, or schedule your modifications in the main thread. The SQL fetch is probably running in the main thread, so in your remote fetch code you could do something like:
dispatch_async(dispatch_get_main_queue(), ^{
[myArray addObject: newThing];
}];
If you are adding a bunch of objects this will be slow since it is putting a new task on the scheduler for each record. You can bunch the records in a separate array in the thread and add the temp array using addObjectsFromArray: if that is the case.
Personally, I'd be inclined to have a concurrent NSOperationQueue and add the two retrieval operations operations, one for the database operation, one for the network operation. I would then have a dedicated serial queue for adding the records to the NSMutableArray, which each of the two concurrent retrieval operations would use to add records to the mutable array. That way you have one queue for adding records, but being fed from the two retrieval operations running on the other, concurrent queue. If you need to know when the two concurrent retrieval operations are done, I'd add a third operation to that concurrent queue, set its dependencies to be the two retrieval operations, which would fire automatically when the two retrieval operations are done.
In addition to the good suggestions above, consider not launching the GET and the sql concurrently.
[self doTheLocalLookupThen:^{
// update the array and ui
[self doTheServerGetThen:^{
// update the array and ui
}];
}];
- (void)doTheLocalLookupThen:(void (^)(void))completion {
if ([self skipTheLocalLookup]) return completion();
// do the local lookup, invoke completion
}
- (void)doTheServerGetThen:(void (^)(void))completion {
if ([self skipTheServerGet]) return completion();
// do the server get, invoke completion
}

Parse list 3 threads a time, when 5 completed works, server signal to do something

Hy I am curious does anyone know a tutorial example where semaphores are used for more than 1 process /thread. I'm looking forward to fix this problem. I have an array, of elements and an x number of threads. This threads work over the array, only 3 at a moment. After 5 works have been completed, the server is signelised and it clean those 5 nodes. But I'm having problems with the designing this problem. (node contains worker value which contains the 'name' of the thread that is allowed to work on it, respectivly nrNodes % nrThreads)
In order to make changes on the list a mutex is neccesarly in order not to overwrite / make false evaluations.
But i have no clue how to limit 3 threads to parse, at a given point, the list, and how to signal the main for cleaning session. I have been thinking aboutusing a semafor and a global constant. When the costant reaches 5, the server to be signaled(which probably would eb another thread.)
Sorry for lack of code but this is a conceptual question, what i have written so far doesn't affect the question in any way.

The memory consistency model CUDA 4.0 and global memory?

Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.

Resources