trouble reading from __global memory after atom_inc in OpenCL - memory

OpenCL doesn't have a global barrier that will stop all threads, so I'm trying to create a work around with the following code:
void barrier(__global uint* scratch) {
uint nThreads = get_global_size(0);
atom_inc(scratch);
/* this loop never terminates */
while(scratch[0] < nThreads) {
continue;
}
}
The idea is that each thread loops until all of them increment that one piece of memory.
However, the value read from scratch[0] never changes for the threads once it's been read, and it loops forever. I know it's being incremented because it's the correct value when I read it back to the host.
Is the global memory being locally cached? What's going on here?

Found the problem: the order in which work groups are executed is implementation defined. This means that some threads might start only after others have finished.
In the code I gave, the work groups that are started first will loop forever waiting on the the others to hit the 'barrier'. And the work groups that would be started later won't ever start because they're waiting for the first ones to finish.
If the implementation (I'm on a Radeon 5750, using Stream SDK 2.2) executes all work groups concurrently, then it probably wouldn't be an issue. But that's not the case for my setup.

Related

Need explanation for an excerpt from Apple's documentation on NSRunLoop

Apple's official documentation is sometimes difficult for understanding, especially for non-native speakers. This is an excerpt from Anatomy of NSRunLoop
A run loop is very much like its name sounds. It is a loop your thread enters and uses to run event handlers in response to incoming events. Your code provides the control statements used to implement the actual loop portion of the run loop—in other words, your code provides the while or for loop that drives the run loop. Within your loop, you use a run loop object to "run” the event-processing code that receives events and calls the installed handlers.
This confuses me. My code never provides while or for loops even for non-main threads. What is being meant here? Can anyone explain?
Keep reading until Using Run Loop Objects and Apple’s code samples do show control statements like while loops.
Listing 3-1
NSInteger loopCount = 10;
do
{
// Run the run loop 10 times to let the timer fire.
[myRunLoop runUntilDate:[NSDate dateWithTimeIntervalSinceNow:1]];
loopCount--;
}
while (loopCount);
Listing 3-2
do
{
// Start the run loop but return after each source is handled.
SInt32 result = CFRunLoopRunInMode(kCFRunLoopDefaultMode, 10, YES);
// If a source explicitly stopped the run loop, or if there are no
// sources or timers, go ahead and exit.
if ((result == kCFRunLoopRunStopped) || (result == kCFRunLoopRunFinished))
done = YES;
// Check for any other exit conditions here and set the
// done variable as needed.
}
while (!done);
The intended way to use NSRunLoop does require you to invoke the next run, again and again until a certain condition is met.
But if you start your run loop with -[NSRunLoop run], it runs indefinitely without help. That’s what the main thread does.
In case you’re wondering why Apple lets (or wants) you to control every loop, NeXTSTEP shipped in the 80s when every CPU cycle counts. Functions like -[NSRunLoop runMode:beforeDate:] lets you fine tune the frequency and behaviour of your run loops down to every run.
Oh, you do run a loop on the main thread, but you don't know.
Set a breakpoint on an action method and look at the stack trace. There will be something like:
#9 0x00007fff912eaa29 in -[NSApplication run] ()
That's the loop.
In another thread you very often do not need a instance of NSRunLoop. Its primary ability is to receive events and to dispatch them. But in an additional thread you want to process calculations straight forwarded in most cases. To have a term for it: Additional threads are usually not event-driven.
So you have a run loop (and have to run it) only rarely, especially when you have networking or file access that is dispatched using a run loop.In such a case it is a common mistake that one does not run the thread's run loop.

Resart a task in FreeRTOS

I have a specific task routine which performs some operations in a specific order, and these operations handle few volatile variables. There is a specific interrupt which updates these volatile variables asynchronously. Hence, the task routine should restart if such an interrupt occurs. Normally FreeRTOS will resume the task, but this will result in wrong derived values, hence the requirement for restarting the routine. I also cannot keep the task routine under critical section, because I should not be missing any interrupts.
Is there a way in FreeRTOS with which I can achieve this? Like a vtaskRestart API. I could have deleted the task and re-created it, but this adds a lot of memory management complications, which I would like to avoid. Currently my only option is to add checks in the routine on a flag to see if a context switch have occured and if yes, restart, else continue.
Googling did not fetch any clue on this. Seems like people never faced such a problem or may be its that this design is poor. In FreeRTOS forum, few who asked for a task-restart didn't seem to have this same problem. stackOverflow didn't have a result on freertos + task + restart. So, this could be the first post with this tag combination ;)
Can someone please tell me if this is directly possible in FreeRTOS?
You can use semaphore for this purpose. If you decide using semaphore, you should do the steps below.
Firstly, you should create a binary semaphore.
The semaphore must be given in the interrupt routine with
xSemaphoreGiveFromISR( Example_xSemaphore, &xHigherPriorityTaskWoken
);
And, you must check taking semaphore in the task.
void vExample_Task( void * pvParameters )
{
for( ;; )
{
if (xSemaphoreTake( Example_xSemaphore, Example_PROCESS_TIME)==pdTRUE)
{
}
}
}
For this purpose you should use a queue and use the queue peek function to yield at your volatile data.
I'm using it as I have a real time timer and this way I make the time available to all my task, without any blocking.
Here it how it goes:
Declare the queue:
xQueueHandle RTC_Time_Queue;
Create the queue of 1 element:
RTC_Time_Queue = xQueueCreate( 1, sizeof(your volatile struct) );
Overwrite the queue everytime your interrupt occurs:
xQueueOverwriteFromISR(RTC_Time_Queue, (void*) &time);
And from other task peek the queue:
xQueuePeek(RTC_GetReadQueue(), (void*) &TheTime, 0);
The 0 at the end of xQueuePeek means you don't want to wait if the queue is empty. The queue peek won't delete the value in the queue so it will be present every time you peek and the code will never stop.
Also you should avoid having variable being accessed from ISR and the RTOS code as you may get unexpected corruption.

Call to CFReadStreamRead stops execution in thread

NB: The entire code base for this project is so large that posting any meaningful amount wold render this question too localised, I have tried to distil any code down to the bare-essentials. I'm not expecting anyone to solve my problems directly but I will up vote those answers I find helpful or intriguing.
This project uses a modified version of AudioStreamer to playback audio files that are saved to locally to the device (iPhone).
The stream is set up and scheduled on the current loop using this code (unaltered from the standard AudioStreamer project as far as I know):
CFStreamClientContext context = {0, self, NULL, NULL, NULL};
CFReadStreamSetClient(
stream,
kCFStreamEventHasBytesAvailable | kCFStreamEventErrorOccurred | kCFStreamEventEndEncountered,
ASReadStreamCallBack,
&context);
CFReadStreamScheduleWithRunLoop(stream, CFRunLoopGetCurrent(), kCFRunLoopCommonModes);
The ASReadStreamCallBack calls:
- (void)handleReadFromStream:(CFReadStreamRef)aStream
eventType:(CFStreamEventType)eventType
On the AudioStreamer object, this all works fine until the stream is read using this code:
BOOL hasBytes = NO; //Added for debugging
hasBytes = CFReadStreamHasBytesAvailable(stream);
length = CFReadStreamRead(stream, bytes, kAQDefaultBufSize);
hasBytes is YES but when CFReadStreamRead is called execution stops, the App does not crash it just stops exciting, any break points below the CFReadStreamRead call are not hit and ASReadStreamCallBack is not called again.
I am at a loss to what might cause this, my best guess is the thread is being terminated? But the hows and whys is why I'm asking SO.
Has anyone seen this behaviour before? How can I track it down and ideas on how I might solve it will be very much welcome!
Additional Info Requested via Comments
This is 100% repeatable
CFReadStreamHasBytesAvailable was added by me for debugging but removing it has no effect
First, I assume that CFReadStreamScheduleWithRunLoop() is running on the same thread as CFReadStreamRead()?
Is this thread processing its runloop? Failure to do this is my main suspicion. Do you have a call like CFRunLoopRun() or equivalent on this thread?
Typically there is no reason to spawn a separate thread for reading streams asynchronously, so I'm a little confused about your threading design. Is there really a background thread involved here? Also, typically CFReadStreamRead() would be in your client callback (when you receive the kCFStreamEventHasBytesAvailable event (which it appears to be in the linked code), but you're suggesting ASReadStreamCallBack is never called. How have you modified AudioStreamer?
It is possible that the stream pointer is just corrupt in some way. CFReadStreamRead should certainly not block if bytes are available (it certainly would never block for more than a few milliseconds for local files). Can you provide the code you use to create the stream?
Alternatively, CFReadStreams send messages asynchronously but it is possible (but not likely) that it's blocking because the runloop isn't being processed.
If you prefer, I've uploaded my AudioPlayer inspired by Matt's AudioStreamer hosted at https://code.google.com/p/audjustable/. It supports local files (as well as HTTP). I think it does what you wanted (stream files from more than just HTTP).

pthread_create and EAGAIN

I got an EAGAIN when trying to spawn a thread using pthread_create. However, from what I've checked, the threads seem to have been terminated properly.
What determines the OS to give EAGAIN when trying to create a thread using pthread_create? Would it be possible that unclosed sockets/file handles play a part in causing this EAGAIN (i.e they share the same resource space)?
And lastly, is there any tool to check resource usage, or any functions that can be used to see how many pthread objects are active at the time?
Okay, found the answer. Even if pthread_exit or pthread_cancel is called, the parent process still need to call pthread_join to release the pthread ID, which will then become recyclable.
Putting a pthread_join(tid, NULL) in the end did the trick.
edit (was not waitpid, but rather pthread_join)
As a practical matter EAGAIN is almost always related to running out of memory for the process. Often this has to do with the stack size allocated for the thread which you can adjust with pthread_attr_setstacksize(). But there are process limits to how many threads you can run. You can query the hard and soft limits with getrlimit() using RLIMIT_NPROC as the first parameter.
There are quite a few questions here dedicated to keeping track of threads, their number, whether they are dead or alive, etc. Simply put, the easiest way to keep track of them is to do it yourself through some mechanism you code, which can be as simple as incrementing and decrementing a global counter (protected by a mutex) or something more elaborate.
Open sockets or other file descriptors shouldn't cause pthread_create() to fail. If you reached the maximum for descriptors you would have already failed before creating the new thread and the new thread would have already have had to be successfully created to open more of them and thus could not have failed with EAGAIN.
As per my observation if one of the parent process calls pthread_join(), and chilled processes are trying to release the thread by calling pthread_exit() or pthread_cancel() then system is not able to release that thread properly. In that case, if pthread_detach() is call immediately after successful call of pthread_create() then this problem has been solved. A snapshot is here -
err = pthread_create(&(receiveThread), NULL, &receiver, temp);
if (err != 0)
{
MyPrintf("\nCan't create thread Reason : %s\n ",(err==EAGAIN)?"EAGAUIN":(err==EINVAL)?"EINVAL":(err==EPERM)?"EPERM":"UNKNOWN");
free(temp);
}
else
{
threadnumber++;
MyPrintf("Count: %d Thread ID: %u\n",threadnumber,receiveThread);
pthread_detach(receiveThread);
}
Another potential cause: I was getting this problem (EAGAIN on pthread_create) because I had forgotten to call pthread_attr_init on the pthread_attr_t I was trying to initialize my thread with.

pthreads : pthread_cond_signal() from within critical section

I have the following piece of code in thread A, which blocks using pthread_cond_wait()
pthread_mutex_lock(&my_lock);
if ( false == testCondition )
pthread_cond_wait(&my_wait,&my_lock);
pthread_mutex_unlock(&my_lock);
I have the following piece of code in thread B, which signals thread A
pthread_mutex_lock(&my_lock);
testCondition = true;
pthread_cond_signal(&my_wait);
pthread_mutex_unlock(&my_lock);
Provided there are no other threads, would it make any difference if pthread_cond_signal(&my_wait) is moved out of the critical section block as shown below ?
pthread_mutex_lock(&my_lock);
testCondition = true;
pthread_mutex_unlock(&my_lock);
pthread_cond_signal(&my_wait);
My recommendation is typically to keep the pthread_cond_signal() call inside the locked region, but probably not for the reasons you think.
In most cases, it doesn't really matter whether you call pthread_cond_signal() with the lock held or not. Ben is right that some schedulers may force a context switch when the lock is released if there is another thread waiting, so your thread may get switched away before it can call pthread_cond_signal(). On the other hand, some schedulers will run the waiting thread as soon as you call pthread_cond_signal(), so if you call it with the lock held, the waiting thread will wake up and then go right back to sleep (because it's now blocked on the mutex) until the signaling thread unlocks it. The exact behavior is highly implementation-specific and may change between operating system versions, so it isn't anything you can rely on.
But, all of this looks past what should be your primary concern, which is the readability and correctness of your code. You're not likely to see any real-world performance benefit from this kind of micro-optimization (remember the first rule of optimization: profile first, optimize second). However, it's easier to think about the control flow if you know that the set of waiting threads can't change between the point where you set the condition and send the signal. Otherwise, you have to think about things like "what if thread A sets testCondition=TRUE and releases the lock, and then thread B runs and sees that testCondition is true, so it skips the pthread_cond_wait() and goes on to reset testCondition to FALSE, and then finally thread A runs and calls pthread_cond_signal(), which wakes up thread C because thread B wasn't actually waiting, but testCondition isn't true anymore". This is confusing and can lead to hard-to-diagnose race conditions in your code. For that reason, I think it's better to signal with the lock held; that way, you know that setting the condition and sending the signal are atomic with respect to each other.
On a related note, the way you are calling pthread_cond_wait() is incorrect. It's possible (although rare) for pthread_cond_wait() to return without the condition variable actually being signaled, and there are other cases (for example, the race I described above) where a signal could end up awakening a thread even though the condition isn't true. In order to be safe, you need to put the pthread_cond_wait() call inside a while() loop that tests the condition, so that you call back into pthread_cond_wait() if the condition isn't satisfied after you reacquire the lock. In your example it would look like this:
pthread_mutex_lock(&my_lock);
while ( false == testCondition ) {
pthread_cond_wait(&my_wait,&my_lock);
}
pthread_mutex_unlock(&my_lock);
(I also corrected what was probably a typo in your original example, which is the use of my_mutex for the pthread_cond_wait() call instead of my_lock.)
The thread waiting on the condition variable should keep the mutex locked, and the other thread should always signal with the mutex locked. This way, you know the other thread is waiting on the condition when you send the signal. Otherwise, it's possible the waiting thread won't see the condition being signaled and will block indefinitely waiting on it.
Condition variables are typically used like this:
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
int go = 0;
void *threadproc(void *data) {
printf("Sending go signal\n");
pthread_mutex_lock(&lock);
go = 1;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&lock);
}
int main(int argc, char *argv[]) {
pthread_t thread;
pthread_mutex_lock(&lock);
printf("Waiting for signal to go\n");
pthread_create(&thread, NULL, &threadproc, NULL);
while(!go) {
pthread_cond_wait(&cond, &lock);
}
printf("We're allowed to go now!\n");
pthread_mutex_unlock(&lock);
pthread_join(thread, NULL);
return 0;
}
This is valid:
void *threadproc(void *data) {
printf("Sending go signal\n");
go = 1;
pthread_cond_signal(&cond);
}
However, consider what's happening in main
while(!go) {
/* Suppose a long delay happens here, during which the signal is sent */
pthread_cond_wait(&cond, &lock);
}
If the delay described by that comment happens, pthread_cond_wait will be left waiting—possibly forever. This is why you want to signal with the mutex locked.
Both are correct, however for reactivity issues, most schedulers give hand to another thread when a lock is released. I you don't signal before unlocking, your waiting thread A is not in the ready list and thous will not be scheduled until B is scheduled again and call pthread_cond_signal().
The Open Group Base Specifications Issue 7 IEEE Std 1003.1, 2013 Edition (which as far as I can tell is the official pthread specification) says this on the matter:
The pthread_cond_broadcast() or pthread_cond_signal() functions may be
called by a thread whether or not it currently owns the mutex that
threads calling pthread_cond_wait() or pthread_cond_timedwait() have
associated with the condition variable during their waits; however, if
predictable scheduling behavior is required, then that mutex shall be
locked by the thread calling pthread_cond_broadcast() or
pthread_cond_signal().
To add my personal experience, I was working on an application that had code where the conditional variable was destroyed (and the memory containing it freed) by the thread that was woken up. We found that on a multi-core device (an iPad Air 2) the pthread_cond_signal() could actually crash sometimes if it was outside the mutex lock, as the waiter woke up and destroyed the conditional variable before the pthread_cond_signal had completed. This was quite unexpected.
So I would definitely veer towards the 'signal inside the lock' version, it appears to be safer.
Here is nice write up about the conditional variables: Techniques for Improving the Scalability of Applications Using POSIX Thread Condition Variables (look under 'Avoiding the Mutex Contention' section and point 7)
It says that, the second version may have some performance benefits. Because it makes possible for thread with pthread_cond_wait to wait less frequently.

Resources