Using NSOperatonQueue to speed up long calculation - ios

I'm looking for ways to speed up a lengthy calculation (with two nested for-loops), the results of which will be shown in a plot. I tried NSOperationQueue thinking that each of the inner for loops would run concurrently. But apparently that's not the case, at least in my implementation. If I remove the NSOperationQueue calls, I get my results in my plot, so I know the calculation is done properly.
Here's a code snippet:
NSInteger half_window, len;
len = [myArray length];
if (!len)
return;
NSOperationQueue *queue = [[NSOperationQueue alloc] init];
half_window = 0.5 * (self.slidingWindowSize - 1);
numberOfPoints = len - 2 * half_window;
double __block minY = 0;
double __block maxY = 0;
double __block sum, y;
xPoints = (double *) malloc (numberOfPoints * sizeof(double));
yPoints = (double *) malloc (numberOfPoints * sizeof(double));
for ( NSUInteger i = half_window; i < (len - half_window); i++ )
{
[queue addOperationWithBlock: ^{
sum = 0.0;
for ( NSInteger j = -half_window; j <= half_window; j++ )
{
MyObject *mo = [myArray objectAtIndex: (i+j)];
sum += mo.floatValue;
}
xPoints[i - half_window] = (double) i+1;
y = (double) (sum / self.slidingWindowSize);
yPoints[i - half_window] = y;
if (y > maxY)
maxY = y;
if (y < minY)
minY = y;
}];
[queue waitUntilAllOperationsAreFinished];
}
// update my core-plot
self.maximumValueForXAxis = len;
self.minimumValueForYAxis = floor(minY);
self.maximumValueForYAxis = ceil(maxY);
[self setUpPlotSpaceAndAxes];
[graph reloadData];
// cleanup
free(xPoints);
free(yPoints);
Is there a way to make this execute any faster?

You are waiting for all operations in the queue to finish after adding each item.
[queue waitUntilAllOperationsAreFinished];
}
// update my core-plot
self.maximumValueForXAxis = len;
should be
}
[queue waitUntilAllOperationsAreFinished];
// update my core-plot
self.maximumValueForXAxis = len;
You are also setting sum variable to 0.0 in each operation queue block.

This looks odd:
for ( NSUInteger j = -half_window; j <= half_window; j++ )
Assuming half_window is positive then you're setting an unsigned int to a negative number. I suspect that this will generate a huge unsigned int which will fail the condition which means this loop never gets calculated.
However, this isn't the cause of your slowness.

Revised answer
Below, in my original answer, I address two types of performance improvement, (1) designing responsive UI by moving complicated calculations in the background; and (2) making complicated calculations perform more quickly by making them multi-threaded (but which is a little complicated so be careful).
In retrospect, I now realize that you're doing a moving average, so your performance hit by doing nested for loops can be completely eliminated, cutting the Gordian knot. Using pseudo code, you can do something like the following, which updates the sum by removing the first point and adding the next point as you go along (where n represents how many points you're averaging in your moving average, e.g. a 30 point moving average from your large set, n is 30):
double sum = 0.0;
for (NSInteger i = 0; i < n; i++)
{
sum += originalDataPoints[i];
}
movingAverageResult[n - 1] = sum / n;
for (NSInteger i = n; i < totalNumberOfPointsInOriginalDataSet; i++)
{
sum = sum - originalDataPoints[i - n] + originalDataPoints[i];
movingAverageResult[i] = sum / n;
}
That makes this a linear complexity problem, which should be much faster. You definitely do not need to break this into multiple operations that you add to some queue to try to make the algorithm run multi-threaded (e.g. which is great because you therefore bypass the complications I warn you of in my point #2 below). You can, though, wrap this whole algorithm as a single operation that you add to a dispatch/operation queue so it runs asynchronous of your user interface (my point #1 below) if you want.
Original answer
It's not entirely clear from your question what the performance issue is. There are two classes of performance issues:
User interface responsiveness: If you're concerned about the responsiveness of the UI, you should definitely eliminate the waitUntilAllOperationsAreFinished because that is, at the end of the day, making the calculation synchronous with respect to your UI. If you're trying to address responsiveness in the user interface, you might (a) remove the operation block inside the for loop; but then (b) wrap these two nested for loops inside a single block that you would add to your background queue. Looking at this at a high level, the code would end up looking like:
[queue addOperationWithBlock:^{
// do all of your time consuming stuff here with
// your nested for loops, no operations dispatched
// inside the for loop
// when all done
[[NSOperationQueue mainQueue] addOperationWithBlock:^{
// now update your UI
}];
}];
Note, do not have any waitUntilAllOperationsAreFinished call here. The goal in responsive user interfaces is to have it run asynchronously, and using waitUntil... method effectively makes it synchronous, the enemy of a responsive UI.
Or, you can use the GCD equivalent:
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
// do all of your time consuming stuff here
// when all done
dispatch_async(dispatch_get_main_queue(), ^{
// now update your UI
});
});
Again, we're calling dispatch_async (which is equivalent to making sure you don't call waitUntilAllOperationsAreFinished) to make sure we dispatch this code to the background, but then immediately return so our UI remains responsive.
When you do this, the method that does this will return almost instantaneously, keeping the UI from stuttering/freezing during this operation. And when this operation is done, it will update the UI accordingly.
Note, this presumes that you're doing this all in a single operation, not submitting a bunch of separate background operations. You're just going to submit this single operation to the background, it's going to do its complicated calculations, and when it's done, it will update your user interface. In the mean time your user interface can continue to be responsive (let the user do other stuff, or if that doesn't make sense, show the user some UIActivityIndicatorView, a spinner, so they know the app is doing something special for them and that it will be right back).
The take-home message, though, is that anything that will freeze (even temporarily) a UI is a not a great design. And be forewarned that, and if your existing process takes long enough, the watchdog process may even kill your app. Apple's counsel is that, at the very least, if it takes more than a few hundred milliseconds, you should be doing it asynchronously. And if the UI is trying to do anything else at the same time (e.g. some animation, some scrolling view, etc.), even a few hundred milliseconds is far too long.
Optimizing performance by making the calculation, itself, multi-threaded: If you're trying to tackle the more fundamental performance issue of this by making multi-threaded, far more care must be taken regarding the how you do this.
First, you probably want to restrict the number of concurrent operations you have to some reasonable number (you never want to risk using up all of the available threads). I'd suggest you set maxConcurrentOperationCount to some small, reasonable number (e.g. 4 or 6 or something like that). You'll have diminishing returns at that point anyway, because the device only has a limited number of cores available.
Second, and just as important, you should to pay special attention to the synchronizing of your updating of variables outside of the operation (like your minY, maxY, etc.). Let's say maxY is currently 100 and you have two concurrent operations, one which is trying to set it to 300 and another that is trying to set it to 200. But if they both confirm that they're greater than the current value, 100, and proceed to set their values, if the one that is setting it to 300 happens to win the race, the other operation can reset it back to 200, blowing away your 300 value.
When you want to write concurrent code with separate operations updating the same variables, you have to think very carefully about synchronization of these external variables. See the Synchronization section of the Threading Programming Guide for a discussion of a variety of different locking mechanism that address this problem. Or you can define another dedicate serial queue for synchronizing the values as discussed in Eliminating Lock-Based Code of the Concurrency Programming Guide.
Finally, when thinking about synchronization, you can always step back and ask yourself whether the cost of doing all of this synchronization of these variables is really necessary (because there is a performance hit when synchronizing, even if you don't have contention issues). For example, although it might seem counter-intuitive, it might be faster to not try to update to minY and maxY at all during these operations, eliminating the need for synchronization. You could forgo the calculation of those two variables for the range of y values as the calculations are being done, but just wait until all of the operations are done and then do one final iteration through the entire result set and calculate the min and max then. This is an approach that you can verify empirically, where you might want to try it both with locks (or other synchronization method) and then again, calculating the range of values as a single operation at the very end where locks wouldn't be necessary. Surprisingly, sometimes adding the extra loop at the end (and thereby eliminating the need to synchronize) can be faster.
The bottom line is that you can't generally just take a serial piece of code and make it concurrent, without special attention to both of these considerations, constraining how many threads you'll consume and if you're going to be updating the same variables from multiple operations, consider how you're going to synchronize the values. And even if you decide to tackle this second issue, the multi-threaded calculation itself, you should still think about the first issue, the responsive UI, and perhaps marry both methods.

Related

how many operations NSOperationQueue can cache

I created an NSOperationQueue and set the maxConcurrentOperationCount property to 2.
If I create 2 operations that do not stop, when I continue to add operations to it, the NSOperationQueue will cache these tasks, so what is the maximum number of operations that can be cached by the NSOperationQueue, will it cause a memory surge?
NSOperationQueue *queue = [[NSOperationQueue alloc] init];
queue.maxConcurrentOperationCount = 2;
[queue addOperationWithBlock:^{
while (YES) {
NSLog(#"thread1 : %#",[NSThread currentThread]);
}
}];
[queue addOperationWithBlock:^{
while (YES) {
NSLog(#"thread2 : %#",[NSThread currentThread]);
}
}];
// this operation will wait
[queue addOperationWithBlock:^{
while (YES) {
NSLog(#"thread3 : %#",[NSThread currentThread]);
}
}];
above is my code, third operation will never run,
from what I understand, the queue will save these tasks, if you keep adding operations to it, the memory will keep going up,
Will NSOperationQueue handle this situation internally?
An operation queue will handle large number of operations without incident. It is one of the reasons we use operation queues, to gracefully handle constrained concurrency for a number of operations that exceed the maxConcurrentOperationCount.
Obviously, your particular example, with operations spinning indefinitely, is both inefficient (tying up two worker threads with computationally intensive process) and will prevent the third operation from ever starting. But if you changed the operations to something more practical (e.g., ones that finish in some finite period of time), the operation queue can gracefully handle a very large number of operations.
That is not to say that operation queues can be used recklessly. For example, one can easily create operation queue scenarios that suffer from thread explosion and exhaust the limited worker thread pool. Or if you had operations that individually used tons of memory, then eventually, if you had enough of those queued up, you could theoretically introduce memory issues.
But don’t worry about theoretical problems. If you have a practical example for which you are having a problem, then post that as a separate question. But, in answer to your question here, operation queues can handle many queued operation quite well.
Let us consider queuing 100,000 operations, each taking one second to finish:
NSOperationQueue *queue = [[NSOperationQueue alloc] init];
queue.maxConcurrentOperationCount = 2;
for (NSInteger i = 0; i < 100000; i++) {
[queue addOperationWithBlock:^{
[NSThread sleepForTimeInterval:1];
NSLog(#"%ld", i);
}];
}
NSLog(#"done queuing");
The operation queue handles all of these operations, only running two at a time, perfectly well. It will take some memory (e.g. 80mb) to hold these 100,000 operations in memory at a given moment, but it handles it fine. Even at 1,000,000 operations, it works fine (but will take ~500mb of memory). Clearly, at some point you will run out of memory, but if you're thinking of something with this many operations, you should be considering other patterns, anyway.
There are obviously practical limitations.
Let us consider a degenerate example: Imagine you had a multi-gigabyte video file and you wanted to run some task on each frame of the video. It would be a poor design to add operations for each frame, up-front, passing it the contents of the relevant frame of the video (because you would effectively be trying to hold the entire video, frame-by-frame, in memory at one time).
But this is not an operation queue limitation. This is just a practical memory limitation. We would generally consider a different pattern. In my hypothetical example, I would consider dispatch_apply, known as concurrentPerform in Swift, that would simply load the relevant frame in a just-in-time manner.
Bottom line, just consider how much memory will be added for each operation added to the queue and use your own good judgment as to whether holding all of these in memory at the same time or not. If you have many operations, each holding large piece of data in memory, then consider other patterns.

Thread synchonization of a flag

I have a flag that could be read, cleared and set by two threads and I think I need some kind of synchronisations. All the documentation I read focus on resource protection and I'm wondering if there is something more lightweight for my use cases. I'm particularly interested by something that doesn't call down to the kernel every time. I'm currently coding using Objective-C.
Here's my particular use case:
// First thread does the following between
// a few times and hundreds of time per second.
updateBuffer();
If (displayNeedsUpdate)
displayBuffer();
displayNeedsUpdate = 0;
// Second thread does this 60 times per second:
displayNeedsUpdate = 1;
Many thanks!

Why memory leaks with a dynamic array of a custom class?

I'm creating a indicator that recognizes candlestick shapes.
To do that I created a separate class Candlestick that I include to the indicator file.
The problem is that I suffer from memory leaks.
I'm new to pointers and after reading / watching a lot, I still seem to miss something here.
This is the Indicator class. The content of the Candlestick class is irrelevant so I leave that out.
Candlestick *candles[];
void OnDeinit(const int reason)
{
for(int i = 0; i < ArraySize(candles); i++ ){
delete(candles[i]);
}
}
int OnCalculate(args here)
{
ArrayResize(candles, Bars);
for(int i = MathMax(Bars-2-IndicatorCounted(), 1); i >= 0; i--)
{
candles[i] = new Candlestick();
// Do stuff with this candle (and other candles) here e.g.
if(candles[i+1].type == BULLISH) Print("Last candle was Bullish");
}
}
When I do this I get memory leak errors. It seems that I need to delete the pointers to the candles in that dynamic array. The problem is, when and where? Because I need them in the next iteration of the for(){...} loop. So I can't delete it there.
When I delete it in the OnDeinit() function there are still candles out there and I still get the leak error.
How come?
First, Nick, welcome to the Worlds of MQL4
You might have already realised, the MQL4 code is not a C.
Among many important differences, the key here is what does the code-execution platform ( the MetaTrader Terminal 4 ) do in what moment.
OnCalculate() is a zombie-alike process, which gets invoked many times, but anyway, definitely not under your control.
Next, OnCalculate() by-design does not mean a new Bar.
How to?
MQL4 conceptually originates from days, when computing resources were many orders smaller and much more expensive in terms of their time-sharing CPU-MUX-ing during a code execution phase.
Thus the MQL4-user-domain language retains benefits from some hidden gems, that are not accessible directly. One of these is a very efficient register-based update-processing and keeping dynamic resources allocations on miminum, for their devastatingly adverse effects on Real-Time execution predictability.
This will help you understand how to design & handle your conceptual objects way smarter, best by mimicking this "stone-age"-but-VERY-efficient behaviour ( both time-wise & memory-wise ), instead of flooding your memory-pool with infinite amount of unmanaged instances upon each call of OnCalulate() which sprinkles an endless count of new Candlestick(); // *--> candles[]
A best next step:
If in doubts, just read about best practices for ArrayResize() in the platform localhost-help/documentation, to start realise the things, that introduce overheads ( if not blocks ) in a domain, where nano$econd$ count & hurt in professional software design.

How to CUDA-ize code when all cores require global memory access

Without CUDA, my code is just two for loops that calculate the distance between all pairs of coordinates in a system and sort those distances into bins.
The problem with my CUDA version is that apparently threads can't write to the same global memory locations at the same time (race conditions?). The values I end up getting for each bin are incorrect because only one of the threads ended up writing to each bin.
__global__ void computePcf(
double const * const atoms,
double * bins,
int numParticles,
double dr) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles - 1) {
for (int j = i + 1; j < numParticles; j++) {
double r = distance(&atoms[3*i + 0], &atoms[3*j + 0]);
int binNumber = floor(r/dr);
// Problem line right here.
// This memory address is modified by multiple threads
bins[binNumber] += 2.0;
}
}
}
So... I have no clue what to do. I've been Googling and reading about shared memory, but the problem is that I don't know what memory area I'm going to be accessing until I do my distance computation!
I know this is possible, because a program called VMD uses the GPU to speed up this computation. Any help (or even ideas) would be greatly appreciated. I don't need this optimized, just functional.
How many bins[] are there?
Is there some reason that bins[] need to be of type double? It's not obvious from your code. What you have is essentially a histogram operation, and you may want to look at fast parallel histogram techniques. Thrust may be of interest.
There are several possible avenues to consider with your code:
See if there is a way to restructure your algorithm to arrange computations in such a way that a given group of threads (or bin computations) are not stepping on each other. This might be accomplished based on sorting distances, perhaps.
Use atomics This should solve your problem, but will likely be costly in terms of execution time (but since it's so simple you might want to give it a try.) In place of this:
bins[binNumber] += 2.0;
Something like this:
int * bins,
...
atomicAdd(bins+binNumber, 2);
You can still do this if bins are of type double, it's just a bit more complicated. Refer to the documentation for the example of how to do atomicAdd on a double.
If the number of bins is small (maybe a few thousand, or less) then you could create a few sets of bins that are updated by multiple threadblocks, and then use a reduction operation (adding the sets of bins together, element by element) at the end of the processing sequence. In this case, you might want to consider using a smaller number of threads or threadblocks, each of which processes multiple elements, by putting an additional loop in your kernel code, so that after each particle processing is complete, the loop jumps to the next particle by adding gridDim.x*blockDim.x to the i variable, and repeating the process. Since each thread or threadblock has it's own local copy of the bins, it can do this without stepping on other threads accesses.
For example, suppose I only needed 1000 bins of type int. I could create 1000 sets of bins, which would only take up about 4 megabytes. I could then give each of 1000 threads it's own bin set, and then each of the 1000 threads would have it's own bin set to update, and would not require atomics, since it could not interfere with any other thread. By having each thread loop through multiple particles, I can still effectively keep the machine busy this way. When all the particle-binning is done, I then have to add my 1000 bin-sets together, perhaps with a separate kernel call.

Erratic timing results from mach_absolute_time()

I'm trying to optimize a function (an FFT) on iOS, and I've set up a test program to time its execution over several hundred calls. I'm using mach_absolute_time() before and after the function call to time it. I'm doing the tests on an iPod touch 4th generation running iOS 6.
Most of the timing results are roughly consistent with each other, but occasionally one run will take much longer than the others (as much as 100x longer).
I'm pretty certain this has nothing to do with my actual function. Each run has the same input data, and is a purely numerical calculation (i.e. there are no system calls or memory allocations). I can also reproduce this if I replace the FFT with an otherwise empty for loop.
Has anyone else noticed anything like this?
My current guess is that my app's thread is somehow being interrupted by the OS. If so, is there any way to prevent this from happening? (This is not an app that will be released on the App Store, so non-public APIs would be OK for this.)
I no longer have an iOS 5.x device, but I'm pretty sure this was not happening prior to the update to iOS 6.
EDIT:
Here's a simpler way to reproduce:
for (int i = 0; i < 1000; ++i)
{
uint64_t start = mach_absolute_time();
for (int j = 0; j < 1000000; ++j);
uint64_t stop = mach_absolute_time();
printf("%llu\n", stop-start);
}
Compile this in debug (so the for loop is not optimized away) and run; most of the values are around 220000, but occasionally a value is 10 times larger or more.
In my experience, mach_absolute_time is not reliable. Now I use CFAbsoluteTime instead. It returns the current time in seconds with a much better precision than the second.
const CFAbsoluteTime newTime = CFAbsoluteTimeGetCurrent();
mach_absolute_time() is actually very low level and reliable. It runs at a steady 24MHz on all iOS devices, from the 3GS to the iPad 4th gen. It's also the fastest way to get timing information, taking between 0.5µs and 2µs depending on CPU. But if you get interrupted by another thread, of course you're going to get spurious results.
SCHED_FIFO with maximum priority will allow you to hog the CPU, but only for a few seconds at most, then the OS decides you're being too greedy. You might want to try sleep( 5 ) before running your timing test, as this will build up some "credit".
You don't actually need to start a new thread, you can temporarily change the priority of the current thread with this:
struct sched_param sched;
sched.sched_priority = 62;
pthread_setschedparam( pthread_self(), SCHED_FIFO, &sched );
Note that sched_get_priority_min & max return a conservative 15 & 47, but this only corresponds to an absolute priority of about 0.25 to 0.75. The actual usable range is 0 to 62, which corresponds to 0.0 to 1.0.
It happens while app spend some time in another threads.

Resources