Concurrent file enumeration - ios

I have to perform a complex operation on a large number of files. Fortunately, enumeration order is not important and the jobs can be done in parallel without locking.
Does the platform provide a way to do this? For lack of a better API, I was thinking of:
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
NSArray *paths = [[NSFileManager defaultManager] subpathsAtPath:folder];
[paths enumerateObjectsWithOptions:NSEnumerationConcurrent
usingBlock:^(NSString *path, NSUInteger idx, BOOL *stop) {
// Complex operation
}];
}];
Is there a better way?

Your current code puts one block on the global queue. So, that single block will run on a background thread and do all of the iteration and processing.
You want to do something a bit different to have your processing tasks run concurrently. You should really do the iteration on the main thread and add a block to the global queue on each iteration of the loop.
Better, create an NSOperation subclass. Put your logic there. Create an instance of the operation in the loop and add them to an operation queue. This is a higher level API and offers you options in adding dependencies, tailoring the maximum concurrency, checking the number of operations still to be completed, etc ...

Here's an approach you can consider. If you have (or may have) tens of thousands of files, instead of enumerating with enumerateObjectsWithOptions:usingBlock: you may want to enumerate the array manually in batches (let's say 100 elements each). When the current batch completes execution (you can use dispatch groups to check that) you start the next batch. With this approach you can avoid adding tens of thousands of blocks to the queue.
BTW I've deleted my previous answer, because it was wrong.

Related

how many operations NSOperationQueue can cache

I created an NSOperationQueue and set the maxConcurrentOperationCount property to 2.
If I create 2 operations that do not stop, when I continue to add operations to it, the NSOperationQueue will cache these tasks, so what is the maximum number of operations that can be cached by the NSOperationQueue, will it cause a memory surge?
NSOperationQueue *queue = [[NSOperationQueue alloc] init];
queue.maxConcurrentOperationCount = 2;
[queue addOperationWithBlock:^{
while (YES) {
NSLog(#"thread1 : %#",[NSThread currentThread]);
}
}];
[queue addOperationWithBlock:^{
while (YES) {
NSLog(#"thread2 : %#",[NSThread currentThread]);
}
}];
// this operation will wait
[queue addOperationWithBlock:^{
while (YES) {
NSLog(#"thread3 : %#",[NSThread currentThread]);
}
}];
above is my code, third operation will never run,
from what I understand, the queue will save these tasks, if you keep adding operations to it, the memory will keep going up,
Will NSOperationQueue handle this situation internally?
An operation queue will handle large number of operations without incident. It is one of the reasons we use operation queues, to gracefully handle constrained concurrency for a number of operations that exceed the maxConcurrentOperationCount.
Obviously, your particular example, with operations spinning indefinitely, is both inefficient (tying up two worker threads with computationally intensive process) and will prevent the third operation from ever starting. But if you changed the operations to something more practical (e.g., ones that finish in some finite period of time), the operation queue can gracefully handle a very large number of operations.
That is not to say that operation queues can be used recklessly. For example, one can easily create operation queue scenarios that suffer from thread explosion and exhaust the limited worker thread pool. Or if you had operations that individually used tons of memory, then eventually, if you had enough of those queued up, you could theoretically introduce memory issues.
But don’t worry about theoretical problems. If you have a practical example for which you are having a problem, then post that as a separate question. But, in answer to your question here, operation queues can handle many queued operation quite well.
Let us consider queuing 100,000 operations, each taking one second to finish:
NSOperationQueue *queue = [[NSOperationQueue alloc] init];
queue.maxConcurrentOperationCount = 2;
for (NSInteger i = 0; i < 100000; i++) {
[queue addOperationWithBlock:^{
[NSThread sleepForTimeInterval:1];
NSLog(#"%ld", i);
}];
}
NSLog(#"done queuing");
The operation queue handles all of these operations, only running two at a time, perfectly well. It will take some memory (e.g. 80mb) to hold these 100,000 operations in memory at a given moment, but it handles it fine. Even at 1,000,000 operations, it works fine (but will take ~500mb of memory). Clearly, at some point you will run out of memory, but if you're thinking of something with this many operations, you should be considering other patterns, anyway.
There are obviously practical limitations.
Let us consider a degenerate example: Imagine you had a multi-gigabyte video file and you wanted to run some task on each frame of the video. It would be a poor design to add operations for each frame, up-front, passing it the contents of the relevant frame of the video (because you would effectively be trying to hold the entire video, frame-by-frame, in memory at one time).
But this is not an operation queue limitation. This is just a practical memory limitation. We would generally consider a different pattern. In my hypothetical example, I would consider dispatch_apply, known as concurrentPerform in Swift, that would simply load the relevant frame in a just-in-time manner.
Bottom line, just consider how much memory will be added for each operation added to the queue and use your own good judgment as to whether holding all of these in memory at the same time or not. If you have many operations, each holding large piece of data in memory, then consider other patterns.

How do you cancel an NSArray sort that is in progress?

NSArray and NSMutableArray offer multiple ways to sort them using the sortedArrayUsing... and sortUsing... methods respectively, however none of those methods appear to offer a way to terminate a sort after it has been started.
For relatively small arrays, or when the comparison logic is trivial, this is probably not a big deal, but with larger arrays or when the comparison logic is not trivial, I would like to be able to cancel a sort already in process.
Trivial Use Case Example
Sorting a set of results that match based on a user's fuzzy search string. As the user types in the search field, results are fetched on a background thread and sorted before being presented to the user. If the fetch-and-sort operation is not completed before the user changes the search string, then it should be cancelled and a new fetch-and-sort operation started. The problem is that if the fetch-and-sort operation has already reached the sorting stage and called one of the NSArray sort methods above, then there's no way to cancel it. Instead, the next fetch-and-sort operation is left waiting for the now stale sort operation to complete.
So far, I've come up with two possible solutions but neither seems all that elegant.
Attempted Solution #1
Allow newer fetch-and-sort operations to start before any stale fetch-and-sort operations are finished. I just keep track of which is the latest operation using some internal state and as the operations complete, if they aren't the primary operation, then their results are discarded.
This works, but it can quickly result in multiple outstanding sorting operations all running concurrently, whether they need to be or not. This can be somewhat mitigated by throttling the maximum number of concurrent operations, but then I'm just adding an arbitrary limit. Pending, stale operations can be cancelled before they get executed, but I'm still left with situations where sorting work is being done when it doesn't need to be.
Attempted Solution #2
Roll my own quick sort or merge sort implementation and add an isCancelled flag to those routines so that they can quickly unwind and terminate. This is working, and working fairly well, but when the sorting operation doesn't need to be cancelled, the run time is about 15-20% slower than using one of the NSArray methods. Part of this, I imagine, is the overhead of calling methods like objectAtIndex and exchangeObjectAtIndex which I assume the internal sorting routines can bypass depending on how the NSArray is internally storing the objects in question.
It also feels wrong to be rolling my own sorting implementations in 2015 against something like AppKit and NSArray.
Semi-Attempted Solutions
Keeping a previously sorted array around and re-using that for filtering: This doesn't really work for what I'm trying to do so for sake of discussion, assume that the array I have to sort on is always unsorted and has no relationship to the previously sorted array.
Moving away from NSArray and back to C-style arrays: This works pretty well and the performance is quite good, but I'm left playing a bunch of games with ARC and the complexity of the overall implementatiion is significantly higher because at the end of the day, I'm always dealing with NSObjects. There's also a non-zero cost of going back and worth between NSArray and C-style arrays.
Summary
So, all of that to get back to the original question: "How do you cancel an in-progress NSArray sorting method?"
Tech Note
For those that are curious why this is a problem to begin with, I'm attempting to sort somewhere between 500,000 to 1,000,000 strings using compare methods like localizedStandardCompare, which is dramatically slower than just a straight NSString compare. The runtime difference between the various sortUsing... methods is relatively insignificant when compared to the total time to sort.
Starting where you end:
So, all of that to get back to the original question: "How do you cancel an in-progress NSArray sorting method?"
You don't. Cancellation isn't supported and anything you come up with is bound to be fragile.
So back to what you've done:
Roll my own quick sort or merge sort implementation and add an isCancelled flag to those routines so that they can quickly unwind and terminate. This is working, and working fairly well, but when the sorting operation doesn't need to be cancelled, the run time is about 15-20% slower than using one of the NSArray methods.
This is the way to go in this case, you just need to work on that slowdown...
You might be right, part of the slowdown might be the need to call methods for indexing and exchanging elements. Have you tried caching C function pointers to the common methods you require? If at the start of a sort you obtain direct C function pointers to objectAtIndex: et al. using the Objective-C runtime function class_getMethodImplementation() you can replace all the calls to method lookup with simple indirection.
If such manipulations fail then maybe look at C arrays again. As NSArray is toll-free bridged to CFArrayRef you can use CFArrayGetValues to copy out the elements into a malloc'ed C array, sort that, and then use CFArrayCreate to get back to a NSArray. Provided you are careful and not mutating the array you are sorting, as the elements will be in the original array they will already be retained and creating the new array will retain them once more, you can probably handle memory management by doing nothing. Sorting the C-array will be faster, but extraction and creation are going to be O(N) operations on top of the sort.
HTH
After several days of testing, I've opted to go with a custom, in-place merge sort implementation that accepts a boolean flag to trigger a cancellation.
A few follow-up points for those interested:
The raw performance of my merge sort implementation still lags somewhat behind the raw performance of the NSArray sortUsingComparator method. Instruments indicates that NSArray is using a merge sort as well, so I suspect the performance difference can be attributed to a more tuned implementation by Apple than I came up with and the ability to directly access NSArray's internals. NSArray's implementation took about 28 seconds to sort 1,000,000 strings using localizedStandardCompare as compared to 31.5 seconds for mine. (MacBook Air 2013)
Converting an NSMutableArray to a C-array of objects did not yield enough of a performance improvement to warrant the added complexity. The sort time was only reduced by between 0.5 - 1.0 second. Nice to have, but still dwarfed by the time spent in localizedStandardCompare. For input arrays of much smaller sizes (100,000), the speed difference was almost negligible. This surprised me, but Instruments is showing that all of the "overhead" in using an NSMutableArray is mostly noise when compared to the sort operation itself.
Parallelizing the merge function and farming out the tasks via GCD yielded a noticeable improvement of between 6.0 - 7.0 seconds, reducing the total time to sort to less than what NSArray sortUsingComparator was taking. Tuning the job count and stride length based on input array size could offer even more improvements (albeit minor ones at this stage).
Ultimately, a parallelized and cancelable implementation is proving to offer the best user experience for what I have in mind.
Firstly, I think common way to handle the problem you are mentioning, is not to cancel sorting, but to add a delay before fetch/sort operation is made.
Users usually type in short bursts. So add a delay of x seconds(e.g. 0.5s) before fetch and sort will actually begin.
Example:
User types 'a'.
Start a x second timer.
Before timer expires user types 'b'
Invalidate old timer and start a new one with x seconds.
Timer expires, start fetch and sort operation.
Hope this helps.
Instead of implementing your own sorting algorithm (which check for cancelation), you can implement your own comparator, which can check the cancelation condition, and throw an exception to interrupt the NSArray sortUsing...
The call to NSArray sortUsing.. should be enclosed in a try/catch group
- (void) testInterruptSort
{
NSArray *a = #[#"beta", #"alpha", #"omega", #"foo", #"bar"];
NSArray *sorted;
BOOL interrupted = NO;
NSString * const myException = #"User";
#try {
int __block n = 0;
sorted = [a sortedArrayUsingComparator:^(NSString *s1, NSString *s2) {
if (/* cancel condition*/ (1) && (n>5)) {
n++;
NSException *e = [NSException exceptionWithName:myException reason:#"interrupted" userInfo:nil];
[e raise];
}
return [s1 localizedStandardCompare:s2];
}];
}
#catch (NSException *exception) {
// should check if this is the "User" exception
// see https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/Exceptions/Tasks/HandlingExceptions.html
interrupted = YES;
}
#finally {
}
NSLog(#"interrupted: %#, result = %#\n", interrupted ? #"YES":#"NO", sorted);
}

Best practice for writing to resource from two different processes in objective-c

I have a general objective-c pattern/practice question relative to a problem I'm trying to solve with my app. I could not find a similar objective-c focused question/answer here, yet.
My app holds a mutable array of objects which I call "Records". The app gathers records and puts them into the that array in one of two ways:
It reads data from a SQLite database available locally within the App's sand box. The read is usually very fast.
It requests data asynchronously from a web service, waits for it to finish then parses the data. The read can be fast, but often it is not.
Sometimes the app reads from the database (1) and requests data from the web service (2) at essentially the same time. It is often the case that (1) will finish before (2) finishes and adding Records to the mutable array does not cause a conflict.
I am worried that at some point my SQLite read process will take a bit longer than expected and it will try to add objects to the mutable array at the exact same time the async request finishes and does the same; or vice-versa. These are edge cases that seem difficult to test for but that surely would make my app crash or at the very least cause issues with my array of records.
I should also point out that the Records are to be merged into the mutable array. For example: if (1) runs first and returns 10 records, then shortly after (2) finishes and returns 5 records, my mutable array will contain all 15 records. I'm combining the data rather than overwriting it.
What I want to know is:
Is it safe for me to add objects to the same mutable array instance when the processes, either (1) or (2) finish?
Is there a good pattern/practice to implement for this sort of processing in objective-c?
Does this involve locking access to the mutable array so when (1) is adding objects to it (2) can't add any objects until (1) is done with it?
I appreciate any info you could share.
[EDIT #1]
For posterity, I found this URL to be a great help in understanding how to use NSOperations and an NSOperationQueue. It is a bit out of date, but works, none the less:
http://www.raywenderlich.com/19788/how-to-use-nsoperations-and-nsoperationqueues
Also, It doesn't talk specifically about the problem I'm trying to solve, but the example it uses is practical and easy to understand.
[EDIT #2]
I've decided to go with the approach suggested by danh, where I'll read locally and as needed hit my web service after the local read finished (which should be fast anyway). Taht said, I'm going to try and avoid synchronization issues altogether. Why? Because Apple says so, here:
http://developer.apple.com/library/IOS/#documentation/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html#//apple_ref/doc/uid/10000057i-CH8-SW8
Avoid Synchronization Altogether
For any new projects you work on, and even for existing projects, designing your code and data structures to avoid the need for synchronization is the best possible solution. Although locks and other synchronization tools are useful, they do impact the performance of any application. And if the overall design causes high contention among specific resources, your threads could be waiting even longer.
The best way to implement concurrency is to reduce the interactions and inter-dependencies between your concurrent tasks. If each task operates on its own private data set, it does not need to protect that data using locks. Even in situations where two tasks do share a common data set, you can look at ways of partitioning that set or providing each task with its own copy. Of course, copying data sets has its costs too, so you have to weigh those costs against the costs of synchronization before making your decision.
Is it safe for me to add objects to the same mutable array instance when the processes, either (1) or (2) finish?
Absolutely not. NSArray, along with the rest of the collection classes, are not synchronized. You can use them in conjunction with some kind of lock when you add and remove objects, but that's definitely a lot slower than just making two arrays (one for each operation), and merging them when they both finish.
Is there a good pattern/practice to implement for this sort of processing in objective-c?
Unfortunately, no. The most you can come up with is tripping a Boolean, or incrementing an integer to a certain number in a common callback. To see what I mean, here's a little pseudo-code:
- (void)someAsyncOpDidFinish:(NSSomeOperation*)op {
finshedOperations++;
if (finshedOperations == 2) {
finshedOperations = 0;
//Both are finished, process
}
}
Does this involve locking access to the mutable array so when (1) is adding objects to it (2) can't add any objects until (1) is done with it?
Yes, see above.
You should either lock around your array modifications, or schedule your modifications in the main thread. The SQL fetch is probably running in the main thread, so in your remote fetch code you could do something like:
dispatch_async(dispatch_get_main_queue(), ^{
[myArray addObject: newThing];
}];
If you are adding a bunch of objects this will be slow since it is putting a new task on the scheduler for each record. You can bunch the records in a separate array in the thread and add the temp array using addObjectsFromArray: if that is the case.
Personally, I'd be inclined to have a concurrent NSOperationQueue and add the two retrieval operations operations, one for the database operation, one for the network operation. I would then have a dedicated serial queue for adding the records to the NSMutableArray, which each of the two concurrent retrieval operations would use to add records to the mutable array. That way you have one queue for adding records, but being fed from the two retrieval operations running on the other, concurrent queue. If you need to know when the two concurrent retrieval operations are done, I'd add a third operation to that concurrent queue, set its dependencies to be the two retrieval operations, which would fire automatically when the two retrieval operations are done.
In addition to the good suggestions above, consider not launching the GET and the sql concurrently.
[self doTheLocalLookupThen:^{
// update the array and ui
[self doTheServerGetThen:^{
// update the array and ui
}];
}];
- (void)doTheLocalLookupThen:(void (^)(void))completion {
if ([self skipTheLocalLookup]) return completion();
// do the local lookup, invoke completion
}
- (void)doTheServerGetThen:(void (^)(void))completion {
if ([self skipTheServerGet]) return completion();
// do the server get, invoke completion
}

How to programmatically control and balance a number of threads iOS app is executing?

How to control and balance the number of threads my app is executing, how to limit their number to avoid app's blocking because thread limit is reached?
Here on SO I saw the following possible answer: "Main concurrent queue (dispatch_get_global_queue) manages the number of threads automatically" which I don't like for the following reason:
Consider the following pattern (in my real app there are both more simple and more complex examples):
dispatch_queue_t defaultBackgroundQueue() {
return dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
}
dispatch_queue_t databaseQueue() {
dispatch_queue_create("Database private queue", 0);
}
dispatch_async(defaultBackgroundQueue(), ^{
[AFNetworkingAsynchronousRequestWithCompletionHandler:^(data){
dispatch_async(databaseQueue(), ^{
// data is about 100-200 elements to parse
for (el in data) {
}
maybe more AFNetworking requests and/or processing in other queues or
dispatch_async(dispatch_get_main_queue(), ^{
// At last! We can do something on UI.
});
});
}];
});
This design very often leads to the situation when:
The app is locked because of threads limit is reached (something like > 64)
the slower and thus narrow queues can be overwhelmed with a large number of pending jobs.
the second one also can produce a cancellation problem - if we have 100 jobs already waiting for execution in a serial queue we can't cancel them at once.
The obvious and dumb solution would be to replace sensitive dispatch_async methods with dispatch_sync, but it is definitely the one I don't like.
What is recommended approach for this kind of situations?
I hope an answer more smart than just "Use NSOperationQueue - it can limit the number of concurrent operations" does exist (similar topic: Number of threads with NSOperationQueueDefaultMaxConcurrentOperationCount).
UPDATE 1: The only decent pattern is see: is to replace all dispatch_async's of blocks to concurrent queues with running these blocks wrapped in NSOperations in NSOperationQueue-based concurrent queues with max operations limit set (in my case maybe also set a max operations limit on the NSOperationQueue-based queue that AFNetworking run all its operations in).
You are starting too many network requests. AFAIK it's not documented anywhere, but you can run up to 6 simultaneous network connections (which is a sensible number considering RFC 2616 8.1.4, paragraph 6). After that you get locking, and GCD compensates creating more threads, which by the way, have a stack space of 512KB each with pages allocated on demand. So yes, use NSOperation for this. I use it to queue network requests, increase the priority when the same object is requested again, pause and serialize to disk if the user leaves. I also monitor the speed of the network requests in bytes/time and change the number of concurrent operations.
While I don't see from your example where exactly you're creating "too many" background threads, I'll just try to answer the question of how to control the exact number of threads per queue. Apple's documentation says:
Concurrent queues (also known as a type of global dispatch queue) execute one or more tasks concurrently, but tasks are still started in the order in which they were added to the queue. The currently executing tasks run on distinct threads that are managed by the dispatch queue. The exact number of tasks executing at any given point is variable and depends on system conditions.
While you can now (since iOS5) create concurrent queues manually, there is no way to control how many jobs will be run concurrently by such a queue. The OS will balance the load automatically. If, for whatever reason, you don't want that, you could for example create a set of n serial queues manually and dispatch new jobs to one of your n queues at random:
NSArray *queues = #[dispatch_queue_create("com.myapp.queue1", 0),dispatch_queue_create("com.myapp.queue2", 0),dispatch_queue_create("com.myapp.queue3", 0)];
NSUInteger randQueue = arc4random() % [queues count];
dispatch_async([queues objectAtIndex:randQueue], ^{
NSLog(#"Do something");
});
randQueue = arc4random() % [queues count];
dispatch_async([queues objectAtIndex:randQueue], ^{
NSLog(#"Do something else");
});
I'm by no means endorsing this design - I think concurrent queues are pretty good at balancing system resources. But since you asked, I think this is a feasible approach.

Massive parallel computation ios

I have a method that performs a mathematical operation repeatedly (possibly millions on times) with different data. What is the best way to do this in iOs (it will run on iPad devices)? I understand that performSelectorOnBackgroundThread is deprecated... ? I also need to aggregate all the results in an NSArray . The best way seems to be: post a notification to the Notification Center and add the method as an observer. Is this correct? The array will need to be declared as atomic, I believe... Plus I will need to show a progress bar as the operations complete... How many threa can I start in parallel ? I don't think starting 1.000.000 threads is such a good idea on an iDevice..
Thanks in advance...
Look into Grand Central Dispatch, it's the preferred way to do multi-threading on iOS (and Mac).
A simple example of using GCD would look like:
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_async(queue, ^{
//do long running task here
}
This will execute a block asynchronously of the main thread. GCD has numerous other ways of dispatching tasks, one taken directly from the Wikipedia article listed above is:
dispatch_apply(count, dispatch_get_global_queue(0, 0), ^(size_t i){
results[i] = do_work(data, i);
});
total = summarize(results, count);
This particular code sample is probably exactly what you're looking for, assuming this "large task" of yours is a embarrassingly parallel.
While you could use dispatch_apply() and spin off all of the runs simultaneously, that'll end up being slower.
You'll want to be able to throttle the # of runs in flight simultaneously with the # of simultaneous computations being something that you'll need to tune.
I've often used a dispatch_semaphore_t to allow for easy tuning of the # of in-flight computations.
Details of doing so are in an answer here: https://stackoverflow.com/a/4535110/25646

Resources