My intent is to understand the “cooperative thread pool” used by Swift 5.5’s async-await, and how task groups automatically constrain the degree of concurrency: Consider the following task group code, doing 32 calculations in parallel:
func launchTasks() async {
await withTaskGroup(of: Void.self) { group in
for i in 0 ..< 32 {
group.addTask { [self] in
let value = await doSomething(with: i)
// do something with `value`
}
}
}
}
While I hoped it would constrain the degree of concurrency, as advertised, I'm only getting two (!) concurrent tasks at a time. That is far more constrained than I would have expected:
If I use the old GCD concurrentPerform ...
func launchTasks2() {
DispatchQueue.global().async {
DispatchQueue.concurrentPerform(iterations: 32) { [self] i in
let value = doSomething(with: i)
// do something with `value`
}
}
}
... I get twelve at a time, taking full advantage of the device (iOS 15 simulator on my 6-core i9 MacBook Pro) while avoiding thread-explosion:
(FWIW, both of these were profiled in Xcode 13.0 beta 1 (13A5154h) running on Big Sur. And please disregard the minor differences in the individual “jobs” in these two runs, as the function in question is just spinning for a random duration; the key observation is the degree of concurrency is what we would have expected.)
It is excellent that this new async-await (and task groups) automatically limits the degree of parallelism, but the cooperative thread pool of async-await is far more constrained than I would have expected. And I see of no way to adjust these parameters of that pool. How can we better take advantage of our hardware while still avoiding thread explosion (without resorting to old techniques like non-zero semaphores or operation queues)?
It looks like this curious behavior is a limitation of the simulator. If I run it on my physical iPhone 12 Pro Max, the async-await task group approach results in 6 concurrent tasks ...
... which is essentially the same as the concurrentPerform behavior:
The behavior, including the degree of concurrency, is essentially the same on the physical device.
One is left to infer that the simulator appears to be configured to constrain async-await more than what is achievable with direct GCD calls. But on actual physical devices, the async-await task group behavior is as one would expect.
Related
Despite my best efforts to make CoreML MLModel process its predictions in parallel, seems like under-the-hood Apple forcing it to run in a serial/one-by-one manner.
I made a public repository reproducing the PoC of the issue:
https://github.com/SocialKitLtd/coreml-concurrency-issue.
What I have tried:
Re-create the MLModel every time instead of a global instance
Use only .cpuAndGpu configuration
What I'm trying to achieve:
I'm trying to utilize multithreading to process a bunch of video frames at the same time (assuming the CPU/RAM can take it) faster than the one-by-one strategy.
Code (Also presented in the repository):
class ViewController: UIViewController {
override func viewDidLoad() {
super.viewDidLoad()
let parallelTaskCount = 3
for i in 0...parallelTaskCount - 1 {
DispatchQueue.global(qos: .userInteractive).async {
let image = UIImage(named: "image.jpg")!
self.runPrediction(index: i, image: image)
}
}
}
func runPrediction(index: Int, image: UIImage) {
let conf = MLModelConfiguration()
conf.computeUnits = .cpuAndGPU
conf.allowLowPrecisionAccumulationOnGPU = true
let myModel = try! MyModel(configuration: conf)
let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
// Prediction
let predicition = try! myModel.prediction(input: myModelInput)
print("finished proccessing \(index)")
}
}
Any help will be highly appreciated.
When you employ parallel execution on CPU, you can achieve significant performance gains on CPU-bound calculations only. But CoreML is not a CPU-bound. When you leverage the GPU (e.g., with .cpuAndGPU) you will not achieve the same sort of CPU-parallelism-driven performance gains that you will see with CPU-only calculations.
That having said, using the GPU (or the neural engine) is so much faster than the parallelized CPU rendition, as one would generally forgo parallelized CPU calculations, altogether, and favor the GPU rendition. The non-parallel GPU compute rendition will often be faster than a parallelized CPU rendition.
That having been said, when I employed parallelism in GPU-compute tasks, there is still some modest performance gain. In my experiments, I saw minor performance benefits (13 and 18% faster when I went from serial to three concurrent operations on iPhone and M1 iPad, respectively) running GPU-based CoreML calculations, but slightly more material benefits on a Mac. Just do not expect a dramatic performance improvement.
Profiling with Instruments (by pressing command-i in Xcode or choosing “Product” » “Profile”) can be illuminating. See Recording Performance Data.
First, let us compare a computeUnits of .cpuOnly scenario first. Here it is running 20 CoreML prediction calls sequentially (with maxConcurrentOperationCount of 1):
And, if I switch to the CPU view, I can see that it is jumping between two performance cores on my iPhone 12 Pro Max:
That makes sense. OK, now let us change the maxConcurrentOperationCount to 3, the overall processing time (the processingAll function) drops from 5 to 3½ minutes:
And when I switch to the CPU view, to see what is going on, it looks like it started running on both performance cores in parallel, but switched to some of the efficiency cores (probably because the thermal state of the device was getting stressed, which explains we did not achieve anything close to 2× performance):
So, when doing CPU-only CoreML calculations, parallel execution can yield significant benefits. That having been said, the CPU-only calculations are much slower than the GPU calculations.
When I switched to .cpuAndGPU, the difference maxConcurrentOperationCount of 1 vs 3 was far less pronounced, taking 45 seconds when allowing three concurrent operations and 50 seconds when executing serially. Here it is running three in parallel:
And sequentially:
But in contrast to the .cpuOnly scenarios, you can see in the CPU track, that the CPUs are largely idle. Here is the latter with the CPU view to show the details:
So, one can see that letting them run on multiple CPUs does not achieve much performance gain as this is not CPU-bound, but is obviously constrained by the GPU.
Here is my code for the above. Note, I used OperationQueue as it provides a simple mechanism to control the degree of concurrency (the maxConcurrentOperationCount:
import os.log
private let poi = OSLog(subsystem: "Test", category: .pointsOfInterest)
and
func processAll() {
let parallelTaskCount = 20
let queue = OperationQueue()
queue.maxConcurrentOperationCount = 3 // or try `1`
let id = OSSignpostID(log: poi)
os_signpost(.begin, log: poi, name: #function, signpostID: id)
for i in 0 ..< parallelTaskCount {
queue.addOperation {
let image = UIImage(named: "image.jpg")!
self.runPrediction(index: i, image: image, shouldAddContuter: true)
}
}
queue.addBarrierBlock {
os_signpost(.end, log: poi, name: #function, signpostID: id)
}
}
func runPrediction(index: Int, image: UIImage, shouldAddContuter: Bool = false) {
let id = OSSignpostID(log: poi)
os_signpost(.begin, log: poi, name: #function, signpostID: id, "%d", index)
defer { os_signpost(.end, log: poi, name: #function, signpostID: id, "%d", index) }
let conf = MLModelConfiguration()
conf.computeUnits = .cpuAndGPU // contrast to `.cpuOnly`
conf.allowLowPrecisionAccumulationOnGPU = true
let myModel = try! MyModel(configuration: conf)
let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
// Prediction
let prediction = try! myModel.prediction(input: myModelInput)
os_signpost(.event, log: poi, name: "finished processing", "%d %#", index, prediction.featureNames)
}
Note, above I have focused on CPU usage. You can also use the “Core ML” template in Instruments. E.g. here are the Points of Interest and the CoreML tracks next to each other on my M1 iPad Pro (with maxConcurrencyOperationCount set to 2 to keep it simple):
At first glance, it looks like CoreML is processing these in parallel, but if I run it again with maxConcurrencyOperationCount of 1 (i.e., serially), the time for those individual compute tasks is shorter, which suggests that in the parallel scenario that there is some GPU-related contention.
Anyway, in short, you can use Instruments to observe what is going on. And one can achieve significant improvements in performance through parallel processing for CPU-bound tasks only, and anything requiring the GPU or neural engine will be further constrained by that hardware.
I have a question that other people usually have a problem with.
I am building an application that measures battery discharge.
My plan is to simulate high CPU usage and then measure the time it takes the battery to drop to a certain level.
How can I cause high CPU usage on purpose, but without blocking the UI?
Can I do something like this?
DispatchQueue.global(qos: .background).async { [weak self] in
guard let self = self else { return }
for _ in 0..<Int.max {
while self.isRunning {
}
break
}
}
What you want is persistent load on CPU, with number of threads concurrently loading CPU >= number of CPUs.
So something like
DispatchQueue.concurrentPerform(iterations: 100) { iteration in
for _ in 1...10000000 {
let a = 1 + 1
}
}
Where:
The concurrentPerform with iterations set to 100 makes sure we are running in parallel using every available thread. This is overkill of course, would be enough 4 threads to get every CPU busy on quad core, and about 10 threads is what iOS typically allocates at max per process. But 100 simply makes ruee it really happens)
The 1...10000000 makes loop really really long
The let a = 1 + 1 gives CPU something to do.
On my iPhone 8 simulator running this code created a picture like this (stopped it after about 30 sec):
Careful though! You may overheat your device
I am trying to process series of UIImages using CoreImage & Metal and also display them. My problem is I want to drop the incoming image if my gcd block is busy. How do I achieve this GCD queues, how do I define the maximum buffer size of queue?
There’s no native mechanism for this, but you can achieve what you want algorithmically with semaphores.
Before I dive into the “process 4 at a time, but discard any that come if we’re busy” scenario, let me first consider the simpler “process all, but not more than 4 at any given time” pattern. (I’m going to answer your question below, but building on this simpler situation.)
For example, let’s imagine that you had some preexisting array of objects and you want to process them concurrently, but not more than four at any given time (perhaps to minimize peak memory usage):
DispatchQueue.global().async {
let semaphore = DispatchSemaphore(value: 4)
for object in objects {
semaphore.wait()
processQueue.async {
self.process(object)
semaphore.signal()
}
}
}
Basically, the wait function will, as the documentation says, “Decrement the counting semaphore. If the resulting value is less than zero, this function waits for a signal to occur before returning.”
So, we’re starting our semaphore with a count of 4. So if objects had 10 items in it, the first four would start immediately, but the fifth wouldn’t start until one of the earlier ones finished and sent a signal (which increments the semaphore counter back up by 1), and so on, achieving a “run concurrently, but a max of 4 at any given time” behavior.
So, let’s return to your question. Let’s say you wanted to process no more than four images at a time and drop any incoming image if there were already four images currently being processed. You can accomplish that by telling wait to not really wait at all, i.e., check right .now() whether the semaphore counter has hit zero already, i.e., something like:
let semaphore = DispatchSemaphore(value: 4)
let processQueue = DispatchQueue(label: "com.domain.app.process", attributes: .concurrent)
func submit(_ image: UIImage) {
if semaphore.wait(timeout: .now()) == .timedOut { return }
processQueue.async {
self.process(image)
self.semaphore.signal()
}
}
Note, we generally want to avoid blocking the main thread (like wait can do), but because I’m using a timeout of .now(), it will never block, we’re just use the semaphore to keep track of where we are in a nice, thread-safe manner.
One final approach is to consider operation queues:
// create queue that will run no more than four at a time (no semaphores needed; lol)
let processQueue: OperationQueue = {
let queue = OperationQueue()
queue.maxConcurrentOperationCount = 4
return queue
}()
func submit(_ image: UIImage) {
// cancel all but the last three unstarted operations
processQueue.operations
.filter { $0.isReady && !$0.isFinished && !$0.isExecuting && !$0.isCancelled }
.dropLast(3)
.forEach { $0.cancel() }
// now add new operation to the queue
processQueue.addOperation(BlockOperation {
self.process(image)
})
}
The behavior is slightly different (keeping the most recent four images queued up, ready to go), but is something to consider.
GCD queues don't have a maximum queue size.
You can use a semaphore for this. Initialize it with the maximum queue length you want to support. Use dispatch_semaphore_wait() with DISPATCH_TIME_NOW as the timeout to try to reserve a spot before submitting a task to the queue. If it times out, don't enqueue the task (discard it, or whatever). Have the task signal the semaphore when it's complete to release the spot you reserved for it to be used for another task, later.
There are several questions asking the exact opposite of this, and I don't understand how/why running my app in release mode works, but crashes with an EXC_BAD_ACCESS error in debug mode.
The method that crashes is recursive, and extremely!! substantial; as long as there aren't too many recursions it works fine in both debug (fewer than ~1000 on iPhone XS, unlimited on simulator) and release mode (unlimited?).
I'm at a loss as to where begin finding out how to debug the debug mode and I'm wondering if there is some kind of recursion soft-limit bundled due to the stack trace or some other unknown? Could it even be down to the cable as I'm able to successfully run in the simulator without problems?
I should note that Xcode reports crashes at seemingly random spots, such as property getters that I know are instantiated and valid; in case that helps.
I'm going to go refactor it down into smaller chunks but thought I would post here in case anybody had any ideas about what might be causing this.
See:
https://gist.github.com/ThomasHaz/3aa89cc9b7bda6d98618449c9d6ea1e1
You’re running out of stack memory.
Consider this very simple recursive function to add up integers between 1 and n:
func sum(to n: Int) -> Int {
guard n > 0 else { return 0 }
return n + sum(to: n - 1)
}
You’ll find that if you try, for example, summing the numbers between 1 and 100,000, the app will crash in both release and debug builds, but will simply crash sooner on debug builds. I suspect there is just more diagnostic information pushed on the stack in debug builds, causing it to run out of space in the stack sooner. In release builds of the above, the stack pointer advanced by 0x20 bytes each recursive call, whereas a debug builds advanced by 0x80 bytes each time. And if you’re doing anything material in your recursive function, these increments may be larger and the crash may occur with even fewer recursive calls. But the stack size on my device (iPhone Xs Max) and on my simulator (Thread.current.stackSize) is 524,288 bytes, and that corresponds to the amount by which the stack pointer is advancing and the max number of recursive calls I’m able to achieve. If your device is crashing sooner than the simulator, perhaps your device has less RAM and therefore has allotted a smaller stackSize.
Bottom line, you might want to refactor your algorithm to a non-recursive one if you want to enjoy fast performance but don’t want to incur the memory overhead of a huge call stack. As an aside, the non-recursive rendition of the above was an order of magnitude faster than the recursive rendition.
Alternatively, you can dispatch your recursive calls asynchronously, which eliminates the stack size issues, but introduces GCD overhead. An asynchronous rendition of the above was two to three orders of magnitude slower than the simple recursive rendition and, obviously, yet another order of magnitude slower than the iterative rendition.
Admittedly, my simple sum method is so trivial that the overhead of the recursive calls starts to represent a significant portion of the overall computation time, and given that your routine would appear to be more complicated, I suspect the difference will be less stark. Nonetheless, if you want to avoid running out of stack space, I’d simply suggest pursuing a non-recursive rendition.
I’d refer you to the following WWDC videos:
WWDC 2012 iOS App Performance: Memory acknowledges the different types of memory, including stack memory (but doesn’t go into the latter in any great detail);
WWDC 2018 iOS Memory Deep Dive is a slightly more contemporary version of the above video; and
WWDC 2015 Profiling in Depth touches upon tail-recursion optimization.
It’s worth noting that deeply recursive routines don’t always have to consume a large stack. Notably, sometimes we can employ tail-recursion where our recursive call is the very last call that is made. E.g. my snippet above does not employ a tail call because it’s adding n to the value returned by recursive call. But we can refactor it to pass the running total, thereby ensuring that the recursive call is a true “tail call”:
func sum(to n: Int, previousTotal: Int = 0) -> Int {
guard n > 0 else { return previousTotal }
return sum(to: n - 1, previousTotal: previousTotal + n)
}
Release builds are smart enough to optimize this tail-recursion (through a process called “tail call optimization”, TCO, also known as “tail call elimination”), mitigating the stack growth for the recursive calls. WWDC 2015 Profiling in Depth, while on a different topic, the time profiler, shows exactly what’s happening when it optimizes tail calls.
The net effect is that if your recursive routine is employing tail calls, release builds can use tail call elimination to mitigate stack memory issues, but debug (non-optimized) builds will not do this.
EXEC_BAD_ACCESS usually means that you are trying to access an object which is not in memory or probably not properly initialized.
Check in your code, if you are accessing your Dictionary variable after it is somehow removed? is your variable properly initialized? You might have declared the variable but did not initialize it and accessing it.
There could be a ton of reasons and cant say much without seeing any code.
Try to turn on NSZombieOjects - this might provide you more debug information. Refer to here How to enable NSZombie in Xcode?
IF you would like to know where and when exactly is the error occurring, you could check for memory leaks using instruments. This might be helpful http://www.raywenderlich.com/2696/instruments-tutorial-for-ios-how-to-debug-memory-leaks
I'm trying to optimize a function (an FFT) on iOS, and I've set up a test program to time its execution over several hundred calls. I'm using mach_absolute_time() before and after the function call to time it. I'm doing the tests on an iPod touch 4th generation running iOS 6.
Most of the timing results are roughly consistent with each other, but occasionally one run will take much longer than the others (as much as 100x longer).
I'm pretty certain this has nothing to do with my actual function. Each run has the same input data, and is a purely numerical calculation (i.e. there are no system calls or memory allocations). I can also reproduce this if I replace the FFT with an otherwise empty for loop.
Has anyone else noticed anything like this?
My current guess is that my app's thread is somehow being interrupted by the OS. If so, is there any way to prevent this from happening? (This is not an app that will be released on the App Store, so non-public APIs would be OK for this.)
I no longer have an iOS 5.x device, but I'm pretty sure this was not happening prior to the update to iOS 6.
EDIT:
Here's a simpler way to reproduce:
for (int i = 0; i < 1000; ++i)
{
uint64_t start = mach_absolute_time();
for (int j = 0; j < 1000000; ++j);
uint64_t stop = mach_absolute_time();
printf("%llu\n", stop-start);
}
Compile this in debug (so the for loop is not optimized away) and run; most of the values are around 220000, but occasionally a value is 10 times larger or more.
In my experience, mach_absolute_time is not reliable. Now I use CFAbsoluteTime instead. It returns the current time in seconds with a much better precision than the second.
const CFAbsoluteTime newTime = CFAbsoluteTimeGetCurrent();
mach_absolute_time() is actually very low level and reliable. It runs at a steady 24MHz on all iOS devices, from the 3GS to the iPad 4th gen. It's also the fastest way to get timing information, taking between 0.5µs and 2µs depending on CPU. But if you get interrupted by another thread, of course you're going to get spurious results.
SCHED_FIFO with maximum priority will allow you to hog the CPU, but only for a few seconds at most, then the OS decides you're being too greedy. You might want to try sleep( 5 ) before running your timing test, as this will build up some "credit".
You don't actually need to start a new thread, you can temporarily change the priority of the current thread with this:
struct sched_param sched;
sched.sched_priority = 62;
pthread_setschedparam( pthread_self(), SCHED_FIFO, &sched );
Note that sched_get_priority_min & max return a conservative 15 & 47, but this only corresponds to an absolute priority of about 0.25 to 0.75. The actual usable range is 0 to 62, which corresponds to 0.0 to 1.0.
It happens while app spend some time in another threads.