Intentionally causing high CPU usage - ios

I have a question that other people usually have a problem with.
I am building an application that measures battery discharge.
My plan is to simulate high CPU usage and then measure the time it takes the battery to drop to a certain level.
How can I cause high CPU usage on purpose, but without blocking the UI?
Can I do something like this?
DispatchQueue.global(qos: .background).async { [weak self] in
guard let self = self else { return }
for _ in 0..<Int.max {
while self.isRunning {
}
break
}
}

What you want is persistent load on CPU, with number of threads concurrently loading CPU >= number of CPUs.
So something like
DispatchQueue.concurrentPerform(iterations: 100) { iteration in
for _ in 1...10000000 {
let a = 1 + 1
}
}
Where:
The concurrentPerform with iterations set to 100 makes sure we are running in parallel using every available thread. This is overkill of course, would be enough 4 threads to get every CPU busy on quad core, and about 10 threads is what iOS typically allocates at max per process. But 100 simply makes ruee it really happens)
The 1...10000000 makes loop really really long
The let a = 1 + 1 gives CPU something to do.
On my iPhone 8 simulator running this code created a picture like this (stopped it after about 30 sec):
Careful though! You may overheat your device

Related

CoreML can't work in concurrency (Multithreading)?

Despite my best efforts to make CoreML MLModel process its predictions in parallel, seems like under-the-hood Apple forcing it to run in a serial/one-by-one manner.
I made a public repository reproducing the PoC of the issue:
https://github.com/SocialKitLtd/coreml-concurrency-issue.
What I have tried:
Re-create the MLModel every time instead of a global instance
Use only .cpuAndGpu configuration
What I'm trying to achieve:
I'm trying to utilize multithreading to process a bunch of video frames at the same time (assuming the CPU/RAM can take it) faster than the one-by-one strategy.
Code (Also presented in the repository):
class ViewController: UIViewController {
override func viewDidLoad() {
super.viewDidLoad()
let parallelTaskCount = 3
for i in 0...parallelTaskCount - 1 {
DispatchQueue.global(qos: .userInteractive).async {
let image = UIImage(named: "image.jpg")!
self.runPrediction(index: i, image: image)
}
}
}
func runPrediction(index: Int, image: UIImage) {
let conf = MLModelConfiguration()
conf.computeUnits = .cpuAndGPU
conf.allowLowPrecisionAccumulationOnGPU = true
let myModel = try! MyModel(configuration: conf)
let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
// Prediction
let predicition = try! myModel.prediction(input: myModelInput)
print("finished proccessing \(index)")
}
}
Any help will be highly appreciated.
When you employ parallel execution on CPU, you can achieve significant performance gains on CPU-bound calculations only. But CoreML is not a CPU-bound. When you leverage the GPU (e.g., with .cpuAndGPU) you will not achieve the same sort of CPU-parallelism-driven performance gains that you will see with CPU-only calculations.
That having said, using the GPU (or the neural engine) is so much faster than the parallelized CPU rendition, as one would generally forgo parallelized CPU calculations, altogether, and favor the GPU rendition. The non-parallel GPU compute rendition will often be faster than a parallelized CPU rendition.
That having been said, when I employed parallelism in GPU-compute tasks, there is still some modest performance gain. In my experiments, I saw minor performance benefits (13 and 18% faster when I went from serial to three concurrent operations on iPhone and M1 iPad, respectively) running GPU-based CoreML calculations, but slightly more material benefits on a Mac. Just do not expect a dramatic performance improvement.
Profiling with Instruments (by pressing command-i in Xcode or choosing “Product” » “Profile”) can be illuminating. See Recording Performance Data.
First, let us compare a computeUnits of .cpuOnly scenario first. Here it is running 20 CoreML prediction calls sequentially (with maxConcurrentOperationCount of 1):
And, if I switch to the CPU view, I can see that it is jumping between two performance cores on my iPhone 12 Pro Max:
That makes sense. OK, now let us change the maxConcurrentOperationCount to 3, the overall processing time (the processingAll function) drops from 5 to 3½ minutes:
And when I switch to the CPU view, to see what is going on, it looks like it started running on both performance cores in parallel, but switched to some of the efficiency cores (probably because the thermal state of the device was getting stressed, which explains we did not achieve anything close to 2× performance):
So, when doing CPU-only CoreML calculations, parallel execution can yield significant benefits. That having been said, the CPU-only calculations are much slower than the GPU calculations.
When I switched to .cpuAndGPU, the difference maxConcurrentOperationCount of 1 vs 3 was far less pronounced, taking 45 seconds when allowing three concurrent operations and 50 seconds when executing serially. Here it is running three in parallel:
And sequentially:
But in contrast to the .cpuOnly scenarios, you can see in the CPU track, that the CPUs are largely idle. Here is the latter with the CPU view to show the details:
So, one can see that letting them run on multiple CPUs does not achieve much performance gain as this is not CPU-bound, but is obviously constrained by the GPU.
Here is my code for the above. Note, I used OperationQueue as it provides a simple mechanism to control the degree of concurrency (the maxConcurrentOperationCount:
import os.log
private let poi = OSLog(subsystem: "Test", category: .pointsOfInterest)
and
func processAll() {
let parallelTaskCount = 20
let queue = OperationQueue()
queue.maxConcurrentOperationCount = 3 // or try `1`
let id = OSSignpostID(log: poi)
os_signpost(.begin, log: poi, name: #function, signpostID: id)
for i in 0 ..< parallelTaskCount {
queue.addOperation {
let image = UIImage(named: "image.jpg")!
self.runPrediction(index: i, image: image, shouldAddContuter: true)
}
}
queue.addBarrierBlock {
os_signpost(.end, log: poi, name: #function, signpostID: id)
}
}
func runPrediction(index: Int, image: UIImage, shouldAddContuter: Bool = false) {
let id = OSSignpostID(log: poi)
os_signpost(.begin, log: poi, name: #function, signpostID: id, "%d", index)
defer { os_signpost(.end, log: poi, name: #function, signpostID: id, "%d", index) }
let conf = MLModelConfiguration()
conf.computeUnits = .cpuAndGPU // contrast to `.cpuOnly`
conf.allowLowPrecisionAccumulationOnGPU = true
let myModel = try! MyModel(configuration: conf)
let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
// Prediction
let prediction = try! myModel.prediction(input: myModelInput)
os_signpost(.event, log: poi, name: "finished processing", "%d %#", index, prediction.featureNames)
}
Note, above I have focused on CPU usage. You can also use the “Core ML” template in Instruments. E.g. here are the Points of Interest and the CoreML tracks next to each other on my M1 iPad Pro (with maxConcurrencyOperationCount set to 2 to keep it simple):
At first glance, it looks like CoreML is processing these in parallel, but if I run it again with maxConcurrencyOperationCount of 1 (i.e., serially), the time for those individual compute tasks is shorter, which suggests that in the parallel scenario that there is some GPU-related contention.
Anyway, in short, you can use Instruments to observe what is going on. And one can achieve significant improvements in performance through parallel processing for CPU-bound tasks only, and anything requiring the GPU or neural engine will be further constrained by that hardware.

Does GlobalQueue cause memory leaks in iOS?

Here's a very simple demo:
class ViewController: UIViewController {
override func viewDidLoad() {
super.viewDidLoad()
for i in 0..<500000 {
DispatchQueue.global().async {
print(i)
}
}
}
}
When I run this demo from my simulator, the memory usage goes up to ~17MB and then drops to ~15MB in the end. However, if I comment out the dispatch code and only keeps the print() line, the memory usage is only ~10MB. The increase amount varies whenever I change the loop count.
Is there a memory leak? I tried Leaks and didn't find anything.
When looking at memory usage, one has to run the cycle several times before concluding that there is a “leak”. It could just be caching.
In this particular case, you might see memory growth after the first iteration, but it will not continue to grow on subsequent iterations. If it was a true leak, the post-peak baseline would continue to creep up. But it does not. Note that the baseline after the second peak is basically the same as after the first peak.
As an aside, the memory characteristics here are a result of the thread explosion (which you should always avoid). Consider:
for i in 0 ..< 500_000 {
DispatchQueue.global().async {
print(i)
}
}
That dispatches half a million work items to a queue that can only support 64 worker threads at a time. That is “thread-explosion” exceeding the worker thread pool.
You should instead do the following, which constraints the degree of concurrency with concurrentPerform:
DispatchQueue.global().async {
DispatchQueue.concurrentPerform(iterations: 500_000) { i in
print(i)
}
}
That achieves the same thing, but limits the degree of concurrency to the number of available CPU cores. That avoids many problems (specifically it avoids exhausting the limited worker pool thread which could deadlock other systems), but it also avoids the spike in memory, too. (See the above graph, where there is no spike after either of the latter two concurrentPerform processes.)
So, while the thread-explosion scenario does not actually leak, it should be avoided at all costs because of both the memory spike and the potential deadlock risks.
Memory used is not memory leaked.
There is a certain amount of overhead associated with certain OS services. I remember answering a similar question posed by someone using a WebView. There are global caches. There is simply code that has to be paged in from disk to memory (which is big with WebKit), and once the code is paged in, it's incredibly unlikely to ever be paged out.
I've not looked at the libdispatch source code lately, but GCD maintains one or more pools of threads that it uses to execute the blocks you enqueue. Every one of those threads has a stack. The default thread stack size on macOS is 8MB. I don't know about iOS's default thread stack size, but it for sure has one (and I'd bet one stiff drink that it's 8MB). Once GCD creates those threads, why would it shut them down? Especially when you've shown the OS that you're going to rapidly queue 500K operations?
The OS is optimizing for performance/speed at the expense of memory use. You don't get to control that. It's not a "leak" which is memory that's been allocated but has no live references to it. This memory surely has live references to it, they're just not under your control. If you want more (albeit different) visibility into memory usage, look into the vmmap command. It can (and will) show you things that might surprise you.

Maximum number of threads with async-await task groups

My intent is to understand the “cooperative thread pool” used by Swift 5.5’s async-await, and how task groups automatically constrain the degree of concurrency: Consider the following task group code, doing 32 calculations in parallel:
func launchTasks() async {
await withTaskGroup(of: Void.self) { group in
for i in 0 ..< 32 {
group.addTask { [self] in
let value = await doSomething(with: i)
// do something with `value`
}
}
}
}
While I hoped it would constrain the degree of concurrency, as advertised, I'm only getting two (!) concurrent tasks at a time. That is far more constrained than I would have expected:
If I use the old GCD concurrentPerform ...
func launchTasks2() {
DispatchQueue.global().async {
DispatchQueue.concurrentPerform(iterations: 32) { [self] i in
let value = doSomething(with: i)
// do something with `value`
}
}
}
... I get twelve at a time, taking full advantage of the device (iOS 15 simulator on my 6-core i9 MacBook Pro) while avoiding thread-explosion:
(FWIW, both of these were profiled in Xcode 13.0 beta 1 (13A5154h) running on Big Sur. And please disregard the minor differences in the individual “jobs” in these two runs, as the function in question is just spinning for a random duration; the key observation is the degree of concurrency is what we would have expected.)
It is excellent that this new async-await (and task groups) automatically limits the degree of parallelism, but the cooperative thread pool of async-await is far more constrained than I would have expected. And I see of no way to adjust these parameters of that pool. How can we better take advantage of our hardware while still avoiding thread explosion (without resorting to old techniques like non-zero semaphores or operation queues)?
It looks like this curious behavior is a limitation of the simulator. If I run it on my physical iPhone 12 Pro Max, the async-await task group approach results in 6 concurrent tasks ...
... which is essentially the same as the concurrentPerform behavior:
The behavior, including the degree of concurrency, is essentially the same on the physical device.
One is left to infer that the simulator appears to be configured to constrain async-await more than what is achievable with direct GCD calls. But on actual physical devices, the async-await task group behavior is as one would expect.

How to limit gcd queue buffer size

I am trying to process series of UIImages using CoreImage & Metal and also display them. My problem is I want to drop the incoming image if my gcd block is busy. How do I achieve this GCD queues, how do I define the maximum buffer size of queue?
There’s no native mechanism for this, but you can achieve what you want algorithmically with semaphores.
Before I dive into the “process 4 at a time, but discard any that come if we’re busy” scenario, let me first consider the simpler “process all, but not more than 4 at any given time” pattern. (I’m going to answer your question below, but building on this simpler situation.)
For example, let’s imagine that you had some preexisting array of objects and you want to process them concurrently, but not more than four at any given time (perhaps to minimize peak memory usage):
DispatchQueue.global().async {
let semaphore = DispatchSemaphore(value: 4)
for object in objects {
semaphore.wait()
processQueue.async {
self.process(object)
semaphore.signal()
}
}
}
Basically, the wait function will, as the documentation says, “Decrement the counting semaphore. If the resulting value is less than zero, this function waits for a signal to occur before returning.”
So, we’re starting our semaphore with a count of 4. So if objects had 10 items in it, the first four would start immediately, but the fifth wouldn’t start until one of the earlier ones finished and sent a signal (which increments the semaphore counter back up by 1), and so on, achieving a “run concurrently, but a max of 4 at any given time” behavior.
So, let’s return to your question. Let’s say you wanted to process no more than four images at a time and drop any incoming image if there were already four images currently being processed. You can accomplish that by telling wait to not really wait at all, i.e., check right .now() whether the semaphore counter has hit zero already, i.e., something like:
let semaphore = DispatchSemaphore(value: 4)
let processQueue = DispatchQueue(label: "com.domain.app.process", attributes: .concurrent)
func submit(_ image: UIImage) {
if semaphore.wait(timeout: .now()) == .timedOut { return }
processQueue.async {
self.process(image)
self.semaphore.signal()
}
}
Note, we generally want to avoid blocking the main thread (like wait can do), but because I’m using a timeout of .now(), it will never block, we’re just use the semaphore to keep track of where we are in a nice, thread-safe manner.
One final approach is to consider operation queues:
// create queue that will run no more than four at a time (no semaphores needed; lol)
let processQueue: OperationQueue = {
let queue = OperationQueue()
queue.maxConcurrentOperationCount = 4
return queue
}()
func submit(_ image: UIImage) {
// cancel all but the last three unstarted operations
processQueue.operations
.filter { $0.isReady && !$0.isFinished && !$0.isExecuting && !$0.isCancelled }
.dropLast(3)
.forEach { $0.cancel() }
// now add new operation to the queue
processQueue.addOperation(BlockOperation {
self.process(image)
})
}
The behavior is slightly different (keeping the most recent four images queued up, ready to go), but is something to consider.
GCD queues don't have a maximum queue size.
You can use a semaphore for this. Initialize it with the maximum queue length you want to support. Use dispatch_semaphore_wait() with DISPATCH_TIME_NOW as the timeout to try to reserve a spot before submitting a task to the queue. If it times out, don't enqueue the task (discard it, or whatever). Have the task signal the semaphore when it's complete to release the spot you reserved for it to be used for another task, later.

Swift concurrent operation slower 2x times

I have a large JSON array that I need to save to Realm, the problem is that this operation lasts around 45 seconds and that's too long. I tried running the save operation concurrently for each element in JSON array like this:
for element in jsonArray { // jsonArray has about 25 elements
DispatchQueue.global(qos: .userInitiated).async {
let realm = try! Realm()
let savedObject = realm.objects(MyObject.self).filter("name == '\(element.name)'")
for subElement in element { // element is an array that has around 1000 elements
let myModel = MyModel(initWith: subElement) // MyModel initialization is a simple light weight process that copies values from one model to another
savedObject.models.append(myModel)
}
}
}
When I try to run the same code but with DispatchQueue.main.async it finished around 2x faster even though it's not concurrent. I also tried running the code above with quality of service .userInteractive but it is the same speed.
When I run this code CPU utilization is about 30%, and memory about 45 MB. Is it possible to speed up this operation or I reached dead end?
The entire loop should be inside the DispatchQueue.global(qos: .userInitiated).async block.
As documented on the Realm website:
Realm write operations are synchronous and blocking, not asynchronous. If thread A starts a write operation, then thread B starts a write operation on the same Realm before thread A is finished, thread A must finish and commit its transaction before thread B’s write operation takes place. Write operations always refresh automatically on beginWrite(), so no race condition is created by overlapping writes.
This means you won't get any benefit by trying to write in multiple threads.

Resources