Despite my best efforts to make CoreML MLModel process its predictions in parallel, seems like under-the-hood Apple forcing it to run in a serial/one-by-one manner.
I made a public repository reproducing the PoC of the issue:
https://github.com/SocialKitLtd/coreml-concurrency-issue.
What I have tried:
Re-create the MLModel every time instead of a global instance
Use only .cpuAndGpu configuration
What I'm trying to achieve:
I'm trying to utilize multithreading to process a bunch of video frames at the same time (assuming the CPU/RAM can take it) faster than the one-by-one strategy.
Code (Also presented in the repository):
class ViewController: UIViewController {
override func viewDidLoad() {
super.viewDidLoad()
let parallelTaskCount = 3
for i in 0...parallelTaskCount - 1 {
DispatchQueue.global(qos: .userInteractive).async {
let image = UIImage(named: "image.jpg")!
self.runPrediction(index: i, image: image)
}
}
}
func runPrediction(index: Int, image: UIImage) {
let conf = MLModelConfiguration()
conf.computeUnits = .cpuAndGPU
conf.allowLowPrecisionAccumulationOnGPU = true
let myModel = try! MyModel(configuration: conf)
let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
// Prediction
let predicition = try! myModel.prediction(input: myModelInput)
print("finished proccessing \(index)")
}
}
Any help will be highly appreciated.
When you employ parallel execution on CPU, you can achieve significant performance gains on CPU-bound calculations only. But CoreML is not a CPU-bound. When you leverage the GPU (e.g., with .cpuAndGPU) you will not achieve the same sort of CPU-parallelism-driven performance gains that you will see with CPU-only calculations.
That having said, using the GPU (or the neural engine) is so much faster than the parallelized CPU rendition, as one would generally forgo parallelized CPU calculations, altogether, and favor the GPU rendition. The non-parallel GPU compute rendition will often be faster than a parallelized CPU rendition.
That having been said, when I employed parallelism in GPU-compute tasks, there is still some modest performance gain. In my experiments, I saw minor performance benefits (13 and 18% faster when I went from serial to three concurrent operations on iPhone and M1 iPad, respectively) running GPU-based CoreML calculations, but slightly more material benefits on a Mac. Just do not expect a dramatic performance improvement.
Profiling with Instruments (by pressing command-i in Xcode or choosing “Product” » “Profile”) can be illuminating. See Recording Performance Data.
First, let us compare a computeUnits of .cpuOnly scenario first. Here it is running 20 CoreML prediction calls sequentially (with maxConcurrentOperationCount of 1):
And, if I switch to the CPU view, I can see that it is jumping between two performance cores on my iPhone 12 Pro Max:
That makes sense. OK, now let us change the maxConcurrentOperationCount to 3, the overall processing time (the processingAll function) drops from 5 to 3½ minutes:
And when I switch to the CPU view, to see what is going on, it looks like it started running on both performance cores in parallel, but switched to some of the efficiency cores (probably because the thermal state of the device was getting stressed, which explains we did not achieve anything close to 2× performance):
So, when doing CPU-only CoreML calculations, parallel execution can yield significant benefits. That having been said, the CPU-only calculations are much slower than the GPU calculations.
When I switched to .cpuAndGPU, the difference maxConcurrentOperationCount of 1 vs 3 was far less pronounced, taking 45 seconds when allowing three concurrent operations and 50 seconds when executing serially. Here it is running three in parallel:
And sequentially:
But in contrast to the .cpuOnly scenarios, you can see in the CPU track, that the CPUs are largely idle. Here is the latter with the CPU view to show the details:
So, one can see that letting them run on multiple CPUs does not achieve much performance gain as this is not CPU-bound, but is obviously constrained by the GPU.
Here is my code for the above. Note, I used OperationQueue as it provides a simple mechanism to control the degree of concurrency (the maxConcurrentOperationCount:
import os.log
private let poi = OSLog(subsystem: "Test", category: .pointsOfInterest)
and
func processAll() {
let parallelTaskCount = 20
let queue = OperationQueue()
queue.maxConcurrentOperationCount = 3 // or try `1`
let id = OSSignpostID(log: poi)
os_signpost(.begin, log: poi, name: #function, signpostID: id)
for i in 0 ..< parallelTaskCount {
queue.addOperation {
let image = UIImage(named: "image.jpg")!
self.runPrediction(index: i, image: image, shouldAddContuter: true)
}
}
queue.addBarrierBlock {
os_signpost(.end, log: poi, name: #function, signpostID: id)
}
}
func runPrediction(index: Int, image: UIImage, shouldAddContuter: Bool = false) {
let id = OSSignpostID(log: poi)
os_signpost(.begin, log: poi, name: #function, signpostID: id, "%d", index)
defer { os_signpost(.end, log: poi, name: #function, signpostID: id, "%d", index) }
let conf = MLModelConfiguration()
conf.computeUnits = .cpuAndGPU // contrast to `.cpuOnly`
conf.allowLowPrecisionAccumulationOnGPU = true
let myModel = try! MyModel(configuration: conf)
let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
// Prediction
let prediction = try! myModel.prediction(input: myModelInput)
os_signpost(.event, log: poi, name: "finished processing", "%d %#", index, prediction.featureNames)
}
Note, above I have focused on CPU usage. You can also use the “Core ML” template in Instruments. E.g. here are the Points of Interest and the CoreML tracks next to each other on my M1 iPad Pro (with maxConcurrencyOperationCount set to 2 to keep it simple):
At first glance, it looks like CoreML is processing these in parallel, but if I run it again with maxConcurrencyOperationCount of 1 (i.e., serially), the time for those individual compute tasks is shorter, which suggests that in the parallel scenario that there is some GPU-related contention.
Anyway, in short, you can use Instruments to observe what is going on. And one can achieve significant improvements in performance through parallel processing for CPU-bound tasks only, and anything requiring the GPU or neural engine will be further constrained by that hardware.
Related
I have a question that other people usually have a problem with.
I am building an application that measures battery discharge.
My plan is to simulate high CPU usage and then measure the time it takes the battery to drop to a certain level.
How can I cause high CPU usage on purpose, but without blocking the UI?
Can I do something like this?
DispatchQueue.global(qos: .background).async { [weak self] in
guard let self = self else { return }
for _ in 0..<Int.max {
while self.isRunning {
}
break
}
}
What you want is persistent load on CPU, with number of threads concurrently loading CPU >= number of CPUs.
So something like
DispatchQueue.concurrentPerform(iterations: 100) { iteration in
for _ in 1...10000000 {
let a = 1 + 1
}
}
Where:
The concurrentPerform with iterations set to 100 makes sure we are running in parallel using every available thread. This is overkill of course, would be enough 4 threads to get every CPU busy on quad core, and about 10 threads is what iOS typically allocates at max per process. But 100 simply makes ruee it really happens)
The 1...10000000 makes loop really really long
The let a = 1 + 1 gives CPU something to do.
On my iPhone 8 simulator running this code created a picture like this (stopped it after about 30 sec):
Careful though! You may overheat your device
My intent is to understand the “cooperative thread pool” used by Swift 5.5’s async-await, and how task groups automatically constrain the degree of concurrency: Consider the following task group code, doing 32 calculations in parallel:
func launchTasks() async {
await withTaskGroup(of: Void.self) { group in
for i in 0 ..< 32 {
group.addTask { [self] in
let value = await doSomething(with: i)
// do something with `value`
}
}
}
}
While I hoped it would constrain the degree of concurrency, as advertised, I'm only getting two (!) concurrent tasks at a time. That is far more constrained than I would have expected:
If I use the old GCD concurrentPerform ...
func launchTasks2() {
DispatchQueue.global().async {
DispatchQueue.concurrentPerform(iterations: 32) { [self] i in
let value = doSomething(with: i)
// do something with `value`
}
}
}
... I get twelve at a time, taking full advantage of the device (iOS 15 simulator on my 6-core i9 MacBook Pro) while avoiding thread-explosion:
(FWIW, both of these were profiled in Xcode 13.0 beta 1 (13A5154h) running on Big Sur. And please disregard the minor differences in the individual “jobs” in these two runs, as the function in question is just spinning for a random duration; the key observation is the degree of concurrency is what we would have expected.)
It is excellent that this new async-await (and task groups) automatically limits the degree of parallelism, but the cooperative thread pool of async-await is far more constrained than I would have expected. And I see of no way to adjust these parameters of that pool. How can we better take advantage of our hardware while still avoiding thread explosion (without resorting to old techniques like non-zero semaphores or operation queues)?
It looks like this curious behavior is a limitation of the simulator. If I run it on my physical iPhone 12 Pro Max, the async-await task group approach results in 6 concurrent tasks ...
... which is essentially the same as the concurrentPerform behavior:
The behavior, including the degree of concurrency, is essentially the same on the physical device.
One is left to infer that the simulator appears to be configured to constrain async-await more than what is achievable with direct GCD calls. But on actual physical devices, the async-await task group behavior is as one would expect.
I am trying to process series of UIImages using CoreImage & Metal and also display them. My problem is I want to drop the incoming image if my gcd block is busy. How do I achieve this GCD queues, how do I define the maximum buffer size of queue?
There’s no native mechanism for this, but you can achieve what you want algorithmically with semaphores.
Before I dive into the “process 4 at a time, but discard any that come if we’re busy” scenario, let me first consider the simpler “process all, but not more than 4 at any given time” pattern. (I’m going to answer your question below, but building on this simpler situation.)
For example, let’s imagine that you had some preexisting array of objects and you want to process them concurrently, but not more than four at any given time (perhaps to minimize peak memory usage):
DispatchQueue.global().async {
let semaphore = DispatchSemaphore(value: 4)
for object in objects {
semaphore.wait()
processQueue.async {
self.process(object)
semaphore.signal()
}
}
}
Basically, the wait function will, as the documentation says, “Decrement the counting semaphore. If the resulting value is less than zero, this function waits for a signal to occur before returning.”
So, we’re starting our semaphore with a count of 4. So if objects had 10 items in it, the first four would start immediately, but the fifth wouldn’t start until one of the earlier ones finished and sent a signal (which increments the semaphore counter back up by 1), and so on, achieving a “run concurrently, but a max of 4 at any given time” behavior.
So, let’s return to your question. Let’s say you wanted to process no more than four images at a time and drop any incoming image if there were already four images currently being processed. You can accomplish that by telling wait to not really wait at all, i.e., check right .now() whether the semaphore counter has hit zero already, i.e., something like:
let semaphore = DispatchSemaphore(value: 4)
let processQueue = DispatchQueue(label: "com.domain.app.process", attributes: .concurrent)
func submit(_ image: UIImage) {
if semaphore.wait(timeout: .now()) == .timedOut { return }
processQueue.async {
self.process(image)
self.semaphore.signal()
}
}
Note, we generally want to avoid blocking the main thread (like wait can do), but because I’m using a timeout of .now(), it will never block, we’re just use the semaphore to keep track of where we are in a nice, thread-safe manner.
One final approach is to consider operation queues:
// create queue that will run no more than four at a time (no semaphores needed; lol)
let processQueue: OperationQueue = {
let queue = OperationQueue()
queue.maxConcurrentOperationCount = 4
return queue
}()
func submit(_ image: UIImage) {
// cancel all but the last three unstarted operations
processQueue.operations
.filter { $0.isReady && !$0.isFinished && !$0.isExecuting && !$0.isCancelled }
.dropLast(3)
.forEach { $0.cancel() }
// now add new operation to the queue
processQueue.addOperation(BlockOperation {
self.process(image)
})
}
The behavior is slightly different (keeping the most recent four images queued up, ready to go), but is something to consider.
GCD queues don't have a maximum queue size.
You can use a semaphore for this. Initialize it with the maximum queue length you want to support. Use dispatch_semaphore_wait() with DISPATCH_TIME_NOW as the timeout to try to reserve a spot before submitting a task to the queue. If it times out, don't enqueue the task (discard it, or whatever). Have the task signal the semaphore when it's complete to release the spot you reserved for it to be used for another task, later.
Im trying to apply the live camera filters through metal using the default MPSKernal filters given by apple and custom compute Shaders.
In compute shader I did the inplace encoding with the MPSImageGaussianBlur
and the code goes here
func encode(to commandBuffer: MTLCommandBuffer, sourceTexture: MTLTexture, destinationTexture: MTLTexture, cropRect: MTLRegion = MTLRegion.init(), offset : CGPoint) {
let blur = MPSImageGaussianBlur(device: device, sigma: 0)
blur.clipRect = cropRect
blur.offset = MPSOffset(x: Int(offset.x), y: Int(offset.y), z: 0)
let threadsPerThreadgroup = MTLSizeMake(4, 4, 1)
let threadgroupsPerGrid = MTLSizeMake(sourceTexture.width / threadsPerThreadgroup.width, sourceTexture.height / threadsPerThreadgroup.height, 1)
let commandEncoder = commandBuffer.makeComputeCommandEncoder()
commandEncoder.setComputePipelineState(pipelineState!)
commandEncoder.setTexture(sourceTexture, at: 0)
commandEncoder.setTexture(destinationTexture, at: 1)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
autoreleasepool {
var inPlaceTexture = destinationTexture
blur.encode(commandBuffer: commandBuffer, inPlaceTexture: &inPlaceTexture, fallbackCopyAllocator: nil)
}
}
But sometimes the inplace texture tend to fail and eventually it creates a jerk effect on the screen.
So if anyone can suggest me the solution without using the inplace texture or how to use the fallbackCopyAllocator or using the compute shaders in a different way that would be really helpful.
I have done enough coding in this area (applying computing shaders to video stream from camera), and the most common problem you run into is the "pixel buffer reuse" issue.
The metal texture you create from the sample buffer is backed up a pixel buffer, which is managed by the video session, and can be re-used for following video frames, unless you retain the reference to the sample buffer (retaining the reference to the metal texture is not enough).
Feel free to take a look at my code at https://github.com/snakajima/vs-metal, which applies various computing shaders to a live video stream.
VSContext:set() method takes optional sampleBuffer parameter in addition to the texture parameter, and retain the reference to the sampleBuffer until the computing shader's computation is completed (in VSRuntime:encode() method).
The in place operation method can be hit or miss depending on what the underlying filter is doing. If it is a single pass filter for some parameters, then you'll end up running out of place for those cases.
Since that method was added, MPS has added an underlying MTLHeap to manage memory a bit more transparently for you. If your MPSImage doesn't need to be viewed by the CPU and exists for only a short period of time on the GPU, it is recommended that you just use a MPSTemporaryImage instead. When the readCount hits 0 on that, the backing store will be recycled through the MPS heap and made available for other MPSTemporaryImages and other temporary resources used downstream. Likewise, the backing store for it isn't actually allocated from the heap until absolutely necessary (e.g. texture is written to, or .texture is called) A separate heap is allocated for each command buffer.
Using temporary images should help reduce memory usage quite a lot. For example, in an Inception v3 neural network graph, which has over a hundred passes, the heap was able to automatically reduce the graph to just four allocations.
i make rigid body simulation for iPhone/iPad with using Apple Metal. To do this, i need to make many calls of kernel functions, and i see, that it takes a long time, opposite to CUDA for example.
I implemented Metal kernel function call, like it describes in Apple tutorial
let commandQueue = device.newCommandQueue()
var commandBuffers:[MTLCommandBuffer]=[]
var gpuPrograms:[MTLFunction]=[]
var computePipelineFilters:[MTLComputePipelineState]=[]
var computeCommandEncoders:[MTLComputeCommandEncoder]=[]
//here i fill all arrays for my command queue
//and next i execute it
let threadsPerGroup = MTLSize(width:1,height:1,depth:1)
let numThreadgroups = MTLSize(width:threadsAmount, height:1, depth:1)
for computeCommandEncoder in computeCommandEncoders
{
computeCommandEncoder.dispatchThreadgroups(numThreadgroups, threadsPerThreadgroup: threadsPerGroup)
}
for computeCommandEncoder in computeCommandEncoders
{
computeCommandEncoder.endEncoding()
}
for commandBuffer in commandBuffers
{
commandBuffer.enqueue()
}
for commandBuffer in commandBuffers
{
commandBuffer.commit()
}
for commandBuffer in commandBuffers
{
commandBuffer.waitUntilCompleted()
}
I am do up to few dozens metal kernel functions every frame, and it works too slow. I tested it with empty kernel functions - and it shows me, that the problem are in Swift part of execution. I mean, when i want to execute kernel function in CUDA, i just call it like usual function and it works very fast. But here i must make many actions for every execution of every function every frame. May be i don't know something, but i want create all additional objects one time, and then just make something like
commandQueue.execute()
to execute all kernel functions.
Am i rights in my actions to execute many kernel functions, or there is some other way to do it faster?
I have a few projects that use multiple shaders in a single step. I only create a single buffer and encoder but multiple pipeline states; one for each compute function.
Remember that MTLCommandQueue is persistent, so only needs to be created once, so my MetalKit View's drawRect() function is roughly this (there are more shaders and textures being passed between them, but you get an idea of structure):
let commandBuffer = commandQueue.commandBuffer()
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(advect_pipelineState)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.setComputePipelineState(divergence_pipelineState)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
[...]
commandEncoder.endEncoding()
commandBuffer.commit()
My code actually iterates over one of the shaders twenty times and still runs pretty nippily, so if you reorganise and to follow this structure with a single buffer and a single encoder and only call endEncoding() and commit() once per pass, you may see an improvement in performance.
May being the operative word :)