Making use of Metal's volatile MTLPurgeableState while triple buffering - ios

I have a number of cached MTLTexture's which I would like to set to purgeable using MTLTexture's setPurgeableState API. In order to do this I follow the process outlined in Apple's 2019 WWDC video which suggests:
Setting cached MTLTexture instances to volatile
Flagging MTLTexture instances as nonVolatile while 'in use' as part of a command buffer
Using MTLCommandBuffer's addCompletedHandler to set all MTLTexture instances back to volatile after the command buffer completes its work
This approach works great, but quickly runs into issues in a triple buffered renderer where more than one command buffer is in-flight simultaneously. In these instances I receive the following error:
Cannot set purgeability state to volatile while resource is in use by a command buffer.
... which makes sense. I'm obviously attempting to flag a MTLTexture as volatile while it's in-flight as part of a subsequent command buffer. But what's the best way around this without obliterating the performance advantages afforded by triple buffering in the first place?
Attached below is a basic sketch of the approach I'm using (and the one outlined by Apple in the aforementioned WWDC video):
let cache = [String:MTLTexture]()
// Insert a number of textures into CPU cache and set
// their purgeable state to volatile
for (key, texture) in cache {
texture.setPurgeableState(.volatile)
}
// Ensure no more than 3 command buffers are in-flight
// at any given them
semaphore.wait()
// Determine which cached textures are to be used for the
// next render pass and mark them nonVolatile
if cache["texture_to_be_used"].setPurgeableState(.nonVolatile) == .empty {
// If the texture has been purged recreate as necessary...
}
guard let commandBuffer = commandQueue.makeCommandBuffer() else { return }
// Construct command buffer, binding textures where necessary
// ...
// Set a completion handler for the command buffer
commandBuffer.addCompletedHandler({ (buffer) in
// Set any textures marked as nonVoltile back to volatile
// so that they can be purged when required
//
// The issue here is that as I'm using a triple buffered
// approach I have no way of (efficiently) knowing if a given
// texture is being bound as part of another in-flight
// command buffer
for (key, texture) in cache {
texture.setPurgeableState(.volatile)
}
// Rely on a semaphore to manage triple buffered rendering
self.semaphore.signal()
})
commandBuffer.commit()

Related

Reduce memory usage of AVAssetWriter

As the title says, I am having some trouble with AVAssetWriter and memory.
Some notes about my environment/requirements:
I am NOT using ARC, but if there is a way to simply use it and get it all working I'm all for it. My attempts have not made any difference though. And the environment I will be using this in requires memory to be minimised / released ASAP.
Objective-C is a requirement
Memory usage must be as low as possible, the 300mb it takes up now is unstable when testing on my device (iPhone X).
The code
This is the code used when taking the screenshots below https://gist.github.com/jontelang/8f01b895321d761cbb8cda9d7a5be3bd
The problem / items kept around in memory
Most of the things that seem to take up a lot of memory throughout the processing seems to be allocated in the beginning.
So at this point it doesn't seem to me that the issues are with my code. The code that I personally have control over seems to not be an issue, namely loading the images, creating the buffer, releasing it all seems to not be where the memory has a problem. For example if I mark in Instruments the majority of the time after the one above, the memory is stable and none of the memory is kept around.
The only reason for the persistent 5mb is that it is deallocated just after the marking period ends.
Now what?
I actually started writing this question with the focus being on wether my code was releasing things correctly or not, but now it seems like that is fine. So what are my options now?
Is there something I can configure within the current code to make the memory requirements smaller?
Is there simply something wrong with my setup of the writer/input?
Do I need to use a totally different way of making a video to be able to make this work?
A note on using CVPixelBufferPool
In the documentation of CVPixelBufferCreate Apple states:
If you need to create and release a number of pixel buffers, you should instead use a pixel buffer pool (see CVPixelBufferPool) for efficient reuse of pixel buffer memory.
I have tried with this as well, but I saw no changes in the memory usage. Changing the attributes for the pool didn't seem to have any effect as well, so there is a small possibility that I am not actually using it 100% properly, although from comparing to code online it seems like I am, at least. And the output file works.
The code for that, is here https://gist.github.com/jontelang/41a702d831afd9f9ceeb0f9f5365de03
And here is a slightly different version where I set up the pool in a slightly different way https://gist.github.com/jontelang/c0351337bd496a6c7e0c94293adf881f.
Update 1
So I looked a bit deeper into a trace, to figure out when/where the majority of the allocations are coming from. Here is an annotated image of that:
The takeaway is:
The space is not allocated "with" the AVAssetWriter
The 500mb that is held until the end is allocated within 500ms after the processing starts
It seems that it is done internally in AVAssetWriter
I have the .trace file uploaded here: https://www.dropbox.com/sh/f3tf0gw8gamu924/AAACrAbleYzbyeoCbC9FQLR6a?dl=0
When creating Dispatch Queue, ensure you create a queue with Autorlease Pool. Replace DISPATCH_QUEUE_SERIAL with DISPATCH_QUEUE_SERIAL_WITH_AUTORELEASE_POOL.
Wrap each iteration of for loop into autorelease pool as well
like this:
[assetWriterInput requestMediaDataWhenReadyOnQueue:recordingQueue usingBlock:^{
for (int i = 1; i < 200; ++i) {
#autoreleasepool {
while (![assetWriterInput isReadyForMoreMediaData]) {
[NSThread sleepForTimeInterval:0.01];
}
NSString *path = [NSString stringWithFormat:#"/Users/jontelang/Desktop/SnapperVideoDump/frames/frame_%i.jpg", i];
UIImage *image = [UIImage imageWithContentsOfFile:path];
CGImageRef ref = [image CGImage];
CVPixelBufferRef buffer = [self pixelBufferFromCGImage:ref pool:writerAdaptor.pixelBufferPool];
CMTime presentTime = CMTimeAdd(CMTimeMake(i, 60), CMTimeMake(1, 60));
[writerAdaptor appendPixelBuffer:buffer withPresentationTime:presentTime];
CVPixelBufferRelease(buffer);
}
}
[assetWriterInput markAsFinished];
[assetWriter finishWritingWithCompletionHandler:^{}];
}];
No, I see it is around 240 mb peaking in app. It's my first time using this allocation - interesting.
I'm using AssetWriter to write a video file by streaming cmSampleBuffer : CMSampleBuffer. It gets from AVCaptureVideoDataOutputSampleBufferDelegate by Camera CaptureOutput Realtime.
While I have not yet found the actual issue, the memory problem I described in this question was solved by simply doing it on the actual device instead of the simulator.
#Eugene_Dudnyk Answer is on spot, the auto release pool INSIDE the for or while loop is the key, here is how I got it working for Swift, also, please use AVAssetWriterInputPixelBufferAdaptor for pixel buffer pool:
videoInput.requestMediaDataWhenReady(on: videoInputQueue) { [weak self] in
while videoInput.isReadyForMoreMediaData {
autoreleasepool {
guard let sample = assetReaderVideoOutput.copyNextSampleBuffer(),
let buffer = CMSampleBufferGetImageBuffer(sample) else {
print("Error while processing video frames")
videoInput.markAsFinished()
DispatchQueue.main.async {
videoFinished = true
closeWriter()
}
return
}
// Process image and render back to buffer (in place operation, where ciProcessedImage is your processed new image)
self?.getCIContext().render(ciProcessedImage, to: buffer)
let timeStamp = CMSampleBufferGetPresentationTimeStamp(sample)
self?.adapter?.append(buffer, withPresentationTime: timeStamp)
}
}
}
My memory usage stopped rising.

What is ID3D12GraphicsCommandList::DiscardResource?

What exactly should I expect to happen when using DiscardResource?
What's the difference between discard and destroying/deleting a resource.
When is a good time/use-case to discard a resource?
Unfortunately Microsoft doesn't seem to say much about it other than it "discards a resource".
TL;DR: Is a rarely used function that provides a driver hint related to handling clear compression structures. You are unlikely to use it except based on specific performance advice.
DiscardResource is the DirectX 12 version of the Direct3D 11.1 method. See Microsoft Docs
The primary use of these methods it to optimize the performance of tiled-based deferred rasterizer graphics parts by discarding the render target after present. This is a hint to the driver that the contents of the render target are no longer relevant to the operation of the program, so it can avoid some internal clearing operations on the next use.
For DirectX 11, this is in the DirectX 11 App template to use DiscardView because it makes use of DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL:
void DX::DeviceResources::Present()
{
// The first argument instructs DXGI to block until VSync, putting the application
// to sleep until the next VSync. This ensures we don't waste any cycles rendering
// frames that will never be displayed to the screen.
DXGI_PRESENT_PARAMETERS parameters = { 0 };
HRESULT hr = m_swapChain->Present1(1, 0, &parameters);
// Discard the contents of the render target.
// This is a valid operation only when the existing contents will be entirely
// overwritten. If dirty or scroll rects are used, this call should be removed.
m_d3dContext->DiscardView1(m_d3dRenderTargetView.Get(), nullptr, 0);
// Discard the contents of the depth stencil.
m_d3dContext->DiscardView1(m_d3dDepthStencilView.Get(), nullptr, 0);
// If the device was removed either by a disconnection or a driver upgrade, we
// must recreate all device resources.
if (hr == DXGI_ERROR_DEVICE_REMOVED || hr == DXGI_ERROR_DEVICE_RESET)
{
HandleDeviceLost();
}
else
{
DX::ThrowIfFailed(hr);
}
}
The DirectX 12 App template doesn't need those explicit calls because it uses DXGI_SWAP_EFFECT_FLIP_DISCARD.
If you are wondering why the DirectX 11 app doesn't just use DXGI_SWAP_EFFECT_FLIP_DISCARD, it probably should. The DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL swap effect was the only one supported by Windows 8.x for Windows Store apps, which is when DiscardView was introduced. For Windows 10 / DirectX 12 / UWP, it's probably better to always use DXGI_SWAP_EFFECT_FLIP_DISCARD unless you specifically don't want the backbuffer discarded.
It is also useful for multi-GPU SLI / Crossfire configurations since the clearing operation can require synchronization between the GPUs. See this GDC 2015 talk
There are also other scenario-specific usages. For example, if doing deferred rendering for the G-buffer where you know every single pixel will be overwritten, you can use DiscardResource instead of doing ClearRenderTargetView / ClearDepthStencilView.

Apple Metal blitCommandEncoder in multi-thread situation

I have a loop that send off jobs to GPU in the managed memory model. code is:
var commandBufferArray : [MTLCommandBuffer] = []
var blitCommandArray : [MTLBlitCommandEncoder] = []
for i_cycle in 0..<n
{
commandBufferArray.append(mc.metalCommandQueue.makeCommandBuffer())
let outputDeviate = [float4](repeating: float4(0.0),count: 1024)
outputDeviateBufferArray.append(mc.createFloat4MetalBufferManaged(outputDeviate))
populateBuffersMetalJob(.....)
blitCommandArray.append(commandBufferArray[i_cycle].makeBlitCommandEncoder())
blitCommandArray[i_cycle].synchronize(resource: outputDeviateBufferArray[i_cycle])
blitCommandArray[i_cycle].endEncoding()
commandBufferArray[i_cycle].addCompletedHandler({ _ in
// do stuff with result
})
commandBufferArray[i_cycle].commit()
}
for i_cycle in 0..<numCycles
{
commandBufferArray[i_cycle].waitUntilCompleted()
}
I am using the AMD process on a 2015 MBP. If n = 1, this works fine. Once n > 1, it seems to hang on the synchronization call and never completes.
Any thoughts on what is going wrong here?
What is in the // do stuff with result code? I suspect you're doing something in there that's deadlocking. Perhaps it's trying to run something on the main thread where the code you've shown is blocked. Or it's trying to access a resource that you have locked. That prevents the completed handler(s) from finished, which prevents the command buffer from moving on and letting the next command buffer run or complete.
If you take a sample of the process, it can provides hints about where it's stuck and what it's waiting for. You can do that using the sample command-line tool or Activity Monitor > View > Sample Process.
Also, why are you using multiple command buffers? And why multiple blit command encoders? You do realize you could do all of this using a single command buffer and a single blit command encoder, right?

Does metal have a back buffer?

I'm currently tracking down some visual popping in my Metal app, and believe it is because I'm drawing directly to framebuffer, not a back-buffer
// this is when I've finished passing commands to the render buffer and issue the draw command. I believe this sends all the images directly to the framebuffer instead of using a backbuffer
[renderEncoder endEncoding];
[mtlCommandBuffer presentDrawable:frameDrawable];
[mtlCommandBuffer commit];
[mtlCommandBuffer release];
//[frameDrawable present]; // This line isn't needed (and I believe is performed by presentDrawable
Several googles later, I haven't found any documentation of back-buffers in metal. I know I could roll my own, but I can't believe metal doesn't support a back buffer.
Here is the code snippet of how I've setup my CAMetalLayer object.
+ (id)layerClass
{
return [CAMetalLayer class];
}
- (void)initCommon
{
self.opaque = YES;
self.backgroundColor = nil;
...
}
-(id <CAMetalDrawable>)getMetalLayer
{
id <CAMetalDrawable> frameDrawable;
while (!frameDrawable && !frameDrawable.texture)
{
frameDrawable = [self->_metalLayer nextDrawable];
}
return frameDrawable;
}
Can I enable a backbuffer on my CAMetalLayer object, or will I need to roll my own?
I assume by back-buffer, you mean a renderbuffer that is being rendered to, while the corresponding front-buffer is being displayed?
In Metal, the concept is provided by the drawables that you extract from CAMetalLayer. The CAMetalLayer instance maintains a small pool of drawables (generally 3), retrieves one of them from the pool each time you invoke nextDrawable, and returns it back to the pool after you've invoked presentDrawable and once rendering is complete (which may be some time later, since the GPU runs asynchronously from the CPU).
Effectively, on each frame loop, you grab a back-buffer by invoking nextDrawable, and make it eligible to become the front-buffer by invoking presentDrawable: and committing the MTLCommandBuffer.
Since there are only 3 drawables in the pool, the catch is that you have to manage this lifecycle yourself, by adding appropriate CPU resource synchronization at the time you invoke nextDrawable and in the callback you get once rendering is complete (as per the MTLCommandBuffer addCompletedHandler: callback set-up).
Typically you use a dispatch_semaphore_t for this:
_resource_semaphore = dispatch_semaphore_create(3);
then put the following just before you invoke nextDrawable:
dispatch_semaphore_wait(_resource_semaphore, DISPATCH_TIME_FOREVER);
and this in your addCompletedHandler: callback handler:
dispatch_semaphore_signal(_resource_semaphore);
Have a look at some of the simple Metal sample apps from Apple to see this in action. There is not a lot in terms of Apple documentation on this.

Is it ok to create EAGLContext for each thread?

I want to do some work in my OpenGL ES project in concurrent GCD queues. Is it ok if to create EAGLContext for each thread? I'm going to do it with such way:
queue_ = dispatch_queue_create("test.queue", DISPATCH_QUEUE_CONCURRENT);
dispatch_async(queue_, ^{
NSMutableDictionary* threadDictionary = [[NSThread currentThread] threadDictionary];
EAGLContext* context = threadDictionary[#"context"];
if (!context) {
context = /* creating EAGLContext with sharegroup */;
threadDictionary[#"context"] = context;
}
if ([EAGLContext setCurrentContext:context]) {
// rendering
[EAGLContext setCurrentContext:nil];
}
});
If it is not correct what is the best practice to parallelize OpenGL rendering?
Not only is it okay, this is the only way you can share OpenGL resources between multiple threads. Note that shareable resources are typically limited to resources that allocate memory (e.g. buffer objects, textures, shaders). They do not include objects that merely store state (e.g. the global state machine, Framebuffer Objects or Vertex Array Objects). But if you are considering modifying data that you are using for rendering, I would strongly advise against this.
Whenever GL has a command in the pipeline that has not finished, any attempt to modify a resource used by that command will block until the command finishes. A better solution would be to double-buffer your resources, have a copy you use for rendering and a separate copy you use for updating. When you finish updating, the next time your drawing thread uses that resource, have it swap the buffers used for updating and drawing. This will reduce the amount of time the driver has to synchronize your worker threads with the drawing thread.
Now, if you are suggesting here that you want to draw from multiple threads, then you should re-think your strategy. OpenGL generally does not benefit from issuing draw commands from multiple threads, it just creates a synchronization nightmare. Multi-threading is useful mostly for controlling VSYNC on multiple windows (probably not something you will ever encounter in ES) or streaming resource data in the background.

Resources