Metal in-place operation for `MTLComputeCommandEncoder ` on iOS - ios

Is it possible to have a Metal compute function that processes a texture in-place on iOS? I have noticed that some MPS image filters support in-place processing, and was wondering if there is a way to accomplish this with custom kernels.
Specifically, I am looking to combine two textures into one using a blend function. I am easily able to do this by making the first texture a render target and using a shader to write the second one on top, but it feels like an overkill since both textures are the same size.

Yes, you can take a texture parameter with the access::read_write attribute, and read and write it within the same kernel function invocation. You'll need to ensure that the texture is created with both the .read and .write usage flags. Additionally, note that writes are not guaranteed to be seen by any subsequent reads by the same thread unless you call the flush() function after the write.
By the way, MetalPerformanceShaders kernels that are able to operate "in-place" don't necessarily use read_write textures; it's often the case that they use auxiliary textures and buffers and do their work across multiple passes. Per the documentation, any kernel can fail to operate in-place for any number of reasons, so you should always provide a fallback allocator to handle such cases.

Related

When to use setVertexBytes/setVertexBuffer when dealing with small data in Metal?

The documentation for setVertexBytes says:
Use this method for single-use data smaller than 4 KB. Create a MTLBuffer object if your data exceeds 4 KB in length or persists for multiple uses.
What exactly does single-use mean?
For example, if I have a uniforms struct which is less than 4KB(and is updated every frame), is it better to use a triple buffer technique or simply use setVertexBytes?
From what I understand using setVertexBytes would copy the data every time into a MTLBuffer that Metal manages. This sounds slower than using triple buffering.
But then if I have different objects, each with its own uniforms, I would have to triple buffer everything, since it's dynamically updated.
And if I have a material that updates rarely but is passed to the shader every frame, would it be better to keep it in a buffer or pass it as a pointer using setVertexBytes?
It's not necessarily the case that Metal manages a distinct resource into which this data is written. As user Columbo notes in their comment, some hardware allows constant data to be recorded directly into command buffer memory, from which it can be subsequently read by a shader.
As always, you should profile in order to find the difference between the two approaches on your target hardware, but if the amount of data you're pushing per draw call is small, you might very well find that using setVertexBytes:... is faster than writing into a buffer and calling setVertexBuffer:....
For data that doesn't vary every frame (your slow-varying material use case), it may indeed be more efficient to keep that data in a buffer (double- or triple-buffered) rather than using setVertexBytes:....

Metal Best Practice: Triple-buffering – Textures too?

In the Metal Best Practices Guide, it states that for best performance one should "implement a triple buffering model to update dynamic buffer data," and that "dynamic buffer data refers to frequently updated data stored in a buffer."
Does an MTLTexture qualify as "frequently updated data stored in a buffer" if it needs to be updated every frame? All the examples in the guide above focus on MTLBuffers.
I notice Apple's implementation in MetalKit has a concept of a nextDrawable, so perhaps that's what's happening here?
If a command could be in flight and it could access (read/sample/write) the texture while you're modifying that same texture on the CPU (e.g. using one of the -replaceRegion:... methods or by writing to a backing IOSurface), then you will need a multi-buffering technique, yes.
If you're only modifying the texture on the GPU (by rendering to it, writing to it from a shader function, or using blit command encoder methods to copy to it), then you don't need multi-buffering. You may need to use a texture fence within the shader function or you may need to call -textureBarrier on the render command encoder between draw calls, depending on exactly what you're doing.
Yes, nextDrawable provides a form of multi-buffering. In this case, it's not due to CPU access, though. You're going to render to one texture while the previously-rendered texture may still be on its way to the screen. You don't want to use the same texture for both because the new rendering could overdraw the texture just before it's put on screen, thus showing corrupt results.

Texture streaming in DirectX11, Immutable vs Dynamic

We often have the case where we need to stream textures to the graphics card (in game case: terrains, in my case image from different input sources like cameras/capture cards/videos)
Of course in camera case, I receive my data in a separate thread, but still need to upload that data to the GPU for display.
I know 2 models for it.
Use a dynamic resource:
You create a dynamic texture which has the same size and format as your input image, when you receive a new image you set a flag that tells you need upload, and then use map in the device context to upload the texture data (with eventual double buffer of course).
Advantage is you have a single memory location, hence you don't have memory fragmentation over time.
Drawback is you need to upload in immediate context, so your upload had to be in your render loop.
Use immutable and load/discard
In that case you upload in the image receiving thread, by creating a new resource, push the data and discard the old resource.
Advantage is you should have a stall free upload (no need for immediate context, you can still run your command list while texture is uploading), resource can be used with a simple trigger once available (to swap SRV).
Drawback is you can fragment memory over time (by allocating and freeing resources in a constant manner (30 fps for a standard camera as example).
Also you have to deal with throttling yourself (but that part is not a big deal).
So is there something I missed in those techniques, or is there an even better way to handle this?
These are the two main methods of updating textures D3D11.
However, the assumption that the first method will not result in memory usage patterns identical to the second case is dependent on the driver, and likely is not true. You would use D3D11_MAP_WRITE_DISCARD if you are overwriting the whole image (which it sounds like what you are doing), meaning that the current contents of the buffer become undefined. However, this is only true from the CPU's point-of-view. They are retained for the GPU, if they are potentially used in a pending draw operation. Most (maybe all?) drivers will actually allocate new storage for the write location of the mapped texture in this case, otherwise command buffer processing would need to stall. The same holds if you do not use the discard flag. Instead, when the map command is processed in the command buffer, the resource's buffer is updated to the value returned from Map in D3D11_MAPPED_SUBRESOURCE.
Also, it is not true that you must update dynamic textures in the immediate context. Only that if you update them in a deferred context, you must use the D3D11_MAP_DISCARD flag. This means you could update the texture on a worker thread, if you are overwriting the entire texture.
The bottom line is that, since the CPU/GPU system on a PC is not a unified memory system, there will be synchronization issues updating GPU resources coming from the CPU.

Cascading Image filters in opencl without using Texture Memory

I am working on a custom device that supports OpenCL 1.2 Embedded Profile and does not have Image support or Texture Memory. I have to pass an image through a Sobel filter and then a Median filter. What could be the best (fastest) way of doing this? Can I avoid having to send the image back to the host after Sobel filter and then reading it back on the device for Median filter? Where to store the intermediate image, global memory, local memory or elsewhere?
You can keep the buffer in the global memory of the device between kernel calls to avoid the extra copies. When you create the buffer, make sure you use the flag 'CL_MEM_READ_WRITE', this will allow the Sobel kernel to write to it, and the Median kernel to read from it afterward. You can get away with two buffers, but I would use three if memory is not a restriction.
create 3 buffers. call them whatever you'd like. (originalBuff, middleBuff, finalBuff)
copy the image data to originalBuff
optionally set other buffers to an all-zero state (can be done on the device by the kernels which write to these buffers)
call the sobel filter kernel with params (originalBuff, middleBuff)
call median kernel with params (middleBuff, finalBuff)
read finalBuff back to host
I left out the other steps, such as creating context/program/queue/etc.. in order to focus on the answer to your question.
Read about clCreateBuffer here.
EDIT:
I have not tried the flag 'CL_MEM_HOST_NO_ACCESS' before, but I think it is worth a try. In my example, middleBuff might benefit from this flag. Like most opencl features, any possible benefit would be implementation-dependent.

Can I reference an texture unit besides the previous and current in a combiner?

The PowerVR SGX platform, for example, supports 8 texture units (TEXTURE0...TEXTURE7), which can be accessed directly without shaders.
Using texture combiners, I can access values from the previous texture stage (GL_PREVIOUS) or the currently bound texture (GL_TEXTURE), etc. Is there a way to access anything from any stage before the immediately-previous one?
E.g. if I want to set up essentially multiple independent threads of processing and then combine the ultimate result for output, is this possible? Or am I restricted to flowing data from n to n+1 only?
No, the flow is restricted from n to n+1. The Combiner API hasn't been touched in years, it's modern replacement (Shaders) are much more flexible.

Resources