When to use setVertexBytes/setVertexBuffer when dealing with small data in Metal? - metal

The documentation for setVertexBytes says:
Use this method for single-use data smaller than 4 KB. Create a MTLBuffer object if your data exceeds 4 KB in length or persists for multiple uses.
What exactly does single-use mean?
For example, if I have a uniforms struct which is less than 4KB(and is updated every frame), is it better to use a triple buffer technique or simply use setVertexBytes?
From what I understand using setVertexBytes would copy the data every time into a MTLBuffer that Metal manages. This sounds slower than using triple buffering.
But then if I have different objects, each with its own uniforms, I would have to triple buffer everything, since it's dynamically updated.
And if I have a material that updates rarely but is passed to the shader every frame, would it be better to keep it in a buffer or pass it as a pointer using setVertexBytes?

It's not necessarily the case that Metal manages a distinct resource into which this data is written. As user Columbo notes in their comment, some hardware allows constant data to be recorded directly into command buffer memory, from which it can be subsequently read by a shader.
As always, you should profile in order to find the difference between the two approaches on your target hardware, but if the amount of data you're pushing per draw call is small, you might very well find that using setVertexBytes:... is faster than writing into a buffer and calling setVertexBuffer:....
For data that doesn't vary every frame (your slow-varying material use case), it may indeed be more efficient to keep that data in a buffer (double- or triple-buffered) rather than using setVertexBytes:....


Binding all textures on one huge descriptor set

I have some design qustion for a vulkan game engine:
In my game engine i bound all "static" textures resources on one huge descriptor-set(256k descriptors), and my shaders access those samplers by an dynamic indexing.
[For example when i want to sample a some normals-map that belong to a currtain gameobject i add an new uint into the material's ubo that represent the index of the object's normals-map descriptor inside the huge descriptor set, then i sample it and compute the final object color.]
I wondered whether this way to access objects textures is efficient compare to the idia to bind each object's texture on his per-object descriptor set(alongside the material ubo).
Does the size of an descriptor-set can drastically affect on the texel access speed?
or my idia is suck?
Again, sorry about my English.
There are no performance issues with indexing from an array of sampler descriptors. The only real reason not to do things this way is that implementations may not let you dynamically index such arrays. But if you're requiring that from the implementation (all desktop implementations allow it), then just keep doing it; it's a common technique for reducing the number of state changes you have to issue on the CPU.

DirectCompute: How to read from a RWTexture2D<float4>?

I have the following buffer:
RWTexture2D<float4> Output : register(u0);
This buffer is used by a compute shader for rendering a computed image.
To write a pixel in that texture, I just use code similar to this:
Output[XY] = SomeFunctionReturningFloat4(SomeArgument);
This works very well and my computed image is correctly rendered on screen.
Now at some stage in the compute shader, I would like to read back an
already computed pixel and process it again.
Output[XY] = SomeOtherFunctionReturningFloat4(Output[XY]);
The compiler return an error:
error X3676: typed UAV loads are only allowed for single-component 32-bit element types
Any help appreciated.
In Compute Shaders, data access is limited on some data types, and not at all intuitive and straightforward. In your case, you use a
That is a UAV typed of DXGI_FORMAT_R32G32B32A32_FLOAT format.
This forma is only supported for UAV typed store, but it’s not supported by UAV typed load.
Basically, you can only write on it, but not read it. UAV typed load only supports 32 bit formats, in your case DXGI_FORMAT_R32_FLOAT, that can only contain a single component (32 bits and that’s all).
Your code should run if you use a RWTexture2D<float> but I suppose this is not enough for you.
Possible workarounds that spring to my minds are:
1. using 4 different RWTexture2D<float>, one for each component
2. using 2 different textures, RWTexture2D<float4> to write your values and Texture2D<float4> to read from
3. Use a RWStructuredBufferinstead of the texture.
I don’t know your code so I don’t know if solutions 1. and 2. could be viable. However, I strongly suggest going for 3. and using StructuredBuffer. A RWStructuredBuffer can hold any type of struct and can easily cover all your needs. To be honest, in compute shaders I almost only use them to pass data. If you need the final output to be a texture, you can do all your calculations on the buffer, then copy the results on the texture when you’re done. I would add that drivers often use CompletePath to access RWTexture2D data, and FastPath to access RWStructuredBuffer data, making the former awfully slower than the latter.
Reference for data type access is here. Scroll down to UAV typed load.

Metal Best Practice: Triple-buffering – Textures too?

In the Metal Best Practices Guide, it states that for best performance one should "implement a triple buffering model to update dynamic buffer data," and that "dynamic buffer data refers to frequently updated data stored in a buffer."
Does an MTLTexture qualify as "frequently updated data stored in a buffer" if it needs to be updated every frame? All the examples in the guide above focus on MTLBuffers.
I notice Apple's implementation in MetalKit has a concept of a nextDrawable, so perhaps that's what's happening here?
If a command could be in flight and it could access (read/sample/write) the texture while you're modifying that same texture on the CPU (e.g. using one of the -replaceRegion:... methods or by writing to a backing IOSurface), then you will need a multi-buffering technique, yes.
If you're only modifying the texture on the GPU (by rendering to it, writing to it from a shader function, or using blit command encoder methods to copy to it), then you don't need multi-buffering. You may need to use a texture fence within the shader function or you may need to call -textureBarrier on the render command encoder between draw calls, depending on exactly what you're doing.
Yes, nextDrawable provides a form of multi-buffering. In this case, it's not due to CPU access, though. You're going to render to one texture while the previously-rendered texture may still be on its way to the screen. You don't want to use the same texture for both because the new rendering could overdraw the texture just before it's put on screen, thus showing corrupt results.

DirectX 12 Updating the Descriptor Heap

I'm currently writing my own graphics framework for DirectX12 (I've already written several DirectX 11 frameworks for personal game engines), and I'm currently trying to copy the methods used in the recent Hitman game for resource binding.
I'm confused about the best way to handle per-object resource binding for the SRV/CBV/UAV heap. I've watched several GDC presentations, and they all seem to gloss over this.
Only 1 SRV/CBV/UAV heap can be bound at a time, and switching the currently-bound heap in the middle of a command list can be bad for performance on some hardware by forcing a flush. Because of this, what is the best way to handle updating the heap with new descriptors? To me, it seems like each command list would:
Get a hold of a SRV/CBV/UAV heap for itself.
For each object in a subset of objects, create descriptors on the heap pointing to per-object data that was placed into a separate upload heap.
Afterwards, another command list takes this filled descriptor heap and binds it, then issues draw calls mixed with SetGraphicsRootDescriptorTable in order to move through the current descriptor heap.
This being said, several sources online (including another SO post) suggest using one large SRV/CBV/UAV heap and copying into it using CPU-visible heaps. I'm assuming they're not attempting to use the asynchronous CopyDescriptors, but rather CopyBufferRegion. I tried using CopyBufferRegion to update data per-object, but to me this seems under-performant with so many transitions between D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER and D3D12_RESOURCE_STATE_COPY_DEST. Am I misunderstanding something? Any clarity would be appreciated.
CopyDescriptors is not asynchronous, it is a CPU operation that is immediate on the CPU. It can happen anytime before a command list is executed for volatile descriptor ( after the command list operation using it is recorded ), or have to be ready at the usage for static descriptor ( root signature 1.1 ).
The usual approach is to have a large descriptor heap, keep a portion for static descriptors, then use the rest as a ring buffer, allocating descriptor table offset on demand to copy and use the needed descriptor for any draw/compute operation.
CopyBufferRegion has nothing to do here, remember that mapping buffers is also an immediate operation, so you also ring buffer a big chunk of memory for your per objet constant buffers, and you cycle into it. The only thing is that you need to make sure you do not overwrite memory or descriptor while they may still be in use, so you have to fence to prevent the case.

Texture streaming in DirectX11, Immutable vs Dynamic

We often have the case where we need to stream textures to the graphics card (in game case: terrains, in my case image from different input sources like cameras/capture cards/videos)
Of course in camera case, I receive my data in a separate thread, but still need to upload that data to the GPU for display.
I know 2 models for it.
Use a dynamic resource:
You create a dynamic texture which has the same size and format as your input image, when you receive a new image you set a flag that tells you need upload, and then use map in the device context to upload the texture data (with eventual double buffer of course).
Advantage is you have a single memory location, hence you don't have memory fragmentation over time.
Drawback is you need to upload in immediate context, so your upload had to be in your render loop.
Use immutable and load/discard
In that case you upload in the image receiving thread, by creating a new resource, push the data and discard the old resource.
Advantage is you should have a stall free upload (no need for immediate context, you can still run your command list while texture is uploading), resource can be used with a simple trigger once available (to swap SRV).
Drawback is you can fragment memory over time (by allocating and freeing resources in a constant manner (30 fps for a standard camera as example).
Also you have to deal with throttling yourself (but that part is not a big deal).
So is there something I missed in those techniques, or is there an even better way to handle this?
These are the two main methods of updating textures D3D11.
However, the assumption that the first method will not result in memory usage patterns identical to the second case is dependent on the driver, and likely is not true. You would use D3D11_MAP_WRITE_DISCARD if you are overwriting the whole image (which it sounds like what you are doing), meaning that the current contents of the buffer become undefined. However, this is only true from the CPU's point-of-view. They are retained for the GPU, if they are potentially used in a pending draw operation. Most (maybe all?) drivers will actually allocate new storage for the write location of the mapped texture in this case, otherwise command buffer processing would need to stall. The same holds if you do not use the discard flag. Instead, when the map command is processed in the command buffer, the resource's buffer is updated to the value returned from Map in D3D11_MAPPED_SUBRESOURCE.
Also, it is not true that you must update dynamic textures in the immediate context. Only that if you update them in a deferred context, you must use the D3D11_MAP_DISCARD flag. This means you could update the texture on a worker thread, if you are overwriting the entire texture.
The bottom line is that, since the CPU/GPU system on a PC is not a unified memory system, there will be synchronization issues updating GPU resources coming from the CPU.
