We often have the case where we need to stream textures to the graphics card (in game case: terrains, in my case image from different input sources like cameras/capture cards/videos)
Of course in camera case, I receive my data in a separate thread, but still need to upload that data to the GPU for display.
I know 2 models for it.
Use a dynamic resource:
You create a dynamic texture which has the same size and format as your input image, when you receive a new image you set a flag that tells you need upload, and then use map in the device context to upload the texture data (with eventual double buffer of course).
Advantage is you have a single memory location, hence you don't have memory fragmentation over time.
Drawback is you need to upload in immediate context, so your upload had to be in your render loop.
Use immutable and load/discard
In that case you upload in the image receiving thread, by creating a new resource, push the data and discard the old resource.
Advantage is you should have a stall free upload (no need for immediate context, you can still run your command list while texture is uploading), resource can be used with a simple trigger once available (to swap SRV).
Drawback is you can fragment memory over time (by allocating and freeing resources in a constant manner (30 fps for a standard camera as example).
Also you have to deal with throttling yourself (but that part is not a big deal).
So is there something I missed in those techniques, or is there an even better way to handle this?
These are the two main methods of updating textures D3D11.
However, the assumption that the first method will not result in memory usage patterns identical to the second case is dependent on the driver, and likely is not true. You would use D3D11_MAP_WRITE_DISCARD if you are overwriting the whole image (which it sounds like what you are doing), meaning that the current contents of the buffer become undefined. However, this is only true from the CPU's point-of-view. They are retained for the GPU, if they are potentially used in a pending draw operation. Most (maybe all?) drivers will actually allocate new storage for the write location of the mapped texture in this case, otherwise command buffer processing would need to stall. The same holds if you do not use the discard flag. Instead, when the map command is processed in the command buffer, the resource's buffer is updated to the value returned from Map in D3D11_MAPPED_SUBRESOURCE.
Also, it is not true that you must update dynamic textures in the immediate context. Only that if you update them in a deferred context, you must use the D3D11_MAP_DISCARD flag. This means you could update the texture on a worker thread, if you are overwriting the entire texture.
The bottom line is that, since the CPU/GPU system on a PC is not a unified memory system, there will be synchronization issues updating GPU resources coming from the CPU.
Related
Is it possible to have a Metal compute function that processes a texture in-place on iOS? I have noticed that some MPS image filters support in-place processing, and was wondering if there is a way to accomplish this with custom kernels.
Specifically, I am looking to combine two textures into one using a blend function. I am easily able to do this by making the first texture a render target and using a shader to write the second one on top, but it feels like an overkill since both textures are the same size.
Yes, you can take a texture parameter with the access::read_write attribute, and read and write it within the same kernel function invocation. You'll need to ensure that the texture is created with both the .read and .write usage flags. Additionally, note that writes are not guaranteed to be seen by any subsequent reads by the same thread unless you call the flush() function after the write.
By the way, MetalPerformanceShaders kernels that are able to operate "in-place" don't necessarily use read_write textures; it's often the case that they use auxiliary textures and buffers and do their work across multiple passes. Per the documentation, any kernel can fail to operate in-place for any number of reasons, so you should always provide a fallback allocator to handle such cases.
The documentation for setVertexBytes says:
Use this method for single-use data smaller than 4 KB. Create a MTLBuffer object if your data exceeds 4 KB in length or persists for multiple uses.
What exactly does single-use mean?
For example, if I have a uniforms struct which is less than 4KB(and is updated every frame), is it better to use a triple buffer technique or simply use setVertexBytes?
From what I understand using setVertexBytes would copy the data every time into a MTLBuffer that Metal manages. This sounds slower than using triple buffering.
But then if I have different objects, each with its own uniforms, I would have to triple buffer everything, since it's dynamically updated.
And if I have a material that updates rarely but is passed to the shader every frame, would it be better to keep it in a buffer or pass it as a pointer using setVertexBytes?
It's not necessarily the case that Metal manages a distinct resource into which this data is written. As user Columbo notes in their comment, some hardware allows constant data to be recorded directly into command buffer memory, from which it can be subsequently read by a shader.
As always, you should profile in order to find the difference between the two approaches on your target hardware, but if the amount of data you're pushing per draw call is small, you might very well find that using setVertexBytes:... is faster than writing into a buffer and calling setVertexBuffer:....
For data that doesn't vary every frame (your slow-varying material use case), it may indeed be more efficient to keep that data in a buffer (double- or triple-buffered) rather than using setVertexBytes:....
In the Metal Best Practices Guide, it states that for best performance one should "implement a triple buffering model to update dynamic buffer data," and that "dynamic buffer data refers to frequently updated data stored in a buffer."
Does an MTLTexture qualify as "frequently updated data stored in a buffer" if it needs to be updated every frame? All the examples in the guide above focus on MTLBuffers.
I notice Apple's implementation in MetalKit has a concept of a nextDrawable, so perhaps that's what's happening here?
If a command could be in flight and it could access (read/sample/write) the texture while you're modifying that same texture on the CPU (e.g. using one of the -replaceRegion:... methods or by writing to a backing IOSurface), then you will need a multi-buffering technique, yes.
If you're only modifying the texture on the GPU (by rendering to it, writing to it from a shader function, or using blit command encoder methods to copy to it), then you don't need multi-buffering. You may need to use a texture fence within the shader function or you may need to call -textureBarrier on the render command encoder between draw calls, depending on exactly what you're doing.
Yes, nextDrawable provides a form of multi-buffering. In this case, it's not due to CPU access, though. You're going to render to one texture while the previously-rendered texture may still be on its way to the screen. You don't want to use the same texture for both because the new rendering could overdraw the texture just before it's put on screen, thus showing corrupt results.
I'm currently writing my own graphics framework for DirectX12 (I've already written several DirectX 11 frameworks for personal game engines), and I'm currently trying to copy the methods used in the recent Hitman game for resource binding.
I'm confused about the best way to handle per-object resource binding for the SRV/CBV/UAV heap. I've watched several GDC presentations, and they all seem to gloss over this.
Only 1 SRV/CBV/UAV heap can be bound at a time, and switching the currently-bound heap in the middle of a command list can be bad for performance on some hardware by forcing a flush. Because of this, what is the best way to handle updating the heap with new descriptors? To me, it seems like each command list would:
Get a hold of a SRV/CBV/UAV heap for itself.
For each object in a subset of objects, create descriptors on the heap pointing to per-object data that was placed into a separate upload heap.
Afterwards, another command list takes this filled descriptor heap and binds it, then issues draw calls mixed with SetGraphicsRootDescriptorTable in order to move through the current descriptor heap.
This being said, several sources online (including another SO post) suggest using one large SRV/CBV/UAV heap and copying into it using CPU-visible heaps. I'm assuming they're not attempting to use the asynchronous CopyDescriptors, but rather CopyBufferRegion. I tried using CopyBufferRegion to update data per-object, but to me this seems under-performant with so many transitions between D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER and D3D12_RESOURCE_STATE_COPY_DEST. Am I misunderstanding something? Any clarity would be appreciated.
CopyDescriptors is not asynchronous, it is a CPU operation that is immediate on the CPU. It can happen anytime before a command list is executed for volatile descriptor ( after the command list operation using it is recorded ), or have to be ready at the usage for static descriptor ( root signature 1.1 ).
The usual approach is to have a large descriptor heap, keep a portion for static descriptors, then use the rest as a ring buffer, allocating descriptor table offset on demand to copy and use the needed descriptor for any draw/compute operation.
CopyBufferRegion has nothing to do here, remember that mapping buffers is also an immediate operation, so you also ring buffer a big chunk of memory for your per objet constant buffers, and you cycle into it. The only thing is that you need to make sure you do not overwrite memory or descriptor while they may still be in use, so you have to fence to prevent the case.
I'm experimenting with gstreamer on an embedded system and I'm wondering if there is a way to determine the maximum amount amount of memory gstreamer will use. If I have a simple source -> filter -> filter -> sink pipeline, can I figure out how many buffers each stage will allocate and what their maximum size would be?
My understanding is that I can't limit the memory usage, but I would at least like to understand the worst case scenario. Is this possible or is it too dependent on run-time conditions and/or data content. I'm also new to gstreamer, so please let me know if there is something I could add to the pipeline to make it more deterministic.
Thanks!
With gstreamer-0.10 you can use gst-tracelib (http://cgit.freedesktop.org/~ensonic/gst-tracelib/) to get e.g. peak memory consumption and various data flow relates statistics. Normally elements don't keep copies of buffers around. Exception are e.g. queue like elements and codecs (that need keep reference buffers). Many elements try to work in-place, that is they don't allocate new buffers, but rather change the buffer they received and pass it on (e.g. volume).