If I am generating 0-12 triangles in a compute shader, is there a way I can stream them to a buffer that will then be used for rendering to screen?
My current strategy is:
create a buffer of float3 of size threads * 12, so can store the maximum possible number of triangles;
write to the buffer using an index that depends on the thread position in the grid, so there are no race conditions.
If I want to render from this though, I would need to skip the empty memory. It sounds ugly, but probably there is no other way currently. I know CUDA geometry shaders can have variable length output, but I wonder if/how games on iOS can generate variable-length data on GPU.
UPDATE 1:
As soon as I wrote the question, I thought about the possibility of using a second buffer that would point out how many triangles are available for each block. The vertex shader would then process all vertices of all triangles of that block.
This will not solve the problem of the unused memory though and as I have a big number of threads, the total memory wasted would be considerable.
What you're looking for is the Metal equivalent of D3D's "AppendStructuredBuffer". You want a type that can have structures added to it atomically.
I'm not familiar with Metal, but it does support Atomic operations such as 'Add' which is all you really need to roll your own Append Buffer. Initialise the counter to 0 and have each thread add '1' to the counter and use the original value as the index to write to in your buffer.
Related
The documentation for setVertexBytes says:
Use this method for single-use data smaller than 4 KB. Create a MTLBuffer object if your data exceeds 4 KB in length or persists for multiple uses.
What exactly does single-use mean?
For example, if I have a uniforms struct which is less than 4KB(and is updated every frame), is it better to use a triple buffer technique or simply use setVertexBytes?
From what I understand using setVertexBytes would copy the data every time into a MTLBuffer that Metal manages. This sounds slower than using triple buffering.
But then if I have different objects, each with its own uniforms, I would have to triple buffer everything, since it's dynamically updated.
And if I have a material that updates rarely but is passed to the shader every frame, would it be better to keep it in a buffer or pass it as a pointer using setVertexBytes?
It's not necessarily the case that Metal manages a distinct resource into which this data is written. As user Columbo notes in their comment, some hardware allows constant data to be recorded directly into command buffer memory, from which it can be subsequently read by a shader.
As always, you should profile in order to find the difference between the two approaches on your target hardware, but if the amount of data you're pushing per draw call is small, you might very well find that using setVertexBytes:... is faster than writing into a buffer and calling setVertexBuffer:....
For data that doesn't vary every frame (your slow-varying material use case), it may indeed be more efficient to keep that data in a buffer (double- or triple-buffered) rather than using setVertexBytes:....
I am trying to move from OpenGL to Metal for my iOS apps. In my OpenGL code I use glColorMask (if I want to write only to selected channels, for example only to alpha channel of a texture) in many places.
In Metal, for render pipeline (though vertex and fragment shader) seems like MTLColorWriteMask is the equivalent of glColorMask. I can setup it up while creating a MTLRenderPipelineState through the MTLRenderPipelineDescriptor.
But I could not find a similar option for compute pipeline (through kernel function). I always need to write all the channels (red, green, blue and alpha) every time I write to an output texture. What if I want to preserve the alpha (or any other channel) and only want to modify the color channels? I can create a copy of the output texture and use it as one of the inputs and read alpha from it to preserve the values but that is expensive.
Computer memory architectures don't like writing only some bytes of data. A write to 1 out of 4 bytes usually involves reading those four bytes into the cache, modifying one of them in the cache, and then writing those four bytes back out into memory. Well, most computers read/write a lot more than 4 bytes at a time, but you get the idea.
This happens with framebuffers too. If you do a partial write mask, the hardware is still going to be doing the equivalent of a read/modify/write on that texture. It's just not changing all of the bytes its reads.
So you can do the same thing from your compute shader. Read the 4-vector value, modify the channels you want, and then write it back out. As long as the read and write are from the same shader invocation, there should be no synchronization problems (assuming that no other invocations are trying to read/write to that same location, but if that were the case, you'd have problems anyway).
I'm looking at getting a program written for DirectX11 to play nice on DirectX10. To do that, I need to compile the shaders for model 4, not 5. Right now the only problem with that is that the geometry shaders use instancing which is unsupported by 4. The general model is
[instance(NUM_INSTANCES)]
void Gs(..., in uint instanceId : SV_GSInstanceID) { }
I can't seem to find many documents on why this exists, because my thought is: can't I just replace this with a loop from instanceId=0 to instanceId=NUM_INSTANCES-1?
The answer seems to be no, as it doesn't seems to output correctly, but besides my exact problem - can you help me understand why the concept of instancing exists. Is there some implication on the entire pipeline that instancing has beyond simply calling the main function twice with a different index?
With regards to why my replacement did not work:
Geometry shaders are annotated with [maxvertexcount(N)]. I had incorrectly assumed this was the vertex input count, and ignored it. In fact, input is determined by the type of primitive coming in, and so this was about the output. Before, if N was my output over I instances, each instance output N vertices. But now that I want to use a loop, a single instance outputs N*I vertices. As such, the answer was to do as I suggested, and also use [maxvertexcount(N*NUM_INSTANCES)].
To more broadly answer my question on why instances may be useful in a world that already has loops, I can only guess
Loops are not truly supported in shaders, it turns out - graphics card cores do not have a concept of control flow. When loops are written in shaders, the loop is unrolled (see [unroll]). This has limitations, makes compilation slower, and makes the shader blob bigger.
Instances can be parallelized - one GPU core can run one instance of a shader while another runs the next instance of the same shader with the same input.
Is there any scheme using WebGL which allows to process one data record to an previously unknown number of records?
Using OpenGL for example, a geometry program can be used to multiply vertices depending on their attributes, and thus output data of unknown length.
Is there any trick to use WebGL in a likewise fashion, or is this only possible on the JavaScript side?
Yup, there is no Geometry Shader in WebGL (just Vertex and Fragment).
So, yes, something multiplicative needs to be implemented on the JS side, by making more data or more calls to gl.drawTriangles/gl.drawElements.
One approach that might be applicable, is to have lots of data (triangles, say), and use the Fragment Shader to algorithmicly throw-away some or all of them. Kind of the opposite of multiplying. But if you keep the same triangles, and change their processing with uniforms, or perhaps smaller data in textures, you can at least save the hit of sending up lots of different data.
To "Throw away" a vertex, need to put it outside the NDC (the -1 to +1 unit cube), for all three vertices of a triangle.
I'm trying to implement one complex algorithm using GPU. The only problem is HW limitations and maximum available feature level is 9_3.
Algorithm is basically "stereo matching"-like algorithm for two images. Because of mentioned limitations all calculations has to be performed in Vertex/Pixel shaders only (there is no computation API available). Vertex shaders are rather useless here so I considered them as pass-through vertex shaders.
Let me shortly describe the algorithm:
Take two images and calculate cost volume maps (basically conterting RGB to Grayscale -> translate right image by D and subtract it from the left image). This step is repeated around 20 times for different D which generates Texture3D.
Problem here: I cannot simply create one Pixel Shader which calculates
those 20 repetitions in one go because of size limitation of Pixel
Shader (max. 512 arithmetics), so I'm forced to call Draw() in a loop
in C++ which unnecessary involves CPU while all operations are done on
the same two images - it seems to me like I have one bottleneck here. I know that there are multiple render targets but: there are max. 8 targets (I need 20+), if I want to generate 8 results in one pixel shader I exceed it's size limit (512 arithmetic for my HW).
Then I need to calculate for each of calculated textures box filter with windows where r > 9.
Another problem here: Because window is so big I need to split box filtering into two Pixel Shaders (vertical and horizontal direction separately) because loops unrolling stage results with very long code. Manual implementation of those loops won't help cuz still it would create to big pixel shader. So another bottleneck here - CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass).
Then in next step some arithmetic operations are applied for each pair of results from 1st step and 2nd step.
I haven't reach yet here with my development so no idea what kind of bottlenecks are waiting for me here.
Then minimal D (value of parameter from 1st step) is taken for each pixel based on pixel value from step 3.
... same as in step 3.
Here basically is VERY simple graph showing my current implementation (excluding steps 3 and 4).
Red dots/circles/whatever are temporary buffers (textures) where partial results are stored and at every red dot CPU is getting involved.
Question 1: Isn't it possible somehow to let GPU know how to perform each branch form up to the bottom without involving CPU and leading to bottleneck? I.e. to program sequence of graphics pipelines in one go and then let the GPU do it's job.
One additional question about render-to-texture thing: Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching? Or there is any transfer from GPU to CPU happening... Cuz this may be another issue here which leads to bottleneck.
Any help would be appreciated!
Thank you in advance.
Best regards,
Lukasz
Writing computational algorithms in pixel shaders can be very difficult. Writing such algorithms for 9_3 target can be impossible. Too much restrictions. But, well, I think I know how to workaround your problems.
1. Shader repetition
First of all, it is unclear, what do you call "bottleneck" here. Yes, theoretically, draw calls in for loop is a performance loss. But does it bottleneck? Does your application really looses performance here? How much? Only profilers (CPU and GPU) can answer. But to run it, you must first complete your algorithm (stages 3 and 4). So, I'd better stick with current solution, and started to implement whole algorithm, then profile and than fix performance issues.
But, if you feel ready to tweaks... Common "repetition" technology is instancing. You can create one more vertex buffer (called instance buffer), which will contains parameters not for each vertex, but for one draw instance. Then you do all the stuff with one DrawInstanced() call.
For you first stage, instance buffer can contain your D value and index of target Texture3D layer. You can pass-through them from vertex shader.
As always, you have a tradeof here: simplicity of code to (probably) performance.
2. Multi-pass rendering
CPU needs to be involved to pass results from temp texture (result of
V pass) to the second pass (H pass)
Typically, you do chaining like this, so no CPU involved:
// Pass 1: from pTexture0 to pTexture1
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture0); // source
pContext->OMSetRenderTargets(1, pTexture1, 0); // target
pContext->Draw(...);
// Pass 2: from pTexture1 to pTexture2
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture1); // previous target is now source
pContext->OMSetRenderTargets(1, pTexture2, 0);
pContext->Draw(...);
// Pass 3: ...
Note, that pTexture1 must have both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET flags. You can have multiple input textures and multiple render targets. Just make sure, that every next pass knows what previous pass outputs.
And if previous pass uses more resources than current, don't forget to unbind unneeded, to prevent hard-to-find errors:
pContext->PSSetShaderResources(2, 1, 0);
pContext->PSSetShaderResources(3, 1, 0);
pContext->PSSetShaderResources(4, 1, 0);
// Only 0 and 1 texture slots will be used
3. Resource data location
Does all textures resides in GPU memory all the time even between
Draw() method calls and Pixel/Vertex shaders switching?
We can never know that. Driver chooses appropriate location for resources. But if you have resources created with DEFAULT usage and 0 CPU access flag, you can be almost sure it will always be in video memory.
Hope it helps. Happy coding!