Performance of dynamic constant buffer indexing in pixel shaders - directx

I have a pixel shader, written in HLSL, that declares the following constant buffer:
cbuffer RenderParametersData : register(b2)
{
float4 LineColor[16];
};
In one of the shader functions, I look up the output color based on the index "color" (which is not really a color, just a convenient place to put the index into the array of LineColors):
output.Color = Colors[input.Color.b * 255];
This results in a dramatic increase in the number of instruction slots in the resulting assembly code. Keeping everything else constant, but instead performing a constant array lookup - output.Color = LineColor[0]; - the number of arithmetic operations goes from 10 to 37. Almost all of the additional operations look like this:
cmp r2, -r1.x, c0, r0.w
cmp r2, -r1.y, c1, r2
cmp r2, -r1.z, c2, r2
cmp r1, -r1.w, c3, r2
Where c increases to 15, matching the number of elements in LineColor. Resizing LineColor to 8 elements resulted in code much like the second case, but with c going only to 7, again matching the number of elements in the array. Going back to constant lookup, the number of operations dropped back to 10.
So it seems that dynamic constant buffer array lookup carries a pretty significant additional cost, adding one comparison instruction per element in the array, plus some overhead. I am genuinely surprised at how expensive this array lookup is, and given that my array size will soon increase by an order of magnitude, this will push me over the 64 arithmetic instructions limit.
Is this the expected behavior? Am I doing something wrong here, or is this a necessary consequence of dynamic array indexing?
Thanks!
EDIT: Just to add some additional detail, the effect I'm after is to color some quads based on data from the vertex shader and texture coordinates. I would do the work in the vertex shader, but interpolation of the texture coordinates has to occur first.
EDIT2: I've resolved this. I was specifying to FXC that my target is ps_4_0_level_9_1, which results in it generating assembly for both shader model 2.0 and 4.0. I discovered that the additional comparison per element problem only occurs in the model 2.0 assembly code. Switching the compiler targer to PS_4_0 results in getting only the model 4.0 code, and since I'm not constrained to level 9_1, things are now working well.

I resolved this by specifying that shader model 2.0 assembly should not be generated by the compiler. More details at the end of the question.

Related

What exactly is a constant buffer (cbuffer) used for in hlsl?

Currently I have this code in my vertex shader class:
cbuffer MatrixBuffer {
matrix worldMatrix;
matrix viewMatrix;
matrix projectionMatrix; };
I don't know why I need to wrap those variables in a cbuffer. If I delete the buffer my code works aswell. I would really appreciate it if someone could give me a brieve explanation why using cbuffers are necessary.
The reason it works either way is due to the legacy way constants were handled in Direct3D 8/Direct3D 9. Back then, there was only a single shared array of constants for the entire shader (one for VS and one for PS). This required that you had to change the constant array every single time you called Draw.
In Direct3D 10, constants were reorganized into one or more Constant Buffers to make it easier to update some constants while leaving others alone, and thus sending less data to the GPU.
See the classic presentation Windows to Reality: Getting the Most out of Direct3D 10 Graphics in Your Games for a lot of details on the impacts of constant update.
The up-shot of which here is that if you don't specify cbuffer, all the constants get put into a single implicit constant buffer bound to register b0 to emulate the old 'one constants array' behavior.
There are compiler flags to control the acceptance of legacy constructs: /Gec for backwards compatibility mode to support old Direct3D 8/9 intrinsics, and /Ges to enable a more strict compilation to weed out older constructs. That said, the HLSL compiler will pretty much always accept global constants without cbuffer and stick them into a single implicit constant buffer because this pattern is extremely common in shader code.

how to access the neighbor unit in metal kernel function correctly

I used metal to do some interpolation task. I wrote the kernel function as followed:
kernel void kf_interpolation( device short *dst, device uchar *src, uint id [[ thread_position_in_grid ]] )
{
dst[id] = src[id-1] + src[id] + src[id+1];
}
That kernel function could not gave expected value. And I found the cause was that the src[id-1] was always 0, which was false value. However, src[id+1] contained the right value. The question is how could I use the neighbour unit correctly, e.g. [id-1], in kernel functions. Thanks in advance.
The most efficient way to handle edge cases like this is usually to grow your source array at each end and offset the indices. So for N calculations, allocate your src array with N+2 elements, fill elements 1 through N (inclusive) with the source data, and set element 0 and N+1 to whatever you want the edge condition to be.
An even more efficient method would be to use MTLTextures instead of MTLBuffers. MTLTextures have an addressing mode attached to them which causes the hardware to automatically substitute either zero or the nearest valid texel when you read off the edge of a texture. They can also do linear interpolation in hardware for free, which can be of great help for resampling, assuming bilinear interpolation is good enough for you. If not, I recommend looking at MPSImageLanczosScale as an alternative.
You can make a MTLTexture from a MTLBuffer. The two will alias the same pixel data.

iOS Import .obj file to Model I/O without duplicating vertices

I'm trying to import a .obj file to use in Scene Kit using the Model I/O framework. I initially used the simple MDLAsset initWithURL: function, but after transferring the mesh to a SCNGeometry, I realized this function was triangulizing the mesh, such that each face had 3 unique vertices, and there were separate vertices at the same location for border faces. This was causing some major problems with my other functions, so I tried to fix it by instead using the MDLAsset initWithURL:vertexDescriptor:bufferAllocator:preserveTopology function with preserveTopology set to YES with the descriptor/allocator set to the default with nil. This preserving topology fixed my problem of duplicating vertices, so the faces/edges were all good, but in the process I lost the normals data.
By lost the normals, I don't mean multiple indexing, I mean after setting preserveTopology to YES, the buffer did not contain any normals values at all. Whereas before it was v1/n1/v2/n2... and the stride was 24 bytes (3 dimensions *4 bytes/float * 2 attributes), now the first half of the buffer is v1/v2/... with a stride of 12 and the entire 2nd half of the buffer is just 0.0 floats.
Also something weird with this, when you look at the SCNGeometrySources of the Geometry, there are 2 sources, 1 with semantic kGeometrySourceSemanticVertex, and 1 with semantic kGeometrySourceSemanticNormal. You would think that the semantic vertex source would contain the position data, and the semantic normal source would contain the normal data. However that is not the case. No matter what you set preserveTopology, they are buffers of size to contain both position and normal data with identical values. So when I said before there was no normal data, I mean both of these buffers, semantic vertex AND semantic normal went from being v1/n1/v2/n2... to v1/v2/.../(0.0, 0.0, 0.0)/(0.0, 0.0, 0.0)/... I went into the mdlmesh's buffer (before the transfer to scene kit) at found the same problem, so the problem must be with the initWithURL, not with the model i/o to scenekit bridge.
So I figured there must be something wrong with the default vertex descriptor and buffer allocator (since I was using nil) and went about trying to create my own that matched these 2 possible data formats. Alas after much trying I was unable to get something that worked.
Any ideas on how I should do this? How to give MDLAsset the proper vertexDescriptor and bufferAllocator (I feel like nil should be ok here) for importing a .obj file? Thanks
An obj file with vertices and normals has vertices, indicated by v lines, normals, indicated by vn lines, and faces, indicated by f lines.
The v and vn lines will just be the floating point values you expect, and the f line will be of the form -
f v0//n0 v1//n1 etc
Since OpenGL and Metal don't allow multiple indexing, you'll see the first effect of vertices being duplicated. For example,
f 0//0 1//2 2//0
can't work as a vertex buffer because it would require different indices per vertex. So typical OBJ parsers have to create new vertices that allow the face to become
f 0//0 1//1 2//2
The preserve topology option doesn't help you. It preserves the connectivity and shape of the mesh (no triangulation occurs, shared edges remain shared) but it still enforces a single index per vertex component.
One solution would be to make sure that your tool that is outputting the OBJ files uses single indexing during export, if that is an option.
Another option, and this won't solve the problem immediately, would be file a request that multiple-indexing be supported at the Model I/O level. SceneKit would still have to uniquely-index because it is has to be able to render.
Another option would be to use a format like PLY that doesn't have multiple indexing.

DirectX 11, Combining pixel shaders to prevent bottlenecks

I'm trying to implement one complex algorithm using GPU. The only problem is HW limitations and maximum available feature level is 9_3.
Algorithm is basically "stereo matching"-like algorithm for two images. Because of mentioned limitations all calculations has to be performed in Vertex/Pixel shaders only (there is no computation API available). Vertex shaders are rather useless here so I considered them as pass-through vertex shaders.
Let me shortly describe the algorithm:
Take two images and calculate cost volume maps (basically conterting RGB to Grayscale -> translate right image by D and subtract it from the left image). This step is repeated around 20 times for different D which generates Texture3D.
Problem here: I cannot simply create one Pixel Shader which calculates
those 20 repetitions in one go because of size limitation of Pixel
Shader (max. 512 arithmetics), so I'm forced to call Draw() in a loop
in C++ which unnecessary involves CPU while all operations are done on
the same two images - it seems to me like I have one bottleneck here. I know that there are multiple render targets but: there are max. 8 targets (I need 20+), if I want to generate 8 results in one pixel shader I exceed it's size limit (512 arithmetic for my HW).
Then I need to calculate for each of calculated textures box filter with windows where r > 9.
Another problem here: Because window is so big I need to split box filtering into two Pixel Shaders (vertical and horizontal direction separately) because loops unrolling stage results with very long code. Manual implementation of those loops won't help cuz still it would create to big pixel shader. So another bottleneck here - CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass).
Then in next step some arithmetic operations are applied for each pair of results from 1st step and 2nd step.
I haven't reach yet here with my development so no idea what kind of bottlenecks are waiting for me here.
Then minimal D (value of parameter from 1st step) is taken for each pixel based on pixel value from step 3.
... same as in step 3.
Here basically is VERY simple graph showing my current implementation (excluding steps 3 and 4).
Red dots/circles/whatever are temporary buffers (textures) where partial results are stored and at every red dot CPU is getting involved.
Question 1: Isn't it possible somehow to let GPU know how to perform each branch form up to the bottom without involving CPU and leading to bottleneck? I.e. to program sequence of graphics pipelines in one go and then let the GPU do it's job.
One additional question about render-to-texture thing: Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching? Or there is any transfer from GPU to CPU happening... Cuz this may be another issue here which leads to bottleneck.
Any help would be appreciated!
Thank you in advance.
Best regards,
Lukasz
Writing computational algorithms in pixel shaders can be very difficult. Writing such algorithms for 9_3 target can be impossible. Too much restrictions. But, well, I think I know how to workaround your problems.
1. Shader repetition
First of all, it is unclear, what do you call "bottleneck" here. Yes, theoretically, draw calls in for loop is a performance loss. But does it bottleneck? Does your application really looses performance here? How much? Only profilers (CPU and GPU) can answer. But to run it, you must first complete your algorithm (stages 3 and 4). So, I'd better stick with current solution, and started to implement whole algorithm, then profile and than fix performance issues.
But, if you feel ready to tweaks... Common "repetition" technology is instancing. You can create one more vertex buffer (called instance buffer), which will contains parameters not for each vertex, but for one draw instance. Then you do all the stuff with one DrawInstanced() call.
For you first stage, instance buffer can contain your D value and index of target Texture3D layer. You can pass-through them from vertex shader.
As always, you have a tradeof here: simplicity of code to (probably) performance.
2. Multi-pass rendering
CPU needs to be involved to pass results from temp texture (result of
V pass) to the second pass (H pass)
Typically, you do chaining like this, so no CPU involved:
// Pass 1: from pTexture0 to pTexture1
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture0); // source
pContext->OMSetRenderTargets(1, pTexture1, 0); // target
pContext->Draw(...);
// Pass 2: from pTexture1 to pTexture2
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture1); // previous target is now source
pContext->OMSetRenderTargets(1, pTexture2, 0);
pContext->Draw(...);
// Pass 3: ...
Note, that pTexture1 must have both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET flags. You can have multiple input textures and multiple render targets. Just make sure, that every next pass knows what previous pass outputs.
And if previous pass uses more resources than current, don't forget to unbind unneeded, to prevent hard-to-find errors:
pContext->PSSetShaderResources(2, 1, 0);
pContext->PSSetShaderResources(3, 1, 0);
pContext->PSSetShaderResources(4, 1, 0);
// Only 0 and 1 texture slots will be used
3. Resource data location
Does all textures resides in GPU memory all the time even between
Draw() method calls and Pixel/Vertex shaders switching?
We can never know that. Driver chooses appropriate location for resources. But if you have resources created with DEFAULT usage and 0 CPU access flag, you can be almost sure it will always be in video memory.
Hope it helps. Happy coding!

HLSL: Index to unaligned/packed floats

I have a vertex shader (2.0) doing some instancing - each vertex specifies an index into an array.
If I have an array like this:
float instanceData[100];
The compiler allocates it 100 constant registers. Each constant register is a float4, so it's allocating 4 times as much space as is needed.
I need a way to make it allocate just 25 constant registers and store four values in each of them.
Ideally I'd like a method where it still looks like a float[] on both the CPU and GPU (Right now I am calling EffectParamter.SetValue(Single[]), I'm using XNA). But manually packing and unpacking a float4[] is an option, too.
Also: what are the performance implications for doing this? Is it actually worth it? (For me, this will save about one batch in every four or five).
Does that helps?:
float4 packedInstanceData[25];
...
float data = packedInstanceData[index / 4][index % 4];

Resources