Direct3D 10 Hardware Instancing using Structured Buffers

Direct3D 10 Hardware Instancing using Structured Buffers - directx

I am trying to implement hardware instancing with Direct3D 10+ using Structured Buffers for the per instance data but I've not used them before.
I understand how to implement instancing when combining the per vertex and per instance data into a single structure in the Vertex Shader - i.e. you bind two vertex buffers to the input assembler and call the DrawIndexedInstanced function.
Can anyone tell me the procedure for binding the input assembler and making the draw call etc. when using Structured Buffers with hardware instancing? I can't seem to find a good example of it anywhere.
It's my understanding that Structured Buffers are bound as ShaderResourceViews, is this correct?

Yup, that's exactly right. Just don't put any per-instance vertex attributes in your vertex buffer or your input layout and create a ShaderResourceView of the buffer and set it on the vertex shader. You can then use the SV_InstanceID semantic to query which instance you're on and just fetch the relevant struct from your buffer.
StructuredBuffers are very similar to normal buffers. The only differences are that you specify the D3D11_RESOURCE_MISC_BUFFER_STRUCTURED flag on creation, fill in StructureByteStride and when you create a ShaderResourceView the Format is DXGI_UNKNOWN (the format is specified implicitly by the struct in your shader).
StructuredBuffer<MyStruct> myInstanceData : register(t0);
is the syntax in HLSL for a StructuredBuffer and you just access it using the [] operator like you would an array.
Is there anything else that's unclear?

Related

DirectCompute: How to read from a RWTexture2D<float4>?

I have the following buffer:
RWTexture2D<float4> Output : register(u0);
This buffer is used by a compute shader for rendering a computed image.
To write a pixel in that texture, I just use code similar to this:
Output[XY] = SomeFunctionReturningFloat4(SomeArgument);
This works very well and my computed image is correctly rendered on screen.
Now at some stage in the compute shader, I would like to read back an
already computed pixel and process it again.
Output[XY] = SomeOtherFunctionReturningFloat4(Output[XY]);
The compiler return an error:
error X3676: typed UAV loads are only allowed for single-component 32-bit element types
Any help appreciated.

In Compute Shaders, data access is limited on some data types, and not at all intuitive and straightforward. In your case, you use a
RWTexture2D<float4>
That is a UAV typed of DXGI_FORMAT_R32G32B32A32_FLOAT format.
This forma is only supported for UAV typed store, but it’s not supported by UAV typed load.
Basically, you can only write on it, but not read it. UAV typed load only supports 32 bit formats, in your case DXGI_FORMAT_R32_FLOAT, that can only contain a single component (32 bits and that’s all).
Your code should run if you use a RWTexture2D<float> but I suppose this is not enough for you.
Possible workarounds that spring to my minds are:
1. using 4 different RWTexture2D<float>, one for each component
2. using 2 different textures, RWTexture2D<float4> to write your values and Texture2D<float4> to read from
3. Use a RWStructuredBufferinstead of the texture.
I don’t know your code so I don’t know if solutions 1. and 2. could be viable. However, I strongly suggest going for 3. and using StructuredBuffer. A RWStructuredBuffer can hold any type of struct and can easily cover all your needs. To be honest, in compute shaders I almost only use them to pass data. If you need the final output to be a texture, you can do all your calculations on the buffer, then copy the results on the texture when you’re done. I would add that drivers often use CompletePath to access RWTexture2D data, and FastPath to access RWStructuredBuffer data, making the former awfully slower than the latter.
Reference for data type access is here. Scroll down to UAV typed load.

What is the correct sequence for uploading a uniform block?

In the example page at https://www.lighthouse3d.com/tutorials/glsl-tutorial/uniform-blocks/ has this:
uniformBlockBinding()
bindBuffer()
bufferData()
bindBufferBase()
But conceptually, wouldn't this be more correct?
bindBuffer()
bufferData()
uniformBlockBinding()
bindBufferBase()
The idea being that uploading to a buffer (bindBuffer+bufferData) should be agnostic about what the buffer will be used for - and then, separately, uniformBlockBinding()+bindBufferBase() would be used to update those uniforms, per shader, when the relevant buffer has changed?

Adding answer since the accepted answer has lots of info irrelevant to WebGL2
At init time you call uniformBlockBinding. For the given program it sets up which uniform buffer index bind point that particular program will get a particular uniform buffer from.
At render time you call bindBufferRange or bindBufferBase to bind a specific buffer to a specific uniform buffer index bind point
If you also need to upload new data to that buffer you can then call bufferData
In pseudo code
// at init time
for each uniform block
gl.uniformBlockBinding(program, indexOfBlock, indexOfBindPoint)
// at render time
for each uniform block
gl.bindBufferRange(gl.UNIFORM_BUFFER, indexOfBindPoint, buffer, offset, size)
if (need to update data in buffer)
gl.bufferData/gl.bufferSubData(gl.UNIFORM_BUFFER, data, ...)
Note that there is no “correct” sequence. The issue here is that how you update your buffers is really up to you. Since you might store multiple uniform buffer datas in a single buffer at different offsets then calling gl.bufferData/gl.bufferSubData like above is really not “correct”, it’s just one way of 100s.
WebGL2 (GLES 3.0 ES) does not support the layout(binding = x) mentioned in the accepted answer. There is also no such thing as glGenBuffers in WebGL2

Neither is "more correct" than the other; they all work. But if you're talking about separation of concerns, the first one better emphasizes correct separation.
glUniformBlockBinding modifies the program; it doesn't affect the nature of the buffer object or context buffer state. Indeed, by all rights, that call shouldn't even be in the same function; it's part of program object setup. In a modern GL tutorial, they would use layout(binding=X) to set the binding, so the function wouldn't even appear. For older code, it should be set to a known, constant value after creating the program and then left alone.
So calling the function between allocating storage for the buffer and binding it to an indexed bind point for use creates the impression that they should be calling glUniformBlockBinding every frame, which is the wrong impression.
And speaking of wrong impressions, glBindBufferBase shouldn't even be called there. The rest of that code is buffer setup code; it should only be done once, at the beginning of the application. glBindBufferBase should be called as part of the rendering process, not the setup process. In a good application, that call shouldn't be anywhere near the glGenBuffers call.

Instance vs Loops in HLSL Model 5 Geometry Shaders

I'm looking at getting a program written for DirectX11 to play nice on DirectX10. To do that, I need to compile the shaders for model 4, not 5. Right now the only problem with that is that the geometry shaders use instancing which is unsupported by 4. The general model is
[instance(NUM_INSTANCES)]
void Gs(..., in uint instanceId : SV_GSInstanceID) { }
I can't seem to find many documents on why this exists, because my thought is: can't I just replace this with a loop from instanceId=0 to instanceId=NUM_INSTANCES-1?
The answer seems to be no, as it doesn't seems to output correctly, but besides my exact problem - can you help me understand why the concept of instancing exists. Is there some implication on the entire pipeline that instancing has beyond simply calling the main function twice with a different index?

With regards to why my replacement did not work:
Geometry shaders are annotated with [maxvertexcount(N)]. I had incorrectly assumed this was the vertex input count, and ignored it. In fact, input is determined by the type of primitive coming in, and so this was about the output. Before, if N was my output over I instances, each instance output N vertices. But now that I want to use a loop, a single instance outputs N*I vertices. As such, the answer was to do as I suggested, and also use [maxvertexcount(N*NUM_INSTANCES)].
To more broadly answer my question on why instances may be useful in a world that already has loops, I can only guess
Loops are not truly supported in shaders, it turns out - graphics card cores do not have a concept of control flow. When loops are written in shaders, the loop is unrolled (see [unroll]). This has limitations, makes compilation slower, and makes the shader blob bigger.
Instances can be parallelized - one GPU core can run one instance of a shader while another runs the next instance of the same shader with the same input.

Random access to D3D11 buffer with R8G8B8A8_UNorm format in HLSL

I have a D3D11 buffer with a few million elements that is supposed to hold data in the R8G8B8A8_UNorm format.
The desired behavior is the following: One shader calculates a vec4 and writes it to the buffer in a random access pattern. In the next pass, another shader reads the data in a random access pattern and processes them further.
My best guess would be to create an UnorderedAccessView with the R8G8B8A8_UNorm format. But how do I declare the RWBuffer<?> in HLSL, and how do I write to and read from it? Is it necessary to declare it as RWBuffer<uint> and do the packing from vec4 to uint manually?
In OpenGL I would create a buffer and a buffer texture. Then I can declare an imageBuffer with the rgba8 format in the shader, access it with imageLoad and imageStore, and the hardware does all the conversions for me. Is this possible in D3D11?

This is a little tricky due to a lot of different gotchas, but you should be able to do something like this.
In your shader that writes to the buffer declare:
RWBuffer<float4> WriteBuf : register( u1 );
Note that it is bound to register u1 instead of u0. Unordered access views (UAV) must start at slot 1 because the u# register is also used for render targets.
To write to the buffer just do something like:
WriteBuf[0] = float4(0.5, 0.5, 0, 1);
Note that you must write all 4 values at once.
In your C++ code, you must create an unordered access buffer, and bind it to a UAV. You can use the DXGI_FORMAT_R8G8B8A8_UNORM format. When you write 4 floats to it, the values will automatically be converted and packed. The UAV can be bound to the pipeline using OMSetRenderTargetsAndUnorderedAccessViews.
In your shader that reads from the buffer declare a read only buffer:
Buffer<float4> ReadBuf : register( t0 );
Note that this buffer uses t0 because it will be bound as a shader resource view (SRV) instead of UAV.
To access the buffer use something like:
float4 val = ReadBuf[0];
In your C++ code, you can bind the same buffer you created earlier to an SRV instead of a UAV. The SRV can be bound to the pipeline using PSSetShaderResources and can also be created with DXGI_FORMAT_R8G8B8A8_UNORM.
You can't bind both the SRV and UAV using the same buffer to the pipeline at the same time. So you must bind the UAV first and run your first shader pass. Then unbind the UAV, bind SRV, and run the second shader pass.
There are probably other ways to do this as well. Note that all of this requires shader model 5.

DirectX11: Pass data from ComputeShader to VertexShader?

Is it possible to apply a filter to the geometry data that is to be rendered using Compute Shader and then use the result as an input buffer in the Vertex Shader? That would save me the trouble (&time) of reading back the data.
Any help is much appreciated.

Yes absolutely. First you create two identicals ID3D11Buffer of structures using BIND_VERTEX_BUFFER, BIND_SHADER_RESOURCE and BIND_UNORDERED_ACCESS usage flags, and the associated UAVs and SRVs.
First step is to apply your filter to input source buffer and write to the destination buffer during your compute pass.
Then during the draw pass, you just have to bind the destination buffer to the IA stage. You can do some ping-pong if you need to accumulate computations on the vertices (I assume that by filter you mean a functional map, for refering to the Functional Programming term).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart