Optimizing branching for lookup tables - webgl

Branching in WebGL seems to be something like the following (paraphrased from various articles):
The shader executes its code in parallel, and if it needs to evaluate whether a condition is true before continuing (e.g. with an if statement) then it must diverge and somehow communicate with the other threads in order to come to a conclusion.
Maybe that's a bit off - but ultimately, it seems like the problem with branching in shaders is when each thread may be seeing different data. Therefore, branching with uniforms-only is typically okay, whereas branching on dynamic data is not.
Question 1: Is this correct?
Question 2: How does this relate to something that's fairly predictable but not a uniform, such as an index in a loop?
Specifically, I have the following function:
vec4 getMorph(int morphIndex) {
/* doesn't work - can't access morphs via dynamic index
vec4 morphs[8];
morphs[0] = a_Morph_0;
morphs[1] = a_Morph_1;
...
morphs[7] = a_Morph_7;
return morphs[morphIndex];
*/
//need to do this:
if(morphIndex == 0) {
return a_Morph_0;
} else if(morphIndex == 1) {
return a_Morph_1;
}
...
else if(morphIndex == 7) {
return a_Morph_7;
}
}
And I call it in something like this:
for(int i = 0; i < 8; i++) {
pos += weight * getMorph(i);
normal += weight * getMorph(i);
...
}
Technically, it works fine - but my concern is all the if/else branches based on the dynamic index. Is that going to slow things down in a case like this?
For the sake of comparison, though it's tricky to explain in a few concise words here - I have an alternative idea to always run all the calculations for each attribute. This would involve potentially 24 superfluous vec4 += float * vec4 calculations per vertex. Would that be better or worse than branching 8 times on an index, usually?
note: in my actual code there's a few more levels of mapping and indirection, while it does boil down to the same getMorph(i) question, my use case involves getting that index from both an index in a loop, and a lookup of that index in a uniform integer array

I know this is not a direct answer to your question but ... why not just not use a loop?
vec3 pos = weight[0] * a_Morph_0 +
weight[1] * a_Morph_1 +
weight[2] * a_Morph_2 ...
If you want generic code (ie where you can set the number of morphs) then either get creative with #if, #else, #endif
const numMorphs = ?
const shaderSource = `
...
#define NUM_MORPHS ${numMorphs}
vec3 pos = weight[0] * a_Morph_0
#if NUM_MORPHS >= 1
+ weight[1] * a_Morph_1
#endif
#if NUM_MORPHS >= 2
+ weight[2] * a_Morph_2
#endif
;
...
`;
or generate the shader in JavaScript with string manipulation.
function createMorphShaderSource(numMorphs) {
const morphStrs = [];
for (i = 1; i < numMorphs; ++i) {
morphStrs.push(`+ weight[${i}] * a_Morph_${i}`);
}
return `
..shader code..
${morphStrs.join('\n')}
..shader code..
`;
}
Shader generation through string manipulation is a normal thing to do. You'll find all major 3d libraries do this (three.js, unreal, unity, pixi.js, playcanvas, etc...)
As for whether or not branching is slow it really depends on the GPU but the general rule is that yes, it's slower no matter how it's done.
You generally can avoid branches by writing custom shaders instead of trying to be generic.
Instead of
uniform bool haveTexture;
if (haveTexture) {
...
} else {
...
}
Just write 2 shaders. One with a texture and one without.
Another way to avoid branches is to get creative with your math. For example let's say we want to support vertex colors or textures
varying vec4 vertexColor;
uniform sampler2D textureColor;
...
vec4 tcolor = texture2D(textureColor, ...);
gl_FragColor = tcolor * vertexColor;
Now when we just want just a vertex color set textureColor to a 1x1 pixel white texture. When we just want just a texture turn off the attribute for vertexColor and set that attribute to white gl.vertexAttrib4f(vertexColorAttributeLocation, 1, 1, 1, 1); and bonus!, we can modulate the texture with vertexColors by supplying both a texture and vertex colors.
Similarly we could pass in a 0 or a 1 to multiply certain things by 0 or 1 to remove their influence. In your morph example, a 3d engine that is targeting performance would generate shaders for different numbers of morphs. A 3d engine that didn't care about performance would have 1 shader that supported N morph targets just set the weight to 0 for any unused targets to 0.
Yet another way to avoid branching is the step function which is defined as
step(edge, x) {
return x < edge ? 0.0 : 1.0;
}
So you can choose a or b with
v = mix(a, b, step(edge, x));

Related

how can I update dynamic vertex buffer fastly?

I'm trying to make a simple 3D modeling tool.
there is some work to move a vertex( or vertices ) for transform the model.
I used dynamic vertex buffer because thought it needs much update.
but performance is too low in high polygon model even though I change just one vertex.
is there other methods? or did I wrong way?
here is my D3D11_BUFFER_DESC
Usage = D3D11_USAGE_DYNAMIC;
CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
BindFlags = D3D11_BIND_VERTEX_BUFFER;
ByteWidth = sizeof(ST_Vertex) * _nVertexCount
D3D11_SUBRESOURCE_DATA d3dBufferData;
d3dBufferData.pSysMem = pVerticesInfo;
hr = pd3dDevice->CreateBuffer(&descBuffer, &d3dBufferData, &_pVertexBuffer);
and my update funtion
D3D11_MAPPED_SUBRESOURCE d3dMappedResource;
pImmediateContext->Map(_pVertexBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &d3dMappedResource);
ST_Vertex* pBuffer = (ST_Vertex*)d3dMappedResource.pData;
for (int i = 0; i < vIndice.size(); ++i)
{
pBuffer[vIndice[i]].xfPosition.x = pVerticesInfo[vIndice[i]].xfPosition.x;
pBuffer[vIndice[i]].xfPosition.y = pVerticesInfo[vIndice[i]].xfPosition.y;
pBuffer[vIndice[i]].xfPosition.z = pVerticesInfo[vIndice[i]].xfPosition.z;
}
pImmediateContext->Unmap(_pVertexBuffer, 0);
As mentioned in the previous answer, you are updating your whole buffer every time, which will be slow depending on model size.
The solution is indeed to implement partial updates, there are two possibilities for it, you want to update a single vertex, or you want to update
arbitrary indices (for example, you want to move N vertices in one go, in different locations, like vertex 1,20,23 for example.
The first solution is rather simple, first create your buffer with the following description :
Usage = D3D11_USAGE_DEFAULT;
CPUAccessFlags = 0;
BindFlags = D3D11_BIND_VERTEX_BUFFER;
ByteWidth = sizeof(ST_Vertex) * _nVertexCount
D3D11_SUBRESOURCE_DATA d3dBufferData;
d3dBufferData.pSysMem = pVerticesInfo;
hr = pd3dDevice->CreateBuffer(&descBuffer, &d3dBufferData, &_pVertexBuffer);
This makes sure your vertex buffer is gpu visible only.
Next create a second dynamic buffer which has the size of a single vertex (you do not need any bind flags in that case, as it will be used only for copies)
_pCopyVertexBuffer
Usage = D3D11_USAGE_DYNAMIC; //Staging works as well
CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
BindFlags = 0;
ByteWidth = sizeof(ST_Vertex);
D3D11_SUBRESOURCE_DATA d3dBufferData;
d3dBufferData.pSysMem = NULL;
hr = pd3dDevice->CreateBuffer(&descBuffer, &d3dBufferData, &_pCopyVertexBuffer);
when you move a vertex, copy the changed vertex in the copy buffer :
ST_Vertex changedVertex;
D3D11_MAPPED_SUBRESOURCE d3dMappedResource;
pImmediateContext->Map(_pVertexBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &d3dMappedResource);
ST_Vertex* pBuffer = (ST_Vertex*)d3dMappedResource.pData;
pBuffer->xfPosition.x = changedVertex.xfPosition.x;
pBuffer->.xfPosition.y = changedVertex.xfPosition.y;
pBuffer->.xfPosition.z = changedVertex.xfPosition.z;
pImmediateContext->Unmap(_pVertexBuffer, 0);
Since you use D3D11_MAP_WRITE_DISCARD, make sure to write all attributes there (not only position).
Now once you done, you can use ID3D11DeviceContext::CopySubresourceRegion to only copy the modified vertex in the current location :
I assume that vertexID is the index of the modified vertex :
pd3DeviceContext->CopySubresourceRegion(_pVertexBuffer,
0, //must be 0
vertexID * sizeof(ST_Vertex), //location of the vertex in you gpu vertex buffer
0, //must be 0
0, //must be 0
_pCopyVertexBuffer,
0, //must be 0
NULL //in this case we copy the full content of _pCopyVertexBuffer, so we can set to null
);
Now if you want to update a list of vertices, things get more complicated and you have several options :
-First you apply this single vertex technique in a loop, this will work quite well if your changeset is small.
-If your changeset is very big (close to almost full vertex size, you can probably rewrite the whole buffer instead).
-An intermediate technique is to use compute shader to perform the updates (thats the one I normally use as its the most flexible version).
Posting all c++ binding code would be way too long, but here is the concept :
your vertex buffer must have BindFlags = D3D11_BIND_VERTEX_BUFFER | D3D11_BIND_UNORDERED_ACCESS; //this allows to write wioth compute
you need to create an ID3D11UnorderedAccessView for this buffer (so shader can write to it)
you need the following misc flags : D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS //this allows to write as RWByteAddressBuffer
you then create two dynamic structured buffers (I prefer those over byteaddress, but vertex buffer and structured is not allowed in dx11, so for the write one you need raw instead)
first structured buffer has a stride of ST_Vertex (this is your changeset)
second structured buffer has a stride of 4 (uint, these are the indices)
both structured buffers get an arbitrary element count (normally i use 1024 or 2048), so that will be the maximum amount of vertices you can update in a single pass.
both structured buffers you need an ID3D11ShaderResourceView (shader visible, read only)
Then update process is the following :
write modified vertices and locations in structured buffers (using map discard, if you have to copy less its ok)
attach both structured buffers for read
attach ID3D11UnorderedAccessView for write
set your compute shader
call dispatch
detach ID3D11UnorderedAccessView for write (this is VERY important)
This is a sample compute shader code (I assume you vertex is position only, for simplicity)
cbuffer cbUpdateCount : register(b0)
{
uint updateCount;
};
RWByteAddressBuffer RWVertexPositionBuffer : register(u0);
StructuredBuffer<float3> ModifiedVertexBuffer : register(t0);
StructuredBuffer<uint> ModifiedVertexIndicesBuffer : register(t0);
//this is the stride of your vertex buffer, since here we use float3 it is 12 bytes
#define WRITE_STRIDE 12
[numthreads(64, 1, 1)]
void CS( uint3 tid : SV_DispatchThreadID )
{
//make sure you do not go part element count, as here we runs 64 threads at a time
if (tid.x >= updateCount) { return; }
uint readIndex = tid.x;
uint writeIndex = ModifiedVertexIndicesBuffer[readIndex];
float3 vertex = ModifiedVertexBuffer[readIndex];
//byte address buffers do not understand float, asuint is a binary cast.
RWVertexPositionBuffer.Store3(writeIndex * WRITE_STRIDE, asuint(vertex));
}
For the purposes of this question I'm going to assume you already have a mechanism for selecting a vertex from a list of vertices based upon ray casting or some other picking method and a mechanism for creating a displacement vector detailing how the vertex was moved in model space.
The method you have for updating the buffer is sufficient for anything less than a few hundred vertices, but on large scale models it becomes extremely slow. This is because you're updating everything, rather than the individual vertices you modified.
To fix this, you should only update the vertices you have changed, and to do that you need to create a change set.
In concept, a change set is nothing more than a set of changes made to the data - a list of the vertices that need to be updated. Since we already know which vertices were modified (otherwise we couldn't have manipulated them), we can map in the GPU buffer, go to that vertex specifically, and copy just those vertices into the GPU buffer.
In your vertex modification method, record the index of the vertex that was modified by the user:
//Modify the vertex coordinates based on mouse displacement
pVerticesInfo[SelectedVertexIndex].xfPosition.x += DisplacementVector.x;
pVerticesInfo[SelectedVertexIndex].xfPosition.y += DisplacementVector.y;
pVerticesInfo[SelectedVertexIndex].xfPosition.z += DisplacementVector.z;
//Add the changed vertex to the list of changes.
changedVertices.add(SelectedVertexIndex);
//And update the GPU buffer
UpdateD3DBuffer();
In UpdateD3DBuffer(), do the following:
D3D11_MAPPED_SUBRESOURCE d3dMappedResource;
pImmediateContext->Map(_pVertexBuffer, 0, D3D11_MAP_WRITE, 0, &d3dMappedResource);
ST_Vertex* pBuffer = (ST_Vertex*)d3dMappedResource.pData;
for (int i = 0; i < changedVertices.size(); ++i)
{
pBuffer[changedVertices[i]].xfPosition.x = pVerticesInfo[changedVertices[i]].xfPosition.x;
pBuffer[changedVertices[i]].xfPosition.y = pVerticesInfo[changedVertices[i]].xfPosition.y;
pBuffer[changedVertices[i]].xfPosition.z = pVerticesInfo[changedVertices[i]].xfPosition.z;
}
pImmediateContext->Unmap(_pVertexBuffer, 0);
changedVertices.clear();
This has the effect of only updating the vertices that have changed, rather than all vertices in the model.
This also allows for some more complex manipulations. You can select multiple vertices and move them all as a group, select a whole face and move all the connected vertices, or move entire regions of the model relatively easily, assuming your picking method is capable of handling this.
In addition, if you record the change sets with enough information (the affected vertices and the displacement index), you can fairly easily implement an undo function by simply reversing the displacement vector and reapplying the selected change set.

WebGL shader save multiple 32 bit values

I need to save up to 8 32bit values in each WebGL fragment shader invocation(including when no OES_texture_float or OES_texture_half_float extentions are available). It seems I can only store a single 32 bit value by packing it into 4x8bits RGBA gl_FragColor.
Is there a way to store 8 values ?
The only way to draw more than one vec4 worth of data per call in the fragment shader is to use WEBGL_draw_buffers which lets you bind multiple color attachments to a framebuffer and then render to all of them in a single fragment shader call using
gl_FragData[constAttachmentIndex] = result;
If WEBGL_draw_buffers is not available the only workarounds are I can think of are
Rendering in multiple draw calls.
Call gl.drawArrays to render the first vec4, then again with different parameters or a different shader to render the second vec4.
Render based on gl_FragCoord where you change the output for each pixel.
In otherwords, 1st pixel gets the first vec4, second pixel gets the second vec4, etc. For example
float mode = mod(gl_Fragcoord.x, 2.);
gl_FragColor = mix(result1, result2, mode);
In this way the results are stored like this
1212121212
1212121212
1212121212
into one texture. For more vec4s you could do this
float mode = mod(gl_Fragcoord.x, 4.); // 4 vec4s
if (mode < 0.5) {
gl_FragColor = result1;
} else if (mode < 1.5) {
gl_FragColor = result2;
} else if (mode < 2.5) {
gl_FragColor = result3;
} else {
gl_FragColor = result4;
}
This may or may not be faster than method #1. Your shader is more complicated because it's potentially doing calculations for both result1 and result2 for every pixel but depending on the GPU and pipelining you might get some of that for free.
As for getting 32bit values out even if there's no OES_texture_float you're basically going to have to write out even more 8bit values using one of the 3 techniques above.
In WebGL2 draw buffers is a required a feature where as it's optional in WebGL1. In WebGL2 there's also transform feedback which writes the outputs of a vertex shader (the varyings) into buffers.

SceneKit Rigged Character Animation increase performance

I have *.DAE files for characters each has 45-70 bones,
I want to have about 100 animated characters on the screen.
However when I have ~60 Characters the animations takes ~13ms of my update loop which is very costly, and leaves me almost no room for other tasks.
I am setting the animations "CAAnimationGroup" to the Mesh SCNNode
when I want to swap animations I am removing the previous animations with fadeOut set to 0.2 and adding the new Animation with FadeIn set to 0.2 as well. -> Is it bad ? Should I just pause a previous animation and play a new one ? or is it even worse?
Is there better ways to animate rigged characters in SceneKit maybe using GPU or something ?
Please get me started to the right direction to decrease the animations overhead in my update loop.
Update
After Contacting Apple via Bug radar I received this issue via E-Mail:
This issue is being worked on to be fixed in a future update, we will
let you know as soon as we have a beta build you can test and verify
this issue.
Thank you for your patience.
So lets wait and see how far Apple's Engineers will enhance it :).
SceneKit does the skeletal animation on the GPU if your vertices have less than 4 influences. From the docs, reproduced below:
SceneKit performs skeletal animation on the GPU only if the componentsPerVector count in this geometry source is 4 or less. Larger vectors result in CPU-based animation and drastically reduced rendering performance.
I have used the following code to detect if the animation is done on the GPU:
- (void)checkGPUSkinningForInScene:(SCNScene*)character
forNodes:(NSArray*)skinnedNodes {
for (NSString* nodeName in skinnedNodes) {
SCNNode* skinnedNode =
[character.rootNode childNodeWithName:nodeName recursively:YES];
SCNSkinner* skinner = skinnedNode.skinner;
NSLog(#"******** Skinner for node %# is %# with skeleton: %#",
skinnedNode.name, skinner, skinner.skeleton);
if (skinner) {
SCNGeometrySource* boneIndices = skinner.boneIndices;
SCNGeometrySource* boneWeights = skinner.boneWeights;
NSInteger influences = boneWeights.componentsPerVector;
if (influences <= 4) {
NSLog(#" This node %# with %lu influences is skinned on the GPU",
skinnedNode.name, influences);
} else {
NSLog(#" This node %# with %lu influences is skinned on the CPU",
skinnedNode.name, influences);
}
}
}
}
You pass the SCNScene and the names of nodes which have SCNSkinner attached to check if the animation is done on the GPU or the CPU.
However, there is one other hidden piece of information about animation on the GPU which is that if your skeleton has more than 60 bones, it won't be executed on the GPU. The trick to know that is to print the default vertex shader, by attaching an invalid shader modifier entry as explained in this post.
The vertex shader contains the following skinning related code:
#ifdef USE_SKINNING
uniform vec4 u_skinningJointMatrices[60];
....
#ifdef USE_SKINNING
{
vec3 pos = vec3(0.);
#ifdef USE_NORMAL
vec3 nrm = vec3(0.);
#endif
#if defined(USE_TANGENT) || defined(USE_BITANGENT)
vec3 tgt = vec3(0.);
#endif
for (int i = 0; i < MAX_BONE_INFLUENCES; ++i) {
#if MAX_BONE_INFLUENCES == 1
float weight = 1.0;
#else
float weight = a_skinningWeights[i];
#endif
int idx = int(a_skinningJoints[i]) * 3;
mat4 jointMatrix = mat4(u_skinningJointMatrices[idx], u_skinningJointMatrices[idx+1], u_skinningJointMatrices[idx+2], vec4(0., 0., 0., 1.));
pos += (_geometry.position * jointMatrix).xyz * weight;
#ifdef USE_NORMAL
nrm += _geometry.normal * mat3(jointMatrix) * weight;
#endif
#if defined(USE_TANGENT) || defined(USE_BITANGENT)
tgt += _geometry.tangent.xyz * mat3(jointMatrix) * weight;
#endif
}
_geometry.position.xyz = pos;
which clearly implies that your skeleton should be restricted to 60 bones.
If all your characters have the same skeleton, then I would suggest just check if the animation is executed on CPU or GPU using the above tips. Otherwise you may have to fix your character skeleton to have less than 60 bones and not more than 4 influences per vertex.

How to get the texture size in HLSL?

For a HLSL shader I'm working on (for practice) I'm trying to execute a part of the code if the texture coordinates (on a model) are above half the respective size (that is x > width / 2 or y > height / 2). I'm familiar with C/C++ and know the basics of HLSL (the very basics). If no other solution is possible, I will set the texture size manually with XNA (in which I'm using the shader, as a matter of fact). Is there a better solution? I'm trying to remain within Shader Model 2.0 if possible.
The default texture coordinate space is normalized to 0..1 so x > width / 2 should simply be texcoord.x > 0.5.
Be careful here. tex2d() and other texture calls should NOT be within if()/else clauses. So if you have a pixel shader input "IN.UV" and your aiming at "OUT.color," you need to do it this way:
float4 aboveCol = tex2d(mySampler,some_texcoords);
float4 belowCol = tex2d(mySampler,some_other_texcoords);
if (UV.x >= 0.5) {
OUT.color = /* some function of... */ aboveCol;
} else {
OUT.color = /* some function of... */ belowCol;
}
rather than putting teh tex() calls inside the if() blocks.

Problem with HLSL looping/sampling

I have a piece of HLSL code which looks like this:
float4 GetIndirection(float2 TexCoord)
{
float4 indirection = tex2D(IndirectionSampler, TexCoord);
for (half mip = indirection.b * 255; mip > 1 && indirection.a < 128; mip--)
{
indirection = tex2Dlod(IndirectionSampler, float4(TexCoord, 0, mip));
}
return indirection;
}
The results I am getting are consistent with that loop only executing once. I checked the shader in PIX and things got even more weird, the yellow arrow indicating position in the code gets to the loop, goes through it once, and jumps back to the start, at that point the yellow arrow never moves again but the cursor moves through the code and returns a result (a bug in PIX, or am I just using it wrong?)
I have a suspicion this may be a problem to do with texture reads getting moved outside the loop by the compiler, however I thought that didn't happen with tex2Dlod since I'm setting the LOD manually :/
So:
1) What's the problem?
2) Any suggested solutions?
Problem was solved, it was a simple coding mistake, I needed to increase mip level on each iteration, not decrease it.
float4 GetIndirection(float2 TexCoord)
{
float4 indirection = tex2D(IndirectionSampler, TexCoord);
for (half mip = indirection.b * 255; mip > 1 && indirection.a < 128; mip++)
{
indirection = tex2Dlod(IndirectionSampler, float4(TexCoord, 0, mip));
}
return indirection;
}

Resources