Problem with HLSL looping/sampling - directx

I have a piece of HLSL code which looks like this:
float4 GetIndirection(float2 TexCoord)
float4 indirection = tex2D(IndirectionSampler, TexCoord);
for (half mip = indirection.b * 255; mip > 1 && indirection.a < 128; mip--)
indirection = tex2Dlod(IndirectionSampler, float4(TexCoord, 0, mip));
return indirection;
The results I am getting are consistent with that loop only executing once. I checked the shader in PIX and things got even more weird, the yellow arrow indicating position in the code gets to the loop, goes through it once, and jumps back to the start, at that point the yellow arrow never moves again but the cursor moves through the code and returns a result (a bug in PIX, or am I just using it wrong?)
I have a suspicion this may be a problem to do with texture reads getting moved outside the loop by the compiler, however I thought that didn't happen with tex2Dlod since I'm setting the LOD manually :/
1) What's the problem?
2) Any suggested solutions?

Problem was solved, it was a simple coding mistake, I needed to increase mip level on each iteration, not decrease it.
float4 GetIndirection(float2 TexCoord)
float4 indirection = tex2D(IndirectionSampler, TexCoord);
for (half mip = indirection.b * 255; mip > 1 && indirection.a < 128; mip++)
indirection = tex2Dlod(IndirectionSampler, float4(TexCoord, 0, mip));
return indirection;


Vulkan Ray Tracing - Any Hit Shader doesn't write to buffer

I set up a minimal ray tracing pipeline in vulkan, with any and closest hit shaders that write into buffers and ray payloads.
The problem is that buffer writes from the any hit shader seem not to take effect.
Here is the source code for the closest hit shader:
layout(set = 0, binding = 0, std430) writeonly buffer RayStatusBuffer {
uint items[];
} gRayStatus;
layout(location = 0) rayPayloadInEXT uint gRayPayload;
void main(void)
gRayStatus.items[0] = 1;
gRayPayload = 2;
The any hit shader code is identical, except for writing 3 and 4 for ray status buffer item and ray payload, respectively.
The buffer associated with gRayStatus is initialized to 0 and fed to the pipeline with:
VkDescriptorSetLayoutBinding statusLB{};
statusLB.binding = 0;
statusLB.descriptorCount = 1;
By calling traceRayEXT(..., flags = 0, ...) from the raygen shader, I can read back the values 1 and 2 for ray status buffer item and ray payload, respectively and as expected.
But when calling traceRayEXT(..., flags = gl_RayFlagsSkipClosestHitShaderEXT, ...) I would expect the output of the any hit shader (3 and 4) to be present, but I get 0 and 4, as if the buffer write would have been ignored.
Any idea on this?
sorry for the late response.
From what I know, there could be two causes:
1° Any hit shaders are not called because of the flag VkGeometryFlagBitsKHR in the struct VkAccelerationStructureGeometryKHR used during the creation of a BLAS.
2° The conditions in which the any hit shaders are called. Look at this image helped me a lot:
As you can see from the picture an any shader shader is called only if the hit geometry is the closest and is not opaque

Optimizing branching for lookup tables

Branching in WebGL seems to be something like the following (paraphrased from various articles):
The shader executes its code in parallel, and if it needs to evaluate whether a condition is true before continuing (e.g. with an if statement) then it must diverge and somehow communicate with the other threads in order to come to a conclusion.
Maybe that's a bit off - but ultimately, it seems like the problem with branching in shaders is when each thread may be seeing different data. Therefore, branching with uniforms-only is typically okay, whereas branching on dynamic data is not.
Question 1: Is this correct?
Question 2: How does this relate to something that's fairly predictable but not a uniform, such as an index in a loop?
Specifically, I have the following function:
vec4 getMorph(int morphIndex) {
/* doesn't work - can't access morphs via dynamic index
vec4 morphs[8];
morphs[0] = a_Morph_0;
morphs[1] = a_Morph_1;
morphs[7] = a_Morph_7;
return morphs[morphIndex];
//need to do this:
if(morphIndex == 0) {
return a_Morph_0;
} else if(morphIndex == 1) {
return a_Morph_1;
else if(morphIndex == 7) {
return a_Morph_7;
And I call it in something like this:
for(int i = 0; i < 8; i++) {
pos += weight * getMorph(i);
normal += weight * getMorph(i);
Technically, it works fine - but my concern is all the if/else branches based on the dynamic index. Is that going to slow things down in a case like this?
For the sake of comparison, though it's tricky to explain in a few concise words here - I have an alternative idea to always run all the calculations for each attribute. This would involve potentially 24 superfluous vec4 += float * vec4 calculations per vertex. Would that be better or worse than branching 8 times on an index, usually?
note: in my actual code there's a few more levels of mapping and indirection, while it does boil down to the same getMorph(i) question, my use case involves getting that index from both an index in a loop, and a lookup of that index in a uniform integer array
I know this is not a direct answer to your question but ... why not just not use a loop?
vec3 pos = weight[0] * a_Morph_0 +
weight[1] * a_Morph_1 +
weight[2] * a_Morph_2 ...
If you want generic code (ie where you can set the number of morphs) then either get creative with #if, #else, #endif
const numMorphs = ?
const shaderSource = `
#define NUM_MORPHS ${numMorphs}
vec3 pos = weight[0] * a_Morph_0
#if NUM_MORPHS >= 1
+ weight[1] * a_Morph_1
#if NUM_MORPHS >= 2
+ weight[2] * a_Morph_2
or generate the shader in JavaScript with string manipulation.
function createMorphShaderSource(numMorphs) {
const morphStrs = [];
for (i = 1; i < numMorphs; ++i) {
morphStrs.push(`+ weight[${i}] * a_Morph_${i}`);
return `
..shader code..
..shader code..
Shader generation through string manipulation is a normal thing to do. You'll find all major 3d libraries do this (three.js, unreal, unity, pixi.js, playcanvas, etc...)
As for whether or not branching is slow it really depends on the GPU but the general rule is that yes, it's slower no matter how it's done.
You generally can avoid branches by writing custom shaders instead of trying to be generic.
Instead of
uniform bool haveTexture;
if (haveTexture) {
} else {
Just write 2 shaders. One with a texture and one without.
Another way to avoid branches is to get creative with your math. For example let's say we want to support vertex colors or textures
varying vec4 vertexColor;
uniform sampler2D textureColor;
vec4 tcolor = texture2D(textureColor, ...);
gl_FragColor = tcolor * vertexColor;
Now when we just want just a vertex color set textureColor to a 1x1 pixel white texture. When we just want just a texture turn off the attribute for vertexColor and set that attribute to white gl.vertexAttrib4f(vertexColorAttributeLocation, 1, 1, 1, 1); and bonus!, we can modulate the texture with vertexColors by supplying both a texture and vertex colors.
Similarly we could pass in a 0 or a 1 to multiply certain things by 0 or 1 to remove their influence. In your morph example, a 3d engine that is targeting performance would generate shaders for different numbers of morphs. A 3d engine that didn't care about performance would have 1 shader that supported N morph targets just set the weight to 0 for any unused targets to 0.
Yet another way to avoid branching is the step function which is defined as
step(edge, x) {
return x < edge ? 0.0 : 1.0;
So you can choose a or b with
v = mix(a, b, step(edge, x));

SceneKit Rigged Character Animation increase performance

I have *.DAE files for characters each has 45-70 bones,
I want to have about 100 animated characters on the screen.
However when I have ~60 Characters the animations takes ~13ms of my update loop which is very costly, and leaves me almost no room for other tasks.
I am setting the animations "CAAnimationGroup" to the Mesh SCNNode
when I want to swap animations I am removing the previous animations with fadeOut set to 0.2 and adding the new Animation with FadeIn set to 0.2 as well. -> Is it bad ? Should I just pause a previous animation and play a new one ? or is it even worse?
Is there better ways to animate rigged characters in SceneKit maybe using GPU or something ?
Please get me started to the right direction to decrease the animations overhead in my update loop.
After Contacting Apple via Bug radar I received this issue via E-Mail:
This issue is being worked on to be fixed in a future update, we will
let you know as soon as we have a beta build you can test and verify
this issue.
Thank you for your patience.
So lets wait and see how far Apple's Engineers will enhance it :).
SceneKit does the skeletal animation on the GPU if your vertices have less than 4 influences. From the docs, reproduced below:
SceneKit performs skeletal animation on the GPU only if the componentsPerVector count in this geometry source is 4 or less. Larger vectors result in CPU-based animation and drastically reduced rendering performance.
I have used the following code to detect if the animation is done on the GPU:
- (void)checkGPUSkinningForInScene:(SCNScene*)character
forNodes:(NSArray*)skinnedNodes {
for (NSString* nodeName in skinnedNodes) {
SCNNode* skinnedNode =
[character.rootNode childNodeWithName:nodeName recursively:YES];
SCNSkinner* skinner = skinnedNode.skinner;
NSLog(#"******** Skinner for node %# is %# with skeleton: %#",, skinner, skinner.skeleton);
if (skinner) {
SCNGeometrySource* boneIndices = skinner.boneIndices;
SCNGeometrySource* boneWeights = skinner.boneWeights;
NSInteger influences = boneWeights.componentsPerVector;
if (influences <= 4) {
NSLog(#" This node %# with %lu influences is skinned on the GPU",, influences);
} else {
NSLog(#" This node %# with %lu influences is skinned on the CPU",, influences);
You pass the SCNScene and the names of nodes which have SCNSkinner attached to check if the animation is done on the GPU or the CPU.
However, there is one other hidden piece of information about animation on the GPU which is that if your skeleton has more than 60 bones, it won't be executed on the GPU. The trick to know that is to print the default vertex shader, by attaching an invalid shader modifier entry as explained in this post.
The vertex shader contains the following skinning related code:
uniform vec4 u_skinningJointMatrices[60];
vec3 pos = vec3(0.);
vec3 nrm = vec3(0.);
#if defined(USE_TANGENT) || defined(USE_BITANGENT)
vec3 tgt = vec3(0.);
for (int i = 0; i < MAX_BONE_INFLUENCES; ++i) {
float weight = 1.0;
float weight = a_skinningWeights[i];
int idx = int(a_skinningJoints[i]) * 3;
mat4 jointMatrix = mat4(u_skinningJointMatrices[idx], u_skinningJointMatrices[idx+1], u_skinningJointMatrices[idx+2], vec4(0., 0., 0., 1.));
pos += (_geometry.position * jointMatrix).xyz * weight;
nrm += _geometry.normal * mat3(jointMatrix) * weight;
#if defined(USE_TANGENT) || defined(USE_BITANGENT)
tgt += * mat3(jointMatrix) * weight;
} = pos;
which clearly implies that your skeleton should be restricted to 60 bones.
If all your characters have the same skeleton, then I would suggest just check if the animation is executed on CPU or GPU using the above tips. Otherwise you may have to fix your character skeleton to have less than 60 bones and not more than 4 influences per vertex.

How to get the texture size in HLSL?

For a HLSL shader I'm working on (for practice) I'm trying to execute a part of the code if the texture coordinates (on a model) are above half the respective size (that is x > width / 2 or y > height / 2). I'm familiar with C/C++ and know the basics of HLSL (the very basics). If no other solution is possible, I will set the texture size manually with XNA (in which I'm using the shader, as a matter of fact). Is there a better solution? I'm trying to remain within Shader Model 2.0 if possible.
The default texture coordinate space is normalized to 0..1 so x > width / 2 should simply be texcoord.x > 0.5.
Be careful here. tex2d() and other texture calls should NOT be within if()/else clauses. So if you have a pixel shader input "IN.UV" and your aiming at "OUT.color," you need to do it this way:
float4 aboveCol = tex2d(mySampler,some_texcoords);
float4 belowCol = tex2d(mySampler,some_other_texcoords);
if (UV.x >= 0.5) {
OUT.color = /* some function of... */ aboveCol;
} else {
OUT.color = /* some function of... */ belowCol;
rather than putting teh tex() calls inside the if() blocks.

2D Pixel Shader has no effect?

I set up a basic pixel shader (right now, its configured for testing), and it doesn't seem to do anything. I set it up like so:
uniform extern texture ScreenTexture;
const float bloomThreshhold = 0.4;
const float existingPixelColorMult = 1.1;
sampler ScreenS = sampler_state
Texture = <ScreenTexture>;
float4 BloomedColor(float2 texCoord: TEXCOORD0) : COLOR
// pick a pixel on the screen for this pixel, based on
// the calculated offset and direction
float2 temp = texCoord;
temp.x += 1;
float4 mainPixelColor = 0;
float4 pixelPlus1X = tex2D(ScreenS, temp);
temp.x -= 2;
float4 pixelMinus1X = tex2D(ScreenS, temp);
temp.x += 1;
temp.y += 1;
float4 pixelPlus1Y = tex2D(ScreenS, temp);
temp.y -= 2;
float4 pixelMinus1Y = tex2D(ScreenS, temp);
return mainPixelColor;
technique Bloom
pass P0
PixelShader = compile ps_1_1 BloomedColor();
with the loading code like:
glowEffect = Content.Load<Effect>("GlowShader");
glowEffect.CurrentTechnique = glowEffect.Techniques[0];
and use code is:
spriteBatch.Draw(screenImage, Vector2.Zero, Color.White);
Loading seems to work fine, and there are no errors thrown when I use that method to render the texture, but it acts like the effect code isn't in there. It can't be that I'm using the wrong version of shaders (I tested with 2.0 and 1.1 versions), so why? (Using XNA 3.1)
You're returning 0 for every pixel. You have commented out any code that would return a different value than 0. 0 is black and if you're doing any sort of render you'll either get black (if the blend mode shows this as a color) or no change (if the blend mode multiplies the result). You can of course (if you were just attempting to see if the shader is being loaded and operated) try using an oddball color. Neon green anyone? Then, once you confirm it is being at least processed, start uncommenting that code and assessing the result.
Finally, if Bloom is what you're after, Microsoft has a very useful sample you will probably learn a lot from here:
If you're using XNA 4.0, see what Shawn Hargreaves has to say about this.
