Shader code optimization

Shader code optimization - directx

I have this code snippet (for cubemap PCF filtering). I would like to optimize it for shader model 2. I tried eliminating the branches with permutation matrices stored in uniforms, but it requires too much (2x24).
float3 l = normalize(ldir);
float3 al = abs(l);
float3 off2, off3, off4;
if( al.x < al.y )
{
if( al.y < al.z )
{
// z is dominant
off2 = CubeOffset(l.zxy, float2(0, 1), texelsize).yzx;
off3 = CubeOffset(l.zxy, float2(1, 0), texelsize).yzx;
off4 = CubeOffset(l.zxy, float2(1, 1), texelsize).yzx;
}
else
{
// y is dominant
off2 = CubeOffset(l.yxz, float2(0, 1), texelsize).yxz;
off3 = CubeOffset(l.yxz, float2(1, 0), texelsize).yxz;
off4 = CubeOffset(l.yxz, float2(1, 1), texelsize).yxz;
}
}
else
{
if( al.x < al.z )
{
// z is dominant
off2 = CubeOffset(l.zxy, float2(0, 1), texelsize).yzx;
off3 = CubeOffset(l.zxy, float2(1, 0), texelsize).yzx;
off4 = CubeOffset(l.zxy, float2(1, 1), texelsize).yzx;
}
else
{
// x is dominant
off2 = CubeOffset(l, float2(0, 1), texelsize);
off3 = CubeOffset(l, float2(1, 0), texelsize);
off4 = CubeOffset(l, float2(1, 1), texelsize);
}
}
Perhaps a mathematical relation can be found between the comparisons (al.xyy < al.yzz) and the swizzles.
UPDATE: definition of cubeoffset
float3 CubeOffset(float3 swiz, float2 off, float2 texelsize)
{
float3 ret;
ret.yz = swiz.yz + 2.0f * off * texelsize;
ret.x = sqrt(1.0f - dot(ret.yz, ret.yz));
if( swiz.x < 0 )
ret.x *= -1.0f;
return ret;
}
And the HLSL error when compiling SM 2.0:
error X5608: Compiled shader code uses too many arithmetic instruction slots (107).
Max. allowed by the target (ps_2_0) is 64.
error X5609: Compiled shader code uses too many instruction slots (111).
Max. allowed by the target (ps_2_0) is 96.
GLSL handles it fine. The goal is backward compatibility.
(btw. the algorithm is faulty, but that's not an issue right now)

Not really an optimization, but consider to test this.
Obvious, not always desired, and rarely best solution in such cases is to move to CPU extra code that not fits up (due to instructions count for example). In case of branching you can:
move branch condition check to CPU
split and move to separate shaders branch bodies
bind appropriate shader by results of condition check
It is a simplest thing you can do. And no need to poking around in assembler. Problem is when your condition is calculated inside shader.
Hope it helps somehow.

While I don't know whether it can be solved with SM 2.0, considering the advancements in GPU power I provide a SM 3.0 solution.
Please keep in mind, that this code is a snippet from my own shader language (but similar to HLSL):
template <int samples>
float PCFIrregularCUBE(sampler shadowmap, sampler noisetex, float3 ldir, float2 sloc, float2 texelsize)
{
const float kernelradius = 2.0f;
float3 l = normalize(ldir);
float3 al = abs(l);
float2 noise;
float2 rotated;
float sd, t, s;
float d = length(ldir);
noise = tex2D(noisetex, sloc);
noise = normalize(noise * 2.0f - 1.0f);
float2 rotmat0 = float2(noise.x, noise.y);
float2 rotmat1 = float2(-noise.y, noise.x);
float3 off;
s = 0;
for( int i = 0; i < samples; ++i ) {
rotated.x = dot(irreg_kernel[i], rotmat0) * kernelradius;
rotated.y = dot(irreg_kernel[i], rotmat1) * kernelradius;
if( al.x < al.y ) {
if( al.y < al.z )
off = CubeOffsetZXY(l, rotated, texelsize);
else
off = CubeOffsetYXZ(l, rotated, texelsize);
} else {
if( al.x < al.z )
off = CubeOffsetZXY(l, rotated, texelsize);
else
off = CubeOffsetXYZ(l, rotated, texelsize);
}
sd = texCUBE(shadowmap, off).r;
t = ((d > sd) ? 0.0f : 1.0f);
s += ((sd < 0.001f) ? 1.0f : t);
}
return s * (1.0f / samples);
}
With CubeOffsetXXX being like:
float3 CubeOffsetZXY(float3 swiz, float2 off, float2 texelsize)
{
float3 ret;
ret.xy = swiz.xy + 2.0f * off * texelsize * swiz.z;
ret.z = sqrt(1.0f - dot(ret.xy, ret.xy));
if( swiz.z < 0 )
ret.z *= -1.0f;
return ret;
}
For more details you should google for irregular PCF. The result in its worst (as in "the camera is close") is:
Notice the "salt and pepper" noise caused by irregular PCF. From a distance it is perfectly acceptable (Crysis 1 method).

Related

HLSL: Which DDX DDY are expected for TextureCube.SampleGrad()

I am wondering which DDX DDY values the SampleGrad() function expects for a TextureCube object.
I know that it's the change in UV coordinates for 2D textures. So I thought, it would be the change in the direction in this case. However, this does not seem to be the case.
I get different results if I try to use the Sample function vs. SampleGrad:
Sample:
// calculate reflected ray
float3 reflRay = reflect(-viewDir, normal);
// reflection map lookup
return reflectionMap.Sample(linearSampler, reflRay);
SampleGrad:
// calculate reflected ray
float3 reflRay = reflect(-viewDir, normal);
// reflection map lookup
float3 dxr = ddx(reflRay);
float3 dyr = ddy(reflRay);
return reflectionMap.SampleGrad(linearSampler, reflRay, dxr, dyr);

I still don't know which values for DDX and DDY are required, but if found an acceptable workaround that computes the level of detail for my gradients. Unfortunately, the quality of this solution is not as good as a real Sample function with anisotropic filtering.
In case anyone needs it:
The computation is described in: https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#LODCalculation
My HLSL implementation:
// calculate reflected ray
float3 reflRay = reflect(-viewDir, normal);
// reflection map lookup
float3 dxr = ddx(reflRay);
float3 dyr = ddy(reflRay);
// cubemap size for lod computation
float reflWidth, reflHeight;
reflectionMap.GetDimensions(reflWidth, reflHeight);
// calculate lod based on raydiffs
float lod = calcLod(getCubeDiff(reflRay, dxr).xy * reflWidth, getCubeDiff(reflRay, dyr).xy * reflHeight);
return reflectionMap.SampleLevel(linearSampler, reflRay, lod).rgb;
Helper functions:
float pow2(float x) {
return x * x;
}
// calculates texture coordinates [-1, 1] for the view direction (xy values must be divided by axisMajorValue for proper [-1, 1] range).else
// z coordinate is the faceId
float3 getCubeCoord(float3 viewDir, out float axisMajorValue)
{
// according to dx spec: https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#PointSampling
// Choose the largest magnitude component of the input vector. Call this magnitude of this value AxisMajor. In the case of a tie, the following precedence should occur: Z, Y, X.
int axisMajor = 0;
int axisFlip = 0;
axisMajorValue = 0.0;
[unroll] for (int i = 0; i < 3; ++i)
{
if (abs(viewDir[i]) >= axisMajorValue)
{
axisMajor = i;
axisFlip = viewDir[i] < 0.0f ? 1 : 0;
axisMajorValue = abs(viewDir[i]);
}
}
int faceId = axisMajor * 2 + axisFlip;
// Select and mirror the minor axes as defined by the TextureCube coordinate space. Call this new 2d coordinate Position.
int axisMinor1 = axisMajor == 0 ? 2 : 0; // first coord is x or z
int axisMinor2 = 3 - axisMajor - axisMinor1;
// Project the coordinate onto the cube by dividing the components Position by AxisMajor.
//float u = viewDir[axisMinor1] / axisMajorValue;
//float v = -viewDir[axisMinor2] / axisMajorValue;
// don't project for getCubeDiff function!
float u = viewDir[axisMinor1];
float v = -viewDir[axisMinor2];
switch (faceId)
{
case 0:
case 5:
u *= -1.0f;
break;
case 2:
v *= -1.0f;
break;
}
return float3(u, v, float(faceId));
}
float3 getCubeDiff(float3 ray, float3 diff)
{
// from: https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#LODCalculation
// Using TC, determine which component is of the largest magnitude, as when calculating the texel location. If any of the components are equivalent, precedence is as follows: Z, Y, X. The absolute value of this will be referred to as AxisMajor.
// select and mirror the minor axes of TC as defined by the TextureCube coordinate space to generate TC'.uv
float axisMajor;
float3 tuv = getCubeCoord(ray, axisMajor);
// select and mirror the minor axes of the partial derivative vectors as defined by the TextureCube coordinate space, generating 2 new partial derivative vectors dX'.uv & dY'.uv.
float derivateMajor;
float3 duv = getCubeCoord(diff, derivateMajor);
// Calculate 2 new dX and dY vectors for future calculations as follows:
// dX.uv = (AxisMajor*dX'.uv - TC'.uv*DerivativeMajorX)/(AxisMajor*AxisMajor)
float3 res;
res.z = 0.0;
res.xy = (axisMajor * duv.xy - tuv.xy * derivateMajor) / (axisMajor * axisMajor);
return res * 0.5;
}
// dx, dy in pixel coordinates
float calcLod(float2 dX, float2 dY)
{
// from: https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#LODCalculation
float A = pow2(dX.y) + pow2(dY.y);
float B = -2.0 * (dX.x * dX.y + dY.x * dY.y);
float C = pow2(dX.x) + pow2(dY.x);
float F = pow2(dX.x * dY.y - dY.x * dX.y);
float p = A - C;
float q = A + C;
float t = sqrt(pow2(p) + pow2(B));
float lengthX = sqrt(abs(F * (t+p) / ( t * (q+t))) + abs(F * (t-p) / ( t * (q+t))));
float lengthY = sqrt(abs(F * (t-p) / ( t * (q-t))) + abs(F * (t+p) / ( t * (q-t))));
return log2(max(lengthX,lengthY));
}

DirectX Intel HD and NVIDIA different behavior Geometry Shader

My code uses a geometry shader to produce thick lines using this: https://forum.libcinder.org/topic/smooth-thick-lines-using-geometry-shader
(Uses geometry shader approach)
I get it work on my local machine using an Intel HD Graphics card.
However, if I use the same settings on my destination machine the lines that are drawn with weird gaps.
I don't understand why, because on different Intel HD it works.
Note that my target is a NVS 300 that is fairly old, but supports FL10_1 and Geometry shader I guess. The Intel devices I try it with might be a bit newer.
Since I force the feature level on my development on device creation to be 10_1 I expect no difference.
I don't see any error codes that could hint a arbitrary behavior in the output that would explain it, even if I set up native code debugging or remote debugging.
Does anyone have a clue why this behaves differently?
I could add images, but you basically see a thick sine curve on my local machine and a thick ,but fractioned with gaps, on the target.
Thanks in advance for any clues.
cbuffer constBuffer
{
float THICKNESS;
float2 WIN_SCALE;
};
struct PSInput
{
float4 Position : SV_POSITION;
};
float2 toScreenSpace(float4 vertex)
{
//float2 WIN_SCALE = { 100.0f, 100.0f };
return float2(vertex.xy) * WIN_SCALE;
}
[maxvertexcount(7)]
void main(lineadj float4 vertices[4] : SV_POSITION, inout TriangleStream<PSInput> triStream)
{
//float2 WIN_SCALE = { 100.0f, 100.0f };
float2 p0 = toScreenSpace(vertices[0]); // start of previous segment
float2 p1 = toScreenSpace(vertices[1]); // end of previous segment, start of current segment
float2 p2 = toScreenSpace(vertices[2]); // end of current segment, start of next segment
float2 p3 = toScreenSpace(vertices[3]); // end of next segment
// perform naive culling
float2 area = WIN_SCALE * 1.2;
if (p1.x < -area.x || p1.x > area.x)
return;
if (p1.y < -area.y || p1.y > area.y)
return;
if (p2.x < -area.x || p2.x > area.x)
return;
if (p2.y < -area.y || p2.y > area.y)
return;
float2 v0 = normalize(p1 - p0);
float2 v1 = normalize(p2 - p1);
float2 v2 = normalize(p3 - p2);
// determine the normal of each of the 3 segments (previous, current, next)
float2 n0 = { -v0.y, v0.x};
float2 n1 = { -v1.y, v1.x};
float2 n2 = { -v2.y, v2.x};
// determine miter lines by averaging the normals of the 2 segments
float2 miter_a = normalize(n0 + n1); // miter at start of current segment
float2 miter_b = normalize(n1 + n2); // miter at end of current segment
// determine the length of the miter by projecting it onto normal and then inverse it
//float THICKNESS = 10;
float length_a = THICKNESS / dot(miter_a, n1);
float length_b = THICKNESS / dot(miter_b, n1);
float MITER_LIMIT = -1;
//float MITER_LIMIT = -1;
//float MITER_LIMIT = 1;
PSInput v;
float2 temp;
//// prevent excessively long miters at sharp corners
if (dot(v0, v1) < -MITER_LIMIT)
{
miter_a = n1;
length_a = THICKNESS;
// close the gap
if (dot(v0, n1) > 0)
{
temp = (p1 + THICKNESS * n0) / WIN_SCALE;
v.Position = float4(temp, 0, 1.0);
triStream.Append(v);
temp = (p1 + THICKNESS * n1) / WIN_SCALE;
v.Position = float4(temp, 0, 1.0);
triStream.Append(v);
v.Position = float4(p1 / WIN_SCALE, 0, 1.0);
triStream.Append(v);
triStream.RestartStrip();
}
else
{
temp = (p1 - THICKNESS * n1) / WIN_SCALE;
v.Position = float4(temp, 0, 1.0);
triStream.Append(v);
temp = (p1 - THICKNESS * n0) / WIN_SCALE;
v.Position = float4(temp, 0, 1.0);
triStream.Append(v);
v.Position = float4(p1 / WIN_SCALE, 0, 1.0);
triStream.Append(v);
triStream.RestartStrip();
}
}
if (dot(v1, v2) < -MITER_LIMIT)
{
miter_b = n1;
length_b = THICKNESS;
}
// generate the triangle strip
temp = (p1 + length_a * miter_a) / WIN_SCALE;
v.Position = float4(temp, 0, 1.0);
triStream.Append(v);
temp = (p1 - length_a * miter_a) / WIN_SCALE;
v.Position = float4(temp, 0, 1.0);
triStream.Append(v);
temp = (p2 + length_b * miter_b) / WIN_SCALE;
v.Position = float4(temp, 0, 1.0);
triStream.Append(v);
temp = (p2 - length_b * miter_b) / WIN_SCALE;
v.Position = float4(temp, 0, 1.0);
triStream.Append(v);
triStream.RestartStrip();
}

DX 11 Compute Shader\SharpDX Deferrerd Tiled lighting, Point light problems

I have just finished porting my engine from XNA to SharpDX(DX11).
Everything is going really well and I have conquered most of my issues without having to ask for help until now and I'm really stuck, maybe I just need another set of eye to look over my code idk but here it is.
I'm implementing tile based lighting (point lights only for now), I'm basing my code off the Intel sample because it's not as messy as the ATI one.
So my problem is that the lights move with the camera, I have looked all over the place to find a fix and I have tried everything (am I crazy?).
I just made sure all my normal and light vectors are in view space and normalized (still the same).
I have tried with the inverse View, inverse Projection, a mix of the both and a few other bits from over the net but I can't fix it.
So here is my CPU code:
Dim viewSpaceLPos As Vector3 = Vector3.Transform(New Vector3(pointlight.PosRad.X, pointlight.PosRad.Y, pointlight.PosRad.Z), Engine.Camera.EyeTransform)
Dim lightMatrix As Matrix = Matrix.Scaling(pointlight.PosRad.W) * Matrix.Translation(New Vector3(pointlight.PosRad.X, pointlight.PosRad.Y, pointlight.PosRad.Z))
Here is my CS shader code:
[numthreads(GROUP_WIDTH, GROUP_HEIGHT, GROUP_DEPTH)]
void TileLightingCS(uint3 dispatchThreadID : SV_DispatchThreadID, uint3 GroupID : SV_GroupID, uint3 GroupThreadID : SV_GroupThreadID)
{
int2 globalCoords = dispatchThreadID.xy;
uint groupIndex = GroupThreadID.y * GROUP_WIDTH + GroupThreadID.x;
float minZSample = FrameBufferCamNearFar.x;
float maxZSample = FrameBufferCamNearFar.y;
float2 gbufferDim;
DepthBuffer.GetDimensions(gbufferDim.x, gbufferDim.y);
float2 screenPixelOffset = float2(2.0f, -2.0f) / gbufferDim;
float2 positionScreen = (float2(globalCoords)+0.5f) * screenPixelOffset.xy + float2(-1.0f, 1.0f);
float depthValue = DepthBuffer[globalCoords].r;
float3 positionView = ComputePositionViewFromZ(positionScreen, Projection._43 / (depthValue - Projection._33));
// Avoid shading skybox/background or otherwise invalid pixels
float viewSpaceZ = positionView.z;
bool validPixel = viewSpaceZ >= FrameBufferCamNearFar.x && viewSpaceZ < FrameBufferCamNearFar.y;
[flatten] if (validPixel)
{
minZSample = min(minZSample, viewSpaceZ);
maxZSample = max(maxZSample, viewSpaceZ);
}
// How many total lights?
uint totalLights, dummy;
InputBuffer.GetDimensions(totalLights, dummy);
// Initialize shared memory light list and Z bounds
if (groupIndex == 0)
{
sTileNumLights = 0;
sMinZ = 0x7F7FFFFF; // Max float
sMaxZ = 0;
}
GroupMemoryBarrierWithGroupSync();
if (maxZSample >= minZSample) {
InterlockedMin(sMinZ, asuint(minZSample));
InterlockedMax(sMaxZ, asuint(maxZSample));
}
GroupMemoryBarrierWithGroupSync();
float minTileZ = asfloat(sMinZ);
float maxTileZ = asfloat(sMaxZ);
// Work out scale/bias from [0, 1]
float2 tileScale = float2(FrameBufferCamNearFar.zw) * rcp(float(2 * GROUP_WIDTH));
float2 tileBias = tileScale - float2(GroupID.xy);
// Now work out composite projection matrix
// Relevant matrix columns for this tile frusta
float4 c1 = float4(Projection._11 * tileScale.x, 0.0f, tileBias.x, 0.0f);
float4 c2 = float4(0.0f, -Projection._22 * tileScale.y, tileBias.y, 0.0f);
float4 c4 = float4(0.0f, 0.0f, 1.0f, 0.0f);
// Derive frustum planes
float4 frustumPlanes[6];
// Sides
frustumPlanes[0] = c4 - c1;
frustumPlanes[1] = c4 + c1;
frustumPlanes[2] = c4 - c2;
frustumPlanes[3] = c4 + c2;
// Near/far
frustumPlanes[4] = float4(0.0f, 0.0f, 1.0f, -minTileZ);
frustumPlanes[5] = float4(0.0f, 0.0f, -1.0f, maxTileZ);
// Normalize frustum planes (near/far already normalized)
[unroll] for (uint i = 0; i < 4; ++i)
{
frustumPlanes[i] *= rcp(length(frustumPlanes[i].xyz));
}
// Cull lights for this tile
for (uint lightIndex = groupIndex; lightIndex < totalLights; lightIndex += (GROUP_WIDTH * GROUP_HEIGHT))
{
PointLight light = InputBuffer[lightIndex];
float3 lightVS = light.PosRad.xyz;// mul(float4(light.Pos.xyz, 1), View);
// Cull: point light sphere vs tile frustum
bool inFrustum = true;
[unroll]
for (uint i = 0; i < 6; ++i)
{
float d = dot(frustumPlanes[i], float4(lightVS, 1.0f));
inFrustum = inFrustum && (d >= -light.PosRad.w);
}
[branch]
if (inFrustum)
{
uint listIndex;
InterlockedAdd(sTileNumLights, 1, listIndex);
sTileLightIndices[listIndex] = lightIndex;
}
}
GroupMemoryBarrierWithGroupSync();
uint numLights = sTileNumLights;
if (all(globalCoords < FrameBufferCamNearFar.zw))
{
float4 NormalMap = NormalBuffer[globalCoords];
float3 normal = DecodeNormal(NormalMap);
if (numLights > 0)
{
float3 lit = float3(0.0f, 0.0f, 0.0f);
for (uint tileLightIndex = 0; tileLightIndex < numLights; ++tileLightIndex)
{
PointLight light = InputBuffer[sTileLightIndices[tileLightIndex]];
float3 lDir = light.PosRad.xyz - positionView;
lDir = normalize(lDir);
float3 nl = saturate(dot(lDir, normal));
lit += ((light.Color.xyz * light.Color.a) * nl) * 0.1f;
}
PointLightColor[globalCoords] = float4(lit, 1);
}
else
{
PointLightColor[globalCoords] = 0;
}
}
GroupMemoryBarrierWithGroupSync();
}
So I know the culling works because there are lights drawn, they just move with the camera.
Could it be a handedness issue?
Am I setting my CPU light code up right?
Have I messed my spaces up?
What am I missing?
Am I reconstructing my position from depth wrong? (don't think it's this because the culling works)
ps. I write depth out like this:
VS shader
float4 viewSpacePos = mul(float4(input.Position,1), WV);
output.Depth=viewSpacePos.z ;
PS Shader
-input.Depth.x / FarClip

XNA Projected texture in two directions (one is opposite direction)

I created Projector with:
Matrix.CreateLookAt(position, direction, Vector3.Up);
Matrix.CreatePerspectiveFieldOfView(MathHelper.ToRadians(45), 1, 1, 2);
I pass to the shader multiplication of these matrices (in shader called View), then in shader I do:
float4 proj(float3 Position)
{
float4 texCoord = mul(float4(Position, 1.0), View);
texCoord.x = ( (texCoord.x / texCoord.w)/2) + 0.5;
texCoord.y = (-(texCoord.y / texCoord.w)/2) + 0.5;
return tex2D(shape, texCoord.xy);
}
uvw of texture is Clamped. I use it in light stage of deffered shading. Resulting image (red arrow is the correct direction):
image
What should I do to make it go only in correct direction?
SOLVED:
The problem was back projection wich was simply solved:
float4 proj(float3 Position)
{
float4 texCoord = mul(float4(Position, 1.0), View);
if(texCoord.z < 0)
return 0;
texCoord.x = ( (texCoord.x / texCoord.w)/2) + 0.5;
texCoord.y = (-(texCoord.y / texCoord.w)/2) + 0.5;
return tex2D(shape, texCoord.xy);
}

The problem is back projection wich is simply solved:
float4 proj(float3 Position)
{
float4 texCoord = mul(float4(Position, 1.0), View);
if(texCoord.z < 0)
return 0;
texCoord.x = ( (texCoord.x / texCoord.w)/2) + 0.5;
texCoord.y = (-(texCoord.y / texCoord.w)/2) + 0.5;
return tex2D(shape, texCoord.xy);
}

Pixel Shader performance on xbox

I've got a pixelshader (below) that i'm using with XNA. On my laptop (crappy graphics card) it runs a little jerky, but ok. I've just tried running it on the xbox and it's horrible!
There's nothing to the game (it's just a fractal renderer) so it's got to be the pixel shader causing the issues. I also think it's the PS code because i've lowered the iterations and it's ok. I've also checked, and the GC delta is zero.
Are there any HLSL functions that are no-no's on the xbox?? I must be doing something wrong here, performance can't be that bad!
#include "FractalBase.fxh"
float ZPower;
float3 Colour;
float3 ColourScale;
float ComAbs(float2 Arg)
{
return sqrt(Arg.x * Arg.x + Arg.y * Arg.y);
}
float2 ComPow(float2 Arg, float Power)
{
float Mod = pow(Arg.x * Arg.x + Arg.y * Arg.y, Power / 2);
float Ang = atan2(Arg.y, Arg.x) * Power;
return float2(Mod * cos(Ang), Mod * sin(Ang));
}
float4 FractalPixelShader(float2 texCoord : TEXCOORD0, uniform float Iterations) : COLOR0
{
float2 c = texCoord.xy;
float2 z = 0;
float i;
float oldBailoutTest = 0;
float bailoutTest = 0;
for(i = 0; i < Iterations; i++)
{
z = ComPow(z, ZPower) + c;
bailoutTest = z.x * z.x + z.y * z.y;
if(bailoutTest >= ZPower * ZPower)
{
break;
}
oldBailoutTest = bailoutTest;
}
float normalisedIterations = i / Iterations;
float factor = (bailoutTest - oldBailoutTest) / (ZPower * ZPower - oldBailoutTest);
float4 Result = normalisedIterations + (1 / factor / Iterations);
Result = (i >= Iterations - 1) ? float4(0.0, 0.0, 0.0, 1.0) : float4(Result.x * Colour.r * ColourScale.x, Result.y * Colour.g * ColourScale.y, Result.z * Colour.b * ColourScale.z, 1);
return Result;
}
technique Technique1
{
pass
{
VertexShader = compile vs_3_0 SpriteVertexShader();
PixelShader = compile ps_3_0 FractalPixelShader(128);
}
}
Below is FractalBase.fxh:
float4x4 MatrixTransform : register(vs, c0);
float2 Pan;
float Zoom;
float Aspect;
void SpriteVertexShader(inout float4 Colour : COLOR0,
inout float2 texCoord : TEXCOORD0,
inout float4 position : SV_Position)
{
position = mul(position, MatrixTransform);
// Convert the position into from screen space into complex coordinates
texCoord = (position) * Zoom * float2(1, Aspect) - float2(Pan.x, -Pan.y);
}
EDIT I did try removing the conditional by using lots of lerps, however when i did that i got loads of artifacts (and not the kind that "belong in a museum"!). I changed things around, and fixed a few logic errors, however the key was to multiply the GreaterThan result by 1 + epsilon, to account for rounding errors just making 0.9999 = 0 (integer). See the fixed code below:
#include "FractalBase.fxh"
float ZPower;
float3 Colour;
float3 ColourScale;
float ComAbs(float2 Arg)
{
return sqrt(Arg.x * Arg.x + Arg.y * Arg.y);
}
float2 ComPow(float2 Arg, float Power)
{
float Mod = pow(Arg.x * Arg.x + Arg.y * Arg.y, Power / 2);
float Ang = atan2(Arg.y, Arg.x) * Power;
return float2(Mod * cos(Ang), Mod * sin(Ang));
}
float GreaterThan(float x, float y)
{
return ((x - y) / (2 * abs(x - y)) + 0.5) * 1.001;
}
float4 FractalPixelShader(float2 texCoord : TEXCOORD0, uniform float Iterations) : COLOR0
{
float2 c = texCoord.xy;
float2 z = 0;
int i;
float oldBailoutTest = 0;
float bailoutTest = 0;
int KeepGoing = 1;
int DoneIterations = Iterations;
int Bailout = 0;
for(i = 0; i < Iterations; i++)
{
z = lerp(z, ComPow(z, ZPower) + c, KeepGoing);
bailoutTest = lerp(bailoutTest, z.x * z.x + z.y * z.y, KeepGoing);
Bailout = lerp(Bailout, GreaterThan(bailoutTest, ZPower * ZPower), -abs(Bailout) + 1);
KeepGoing = lerp(KeepGoing, 0.0, Bailout);
DoneIterations = lerp(DoneIterations, min(i, DoneIterations), Bailout);
oldBailoutTest = lerp(oldBailoutTest, bailoutTest, KeepGoing);
}
float normalisedIterations = DoneIterations / Iterations;
float factor = (bailoutTest - oldBailoutTest) / (ZPower * ZPower - oldBailoutTest);
float4 Result = normalisedIterations + (1 / factor / Iterations);
Result = (DoneIterations >= Iterations - 1) ? float4(0.0, 0.0, 0.0, 1.0) : float4(Result.x * Colour.r * ColourScale.x, Result.y * Colour.g * ColourScale.y, Result.z * Colour.b * ColourScale.z, 1);
return Result;
}
technique Technique1
{
pass
{
VertexShader = compile vs_3_0 SpriteVertexShader();
PixelShader = compile ps_3_0 FractalPixelShader(128);
}
}

The xbox has a pretty large block size, so branching on the xbox isn't always so great. Also the compiler isn't always the most effective at emitting dynamic branches which your code seems to use.
Look into the branch attribute: http://msdn.microsoft.com/en-us/library/bb313972%28v=xnagamestudio.31%29.aspx
Also, if you move the early bailout, does the PC become more more similar to the Xbox?
Keep in mind that modern graphic cards are actually quite a bit faster then the Xenon unit by now.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Shader code optimization - directx

Related

HLSL: Which DDX DDY are expected for TextureCube.SampleGrad()

DirectX Intel HD and NVIDIA different behavior Geometry Shader

DX 11 Compute Shader\SharpDX Deferrerd Tiled lighting, Point light problems

XNA Projected texture in two directions (one is opposite direction)

Pixel Shader performance on xbox

Categories

Resources