DirectX: Pixel Shader derivative calculation - directx

As far as I understand Pixel Shader operates on per-pixel basis. But there are functions like ddx and ddy that calculates derivates. But how can one calculate derivatives from just one pixel coordinates?? Can someone help me on these? These also raises questions as in
tex.Sample(s0, t0);
Does it mean Sample function is calculated per pixel basis?? I thought Sampler Instruction operates on per subspan basis.
Example:
If I have the following 16 pixels:
* * * *
* * * *
* * * *
* * * *
And my pixel shader looks like this:
float4 PS(PS_INPUT input) : SV_Target{
float2 derivX = ddx_fine(input.tex);
float2 derivY = ddx_fine(input.tex);
return tex.SampleGrad(s0, t0, derivX, derivY);
}
How many times the above code will be called in a 4 x 4 grid of pixel coordinates?
Thanks.

A pixelshader is the program for exactly one fragment. But, all pixels are shaded within a 2x2 block. The pixelshaders are running simultaneously for this 4 pixels, so there can be computed dervatives for mipmapping etc. If you're calling tex.Sample it's called for all pixels in the respective block, so the gradient can be determined. This is the reason why gradient functions cannot be used in branching ifs or loops, because the program would differ within the fragment-quad.
There is a good description how mipmapping for example works: http://www.arcsynthesis.org/gltut/Texturing/Tut15%20How%20Mipmapping%20Works.html
# Your edit, which should be answered by my link, but anyway: I'm not sure how it's implemented, but in the best case your shader would be runned 16 times. The pixels are grouped to 2x2 quads. There are cases at edges where pixels will be computed and discarded, but maybe only with skewed edges.

Related

Direct3D11: "gradient instruction used in a loop with varying iteration, forcing loop to unroll", warning: X3570

I'm working on a graphics engine using Direct3D 11 and Visual Studio 2015. In the HLSL shaders for the main draw calls, I sample shadow maps for directional and point lights with percentage-closer-filtering, i.e. I sample a small square area around the target shadow map texel and average the results to get soft shadows. Now, every call to shadowMap_.Sample(...) creates a warning: "gradient instruction used in a loop with varying iteration, forcing loop to unroll" (X3570). I want to fix this or, if that is not possible, hide the warning as it completely floods my warning output.
I tried searching online for the error message and couldn't find any further descriptions. I couldn't even find an explanation what a gradient instruction is supposed to be. I checked the Microsoft documentation for a different sampler or sampling function that lets me replace the loop with native sampling functionality, but didn't find anything like that either. Here is the function I use for sampling my shadow cube maps for point lights:
float getPointShadowValue(in uint index, in float3 worldPosition)
{
// (Half-)Radius for percentage closer filtering
int hFilterRadius = 2;
// Calculate the vector inside the cube that points to the fragment
float3 fragToLight = worldPosition.xyz - pointEmitters_[index].position.xyz;
// Calculate the depth of the current fragment
float currentDepth = length(fragToLight);
float sum = 0.0;
for (float z = -hFilterRadius; z <= hFilterRadius; z++)
{
for (float y = -hFilterRadius; y <= hFilterRadius; y++)
{
for (float x = -hFilterRadius; x <= hFilterRadius; x++)
{
// Offset the currently targeted cube map texel and sample at that position
float3 off = float3(x, y, z) * 0.05;
float closestDepth = pointShadowMaps_.Sample(sampler_, float4(fragToLight + off, index)).x * farPlane_;
sum += (currentDepth - 0.1 > closestDepth ? 1.0 : 0.0);
}
}
}
// Calculate the average and return the shadow value clamped to [0, 1]
float shadow = sum / (pow(hFilterRadius * 2 + 1, 3));
return min(shadow, 1.0);
}
The code still works fine as it is, but I get a huge amount of these warnings and don't know if this causes a relevant performance impact. Any further information about the warning and what can be done about it is greatly appreciated.
Thanks in advance.
Gradient function are all texture sampling methods which are determine the used mip-level by themselves, such as your used method Sample. Therefore they use the ddx (doc) and ddy(doc) internally. Fragments are computed on the gpu in 2x2 chunks, so they can compare the difference in the texture coordinate with each other. The larger the difference the higher mip-map-level is used. With dynamic branching this method no longer works as it is not assured that each fragment uses the same computation path, so gradient functions don't work within dynamic branches. As loops are using branching, the compiler has to make them static to use gradient functions. This is done by unrolling in you case as the loops are always the same. The compiler already detected it and compiles your loops with writing all steps after another automatically to make non-branching code. With the [unroll](doc) statement you can hint the compiler to do so and suppressing warnings.
Another way for your code would be to use sampling methods which aren't gradient functions, such as SampleLevel (doc) where you pass the desired mip-map-level (0 in your case as you shadow map doesn't have mip-map-levels) and the gpu doesn't have to determine it. As far as I know the performance impact is negligible as this happens on a very low level where most functions are processed equally fast on the gpu, but perhaps you should do your own tests.
One addition which not applies to you case, but a further non gradient method to fetch textels is Load (doc) to directly fetch a specific texel by the integer texel index.
As Chuck Walbourn already stated, adding an [unroll] statement before the for loops fixes the warnings. This type of warning is basically the compiler informing you that a loop can't be unrolled or it would be less performant to do so (as can be read in the Microsoft documentation for the HLSL for-loop). I assume this can be safely accepted.

How to correctly linearize depth in OpenGL ES in iOS?

I'm trying to render a forrest scene for an iOS App with OpenGL. To make it a little bit nicer, I'd like to implement a depth effect into the scene. However I need a linearized depth value from the OpenGL depth buffer to do so. Currently I am using a computation in the fragment shader (which I found here).
Therefore my terrain fragment shader looks like this:
#version 300 es
precision mediump float;
layout(location = 0) out lowp vec4 out_color;
float linearizeDepth(float depth) {
return 2.0 * nearz / (farz + nearz - depth * (farz - nearz));
}
void main(void) {
float depth = gl_FragCoord.z;
float linearized = (linearizeDepth(depth));
out_color = vec4(linearized, linearized, linearized, 1.0);
}
However, this results in the following output:
As you can see, the "further" you get away, the more "stripy" the resulting depth value gets (especially behind the ship). If the terrain tile is close to the camera, the output is somewhat okay..
I even tried another computation:
float linearizeDepth(float depth) {
return 2.0 * nearz * farz / (farz + nearz - (2.0 * depth - 1.0) * (farz - nearz));
}
which resulted in a way too high value so I scaled it down by dividing:
float linearized = (linearizeDepth(depth) - 2.0) / 40.0;
Nevertheless, it gave a similar result.
So how do I achieve a smooth, linear transition between the near and the far plane, without any stripes? Has anybody had a similar problem?
the problem is that you store non linear values which are truncated so when you peek the depth values later on you got choppy result because you lose accuracy the more you are far from znear plane. No matter what you evaluate you will not obtain better results unless:
Lower accuracy loss
You can change znear,zfar values so they are closer together. enlarge znear as much as you can so the more accurate area covers more of your scene.
Another option is to use more bits per depth buffer (16 bits is too low) not sure if can do this in OpenGL ES but in standard OpenGL you can use 24,32 bits on most cards.
use linear depth buffer
So store linear values into depth buffer. There are two ways. One is compute depth so after all the underlying operations you will get linear value.
Another option is to use separate texture/FBO and store the linear depths directly to it. The problem is you can not use its contents in the same rendering pass.
[Edit1] Linear Depth buffer
To linearize depth buffer itself (not just the values taken from it) try this:
Vertex:
varying float depth;
void main()
{
vec4 p=ftransform();
depth=p.z;
gl_Position=p;
gl_FrontColor = gl_Color;
}
Fragment:
uniform float znear,zfar;
varying float depth; // original z in camera space instead of gl_FragCoord.z because is already truncated
void main(void)
{
float z=(depth-znear)/(zfar-znear);
gl_FragDepth=z;
gl_FragColor=gl_Color;
}
Non linear Depth buffer linearized on CPU side (as you do):
Linear Depth buffer GPU side (as you should):
The scene parameters are:
// 24 bits per Depth value
const double zang = 60.0;
const double znear= 0.01;
const double zfar =20000.0;
and simple rotated plate covering whole depth field of view. Booth images are taken by glReadPixels(0,0,scr.xs,scr.ys,GL_DEPTH_COMPONENT,GL_FLOAT,zed); and transformed to 2D RGB texture on CPU side. Then rendered as single QUAD covering whole screen on unit matrices ...
Now to obtain original depth value from linear depth buffer you just do this:
z = znear + (zfar-znear)*depth_value;
I used the ancient stuff just to keep this simple so port it to your profile ...
Beware I do not code in OpenGL ES nor IOS so I hope I did not miss something related to that (I am used to Win and PC).
To show the difference I added another rotated plate to the same scene (so they intersect) and use colored output (no depth obtaining anymore):
As you can see linear depth buffer is much much better (for scenes covering large part of depth FOV).

why the sphere texture map can not actually match

i have a sphere in my 3d project, and i have an earth texture, i use the algorithm from wiki to calculate the texture coordinate.
the code in my effect file look like this:
float pi = 3.14159265359f;
output.uvCoords.x = 0.5 + atan2(input.normal.z, input.normal.x) / (2 * pi);
output.uvCoords.y = 0.5f - asin(input.normal.y) / pi;
the result is the picture below:
look from left( there is a line, this is my question)
look from front
3.look from right
Not pretend to be a complete answer at all, but there are some ideas:
Try 6.28 instead 6.18, because 3.14 * 2 = 6.28. It is always a good idea to create variables or macro instead of plain numbers, to prevent such sad mistakes in future
Try to use more precise value of Pi (numbers to the right of the decimal point)
Try to normalize normal vector before calculations
Even better calculate texcoords on CPU once and for all, instead of calculating on each shader invocation. You can use any asset library for this purpose or just quickly move your HLSL to main code.
#define PI 3.14159265359f
#define PImul2 6.28318530718f // pi*2
#define PIdiv2 1.57079632679f // pi/2
#define PImul3div2 2.09439510239 // 3*pi/2
#define PIrev 0.31830988618f // 1/pi
...
Hope it helps.
finally i figure it out by myself, The problem lies in the fact, that i'am calculating the texture coordinates in the vertex shader. The problem is that one vertex is on the far right of the texture, while the other 2 vertices of a triangle are on the far left of the texture, which results in almost the whole texture being visible on such a triangle. so there is a line of jumbled texture coords. the solution is i should send the normal to pixel shader and calculate the texture coord in the pixel shader

CUDA implementation of the Circle Hough Transform

I'm trying to implement a maximum performance Circle Hough Transform in CUDA, whereby edge pixel coordinates cast votes in the hough space. Pseudo code for the CHT is as follows, I'm using image sizes of 256 x 256 pixels:
int maxRadius = 100;
int minRadius = 20;
int imageWidth = 256;
int imageHeight = 256;
int houghSpace[imageWidth x imageHeight * maxRadius];
for(int radius = minRadius; radius < maxRadius; ++radius)
{
for(float theta = 0.0; theta < 180.0; ++theta)
{
xCenter = edgeCoordinateX + (radius * cos(theta));
yCenter = edgeCoordinateY + (radius * sin(theta));
houghSpace[xCenter, yCenter, radius] += 1;
}
}
My basic idea is to have each thread block calculate a (small) tile of the output Hough space (maybe one block for each row of the output hough space). Therefore, I need to get the required part of the input image into shared memory somehow in order to carry out the voting in a particular output sub-hough space.
My questions are as follows:
How do I calculate and store the coordinates for the required part of the input image in shared memory?
How do I retrieve the x,y coordinates of the edge pixels, previously stored in shared memory?
Do I cast votes in another shared memory array or write the votes directly to global memory?
Thanks everyone in advance for your time. I'm new to CUDA and any help with this would be gratefully received.
I don't profess to know much about this sort of filtering, but the basic idea of propagating characteristics from a source doesn't sound too different to marching and sweeping methods for solving the stationary Eikonal equation. There is a very good paper on solving this class of problem (PDF might still be available here):
A Fast Iterative Method for Eikonal Equations. Won-Ki Jeong, Ross T.
Whitaker. SIAM Journal on Scientific Computing, Vol 30, No 5,
pp.2512-2534, 2008
The basic idea is to decompose the computational domain into tiles, and the sweep the characteristic from source across the domain. As tiles get touched by the advancing characteristic, they get added to a list of active tiles and calculated. Each time a tile is "solved" (converged to a numerical tolerance in the Eikonal case, probably a state in your problem) it is retired from the working set and its neighbours are activated. If a tile is touched again, it is re-added to the active list. The process continues until all tiles are calculated and the active list is empty. Each calculation iteration can be solved by a kernel launch, which explictly synchronizes the calculation. Run as many kernels in series as required to reach an empty work list.
I don't think it is worth trying to answer your questions until you have a more concrete algorithmic approach and are getting into implementation details.

Dealing with Boundary conditions / Halo regions in CUDA

I'm working on image processing with CUDA and i've a doubt about pixel processing.
What is often done with the boundary pixels of an image when applying a m x m convolution filter?
In a 3 x 3 convolution kernel, ignoring the 1 pixel boundary of the image is easier to deal with, especially when the code is improved with shared memory. Indeed, in this case, one does not need to check if a given pixel has all the neigbourhood available (i.e. pixel at coord (0, 0) has not left, left-upper, upper neighbours). However, removing the 1 pixel boundary of the original image could generate partial results.
Opposite to that, I'd like to process all the pixels within the image, also when using shared memory improvements, i.e., for example, loading 16 x 16 pixels, but computing the inner 14 x 14. Also in this case, ignoring the boundary pixels generates a clearer code.
What is usually done in this case?
Does anyone usually use my approach ignoring the boundary pixels?
Of course, I'm aware the answer depends on the type of problem, i.e. adding two images pixel-wise has not this problem.
Thanks in advance.
A common approach to dealing with border effects is to pad the original image with extra rows & columns based on your filter size. Some common choices for the padded values are:
A constant (e.g. zero)
Replicate the first and last row / column as many times as needed
Reflect the image at the borders (e.g. column[-1] = column[1], column[-2] = column[2])
Wrap the image values (e.g. column[-1] = column[width-1], column[-2] = column[width-2])
tl;dr: It depends on the problem you're trying to solve -- there is no solution for this that applies to all problems. In fact, mathematically speaking, I suspect there may be no "solution" at all since I believe it's an ill-posed problem you're forced to deal with.
(Apologies in advance for my reckless abuse of mathematics)
To demonstrate let's consider a situation where all pixel components and kernel values are assumed to be positive. To get an idea of how some of these answers could lead us astray let's further think about a simple averaging ("box") filter. If we set values outside the boundary of the image to zero then this will clearly drag down the average at every pixel within ceil(n/2) (manhattan distance) of the boundary. So you'll get a "dark" border on your filtered image (assuming a single intensity component or RGB colorspace -- your results will vary by colorspace!). Note that similar arguments can be made if we set the values outside the boundary to any arbitrary constant -- the average will tend towards that constant. A constant of zero might be appropriate if the edges of your typical image tend towards 0 anyway. This is also true if we consider more complex filter kernels like a gaussian however the problem will be less pronounced because the kernel values tend to decrease quickly with distance from the center.
Now suppose that instead of using a constant we choose to repeat the edge values. This is the same as making a border around the image and copying rows, columns, or corners enough times to ensure the filter stays "inside" the new image. You could also think of it as clamping/saturating the sample coordinates. This has problems with our simple box filter because it overemphasizes the values of the edge pixels. A set of edge pixels will appear more than once yet they all receive the same weight w=(1/(n*n)).
Suppose we sample an edge pixel with value K 3 times. That means its contribution to the average is:
K*w + K*w + K*w = K*3*w
So effectively that one pixel has a higher weight in the average. Note that since this is an average filter the weight is a constant over the kernel. However this argument applies to kernels with weights that vary by position too (again: think of the gaussian kernel..).
Suppose we wrap or reflect the sampling coordinates so that we're still using values from within the boundary of the image. This has some valuable advantages over using a constant but isn't necessarily "correct" either. For instance, how many photos do you take where the objects at the upper border are similar to those at the bottom? Unless you're taking pictures of mirror-smooth lakes I doubt this is true. If you're taking pictures of rocks to use as textures in games wrapping or reflecting could be appropriate. I'm sure there are significant points to be made here about how wrapping and reflecting will likely reduce any artifacts that result from using a fourier transform. However this comes back to the same idea: that you have a periodic signal which you do not wish to distort by introducing spurious new frequencies or overestimating the amplitude of existing frequencies.
So what can you do if you're filtering photos of bright red rocks beneath a blue sky? Clearly you don't want to add orange-ish haze in the blue sky and blue-ish fuzz on the red rocks. Reflecting the sample coordinate works because we expect similar colors to those pixels found at the reflected coordinates... unless, just for the sake of argument, we imagine the filter kernel is so big that the reflected coordinate would extend past the horizon.
Let's go back to the box filter example. An alternative with this filter is to stop thinking about using a static kernel and think back to what this kernel was meant to do. An averaging/box filter is designed to sum the pixel components then divide by the number of pixels summed. The idea is that this smooths out noise. If we're willing to trade a reduced effectiveness in suppressing noise near the boundary we can simply sum fewer pixels and divide by a correspondingly smaller number. This can be extended to filters with similar what-I-will-call-"normalizing" terms -- terms that are related to the area or volume of the filter. For "area" terms you count the number of kernel weights that are within the boundary and ignore those weights that are not. Then use this count as the "area" (which might involve a extra multiplication). For volume (again: assuming positive weights!) simply sum the kernel weights. This idea is probably awful for derivative filters because there are fewer pixels to compete with the noisy pixels and differentials are notoriously sensitive to noise. Also, some filters have been derived by numeric optimization and/or empirical data rather than from ab-initio/analytic methods and thus may lack a readily apparent "normalizing" factor.
Your question is somewhat broad and I believe it mixes two problems:
dealing with boundary conditions;
dealing with halo regions.
The first problem (boundary conditions) is encountered, for example, when computing the convolution between and image and a 3 x 3 kernel. When the convolution window comes across the boundary, one has the problem of extending the image outside of its boundaries.
The second problem (halo regions) is encountered, for example, when loading a 16 x 16 tile within shared memory and one has to process the internal 14 x 14 tile to compute second order derivatives.
For the second issue, I think a useful question is the following: Analyzing memory access coalescing of my CUDA kernel.
Concerning the extension of a signal outside of its boundaries, a useful tool is provided in this case by texture memory thanks to the different provided addressing modes, see The different addressing modes of CUDA textures.
Below, I'm providing an example on how a median filter can be implemented with periodic boundary conditions using texture memory.
#include <stdio.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
texture<float, 1, cudaReadModeElementType> signal_texture;
#define BLOCKSIZE 32
/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_vec, const unsigned int N){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
float signal_center = tex1D(signal_texture, tid - 0);
float signal_before = tex1D(signal_texture, tid - 1);
float signal_after = tex1D(signal_texture, tid + 1);
printf("%i %f %f %f\n", tid, signal_before, signal_center, signal_after);
d_vec[tid] = (signal_center + signal_before + signal_after) / 3.f;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 10;
// --- Input host array declaration and initialization
float *h_arr = (float *)malloc(N * sizeof(float));
for (int i = 0; i < N; i++) h_arr[i] = (float)i;
// --- Output host and device array vectors
float *h_vec = (float *)malloc(N * sizeof(float));
float *d_vec; gpuErrchk(cudaMalloc(&d_vec, N * sizeof(float)));
// --- CUDA array declaration and texture memory binding; CUDA array initialization
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
//Alternatively
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray *d_arr; gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1));
gpuErrchk(cudaMemcpyToArray(d_arr, 0, 0, h_arr, N * sizeof(float), cudaMemcpyHostToDevice));
cudaBindTextureToArray(signal_texture, d_arr);
signal_texture.normalized = false;
signal_texture.addressMode[0] = cudaAddressModeWrap;
// --- Kernel execution
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
printf("Test finished\n");
return 0;
}

Resources