I'm trying to implement a maximum performance Circle Hough Transform in CUDA, whereby edge pixel coordinates cast votes in the hough space. Pseudo code for the CHT is as follows, I'm using image sizes of 256 x 256 pixels:
int maxRadius = 100;
int minRadius = 20;
int imageWidth = 256;
int imageHeight = 256;
int houghSpace[imageWidth x imageHeight * maxRadius];
for(int radius = minRadius; radius < maxRadius; ++radius)
{
for(float theta = 0.0; theta < 180.0; ++theta)
{
xCenter = edgeCoordinateX + (radius * cos(theta));
yCenter = edgeCoordinateY + (radius * sin(theta));
houghSpace[xCenter, yCenter, radius] += 1;
}
}
My basic idea is to have each thread block calculate a (small) tile of the output Hough space (maybe one block for each row of the output hough space). Therefore, I need to get the required part of the input image into shared memory somehow in order to carry out the voting in a particular output sub-hough space.
My questions are as follows:
How do I calculate and store the coordinates for the required part of the input image in shared memory?
How do I retrieve the x,y coordinates of the edge pixels, previously stored in shared memory?
Do I cast votes in another shared memory array or write the votes directly to global memory?
Thanks everyone in advance for your time. I'm new to CUDA and any help with this would be gratefully received.
I don't profess to know much about this sort of filtering, but the basic idea of propagating characteristics from a source doesn't sound too different to marching and sweeping methods for solving the stationary Eikonal equation. There is a very good paper on solving this class of problem (PDF might still be available here):
A Fast Iterative Method for Eikonal Equations. Won-Ki Jeong, Ross T.
Whitaker. SIAM Journal on Scientific Computing, Vol 30, No 5,
pp.2512-2534, 2008
The basic idea is to decompose the computational domain into tiles, and the sweep the characteristic from source across the domain. As tiles get touched by the advancing characteristic, they get added to a list of active tiles and calculated. Each time a tile is "solved" (converged to a numerical tolerance in the Eikonal case, probably a state in your problem) it is retired from the working set and its neighbours are activated. If a tile is touched again, it is re-added to the active list. The process continues until all tiles are calculated and the active list is empty. Each calculation iteration can be solved by a kernel launch, which explictly synchronizes the calculation. Run as many kernels in series as required to reach an empty work list.
I don't think it is worth trying to answer your questions until you have a more concrete algorithmic approach and are getting into implementation details.
Related
I'm looking to display coordinates on a map. The coordinates are at a relatively fine resolution (3 decimal places), but I need to anonymise and aggregate them to a coarser resolution.
All the approaches I've seen run the risk of the coarse coordinates being the same as, or very close to, the original coordinates, since they rely on rounding or adding random noise to the original.
For example, with rounding:
53.401, -2.899 -> 53.4, -2.9 # less than 100m
With adding 'noise', e.g.:
lat = 53.456
// 'fuzz' in range -0.1 to 0.1
rnd = (Math.random() * 2 - 1) * 0.1
newLat = lat + (Math.random() * 2 - 1) * 0.1
However if rnd is close to 0, then the coordinates don't 'move' much.
Is there a (simple) way to 'move' a coordinate in a random way a certain (minimum) distance from it's original location?
I've looked at other answers here but they don't seem to solve this issue of the new coordinates overlapping with the original coordinates:
Rounding Lat and Long to Show Approximate Location in Google Maps
Is there any easy way to make GPS coordinates coarse?
To add random noise, you could displace every point by a fixed distance in a random direction. On a flat projection, for a radius r:
angle = Math.random() * 2 * PI
newLat = lat + (r * sin(angle))
newLon = lon + (r * cos(angle))
That would guarantee a fixed displacement (r) for every point, in an unpredictable direction.
Alternatively, you could anonymise by joining to a polygon at a coarser grain, and then plot the data by the polygon rather than the points. It could be as simple as a grid on a flat projection. Or something more sophisticated such as the Australian Statistical Geography Standard which offers multiple choices, the most granular being a "mesh block" which they guarantee to always contain 30-60 dwellings.
All the approaches I've seen run the risk of the coarse coordinates
being the same as, or very close to, the original coordinates, since
they rely on rounding or adding random noise to the original.
Could you explain, what's the risk that you are concerned about here? Yes, the coarse coordinate might happen to be the same, but it is still anonymized - whoever sees the coarse data would not know if it is coincidentally close or not. All they know is that the actual location is within some distance R_max from the coarse location.
Re the other solution,
displace every point by a fixed distance in a random direction
I would say it is much worse: here it would be easy to discover the fixed displacement distance by knowing just a single original location. Then, for any "coarse" location, we would know the original is on thin unfilled circle centered on the "coarse" location - much worse than the filled circle or rectangle in the original solution.
At the very least, I would use random radius, maybe don't allow it to be zero, if you are concerned about coincidental collision (but you should not be). E.g. this varies the radius from r_max / 2 to r_max:
r = (Math.random() + 1) * r_max / 2;
and then you can use this random radius with Schepo's solution.
I'm working on a graphics engine using Direct3D 11 and Visual Studio 2015. In the HLSL shaders for the main draw calls, I sample shadow maps for directional and point lights with percentage-closer-filtering, i.e. I sample a small square area around the target shadow map texel and average the results to get soft shadows. Now, every call to shadowMap_.Sample(...) creates a warning: "gradient instruction used in a loop with varying iteration, forcing loop to unroll" (X3570). I want to fix this or, if that is not possible, hide the warning as it completely floods my warning output.
I tried searching online for the error message and couldn't find any further descriptions. I couldn't even find an explanation what a gradient instruction is supposed to be. I checked the Microsoft documentation for a different sampler or sampling function that lets me replace the loop with native sampling functionality, but didn't find anything like that either. Here is the function I use for sampling my shadow cube maps for point lights:
float getPointShadowValue(in uint index, in float3 worldPosition)
{
// (Half-)Radius for percentage closer filtering
int hFilterRadius = 2;
// Calculate the vector inside the cube that points to the fragment
float3 fragToLight = worldPosition.xyz - pointEmitters_[index].position.xyz;
// Calculate the depth of the current fragment
float currentDepth = length(fragToLight);
float sum = 0.0;
for (float z = -hFilterRadius; z <= hFilterRadius; z++)
{
for (float y = -hFilterRadius; y <= hFilterRadius; y++)
{
for (float x = -hFilterRadius; x <= hFilterRadius; x++)
{
// Offset the currently targeted cube map texel and sample at that position
float3 off = float3(x, y, z) * 0.05;
float closestDepth = pointShadowMaps_.Sample(sampler_, float4(fragToLight + off, index)).x * farPlane_;
sum += (currentDepth - 0.1 > closestDepth ? 1.0 : 0.0);
}
}
}
// Calculate the average and return the shadow value clamped to [0, 1]
float shadow = sum / (pow(hFilterRadius * 2 + 1, 3));
return min(shadow, 1.0);
}
The code still works fine as it is, but I get a huge amount of these warnings and don't know if this causes a relevant performance impact. Any further information about the warning and what can be done about it is greatly appreciated.
Thanks in advance.
Gradient function are all texture sampling methods which are determine the used mip-level by themselves, such as your used method Sample. Therefore they use the ddx (doc) and ddy(doc) internally. Fragments are computed on the gpu in 2x2 chunks, so they can compare the difference in the texture coordinate with each other. The larger the difference the higher mip-map-level is used. With dynamic branching this method no longer works as it is not assured that each fragment uses the same computation path, so gradient functions don't work within dynamic branches. As loops are using branching, the compiler has to make them static to use gradient functions. This is done by unrolling in you case as the loops are always the same. The compiler already detected it and compiles your loops with writing all steps after another automatically to make non-branching code. With the [unroll](doc) statement you can hint the compiler to do so and suppressing warnings.
Another way for your code would be to use sampling methods which aren't gradient functions, such as SampleLevel (doc) where you pass the desired mip-map-level (0 in your case as you shadow map doesn't have mip-map-levels) and the gpu doesn't have to determine it. As far as I know the performance impact is negligible as this happens on a very low level where most functions are processed equally fast on the gpu, but perhaps you should do your own tests.
One addition which not applies to you case, but a further non gradient method to fetch textels is Load (doc) to directly fetch a specific texel by the integer texel index.
As Chuck Walbourn already stated, adding an [unroll] statement before the for loops fixes the warnings. This type of warning is basically the compiler informing you that a loop can't be unrolled or it would be less performant to do so (as can be read in the Microsoft documentation for the HLSL for-loop). I assume this can be safely accepted.
I'm trying to implement a maximum performance Circle Hough Transform in CUDA, whereby edge pixel coordinates cast votes in the hough space. Pseudo code for the CHT is as follows, I'm using image sizes of 256 x 256 pixels:
int maxRadius = 100;
int minRadius = 20;
int imageWidth = 256;
int imageHeight = 256;
int houghSpace[imageWidth x imageHeight * maxRadius];
for(int radius = minRadius; radius < maxRadius; ++radius)
{
for(float theta = 0.0; theta < 180.0; ++theta)
{
xCenter = edgeCoordinateX + (radius * cos(theta));
yCenter = edgeCoordinateY + (radius * sin(theta));
houghSpace[xCenter, yCenter, radius] += 1;
}
}
My basic idea is to have each thread block calculate a (small) tile of the output Hough space (maybe one block for each row of the output hough space). Therefore, I need to get the required part of the input image into shared memory somehow in order to carry out the voting in a particular output sub-hough space.
My questions are as follows:
How do I calculate and store the coordinates for the required part of the input image in shared memory?
How do I retrieve the x,y coordinates of the edge pixels, previously stored in shared memory?
Do I cast votes in another shared memory array or write the votes directly to global memory?
Thanks everyone in advance for your time. I'm new to CUDA and any help with this would be gratefully received.
I don't profess to know much about this sort of filtering, but the basic idea of propagating characteristics from a source doesn't sound too different to marching and sweeping methods for solving the stationary Eikonal equation. There is a very good paper on solving this class of problem (PDF might still be available here):
A Fast Iterative Method for Eikonal Equations. Won-Ki Jeong, Ross T.
Whitaker. SIAM Journal on Scientific Computing, Vol 30, No 5,
pp.2512-2534, 2008
The basic idea is to decompose the computational domain into tiles, and the sweep the characteristic from source across the domain. As tiles get touched by the advancing characteristic, they get added to a list of active tiles and calculated. Each time a tile is "solved" (converged to a numerical tolerance in the Eikonal case, probably a state in your problem) it is retired from the working set and its neighbours are activated. If a tile is touched again, it is re-added to the active list. The process continues until all tiles are calculated and the active list is empty. Each calculation iteration can be solved by a kernel launch, which explictly synchronizes the calculation. Run as many kernels in series as required to reach an empty work list.
I don't think it is worth trying to answer your questions until you have a more concrete algorithmic approach and are getting into implementation details.
I got an application that should calculate the illuminance of an given photo.
My problem is: i can't find a way to calculate this illuminance index (in lux)
i can get luminosity with this code:
UIImage* image = [UIImage imageNamed:#"image.png"];
unsigned char* pixels = [image rgbaPixels];
double totalLuminance = 0.0;
for(int p=0;p<image.size.width*image.size.height*4;p+=4) {
totalLuminance += pixels[p]*0.299 + pixels[p+1]*0.587 + pixels[p+2]*0.114;
}
totalLuminance /= (image.size.width*image.size.height);
totalLuminance /= 255.0;
NSLog(#"Image.png = %f",totalLuminance);
Source
but the results are between 0 and 1, witch i don't think i can use to calculate illuminance in lux
looking for an answer on stack overflow i found this answer https://stackoverflow.com/a/2720635/839049 witch gave me a direction, but i don't know how to proceed because i don't know what to do with exposure.
.. you can just ignore the pixel data and just use the exposure information as a light meter. ...
how? does anyone know how do i do that?
or there is a better way to do it?
If you know the exposure value from the Exif data, then the total scene illuminance can be calculated as
pow(2, Exposure)*(0.3 * Calibration);
where Calibration, unfortunately, depends on the physical characteristics of the scene you're imaging. If you have a lot of dark, low-reflectivity objects, then the constant will have to be set higher.
Usually the Exposure works out so that the average illuminance you get from your formula, i.e., the sum of all your Y values divided by the number of pixels, is around 0.5 (but that depends on the camera's "brain": some cameras divide the scene in "zones" and apply different weights to each zone, e.g. they try to get 0.5 in the central area even if this means darkening the edges; the latest cameras integrate the contrast values, so as to capture the most details from what they deduce to be the "zone of interest").
This means that your image will always be "scaled", unless you somehow instruct the camera to take pictures at a fixed speed and stop setting, without compensating in any way. If you do, you will be able to use the average pixel luminance to determine total apparent illuminance.
You will always have to calibrate your results with a known meter, though.
I'm working on image processing with CUDA and i've a doubt about pixel processing.
What is often done with the boundary pixels of an image when applying a m x m convolution filter?
In a 3 x 3 convolution kernel, ignoring the 1 pixel boundary of the image is easier to deal with, especially when the code is improved with shared memory. Indeed, in this case, one does not need to check if a given pixel has all the neigbourhood available (i.e. pixel at coord (0, 0) has not left, left-upper, upper neighbours). However, removing the 1 pixel boundary of the original image could generate partial results.
Opposite to that, I'd like to process all the pixels within the image, also when using shared memory improvements, i.e., for example, loading 16 x 16 pixels, but computing the inner 14 x 14. Also in this case, ignoring the boundary pixels generates a clearer code.
What is usually done in this case?
Does anyone usually use my approach ignoring the boundary pixels?
Of course, I'm aware the answer depends on the type of problem, i.e. adding two images pixel-wise has not this problem.
Thanks in advance.
A common approach to dealing with border effects is to pad the original image with extra rows & columns based on your filter size. Some common choices for the padded values are:
A constant (e.g. zero)
Replicate the first and last row / column as many times as needed
Reflect the image at the borders (e.g. column[-1] = column[1], column[-2] = column[2])
Wrap the image values (e.g. column[-1] = column[width-1], column[-2] = column[width-2])
tl;dr: It depends on the problem you're trying to solve -- there is no solution for this that applies to all problems. In fact, mathematically speaking, I suspect there may be no "solution" at all since I believe it's an ill-posed problem you're forced to deal with.
(Apologies in advance for my reckless abuse of mathematics)
To demonstrate let's consider a situation where all pixel components and kernel values are assumed to be positive. To get an idea of how some of these answers could lead us astray let's further think about a simple averaging ("box") filter. If we set values outside the boundary of the image to zero then this will clearly drag down the average at every pixel within ceil(n/2) (manhattan distance) of the boundary. So you'll get a "dark" border on your filtered image (assuming a single intensity component or RGB colorspace -- your results will vary by colorspace!). Note that similar arguments can be made if we set the values outside the boundary to any arbitrary constant -- the average will tend towards that constant. A constant of zero might be appropriate if the edges of your typical image tend towards 0 anyway. This is also true if we consider more complex filter kernels like a gaussian however the problem will be less pronounced because the kernel values tend to decrease quickly with distance from the center.
Now suppose that instead of using a constant we choose to repeat the edge values. This is the same as making a border around the image and copying rows, columns, or corners enough times to ensure the filter stays "inside" the new image. You could also think of it as clamping/saturating the sample coordinates. This has problems with our simple box filter because it overemphasizes the values of the edge pixels. A set of edge pixels will appear more than once yet they all receive the same weight w=(1/(n*n)).
Suppose we sample an edge pixel with value K 3 times. That means its contribution to the average is:
K*w + K*w + K*w = K*3*w
So effectively that one pixel has a higher weight in the average. Note that since this is an average filter the weight is a constant over the kernel. However this argument applies to kernels with weights that vary by position too (again: think of the gaussian kernel..).
Suppose we wrap or reflect the sampling coordinates so that we're still using values from within the boundary of the image. This has some valuable advantages over using a constant but isn't necessarily "correct" either. For instance, how many photos do you take where the objects at the upper border are similar to those at the bottom? Unless you're taking pictures of mirror-smooth lakes I doubt this is true. If you're taking pictures of rocks to use as textures in games wrapping or reflecting could be appropriate. I'm sure there are significant points to be made here about how wrapping and reflecting will likely reduce any artifacts that result from using a fourier transform. However this comes back to the same idea: that you have a periodic signal which you do not wish to distort by introducing spurious new frequencies or overestimating the amplitude of existing frequencies.
So what can you do if you're filtering photos of bright red rocks beneath a blue sky? Clearly you don't want to add orange-ish haze in the blue sky and blue-ish fuzz on the red rocks. Reflecting the sample coordinate works because we expect similar colors to those pixels found at the reflected coordinates... unless, just for the sake of argument, we imagine the filter kernel is so big that the reflected coordinate would extend past the horizon.
Let's go back to the box filter example. An alternative with this filter is to stop thinking about using a static kernel and think back to what this kernel was meant to do. An averaging/box filter is designed to sum the pixel components then divide by the number of pixels summed. The idea is that this smooths out noise. If we're willing to trade a reduced effectiveness in suppressing noise near the boundary we can simply sum fewer pixels and divide by a correspondingly smaller number. This can be extended to filters with similar what-I-will-call-"normalizing" terms -- terms that are related to the area or volume of the filter. For "area" terms you count the number of kernel weights that are within the boundary and ignore those weights that are not. Then use this count as the "area" (which might involve a extra multiplication). For volume (again: assuming positive weights!) simply sum the kernel weights. This idea is probably awful for derivative filters because there are fewer pixels to compete with the noisy pixels and differentials are notoriously sensitive to noise. Also, some filters have been derived by numeric optimization and/or empirical data rather than from ab-initio/analytic methods and thus may lack a readily apparent "normalizing" factor.
Your question is somewhat broad and I believe it mixes two problems:
dealing with boundary conditions;
dealing with halo regions.
The first problem (boundary conditions) is encountered, for example, when computing the convolution between and image and a 3 x 3 kernel. When the convolution window comes across the boundary, one has the problem of extending the image outside of its boundaries.
The second problem (halo regions) is encountered, for example, when loading a 16 x 16 tile within shared memory and one has to process the internal 14 x 14 tile to compute second order derivatives.
For the second issue, I think a useful question is the following: Analyzing memory access coalescing of my CUDA kernel.
Concerning the extension of a signal outside of its boundaries, a useful tool is provided in this case by texture memory thanks to the different provided addressing modes, see The different addressing modes of CUDA textures.
Below, I'm providing an example on how a median filter can be implemented with periodic boundary conditions using texture memory.
#include <stdio.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
texture<float, 1, cudaReadModeElementType> signal_texture;
#define BLOCKSIZE 32
/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_vec, const unsigned int N){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
float signal_center = tex1D(signal_texture, tid - 0);
float signal_before = tex1D(signal_texture, tid - 1);
float signal_after = tex1D(signal_texture, tid + 1);
printf("%i %f %f %f\n", tid, signal_before, signal_center, signal_after);
d_vec[tid] = (signal_center + signal_before + signal_after) / 3.f;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 10;
// --- Input host array declaration and initialization
float *h_arr = (float *)malloc(N * sizeof(float));
for (int i = 0; i < N; i++) h_arr[i] = (float)i;
// --- Output host and device array vectors
float *h_vec = (float *)malloc(N * sizeof(float));
float *d_vec; gpuErrchk(cudaMalloc(&d_vec, N * sizeof(float)));
// --- CUDA array declaration and texture memory binding; CUDA array initialization
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
//Alternatively
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray *d_arr; gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1));
gpuErrchk(cudaMemcpyToArray(d_arr, 0, 0, h_arr, N * sizeof(float), cudaMemcpyHostToDevice));
cudaBindTextureToArray(signal_texture, d_arr);
signal_texture.normalized = false;
signal_texture.addressMode[0] = cudaAddressModeWrap;
// --- Kernel execution
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
printf("Test finished\n");
return 0;
}