Let's say I have a texture of size 100 in width and 100 in height and dispatch it to the kernel function in {10, 10, 1} thread groups and size.
I am having trouble understanding whether thread_position_in_grid goes from 0 - 9 or 0 - 99 (as the size of my texture is 100 x 100)?
I am using float4 c = in.read(gid); to retrieve the color of the texture, but I would like to know how gid is mapped to the texture coordinates to retrieve the result which is of type float4 (would there be cases where it is between pixel (texture?) coordinates?).
I would like to understand in depth how the above works, because what I would like to achieve is to be able to retrieve the exact position as defined by the size of my texture in the kernel function; i.e.:
Dispatch size: {10, 10, 1} and texture size 100 x 100; and retrieving values in kernel function:
0, 1, 2, 3, 4, 5..., 99.
In summary, how are gid, texture coordinates and pixel coordinates related within kernel function? How are they mapped from one to another? I have been reading the documentation and similar questions for the past 4 days, but I have not been successful in obtaining a concrete and authoritative answer.
Thank you.
In your example, you're dispatching a grid of work like this:
MTLSize threadsPerThreadgroup = { 10, 10, 1 };
MTLSize threadgroupCount = { 10, 10, 1 };
[computeEncoder dispatchThreadgroups:threadgroupCount
threadsPerThreadgroup:threadsPerThreadgroup];
Your gid, with attribute thread_position_in_grid, is probably of type uint2, since the domain of the grid is two-dimensional. So far, so good.
gid will range from (0, 0) to (99, 99). How you map it to other quantities, including texture coordinates is entirely up to you.
For example, say you're not operating on a 100x100 texture, but instead a 300x300 texture, and you want to perform some operation on the lower-right corner. In that case, you might add uint2(200, 200) to gid before reading/writing the texture, in order to address the region of pixels with coordinates (200, 200) to (299, 299). Other, arbitrary transformations are just as possible.
Basically, specifying the grid dimensions is a convenient way for you to describe the logical shape of the operation you want to perform, while also (through threadsPerThreadgroup) allowing you to optimize how much of that work can be executed in parallel.
As for (normalized) texture coordinates, in the simple case where your grid dimensions exactly match your texture dimensions, you can get the normalized coordinates corresponding to a kernel function invocation by dividing by a parameter that is qualified with the [[threads_per_grid]] attribute, remembering to cast first to avoid truncation:
float2 coords = gid / float2(tpg)
where tpg is another parameter to your kernel function declared as uint2 tpg [[threads_per_grid]].
Related
For example, if I want to do a grayscale transformation, I need to set up my threadsPerGroup and thread group in the following way.
NSUInteger maxTotalThreadsPerThreadgroup = [self.computePipelineState maxTotalThreadsPerThreadgroup];
MTLSize threadgroupCounts = MTLSizeMake(threadExecutionWidth * 2, threadExecutionWidth * 2, 1);
MTLSize threadsPerThreadGroup = MTLSizeMake([self.texutre width] / threadgroupCounts.width + 1,
[self.texutre height] / threadgroupCounts.height + 1,
1);
I know the image will be chopped into different blocks and each one will be processed by one thread group. But it seems in the kernel, we will just read the 2d texture, and then output the processed texture.
But the question is that how the image is chopped into different 2d texture? How do we know if each block of image get assigned to a thread to process? Is this done by Metal itself? Or we need to manually assign each block to each threadgroup by using the gid?
Metal doesn't know or care whether your shader is operating on an image. It doesn't "chop" the image or anything like that.
A compute shader is processed over a "grid". The grid is an abstraction. It's an arbitrary way for you to organize the work. Metal doesn't assign any significance to the grid, such as associating a position in the grid with a pixel in an image.
Such an association, if it exists, is implicit in how your shader code behaves. Yes, that is largely based on what the shader does with thread_position_in_grid, thread_position_in_threadgroup, thread_index_in_threadgroup, etc.
So, if you're using a gid variable with the thread_position_in_grid attribute, and you use its coordinates as image coordinates, then that usage is what dictates that each grid position corresponds to an image pixel. Once you do that, then it follows that each thread group corresponds to a block of the image, since a thread group is just a block of grid positions. Again, though, this is not something that Metal is doing, it's something that your shader is doing.
You could do something entirely different and Metal wouldn't care.
Although this is an inappropriate use of a compute shader I was doing some experiments to determine if I could use one to produce the general UV gradient where one channel of the image goes linearly from 0-1 across the x axis and the other channel goes from 0-1 across the y axis of the image. However I became confused when I generated this image by varying the b value of a texture by the thread_position_in_grid.x value divided by the image width. I edited the pixel of the texture at the thread_position_in_grid position:
Yes it was a gradient but it certainly did not appear to be the gradient from 0-1 that I wanted. I dropped it into an image editor and sure enough it was not linear. (The part added below shows what a linear gradient from 0-1 would look like)
It would appear that I do not understand what exactly the thread_position_in_grid value means. I know it has something to do with the threads per thread-groups and thread execution width but I dont exactly understand what. I suppose my end goal is to know whether it would be possible to generate the gradient below in a compute shader however I dont understand what is going on.
For reference I was working with a 100x100 texture with the following thread settings. Really I dont know why I use these values but this is what I saw recommended somewhere so I am sticking with them. I would love to be able to generalize this problem to other texture sizes as well including rectangles.
let w = greenPipeline.threadExecutionWidth
let h = greenPipeline.maxTotalThreadsPerThreadgroup / w
let threadsPerThreadgroup = MTLSizeMake(w, h, 1)
let threadgroupsPerGrid = MTLSize(width: (texture.width + w - 1) / w,
height: (texture.height + h - 1) / h,
depth: 1)
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
And my shader looks like this:
kernel void green(texture2d<float, access::write> outputTexture [[texture(0)]],
uint2 position [[thread_position_in_grid]])
{
if (position.x >= outputTexture.get_width() || position.y >= outputTexture.get_height()) {
return;
}
outputTexture.write(float4(position.x / 100.0, 0.0, 0.0, 0.0), position);
}
Two things about this shader confuse me because I cannot explain them:
I am using position as the coordinate to write to on the texture so
it bothers me that position doesnt work to generate the gradient.
You cannot reaplace position.x / 100.0 value with position.x / outputTexture.getWidth() even though it should also be 100. Doing so causes a black image. Yet when I made a shader that colored everything with outputTexture.getWidth() as its value it did indeed shade everything to a value equivalent to 100 (or more accurately 101 because of rounding)
It is ok to use position to check if the kernel is within bounds but not to create the UV gradient.
What is going on?
The thread_position_in_grid means whatever you want it to mean because you decide how large the grid is and what each thread in the grid does.
In your example, thread_position_in_grid is the pixel coordinate in the texture because your grid size is equal to the number of pixels in the texture (rounded up to a multiple of the pipeline's max thread execution width).
You can see this if you change the threadGroupsPerGrid to:
let threadgroupsPerGrid = MTLSize(width: (texture.width/2 + w - 1) / w,
height: (texture.height/2 + h - 1) / h,
depth: 1)
Now only the top quarter of your texture should be filled in because the grid only covers half the texture's width and height.
As to why your texture looks weird, it's probably related to the pixel format. After all, you're writing into the red color component and your texture comes out as blue.
In iOS or OS/X what texture coordinates are used in Metal Shader Language kernel function? For example, given an MTLTexture and uint2 gid[[thread_position_in_grid]] Is gid.x and gid.ybetween 0..1 (x and y are floats) or 0..inTexture.get_width() (x and y are integers).
Thanks in Advance
thread_position_in_grid is an index (an integer) in the grid that takes values in the ranges you specify in dispatchThreadgroups:threadsPerThreadgroup:. It's up to you to decide how many thread groups you want, and how many threads per group.
In the following sample code you can see that threadsPerGroup.width * numThreadgroups.width == inputImage.width and threadsPerGroup.height * numThreadgroups.height == inputImage.height. In this case, a position in the grid will thus be a non-normalized (integer) pixel coordinate.
Each launch of a compute shader in Metal is accompanied by a dense rectangular 3D grid of thread IDs. The dimensions of the grid is set when you call [MTLComputeCommandEncoder dispatchThreadGroups:threadsPerThreadgroup:]. You can for example have a threadgroup size of {16,16,1} (256 threads in a threadgroup as a 16x16x1 square), and threadgroup count of {1,2,1}, which will cause two threadgroups to be launched with a total area of 512 threads in the shape {16,32,1}. These are the integers that appear at the top of your kernel as [[thread_position_in_grid]]. The thread position is the way that you tell which thread you are, just like the threadID parameter passed to a block by dispatch_apply().
Metal specifies no mapping from [[thread_position_in_grid]] to coordinates in a texture. This is done by you in software in your compute shader. If you want to read every other pixel in a region of a texture at some offset in the image, then you need to multiply the threadID by two and add an offset in your kernel before passing the new coordinate to texture2d.sample. Since Metal can not launch partial threadgroups, it is up to you to make sure that unneeded threadgroups are not executed. For example, when applied to a smaller texture, the full size of your 32x64 launch might cause you to write off the end of your texture. In this case you must check the threadID to see if the thread will write off the end and then either return out of the shader or skip over the texture write call for that thread to avoid the problem.
thread_position_in_grid is always made of unsigned integers, and provides these options, but none of them are related to texture coordinates. It may be helpful to ask another, related question, because you seem to be conflating the idea of textures and kernel functions.
16- or -32 bit
1D, 2D, or 3D
I am trying to use FFTW for image convolution.
At first just to test if the system is working properly, I performed the fft, then the inverse fft, and could get the exact same image returned.
Then a small step forward, I used the identity kernel(i.e., kernel[0][0] = 1 whereas all the other components equal 0). I took the component-wise product between the image and kernel(both in the frequency domain), then did the inverse fft. Theoretically I should be able to get the identical image back. But the result I got is very not even close to the original image. I am suspecting this has something to do with where I center my kernel before I fft it into frequency domain(since I put the "1" at kernel[0][0], it basically means that I centered the positive part at the top left). Could anyone enlighten me about what goes wrong here?
For each dimension, the indexes of samples should be from -n/2 ... 0 ... n/2 -1, so if the dimension is odd, center around the middle. If the dimension is even, center so that before the new 0 you have one sample more than after the new 0.
E.g. -4, -3, -2, -1, 0, 1, 2, 3 for a width/height of 8 or -3, -2, -1, 0, 1, 2, 3 for a width/height of 7.
The FFT is relative to the middle, in its scale there are negative points.
In the memory the points are 0...n-1, but the FFT treats them as -ceil(n/2)...floor(n/2), where 0 is -ceil(n/2) and n-1 is floor(n/2)
The identity matrix is a matrix of zeros with 1 in the 0,0 location (the center - according to above numbering). (In the spatial domain.)
In the frequency domain the identity matrix should be a constant (all real values 1 or 1/(N*M) and all imaginary values 0).
If you do not receive this result, then the identify matrix might need padding differently (to the left and down instead of around all sides) - this may depend on the FFT implementation.
Center each dimension separately (this is an index centering, no change in actual memory).
You will probably need to pad the image (after centering) to a whole power of 2 in each dimension (2^n * 2^m where n doesn't have to equal m).
Pad relative to FFT's 0,0 location (to center, not corner) by copying existing pixels into a new larger image, using center-based-indexes in both source and destination images (e.g. (0,0) to (0,0), (0,1) to (0,1), (1,-2) to (1,-2))
Assuming your FFT uses regular floating point cells and not complex cells, the complex image has to be of size 2*ceil(2/n) * 2*ceil(2/m) even if you don't need a whole power of 2 (since it has half the samples, but the samples are complex).
If your image has more than one color channel, you will first have to reshape it, so that the channel are the most significant in the sub-pixel ordering, instead of the least significant. You can reshape and pad in one go to save time and space.
Don't forget the FFTSHIFT after the IFFT. (To swap the quadrants.)
The result of the IFFT is 0...n-1. You have to take pixels floor(n/2)+1..n-1 and move them before 0...floor(n/2).
This is done by copying pixels to a new image, copying floor(n/2)+1 to memory-location 0, floor(n/2)+2 to memory-location 1, ..., n-1 to memory-location floor(n/2), then 0 to memory-location ceil(n/2), 1 to memory-location ceil(n/2)+1, ..., floor(n/2) to memory-location n-1.
When you multiply in the frequency domain, remember that the samples are complex (one cell real then one cell imaginary) so you have to use a complex multiplication.
The result might need dividing by N^2*M^2 where N is the size of n after padding (and likewise for M and m). - You can tell this by (a. looking at the frequency domain's values of the identity matrix, b. comparing result to input.)
I think that your understanding of the Identity kernel may be off. An Identity kernel should have the 1 at the center of the 2D kernal not at the 0, 0 position.
example for a 3 x 3, you have yours setup as follows:
1, 0, 0
0, 0, 0
0, 0, 0
It should be
0, 0, 0
0, 1, 0
0, 0, 0
Check this out also
What is the "do-nothing" convolution kernel
also look here, at the bottom of page 3.
http://www.fmwconcepts.com/imagemagick/digital_image_filtering.pdf
I took the component-wise product between the image and kernel in frequency domain, then did the inverse fft. Theoretically I should be able to get the identical image back.
I don't think that doing a forward transform with a non-fft kernel, and then an inverse fft transform should lead to any expectation of getting the original image back, but perhaps I'm just misunderstanding what you were trying to say there...
I'm working on image processing with CUDA and i've a doubt about pixel processing.
What is often done with the boundary pixels of an image when applying a m x m convolution filter?
In a 3 x 3 convolution kernel, ignoring the 1 pixel boundary of the image is easier to deal with, especially when the code is improved with shared memory. Indeed, in this case, one does not need to check if a given pixel has all the neigbourhood available (i.e. pixel at coord (0, 0) has not left, left-upper, upper neighbours). However, removing the 1 pixel boundary of the original image could generate partial results.
Opposite to that, I'd like to process all the pixels within the image, also when using shared memory improvements, i.e., for example, loading 16 x 16 pixels, but computing the inner 14 x 14. Also in this case, ignoring the boundary pixels generates a clearer code.
What is usually done in this case?
Does anyone usually use my approach ignoring the boundary pixels?
Of course, I'm aware the answer depends on the type of problem, i.e. adding two images pixel-wise has not this problem.
Thanks in advance.
A common approach to dealing with border effects is to pad the original image with extra rows & columns based on your filter size. Some common choices for the padded values are:
A constant (e.g. zero)
Replicate the first and last row / column as many times as needed
Reflect the image at the borders (e.g. column[-1] = column[1], column[-2] = column[2])
Wrap the image values (e.g. column[-1] = column[width-1], column[-2] = column[width-2])
tl;dr: It depends on the problem you're trying to solve -- there is no solution for this that applies to all problems. In fact, mathematically speaking, I suspect there may be no "solution" at all since I believe it's an ill-posed problem you're forced to deal with.
(Apologies in advance for my reckless abuse of mathematics)
To demonstrate let's consider a situation where all pixel components and kernel values are assumed to be positive. To get an idea of how some of these answers could lead us astray let's further think about a simple averaging ("box") filter. If we set values outside the boundary of the image to zero then this will clearly drag down the average at every pixel within ceil(n/2) (manhattan distance) of the boundary. So you'll get a "dark" border on your filtered image (assuming a single intensity component or RGB colorspace -- your results will vary by colorspace!). Note that similar arguments can be made if we set the values outside the boundary to any arbitrary constant -- the average will tend towards that constant. A constant of zero might be appropriate if the edges of your typical image tend towards 0 anyway. This is also true if we consider more complex filter kernels like a gaussian however the problem will be less pronounced because the kernel values tend to decrease quickly with distance from the center.
Now suppose that instead of using a constant we choose to repeat the edge values. This is the same as making a border around the image and copying rows, columns, or corners enough times to ensure the filter stays "inside" the new image. You could also think of it as clamping/saturating the sample coordinates. This has problems with our simple box filter because it overemphasizes the values of the edge pixels. A set of edge pixels will appear more than once yet they all receive the same weight w=(1/(n*n)).
Suppose we sample an edge pixel with value K 3 times. That means its contribution to the average is:
K*w + K*w + K*w = K*3*w
So effectively that one pixel has a higher weight in the average. Note that since this is an average filter the weight is a constant over the kernel. However this argument applies to kernels with weights that vary by position too (again: think of the gaussian kernel..).
Suppose we wrap or reflect the sampling coordinates so that we're still using values from within the boundary of the image. This has some valuable advantages over using a constant but isn't necessarily "correct" either. For instance, how many photos do you take where the objects at the upper border are similar to those at the bottom? Unless you're taking pictures of mirror-smooth lakes I doubt this is true. If you're taking pictures of rocks to use as textures in games wrapping or reflecting could be appropriate. I'm sure there are significant points to be made here about how wrapping and reflecting will likely reduce any artifacts that result from using a fourier transform. However this comes back to the same idea: that you have a periodic signal which you do not wish to distort by introducing spurious new frequencies or overestimating the amplitude of existing frequencies.
So what can you do if you're filtering photos of bright red rocks beneath a blue sky? Clearly you don't want to add orange-ish haze in the blue sky and blue-ish fuzz on the red rocks. Reflecting the sample coordinate works because we expect similar colors to those pixels found at the reflected coordinates... unless, just for the sake of argument, we imagine the filter kernel is so big that the reflected coordinate would extend past the horizon.
Let's go back to the box filter example. An alternative with this filter is to stop thinking about using a static kernel and think back to what this kernel was meant to do. An averaging/box filter is designed to sum the pixel components then divide by the number of pixels summed. The idea is that this smooths out noise. If we're willing to trade a reduced effectiveness in suppressing noise near the boundary we can simply sum fewer pixels and divide by a correspondingly smaller number. This can be extended to filters with similar what-I-will-call-"normalizing" terms -- terms that are related to the area or volume of the filter. For "area" terms you count the number of kernel weights that are within the boundary and ignore those weights that are not. Then use this count as the "area" (which might involve a extra multiplication). For volume (again: assuming positive weights!) simply sum the kernel weights. This idea is probably awful for derivative filters because there are fewer pixels to compete with the noisy pixels and differentials are notoriously sensitive to noise. Also, some filters have been derived by numeric optimization and/or empirical data rather than from ab-initio/analytic methods and thus may lack a readily apparent "normalizing" factor.
Your question is somewhat broad and I believe it mixes two problems:
dealing with boundary conditions;
dealing with halo regions.
The first problem (boundary conditions) is encountered, for example, when computing the convolution between and image and a 3 x 3 kernel. When the convolution window comes across the boundary, one has the problem of extending the image outside of its boundaries.
The second problem (halo regions) is encountered, for example, when loading a 16 x 16 tile within shared memory and one has to process the internal 14 x 14 tile to compute second order derivatives.
For the second issue, I think a useful question is the following: Analyzing memory access coalescing of my CUDA kernel.
Concerning the extension of a signal outside of its boundaries, a useful tool is provided in this case by texture memory thanks to the different provided addressing modes, see The different addressing modes of CUDA textures.
Below, I'm providing an example on how a median filter can be implemented with periodic boundary conditions using texture memory.
#include <stdio.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
texture<float, 1, cudaReadModeElementType> signal_texture;
#define BLOCKSIZE 32
/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_vec, const unsigned int N){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
float signal_center = tex1D(signal_texture, tid - 0);
float signal_before = tex1D(signal_texture, tid - 1);
float signal_after = tex1D(signal_texture, tid + 1);
printf("%i %f %f %f\n", tid, signal_before, signal_center, signal_after);
d_vec[tid] = (signal_center + signal_before + signal_after) / 3.f;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 10;
// --- Input host array declaration and initialization
float *h_arr = (float *)malloc(N * sizeof(float));
for (int i = 0; i < N; i++) h_arr[i] = (float)i;
// --- Output host and device array vectors
float *h_vec = (float *)malloc(N * sizeof(float));
float *d_vec; gpuErrchk(cudaMalloc(&d_vec, N * sizeof(float)));
// --- CUDA array declaration and texture memory binding; CUDA array initialization
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
//Alternatively
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray *d_arr; gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1));
gpuErrchk(cudaMemcpyToArray(d_arr, 0, 0, h_arr, N * sizeof(float), cudaMemcpyHostToDevice));
cudaBindTextureToArray(signal_texture, d_arr);
signal_texture.normalized = false;
signal_texture.addressMode[0] = cudaAddressModeWrap;
// --- Kernel execution
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
printf("Test finished\n");
return 0;
}