Image analysis through GPU (with OpenGL ES) or CPU?

Image analysis through GPU (with OpenGL ES) or CPU? - ios

I'd like to analyze the constantly updating image feed that comes from an iPhone camera to determine a general "lightness coefficient". Meaning: if the coefficient returns 0.0, the image is completely black, if it returns 1.0 the image is completely white. Of course all values in between are the ones that I care about the most (background info: I'm using this coefficient to calculate the intensity of some blending effects in my fragment shader).
So I'm wondering if I should run a for loop over my pixelbuffer and analyze the image every frame (30 fps) and send the coeff as a uniform to my fragment shader or is there a way to analyze my image in OpenGL. If so, how should I do that?

There are more answers, each one with its own strong and weak points.
On CPU, it's fairly simple: cycle through the pixels, sum them up, divide, and that's it. It's a five minutes work. It will take a few milliseconds with a good implementation.
int i, sum = 0, count = width * height * channels;
for(i=0;i<count;i++)
avg += buffer[i];
double avg = double(sum) / double(count);
On GPU, it is likely to be much faster, but there are a few drawbacks: First one is the amount of work needed just to put everything in place. The GPUImage framework will save you some work, but it will also add a lot of code. If all you want to do is to sum the pixels, it may be a waste. Second problem is that sending pixels to GPU may take more than summing them up in the CPU. Actually, the GPU will justify the work only if you really need serious processing.
A third option, to use the CPU with a library, has the drawback that you add a lot of code for what you can do in 10 lines. But the result will be beautiful. Again, it justify if you also use the lib for other tasks. Here is an example in OpenCV:
cv::Mat frame(buffer, width, height, channels, type);
double avgLuminance = cv::sum(frame)/(double(frame.total()*frame.channels()));

There is of course OpenCL, which allows you to use your GPU for general processing.

Related

Efficiently count how many transparent pixels are in UIImage/CIImage with Metal

What is the fastest way we can count how many transparent pixels exist in CIImage/UIImage?
For example:
My first thought, if we speak about efficiency, is to use Metal Kernel using either CIColorKernel or so, but I can't understand how to use it to output "count".
Also other ideas I had in mind:
Use some kind of average color to calculate it, the "redder" the more filled with pixels? Maybe some kind of linear calculation depends on the image size (using CIAreaAverage CIFilter?
Count pixels one by one and check the RGB values?
Using Metal parallel capabilities, similar to this post: Counting coloured pixels on the GPU - Theory?
Scale down the image and then count? Or do all the other processes suggested above just with scaled than version, and multiple it back depends on the scale down proportions after calculating?
What is the fastest way to achieve this count?

To answer your question how to do it metal, you would use device atomic_int.
Essentially you create an Int MTLBuffer and pass it to your kernel and increment it with atomic_fetch_add_explicit.
Create buffer once:
var bristleCounter = 0
counterBuffer = device.makeBuffer(bytes: &bristleCounter, length: MemoryLayout<Int>.size, options: [.storageModeShared])
Reset counter to 0 and binding counter buffer:
var z = 0
counterBuffer.contents().copyMemory(from: &z, byteCount: MemoryLayout<Int>.size)
kernelEncoder.setBuffer(counterBuffer, offset: 0, index: 0)
Kernel:
kernel void myKernel (device atomic_int *counter [[buffer(0)]]) {}
Increment counter in Kernel (and get the value):
int newCounterValue = atomic_fetch_add_explicit(counter, 1, memory_order_relaxed);
Get the counter on the CPU side:
kernelEncoder.endEncoding()
kernelBuffer.commit()
kernelBuffer.waitUntilCompleted()
//Counter from kernel now in counterBuffer
let bufPointer = counterBuffer.contents().load(as: Int.self)
print("Counter: \(bufPointer)")

What you want to perform is a reduction operation, which is not necessarily well-suited for the GPU due to its massively parallel nature. I'd recommend not writing a reduction operation for the GPU yourself, but rather use some highly optimized built-in APIs that Apple provides (like CIAreaAverage or the corresponding Metal Performance Shaders).
The most efficient way depends a bit on your use case, specifically where the image comes from (loaded via UIImage/CGImage or the result of a Core Image pipeline?) and where you'd need the resulting count (on the CPU/Swift side or as an input for another Core Image filter?).
It also depends on if the pixels could also be semi-transparent (alpha not 0.0 or 1.0).
If the image is on the GPU and/or the count should be used on the GPU, I'd recommend using CIAreaAverage. The alpha value of the result should reflect the percentage of transparent pixels. Note that this only works if there are now semi-transparent pixels.
The next best solution is probably just iterating the pixel data on the CPU. It might be a few million pixels, but the operation itself is very fast so this should take almost no time. You could even use multi-threading by splitting the image up in chunks and use concurrentPerform(...) of DispatchQueue.
A last, but probably overkill solution would be to use Accelerate (this would make #FlexMonkey happy): Load the image's pixel data into a vDSP buffer and use the sum or average methods to calculate the percentage using the CPU's vector units.
Clarification
When I was saying that a reduction operation is "not necessarily well-suited for the GPU", I meant to say that it's rather complicated to implement in an efficient way and by far not as straightforward as a sequential algorithm.
The check whether a pixel is transparent or not can be done in parallel, sure, but the results need to be gathered into a single value, which requires multiple GPU cores reading and writing values into the same memory. This usually requires some synchronization (and thereby hinders parallel execution) and incurs latency cost due to access to the shared or global memory space. That's why efficient gather algorithms for the GPU usually follow a multi-step tree-based approach. I can highly recommend reading NVIDIA's publications on the topic (e.g. here and here). That's also why I recommended using built-in APIs when possible since Apple's Metal team knows how to best optimize these algorithms for their hardware.
There is also an example reduction implementation in Apple's Metal Shading Language Specification (pp. 158) that uses simd_shuffle intrinsics for efficiently communicating intermediate values down the tree. The general principle is the same as described by NVIDIA's publications linked above, though.

If the image contains semitransparent pixels, it can be easily preprocessed to make all pixels with alpha below certain threshold fully transparent, or fully opaque otherwise. Then the CIAreaAverage could be applied, as was originally suggested in the question, and finally the approximate number of the fully opaque pixels can be calculated by multiplying alpha component of the result by the image size.
For pre-processing we could use a trivial CIColorKernel like this:
half4 clampAlpha(coreimage::sample_t color) {
half4 out = half4(color);
out.a = step(half(0.99), out.a);
return out;
}
(Choose whatever threshold you like instead of 0.99)
To get the alpha component out of the output of CIAreaAverage we could do something like this:
let context = CIContext(options: [.workingColorSpace: NSNull(), .outputColorSpace: NSNull()])
var color: [Float] = [0, 0, 0, 0]
context.render(output,
toBitmap: &color,
rowBytes: MemoryLayout<Float>.size * 4,
bounds: CGRect(origin: .zero, size: CGSize(width: 1, height: 1)),
format: .RGBAf,
colorSpace: nil)
// color[3] contains alpha component of the result
With that approach everything is done on GPU while taking advantage of its inherent parallelism.
BTW, check this app out https://apps.apple.com/us/app/filter-magic/id1594986951. It lets you play with every single CoreImage filter out there.

Simple way to check if an image bitmap is blur

I am looking for a "very" simple way to check if an image bitmap is blur. I do not need accurate and complicate algorithm which involves fft, wavelet, etc. Just a very simple idea even if it is not accurate.
I've thought to compute the average euclidian distance between pixel (x,y) and pixel (x+1,y) considering their RGB components and then using a threshold but it works very bad. Any other idea?

Don't calculate the average differences between adjacent pixels.
Even when a photograph is perfectly in focus, it can still contain large areas of uniform colour, like the sky for example. These will push down the average difference and mask the details you're interested in. What you really want to find is the maximum difference value.
Also, to speed things up, I wouldn't bother checking every pixel in the image. You should get reasonable results by checking along a grid of horizontal and vertical lines spaced, say, 10 pixels apart.
Here are the results of some tests with PHP's GD graphics functions using an image from Wikimedia Commons (Bokeh_Ipomea.jpg). The Sharpness values are simply the maximum pixel difference values as a percentage of 255 (I only looked in the green channel; you should probably convert to greyscale first). The numbers underneath show how long it took to process the image.
If you want them, here are the source images I used:
original
slightly blurred
blurred
Update:
There's a problem with this algorithm in that it relies on the image having a fairly high level of contrast as well as sharp focused edges. It can be improved by finding the maximum pixel difference (maxdiff), and finding the overall range of pixel values in a small area centred on this location (range). The sharpness is then calculated as follows:
sharpness = (maxdiff / (offset + range)) * (1.0 + offset / 255) * 100%
where offset is a parameter that reduces the effects of very small edges so that background noise does not affect the results significantly. (I used a value of 15.)
This produces fairly good results. Anything with a sharpness of less than 40% is probably out of focus. Here's are some examples (the locations of the maximum pixel difference and the 9×9 local search areas are also shown for reference):
(source)
(source)
(source)
(source)
The results still aren't perfect, though. Subjects that are inherently blurry will always result in a low sharpness value:
(source)
Bokeh effects can produce sharp edges from point sources of light, even when they are completely out of focus:
(source)
You commented that you want to be able to reject user-submitted photos that are out of focus. Since this technique isn't perfect, I would suggest that you instead notify the user if an image appears blurry instead of rejecting it altogether.

I suppose that, philosophically speaking, all natural images are blurry...How blurry and to which amount, is something that depends upon your application. Broadly speaking, the blurriness or sharpness of images can be measured in various ways. As a first easy attempt I would check for the energy of the image, defined as the normalised summation of the squared pixel values:
1 2
E = --- Σ I, where I the image and N the number of pixels (defined for grayscale)
N
First you may apply a Laplacian of Gaussian (LoG) filter to detect the "energetic" areas of the image and then check the energy. The blurry image should show considerably lower energy.
See an example in MATLAB using a typical grayscale lena image:
This is the original image
This is the blurry image, blurred with gaussian noise
This is the LoG image of the original
And this is the LoG image of the blurry one
If you just compute the energy of the two LoG images you get:
E = 1265 E = 88
or bl
which is a huge amount of difference...
Then you just have to select a threshold to judge which amount of energy is good for your application...

calculate the average L1-distance of adjacent pixels:
N1=1/(2*N_pixel) * sum( abs(p(x,y)-p(x-1,y)) + abs(p(x,y)-p(x,y-1)) )
then the average L2 distance:
N2= 1/(2*N_pixel) * sum( (p(x,y)-p(x-1,y))^2 + (p(x,y)-p(x,y-1))^2 )
then the ratio N2 / (N1*N1) is a measure of blurriness. This is for grayscale images, for color you do this for each channel separately.

Fast image thresholding

What is a fast and reliable way to threshold images with possible blurring and non-uniform brightness?
Example (blurring but uniform brightness):
Because the image is not guaranteed to have uniform brightness, it's not feasible to use a fixed threshold. An adaptive threshold works alright, but because of the blurriness it creates breaks and distortions in the features (here, the important features are the Sudoku digits):
I've also tried using Histogram Equalization (using OpenCV's equalizeHist function). It increases contrast without reducing differences in brightness.
The best solution I've found is to divide the image by its morphological closing (credit to this post) to make the brightness uniform, then renormalize, then use a fixed threshold (using Otsu's algorithm to pick the optimal threshold level):
Here is code for this in OpenCV for Android:
Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_ELLIPSE, new Size(19,19));
Mat closed = new Mat(); // closed will have type CV_32F
Imgproc.morphologyEx(image, closed, Imgproc.MORPH_CLOSE, kernel);
Core.divide(image, closed, closed, 1, CvType.CV_32F);
Core.normalize(closed, image, 0, 255, Core.NORM_MINMAX, CvType.CV_8U);
Imgproc.threshold(image, image, -1, 255, Imgproc.THRESH_BINARY_INV
+Imgproc.THRESH_OTSU);
This works great but the closing operation is very slow. Reducing the size of the structuring element increases speed but reduces accuracy.
Edit: based on DCS's suggestion I tried using a high-pass filter. I chose the Laplacian filter, but I would expect similar results with Sobel and Scharr filters. The filter picks up high-frequency noise in the areas which do not contain features, and suffers from similar distortion to the adaptive threshold due to blurring. it also takes about as long as the closing operation. Here is an example with a 15x15 filter:
Edit 2: Based on AruniRC's answer, I used Canny edge detection on the image with the suggested parameters:
double mean = Core.mean(image).val[0];
Imgproc.Canny(image, image, 0.66*mean, 1.33*mean);
I'm not sure how to reliably automatically fine-tune the parameters to get connected digits.

Using Vaughn Cato and Theraot's suggestions, I scaled down the image before closing it, then scaled the closed image up to regular size. I also reduced the kernel size proportionately.
Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_ELLIPSE, new Size(5,5));
Mat temp = new Mat();
Imgproc.resize(image, temp, new Size(image.cols()/4, image.rows()/4));
Imgproc.morphologyEx(temp, temp, Imgproc.MORPH_CLOSE, kernel);
Imgproc.resize(temp, temp, new Size(image.cols(), image.rows()));
Core.divide(image, temp, temp, 1, CvType.CV_32F); // temp will now have type CV_32F
Core.normalize(temp, image, 0, 255, Core.NORM_MINMAX, CvType.CV_8U);
Imgproc.threshold(image, image, -1, 255,
Imgproc.THRESH_BINARY_INV+Imgproc.THRESH_OTSU);
The image below shows the results side-by-side for 3 different methods:
Left - regular size closing (432 pixels), size 19 kernel
Middle - half-size closing (216 pixels), size 9 kernel
Right - quarter-size closing (108 pixels), size 5 kernel
The image quality deteriorates as the size of the image used for closing gets smaller, but the deterioration isn't significant enough to affect feature recognition algorithms. The speed increases slightly more than 16-fold for the quarter-size closing, even with the resizing, which suggests that closing time is roughly proportional to the number of pixels in the image.
Any suggestions on how to further improve upon this idea (either by further reducing the speed, or reducing the deterioration in image quality) are very welcome.

Alternative approach:
Assuming your intention is to have the numerals to be clearly binarized ... shift your focus to components instead of the whole image.
Here's a pretty easy approach:
Do a Canny edgemap on the image. First try it with parameters to Canny function in the range of the low threshold to 0.66*[mean value] and the high threshold to 1.33*[mean value]. (meaning the mean of the greylevel values).
You would need to fiddle with the parameters a bit to get an image where the major components/numerals are visible clearly as separate components. Near perfect would be good enough at this stage.
Considering each Canny edge as a connected component (i.e. use the cvFindContours() or its C++ counterpart, whichever) one can estimate the foreground and background greylevels and reach a threshold.
For the last bit, do take a look at sections 2. and 3. of this paper. Skipping most of the non-essential theoretical parts it shouldn't be too difficult to have it implemented in OpenCV.
Hope this helped!
Edit 1:
Based on the Canny edge thresholds here's a very rough idea just sufficient to fine-tune the values. The high_threshold controls how strong an edge must be before it is detected. Basically, an edge must have gradient magnitude greater than high_threshold to be detected in the first place. So this does the initial detection of edges.
Now, the low_threshold deals with connecting nearby edges. It controls how much nearby disconnected edges will get combined together into a single edge. For a better idea, read "Step 6" of this webpage. Try setting a very small low_threshold and see how things come about. You could discard that 0.66*[mean value] thing if it doesn't work on these images - its just a rule of thumb anyway.

We use Bradleys algorithm for very similar problem (to segment letters from background, with uneven light and uneven background color), described here: http://people.scs.carleton.ca:8008/~roth/iit-publications-iti/docs/gerh-50002.pdf, C# code here: http://code.google.com/p/aforge/source/browse/trunk/Sources/Imaging/Filters/Adaptive+Binarization/BradleyLocalThresholding.cs?r=1360. It works on integral image, which can be calculated using integral function of OpenCV. It is very reliable and fast, but itself is not implemented in OpenCV, but is easy to port.
Another option is adaptiveThreshold method in openCV, but we did not give it a try: http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold. The MEAN version is the same as bradleys, except that it uses a constant to modify the mean value instead of a percentage, which I think is better.
Also, good article is here: https://dsp.stackexchange.com/a/2504

You could try working on a per-tile basis if you know you have a good crop of the grid. Working on 9 subimages rather than the whole pic will most likely lead to more uniform brightness on each subimage. If your cropping is perfect you could even try going for each digit cell individually; but it all depends on how reliable is your crop.

Ellipse shape is complex to calculate if compared to a flat shape.
Try to change:
Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_ELLIPSE, new Size(19,19));
to:
Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(19,19));
can speed up your enough solution with low impact to accuracy.

Why is cv::resize so slow?

I'm doing some edge detection on a live video feed:
- (void)processImage:(Mat&)image;
{
cv::resize(image, smallImage, cv::Size(288,352), 0, 0, CV_INTER_CUBIC);
edgeDetection(smallImage);
cv::resize(smallImage, image, image.size(), 0, 0, CV_INTER_LINEAR);
}
edgeDetection does some fairly heavy lifting, and was running at quite a low framerate with the video frame size of 1280x720. Adding in the resize calls dramatically decreased the framerate, quite the reverse of what I was expecting. Is this just because a resize operation is slow, or becuase I'm doing something wrong?
smallImage is declared in the header thus:
#interface CameraController : UIViewController
<CvVideoCameraDelegate>
{
Mat smallImage;
}
There is no initialisation of it, and it works ok.

Resizing an image is slow, and you are doing it twice for each processed frame. There are several ways to somehow improve your solution but you have to provide more details about the problem you are trying to solve.
To begin with, resizing an image before detecting edges will result in feeding the edge detection with less information so it will result in less edges being detected - or at least it will make it harder to detect them.
Also the resizing algorithm used affects its speed, CV_INTER_LINEAR is the fastest for cv::resize if my memory does not fail - and you are using CV_INTER_CUBIC for the first resize.
One alternative to resize an image is to instead process a smaller region of the original image. To that you should take a look at opencv image ROI's (region of interest). It is quite easy to do, you have lots of questions in this site regarding those. The downside is that you will be only detecting edges in a region and not for the whole image, that might be fine, depending on the problem.
If you really want to resize the images, opencv developers usually use the pyrDown and pyrUp functions when they want to process smaller images, instead of resize. I think it is because it is faster, but you can test it to be sure. More information about pyrDown and pyrUp in this link.
About cv::resize algorithms, here is the list:
INTER_NEAREST - a nearest-neighbor interpolation
INTER_LINEAR - a bilinear interpolation (used by default)
INTER_AREA - resampling using pixel area relation. It may be a preferred method for image decimation, as it gives moire’-free results. But when the image is zoomed, it is similar to the INTER_NEAREST method.
INTER_CUBIC - a bicubic interpolation over 4x4 pixel neighborhood
INTER_LANCZOS4 - a Lanczos interpolation over 8x8 pixel neighborhood
Can't say for sure if INTER_LINEAR is the fastest of them all but it is for sure faster than INTER_CUBIC.

INTER_NEAREST is the fastest and has the worst quality results. In the downscale, for each pixel, it just uses the pixel nearest to the hypothetical place.
INTER_LINEAR is a good compromise of performance and quality, but it is slower than INTER_NEAREST.
INTER_CUBIC is slower than INTER_LINEAR because it uses more interpolation.
INTER_LANCZOS4 is the algorithm with the best quality result, but it is slower than the others.
Here you can find a good comparison article: http://tanbakuchi.com/posts/comparison-of-openv-interpolation-algorithms/

Time trial on 4 core CPU (not GPU).
From: (1440, 2560, 3)
To: (300, 300, 3)
Fastest to slowest:
INTER_NEAREST resize: Time Taken: 0:00:00.001024
INTER_LINEAR resize: Time Taken: 0:00:00.004321
INTER_CUBIC resize: Time Taken: 0:00:00.007929
INTER_LANCZOS4 resize: Time Taken: 0:00:00.021042
INTER_AREA resize: Time Taken: 0:00:00.065569

Dealing with Boundary conditions / Halo regions in CUDA

I'm working on image processing with CUDA and i've a doubt about pixel processing.
What is often done with the boundary pixels of an image when applying a m x m convolution filter?
In a 3 x 3 convolution kernel, ignoring the 1 pixel boundary of the image is easier to deal with, especially when the code is improved with shared memory. Indeed, in this case, one does not need to check if a given pixel has all the neigbourhood available (i.e. pixel at coord (0, 0) has not left, left-upper, upper neighbours). However, removing the 1 pixel boundary of the original image could generate partial results.
Opposite to that, I'd like to process all the pixels within the image, also when using shared memory improvements, i.e., for example, loading 16 x 16 pixels, but computing the inner 14 x 14. Also in this case, ignoring the boundary pixels generates a clearer code.
What is usually done in this case?
Does anyone usually use my approach ignoring the boundary pixels?
Of course, I'm aware the answer depends on the type of problem, i.e. adding two images pixel-wise has not this problem.
Thanks in advance.

A common approach to dealing with border effects is to pad the original image with extra rows & columns based on your filter size. Some common choices for the padded values are:
A constant (e.g. zero)
Replicate the first and last row / column as many times as needed
Reflect the image at the borders (e.g. column[-1] = column[1], column[-2] = column[2])
Wrap the image values (e.g. column[-1] = column[width-1], column[-2] = column[width-2])

tl;dr: It depends on the problem you're trying to solve -- there is no solution for this that applies to all problems. In fact, mathematically speaking, I suspect there may be no "solution" at all since I believe it's an ill-posed problem you're forced to deal with.
(Apologies in advance for my reckless abuse of mathematics)
To demonstrate let's consider a situation where all pixel components and kernel values are assumed to be positive. To get an idea of how some of these answers could lead us astray let's further think about a simple averaging ("box") filter. If we set values outside the boundary of the image to zero then this will clearly drag down the average at every pixel within ceil(n/2) (manhattan distance) of the boundary. So you'll get a "dark" border on your filtered image (assuming a single intensity component or RGB colorspace -- your results will vary by colorspace!). Note that similar arguments can be made if we set the values outside the boundary to any arbitrary constant -- the average will tend towards that constant. A constant of zero might be appropriate if the edges of your typical image tend towards 0 anyway. This is also true if we consider more complex filter kernels like a gaussian however the problem will be less pronounced because the kernel values tend to decrease quickly with distance from the center.
Now suppose that instead of using a constant we choose to repeat the edge values. This is the same as making a border around the image and copying rows, columns, or corners enough times to ensure the filter stays "inside" the new image. You could also think of it as clamping/saturating the sample coordinates. This has problems with our simple box filter because it overemphasizes the values of the edge pixels. A set of edge pixels will appear more than once yet they all receive the same weight w=(1/(n*n)).
Suppose we sample an edge pixel with value K 3 times. That means its contribution to the average is:
K*w + K*w + K*w = K*3*w
So effectively that one pixel has a higher weight in the average. Note that since this is an average filter the weight is a constant over the kernel. However this argument applies to kernels with weights that vary by position too (again: think of the gaussian kernel..).
Suppose we wrap or reflect the sampling coordinates so that we're still using values from within the boundary of the image. This has some valuable advantages over using a constant but isn't necessarily "correct" either. For instance, how many photos do you take where the objects at the upper border are similar to those at the bottom? Unless you're taking pictures of mirror-smooth lakes I doubt this is true. If you're taking pictures of rocks to use as textures in games wrapping or reflecting could be appropriate. I'm sure there are significant points to be made here about how wrapping and reflecting will likely reduce any artifacts that result from using a fourier transform. However this comes back to the same idea: that you have a periodic signal which you do not wish to distort by introducing spurious new frequencies or overestimating the amplitude of existing frequencies.
So what can you do if you're filtering photos of bright red rocks beneath a blue sky? Clearly you don't want to add orange-ish haze in the blue sky and blue-ish fuzz on the red rocks. Reflecting the sample coordinate works because we expect similar colors to those pixels found at the reflected coordinates... unless, just for the sake of argument, we imagine the filter kernel is so big that the reflected coordinate would extend past the horizon.
Let's go back to the box filter example. An alternative with this filter is to stop thinking about using a static kernel and think back to what this kernel was meant to do. An averaging/box filter is designed to sum the pixel components then divide by the number of pixels summed. The idea is that this smooths out noise. If we're willing to trade a reduced effectiveness in suppressing noise near the boundary we can simply sum fewer pixels and divide by a correspondingly smaller number. This can be extended to filters with similar what-I-will-call-"normalizing" terms -- terms that are related to the area or volume of the filter. For "area" terms you count the number of kernel weights that are within the boundary and ignore those weights that are not. Then use this count as the "area" (which might involve a extra multiplication). For volume (again: assuming positive weights!) simply sum the kernel weights. This idea is probably awful for derivative filters because there are fewer pixels to compete with the noisy pixels and differentials are notoriously sensitive to noise. Also, some filters have been derived by numeric optimization and/or empirical data rather than from ab-initio/analytic methods and thus may lack a readily apparent "normalizing" factor.

Your question is somewhat broad and I believe it mixes two problems:
dealing with boundary conditions;
dealing with halo regions.
The first problem (boundary conditions) is encountered, for example, when computing the convolution between and image and a 3 x 3 kernel. When the convolution window comes across the boundary, one has the problem of extending the image outside of its boundaries.
The second problem (halo regions) is encountered, for example, when loading a 16 x 16 tile within shared memory and one has to process the internal 14 x 14 tile to compute second order derivatives.
For the second issue, I think a useful question is the following: Analyzing memory access coalescing of my CUDA kernel.
Concerning the extension of a signal outside of its boundaries, a useful tool is provided in this case by texture memory thanks to the different provided addressing modes, see The different addressing modes of CUDA textures.
Below, I'm providing an example on how a median filter can be implemented with periodic boundary conditions using texture memory.
#include <stdio.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
texture<float, 1, cudaReadModeElementType> signal_texture;
#define BLOCKSIZE 32
/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_vec, const unsigned int N){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
float signal_center = tex1D(signal_texture, tid - 0);
float signal_before = tex1D(signal_texture, tid - 1);
float signal_after = tex1D(signal_texture, tid + 1);
printf("%i %f %f %f\n", tid, signal_before, signal_center, signal_after);
d_vec[tid] = (signal_center + signal_before + signal_after) / 3.f;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 10;
// --- Input host array declaration and initialization
float *h_arr = (float *)malloc(N * sizeof(float));
for (int i = 0; i < N; i++) h_arr[i] = (float)i;
// --- Output host and device array vectors
float *h_vec = (float *)malloc(N * sizeof(float));
float *d_vec; gpuErrchk(cudaMalloc(&d_vec, N * sizeof(float)));
// --- CUDA array declaration and texture memory binding; CUDA array initialization
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
//Alternatively
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray *d_arr; gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1));
gpuErrchk(cudaMemcpyToArray(d_arr, 0, 0, h_arr, N * sizeof(float), cudaMemcpyHostToDevice));
cudaBindTextureToArray(signal_texture, d_arr);
signal_texture.normalized = false;
signal_texture.addressMode[0] = cudaAddressModeWrap;
// --- Kernel execution
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
printf("Test finished\n");
return 0;
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart