Are GPUs good for case-based image filtering? - image-processing

I am trying to figure out whether a certain problem is a good candidate for using CUDA to put the problem on a GPU.
I am essentially doing a box filter that changes based on some edge detection. So there are basically 8 cases that are tested for for each pixel, and then the rest of the operations happen - typical mean calculations and such. Is the presence of these switch statements in my loop going to cause this problem to be a bad candidate to go to GPU?
I am not sure really how to avoid the switch statements, because this edge detection has to happen at every pixel. I suppose the entire image could have the edge detection part split out from the processing algorithm, and you could store a buffer corresponding to which filter to use for each pixel, but that seems like it would add a lot of pre-processing to the algorithm.
Edit: Just to give some context - this algorithm is already written, and OpenMP has been used to pretty good effect at speeding it up. However, the 8 cores on my development box pales in comparison to the 512 in the GPU.

Edge detection, mean calculations and cross-correlation can be implemented as 2D convolutions. Convolutions can be implemented on GPU very effectively (speed-up > 10, up to 100 with respect to CPU), especially for large kernels. So yes, it may make sense rewriting image filtering on GPU.
Though I wouldn't use GPU as a development platform for such a method.

typically, unless you are on the new CUDA architecture, you will want to avoid branching. because GPUs are basically SIMD machines, the pipleline is extremely vurnurable to, and suffers tremendously from, pipeline stalls due to branch misprediction.
if you think that there are significant benefits to be garnered by using a GPU, do some preliminary benchmarks to get a rough idea.
if you want to learn a bit about how to write non-branching code, head over to http://cellperformance.beyond3d.com/ and have a look.
further, investigating into running this problem on multiple CPU cores might also be worth it, in which case you will probably want to look into either OpenCL or the Intel performance libraries (such as TBB)
another go-to source for problems targeting the GPU be it graphics, computational geometry or otherwise, is IDAV, the Institute for Data Analysis and Visualization: http://idav.ucdavis.edu

Branching is actually not that bad, if there is spatial coherence in the branching. In other words, if you are expecting chunks of pixels next to each other in the image to go through the same branch, the performance hit is minimized.

Using a GPU for processing can often be counter-intuitive; things that are obviously inefficient if done in normal serial code, are actually the best way to do it in parallel using the GPU.
The pseudo-code below looks inefficient (since it computes 8 filtered values for every pixel) but will run efficiently on a GPU:
# Compute the 8 possible filtered values for each pixel
for i = 1...8
# filter[i] is the box filter that you want to apply
# to pixels of the i'th edge-type
result[i] = GPU_RunBoxFilter(filter[i], Image)
# Compute the edge type of each pixel
# This is the value you would normally use to 'switch' with
edge_type = GPU_ComputeEdgeType(Image)
# Setup an empty result image
final_result = zeros(sizeof(Image))
# For each possible switch value, replace all pixels of that edge-type
# with its corresponding filtered value
for i = 1..8
final_result = GPU_ReplacePixelIfTrue(final_result, result[i], edge_type==i)
Hopefully that helps!

Yep, control flow usually has performance penalty on GPU, be it if's / switch'es / ternary operator's, because with control flow operations GPU can't optimally run threads. So usual tactics is to avoid branching as possible. In some cases IF's can be replaced by some formula, where IF conditions maps to formula coefficients. But concrete solution/optimization depends on concrete GPU kernel... Maybe you can show exact code, to be analyzed further by stackoverflow community.
EDIT:
Just in case you are interested here is convolution pixel shader that i wrote.

Related

DBSCAN: How to Cluster Large Dataset with One Huge Cluster

I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.
Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?
Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).
Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?
Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.
DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.
If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster, and remove it / replace it with a smaller approximation.
Subsample. There usually is next to no benefit to using the entire data. Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, e.g., 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024.
This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.
But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem. Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.
Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value.

Analysis of 3d image data as 2d

I have a tif image of size around ~10 Gb. I need to perform object classification or pixel classification in this image. The dimension of image data has zyx form. My voxel size in x=0.6, y=0.6 and z=1.2. Z is the depth of the object. My RAM can not take whole image.
If I do classification of pixels in each Z plane separately and then merge to get the final shape and volume of object.
Would I loose any information and my final shape or volume of object will be wrong?
#ankit agrawal you probably have found the answer but my advice would be definitely not to say you need more memory.
I have had a similar problem and if anyone else comes across them the options below will help.
Options
The answer about splitting into just z planes is correct. You could lose information in the z plane. The idea isn't a bad one but you could take Regions of Interest (ROIs) /split your image into chunks. So that they could be more manageable but say split the x/2, y/2 and z/2. Then you get a bunch of chunks that be used in memory. Then stack the data back up later
use the library [Dask] (https://dask.org/) , it creates all this for you. It's designed for parallelism and can be scaled on a single computer or a cluster. Using the dask.array part lets you create lots of chunked numpy arrays. Even better, go use dask-image (there should be a link to this in dask). It's a wrapper of dask.array and many scipy ndimage functions. Lastly when the file is split appropriately the computation can be faster because of parallelism. Not always but I have easily worked with 20GB data sets on a laptop with 16GB. The files were 8bit so when many libraries and function upfloat blowing your memory up. This allows you keep a handle on it all. If you stick to the core functions it will work fine. Gets harder when you work with mapped blocks.
If you still have this issue.
The issue with doing a classification in each z-plane individually is that you might not be able to classify objects with that sort of restricted information.
You can easily think of that the same way for a 2D face detection problem where you would try to detect the face in each row individually - that is probably not going to be very robust and you will loose valuable spatial information. In the end you'll probably end up with no detections to merge.
Solution proposal:
My advice would be to increase the size of your voxels until it can be processed by your processing unit, saying decrease the resolution of your data and do a classification with a low confidence threshold. Then come back and do another classification on the volumes with detections in them, this time aim for a higher confidence threshold. This can be done iteratively as need be.
I think breaking the image any (x/y/z) plane kind of defeats the point of the voxel concept because the representation of a three-dimensional object is flattened and you lose the spatial relational data.
I think a couple of options are:
Use a distributed computing cluster, like Hadoop.
Look into storing that image in a geospatial database like GeoMesa, so it may be queried efficiently, then you can just hold in memory what you need to train locally.
10GB isn't so large, so perhaps upgrade your memory capacity?

Method for limiting an image to certain colors

I've got certain 11 colors that I want to limit a normal RGB image to them, the method I'm using right now is finding the nearest color using the minimum Euclidean distance, the simplest method,
But since I'll be using a microprocessor to do so in the future, speed is important to me,
Are there other methods such as ANN or other machine learning or image processing techniques I can use to speed up the process?
Thanks in advance
P.S.: what is the name of this problem? So that I can search better
If you got your 11 colors, build some kd-tree (or similar data-structures; a keyword to search for would be spatial data partitioning trees) with those 11 vectors of size 3 (RGB) each.
Remark: those data-structures need you to define the metric in use.
Euclidean-distance / L2-norm sounds okay, but from an image-algorithm perspective, i recommend transforming everything to some color-space build for human-perception and then use L2 on that.
Then for every pixel of your image, you query the nearest-neighbor, the core method of these data-structures.
The possible speedup depends on details. As Mark said: benchmark it!
The keyword you are looking for is probably Nearest Neighbor Search. There are many alternatives, including approximations (probably not needed for your case).

Can GLSL perform a recursion formula calculation? Or how can I speed up this formular

I want to implement this formula in my iOS App. Is there any way to using GLSL to speed this formula up. Or can I use mental or something to speed this formula up?
for (k = 0; k < imageSize; k++) {
imageOut[k] = imageOut[k-1] * a + imageIn[k] * b;
}
OpenCL is not available.
This is a classic IIR filter, and the data dependencies cause problems when converting it to SIMD code. This means that you can't do the operation as a simple transform feedback or render-to-texture operation. In other words, the GPU is designed to work on a bunch of data in parallel, but your formula forces the output to be computed serially (you can't compute out[k] without computing out[k-1] first).
I see two ways to optimize this:
You can use SIMD on the CPU. For iOS, this means ARM NEON. See articles like Optimising IIR Filters Using ARM NEON or Parallelization of IIR Filters using SIMD Extensions.
You can redesign the filter as an FIR filter, completely eliminating data dependencies.
Unfortunately, there is no easy translation to GLSL. Maybe you could use Metal instead of NEON, I'm not sure.
What you have there, as Dietrich Epp already pointed out, is a IIR filter. Now on a computer there's no such thing as "inifinite", you're always limited by number precision, memory, available computational time etc. – even if you executed that loop ad infinitum, due to the limited precision of your typical number representation you'll loose anything meaningful to roundoff errors quite early on.
So lets be honest about it and call a FIR filter with a very long response time. Can those be parallelized? Yes, they can, but for that we have to leave the time domain and look at it from the frequency domain.
Assume you can model the response to a system (=filter) to all the possible signals there are, then "playing back" that response based on the signal gives you the desired output. In the frequency domain that would be a "recording" of the system in response to a broadband signal covering all the frequencies. But that signal is just a simple impulse. That's where the terms FIR and IIR get their middle I from.
Any applying the impulse response of the system to an arbitrary signal by means of a convolution gives you what the system would respond to like to the signal itself. However calculating a convolution in the time domain is the same as multiplying the Fourier transform of the signal with the Fourier transform of the impulse response and transforming the result back, i.e.
s * r = F^-1(F(s) · F(r))
And Fourier transforms are one of the things that can be well parallelized and GPUs are really quite good at doing.
Now there are GLSL based Fourier transform codes, but normally these are written in OpenCL or CUDA to run on GPUs.
Anyway, here's the recipe for you:
Determine the cutoff k for which your IIR becomes indistinguishable from a FIR. Determine the Fourier transform of the impulse response (= complex spectral response, CSR). Fourier transform the signal (=image) multiply with the CSR and transform back.

GPU based SIFT feature extractor for iOS?

I've been playing with the excellent GPUImage library, which implements several feature detectors: Harris, FAST, ShiTomas, Noble. However none of those implementations help with the feature extraction and matching part. They simply output a set of detected corner points.
My understanding (which is shakey) is that the next step would be to examine each of those detected corner points and extract the feature from then, which would result in descriptor - ie, a 32 or 64 bit number that could be used to index the point near to other, similar points.
From reading Chapter 4.1 of [Computer Vision Algorithms and Applications, Szeliski], I understand that using a BestBin approach would help to efficient find neighbouring feautures to match, etc. However, I don't actually know how to do this and I'm looking for some example code that does this.
I've found this project [https://github.com/Moodstocks/sift-gpu-iphone] which claims to implement as much as possible of the feature extraction in the GPU. I've also seen some discussion that indicates it might generate buggy descriptors.
And in any case, that code doesn't go on to show how the extracted features would be best matched against another image.
My use case if trying to find objects in an image.
Does anyone have any code that does this, or at least a good implementation that shows how the extracted features are matched? I'm hoping not to have to rewrite the whole set of algorithms.
thanks,
Rob.
First, you need to be careful with SIFT implementations, because the SIFT algorithm is patented and the owners of those patents require license fees for its use. I've intentionally avoided using that algorithm for anything as a result.
Finding good feature detection and extraction methods that also work well on a GPU is a little tricky. The Harris, Shi-Tomasi, and Noble corner detectors in GPUImage are all derivatives of the same base operation, and probably aren't the fastest way to identify features.
As you can tell, my FAST corner detector isn't operational yet. The idea there is to use a lookup texture based on a local binary pattern (why I built that filter first to test the concept), and to have that return whether it's a corner point or not. That should be much faster than the Harris, etc. corner detectors. I also need to finish my histogram pyramid point extractor so that feature extraction isn't done in an extremely slow loop on the GPU.
The use of a lookup texture for a FAST corner detector is inspired by this paper by Jaco Cronje on a technique they refer to as BFROST. In addition to using the quick, texture-based lookup for feature detection, the paper proposes using the binary pattern as a quick descriptor for the feature. There's a little more to it than that, but in general that's what they propose.
Feature matching is done by Hamming distance, but while there are quick CPU-side and CUDA instructions for calculating that, OpenGL ES doesn't have one. A different approach might be required there. Similarly, I don't have a good solution for finding a best match between groups of features beyond something CPU-side, but I haven't thought that far yet.
It is a primary goal of mine to have this in the framework (it's one of the reasons I built it), but I haven't had the time to work on this lately. The above are at least my thoughts on how I would approach this, but I warn you that this will not be easy to implement.
For object recognition / these days (as of a couple weeks ago) best to use tensorflow /Convolutional Neural Networks for this.
Apple has some metal sample code recently added. https://developer.apple.com/library/content/samplecode/MetalImageRecognition/Introduction/Intro.html#//apple_ref/doc/uid/TP40017385
To do feature detection within an image - I draw your attention to an out of the box - KAZE/AKAZE algorithm with opencv.
http://www.robesafe.com/personal/pablo.alcantarilla/kaze.html
For ios, I glued the Akaze class together with another stitching sample to illustrate.
detector = cv::AKAZE::create();
detector->detect(mat, keypoints); // this will find the keypoints
cv::drawKeypoints(mat, keypoints, mat);
// this is the pseudo SIFT descriptor
.. [255] = {
pt = (x = 645.707153, y = 56.4605064)
size = 4.80000019
angle = 0
response = 0.00223364262
octave = 0
class_id = 0 }
https://github.com/johndpope/OpenCVSwiftStitch
Here is a GPU accelerated SIFT feature extractor:
https://github.com/lukevanin/SIFTMetal
The code is written in Swift 5 and uses Metal compute shaders for most operations (scaling, gaussian blur, key point detection and interpolation, feature extraction). The implementation is largely based on the paper and code from the "Anatomy of the SIFT Method Article" published in the Image Processing Online Journal (IPOL) in 2014 (http://www.ipol.im/pub/art/2014/82/). Some parts are based on code by Rob Whess (https://github.com/robwhess/opensift), which I believe is now used in OpenCV.
For feature matching I tried using a kd-tree with the best-bin first (BBF) method proposed by David Lowe. While BBF does provide some benefit up to about 10 dimensions, with a higher number of dimensions such as used by SIFT, it is no better than quadratic search due to the "curse of dimensionality". That is to say, if you compare 1000 descriptors against 1000 other descriptors, it stills end up making 1,000 x 1,000 = 1,000,000 comparisons - the same as doing brute-force pairwise.
In the linked code I use a different approach optimised for performance over accuracy. I use a trie to locate the general vicinity for potential neighbours, then search a fixed number of sibling leaf nodes for the nearest neighbours. In practice this matches about 50% of the descriptors, but only makes 1000 * 20 = 20,000 comparisons - about 50x faster and scales linearly instead of quadratically.
I am still testing and refining the code. Hopefully it helps someone.

Resources