Convoluting a large filter in GPGPU - image-processing

I wish to apply a certain 2D filter to 2D images, however, the filter size is huge. Image dimensions are about 2000x2000 and the filter size is about 500*500.
No, I cannot do this in frequency domain so FFT is no go. I'm aware of normal GPU convolution and the use of shared memory for coalescing memory access, however shared memory doesn't seem feasible since the space needed by the filter is large and would therefore need to be divided, this might even prove to be very complex to write.
Any ideas?

I think you can easily manage doing filtering for such sized images. You can transfer hundreds of megabytes to the videomemory. Such size is going to be working well.
You can use byte matrices to transfer the image data then you can use your filter to operate on it.

Related

Is it ok to mix use different texture formats at the same time?

I am using the expensive R32G32B32A32Float format for some quality reason. In particular, I need to retrieve the result by un-doing the pre-multiply on the buffer. Now I want to do some optimization. Thus I wonder if there would be problem / hidden trap if I mix use textures in different formats. e.g. Use another lightweight texture format for non-transparent images.
Note: I am not making games but some image processing stuff. While speed is not that much of a concern because of hardware acceleration which is already faster than doing that on CPU, GPU memory is quite limited. Thus I am here asking.

Add the same image with different offsets to the accumulating image on GPU

As the title states, I am trying to add the same image with different offsets, stored in a list, to the accumulating image.
The current implementation performs this on a CPU, and with some intrinsics it can be quite fast.
However, with larger images (2048x2048) and many offsets in the list (~10000), the performance is not satisfactory.
My question is, can the accumulation of the image with different offsets be efficiently implemented on a GPU?
Yes, you can. The results will be likely much faster than on CPU. The trick is to not send the data for each addition, and to not even launch a new kernel for each addition: the kernel you have should do some decent number of offset additions at once, at least 16 but possibly a few hundred, depending on your typical list size (and you can have more than one kernel of course).

What consumes less memory an actual image or a drawn image?

I am designing an app and I am creating some images with PaintCode.
Using that program I get the actual code for each image that I create, thus allowing me to choose to insert code or use an actual image. I was wondering what would consume less memory, the image code or an actual PNG?
I know an image memory consumption is width x height x 4 = bytes in memory but I have no idea whether an image that is generated by code is more memory efficient, less memory efficient or breaks even?
This decision is particularly important given the different screen resolutions. its a lot easier to create an image in code and expand it to whatever size I want rather than go to Photoshop every time.
This answer varies from other answers because I have the impression the graphics context is your most common destination -- that you are not always rendering to a discrete bitmap. So for the purposes of typical drawing:
I was wondering what would consume less memory, the image code or an actual PNG?
It's most likely that the code will result in far less memory consumption.
I have no idea whether an image that is generated by code is more memory efficient, less memory efficient or breaks even?
There are a lot of variables and there is no simple equation to tell you which is better for any given input. If it's simple enough to create with a WYSIWYG, it's likely much smaller as code.
If you need to create intermediate rasterizations or layers for a vector based renderer, then memory will be about equal once you have added the first layer. Typically, one does/should not render each view or layer (not CALayer, btw) to these intermediates and instead render directly into the graphics context. When all your views render directly into the graphics context, they write to the same destination.
With code, you also open yourself to a few other variables which have the potential to add a lot of memory. The effects of font loading and caching can be quite high, and the code generator you use is not going to examine how you could achieve the best caching and sharing of these resources if you find you need to minimize memory consumption.
If your goal is to draw images, you should try to use UIImageView if you possibly can. It's generally the fastest and cheapest way to get an image to the screen, and it's reasonably flexible.
someone explaind it better here.
source
A vector image is almost always smaller in storage than it's raster counterpart, except for photographs. In memory though, they both have to be rasterized if you need to display them, so they will use more or less the same the same amount of memory.
However, I am highly skeptical of the usefulness of PaintCode; in general it's better to use a standard image format such as .svg or .eps, instead of non standard format such as a domain specific language (DSL) within Objective C.
It makes no difference at all, provided the final image size (in point dimensions) is the same as the display size (in point dimensions). What is ultimately displayed in your app is, say, a 100x100 bitmap. Those are the same number of bits no matter how they were obtained to start with.
The place where memory gets wasted is from holding on to an image that is much larger (in point dimensions) than it is actually being displayed in the interface.
If I load a 3MB PNG from my app bundle, scale it down to 100x100, and draw it in the interface, and let go of the original 3MB PNG, the result is exactly the same amount of memory in the backing store as if I draw the content of a 100X100 graphics context from scratch myself using Core Graphics (which is what PaintCode helps you do).

How to manage large 2D FFTs in cuda

I have succesfully written some CUDA FFT code that does a 2D convolution of an image, as well as some other calculations.
How do I go about figuring out what the largest FFT's I can run are? It seems to be that a plan for a 2D R2C convolution takes 2x the image size, and another 2x the image size for the C2R. This seems like a lot of overhead!
Also, it seems like most of the benchmarks and such are for relatively small FFTs..why is this? It seems like for large images, I am going to quickly run out of memory. How is this typically handled? Can you perform an FFT convolution on a tile of an image and combine those results, and expect it to be the same as if I had run a 2D FFT on the entire image?
Thanks for answering these questions
CUFFT plans a different algorithm depending on your image size. If you can't fit in shared memory and are not a power of 2 then CUFFT plans an out-of-place transform while smaller images with the right size will be more amenable to the software.
If you're set on FFTing the whole image and need to see what your GPU can handle my best answer would be to guess and check with different image sizes as the CUFFT planning is complicated.
See the documentation : http://developer.download.nvidia.com/compute/cuda/1_1/CUFFT_Library_1.1.pdf
I agree with Mark and say that tiling the image is the way to go for convolution. Since convolution amounts to just computing many independent integrals you can simply decompose the domain into its constituent parts, compute those independently, and stitch them back together. The FFT convolution trick simply reduces the complexity of the integrals you need to compute.
I expect that your GPU code should outperform matlab by a large factor in all situations unless you do something weird.
It's not usually practical to run FFT on an entire image. Not only does it take a lot of memory, but the image must be a power of 2 in width and height which places an unreasonable constraint on your input.
Cutting the image into tiles is perfectly reasonable. The size of the tiles will determine the frequency resolution you're able to achieve. You may want to overlap the tiles as well.

How to handle large images in matlab without running out of memory?

I am creating mosaic of two images based on the region matches between them using sift descriptors. The problem is when the created mosaic's size gets too large matlab runs out of memory.
Is there some way of stitching the images without actually loading the complete images in memory.
If not how do other gigapixel image generation techniques work or the panorama apps.
Determine the size of the final mosaic prior to stitching (easy to compute with the size of your input images and the homography).
Write a blank mosaic to file (not in any specific format but a sequence of bytes just as in memory)
I'm assuming you're inverse mapping the pixels from the original images to the mosaic. So, just write to file when you're trying to store the intensity of the pixel in your mosaic.
There are a few ways you can save memory:
You should use integer data types, such as uint8 for your data.
If you're stitching, you can only keep the regions of interest in memory, such as the potential overlap regions.
If none of the other works, you can spatially downsample the images using imresample, and work on the resulting smaller images.
You can potentially use distributed arrays in the parallel computing toolbox

Resources