I need to slice an image into N tiles (rects might be located anywhere and overlap), where N could potentially be quite huge. Since CGImage operates on the CPU and this is a performance critical operation that happens several times per second I was wondering if there is a faster way to do this on the GPU.
What's the fastest possible solution to slice an image (possibly using the GPU)?
PS: If it helps in any way, the image is only grayscale (array of floats between 0 and 1). It doesn't have to be an CGImage/UIImage, a float array suffices.
Since slicing images is basically just copying chunks of the image to a new image there is not really a way to speed up that process. Depending on what you are doing with the slices you might be able to get away with not copying the data. If you keep only the coordinates of your slices you can access the underlying storage of your original image.
Related
I am trying to come up with an efficient algorithm to query the top k most dominant colors in a UIImage (jpg or png). By dominant color I mean the color that is present in the most amount of pixels. The use case is mostly geared toward finding the top-1 (single most dominant) color to figure out the images background. Below I document the algorithm I am currently trying to implement and wanted some feedback.
First my algorithm takes an image and draws it into a bitmap context, I do this so I have control over how many bytes per pixel will be present in my image-data for processing/parsing.
Second, after drawing the image, I loop through every pixel in the image buffer. I quickly realized looping through every pixel would not scale, this became a huge performance bottleneck. In order to optimize this, I realized I need not analyze every pixel in the image. I also realized that as I scaled down an image, the dominant colors in the image became more prominent and it resulted in looping through far fewer pixels. So the second step is actually to loop through every pixel in the resized image buffer:
Third, as I loop through each pixel I build up a CountedSet of UIColor objects, this basicly keeps track of a histogram of color counts.
Finally, once I have my counted set it is very easy to then loop through it and respond to top-k dominant color queries.
So my algorithm in short is to resize the image (scale it down by some function proportional to the images size), draw it into a bitmap context buffer once and cache it, loop through all data in the buffer and build out a histogram.
My question to the stack overflow community is how efficient is my algorithm, and are there any gains or further optimizations I can make? I just need something that gives me reasonable performance and after doing some performance testing this seemed to work pretty darn well. Furthermore, how accurate will this be? Particularly the rescale operation kinda worries me. Am I trading signficant amount of accuracy for performance here? At the end of the day this will mostly just be used to determine the background color of an image.
Ideas for potential performance improvements:
1) When analyzing a single dominant color do some math to figure out if I have already found the most dominant color based on number of pixels analyzed and exit early.
2) For the top k query, answer it quickly by levaraging a binary-heap data structure (typical top-k query algo)
You can use some performance tweaks to avoid the downscaling altogether. As we do not see your implementation is hard to say where the bottleneck really is. So here some pointers what to look/check for or improve. Take in mind I do not code for your environment so take extreme prejudice:
pixel access
Most pixel access function I saw are SLOOOW especially functions called putpixel,getpixel,pixels,.... Because in each single pixel access they are doing too many sanity/safety checks and color/space/address conversions. Instead use direct pixel access. Most of the image interfaces I saw have some kind of ScanLine[] access which gives you direct pointer to a single line in an image. So if you fill your own array of pointers with it you obtain direct pixel access without any slowdowns. This usually speeds up the algorithm from 100 to 10000 times on most platforms (depends on the usage).
To check for this try to read or fill image 1024*1024*32bit and measure the time. On standard PC it should take up to few [ms] or less. If you got slow access it could be even seconds. For more info see Display an array of color in C
Dominant color
if #1 is still not fast enough you can take advantage of that dominant color has highest probability in the image. So in theory you do not need to sample whole image instead you could:
sample every n-th pixel (which is downscaling with nearest neighbor filter) or use randomized pixel positions for sampling. Both approaches have their pros and cons but if you combine them you could get much better results with much less pixels to process then the whole image. Of coarse this will lead to wrong results on some occasions (when you miss many of the dominant pixels) which is improbable but possible.
histogram structure
for low color count like up to 16bit you can use bucket sort/histogram acquisition which is fast and can be done in O(n) where n is the number of pixels. No searching needed. So if you reduce colors from true color to 16 bit you can significantly boost the speed of histogram computation. Because you lower the constant time hugely and also the complexity goes from O(n*m) to O(n) which is for high color count m really big difference. See my C++ histogram example it is in HSV but in RGB is almost the same...
In case you need true-color you got 16.7M colors which is not practical for bucket sort style. So you need to use binary search and dictionary to speed up the color search in histogram. If you do not have this then this is your slow down.
histogram sort
How did you sort the histogram? If you got wrong sort implementation it could take much time for big color counts. I usually use bubble-sort in my examples because it is less code to write and usually enough. But I saw here on SO too many times wrongly implemented bubble sort using alway the worse case time T(n^2) which is wrong (and even I sometimes do it). For time sensitive code I use quick-sort. See bubble sort in C++.
Also your task is really resembling Color quantization (or it is just me?) so take a look at: Effective gif/image color quantization?
Downscaling an image requires looking at each pixel so you can pick a new pixel that is closest to the average color of some group of neighbors. The reason this appears to happen so fast compared to your implementation of iterating through all the pixels is that CoreGraphics hands the scaling task off to the GPU hardware, whereas your approach uses the CPU to iterate through each pixel which is much slower.
So the thing you need to do is write some GPU-based code to scan through your original image and look at each pixel, tallying up the color counts as you go. This has the advantage not only of being very fast, but you'll also get an accurate count of colors. Downsampling produces as I mentioned pixels that are color averages, so you won't end up with reliably correct color counts that correlate to your original image (unless you happen to be downscaling solid colors, but in the typical case you'll end up with something other than you started with).
I recommend looking into Apple's Metal framework for an API that lets you write code directly for the GPU. It'll be a challenge to learn, but I think you'll find it interesting and when you're done your code will scan original images extremely fast without having to go through any extra downsampling effort.
As the title states, I am trying to add the same image with different offsets, stored in a list, to the accumulating image.
The current implementation performs this on a CPU, and with some intrinsics it can be quite fast.
However, with larger images (2048x2048) and many offsets in the list (~10000), the performance is not satisfactory.
My question is, can the accumulation of the image with different offsets be efficiently implemented on a GPU?
Yes, you can. The results will be likely much faster than on CPU. The trick is to not send the data for each addition, and to not even launch a new kernel for each addition: the kernel you have should do some decent number of offset additions at once, at least 16 but possibly a few hundred, depending on your typical list size (and you can have more than one kernel of course).
I wish to apply a certain 2D filter to 2D images, however, the filter size is huge. Image dimensions are about 2000x2000 and the filter size is about 500*500.
No, I cannot do this in frequency domain so FFT is no go. I'm aware of normal GPU convolution and the use of shared memory for coalescing memory access, however shared memory doesn't seem feasible since the space needed by the filter is large and would therefore need to be divided, this might even prove to be very complex to write.
Any ideas?
I think you can easily manage doing filtering for such sized images. You can transfer hundreds of megabytes to the videomemory. Such size is going to be working well.
You can use byte matrices to transfer the image data then you can use your filter to operate on it.
I am designing an app and I am creating some images with PaintCode.
Using that program I get the actual code for each image that I create, thus allowing me to choose to insert code or use an actual image. I was wondering what would consume less memory, the image code or an actual PNG?
I know an image memory consumption is width x height x 4 = bytes in memory but I have no idea whether an image that is generated by code is more memory efficient, less memory efficient or breaks even?
This decision is particularly important given the different screen resolutions. its a lot easier to create an image in code and expand it to whatever size I want rather than go to Photoshop every time.
This answer varies from other answers because I have the impression the graphics context is your most common destination -- that you are not always rendering to a discrete bitmap. So for the purposes of typical drawing:
I was wondering what would consume less memory, the image code or an actual PNG?
It's most likely that the code will result in far less memory consumption.
I have no idea whether an image that is generated by code is more memory efficient, less memory efficient or breaks even?
There are a lot of variables and there is no simple equation to tell you which is better for any given input. If it's simple enough to create with a WYSIWYG, it's likely much smaller as code.
If you need to create intermediate rasterizations or layers for a vector based renderer, then memory will be about equal once you have added the first layer. Typically, one does/should not render each view or layer (not CALayer, btw) to these intermediates and instead render directly into the graphics context. When all your views render directly into the graphics context, they write to the same destination.
With code, you also open yourself to a few other variables which have the potential to add a lot of memory. The effects of font loading and caching can be quite high, and the code generator you use is not going to examine how you could achieve the best caching and sharing of these resources if you find you need to minimize memory consumption.
If your goal is to draw images, you should try to use UIImageView if you possibly can. It's generally the fastest and cheapest way to get an image to the screen, and it's reasonably flexible.
someone explaind it better here.
source
A vector image is almost always smaller in storage than it's raster counterpart, except for photographs. In memory though, they both have to be rasterized if you need to display them, so they will use more or less the same the same amount of memory.
However, I am highly skeptical of the usefulness of PaintCode; in general it's better to use a standard image format such as .svg or .eps, instead of non standard format such as a domain specific language (DSL) within Objective C.
It makes no difference at all, provided the final image size (in point dimensions) is the same as the display size (in point dimensions). What is ultimately displayed in your app is, say, a 100x100 bitmap. Those are the same number of bits no matter how they were obtained to start with.
The place where memory gets wasted is from holding on to an image that is much larger (in point dimensions) than it is actually being displayed in the interface.
If I load a 3MB PNG from my app bundle, scale it down to 100x100, and draw it in the interface, and let go of the original 3MB PNG, the result is exactly the same amount of memory in the backing store as if I draw the content of a 100X100 graphics context from scratch myself using Core Graphics (which is what PaintCode helps you do).
I am creating mosaic of two images based on the region matches between them using sift descriptors. The problem is when the created mosaic's size gets too large matlab runs out of memory.
Is there some way of stitching the images without actually loading the complete images in memory.
If not how do other gigapixel image generation techniques work or the panorama apps.
Determine the size of the final mosaic prior to stitching (easy to compute with the size of your input images and the homography).
Write a blank mosaic to file (not in any specific format but a sequence of bytes just as in memory)
I'm assuming you're inverse mapping the pixels from the original images to the mosaic. So, just write to file when you're trying to store the intensity of the pixel in your mosaic.
There are a few ways you can save memory:
You should use integer data types, such as uint8 for your data.
If you're stitching, you can only keep the regions of interest in memory, such as the potential overlap regions.
If none of the other works, you can spatially downsample the images using imresample, and work on the resulting smaller images.
You can potentially use distributed arrays in the parallel computing toolbox