Image resize on GPU is slower than cv::resize - opencv

I am resizing this test picture:
Mat im = Mat::zeros(Size(832*3,832*3),CV_8UC3);
putText(im,"HI THERE",Point2i(10,90),1,7,Scalar(255,255,255),2);
by standard
and by CUDA version of resize:
static void gpuResize(Mat in, Mat &out){
double k = in.cols/416.;
cuda::GpuMat gpuInImage;
cuda::GpuMat gpuOutImage;
const Size2i &newSize = Size(416, in.rows / k);
//cout << "newSize " << newSize<< endl;
cuda::resize(gpuInImage, gpuOutImage, newSize,INTER_NEAREST);;
Measuring time shows that cv::resize is ~25 times faster. What am I doing wrong? I on GTX1080ti videocard, but also observe same situation on Jetson NANO. May be there are any alternative methods to resize image faster then cv::resize with nvidia hardware acceleration?

I was doing similar things today, and had the same results on my Jetson NX running in the NVP model 2 mode (15W, 6 core).
Using the CPU to resize an image 10,000 times was faster than resizing the same image 10,000 times with the GPU.
This was my code for the CPU:
cv::Mat cpu_original_image = cv::imread("test.png"); // 1400x690 RGB image
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
cv::Mat cpu_resized_image;
cv::resize(cpu_original_image, cpu_resized_image, desired_image_size);
This was my code for the GPU:
cv::cuda::GpuMat gpu_original_image;
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
cv::cuda::GpuMat gpu_resized_image;
cv::cuda::resize(gpu_original_image, gpu_resized_image, desired_image_size);
My timing code (not shown above) was only for the for() loops, it didn't include imread() nor upload().
When called in a loop 10K times, my results were:
CPU: 5786.930 milliseconds
GPU: 9678.054 milliseconds (plus an additional 170.587 milliseconds for the upload())
Then I made 1 change to each loop. I moved the "resized" mat outside of the loop to prevent it from being created and destroyed at each iteration. My code then looked like this:
cv::Mat cpu_original_image = cv::imread("test.png"); // 1400x690 RGB image
cv::Mat cpu_resized_image;
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
cv::resize(cpu_original_image, cpu_resized_image, desired_image_size);
...and for the GPU:
cv::cuda::GpuMat gpu_original_image;
cv::cuda::GpuMat gpu_resized_image;
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
cv::cuda::resize(gpu_original_image, gpu_resized_image, desired_image_size);
The for() loop timing results are now:
CPU: 5768.181 milliseconds (basically unchanged)
GPU: 2827.898 milliseconds (from 9.7 seconds to 2.8 seconds)
This looks much better! GPU resize is now faster than CPU long as you're doing lots of work with the GPU and not a single resize. And as long as you don't continuously re-allocate temporary GPU mats, as that seems to be quite expensive.
But after all this, to go back to your original question: if all you are doing is resizing a single image once, or resizing many images once each, the GPU resize won't help you since uploading each image to the GPU mat will take longer than the original resize! Here are my results when trying that on a Jetson NX:
single image resize on CPU: 3.565 milliseconds
upload mat to GPU: 186.966 milliseconds
allocation of 2nd GPU mat and gpu resize: 225.925 milliseconds
So on the CPU the NX can do it in < 4 milliseconds, while on the GPU it takes over 400 milliseconds.


u-disparity shows half the expected cols

I cannot see what I am doing wrong after checking the code a thousand times.
The algorithm is very simple: I have a CV_16U image with the disparity values called disp, and I am trying to implement the building of the u and v disparities in order to detect obstacles.
Mat v_disparity, u_disparity;
v_disparity=Mat::zeros(disp.rows,numberOfDisparities*16, CV_16U);
u_disparity=Mat::zeros(numberOfDisparities*16,disp.cols, CV_16U);
for(int i = 0; i < disp.rows; i++)
d = disp.ptr<ushort>(i); //d[j] is the disparity value
for (int j = 0; j < disp.cols; ++j)
The problem is that when I use imshow to print both disparities after converting to 8bit Unsigned. The u-disparity is wrong, since it has the shape it should, but it's half the horizontal dimension, being the right pixels black.
I finally figured it out. It was just that I used a wrong template while accessing to the value of the pixels in u and v-disparities. In the v-disparity I didn't detect it since I thought there was no pixels in disp with high disparity values.
To sum up, the following lines:<uchar>(i,(d[j]))++;<uchar>((d[j]),j)++;
must be replaced by:<ushort>(i,(d[j]))++;<ushort>((d[j]),j)++;
since both images are CV_16U, and the type uchar is 8 bit, not 16 bit.

SpriteKit SKTextureAtlas, Terminated Due to Memory Pressure while loading texture

Similar to the SpriteKit Featured Game "Adventure" from WWDC, I am try to load my background image via tiles. I have created a Texture Atlas that contains 6,300 "tiles" that are each 100x100 pixels in size. The complete background image is a total of 30,000x2048 (for retina displays). The idea is that the background will move from right to left (side-scroller). The first column and the last column match so that they seem continuous.
When the application runs, it loads my initial loading screen and title images and spikes to 54MB in my memory tab with a CPU usage of 16%. This stays the same as I navigate through the menus until I choose my level, which tells a background thread to load the level assets (of which contains the aforementioned background image). The entire .atlas folder shows to be only 35.4MB. I don't believe that this is a problem since the Adventure .atlas folder (from WWDC) shows to be only 32.7MB.
Once I select the level, it loads approximately 20 of the textures in the .atlas folder before I start receiving memory warnings and it crashes the application. I've checked in Instruments for leaks and it doesn't show any memory leaks. I don't receive any compiler errors (not even the EXC_BAD_ACCESS one). I've looked at my device console and have found a few lines of where the application crashes, but it doesn't look to make much sense to me. I've also checked for Zombies, but haven't seemed to find any.
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
// Used to determine time spent to load
NSDate *startDate = [NSDate date];
// Atlas to load
SKTextureAtlas *tileAtlas = [SKTextureAtlas atlasNamed:#"Day"];
// Make sure the array is empty, before storing the tiles
sBackgroundTiles = nil;
sBackgroundTiles = [[NSMutableArray alloc] initWithCapacity:6300];
// For each row (21 Totals Rows)
for (int y = 0; y < 21; y++) {
// For each Column (100 Total Columns)
for (int x = 1; x <= 100; x++) {
// Get the tile number (0 * 32) + 0;
int tileNumber = (y * 300) + x;
// Create a SpriteNode of that tile
SKSpriteNode *tileNode = [SKSpriteNode spriteNodeWithTexture:[tileAtlas textureNamed:[NSString stringWithFormat:#"tile_%d.png", tileNumber]]];
// Position the SpriteNode
CGPoint position = CGPointMake((x * 100), (y * 100));
tileNode.position = position;
// At layer
tileNode.zPosition = -1.0f;
tileNode.blendMode = SKBlendModeReplace;
// Add to array
[(NSMutableArray *)sBackgroundTiles addObject:tileNode];
NSLog(#"Loaded all world tiles in %f seconds", [[NSDate date] timeIntervalSinceDate:startDate]);
This is what seems to pertain to the crash from the Debug console:[9438] <Warning>: 1 +0.000000 sec [24de/1807]: error: ::read ( -1, 0x4069ec, 18446744069414585344 ) => -1 err = Bad file descriptor (0x00000009)[9438] <Warning>: Exiting.[1] (UIKitApplication:tv.thebasement.Coin-Voyage[0x641d][9441]) <Notice>: (UIKitApplication:tv.thebasement.Coin-Voyage[0x641d]) Exited: Killed: 9
I don't have enough reputation to post images so here is a link to a screenshot screenshot of my allocations in Instruments:
Any help and advice is much appreciated! If I've left out some pertinent information, I'm glad to update.
The file size of an image file (PNG, JPG, atlas folder, etc) tells you nothing about the memory usage.
Instead you have to calculate the texture memory usage using the formula:
width * height * (color bit depth / 8) = texture size in bytes
For example an image with dimensions 4096x4096 pixels and 32 bits color depth (4 bytes) uses this much memory when loaded as a texture (uncompressed):
4096 * 4096 * 4 = 67108864 bytes (64 Megabytes)
According to your specs (6,300 tiles, each 100x100 pixels, assuming they're all unique) you're way, wayyyyyy above any reasonable limit for texture memory usage (about 1.5 Gigabytes!). Considering the atlas size of 35 Megabytes (which is huge for an atlas btw) and assuming a mere 10:1 compression ratio you may still be looking at 350+ Megabytes of texture memory usage.

Read and write Mat (Opencv) in parallel with OpenMP

i'm using image processing on a mat in opencv, and i try to speed up the process with openmp by parallellising with openmp
Mat *input,*output = ...;
#pragma omp parallel for private(i,j)
for(i=startvalueX; i<stopvalueX; i++) {
for(j=startvalueY; j<stopvalueY; j++) {
if(input->at<uchar>(i,j)!=0 && simplePoint(i,j,input) {
simplePoint is a method that grabs neighbours in input, and checks if it meets a predefined neighbourhood in a lookup table (around 15 lines of code).
serially, this program takes 1.56s, in parallel, 3.5s
the difference (using gprof) is
% cumulative self self total
time seconds seconds calls ms/call ms/call name
38.40 0.67 0.33 44 7.50 18.97 _GLOBAL__sub_I__Z10splitImagePcPN2cv3MatES2_
Any ideas?

Average blurring mask produces different results

multiplying each pixel by the average blurring mask *(1/9) but the result is totally different.
PImage toAverageBlur(PImage a)
PImage aBlur = new PImage(a.width, a.height);
for(int i = 0; i < a.width; i++)
for(int j = 0; j < a.height; j++)
int pixelPosition = i*a.width + j;
int aPixel = ((a.pixels[pixelPosition] /9));
aBlur.pixels[pixelPosition] = color(aPixel);
return aBlur;
Currently, you are not applying an average filter, you are only scaling the image by a factor of 1/9, which would make it darker. Your terminology is good, you are trying to apply a 3x3 moving average (or neighbourhood average), also known as a boxcar filter.
For each pixel i,j, you need to take the sum of (i-1,j-1), (i-1,j), (i-1,j+1), (i,j-1), (i,j),(i,j+1),(i+1,j-1),(i+1,j),(i+1,j+1), then divide by 9 (for a 3x3 average). For this to work, you need to not consider the pixels on the image edge, which do not have 9 neighbours (so you start at pixel (1,1), for example). The output image will be a pixel smaller on each side. Alternatively, you can mirror values out to add an extra line to your input image which will make the output image the same size as the original.
There are more efficient ways of doing this, for example using FFT based convolution; these methods are faster because they don't require looping.

Magnitude of FFT result depends on wave frequency?

I'm baffled by the results I'm getting from FFT and would appreciate any help.
I'm using FFTW 3.2.2 but have gotten similar results with other FFT implementations (in Java). When I take the FFT of a sine wave, the scaling of the result depends on the frequency (Hz) of the wave--specifically, whether it's close to a whole number or not. The resulting values are scaled really small when the frequency is near a whole number, and they're orders of magnitude larger when the frequency is in between whole numbers. This graph shows the magnitude of the spike in the FFT result corresponding to the wave's frequency, for different frequencies. Is this right??
I checked that the inverse FFT of the FFT is equal to the original sine wave times the number of samples, and it is. The shape of the FFT also seems to be correct.
It wouldn't be so bad if I were analyzing individual sine waves, because I could just look for the spike in the FFT regardless of its height. The problem is that I want to analyze sums of sine waves. If I'm analyzing a sum of sine waves at, say, 440 Hz and 523.25 Hz, then only the spike for the one at 523.25 Hz shows up. The spike for the other is so tiny that it just looks like noise. There must be some way to make this work because in Matlab it does work-- I get similar-sized spikes at both frequencies. How can I change the code below to equalize the scaling for different frequencies?
#include <cstdlib>
#include <cstring>
#include <cmath>
#include <fftw3.h>
#include <cstdio>
using namespace std;
const double PI = 3.141592;
/* Samples from 1-second sine wave with given frequency (Hz) */
void sineWave(double a[], double frequency, int samplesPerSecond, double ampFactor);
int main(int argc, char** argv) {
/* Args: frequency (Hz), samplesPerSecond, ampFactor */
if (argc != 4) return -1;
double frequency = atof(argv[1]);
int samplesPerSecond = atoi(argv[2]);
double ampFactor = atof(argv[3]);
/* Init FFT input and output arrays. */
double * wave = new double[samplesPerSecond];
sineWave(wave, frequency, samplesPerSecond, ampFactor);
double * fftHalfComplex = new double[samplesPerSecond];
int fftLen = samplesPerSecond/2 + 1;
double * fft = new double[fftLen];
double * ifft = new double[samplesPerSecond];
/* Do the FFT. */
fftw_plan plan = fftw_plan_r2r_1d(samplesPerSecond, wave, fftHalfComplex, FFTW_R2HC, FFTW_ESTIMATE);
memcpy(fft, fftHalfComplex, sizeof(double) * fftLen);
/* Do the IFFT. */
fftw_plan iplan = fftw_plan_r2r_1d(samplesPerSecond, fftHalfComplex, ifft, FFTW_HC2R, FFTW_ESTIMATE);
printf("%s,%s,%s", argv[1], argv[2], argv[3]);
for (int i = 0; i < samplesPerSecond; i++) {
printf("\t%.6f", wave[i]);
printf("%s,%s,%s", argv[1], argv[2], argv[3]);
for (int i = 0; i < fftLen; i++) {
printf("\t%.9f", fft[i]);
printf("%s,%s,%s", argv[1], argv[2], argv[3]);
for (int i = 0; i < samplesPerSecond; i++) {
printf("\t%.6f (%.6f)", ifft[i], samplesPerSecond * wave[i]); // actual and expected result
delete[] wave;
delete[] fftHalfComplex;
delete[] fft;
delete[] ifft;
void sineWave(double a[], double frequency, int samplesPerSecond, double ampFactor) {
for (int i = 0; i < samplesPerSecond; i++) {
double time = i / (double) samplesPerSecond;
a[i] = ampFactor * sin(2 * PI * frequency * time);
The resulting values are scaled really small when the frequency is near a whole number, and they're orders of magnitude larger when the frequency is in between whole numbers.
That's because a Fast Fourier Transform assumes the input is periodic and is repeated infinitely. If you have a nonintegral number of sine waves, and you repeat this waveform, it is not a perfect sine wave. This causes an FFT result that suffers from "spectral leakage"
Look into window functions. These attenuate the input at the beginning and end, so that spectral leakage is diminished.
p.s.: if you want to get precise frequency content around the fundamental, capture lots of wave cycles and you don't need to capture too many points per cycle (32 or 64 points per cycle is probably plenty). If you want to get precise frequency content at higher harmonics, capture a smaller number of cycles, and more points per cycle.
I can only recommend that you look at GNU Radio code. The file that could be of particular interest to you is
