i'm using image processing on a mat in opencv, and i try to speed up the process with openmp by parallellising with openmp
Mat *input,*output = ...;
#pragma omp parallel for private(i,j)
for(i=startvalueX; i<stopvalueX; i++) {
for(j=startvalueY; j<stopvalueY; j++) {
if(input->at<uchar>(i,j)!=0 && simplePoint(i,j,input) {
output->at<uchar>(i,j)=0;
}
}
}
simplePoint is a method that grabs neighbours in input, and checks if it meets a predefined neighbourhood in a lookup table (around 15 lines of code).
serially, this program takes 1.56s, in parallel, 3.5s
the difference (using gprof) is
% cumulative self self total
time seconds seconds calls ms/call ms/call name
38.40 0.67 0.33 44 7.50 18.97 _GLOBAL__sub_I__Z10splitImagePcPN2cv3MatES2_
Any ideas?
Related
I am resizing this test picture:
Mat im = Mat::zeros(Size(832*3,832*3),CV_8UC3);
putText(im,"HI THERE",Point2i(10,90),1,7,Scalar(255,255,255),2);
by standard
cv::resize(im,out,Size(416,416),0,0,INTER_NEAREST);
and by CUDA version of resize:
static void gpuResize(Mat in, Mat &out){
double k = in.cols/416.;
cuda::GpuMat gpuInImage;
cuda::GpuMat gpuOutImage;
gpuInImage.upload(in);
const Size2i &newSize = Size(416, in.rows / k);
//cout << "newSize " << newSize<< endl;
cuda::resize(gpuInImage, gpuOutImage, newSize,INTER_NEAREST);
gpuOutImage.download(out);
}
Measuring time shows that cv::resize is ~25 times faster. What am I doing wrong? I on GTX1080ti videocard, but also observe same situation on Jetson NANO. May be there are any alternative methods to resize image faster then cv::resize with nvidia hardware acceleration?
I was doing similar things today, and had the same results on my Jetson NX running in the NVP model 2 mode (15W, 6 core).
Using the CPU to resize an image 10,000 times was faster than resizing the same image 10,000 times with the GPU.
This was my code for the CPU:
cv::Mat cpu_original_image = cv::imread("test.png"); // 1400x690 RGB image
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
{
cv::Mat cpu_resized_image;
cv::resize(cpu_original_image, cpu_resized_image, desired_image_size);
}
This was my code for the GPU:
cv::cuda::GpuMat gpu_original_image;
gpu_original_image.upload(cpu_original_image);
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
{
cv::cuda::GpuMat gpu_resized_image;
cv::cuda::resize(gpu_original_image, gpu_resized_image, desired_image_size);
}
My timing code (not shown above) was only for the for() loops, it didn't include imread() nor upload().
When called in a loop 10K times, my results were:
CPU: 5786.930 milliseconds
GPU: 9678.054 milliseconds (plus an additional 170.587 milliseconds for the upload())
Then I made 1 change to each loop. I moved the "resized" mat outside of the loop to prevent it from being created and destroyed at each iteration. My code then looked like this:
cv::Mat cpu_original_image = cv::imread("test.png"); // 1400x690 RGB image
cv::Mat cpu_resized_image;
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
{
cv::resize(cpu_original_image, cpu_resized_image, desired_image_size);
}
...and for the GPU:
cv::cuda::GpuMat gpu_original_image;
gpu_original_image.upload(cpu_original_image);
cv::cuda::GpuMat gpu_resized_image;
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
{
cv::cuda::resize(gpu_original_image, gpu_resized_image, desired_image_size);
}
The for() loop timing results are now:
CPU: 5768.181 milliseconds (basically unchanged)
GPU: 2827.898 milliseconds (from 9.7 seconds to 2.8 seconds)
This looks much better! GPU resize is now faster than CPU resize...as long as you're doing lots of work with the GPU and not a single resize. And as long as you don't continuously re-allocate temporary GPU mats, as that seems to be quite expensive.
But after all this, to go back to your original question: if all you are doing is resizing a single image once, or resizing many images once each, the GPU resize won't help you since uploading each image to the GPU mat will take longer than the original resize! Here are my results when trying that on a Jetson NX:
single image resize on CPU: 3.565 milliseconds
upload mat to GPU: 186.966 milliseconds
allocation of 2nd GPU mat and gpu resize: 225.925 milliseconds
So on the CPU the NX can do it in < 4 milliseconds, while on the GPU it takes over 400 milliseconds.
I have a task - to multiply big row vector (10 000 elements) via big column-major matrix (10 000 rows, 400 columns). I decided to go with ARM NEON since I'm curious about this technology and would like to learn more about it.
Here's a working example of vector matrix multiplication I wrote:
//float* vec_ptr - a pointer to vector
//float* mat_ptr - a pointer to matrix
//float* out_ptr - a pointer to output vector
//int matCols - matrix columns
//int vecRows - vector rows, the same as matrix
for (int i = 0, max_i = matCols; i < max_i; i++) {
for (int j = 0, max_j = vecRows - 3; j < max_j; j+=4, mat_ptr+=4, vec_ptr+=4) {
float32x4_t mat_val = vld1q_f32(mat_ptr); //get 4 elements from matrix
float32x4_t vec_val = vld1q_f32(vec_ptr); //get 4 elements from vector
float32x4_t out_val = vmulq_f32(mat_val, vec_val); //multiply vectors
float32_t total_sum = vaddvq_f32(out_val); //sum elements of vector together
out_ptr[i] += total_sum;
}
vec_ptr = &myVec[0]; //switch ptr back again to zero element
}
The problem is that it's taking very long time to compute - 30 ms on iPhone 7+ when my goal is 1 ms or even less if it's possible. Current execution time is understandable since I launch multiplication iteration 400 * (10000 / 4) = 1 000 000 times.
Also, I tried to process 8 elements instead of 4. It seems to help, but numbers still very far from my goal.
I understand that I might make some horrible mistakes since I'm newbie with ARM NEON. And I would be happy if someone can give me some tip how I can optimize my code.
Also - is it worth doing big vector-matrix multiplication via ARM NEON? Does this technology fit well for such purpose?
Your code is completely flawed: it iterates 16 times assuming both matCols and vecRows are 4. What's the point of SIMD then?
And the major performance problem lies in float32_t total_sum = vaddvq_f32(out_val);:
You should never convert a vector to a scalar inside a loop since it causes a pipeline hazard that costs around 15 cycles everytime.
The solution:
float32x4x4_t myMat;
float32x2_t myVecLow, myVecHigh;
myVecLow = vld1_f32(&pVec[0]);
myVecHigh = vld1_f32(&pVec[2]);
myMat = vld4q_f32(pMat);
myMat.val[0] = vmulq_lane_f32(myMat.val[0], myVecLow, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[1], myVecLow, 1);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[2], myVecHigh, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[3], myVecHigh, 1);
vst1q_f32(pDst, myMat.val[0]);
Compute all the four rows in a single pass
Do a matrix transpose (rotation) on-the-fly by vld4
Do vector-scalar multiply-accumulate instead of vector-vector multiply and horizontal add that causes the pipeline hazards.
You were asking if SIMD is suitable for matrix operations? A simple "yes" would be a monumental understatement. You don't even need a loop for this.
I am using this GitHub project called "The Amazing Audio Engine", to capture audio from the microphone. So I am using this method:
id<AEAudioReceiver> receiver = [AEBlockAudioReceiver audioReceiverWithBlock: ^(void *source, const AudioTimeStamp *time, UInt32 frames, AudioBufferList *audio) {
// Do something with 'audio'
}];
This method fires every 23 ms delivering an audio array containing all amplitudes of the sound wave over that 23 ms interval.
This is the catch. This audio sound I am dealing with is a FM signal, composed of two frequencies, one at 1000 Hz and one at twice the frequency that represents zeros and ones of a digital stream.
This is my problem. At that point I have an array of audio amplitudes over 0.23 ms.
So I thought I could do a FFT to convert the signal into frequency levels. I used this code:
// Setup the length
vDSP_Length log2n = log2f(numFrames);
// Calculate the weights array. This is a one-off operation.
FFTSetup fftSetup = vDSP_create_fftsetup(log2n, FFT_RADIX2);
// For an FFT, numSamples must be a power of 2, i.e. is always even
int nOver2 = numFrames/2;
// Populate *window with the values for a hamming window function
float *window = (float *)malloc(sizeof(float) * numFrames);
vDSP_hamm_window(window, numFrames, 0);
// Window the samples
vDSP_vmul(data, 1, window, 1, data, 1, numFrames);
// Define complex buffer
COMPLEX_SPLIT A;
A.realp = (float *) malloc(nOver2*sizeof(float));
A.imagp = (float *) malloc(nOver2*sizeof(float));
// Pack samples:
// C(re) -> A[n], C(im) -> A[n+1]
vDSP_ctoz((COMPLEX*)data, 2, &A, 1, numFrames/2);
// RUN THE FFT
//Perform a forward FFT using fftSetup and A
//Results are returned in A
vDSP_fft_zrip(fftSetup, &A, 1, log2n, FFT_FORWARD);
Because each interval is 172 Hz and I want to isolate 1000Hz, I think the 6th "bucket" of the FFT result would be the one, so I have this code:
//Convert COMPLEX_SPLIT A result to magnitudes
float amp[numFrames];
amp[0] = A.realp[0]/(numFrames*2);
for(int i=1; i<numFrames; i++) {
amp[i]=A.realp[i]*A.realp[i]+A.imagp[i]*A.imagp[i];
}
// I need the 6th and the 12th bucket, so I need a[5] and a[11]
but then I am starting to think that the FFT is not what I want because a[5] and a[11] will give me the amplitudes of ~1000Hz and ~2000Hz over 0.23 ms but in fact what I need are all the variations of the 1000 Hz and 2000 Hz sounds had over the 0.23ms time. In fact I need to obtain arrays, not single values.
In broad lines what should I do to obtain the amplitudes over time of the two frequencies, 1000 and 2000 Hz?
If you know what time resolution you want, two Goertzel filters slid by that length would allow you to measure the amplitudes of your two frequencies with much less overhead than using FFTs. The length of the filter or FFT need not (usually should not) be the same length as the number of frames from each audio callback. You can use a circular buffer or fifo to decouple the lengths. (In iOS, the numFrames can be different on different device models, and may suddenly change depending on other factors outside the apps control).
I'm currently using FFT code from here:
https://github.com/syedhali/EZAudio/tree/master/EZAudioExamples/iOS/EZAudioFFTExample
Here's the code from the 2 relevant methods:
-(void)createFFTWithBufferSize:(float)bufferSize withAudioData:(float*)data {
// Setup the length
_log2n = log2f(bufferSize);
// Calculate the weights array. This is a one-off operation.
_FFTSetup = vDSP_create_fftsetup(_log2n, FFT_RADIX2);
// For an FFT, numSamples must be a power of 2, i.e. is always even
int nOver2 = bufferSize/2;
// Populate *window with the values for a hamming window function
float *window = (float *)malloc(sizeof(float)*bufferSize);
vDSP_hamm_window(window, bufferSize, 0);
// Window the samples
vDSP_vmul(data, 1, window, 1, data, 1, bufferSize);
free(window);
// Define complex buffer
_A.realp = (float *) malloc(nOver2*sizeof(float));
_A.imagp = (float *) malloc(nOver2*sizeof(float));
}
-(void)updateFFTWithBufferSize:(float)bufferSize withAudioData:(float*)data {
// For an FFT, numSamples must be a power of 2, i.e. is always even
int nOver2 = bufferSize/2;
// Pack samples:
// C(re) -> A[n], C(im) -> A[n+1]
vDSP_ctoz((COMPLEX*)data, 2, &_A, 1, nOver2);
// Perform a forward FFT using fftSetup and A
// Results are returned in A
vDSP_fft_zrip(_FFTSetup, &_A, 1, _log2n, FFT_FORWARD);
// Convert COMPLEX_SPLIT A result to magnitudes
float amp[nOver2];
float maxMag = 0;
for(int i=0; i<nOver2; i++) {
// Calculate the magnitude
float mag = _A.realp[i]*_A.realp[i]+_A.imagp[i]*_A.imagp[i];
maxMag = mag > maxMag ? mag : maxMag;
}
for(int i=0; i<nOver2; i++) {
// Calculate the magnitude
float mag = _A.realp[i]*_A.realp[i]+_A.imagp[i]*_A.imagp[i];
// Bind the value to be less than 1.0 to fit in the graph
amp[i] = [EZAudio MAP:mag leftMin:0.0 leftMax:maxMag rightMin:0.0 rightMax:1.0];
}
I've modified the updateFFTWithBufferSize method above so that I could get the frequency in Hz like this:
for(int i=0; i<nOver2; i++) {
// Calculate the magnitude
float mag = _A.realp[i]*_A.realp[i]+_A.imagp[i]*_A.imagp[i];
if(maxMag < mag) {
_i_max = i;
}
maxMag = mag > maxMag ? mag : maxMag;
}
float frequency = _i_max / bufferSize * 44100;
NSLog(#"FREQUENCY: %f", frequency);
I've generated a few pure sine tones with Audacity at different frequencies to test with. The issue I'm seeing is that the code is returning the same frequency for two different sine tones that are relatively close in value.
For example:
A sine tone generated at 19255Hz will show up from FFT as 19293.750000Hz. So will a sine tone generated at 19330Hz. Something must be off in the calculations.
Any assistance in how I can modify the above code to get a more precise FFT frequency reading for pure sine tones is greatly appreciated.
Thank you!
You can get a rough frequency estimate by fitting a parabolic curve to the 3 FFT bin magnitudes around the peak magnitude bin, and then finding the extrema of that parabola.
A better estimate can be created by using the transform of your FFT window as an interpolation kernel, and doing successive approximation to refine an estimate of the maxima of the interpolated points. (Zero padding and using a much longer FFT will give you a similar type of interpolated estimate.)
The easy way for a stationary signal is, if possible, to just use a longer FFT with more samples that span a longer time interval.
You've got a number of problems going on here:
1) Your frequency axis spacing is fmax/N, or about 80Hz, so you're not going to get a resolution much better than that.
2) You're signal is very close to the Nyquist frequency (ie, 20KHz/44.1KHz is almost 0.5), and when you're this close to the Nyquist limit you need to be very careful if you want accurate results. (That is, at 20KHz, you're only recording about two data points for each full oscillation cycle.)
3) Since 20KHz is at the edge of human hearing (and higher for most people), many microphones don't really worry about it. Here's a measurement for the iPhone.
Perhaps your sampling frequency isn't high enough?
The FFT is a very good method to get a spectrum if you don't know anything about the input. If you know that the input is a pure sine wave, you can do much better. Start off by calculating the FFT to get a rough idea where the sine is. Get the minimum and maximum to estimate the amplitude [or get that from the FFT - square all inputs, add them, take square root] , get the phase at the begin and end given the estimated frequency and amplitude.
In general, you'll find that the phase does not match. That's because the phase at the end is off by 2*Δf * N. f - Δf is a better estimate of the frequency. Keep in mind that such a method is super noise sensitive. The method works because the input is a pure sine wave, and noise is everything but that. Using this method iteratively blows up quickly; you even hit rounding errors (not sinusoidal either)
Another similar trick is subtracting the estimated wave. The difference between two sines is the product of two sines, one with the frequencies added (in your case, ±38.5 kHz) and one with the frequencies subtracted (Δ_f_, less than 100 Hz). See also Heterodyne detection
I'm baffled by the results I'm getting from FFT and would appreciate any help.
I'm using FFTW 3.2.2 but have gotten similar results with other FFT implementations (in Java). When I take the FFT of a sine wave, the scaling of the result depends on the frequency (Hz) of the wave--specifically, whether it's close to a whole number or not. The resulting values are scaled really small when the frequency is near a whole number, and they're orders of magnitude larger when the frequency is in between whole numbers. This graph shows the magnitude of the spike in the FFT result corresponding to the wave's frequency, for different frequencies. Is this right??
I checked that the inverse FFT of the FFT is equal to the original sine wave times the number of samples, and it is. The shape of the FFT also seems to be correct.
It wouldn't be so bad if I were analyzing individual sine waves, because I could just look for the spike in the FFT regardless of its height. The problem is that I want to analyze sums of sine waves. If I'm analyzing a sum of sine waves at, say, 440 Hz and 523.25 Hz, then only the spike for the one at 523.25 Hz shows up. The spike for the other is so tiny that it just looks like noise. There must be some way to make this work because in Matlab it does work-- I get similar-sized spikes at both frequencies. How can I change the code below to equalize the scaling for different frequencies?
#include <cstdlib>
#include <cstring>
#include <cmath>
#include <fftw3.h>
#include <cstdio>
using namespace std;
const double PI = 3.141592;
/* Samples from 1-second sine wave with given frequency (Hz) */
void sineWave(double a[], double frequency, int samplesPerSecond, double ampFactor);
int main(int argc, char** argv) {
/* Args: frequency (Hz), samplesPerSecond, ampFactor */
if (argc != 4) return -1;
double frequency = atof(argv[1]);
int samplesPerSecond = atoi(argv[2]);
double ampFactor = atof(argv[3]);
/* Init FFT input and output arrays. */
double * wave = new double[samplesPerSecond];
sineWave(wave, frequency, samplesPerSecond, ampFactor);
double * fftHalfComplex = new double[samplesPerSecond];
int fftLen = samplesPerSecond/2 + 1;
double * fft = new double[fftLen];
double * ifft = new double[samplesPerSecond];
/* Do the FFT. */
fftw_plan plan = fftw_plan_r2r_1d(samplesPerSecond, wave, fftHalfComplex, FFTW_R2HC, FFTW_ESTIMATE);
fftw_execute(plan);
memcpy(fft, fftHalfComplex, sizeof(double) * fftLen);
fftw_destroy_plan(plan);
/* Do the IFFT. */
fftw_plan iplan = fftw_plan_r2r_1d(samplesPerSecond, fftHalfComplex, ifft, FFTW_HC2R, FFTW_ESTIMATE);
fftw_execute(iplan);
fftw_destroy_plan(iplan);
printf("%s,%s,%s", argv[1], argv[2], argv[3]);
for (int i = 0; i < samplesPerSecond; i++) {
printf("\t%.6f", wave[i]);
}
printf("\n");
printf("%s,%s,%s", argv[1], argv[2], argv[3]);
for (int i = 0; i < fftLen; i++) {
printf("\t%.9f", fft[i]);
}
printf("\n");
printf("\n");
printf("%s,%s,%s", argv[1], argv[2], argv[3]);
for (int i = 0; i < samplesPerSecond; i++) {
printf("\t%.6f (%.6f)", ifft[i], samplesPerSecond * wave[i]); // actual and expected result
}
delete[] wave;
delete[] fftHalfComplex;
delete[] fft;
delete[] ifft;
}
void sineWave(double a[], double frequency, int samplesPerSecond, double ampFactor) {
for (int i = 0; i < samplesPerSecond; i++) {
double time = i / (double) samplesPerSecond;
a[i] = ampFactor * sin(2 * PI * frequency * time);
}
}
The resulting values are scaled really small when the frequency is near a whole number, and they're orders of magnitude larger when the frequency is in between whole numbers.
That's because a Fast Fourier Transform assumes the input is periodic and is repeated infinitely. If you have a nonintegral number of sine waves, and you repeat this waveform, it is not a perfect sine wave. This causes an FFT result that suffers from "spectral leakage"
Look into window functions. These attenuate the input at the beginning and end, so that spectral leakage is diminished.
p.s.: if you want to get precise frequency content around the fundamental, capture lots of wave cycles and you don't need to capture too many points per cycle (32 or 64 points per cycle is probably plenty). If you want to get precise frequency content at higher harmonics, capture a smaller number of cycles, and more points per cycle.
I can only recommend that you look at GNU Radio code. The file that could be of particular interest to you is usrp_fft.py.