FFTW fftwf_plan_r2r_2d() with FFTW_REDFT01 equivalent - opencv

I am trying to port code that uses FFTW to use KissFFT.
The code uses fftwf_plan_r2r_2d() with FFTW_REDFT01.
What would be the equivalent call in KissFFT?
If this call (with FFTW_REDFT01) is equivalent to a DCT, could I just use a direct DCT transform instead, e.g. such as OpenCV cv::dct?
Is there some input data modification I'd need to do, like reflections and symmetrizations?

Answering my own question...
With the help of these two references, I ended up not using DFT at all, but using OpenCV's cv::dct() and cv::idct() instead.
To answer the question, fftwf_plan_r2r_2d(...,FFTW_REDFT10, FFTW_REDFT10,...) can be replaced by this OpenCV code with the additional scaling:
cv::dct(img, resFFT); // fwd dct. This is like Matlab's dct2()
resFFT *= (4 * sqrt(float(img.rows/2)) * sqrt(float(img.cols/2)));
resFFT.row(0) *= sqrt(2.f);
resFFT.col(0) *= sqrt(2.f);
The inverse with FFTW_REDFT01 can be done like so:
// First re-scale the data for idct():
resFFT /= (4 * sqrt(float(img.rows/2)) * sqrt(float(img.cols/2)));
resFFT.row(0) /= sqrt(2.f);
resFFT.col(0) /= sqrt(2.f);
cv::idct(resFFT, outImg); // this will return the input exactly
// However, the transforms computed by FFTW are unnormalized, exactly like the corresponding,
// so computing a transform followed by its inverse yields the original array scaled by N, where N is the logical DFT size.
// The logical DFT size: Logical N=2*n for each axis, this is th implicit symmetrization
// of the image: reflect right and then reflect both halves down.
int logicalSizeN = (2*img.rows) * (2*img.cols);
outImg *= logicalSizeN; // scale to be like FFTW result
More helpful links here and here.
Note that OpenCV supports only images with an even number of rows and columns.


How would to write a multiplication of double values in NEON assembly?

The line in question is pretty contained:
w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1]
Considering these variables are double (but I can downgrade to float), I think I can pass one value per register? Would it be more efficient to use the 2x2 matrix W directly?
This line is inside a loop that is fired hundreds of times per second and has real-time requirements. Instruments says this line takes 60% of the time of the loop.
This is the loop(s) I'm talking about:
for (int x=startingX; x<endingX; ++x)
for (int y=startingY; y<endingY; ++y)
Matx21d position(x,y);
// warp patch
uint8_t *data;
[self backwardWarpPatchWithWarpingMatrix:warpingMatrix withWarpData:&data withReferenceImage:_initialView withCenter:position];
// check that the backward patch was successful
if (!data)
// calculate zero mean (on the patch) sum of squared differences
int ssd = [self computeZMSSDScoreWithX:x withY:y withCurrentTargetPatch:data];
if (fabs(ssd) < bestSSD)
bestPosition = position;
bestSSD = ssd;
Matx22d warpingMatrixInverse = warpingMatrix.inv();
double wmi0 = warpingMatrixInverse(0,0), wmi1 = warpingMatrixInverse(0,1), wmi2 = warpingMatrixInverse(1,0), wmi3 = warpingMatrixInverse(1,1);
if (isnan(wmi0))
warpingMatrixInverse = Matx22d::eye();
// Perform the warp on a larger patch.
int LEVEL_REF = 0, halfPatchSize = PATCH_SIZE/2;
Matx21d centerInLevel = center * (1.0 / (1<<LEVEL_REF));
__block Mat warped(PATCH_SIZE, PATCH_SIZE, CV_8UC1);
dispatch_apply(PATCH_SIZE, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t y)
for (int x=0; x<PATCH_SIZE; ++x)
double pp0 = x - halfPatchSize, pp1 = (double)y - halfPatchSize;
Matx21d multiplication(wmi0 * pp0 + wmi1 * pp1, wmi2 * pp0 + wmi3 * pp1);
Matx21d px(multiplication(0) + centerInLevel(0), multiplication(1) + centerInLevel(1));
double warpedPixel = [self interpolatePointInImage:referenceImage withU:px(0) withV:px(1)];
warped.at<uchar>(y,x) = (uint8_t)warpedPixel;
int x = (int)u;
int y = (int)v;
float subpixX = u - x,
subpixY = v - y,
oneMinusSubpixX = 1.0 - subpixX,
oneMinusSubpixY = 1.0 - subpixY;
float w00 = oneMinusSubpixX * oneMinusSubpixY,
w01 = oneMinusSubpixX * subpixY,
w10 = subpixX * oneMinusSubpixY,
w11 = 1.0f - w00 - w01 - w10;
const int stride = (int)image.step.p[0];
uchar* ptr = image.data + y * stride + x;
return w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1];
You typically don't translate a single line of code into assembly. For it to be worth writing in assembly, you have to first assume that you can generate better assembly than the compiler will. Sometimes that's true for vectorized code on NEON, but it's usually because you have special knowledge about a complex loop. You're unlikely to beat the compiler significantly on a single line of code (and will likely lose). Is this line part of a loop that you've profiled and identified as a major bottleneck? Have you already tried Accelerate? Have you analyzed the assembly the compiler is generating and found mistakes that it's making.
Trying to do this in ObjC++ is very inefficient. ObjC++ is a glue language for tying together C++ and ObjC; doing both in the same file imposes several performance costs, especially with ARC. Calling an ObjC method inside of a performance-critical inner-loop is very expensive in any case (even if there weren't mixed-in C++). You should never do any kind of function call (least of all an ObjC method dispatch) inside of a tight inner-loop. It's not clear where you're actually calling computeReferencePatchScores. The use of GCD here is probably hurting you more than helping (since it prevents the compiler from applying certain vector optimizations).
This is all to say: how a particular line of code is being compiled into assembly is by far the least of your problems in this code. Its structure is fighting clang's optimizer.
Step one is to step back and ask what computation you want to execute, and then read through the Core Image Programming Guide and the vImage Programming Guide and verify that it isn't already available. You might also look over OpenGL ES, but OpenGL is often a whole approach to drawing (so it's a bit more of a commitment). It looks like you're already using OpenCV, so make sure it doesn't have available functions to do what you want. (Most of what I see in there looks like stuff built into both OpenCV and vImage.)
The simplest way to improve performance without moving to more powerful frameworks is to move the entire loop into a single C++ function. Then the optimizer can see all the code and apply vector operations on its own. But the next step is to make use of the high-level high-performance frameworks already available.
In any case, you'll want to sit down and carefully work through exactly the calculations you need to perform (I usually do this by hand on paper). Make sure you're not duplicating anything, that you need every calculation you're performing, and that each change you make still generates the same result.
This looks to be a 2x2 convolution. If the data set is large, then vImageConvolve_PlanarF with a 3x3 kernel with some zero padding in it will do the job. It tries to skip work on kernel elements that are 0. You would need to convert the data set to single precision.
If the data set is small, then you are probably stuck with scalar code performance. Inline the function if you can. Perhaps you can figure out how to aggregate a bunch of these together to take advantage of a heavier duty high performance routine.
However, if the weights change from pixel to pixel, then a convolution isn't going to work. You may look instead at the N-dimensional lookup table feature in vImage/Transform.h, if your data set is not huge.
I am a bit skeptical that the time is really spent just in that line. It is best to look at the assembly view in instruments to see where the samples really land.

Real to Complex FFT with CUFFT, using OpenCV as Data source

I'm having an issue trying to perform a two dimensional transform on an array of floats using cuFFT. I've had a look at the documentation, but some of the information is contradictory/not clear; so I have a few questions:
My data is 480 rows, with 640 columns (e.g. float data[480][640] but in a single dimension so float data[480*640])
If we say my input dimensions (of real data) are N1 = 480 and N2 = 640. Are the dimensions (after a real to complex transform) N1=480, N2=321?
Can I cudaMemcpy the data directly into a cufftReal array of the same size? Or must it be acufftComplex array?
If it must be acufftComplex array, I am assuming the elements need to be in the place of the real components?
What is the correct structure of a call to cufftPlan2d, cufftExecR2C and cufftC2R given the above values.
I think that's all for now...
Many thanks in advance
EDIT: So, I've implemented the Forward and Inverse transforms as suggested by JackOLantern. However my results are not what I am expecting (an identical Result after FFT as Before it). I have an image gallery here showing two sets of examples. The first is from my room, the second from my University Project.
In the cuFFT Documentation, there is ambiguity in the use of cufftPlan2d (hence why I asked). In the documentation, for a two dimensional array, the data should be input as above (float data[480][640] == float data[NY][NX]) So NY represents the rows. However in the function listing for cufftPlan2d, it states that nx (the parameter) is for the rows...
Swapping the values of NX and NY in the function call gives the result as in the project image (correct orientation, but split into three partially overlapping images at 1/4 the normal size) however, using the parameters as JackOLantern states in his answer gives a slanted/skewed result.
Am I doing something wrong here? Or does the cuFFT library have issues with this type of thing.
ALSO: I have undone a couple of the edits made by JackOLantern to this question as my issues MAY stem from the fact my data is coming from OpenCV.
EDIT: I've recently found out that I was the one who made a mistake in the way I used the function.
Originally I though the function definition referred to the size of the data being passed into it.
However, it appears that the parameters actually refer directly to the size of the REAL part.
This means that the parameters refer to:
The size of the input data when using R2C (Real to Complex)
The size of the output data when using C2R (Complex to Real)
So it appears that the cuFFT documentation and the library itself do not correspond.
When performing an R2C followed by a C2R (real to complex, complex to real respectively), the documentation states that for a Real input of NX x NY dimensions, the Complex output is NX x (floor(NY/2) +1); and vice versa.
However the actual output is of dimensions NX x NY and the actual input is of dimensions NX x NY. This is (half) mentioned on the very first page as
C2R - Symmetric complex input to real output
Implying that the complex data must be Symmetric, i.e. must also have the redundant data in addition to the non-redundant data.
There are a number of other contradictions within the documentation as well which I won't go into.
Needless to say, the problem has been solved.
I have included a MWE below. Near the top are a couple of lines with #define NUM_C2 and appropriate comments. Changing this changes whether the documentation format is followed, or my "fix".
The output is
The Input Real data
The Intermediate Complex data
The output Real data
The ratio of the output data to the input data (there are minor FFT errors, ~1 indicates correct)
Feel free to change the parameters (NUM_R and NUM_C) and feel free to comment if you think I have made a mistake somewhere.
#include <iostream>
#include <math.h>
#include <cufft.h>
// e.g. float data[NUM_R][NUM_C]
#define NUM_R 12
#define NUM_C 16
// Documentation Version
//#define NUM_C2 (1+NUM_C/2)
// "Correct" Version
#define NUM_C2 NUM_C
using namespace std;
int main(int argc, char** argv)
cufftReal *in_h, *out_h, *in_d, *out_d;
cufftComplex *mid_d, *mid_h;
cufftHandle pF, pI;
int r, c;
in_h = (cufftReal*) malloc(NUM_R * NUM_C * sizeof(cufftReal));
out_h= (cufftReal*) malloc(NUM_R * NUM_C * sizeof(cufftReal));
mid_h= (cufftComplex*)malloc(NUM_C2*NUM_R*sizeof(cufftComplex));
cudaMalloc((void**) &in_d, NUM_R * NUM_C * sizeof(cufftReal));
cudaMalloc((void**)&out_d, NUM_R * NUM_C * sizeof(cufftReal));
cudaMalloc((void**)&mid_d, NUM_C2 * NUM_R * sizeof(cufftComplex));
cufftPlan2d(&pF, NUM_R, NUM_C, CUFFT_R2C);
cufftPlan2d(&pI, NUM_R,NUM_C2, CUFFT_C2R);
for(r=0; r<NUM_R; r++)
for(c=0; c<NUM_C; c++)
in_h[c + NUM_C * r] = cos(2.0*M_PI*(c*7.0/NUM_C+r*3.0/NUM_R));
out_h[c+ NUM_C * r] = 0.f;
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
cudaMemcpy((cufftReal*)in_d, (cufftReal*)in_h, NUM_R * NUM_C * sizeof(cufftReal),cudaMemcpyHostToDevice);
cufftExecR2C(pF, (cufftReal*)in_d, (cufftComplex*)mid_d);
cudaMemcpy((cufftComplex*)mid_h, (cufftComplex*)mid_d, NUM_C2*NUM_R*sizeof(cufftComplex), cudaMemcpyDeviceToHost);
for(r=0; r<NUM_R; r++)
for(c=0; c<NUM_C2; c++)
if(c<(NUM_C2-1)) cout<<", ";
else cout<<endl;
cufftExecC2R(pI, (cufftComplex*)mid_d, (cufftReal*)out_d);
cudaMemcpy((cufftReal*)out_h, (cufftReal*)out_d, NUM_R*NUM_C*sizeof(cufftReal), cudaMemcpyDeviceToHost);
for(r=0; r<NUM_R; r++)
for(c=0; c<NUM_C; c++)
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
for(r=0; r<NUM_R; r++)
for(c=0; c<NUM_C; c++)
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
return 0;
1) If we say my input dimensions (of real data) are N1 = 480 and N2 = 640. Are the dimensions (after a real to complex transform) N1=480, N2=321?
The output of cufftExecR2C is a NX*(NY/2+1) cufftComplex matrix. So in your case, you will have a 480x321 float2 matrix as output.
2) Can I cudaMemcpy the data directly into a cufftReal array of the same size? Or must it be a cufftComplex array?
If it must be a cufftComplex array, I am assuming the elements need to be in the place of the real components?
Yes, you can copy the data to a cufftReal array and the N1xN2 data.
3) What is the correct structure of a call to cufftPlan2d, cufftExecR2C and cufftC2R given the above values.
cufftPlan2d(&plan, N1, N2, CUFFT_R2C);
cufftExecR2C(plan, (cufftReal*)idata, (cufftComplex*) odata);

cvPerspectiveTransform: What am I supposed to provide?

I'm trying to use cvPerspectiveTransform to transform four 2D points. I got the transformation matrix (3x3) already through cvFindHomography. I can't figure out what kind of structure to provide to not run into some error.
Would anybody be so kind to show me how to do it with these points?
I'm using OpenCV 2.4.0 on Win.
This is one way to initialize your matrices correctly. It's probably not the most elegant, but it works:
CvMat* input = cvCreateMat(1, 4, CV_32FC2);
CvMat* output = cvCreateMat(1, 4, CV_32FC2);
float data[8] = {0,0,0,640,480,0,640,480};
for (int i =0; i < 8; i++)
input->data.fl[i] = data[i];
cvPerspectiveTransform(input, output, matrix_from_cvFindHomography);
The C++ API offers a more intuitive implementation. Many OpenCV functions, like perspectiveTransform, accept vectors of points as inputs, which can be initialized in this manner:
std::vector<cv::Point2f> inputs;
std::vector<cv::Point2f> outputs;
cv::perspectiveTransform(inputs, outputs, matrix_from_findHomography);
assuming you have a 3x3 cv::Mat of floats, you can convert that to (if you want double change all the f's to d's)
cv::Matx33f transform(your_cv_Mat);
cv::Matx31f pt1(0,0,1);
cv::Matx31f pt2(640,0,1);
pt1 = transform*pt1;
pt2 = transform*pt2;
make sure you normalize by the third coordinate, read up on homogenous coordinates if that does not make sense
pt1 *= 1/pt1(2);
pt2 *= 1/pt2(2);
cv::Point2f final_pt1(pt1(0),pt1(1));
cv::Point2f final_pt2(pt2(0),pt2(1));
You do not need to do this with Matx, it will work with cv::Mat just as well. Personally I like Matx for working with transforms because its size and type is easier to keep track of and its contents can be more easily viewed in the debugger.

How to calculate the Absolute value of complex numbers in opencv

can any one help me about how to get the absolute value of a complex matrix.the matrix contains real value in one channel and imaginary value in another one channel.please help me
if s possible means give me some example.
Thanks in advance
Let's assume you have 2 components: X and Y, two matrices of the same size and type. In your case it can be real/im values.
// n rows, m cols, type float; we assume the following matrices are filled
cv::Mat X(n,m,CV_32F);
cv::Mat Y(n,m,CV_32F);
You can compute the absolute value of each complex number like this:
// create a new matrix for storage
cv::Mat A(n,m,CV_32F,cv::Scalar(0.0));
for(int i=0;i<n;i++){
// pointer to row(i) values
const float* rowi_x = X.ptr<float>(i);
const float* rowi_y = Y.ptr<float>(i);
float* rowi_a = A.ptr<float>(i);
for(int j=0;j<=m;j++){
rowi_a[j] = sqrt(rowi_x[j]*rowi_x[j]+rowi_y[j]*rowi_y[j]);
If you look in the OpenCV phasecorr.cpp module, there's a function called magSpectrums that does this already and will handle conjugate symmetry-packed DFT results too. I don't think it's exposed by the header file, but it's easy enough to copy it. If you care about speed, make sure you compile with any available SIMD options turned on too because they can make a big difference with this calculation.

how to get r,g,b value using opencv2.3 [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
OpenCV rgb value for cv::Point in cv::Mat
As you know, in matlab it's easy to get r/g/b values using r = image(:,:,1).
But in openCV (before 2.2) we must use pointer like this:
plImage* img=cvCreateImage(cvSize(640,480),IPL_DEPTH_32F,3);
((float *)(img->imageData + i*img->widthStep))[j*img->nChannels + 0]=111; // B
((float *)(img->imageData + i*img->widthStep))[j*img->nChannels + 1]=112; // G
((float *)(img->imageData + i*img->widthStep))[j*img->nChannels + 2]=113; // R
But as openCV2.3 comes out, it's easy to get pixel value of a single channel image like this:
Mat image;
int pixel = image.at<uchar>(row,col);
So I just wonder it there also a easy way to get the r,g,b pixel value of a multichannel image just like that in the Matlab? Any help will be appreciated =)
For C++ interface you can do:
Vec3f pixel = image.at<Vec3f>(row, col);
int b = pixel[0];
int g = pixel[1];
int r = pixel[2];
as vasile said, getting a cell as a Vec3 will get you the pixel with easy access to its rgb components, this is the simplest solution in opencv since the data structure saves the pixels in the following format "RGBRGBRGBRGBRGB..." while matlab saves it as "RRRRRRRGGGGGGGBBBBBBBB..."
to get a specified channel like in matlab you can use the CvSplit (or cv::split in c++ style), this function will split the image into its 3-4 different channels so you could access a channels like in matlab. in the provided links you can find also a reference for the opposite function - merge
