I need help with cufft, my results are wrong and I have no idea why.
Here is my code:
#include <cufft.h>
__global__ void print(cufftDoubleComplex *c, int h, int w){
for(int i=0; i<1; i++){
for (int j=0; j<w; j++){
printf("(%d,%d): %f + %fi\n",i+1, j+1, c[i*w+j].x, c[i*w+j].y);
int main(int argc, char *argv[]){
int img_w=5;
int img_h=5;
double fx[img_w*img_h], *d_fx;
cudaMalloc((void**)&d_fx, img_w*img_h*sizeof(double));
cufftDoubleComplex *otfFx;
cudaMalloc((void**)&otfFx, img_w*img_h*sizeof(cufftDoubleComplex));
for(int i=0; i<img_w*img_h; i++){
cudaMemcpy(d_fx, fx, img_w*img_h*sizeof(double), cudaMemcpyHostToDevice);
cufftHandle plan_fx;
cufftPlan2d(&plan_fx, img_h, img_w, CUFFT_D2Z);
cufftExecD2Z(plan_fx, d_fx, otfFx);
print<<<1,1>>>(otfFx, img_h, img_w);
return 0;
That's what I'm getting in the first line of the result:
0.00000 + 0.00000i 0.69098 - 0.95106i 1.80902 - 0.58779i 0.00000 + 0.00000i 0.69098 - 0.95105i
It should be:
0.00000 + 0.00000i 0.69098 - 0.95106i 1.80902 - 0.58779i 1.80902 + 0.58779i 0.69098 + 0.95106i
Everything is garbage after otfFx[14], it's like the result is 5x3 when it should be 5x5.
This is the octave code that gives me the "right" results:
A=[1 0 0 0 -1; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0];
You are right - cuFFT outputs results of size 5x3 as results of D2Z/Z2D/R2C/C2R transforms are symmetric.
cuFFT follows standard fft libraries convention here. Please have a look: http://www.fftw.org/doc/The-1d-Real_002ddata-DFT.html http://docs.nvidia.com/cuda/cufft/index.html#multi-dimensional
If you want to recreate full signal you need to use fact that elements from first half are conjugate of the elements in second part of the signal.
I am trying to use Eigen unsupported FFT library using FFTW backend. Specifically I am want to do a 2D FFT. Here's my code :
void fft2(Eigen::MatrixXf * matIn,Eigen::MatrixXcf * matOut)
const int nRows = matIn->rows();
const int nCols = matIn->cols();
Eigen::FFT< float > fft;
for (int k = 0; k < nRows; ++k) {
Eigen::VectorXcf tmpOut(nRows);
fft.fwd(tmpOut, matIn->row(k));
matOut->row(k) = tmpOut;
for (int k = 0; k < nCols; ++k) {
Eigen::VectorXcf tmpOut(nCols);
fft.fwd(tmpOut, matOut->col(k));
matOut->col(k) = tmpOut;
I have 2 problems :
First, I get a segmentation fault when using this code on some matrix. This error doesn't happen for all matrixes. I guess it's related to an alignment error. I use the functions in the following way :
Eigen::MatrixXcf matFFT(mat.rows(),mat.cols());
where mat can be any matrix. Funnily, the code plants only when I compute the FFT over the 2nd dimension, never on the first one. This doesn't happen with kissFFT backend.
Second I don't get the same result as Matlab (that uses FFTW), when the function works. Eg :
Input Matrix :
[2, 1, 2]
[3, 2, 1]
[1, 2, 3]
Eigen gives :
[ (0,5), (0.5,0.86603), (0,0.5)]
[ (-4.3301,-2.5), (-1,-1.7321), (0.31699,-1.549)]
[ (-1.5,-0.86603), (2,3.4641), (2,3.4641)]
Matlab gives :
17 + 0i 0.5 + 0.86603i 0.5 - 0.86603i
-1 + 0i -1 - 1.7321i 2 - 3.4641i
-1 + 0i 2 + 3.4641i -1 + 1.7321i
Only the central part is the same.
Any help would be welcome.
I failed to activate EIGEN_FFTW_DEFAULT in my first solution, activating it reveals an error in the fftw-support implementation of Eigen. The following works:
#include <iostream>
#include <unsupported/Eigen/FFT>
int main(int argc, char *argv[])
Eigen::MatrixXf A(3,3);
A << 2,1,2, 3,2,1, 1,2,3;
const int nRows = A.rows();
const int nCols = A.cols();
std::cout << A << "\n\n";
Eigen::MatrixXcf B(3,3);
Eigen::FFT< float > fft;
for (int k = 0; k < nRows; ++k) {
Eigen::VectorXcf tmpOut(nRows);
fft.fwd(tmpOut, A.row(k));
B.row(k) = tmpOut;
std::cout << B << "\n\n";
Eigen::FFT< float > fft2; // Workaround: Using the same FFT object for a real and a complex FFT seems not to work with FFTW
for (int k = 0; k < nCols; ++k) {
Eigen::VectorXcf tmpOut(nCols);
fft2.fwd(tmpOut, B.col(k));
B.col(k) = tmpOut;
std::cout << B << '\n';
I get this output:
2 1 2
3 2 1
1 2 3
(17,0) (0.5,0.866025) (0.5,-0.866025)
(-1,0) (-1,-1.73205) (2,-3.4641)
(-1,0) (2,3.4641) (-1,1.73205)
Which is the same as your Matlab result.
N.B.: FFTW seems to support 2D real->complex FFT natively (without using individual FFTs). This is likely more efficient.
fftwf_plan fftwf_plan_dft_r2c_2d(int n0, int n1,
float *in, fftwf_complex *out, unsigned flags);
I have a real 2d matrix. I am taking its fft using fftw. But the result of using a real to complex fft is different from a complex ( with imaginary part equal to zero) to complex fft.
real matrix
0 1 2
3 4 5
6 7 8
result of real to complex fft
36 -4.5+2.59808i -13.5+7.79423i
0 -13.5-7.79423i 0
0 0 0
int r = 3, c = 3;
int sz = r * c;
double *in = (double*) malloc(sizeof(double) * sz);
fftw_complex *out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * sz);
fftw_plan p = fftw_plan_dft_r2c_2d(r, c, in, out, FFTW_MEASURE);
for ( int i=0; i<r; ++i ){
for ( int j=0; j<c; ++j ){
in[i*c+j] = i*c + j;
using a complex matrix with imaginary part of zero
complex matrix
0+0i 1+0i 2+0i
3+0i 4+0i 5+0i
6+0i 7+0i 8+0i
result of complex to complex fft
36 -4.5 + 2.59808i -4.5 - 2.59808i
-13.5 + 7.79423i 0 0
-13.5 - 7.79423i 0 0
int r = 3, c = 3;
int sz = r * c;
fftw_complex *out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * sz);
fftw_complex *inc = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * sz);
p = fftw_plan_dft_2d( r,c, inc, out, FFTW_FORWARD,FFTW_MEASURE);
for ( int i=0; i<r; ++i ){
for ( int j=0; j<c; ++j ){
inc[i*c+j][0] = i*c+j;
inc[i*c+j][1] = 0;
I am after the result of complex to complex fft. But the real to complex fft is much faster and my data is real. Am I making a programming mistake or the result should be different?
As indicated in FFTW documentation
Then, after an r2c transform, the output is an n0 × n1 × n2 × … × (nd-1/2 + 1) array of fftw_complex values in row-major order
In other words, the output for your real-to-complex transform of your sample real matrix really is:
36 -4.5+2.59808i
-13.5+7.79423i 0
-13.5-7.79423i 0
You may notice that these two columns match exactly the first two columns of your complex-to-complex transform. The missing column is omitted from the real-to-complex transform since it is redundant due to symmetry. As such, the full 3x3 matrix including the missing column could be constructed using:
fftw_complex *outfull = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * sz);
int outc = (c/2+1);
for ( int i=0; i<r; ++i ){
// copy existing columns
for ( int j=0; j<outc; ++j ){
outfull[i*c+j][0] = out[i*outc+j][0];
outfull[i*c+j][1] = out[i*outc+j][1];
// generate missing column(s) from symmetry
for ( int j=outc; j<c; ++j){
int row = (r-i)%r;
int col = c-j;
outfull[i*c+j][0] = out[row*outc+col][0];
outfull[i*c+j][1] = -out[row*outc+col][1];
I'm in the rather poor situation of not being able to use the CUDA debugger. I'm getting some strange results from usage of __syncthreads in an application with a single shared array (deltas). The following piece of code is performed in a loop:
__syncthreads(); //if I comment this out, things get funny
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this line doesnt seem to matter regardless if the first sync is commented out or not
//after sync: do something with the values of delta written in this threads and other threads of this block
Basically, I have code with overlapping blocks (required due to the nature of the algorithm). The program does compile and run but somehow I get systematically wrong values in the areas of vertical overlap. This is very confusing to me as I thought that the correct way to sync is to sync after the threads have performed my write to the shared memory.
This is the whole function:
//XC without repetitions
template <int blocksize, int order>
__global__ void __xc(unsigned short* raw_input_data, int num_frames, int width, int height,
float * raw_sofi_data, int block_size, int order_deprecated){
//we make a distinction between real pixels and virtual pixels
//real pixels are pixels that exist in the original data
//overlap correction: every new block has a margin of 3 threads doing less work (only computing deltas)
int x_corrected = global_x() - blockIdx.x * 3;
int y_corrected = global_y() - blockIdx.y * 3;
//if the thread is responsible for any real pixel
if (x_corrected < width && y_corrected < height){
// __shared__ float deltas[blocksize];
__shared__ float deltas[blocksize];
//the outer pixels of a block do not update SOFI values as they do not have sufficient information available
//they are used only to compute mean and delta
//also, pixels at the global edge have to be thrown away (as there is not sufficient data to interpolate)
bool within_inner_block =
threadIdx.x > 0
&& threadIdx.y > 0
&& threadIdx.x < blockDim.x - 2
&& threadIdx.y < blockDim.y - 2
//global edge
&& x_corrected > 0
&& y_corrected > 0
&& x_corrected < width - 1
&& y_corrected < height - 1
//init virtual pixels
float virtual_pixels[order * order];
if (within_inner_block){
for (int i = 0; i < order * order; ++i) {
virtual_pixels[i] = 0;
float mean = 0;
float intensity;
int lex_index_block = threadIdx.x + threadIdx.y * blockDim.x;
//main loop
for (int frame_idx = 0; frame_idx < num_frames; ++frame_idx) {
//shared memory read and computation of mean/delta
intensity = raw_input_data[lex_index_3D(x_corrected,y_corrected, frame_idx, width, height)];
__syncthreads(); //if I comment this out, things break
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this doesnt seem to matter
mean = deltas[lex_index_block]/(float)(frame_idx+1);
//if the thread is responsible for correlated pixels, i.e. not at the border of the original frame
if (within_inner_block){
virtual_pixels[0] += deltas[lex_index_2D(
threadIdx.y + 1,
threadIdx.y - 1,
virtual_pixels[1] += deltas[lex_index_2D(
threadIdx.x + 1,
virtual_pixels[2] += deltas[lex_index_2D(
threadIdx.y + 1,
virtual_pixels[3] += deltas[lex_index_2D(
// xc_update<order>(virtual_pixels, delta2, mean);
if (within_inner_block){
for (int virtual_idx = 0; virtual_idx < order*order; ++virtual_idx) {
raw_sofi_data[lex_index_2D(x_corrected*order + virtual_idx % order,
y_corrected*order + (int)floorf(virtual_idx / order),
From what I can see, there could be a hazard in your application between loop iterations. The write to deltas[lex_index_block] for loop iteration frame_idx+1 could be mapped to the same location as the read of deltas[lex_index_2D(threadIdx.x, threadIdx.y -1, blockDim.x)] in a different thread at iteration frame_idx. The two accesses are unordered and the result is nondeterministic. Try running the app with cuda-memcheck --tool racecheck.
The code below calculates the dot product of two vectors a and b. The correct result is 8192. When I run it for the first time the result is correct. Then when I run it for the second time the result is the previous result + 8192 and so on:
1st iteration: result = 8192
2nd iteration: result = 8192 + 8192
3rd iteration: result = 8192 + 8192
and so on.
I checked by printing it on screen and the device variable dev_c is not freed. What's more writing to it causes something like a sum, the result beeing the previous value plus the new one being written to it. I guess that could be something with the atomicAdd() operation, but nonetheless cudaFree(dev_c) should erase it after all.
#define N 8192
#include <stdio.h>
__global__ void dot( int *a, int *b, int *c ) {
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ ){
sum += temp[i];
int main( void ) {
int *a, *b, *c;
int *dev_a, *dev_b, *dev_c;
int size = N * sizeof( int);
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, sizeof(int));
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(sizeof(int));
for(int i = 0 ; i < N ; i++){
a[i] = 1;
b[i] = 1;
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice);
dot<<< N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>( dev_a, dev_b, dev_c);
cudaMemcpy( c, dev_c, sizeof(int) , cudaMemcpyDeviceToHost);
printf("Dot product = %d\n", *c);
return 0;
cudaFree doesn't erase anything, it simply returns memory to a pool to be re-allocated. cudaMalloc doesn't guarantee the value of memory that has been allocated. You need to initialize memory (both global and shared) that your program uses, in order to have consistent results. The same is true for malloc and free, by the way.
From the documentation of cudaMalloc();
The memory is not cleared.
That means that dev_c is not initialized, and your atomicAdd(c,sum); will add to any random value that happens to be stored in memory at the returned position.
I am runnig the follwoing code using shared memory:
__global__ void computeAddShared(int *in , int *out, int sizeInput){
//not made parameters gidata and godata to emphasize that parameters get copy of address and are different from pointers in host code
extern __shared__ float temp[];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int ltid = threadIdx.x;
temp[ltid] = 0;
while(tid < sizeInput){
temp[ltid] += in[tid];
tid+=gridDim.x * blockDim.x; // to handle array of any size
int offset = 1;
while(offset < blockDim.x){
if(ltid % (offset * 2) == 0){
temp[ltid] = temp[ltid] + temp[ltid + offset];
if(ltid == 0){
out[blockIdx.x] = temp[0];
int main(){
int size = 16; // size of present input array. Changes after every loop iteration
int cidata[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
/*FILE *f;
f = fopen("invertedList.txt" , "w");
a[0] = 1 + (rand() % 8);
fprintf(f, "%d,",a[0]);
for( int i = 1 ; i< N; i++){
a[i] = a[i-1] + (rand() % 8) + 1;
fprintf(f, "%d,",a[i]);
int* gidata;
int* godata;
cudaMalloc((void**)&gidata, size* sizeof(int));
cudaMemcpy(gidata,cidata, size * sizeof(int), cudaMemcpyHostToDevice);
int TPB = 4;
int blocks = 10; //to get things kicked off
cudaEvent_t start, stop;
cudaEventRecord(start, 0);
while(blocks != 1 ){
if(size < TPB){
TPB = size; // size is 2^sth
blocks = (size+ TPB -1 ) / TPB;
cudaMalloc((void**)&godata, blocks * sizeof(int));
computeAddShared<<<blocks, TPB,TPB>>>(gidata, godata,size);
gidata = godata;
size = blocks;
//printf("The error by cuda is %s",cudaGetErrorString(cudaGetLastError()));
cudaEventRecord(stop, 0);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);
int *output = (int*)malloc(sizeof(int));
cudaMemcpy(output, gidata, sizeof(int), cudaMemcpyDeviceToHost);
//Cant free either earlier as both point to same location
cudaError_t chk = cudaFree(godata);
printf("First chk also printed error. Maybe error in my logic\n");
printf("The error by threadsyn is %s", cudaGetErrorString(cudaGetLastError()));
printf("The sum of the array is %d\n", output[0]);
return 0;
Clearly, the first while loop in computeAddShared is causing out of bounds error because I am allocating 4 bytes to shared memory. Why does cudamemcheck not catch this. Below is the output of cuda-memcheck
time is 12.334816 msThe error by threadsyn is no errorThe sum of the array is 13
========= ERROR SUMMARY: 0 errors
Shared memory allocation granularity. The Hardware undoubtedly has a page size for allocations (probably the same as the L1 cache line side). With only 4 threads per block, there will "accidentally" be enough shared memory in a single page to let you code work. If you used a sensible number of threads block (ie. a round multiple of the warp size) the error would be detected because there would not be enough allocated memory.