Compute Kernel Metal - How to retrieve results and debug? - ios

I've downloaded apple's truedepth streamer example and am trying to add a compute pipeline. I think I'm retrieving the results of the computation but am not sure as they all seem to be zero.
I'm a beginner at iOS development so there maybe quite a few mistakes so please bear with me!
The pipeline set up: (i wasn't quite sure how to create the resultsbuffer, since the kernel outputs a float3)
int resultsCount = CVPixelBufferGetWidth(depthFrame) * CVPixelBufferGetHeight(depthFrame);
//because I will be output 3 floats for each value in depthframe
id<MTLBuffer> resultsBuffer = [self.device newBufferWithLength:(sizeof(float) * 3 * resultsCount) options:MTLResourceOptionCPUCacheModeDefault];
_threadgroupSize = MTLSizeMake(16, 16, 1);
// Calculate the number of rows and columns of threadgroups given the width of the input image
// Ensure that you cover the entire image (or more) so you process every pixel
_threadgroupCount.width = (inTexture.width + _threadgroupSize.width - 1) / _threadgroupSize.width;
_threadgroupCount.height = (inTexture.height + _threadgroupSize.height - 1) / _threadgroupSize.height;
// Since we're only dealing with a 2D data set, set depth to 1
_threadgroupCount.depth = 1;
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:_computePipelineState];
[computeEncoder setTexture: inTexture atIndex:0];
[computeEncoder setBuffer:resultsBuffer offset:0 atIndex:1];
[computeEncoder setBytes:&intrinsics length:sizeof(intrinsics) atIndex:0];
[computeEncoder dispatchThreadgroups:_threadgroupCount
[computeEncoder endEncoding];
// Finalize rendering here & push the command buffer to the GPU
[commandBuffer commit];
//for testing
[commandBuffer waitUntilCompleted];
I have added the following compute kernel:
kernel void
calc(texture2d<float, access::read> inTexture [[texture(0)]],
device float3 *resultsBuffer [[buffer(1)]],
constant float3x3& cameraIntrinsics [[ buffer(0) ]],
uint2 gid [[thread_position_in_grid]])
float val = * 1000.0f;
float xrw = (gid.x - cameraIntrinsics[2][0]) * val / cameraIntrinsics[0][0];
float yrw = (gid.y - cameraIntrinsics[2][1]) * val / cameraIntrinsics[1][1];
int vertex_id = ((gid.y * inTexture.get_width()) + gid.x);
resultsBuffer[vertex_id] = float3(xrw, yrw, val);
Code for seeing buffer result: (I tried two different ways and both are outputting all zeroes at the moment)
void *output = [resultsBuffer contents];
for (int i = 0; i < 10; ++i) {
NSLog(#"value is %f", *(float *)(output) ); //= *(float *)(output + 4 * i);
NSData *data = [NSData dataWithBytesNoCopy:resultsBuffer.contents length:(sizeof(float) * 3 * resultsCount)freeWhenDone:NO];
float *finalArray = new float [resultsCount * 3];
[data getBytes:&finalArray[0] length:sizeof(finalArray)];
for (int i = 0; i < 10; ++i) {
NSLog(#"here is output %f", finalArray[i]);

I see a couple of problems here, but neither of them are related to your Metal code per se.
In your first output loop, as written, you're just printing the first element of the results buffer 10 times. The first element may legitimately be 0, leading you to believe all of the results are zero. But when I changed the first log line to
NSLog(#"value is %f", ((float *)output)[i]);
I saw different values printed when running your kernel on a test image.
The other issue is related to your getBytes:length: call. You want to pass the number of bytes to copy, but sizeof(finalArray) is actually the size of the finalArray pointer, i.e., 4 bytes, not the total size of the buffer it points to. This is an extremely common error in C and C++ code.
Instead, you can use the same byte count as the one you used when allocating space:
[data getBytes:&finalArray[0] length:(sizeof(float) * 3 * resultsCount)];
You should then find that you get the same (non-zero) values printed as in the previous step.


How to get more precise output out of an FFT?

I am trying to make a colored waveform using the output of the following code. But when I run it, I only get certain numbers (see the freq variable, it uses the bin size, frame rate and index to make these frequencies) as output frequencies. I'm no math expert, even though I cobbled this together from existing code and answers.
// colored_waveform.c
// MixDJ
// Created by Jonathan Silverman on 3/14/19.
// Copyright © 2019 Jonathan Silverman. All rights reserved.
#include "colored_waveform.h"
#include "fftw3.h"
#include <math.h>
#include "sndfile.h"
//int N = 1024;
// helper function to apply a windowing function to a frame of samples
void calcWindow(double* in, double* out, int size) {
for (int i = 0; i < size; i++) {
double multiplier = 0.5 * (1 - cos(2*M_PI*i/(size - 1)));
out[i] = multiplier * in[i];
// helper function to compute FFT
void fft(double* samples, fftw_complex* out, int size) {
fftw_plan p;
p = fftw_plan_dft_r2c_1d(size, samples, out, FFTW_ESTIMATE);
// find the index of array element with the highest absolute value
// probably want to take some kind of moving average of buf[i]^2
// and return the maximum found
double maxFreqIndex(fftw_complex* buf, int size, float fS) {
double max_freq = 0;
double last_magnitude = 0;
for(int i = 0; i < (size / 2) - 1; i++) {
double freq = i * fS / size;
// printf("freq: %f\n", freq);
double magnitude = sqrt(buf[i][0]*buf[i][0] + buf[i][1]*buf[i][1]);
if(magnitude > last_magnitude)
max_freq = freq;
last_magnitude = magnitude;
return max_freq;
//// map a frequency to a color, red = lower freq -> violet = high freq
//int freqToColor(int i) {
void generateWaveformColors(const char path[]) {
printf("Generating waveform colors\n");
SNDFILE *infile = NULL;
SF_INFO sfinfo;
infile = sf_open(path, SFM_READ, &sfinfo);
sf_count_t numSamples = sfinfo.frames;
// sample rate
float fS = 44100;
// float songLengLengthSeconds = numSamples / fS;
// printf("seconds: %f", songLengLengthSeconds);
// size of frame for analysis, you may want to play with this
float frameMsec = 5;
// samples in a frame
int frameSamples = (int)(fS / (frameMsec * 1000));
// how much overlap each frame, you may want to play with this one too
int frameOverlap = (frameSamples / 2);
// color to use for each frame
// int outColors[(numSamples / frameOverlap) + 1];
// scratch buffers
double* tmpWindow;
fftw_complex* tmpFFT;
tmpWindow = (double*) fftw_malloc(sizeof(double) * frameSamples);
tmpFFT = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * frameSamples);
printf("Processing waveform for colors\n");
for (int i = 0, outptr = 0; i < numSamples; i += frameOverlap, outptr++)
double inSamples[frameSamples];
sf_read_double(infile, inSamples, frameSamples);
// window another frame for FFT
calcWindow(inSamples, tmpWindow, frameSamples);
// compute the FFT on the next frame
fft(tmpWindow, tmpFFT, frameSamples);
// which frequency is the highest?
double freqIndex = maxFreqIndex(tmpFFT, frameSamples, fS);
printf("%i: ", i);
printf("Max freq: %f\n", freqIndex);
// map to color
// outColors[outptr] = freqToColor(freqIndex);
sf_close (infile);
Here is some of the output:
2094216: Max freq: 5512.500000
2094220: Max freq: 0.000000
2094224: Max freq: 0.000000
2094228: Max freq: 0.000000
2094232: Max freq: 5512.500000
2094236: Max freq: 5512.500000
It only shows certain numbers, not a wide variety of frequencies like it maybe should. Or am I wrong? Is there anything wrong with my code you guys can see? The color stuff is commented out because I haven't done it yet.
The frequency resolution of an FFT is limited by the length of the data sample you have. The more samples you have, the higher the frequency resolution.
In your specific case you chose frames of 5 milliseconds, which is then transformed to a number of samples on the following line:
// samples in a frame
int frameSamples = (int)(fS / (frameMsec * 1000));
This corresponds to only 8 samples at the specified 44100Hz sampling rate. The frequency resolution with such a small frame size can be computed to be
44100 / 8
or 5512.5Hz, a rather poor resolution. Correspondingly, the observed frequencies will always be one of 0, 5512.5, 11025, 16537.5 or 22050Hz.
To get a higher resolution you should increase the number of samples used for analysis by increasing frameMsec (as suggested by the comment "size of frame for analysis, you may want to play with this").

Getting a phase image from CUDA FFT

I'm trying to apply a cuFFT, forward then inverse, to a 2D image. I need the real and complex parts as separate outputs so I can compute a phase and magnitude image. I haven't been able to recreate the input image, and also a non-zero phase is returned. In particular I am unsure if I'm correctly creating a full-size image from the reduced-size cuFFT complex output, which apparently stores only the left side of the spectrum. Here's my current code:
// Load image
cv::Mat_<float> img;
img = cv::imread(path,0);
std::cout<<"Input cv::Mat is not continuous!"<<std::endl;
return -1;
float *h_Data, *d_Data;
h_Data = img.ptr<float>(0);
// Complex device pointers
// Plans for cuFFT execution
// Image dimensions
const int dataH = img.rows;
const int dataW = img.cols;
const int complexW = dataW/2+1;
// Allocate memory
h_Result = (cufftComplex *)malloc(dataH * complexW * sizeof(cufftComplex));
checkCudaErrors(cudaMalloc((void **)&d_DataSpectrum, dataH * complexW * sizeof(cufftComplex)));
checkCudaErrors(cudaMalloc((void **)&d_Data, dataH * dataW * sizeof(float)));
checkCudaErrors(cudaMalloc((void **)&d_Result, dataH * complexW * sizeof(cufftComplex)));
// Copy image to GPU
checkCudaErrors(cudaMemcpy(d_Data, h_Data, dataH * dataW * sizeof(float), cudaMemcpyHostToDevice));
// Forward FFT
checkCudaErrors(cufftPlan2d(&fftPlanFwd, dataH, dataW, CUFFT_R2C));
checkCudaErrors(cufftExecR2C(fftPlanFwd, (cufftReal *)d_Data, (cufftComplex *)d_DataSpectrum));
// Inverse FFT
checkCudaErrors(cufftPlan2d(&fftPlanInv, dataH, dataW, CUFFT_C2C));
checkCudaErrors(cufftExecC2C(fftPlanInv, (cufftComplex *)d_DataSpectrum, (cufftComplex *)d_Result, CUFFT_INVERSE));
// Copy result to host memory
checkCudaErrors(cudaMemcpy(h_Result, d_Result, dataH * complexW * sizeof(cufftComplex), cudaMemcpyDeviceToHost));
// Convert cufftComplex to OpenCV real and imag Mat
Mat_<float> resultReal = Mat_<float>(dataH, dataW);
Mat_<float> resultImag = Mat_<float>(dataH, dataW);
for(int i=0; i<dataH; i++){
float* rowPtrReal = resultReal.ptr<float>(i);
float* rowPtrImag = resultImag.ptr<float>(i);
for(int j=0; j<dataW; j++){
rowPtrReal[j] = h_Result[i*complexW+j].x/(dataH*dataW);
rowPtrImag[j] = h_Result[i*complexW+j].y/(dataH*dataW);
// Right side?
rowPtrReal[j] = h_Result[i*complexW+(dataW-j)].x/(dataH*dataW);
rowPtrImag[j] = -h_Result[i*complexW+(dataW-j)].y/(dataH*dataW);
// Compute phase and normalize to 8 bit
Mat_<float> resultPhase;
phase(resultReal, resultImag, resultPhase);
cv::subtract(resultPhase, 2*M_PI, resultPhase, (resultPhase > M_PI));
resultPhase = ((resultPhase+M_PI)*255)/(2*M_PI);
Mat_<uchar> normalized = Mat_<uchar>(dataH, dataW);
resultPhase.convertTo(normalized, CV_8U);
// Save phase image
// Compute amplitude and normalize to 8 bit
Mat_<float> resultAmplitude;
magnitude(resultReal, resultImag, resultAmplitude);
Mat_<uchar> normalizedAmplitude = Mat_<uchar>(dataH, dataW);
resultAmplitude.convertTo(normalizedAmplitude, CV_8U);
// Save phase image
I'm not sure where my error is. Is that the correct way to get back the whole image from the reduced version (the for loop)?
I think I got it now. The 'trick' is to start with a complex matrix. Starting with a real one, you need to apply an R2C transform--which uses reduced size due to symmetry of the spectrum--and then a C2C transform, which preserves that reduced size. The solution was to create a complex input from the real one by inserting zeros as complex part, then applying two C2C transforms in a row which both preserve the whole image and make it easy to get the full sized real and imaginary matrices afterwards:
// Load image
cv::Mat_<float> img;
img = cv::imread(path,0);
std::cout<<"Input cv::Mat is not continuous!"<<std::endl;
return -1;
float *h_DataReal = img.ptr<float>(0);
cufftComplex *h_DataComplex;
// Image dimensions
const int dataH = img.rows;
const int dataW = img.cols;
// Convert real input to complex
h_DataComplex = (cufftComplex *)malloc(dataH * dataW * sizeof(cufftComplex));
for(int i=0; i<dataH*dataW; i++){
h_DataComplex[i].x = h_DataReal[i];
h_DataComplex[i].y = 0.0f;
// Complex device pointers
// Plans for cuFFT execution
// Allocate memory
h_Result = (cufftComplex *)malloc(dataH * dataW * sizeof(cufftComplex));
checkCudaErrors(cudaMalloc((void **)&d_DataSpectrum, dataH * dataW * sizeof(cufftComplex)));
checkCudaErrors(cudaMalloc((void **)&d_Data, dataH * dataW * sizeof(cufftComplex)));
checkCudaErrors(cudaMalloc((void **)&d_Result, dataH * dataW * sizeof(cufftComplex)));
// Copy image to GPU
checkCudaErrors(cudaMemcpy(d_Data, h_DataComplex, dataH * dataW * sizeof(cufftComplex), cudaMemcpyHostToDevice));
// Forward FFT
checkCudaErrors(cufftPlan2d(&fftPlanFwd, dataH, dataW, CUFFT_C2C));
checkCudaErrors(cufftExecC2C(fftPlanFwd, (cufftComplex *)d_Data, (cufftComplex *)d_DataSpectrum, CUFFT_FORWARD));
// Inverse FFT
checkCudaErrors(cufftPlan2d(&fftPlanInv, dataH, dataW, CUFFT_C2C));
checkCudaErrors(cufftExecC2C(fftPlanInv, (cufftComplex *)d_DataSpectrum, (cufftComplex *)d_Result, CUFFT_INVERSE));
// Copy result to host memory
checkCudaErrors(cudaMemcpy(h_Result, d_Result, dataH * dataW * sizeof(cufftComplex), cudaMemcpyDeviceToHost));
// Convert cufftComplex to OpenCV real and imag Mat
Mat_<float> resultReal = Mat_<float>(dataH, dataW);
Mat_<float> resultImag = Mat_<float>(dataH, dataW);
for(int i=0; i<dataH; i++){
float* rowPtrReal = resultReal.ptr<float>(i);
float* rowPtrImag = resultImag.ptr<float>(i);
for(int j=0; j<dataW; j++){
rowPtrReal[j] = h_Result[i*dataW+j].x/(dataH*dataW);
rowPtrImag[j] = h_Result[i*dataW+j].y/(dataH*dataW);
This is an old question, but I'd like to provide additional information: the R2C preserves the same amount of information as a C2C transform, it's just doing so with about half as many elements. The R2C (and C2R) transforms take advantage of Hermitian symmetry to reduce the number of elements that are computed and stored in memory (e.g. the FFT is symmetric, so you actually don't need ~half of the terms that are being stored in a C2C transform).
To generate a 2D image of the real and imaginary components, you could use the R2C transform and then write a kernel that translates the (Nx/2+1)Ny output array into a pair of arrays of size (NxNy), taking advantage of the symmetry yourself to write the terms to the correct positions. But using a C2C transform is a bit less code, and more foolproof.

Initialising texture of MTLPixelFormatR32Float in metal

I have a buffer initialised with a single-channel floating point image, which I need to get into a floating point format texture (MTLPixelFormatR32Float). I've tried creating the texture with that format and doing the following:
float *rawData = (float*)malloc(sizeof(float) * img.cols * img.rows);
for(int i = 0; i < img.rows; i++){
for(int j = 0; j < img.cols; j++){
rawData[i * img.cols + j] =<float>(i, j);
MTLTextureDescriptor *textureDescriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatR32Float
[texture replaceRegion:region mipmapLevel:0 withBytes:&rawData bytesPerRow:bytesPerRow];
where rawData is my buffer with the necessary floating point data. This doesn't work, I get an EXC_BAD_ACCESS error on the [texture replaceRegion...] line. I've also tried the MTKTextureLoader, which also returns nil instead of the texture.
Help would be appreciated. I would be most grateful if anyone has a working method of how to initialise the MTLPixelFormatR32Float texture with custom floating point data for data-parallel computation purposes.
The bytes that you pass to replaceRegion should point to your data. You are incorrectly passing a pointer to a pointer.
To fix it, replace withBytes:&rawData with withBytes:rawData

OpenCL :Access proper index by using globalid(.)

I am coding in OpenCL.
I am converting a "C function" having 2D array starting from i=1 and j=1 .PFB .
cv::Mat input; //Input :having some data in it ..
//Image input size is :input.rows=288 ,input.cols =640
cv::Mat output(input.rows-2,input.cols-2,CV_32F); //Output buffer
//Image output size is :output.rows=286 ,output.cols =638
This is a code Which I want to modify in OpenCL:
for(int i=1;i<output.rows-1;i++)
for(int j=1;j<output.cols-1;j++)
float xVal =<uchar>(i-1,j-1)<uchar>(i-1,j+1)+ 2*(<uchar>(i,j-1)<uchar>(i,j+1))<uchar>(i+1,j-1) -<uchar>(i+1,j+1);
float yVal =<uchar>(i-1,j-1) -<uchar>(i+1,j-1)+ 2*(<uchar>(i-1,j) -<uchar>(i+1,j))<uchar>(i-1,j+1)<uchar>(i+1,j+1);<float>(i-1,j-1) = xVal*xVal+yVal*yVal;
Host code :
//Input Image size is :input.rows=288 ,input.cols =640
//Output Image size is :output.rows=286 ,output.cols =638
OclStr->global_work_size[0] =(input.cols);
OclStr->global_work_size[1] =(input.rows);
size_t outBufSize = (output.rows) * (output.cols) * 4;//4 as I am copying all 4 uchar values into one float variable space
cl_mem cl_input_buffer = clCreateBuffer(
(input.rows) * (input.cols),
static_cast<void *>(, &OclStr->returnstatus);
cl_mem cl_output_buffer = clCreateBuffer(
(output.rows) * (output.cols) * sizeof(float),
static_cast<void *>(, &OclStr->returnstatus);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 0, sizeof(cl_mem), (void *)&cl_input_buffer);
OclStr->returnstatus = clSetKernelArg(OclStr->objkernel, 1, sizeof(cl_mem), (void *)&cl_output_buffer);
OclStr->returnstatus = clEnqueueNDRangeKernel(
clEnqueueMapBuffer(OclStr->command_queue, cl_output_buffer, true, CL_MAP_READ, 0, outBufSize, 0, NULL, NULL, &OclStr->returnstatus);
kernel Code :
__kernel void Sobel_uchar (__global uchar *pSrc, __global float *pDstImage)
const uint cols = get_global_id(0)+1;
const uint rows = get_global_id(1)+1;
const uint width= get_global_size(0);
uchar Opsoble[8];
Opsoble[0] = pSrc[(cols-1)+((rows-1)*width)];
Opsoble[1] = pSrc[(cols+1)+((rows-1)*width)];
Opsoble[2] = pSrc[(cols-1)+((rows+0)*width)];
Opsoble[3] = pSrc[(cols+1)+((rows+0)*width)];
Opsoble[4] = pSrc[(cols-1)+((rows+1)*width)];
Opsoble[5] = pSrc[(cols+1)+((rows+1)*width)];
Opsoble[6] = pSrc[(cols+0)+((rows-1)*width)];
Opsoble[7] = pSrc[(cols+0)+((rows+1)*width)];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(cols-1)+(rows-1)*width] = gx*gx + gy*gy;
Here I am not able to get the output as expected.
I am having some questions that
My for loop is starting from i=1 instead of zero, then How can I get proper index by using the global_id() in x and y direction
What is going wrong in my above kernel code :(
I am suspecting there is a problem in buffer stride but not able to further break my head as already broke it throughout a day :(
I have observed that with below logic output is skipping one or two frames after some 7/8 frames sequence.
I have added the screen shot of my output which is compared with the reference output.
My above logic is doing partial sobelling on my input .I changed the width as -
const uint width = get_global_size(0)+1;
Your suggestions are most welcome !!!
It looks like you may be fetching values in (y,x) format in your opencl version. Also, you need to add 1 to the global id to replicate your for loops starting from 1 rather than 0.
I don't know why there is an unused iOffset variable. Maybe your bug is related to this? I removed it in my version.
Does this kernel work better for you?
__kernel void simple(__global uchar *pSrc, __global float *pDstImage)
const uint i = get_global_id(0) +1;
const uint j = get_global_id(1) +1;
const uint width = get_global_size(0) +2;
uchar Opsoble[8];
Opsoble[0] = pSrc[(i-1) + (j - 1)*width];
Opsoble[1] = pSrc[(i-1) + (j + 1)*width];
Opsoble[2] = pSrc[i + (j-1)*width];
Opsoble[3] = pSrc[i + (j+1)*width];
Opsoble[4] = pSrc[(i+1) + (j - 1)*width];
Opsoble[5] = pSrc[(i+1) + (j + 1)*width];
Opsoble[6] = pSrc[(i-1) + (j)*width];
Opsoble[7] = pSrc[(i+1) + (j)*width];
float gx = Opsoble[0]-Opsoble[1]+2*(Opsoble[2]-Opsoble[3])+Opsoble[4]-Opsoble[5];
float gy = Opsoble[0]-Opsoble[4]+2*(Opsoble[6]-Opsoble[7])+Opsoble[1]-Opsoble[5];
pDstImage[(i-1) + (j-1)*width] = gx*gx + gy*gy ;
I am a bit apprehensive about posting an answer suggesting optimizations to your kernel, seeing as the original output has not been reproduced exactly as of yet. There is a major improvement available to be made for problems related to image processing/filtering.
Using local memory will help you out by reducing the number of global reads by a factor of eight, as well as grouping the global writes together for potential gains with the single write-per-pixel output.
The kernel below reads a block of up to 34x34 from pSrc, and outputs a 32x32(max) area of the pDstImage. I hope the comments in the code are enough to guide you in using the kernel. I have not been able to give this a complete test, so there could be changes required. Any comments are appreciated as well.
__kernel void sobel_uchar_wlocal (__global uchar *pSrc, __global float *pDstImage, __global uint2 dimDstImage)
//call this kernel 1-dimensional work group size: 32x1
//calculates 32x32 region of output with 32 work items
const uint wid = get_local_id(0);
const uint wid_1 = wid+1; // corrected for the calculation step
const uint2 gid = (uint2)(get_group_id(0),get_group_id(1));
const uint localDim = get_local_size(0);
const uint2 globalTopLeft = (uint2)(localDim * gid.x, localDim * gid.y); //position in pSrc to copy from/to
//dimLocalBuff is used for the right and bottom edges of the image, where the work group may run over the border
const uint2 dimLocalBuff = (uint2)(localDim,localDim);
if(dimDstImage.x - globalTopLeft.x < dimLocalBuff.x){
dimLocalBuff.x = dimDstImage.x - globalTopLeft.x;
if(dimDstImage.y - globalTopLeft.y < dimLocalBuff.y){
dimLocalBuff.y = dimDstImage.y - globalTopLeft.y;
int i,j;
//save region of data into local memory
__local uchar srcBuff[34][34]; //34^2 uchar = 1156 bytes
srcBuff[i+1][j+1] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];
//compute output and store locally
__local float dstBuff[32][32]; //32^2 float = 4096 bytes
if(wid_1 < dimLocalBuff.x){
float gx = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1-1)+ (i + 1)]+2*(srcBuff[wid_1+ (i-1)]-srcBuff[wid_1+ (i+1)])+srcBuff[(wid_1+1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i + 1)];
float gy = srcBuff[(wid_1-1)+ (i - 1)]-srcBuff[(wid_1+1)+ (i - 1)]+2*(srcBuff[(wid_1-1)+ (i)]-srcBuff[(wid_1+1)+ (i)])+srcBuff[(wid_1-1)+ (i + 1)]-srcBuff[(wid_1+1)+ (i + 1)];
dstBuff[wid][i] = gx*gx + gy*gy;
//copy results to output
srcBuff[i][j] = pSrc[globalTopLeft.x+i][globalTopLeft.y+j];

Why are my frequency values for iPhone FFT incorrect?

I've been trying to get exact frequencies using the FFT in Apple's Accelerate framework, but I'm having trouble working out why my values are off the true frequency.
I have been using this article as the basis for my implementation, and after really struggling to get to the point I'm at now, I am totally stumped.
So far I've got audio in -> Hanning window -> FFT -> phase calculation -> weird final output. I'd think that there will be a problem with my maths somewhere, but I'm really out of ideas by now.
The outputs are a lot lower what they should be, e.g., I input 440Hz and it prints out 190Hz, or I input 880Hz and it prints out 400Hz. For the most part these results are consistent, but not always, and there doesn't seem to be any common factor between anything either...
Here is my code:
sampleRate = 44100,
osamp = 4,
samples = 4096,
range = samples * 7 / 16,
step = samples / osamp
NSMutableArray *fftResults;
static FFTSetup setupReal;
static uint32_t log2n, n, nOver2;
static int32_t stride;
static float expct = 2*M_PI*((double)step/(double)samples);
static float phase1[range];
static float phase2[range];
static float dPhase[range];
- (void)fftSetup
// Declaring integers
log2n = 12;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
// Allocating memory for complex vectors
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
// Allocating memory for FFT
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
// Setting phase
memset(phase2, 0, range * sizeof(float));
// For each sample in buffer...
for (int bufferCount = 0; bufferCount < audioBufferList.mNumberBuffers; bufferCount++)
// Declaring samples from audio buffer list
SInt16 *samples = (SInt16*)audioBufferList.mBuffers[bufferCount].mData;
// Creating Hann window function
for (int i = 0; i < nOver2; i++)
double hannMultiplier = 0.5 * (1 - cos((2 * M_PI * i) / (nOver2 - 1)));
// Applying window to each sample
A.realp[i] = hannMultiplier * samples[i];
A.imagp[i] = 0;
// Applying FFT
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
// Detecting phase
vDSP_zvphas(&A, stride, phase1, stride, range);
// Calculating phase difference
vDSP_vsub(phase2, stride, phase1, stride, dPhase, stride, range);
// Saving phase
memcpy(phase2, phase1, range * sizeof(float));
// Extracting DSP outputs
for (size_t j = 0; j < nOver2; j++)
NSNumber *realNumbers = [NSNumber numberWithFloat:A.realp[j]];
NSNumber *imagNumbers = [NSNumber numberWithFloat:A.imagp[j]];
[real addObject:realNumbers];
[imag addObject:imagNumbers];
// Combining real and imaginary parts
[resultsCombined addObject:real];
[resultsCombined addObject:imag];
// Filling FFT output array
[fftResults addObject:resultsCombined];
int fftCount = [fftResults count];
NSLog(#"FFT Count: %d",fftCount);
// For each FFT...
for (int i = 0; i < fftCount; i++)
// Declaring integers for peak detection
float peak = 0;
float binNumber = 0;
// Declaring integers for phase detection
float deltaPhase;
static float trueFrequency[range];
for (size_t j = 1; j < range; j++)
// Calculating bin magnitiude
float realVal = [[[[fftResults objectAtIndex:i] objectAtIndex:0] objectAtIndex:j] floatValue];
float imagVal = [[[[fftResults objectAtIndex:i] objectAtIndex:1] objectAtIndex:j] floatValue];
float magnitude = sqrtf(realVal*realVal + imagVal*imagVal);
// Peak detection
if (magnitude > peak)
peak = magnitude;
binNumber = (float)j;
// Getting phase difference
deltaPhase = dPhase[j];
// Subtract expected difference
deltaPhase -= (float)j*expct;
// Map phase difference into +/- pi interval
int qpd = deltaPhase / M_PI;
if (qpd >= 0)
qpd += qpd&1;
qpd -= qpd&1;
deltaPhase -= M_PI * (float)qpd;
// Getting bin deviation from +/i interval
float deltaFrequency = osamp * deltaPhase / (2 * M_PI);
// Calculating true frequency at the j-th partial
trueFrequency[j] = (j * (sampleRate/samples)) + (deltaFrequency * (sampleRate/samples));
UInt32 mag;
mag = binNumber;
// Extracting frequency at bin peak
float f = trueFrequency[mag];
NSLog(#"True frequency = %fHz", f);
float b = roundf(binNumber*(sampleRate/nOver2));
NSLog(#" Bin frequency = %fHz", b);
Note that the expected phase difference (even for a bin-centered frequency) depends on both the window offset or overlap of the FFT pairs, and the bin number or frequency of the FFT result. e.g. If you offset the windows by very little (1 sample), then the unwrapped phase difference between 2 FFTs will be smaller than with a larger offset. At the same offset, if the frequency is higher, the expected phase difference between the same bin of two FFTs will be greater (or it will wrap more).
