Filling Float buffer in Metal

Filling Float buffer in Metal - ios

Problem:
I need to fill a MTLBuffer of Floats with a constant value — say 1729.68921. I also need it to be as fast as possible.
Therefore I'm prohibited from filling the buffer on the CPU side (i.e. getting UnsafeMutablePointer<Float> from the MTLBuffer and assigning in serial manner).
My approach
Ideally I'd use MTLBlitCommandEncoder.fill(), however AFAIK it's only capable to fill a buffer with UInt8 values (given that UInt8 is 1 byte long and Float is 4 bytes long, I can't specify arbitrary value of my Float constant).
So far I can see only 2 options left, but both seem to be overkill:
create another buffer B filled with the constant value and copy its contents into my buffer via MTLBlitCommandEncoder
create a kernel function that'd fill the buffer
Questions
What's the fastest way of filling MTLBuffer of Floats with a
constant value?

Using a compute shader that writes to multiple buffer elements from each thread was the fastest approach in my experiments. This is hardware-dependent, so you should test on the full range of devices you expect the app to be deployed on.
I wrote two compute shaders: one that fills 16 contiguous array elements without checking against the array bounds, and one that sets a single array element after checking against the length of the buffer:
kernel void fill_16_unchecked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
uint index [[thread_position_in_grid]])
{
for (int i = 0; i < 16; ++i) {
buffer[index * 16 + i] = value;
}
}
kernel void single_fill_checked(device float *buffer [[buffer(0)]],
constant float &value [[buffer(1)]],
constant uint &buffer_length [[buffer(2)]],
uint index [[thread_position_in_grid]])
{
if (index < buffer_length) {
buffer[index] = value;
}
}
If you know that your buffer count will always be a multiple of the thread execution width multiplied by the number of elements you set in the loop, you can just use the first function. The second function is a fallback for when you might dispatch a grid that would otherwise overrun the buffer.
Once you have two pipelines built from these functions, you can dispatch the work with a pair of compute commands as follows:
NSInteger executionWidth = [unchecked16Pipeline threadExecutionWidth];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setBuffer:buffer offset:0 atIndex:0];
[computeEncoder setBytes:&value length:sizeof(float) atIndex:1];
if (bufferCount / (executionWidth * 16) != 0) {
[computeEncoder setComputePipelineState:unchecked16Pipeline];
[computeEncoder dispatchThreadgroups:MTLSizeMake(bufferCount / (executionWidth * 16), 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
if (bufferCount % (executionWidth * 16) != 0) {
int remainder = bufferCount % (executionWidth * 16);
[computeEncoder setComputePipelineState:checkedSinglePipeline];
[computeEncoder setBytes:&bufferCount length:sizeof(bufferCount) atIndex:2];
[computeEncoder dispatchThreadgroups:MTLSizeMake((remainder / executionWidth) + 1, 1, 1)
threadsPerThreadgroup:MTLSizeMake(executionWidth, 1, 1)];
}
[computeEncoder endEncoding];
Note that doing the work in this manner will not necessarily be faster than the naive approach that just writes one element per thread. In my tests, it was 40% faster on A8, roughly equivalent on A10, and 2-3x slower (!) on A9. Always test with your own workload.

Related

Optimize metal compute shader for image histogram

I have a metal shader that computes an image histogram like this:
#define CHANNEL_SIZE (256)
typedef atomic_uint HistoBuffer[CHANNEL_SIZE];
kernel void
computeHisto(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 grid [[thread_position_in_grid]]) {
if (grid.x >= sourceTexture.get_width() || grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&histo[grayvalue], 1, memory_order_relaxed);
}
This works as expected but takes too long (>1ms). I now tried to optimise this by reducing the number of atomic operations. I came up with the following improved code. The idea is to compute local histograms per thread group and add them later atomically into the global hist buffer.
kernel void
computeHisto_fast(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
device HistoBuffer &histo [[buffer(0)]],
uint2 t_pos_grid [[thread_position_in_grid]],
uint2 tg_pos_grid [[ threadgroup_position_in_grid ]],
uint2 t_pos_tg [[ thread_position_in_threadgroup]],
uint t_idx_tg [[ thread_index_in_threadgroup ]],
uint2 t_per_tg [[ threads_per_threadgroup ]]
)
{
threadgroup uint localhisto[CHANNEL_SIZE] = { 0 };
if (t_pos_grid.x >= sourceTexture.get_width() || t_pos_grid.y >= sourceTexture.get_height()) { return; }
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
localhisto[grayvalue]++;
// wait for all threads in threadgroup to finish
threadgroup_barrier(mem_flags::mem_none);
// copy the thread group result atomically into global histo buffer
if(t_idx_tg == 0) {
for(uint i=0;i<CHANNEL_SIZE;i++) {
atomic_fetch_add_explicit(&histo[i], localhisto[i], memory_order_relaxed);
}
}
}
There are 2 problems:
The improved routine does not yield identical results compared to the first and I currently don't see why ?
The run time didn't improve. in fact it takes 4 times the runtime of the unoptimised version. According to the debugger the for loop is the problem. But I do not understand this, since the number of atomic operation is reduced by 3 orders of magnitude, i.e. the thread group size, here (32x32)=1024.
Anbody who can explain what I am doing wrong here ? Thanks
EDIT: 2019-12-22:
According to Matthijs answer I have changed the local histogram also to atomic operations like this:
threadgroup atomic_uint localhisto[CHANNEL_SIZE] = {0};
half gray = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&localhisto[grayvalue], 1, memory_order_relaxed);
However the result sill is not the same as in the reference implementation above. There must be another severe conceptional bug ???

You'll still need to use atomic operations on the threadgroup memory, since it's still being shared by multiple threads. This should be faster than in your first version because there is less contention for the same locks.

I think the problem is with initializing shared memory, I don't think this definition does the job. Also, threadgroup level memory synchronization is required between zeroing shared memory and atomic update.
As for the device memory update, doing it using a single thread is clearly suboptimal. Updating the whole 256 length histogram in each threadblock can have a huge overhead depending on the size of the threadblock.
A sample I used for a small (16 element) histogram using 8x8 threadblocks:
kernel void gaussian_filter(device const uchar* data,
device atomic_uint* p_hist,
uint2 imageShape [[threads_per_grid]],
uint2 idx [[thread_position_in_grid]],
uint tidx [[thread_index_in_threadgroup]])
{
threadgroup atomic_uint sh_hist[16];
if (tidx < 16)
atomic_store_explicit(sh_hist + tidx, 0, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
uint histBin = (uint)data[imageShape[0]*idx[1] + idx[0]]/16;
atomic_fetch_add_explicit(sh_hist + histBin, 1, memory_order_relaxed);
threadgroup_barrier(mem_flags::mem_threadgroup);
if (tidx < 16)
atomic_fetch_add_explicit(p_hist + tidx, atomic_load_explicit(sh_hist + tidx, memory_order_relaxed), memory_order_relaxed);
}

How to Speed Up Metal Code for iOS/Mac OS

I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const device int& dataSize [[ buffer(1) ]],
const device float *filterVector [[ buffer(2) ]],
const device int& filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]]) {
int outputSize = dataSize - filterSize + 1;
for (int i=0;i<outputSize;i++) {
float sum = 0.0;
for (int j=0;j<filterSize;j++) {
sum += dataVector[i+j] * filterVector[j];
}
outVector[i] = sum;
}
}
My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?

The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const constant int &dataSize [[ buffer(1) ]],
const constant float *filterVector [[ buffer(2) ]],
const constant int &filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]])
{
float sum = 0.0;
for (int i = 0; i < filterSize; ++i) {
sum += dataVector[id + i] * filterVector[i];
}
outVector[id] = sum;
}
In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.
// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).
let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.

The following code shows how to render encoded commands in parallel on the GPU using the Objective-C Metal API (the threading code above only divides rendering of the output into grid sections for parallel processing; the calculations are still not performed in parallel). It is what you're referring to in your question, even while it's not exactly what you want. I've provided this answer to help anyone who might have stumbled upon this question, thinking that it was going to provide an answer related to parallel rendering (when, in fact, it does not):
- (void)drawInMTKView:(MTKView *)view
{
dispatch_async(((AppDelegate *)UIApplication.sharedApplication.delegate).cameraViewQueue, ^{
id <CAMetalDrawable> drawable = [view currentDrawable]; //[(CAMetalLayer *)view.layer nextDrawable];
MTLRenderPassDescriptor *renderPassDesc = [view currentRenderPassDescriptor];
renderPassDesc.colorAttachments[0].loadAction = MTLLoadActionClear;
renderPassDesc.colorAttachments[0].clearColor = MTLClearColorMake(0.0,0.0,0.0,1.0);
renderPassDesc.renderTargetWidth = self.texture.width;
renderPassDesc.renderTargetHeight = self.texture.height;
renderPassDesc.colorAttachments[0].texture = drawable.texture;
if (renderPassDesc != nil)
{
dispatch_semaphore_wait(self._inflight_semaphore, DISPATCH_TIME_FOREVER);
id <MTLCommandBuffer> commandBuffer = [self.metalContext.commandQueue commandBuffer];
[commandBuffer enqueue];
// START PARALLEL RENDERING OPERATIONS HERE
id <MTLParallelRenderCommandEncoder> parallelRCE = [commandBuffer parallelRenderCommandEncoderWithDescriptor:renderPassDesc];
// FIRST PARALLEL RENDERING OPERATION
id <MTLRenderCommandEncoder> renderEncoder = [parallelRCE renderCommandEncoder];
[renderEncoder setRenderPipelineState:self.metalContext.renderPipelineState];
[renderEncoder setVertexBuffer:self.metalContext.vertexBuffer offset:0 atIndex:0];
[renderEncoder setVertexBuffer:self.metalContext.uniformBuffer offset:0 atIndex:1];
[renderEncoder setFragmentBuffer:self.metalContext.uniformBuffer offset:0 atIndex:0];
[renderEncoder setFragmentTexture:self.texture
atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
vertexStart:0
vertexCount:4
instanceCount:1];
[renderEncoder endEncoding];
// ADD SECOND, THIRD, ETC. PARALLEL RENDERING OPERATION HERE
.
.
.
// SUBMIT ALL RENDERING OPERATIONS IN PARALLEL HERE
[parallelRCE endEncoding];
__block dispatch_semaphore_t block_sema = self._inflight_semaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
dispatch_semaphore_signal(block_sema);
}];
if (drawable)
[commandBuffer presentDrawable:drawable];
[commandBuffer commit];
[commandBuffer waitUntilScheduled];
}
});
}
In the above example, you would duplicate the renderEncoder-related for each calculation you want to perform in parallel. I do not see how this would be of benefit to you in your code example, as one operation appears to be dependent on another. Probably, then, the best you could hope for is the code provided to you by warrenm, even though that doesn't really qualify as parallel rendering, though.

Can someone explain how this code converts volume to decibels using the Accelerate Framework?

I'm building an iOS app using EZAudio. It's delegate returns back a float** buffer, which contains float values indicating the volume detected. This delegate is called constantly and it's work is done a different thread.
What I am trying to do is to take the float value from EZAudio and convert it into decibels.
EZAudioDelegate
Here's my simplified EZAudio Delegate for getting Microphone Data:
- (void)microphone:(EZMicrophone *)microphone hasAudioReceived:(float **)buffer withBufferSize:(UInt32)bufferSize withNumberOfChannels:(UInt32)numberOfChannels {
/*
* Returns a float array called buffer that contains the stereo signal data
* buffer[0] is the left audio channel
* buffer[1] is the right audio channel
*/
// Using a separate audio thread to not block the main UI thread
dispatch_async(dispatch_get_main_queue(), ^{
float decibels = [self getDecibelsFromVolume:buffer withBufferSize:bufferSize];
NSLog(#"Decibels: %f", decibels);
});
}
The Problem
The problem is that after implementing solutions from the links below, I do not understand how it works. If someone could explain how it converts volume to decibels I would be very grateful
How to convert audio input to DB? #85
How to change the buffer size to increase FFT window? #50
Changing Buffer Size #84
The Code
The solution uses the following methods from the Accelerate Framework to convert the volume into decibels:
vDSP_vsq
vDSP_meanv
vDSP_vdbcon
Below is the method getDecibelsFromVolume that is called from the EZAudio Delegate. It is passed the float** buffer and bufferSize from the delegate.
- (float)getDecibelsFromVolume:(float**)buffer withBufferSize:(UInt32)bufferSize {
// Decibel Calculation.
float one = 1.0;
float meanVal = 0.0;
float tiny = 0.1;
float lastdbValue = 0.0;
vDSP_vsq(buffer[0], 1, buffer[0], 1, bufferSize);
vDSP_meanv(buffer[0], 1, &meanVal, bufferSize);
vDSP_vdbcon(&meanVal, 1, &one, &meanVal, 1, 1, 0);
// Exponential moving average to dB level to only get continous sounds.
float currentdb = 1.0 - (fabs(meanVal) / 100);
if (lastdbValue == INFINITY || lastdbValue == -INFINITY || isnan(lastdbValue)) {
lastdbValue = 0.0;
}
float dbValue = ((1.0 - tiny) * lastdbValue) + tiny * currentdb;
lastdbValue = dbValue;
return dbValue;
}

I'll explain how one would compute a dB value for a signal using code and then show how that relates to the vDSP example.
First, compute the RMS sum of a chunk of data
double sumSquared = 0;
for (int i = 0 ; i < numSamples ; i++)
{
sumSquared += samples[i]*samples[i];
}
double rms = sumSquared/numSamples;
For more information on RMS
Next convert the RMS value to dB
double dBvalue = 20*log10(rms);
How this relates to the example code
vDSP_vsq(buffer[0], 1, buffer[0], 1, bufferSize);
This line loops over the buffer and computes squares all of the elements in the buffer. If buffer contained the values [1,2,3,4] before the call then after the call it would contain the values [1,4,9,16]
vDSP_meanv(buffer[0], 1, &meanVal, bufferSize);
This line loops over the buffer, summing the values in the buffer and then returning the sum divided by the number of elements. So for the input buffer [1,4,9,16] in computes the sum 30, divides by 4 and returns the result 7.5.
vDSP_vdbcon(&meanVal, 1, &one, &meanVal, 1, 1, 0);
This line converts the meanVal to decibels. There is really no point in calling a vectorized function here since it is only operating on a single element. What it is doing however is plugging the parameters into the following formula:
meanVal = n*log10(meanVal/one)
where n is either 10 or 20 depending on the last parameter. In this case it is 10. 10 is used for power measurements and 20 is used for amplitudes. I think 20 would make more sense for you to use.
The last little bit of code looks to be doing some simple smoothing of the result to make the meter a little less bouncy.

iOS Accelerate low-pass FFT filter mirroring result

I am trying to port an existing FFT based low-pass filter to iOS using the Accelerate vDSP framework.
It seems like the FFT works as expected for about the first 1/4 of the sample. But then after that the results seem wrong, and even more odd are mirrored (with the last half of the signal mirroring most of the first half).
You can see the results from a test application below. First is plotted the original sampled data, then an example of the expected filtered results (filtering out signal higher than 15Hz), then finally the results of my current FFT code (note that the desired results and example FFT result are at a different scale than the original data):
The actual code for my low-pass filter is as follows:
double *lowpassFilterVector(double *accell, uint32_t sampleCount, double lowPassFreq, double sampleRate )
{
double stride = 1;
int ln = log2f(sampleCount);
int n = 1 << ln;
// So that we get an FFT of the whole data set, we pad out the array to the next highest power of 2.
int fullPadN = n * 2;
double *padAccell = malloc(sizeof(double) * fullPadN);
memset(padAccell, 0, sizeof(double) * fullPadN);
memcpy(padAccell, accell, sizeof(double) * sampleCount);
ln = log2f(fullPadN);
n = 1 << ln;
int nOver2 = n/2;
DSPDoubleSplitComplex A;
A.realp = (double *)malloc(sizeof(double) * nOver2);
A.imagp = (double *)malloc(sizeof(double) * nOver2);
// This can be reused, just including it here for simplicity.
FFTSetupD setupReal = vDSP_create_fftsetupD(ln, FFT_RADIX2);
vDSP_ctozD((DSPDoubleComplex*)padAccell,2,&A,1,nOver2);
// Use the FFT to get frequency counts
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_FORWARD);
const double factor = 0.5f;
vDSP_vsmulD(A.realp, 1, &factor, A.realp, 1, nOver2);
vDSP_vsmulD(A.imagp, 1, &factor, A.imagp, 1, nOver2);
A.realp[nOver2] = A.imagp[0];
A.imagp[0] = 0.0f;
A.imagp[nOver2] = 0.0f;
// Set frequencies above target to 0.
// This tells us which bin the frequencies over the minimum desired correspond to
NSInteger binLocation = (lowPassFreq * n) / sampleRate;
// We add 2 because bin 0 holds special FFT meta data, so bins really start at "1" - and we want to filter out anything OVER the target frequency
for ( NSInteger i = binLocation+2; i < nOver2; i++ )
{
A.realp[i] = 0;
}
// Clear out all imaginary parts
bzero(A.imagp, (nOver2) * sizeof(double));
//A.imagp[0] = A.realp[nOver2];
// Now shift back all of the values
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_INVERSE);
double *filteredAccell = (double *)malloc(sizeof(double) * fullPadN);
// Converts complex vector back into 2D array
vDSP_ztocD(&A, stride, (DSPDoubleComplex*)filteredAccell, 2, nOver2);
// Have to scale results to account for Apple's FFT library algorithm, see:
// http://developer.apple.com/library/ios/#documentation/Performance/Conceptual/vDSP_Programming_Guide/UsingFourierTransforms/UsingFourierTransforms.html#//apple_ref/doc/uid/TP40005147-CH202-15952
double scale = (float)1.0f / fullPadN;//(2.0f * (float)n);
vDSP_vsmulD(filteredAccell, 1, &scale, filteredAccell, 1, fullPadN);
// Tracks results of conversion
printf("\nInput & output:\n");
for (int k = 0; k < sampleCount; k++)
{
printf("%3d\t%6.2f\t%6.2f\t%6.2f\n", k, accell[k], padAccell[k], filteredAccell[k]);
}
// Acceleration data will be replaced in-place.
return filteredAccell;
}
In the original code the library was handling non power-of-two sizes of input data; in my Accelerate code I am padding out the input to the nearest power of two. In the case of the sample test below the original sample data is 1000 samples so it's padded to 1024. I don't think that would affect results but I include that for the sake of possible differences.
If you want to experiment with a solution, you can download the sample project that generates the graphs here (in the FFTTest folder):
FFT Example Project code
Thanks for any insight, I've not worked with FFT's before so I feel like I am missing something critical.

If you want a strictly real (not complex) result, then the data before the IFFT must be conjugate symmetric. If you don't want the result to be mirror symmetric, then don't zero the imaginary component before the IFFT. Merely zeroing bins before the IFFT creates a filter with a huge amount of ripple in the passband.
The Accelerate framework also supports more FFT lengths than just powers of 2.

Get iPhone mp3 frequency from AvassetReader and vDSP_FFT

I'm trying to get frequency from iPhone / iPod music library for a spectrum app on iPod library, helping myself with reading-audio-samples-via-avassetreader to get audio samples and then with using-the-apple-fft-and-accelerate-framework and Apple vDSP Samples, but somehow I'm wrong somewhere and unable to calculate the frequency.
So step by step:
read audio sample
Hanning window
calculate fft
Is this the correct way to get frequencies from an iPod mp3 library?
Here is my code:
static COMPLEX_SPLIT A;
static FFTSetup setupReal;
static uint32_t log2n, n, nOver2;
static int32_t stride;
static float *obtainedReal;
static float scale;
+ (void)initialize
{
log2n = 10;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
obtainedReal = (float *) malloc(n * sizeof(float));
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
}
- (float) performAcceleratedFastFourierTransForAudioBuffer:(AudioBufferList)ioData
{
NSUInteger * sampleIn = (NSUInteger *)ioData.mBuffers[0].mData;
for (int i = 0; i < nOver2; i++) {
double multiplier = 0.5 * (1 - cos(2*M_PI*i/nOver2-1));
A.realp[i] = multiplier * sampleIn[i];
A.imagp[i] = 0;
}
memset(ioData.mBuffers[0].mData, 0, ioData.mBuffers[0].mDataByteSize);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
vDSP_zvmags(&A, 1, A.realp, 1, nOver2);
scale = (float) 1.0 / (2 * n);
vDSP_vsmul(A.realp, 1, &scale, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &scale, A.imagp, 1, nOver2);
vDSP_ztoc(&A, 1, (COMPLEX *)obtainedReal, 2, nOver2);
int peakIndex = 0;
for (size_t i=1; i < nOver2-1; ++i) {
if ((obtainedReal[i] > obtainedReal[i-1]) && (obtainedReal[i] > obtainedReal[i+1]))
{
peakIndex = i;
break;
}
}
//here I don't know how to calculate frequency with my data
float frequency = obtainedReal[peakIndex-1] / 44100 / n;
vDSP_destroy_fftsetup(setupReal);
free(obtainedReal);
free(A.realp);
free(A.imagp);
return frequency;
}
I got 1.485757 and 1.332233 as my first frequencies

It looks to me like there is a problem in the conversion to complex input for the FFT. vDSP_ctoz() splits a buffer where real and imaginary components are interleaved into two buffers, one real and one imaginary. Your input to that function appears to be just real data that has been casted to COMPLEX. This means that your input buffer to vDSP_ctoz() is only half as long as it needs to be and some garbage data beyond the buffer size is getting converted.
You either need to create sampleOut to be 2*n in length and set every other value (the real parts) or better yet, you can bypass the vDSP_ctoz() and directly copy your input data into A.realp and set A.imagp to zeros. vDSP_ctoz() should only be needed when interfacing to a source that produces interleaved complex data.
Edit
Ok, I think I was wrong on my first suggestion since the vDSP documentation says that the real input of the real-to-complex in-place fft should be formatted into the split complex format such that imagp contains even samples and realp contains the odd samples. I have not actually used the vDSP library, but I am familiar with a lot of other FFT libraries and I missed that detail.
You should be able to find the peaks using A.realp after the call to vDSP_zvmags(&A, 1, A.realp, 1, nOver2); At that point, A.realp should contain the magnitude squared of the FFT output, which is scalar. If you are going to do the scaling, it should be done before the mag2 operation, but it may not be needed if you are just looking for the peaks.
To get the real frequencies represented by the FFT output, use this formula:
F = (i * Fs) / N, i=0,1,...,N/2
where
i is the index of the FFT output buffer
Fs is the audio sampling rate
N is the FFT length
so your calculation might look like this:
float frequency = (peakIndex * 44100) / n;
Keep in mind that vDSP only returns the first half of the input spectrum for real input since the second half is redundant. So the FFT output represents frequencies from 0 to Fs/2.
One other note is that I don't know if your peak finding algorithm will work very well since FFT output will not be smooth and there will often be a lot of oscillation. You are simply taking the first sample where the two adjacent samples are lower. If you just want to find a single peak, it would be better just to find the max magnitude across the entire output. If you want to find multiple peaks, you will have to do something more sophisticated.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart