using HLSL to invisibly stress a graphics card - How to stress the memory?

using HLSL to invisibly stress a graphics card - How to stress the memory? - memory

I've been developing for a bit an invisible (read: doesn't produce any visual output) stressor to test the capabilities of my graphics card (and as a exploration of DirectCompute in general, with which I'm pretty new). I've got the following code right now that I'm pretty proud of:
RWStructuredBuffer<uint> BufferOut : register(u0);
[numthreads(1, 1, 1)]
void CSMain( uint3 DTid : SV_DispatchThreadID )
{
uint total = 0;
float p = 0;
while(p++ < 40.0){
float s= 4.0;
float M= pow(2.0,p) - 1.0;
for(uint i=0; i <= p - 2; i++)
{
s=((s*s) - 2) % M;
}
if(s < 1.0) total++;
}
BufferOut[DTid.x] = total;
}
This runs the Lucas Lehmer Test for the first 40 powers of two. When I dispatch this code in a timed loop and look at my graphics cards stats using GPU-Z, my GPU load shoots to 99% for the duration. I'm pretty happy with this, but I also notice that the heat generation from a fully loaded out GPU is actually pretty minimal (I'm getting about a 5 to 10 degree Celsius jump, nowhere near the heat jump I get when running, say, Borderlands 2). My thought is that most of my heat comes from memory accesses, so I would need to include consistent memory accesses across the run. My initial code looked like this:
RWStructuredBuffer<uint> BufferOut : register(u0);
groupshared float4 memory_buffer[1024];
[numthreads(1, 1, 1)]
void CSMain( uint3 DTid : SV_DispatchThreadID )
{
uint total = 0;
float p = 0;
while(p++ < 40.0){
[fastop] // to lower compile times - Code efficiency is strangely not what Im looking for right now.
for(uint i = 0; i < 1024; ++i)
float s= 4.0;
float M= pow(2.0,p) - 1.0;
for(uint i=0; i <= p - 2; i++)
{
s=((s*s) - 2) % M;
}
if(s < 1.0) total++;
}
BufferOut[DTid.x] = total;
}

Read a lot of non-coherent samples in large textures. Try both DXT1 compressed and non-compressed values. And use render to texture. And MRT. All will beat on the GPU memory systems.

Related

How to vectorize Mersenne Twister loops over arrays

Currently i'm working with an custom implementation of the Mersenne Twister, and i'd like to improve my understanding of vector operations.
I have the following code:
#define N 624
#define M 397
for( k = N -1; k; k-- )
{
array[i] = (array[i] ^ ((array[i-1] ^ (array[i-1] >> 30)) * 1566083941UL)) - i;
array[i] &= 0xffffffffUL;
++i;
if ( i >= N )
{
array[0] = array[N-1];
i = 1;
}
}
Here i'm working with 32 bit integers only, so as i understand, I could perform 8 times as much operations at the same time, using AVX2 instructions? How can I do that in practice?
I know how to deal with addition of 2 vectors, but this case seems to be more complicated. I don't know how to begin.
For a scalar approach i'd work like that, but i'd like to get sure how to perform these actions in my case.
for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}
for (i = 0; i < 1024; i+=4)
{
C[i:i+3] = A[i:i+3]*B[i:i+3];
}
Unfortunately at my university there are no lessons about intrinsics, but i'm quite curious in order to get an improvement.
I'm also doing some thoughts, about how to create the array using vectors? Maybe matrix? (Maybe _mm256_setr_epi32)
I hope to get some advice regarding this topic!

Efficiently generate a Sine wave in IOS

What is the most efficient way of generating a sine wave for a device running IOS. For the purposes of the exercise assume a frequency of 440Hz and a sampling rate of 44100Hz and 1024 samples.
A vanilla C implementation looks something like.
#define SAMPLES 1024
#define TWO_PI (3.14159 * 2)
#define FREQUENCY 440
#define SAMPLING_RATE 44100
int main(int argc, const char * argv[]) {
float samples[SAMPLES];
float phaseIncrement = TWO_PI * FREQUENCY / SAMPLING_RATE;
float currentPhase = 0.0;
for (int i = 0; i < SAMPLES; i ++){
samples[i] = sin(currentPhase);
currentPhase += phaseIncrement;
}
return 0;
}
To take advantage of the Accelerate Framework and the vecLib vvsinf function the loop can be changed to only do the addition.
#define SAMPLES 1024
#define TWO_PI (3.14159 * 2)
#define FREQUENCY 440
#define SAMPLING_RATE 44100
int main(int argc, const char * argv[]) {
float samples[SAMPLES] __attribute__ ((aligned));
float results[SAMPLES] __attribute__ ((aligned));
float phaseIncrement = TWO_PI * FREQUENCY / SAMPLING_RATE;
float currentPhase = 0.0;
for (int i = 0; i < SAMPLES; i ++){
samples[i] = currentPhase;
currentPhase += phaseIncrement;
}
vvsinf(results, samples, SAMPLES);
return 0;
}
But is just applying the vvsinf function as far as I should go in terms of efficiency?
I don't really understand the Accelerate framework well enough to know if I can also replace the loop. Is there a vecLib or vDSP function I can use?
For that matter is it possible to use an entirely different alogrithm to fill a buffer with a sine wave?

Given that you are computing the sine of a phase argument which increases in fixed increments, it is generally much faster to implement the signal generation with a recurrence equation as described in this "How to Create Oscillators in Software" post and some more in this "DSP Trick: Sinusoidal Tone Generator" post, both on dspguru:
y[n] = 2*cos(w)*y[n-1] - y[n-2]
Note that this recurrence equation can be subject to numerical roundoff error accumulation, you should avoid computing too many samples at a time (your selection of SAMPLES == 1024 should be fine). This recurrence equation can be used after you have obtained the first two values y[0] and y[1] (the initial conditions). Since you are generating with an initial phase of 0, those are simply:
samples[0] = 0;
samples[1] = sin(phaseIncrement);
or more generally with an arbitrary initial phase (particularly useful to reinitialize the recurrence equation every so often to avoid the numerical roundoff error accumulation I mentioned earlier):
samples[0] = sin(initialPhase);
samples[1] = sin(initialPhase+phaseIncrement);
The recurrence equation can then be implemented directly with:
float scale = 2*cos(phaseIncrement);
// initialize first 2 samples for the 0 initial phase case
samples[0] = 0;
samples[1] = sin(phaseIncrement);
for (int i = 2; i < SAMPLES; i ++){
samples[i] = scale * samples[i-1] - samples[i-2];
}
Note that this implementation could be vectorized by computing multiple tones (each with the same frequency, but with larger phase increments between samples) with appropriate relative phase shifts, then interleaving the results to obtain the original tone (e.g. computing sin(4*w*n), sin(4*w*n+w), sin(4*w*n+2*w) and sin(4*w*n+3*w)). This would however make the implementation a lot more obscure, for a relatively small gain.
Alternatively the equation can be implemented by making use of vDsp_deq22:
// setup dummy array which will hold zeros as input
float nullInput[SAMPLES];
memset(nullInput, 0, SAMPLES * sizeof(float));
// setup filter coefficients
float coefficients[5];
coefficients[0] = 0;
coefficients[1] = 0;
coefficients[2] = 0;
coefficients[3] = -2*cos(phaseIncrement);
coefficients[4] = 1.0;
// initialize first 2 samples for the 0 initial phase case
samples[0] = 0;
samples[1] = sin(phaseIncrement);
vDsp_deq22(nullInput, 1, coefficients, samples, 1, SAMPLES-2);

If efficiency is required, you could pre-load a 440hz (44100 / 440) sine waveform look-up table and loop around it without further mapping or pre-load a 1hz (44100 / 44100) sine waveform look-up table and loop around by skipping samples to reach 440hz just as you did by incrementing a phase counter. Using look-up tables should be faster than computing sin().
Method A (using 440hz sine waveform):
#define SAMPLES 1024
#define FREQUENCY 440
#define SAMPLING_RATE 44100
#define WAVEFORM_LENGTH (SAMPLING / FREQUENCY)
int main(int argc, const char * argv[]) {
float waveform[WAVEFORM_LENGTH];
LoadSinWaveForm(waveform);
float samples[SAMPLES] __attribute__ ((aligned));
float results[SAMPLES] __attribute__ ((aligned));
for (int i = 0; i < SAMPLES; i ++){
samples[i] = waveform[i % WAVEFORM_LENGTH];
}
vvsinf(results, samples, SAMPLES);
return 0;
}
Method B (using 1hz sine waveform):
#define SAMPLES 1024
#define FREQUENCY 440
#define TWO_PI (3.14159 * 2)
#define SAMPLING_RATE 44100
#define WAVEFORM_LENGTH SAMPLING_RATE // since it's 1hz
int main(int argc, const char * argv[]) {
float waveform[WAVEFORM_LENGTH];
LoadSinWaveForm(waveform);
float samples[SAMPLES] __attribute__ ((aligned));
float results[SAMPLES] __attribute__ ((aligned));
float phaseIncrement = TWO_PI * FREQUENCY / SAMPLING_RATE;
float currentPhase = 0.0;
for (int i = 0; i < SAMPLES; i ++){
samples[i] = waveform[floor(currentPhase) % WAVEFORM_LENGTH];
currentPhase += phaseIncrement;
}
vvsinf(results, samples, SAMPLES);
return 0;
}
Please note that:
Method A is susceptible to frequency inaccuracy due to assuming that your frequency always divides correctly the sampling rate, which is not true. That means you may get 441hz or 440hz with a glitch.
Method B is susceptible to aliasing as the frequency goes up an gets closer to the Nyquist frequency, but it's a good trade-off between performance, quality and memory consumption if synthesizing reasonable low frequencies such as the one in your example.

Kiss fft does not work after giving it more than 32 samples

I am trying to take data from an accelerometer and apply Kiss FFT to the samples. I'm using a Freescale Kinetis FRDM-K22F board. I want to use 64 samples, but when I run the program I get an error saying "kiss fft usage error: improper alloc" I started turning down the sample size and saw that the FFT does work with 32 samples, but giving it 33 samples the program just stops and returns no errors. Giving it any more samples gives similar results.
I played around with how I set up the FFT and followed a few websites and forum posts:
KissFFT output of kiss_fftr
http://digiphd.com/programming-reconstruction-fast-fourier-transform-real-signal-kiss-fft-libraries/
Kiss FFT on a dsPIC33
From what I can see, I haven't done anything different from what the above websites and forums have done. I've included my code below. Any help or advice is greatly appreciated.
void Sample_RUN()
{
int size = 64;
kiss_fft_scalar zero;
memset(&zero,0,sizeof(zero));
kiss_fft_cpx fft_in[size];
kiss_fft_cpx fft_out[size];
kiss_fftr_cfg fft = kiss_fftr_alloc(size*2 ,0 ,NULL,NULL);
signed short samples[size];
for (int i = 0; i < size; i++) {
fft_in[i].r = zero;
fft_in[i].i = zero;
fft_out[i].r = zero;
fft_out[i].i = zero;
}
printf("Data Collection Begins \r\n");
for(int j = 0; j < size; j++)
{
for(;;)
{
dr_status = My_I2C_ReadByte(STATUS_REG);
dr_status = (dr_status & 0x04);
if (dr_status == 0x04)
{
//READING FROM ACCEL OUTPUT DATA REGISTERS
AccelData[0] = My_I2C_ReadByte(OUT_X_MSB_REG);
AccelData[1] = My_I2C_ReadByte(OUT_X_LSB_REG);
AccelData[2] = My_I2C_ReadByte(OUT_Y_MSB_REG);
AccelData[3] = My_I2C_ReadByte(OUT_Y_LSB_REG);
AccelData[4] = My_I2C_ReadByte(OUT_Z_MSB_REG);
AccelData[5] = My_I2C_ReadByte(OUT_Z_LSB_REG);
// 14-bit accelerometer data
Xout_Accel_14_bit = ((signed short) (AccelData[0]<<8 | AccelData[1])) >> 2; // Compute 16-bit X-axis acceleration output value
Yout_Accel_14_bit = ((signed short) (AccelData[2]<<8 | AccelData[3])) >> 2; // Compute 16-bit Y-axis acceleration output value
Zout_Accel_14_bit = ((signed short) (AccelData[4]<<8 | AccelData[5])) >> 2; // Compute 16-bit Z-axis acceleration output value
mag_accel = sqrt(pow(Xout_Accel_14_bit, 2) + pow(Yout_Accel_14_bit, 2) + pow(Zout_Accel_14_bit, 2) );
printf("%d \r\n", mag_accel);
samples[j] = mag_accel;
break;
} // end if
} // end infinite for
} // end for
for (int j = 0; j < size; j++)
{
fft_in[j].r = samples[j];
fft_in[j].i = zero;
fft_out[j].r = zero;
fft_out[j].i = zero;
}
printf("Executing FFT\r\n");
kiss_fftr(fft, (kiss_fft_scalar*) fft_in, fft_out);
printf("Printing FFT Outputs\r\n");
for(int j = 0; j < size; j++)
{
printf("%d \r\n", fft_out[j].r);
}
kiss_fft_cleanup();
free(fft);
} // end Sample_RUN

Sounds like you are running out of memory. I am not familiar with that chip, but perhaps you should be using the last arguments of kiss_fft_alloc so you can skip heap allocation.

Fast Gaussian Blur image filter with ARM NEON

I'm trying to make a mobile fast version of Gaussian Blur image filter.
I've read other questions, like: Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev
For my purpose i need only a fixed size (7x7) fixed sigma (2) Gaussian filter.
So, before optimizing for ARM NEON, I'm implementing 1D Gaussian Kernel in C++, and comparing performance with OpenCV GaussianBlur() method directly in mobile environment (Android with NDK). This way it will result in a much simpler code to optimize.
However the result is that my implementation is 10 times slower then OpenCV4Android version. I've read that OpenCV4 Tegra have optimized GaussianBlur implementation, but I don't think that standard OpenCV4Android have those kind of optimizations, so why is my code so slow?
Here is my implementation (note: reflect101 is used for pixel reflection when applying filter near borders):
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
float sum, x1, y1;
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs[i] /= coeffs_sum;
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs[i + 3]*src.at<uchar>(y1, x);
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs[i + 3]*temp.at<uchar>(y, x1);
}
dst.at<uchar>(y,x) = sum;
}
}
return dst;
}

A big part of the problem, here, is that the algorithm is overly precise, as #PaulR pointed out. It's usually best to keep your coefficient table no more precise than your data. In this case, since you appear to be processing uchar data, you would use roughly an 8-bit coefficient table.
Keeping these weights small will particularly matter in your NEON implementation because the narrower you have the arithmetic, the more lanes you can process at once.
Beyond that, the first major slowdown that stands out is that having the image edge reflection code within the main loop. That's going to make the bulk of the work less efficient because it will generally not need to do anything special in that case.
It might work out better if you use a special version of the loop near the edges, and then when you're safe from that you use a simplified inner loop that doesn't call that reflect101() function.
Second (more relevant to prototype code) is that it's possible to add the wings of the window together before applying the weighting function, because the table contains the same coefficients on both sides.
sum = src.at<uchar>(y1, x) * coeffs[3];
for(int i = -3; i < 0; i++) {
int tmp = src.at<uchar>(y + i, x) + src.at<uchar>(y - i, x);
sum += coeffs[i + 3] * tmp;
}
This saves you six multiplies per pixel, and it's a step towards some other optimisations around controlling overflow conditions.
Then there are a couple of other problems related to the memory system.
The two-pass approach is good in principle, because it saves you from performing a lot of recomputation. Unfortunately it can push the useful data out of L1 cache, which can make everything quite a lot slower. It also means that when you write the result out to memory, you're quantising the intermediate sum, which can reduce precision.
When you convert this code to NEON, one of the things you will want to focus on is trying to keep your working set inside the register file, but without discarding calculations before they've been fully utilised.
When people do use two passes, it's usual for the intermediate data to be transposed -- that is, a column of input becomes a row of output.
This is because the CPU will really not like fetching small amounts of data across multiple lines of the input image. It works out much more efficient (because of the way the cache works) if you collect together a bunch of horizontal pixels, and filter those. If the temporary buffer is transposed, then the second pass also collects together a bunch of horizontal points (which would vertical in the original orientation) and it transposes its output again so it comes out the right way.
If you optimise to keep your working set localised, then you might not need this transposition trick, but it's worth knowing about so that you can set yourself a healthy baseline performance. Unfortunately, localisation like this does force you to go back to the non-optimal memory fetches, but with the wider data types that penalty can be mitigated.

If this is specifically for 8 bit images then you really don't want floating point coefficients, especially not double precision. Also you don't want to use floats for x1, y1. You should just use integers for coordinates and you can use fixed point (i.e. integer) for the coefficients to keep all the filter arithmetic in the integer domain, e.g.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_16UC1); // <<<
int sum, x1, y1; // <<<
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
int coeffs_i[7] = { 0 }; // <<<
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] / coeffs_sum * 256); // <<<
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs_i[i + 3]*src.at<uchar>(y1, x); // <<<
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs_i[i + 3]*temp.at<uchar>(y, x1); // <<<
}
dst.at<uchar>(y,x) = sum / (256 * 256); // <<<
}
}
return dst;
}

This is the code after implementing all the suggestions of #Paul R and #sh1, summarized as follows:
1) use only integer arithmetic (with precision to taste)
2) add the values of the pixels at the same distance from the mask center before applying the multiplications, to reduce the number of multiplications.
3) apply only horizontal filters to take advantage of the storage by rows of the matrices
4) separate cycles around the edges from those inside the image not to make unnecessary calls to reflection functions. I totally removed the functions of reflection, including them inside the loops along the edges.
5) In addition, as a personal observation, to improve rounding without calling a (slow) function "round" or "cvRound", I've added to both temporary and final pixel results 0.5f (= 32768 in integers precision) to reduce the error / difference compared to OpenCV.
Now the performance is much better from about 15 to about 6 times slower than OpenCV.
However, the resulting matrix is not perfectly identical to that obtained with the Gaussian Blur of OpenCV. This is not due to arithmetic length (sufficient) as well as removing the error remains. Note that this is a minimum difference, between 0 and 2 (in absolute value) of pixel intensity, between the matrices resulting from the two versions. Coefficient are the same used by OpenCV, obtained with getGaussianKernel with same size and sigma.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
int sum;
int x1;
double coeffs[] = {0.070159, 0.131075, 0.190713, 0.216106, 0.190713, 0.131075, 0.070159};
int coeffs_i[7] = { 0 };
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] * 65536); //65536
}
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// transpose to apply again horizontal filter - better cache data locality
transpose(temp, temp);
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
transpose(dst, dst);
return dst;
}

According to Google document, on Android device, using float/double is twice slower than using int/uchar.
You may find some solutions to speed up your C++ code on this Android documents.
https://developer.android.com/training/articles/perf-tips

Cepstrum and Formant Tracking Using Apple Accelerate Framework

I've been using this web page as a guideline for formant tracking of speech...
http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
It all seems to be going pretty well, except for the last step, which is the converting of the cepstrum into a smoothed representation for simple peak picking for the formant tracking. The spectrograph looks good, and the cepstrograph (can I say that? :P) also looks good (from what I can tell), but the final stage the results (smoothed formant representation) are not what I expected.
I uploaded a sample of each stage as visual images to...
http://imgur.com/a/62duS
This sample is for the speech of the sound 'i' as in 'beed'. According to this site...
http://home.cc.umanitoba.ca/~robh/howto.html#formants
the first formant should come in around 500hz, and the second and third around 2200hz and 2800 hz respectively. The spetrograph shows something very similar, but on the last stage I am gettings results similar to...
F1 - 891
F2 - 1550
F3 - 2329
Any insight would be greatly appreciated. I've been going round in circles on this for some time. My code looks as follows...
// set up fft parameters
UInt32 log2n = 9;
UInt32 n = 512;
UInt32 window = n;
UInt32 halfN = n/2;
UInt32 stride = 1;
FFTSetup setupReal = [appDelegate getFftSetup];
int stepSize = (hpBuffer.sampleCount-window) / quantizeCount;
// calculate volume from raw samples, because it seems more reliable that fft
UInt32 volumeWindow = 128;
volumeBuffer = malloc(sizeof(float)*quantizeCount);
int windowPos = 0;
for (int i=0; i < quantizeCount; i++) {
windowPos += stepSize;
float total = 0.0f;
float max = 0.0f;
for (int p=windowPos; p < windowPos+volumeWindow; p++) {
total += sampleBuffer.buffer[p];
if (sampleBuffer.buffer[p] > max)
max = sampleBuffer.buffer[p];
}
volumeBuffer[i] = max;
}
// normalize volumebuffer
[FloatAudioBuffer normalizePositiveBuffer:volumeBuffer ofSize:quantizeCount];
// allocate memory for complex array
COMPLEX_SPLIT complexArray;
complexArray.realp = (float*)malloc(4096*sizeof(float));
complexArray.imagp = (float*)malloc(4096*sizeof(float));
// allocate some space for temporary hamming buffer
float *hamBuffer = malloc(n*sizeof(float));
// create spectrum and feature buffer
spectrumBuffer = malloc(sizeof(float)*halfN*quantizeCount);
formantBuffer = malloc(sizeof(float)*4096*quantizeCount);
cepstrumBuffer = malloc(sizeof(float)*halfN*quantizeCount);
lowCepstrumBuffer = malloc(sizeof(float)*featureCount*quantizeCount);
featureBuffer = malloc(sizeof(float)*featureCount*quantizeCount);
// create data point for each quantize segment
float TWOPI = 2.0f * M_PI;
for (int s=0; s < quantizeCount; s++) {
// copy buffer data into a seperate array and apply hamming window
int offset = (int)(s * stepSize);
for (int i=0; i < n; i++)
hamBuffer[i] = hpBuffer.buffer[offset+i] * ((1.0f-0.46f) - 0.46f*cos(TWOPI*i/((float)n-1.0f)));
// configure float array into acceptable input array format (interleaved)
vDSP_ctoz((COMPLEX*)hamBuffer, 2, &complexArray, 1, halfN);
// run FFT
vDSP_fft_zrip(setupReal, &complexArray, stride, log2n, FFT_FORWARD);
// Absolute square (equivalent to mag^2)
complexArray.imagp[0] = 0.0f;
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, halfN);
bzero(complexArray.imagp, (halfN) * sizeof(float));
// scale
float scale = 1.0f / (2.0f*(float)n);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, halfN);
// get log of absolute values for passing to inverse FFT for cepstrum
for (int i=0; i < halfN; i++)
complexArray.realp[i] = logf(sqrtf(complexArray.realp[i]));
// save this into spectrum buffer
memcpy(&spectrumBuffer[s*halfN], complexArray.realp, halfN*sizeof(float));
// convert spectrum to interleaved ready for inverse fft
vDSP_ctoz((COMPLEX*)&spectrumBuffer[s*halfN], 2, &complexArray, 1, halfN/2);
// create cepstrum
vDSP_fft_zrip(setupReal, &complexArray, stride, log2n-1, FFT_INVERSE);
//convert interleaved to real and straight into cepstrum buffer
vDSP_ztoc(&complexArray, 1, (COMPLEX*)&cepstrumBuffer[s*halfN], 2, halfN/2);
// copy first part of cepstrum into low cepstrum buffer
memcpy(&lowCepstrumBuffer[s*featureCount], &cepstrumBuffer[s*halfN], featureCount*sizeof(float));
// make 8000 point array based on the first 15 values
float *tempArray = malloc(8192*sizeof(float));
for (int i=0; i < 8192; i++) {
if (i < 15)
tempArray[i] = cepstrumBuffer[s*halfN+i];
else
tempArray[i] = 0.0f;
}
vDSP_ctoz((COMPLEX*)tempArray, 2, &complexArray, 1, 4096);
float newLog2n = log2f(8192.0f);
complexArray.imagp[0] = 0.0f;
vDSP_fft_zrip(setupReal, &complexArray, stride, newLog2n, FFT_FORWARD);
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, 4096);
bzero(complexArray.imagp, (4096) * sizeof(float));
// scale
scale = 1.0f / (2.0f*(float)8192);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, 4096);
// get magnitude
for (int i=0; i < 4096; i++)
complexArray.realp[i] = sqrtf(complexArray.realp[i]);
// write to formant buffer
memcpy(&formantBuffer[s*4096], complexArray.realp, 4096*sizeof(float));
// complex array now contains formant spectrum
// it's large, so get features here!
// try simple peak picking algorithm for first 3 formants
int formantIndex = 0;
float *peaks = malloc(6*sizeof(float));
for (int i=0; i < 6; i++)
peaks[i] = 0.0f;
for (int i=1; i < 4096-1 && formantIndex < 6; i++) {
if (complexArray.realp[i-1] < complexArray.realp[i] &&
complexArray.realp[i+1] < complexArray.realp[i])
peaks[formantIndex++] = i;
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart