I want to measure the similarity degree between two grayscale same sized images using mean square error. I can't use any framework which is not a part of macOS SDK(e.g. OpenCV, Eigen). Simple realization of this algorithm without vectorization looks like this:
vImage_Buffer imgA;
vImage_Buffer imgB;
NSUInteger mse = 0;
unsigned char *pxlsA = (unsigned char *)imgA.data;
unsigned char *pxlsB = (unsigned char *)imgB.data;
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
NSUInteger d = pxlsA[i] - pxlsB[i]);
mse += d * d;
}
Is there some way to do this without loop, in more vectorized way? Maybe something like:
mse = ((imgA - imgB) ^ 2).sum();
The answer to this question is stored in vDSP library, which is part of macOS SDK.
https://developer.apple.com/documentation/accelerate/vdsp
vDSP - Perform basic arithmetic operations and common digital signal processing routines on large vectors.
In my situation I have not really big vectors, but still.
Firstly, you need to convert unsigned char * to float *, and btw it is a significant moment, I don't know how to do this not in loop. Then you need two vDSP function: vDSP_vsbsbm and vDSP_sve.
vDSP_vsbsm - Multiplies the difference of two single-precision vectors by a second difference of two single-precision vectors.
vDSP_sve - Calculates the sum of values in a single-precision vector.
So the final code looks like that:
float *fpxlsA = (float *)malloc(imgA.height * imgA.width * sizeof(float));
float *fpxlsB = (float *)malloc(imgB.height * imgB.width * sizeof(float));
float *output = (float *)malloc(imgB.height * imgB.width * sizeof(float));
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
fpxlsA[i] = (float)(pxlsA[i]);
fpxlsB[i] = (float)(pxlsB[i]);
}
vDSP_vsbsbm(fpxlsA, 1, fpxlsB, 1, fpxlsA, 1, fpxlsB, 1, output, 1, imgA.height * imgB.width);
float sum;
vDSP_sve(output, 1, &sum, imgA.height * imgB.width);
free(output);
free(fpxlsA);
free(fpxlsB);
So, this code did exactly what I wanted and in a more vectorized form. But the result isn't good enough. Comparing performances of the loop approach and vDSP approach, vDSP is two times faster if there isn't any additional memory allocation. But in reality, where additional memory allocation takes place, loop approach is slightly faster.
This appears to be part of Mac OS: https://developer.apple.com/documentation/accelerate
Nice and fast using pointer arithmetic way to loop that would be as follows ...
int d;
size_t i = imgA.height * imgA.width;
while ( i -- )
{
d = ( int )(*pxlsA++) - ( int )(*pxlsB++);
mse += d * d;
}
EDIT
Ooops since those are unsigned char's and since we calculate the difference we need to use signed integers to do so.
And another edit - must use pxls... here, don't know what img... is.
Related
I am attempting to implement a Fast Fourier Transform with associated complex magnitude function on the STM32F411RE Nucleo developer board. My goal is to separate a combined signal with multiple sinusoidal elements into their separate frequency components, with correct amplitude.
My issues is that I cannot correctly line up the frequency bins outcomes from the Complex magnitude function with the frequencies. I am also starting to question the validity of these outcomes as such.
I have tried to use a number of different implementations posted by people for the FFT algorithm with the magnitude fix, most notably the examples listed on StackoverFlow by SleuthEye and Blog by LB9MG.
AFAIK I have a similar approach, but somehow their approaches yield the desired results and mine do not. Below is my code that I have altered to work via the implementation that SleuthEye has created.
int main(void)
{
fftLen = 32; // can be 32, 64, 128, 256, 512, 1024, 2048, 4096
half_fftLen = fftLen/2;
volatile float32_t sampleFreq = 50 * fftLen; // Fs = binsize * fft length, desired binsize = 50 hz
arm_rfft_fast_instance_f32 inst;
arm_status status;
status = arm_rfft_fast_init_f32(&inst, fftLen);
float32_t signalCombined[fftLen] = {0};
float32_t fftCombined[fftLen] = {0};
float32_t fftMagnitude[fftLen] = {0};
volatile float32_t fftFreq[fftLen] = {0};
float32_t maxAmp;
uint32_t maxAmpInd;
while (1)
{
for (int i = 0; i< fftLen; i++)
{
signalCombined[i] = 40 * arm_sin_f32(450 * i); // 450 frequency at 40 amplitude
}
arm_rfft_fast_f32(&inst, signalCombined, fftCombined, 0); // perhaps switch to complex transform to allow for negative frequencies?
arm_cmplx_mag_f32(fftCombined, fftMagnitude, half_fftLen);
fftMagnitude[0] = fftCombined[0];
fftMagnitude[half_fftLen] = fftCombined[1];
arm_max_f32(fftMagnitude, half_fftLen, &maxAmp, &maxAmpInd); // We need the 3 max values
for (int k = 0; k < fftLen ; k++)
{
fftFreq[k] = ((k*sampleFreq)/fftLen);
}
}
Shown below are the results that I get out of the code listed above: whilst I do get a magnitude out of the algorithms (at the correct index 12), it does not correspond to the frequency or the amplitude of the input array signalCombined[].
Does anyone have an idea of why this is happening? Like so many of my errors it is probably a really trivial and stupid thing, but I cannot figure out for the life of me why this is happening.
EDIT: thanks to SleuthEye's help finding the frequencies is now possible, as the initial approach for generating the sin() signal was done incorrectly.
Some new issues popped up as the FFT only appears to yield the correct frequencies for the 32 samples, despite the bin size scaling accordingly to accommodate the adjusted sample size.
I am also unable to implement the amplitude fixing algorith: as per SleuthEye's Link with the example code 2*(1/N)*abs(X(k))^2 I have made my own implementation 2 * powf(fabs(fftMagnitude[j]), 2) / fftLen as shown in the code below, but this does not yield results that are even close to correct.
while (1)
{
for (int i = 0; i < fftLen; i++)
{
signalCombined[i] = 400 * arm_sin_f32(2 * PI * 450 * i / sampleFreq); // Sin Alpha, 400 amp at 10 kHz
// 700 * arm_sin_f32(2 * PI * 33000 * i / sampleFreq) + // Sin Bravo, 700 amp at 33 kHz
// 300 * arm_sin_f32(2 * PI * 50000 * i / sampleFreq); // Sin Charlie, 300 amp at 50 kHz
}
arm_rfft_fast_f32(&inst, signalCombined, fftCombined, 0); // calculate the fourier transform of the time domain signal
arm_cmplx_mag_f32(fftCombined, fftMagnitude, half_fftLen); // calculate the magnitude of the fourier transform
fftMagnitude[0] = fftCombined[0];
fftMagnitude[half_fftLen] = fftCombined[1];
for (int j = 0; j < sizeof(fftMagnitude); j++)
{
fftMagnitude[j] = 2 * powf(fabs(fftMagnitude[j]), 2) / fftLen; // Algorithm to fix the amplitude of each unique frequency
}
arm_max_f32(fftMagnitude, half_fftLen, &maxAmp, &maxAmpInd); // We need the 3 max values
for (int k = 0; k < fftLen ; k++)
{
fftFreq[k] = ((k*sampleFreq)/fftLen);
}
}
Your tone generation does not take into account the sampling frequency of 1600Hz, so you are effectively generating a tone at a frequency of 450*1600/(2*PI) ~ 114591Hz which gets aliased to ~608Hz. That 608Hz frequency roughly corresponds to a frequency index around 12 when using an FFT size of 32.
The generation of a 450Hz tone at a 1600Hz sampling frequency should be done as follows:
for (int i = 0; i< fftLen; i++)
{
signalCombined[i] = 40 * arm_sin_f32(2 * PI * 450 * i / sampleFreq);
}
As far as matching the amplitude, keep in kind that there is a scaling factor between the time-domain and frequency-domain of approximately 0.5*fftLen (see this other post of mine).
I am building some kind of an audio fader effect.
I am using vDSP_vdbcon to turn a buffer of volumes into decibels, applying some modifications in db-space and would like to convert the decibel buffer into volume using the accelerate framework.
Thanks!
Here is what I use for each element for decibel values between -40 and 0. It gives pretty good results.
float decibelsToMag(float decibels){
return pow (10, (0.05 * decibels));
}
I don't know the Accelerate vector equivalent for the pow function. But here's a half vectorized version.
void decibelsToMags(float *decibels, float *mag, int count){
float mul = 0.05;
vDSP_vsmul(decibels, 1, &mul, mag, 1, count);
for (int i = 0; i < count; i++) {
mag[i] = pow(10,mag[i]);
}
}
Post back if you can figure out the vDSP version of the loop.
I am trying to port an existing FFT based low-pass filter to iOS using the Accelerate vDSP framework.
It seems like the FFT works as expected for about the first 1/4 of the sample. But then after that the results seem wrong, and even more odd are mirrored (with the last half of the signal mirroring most of the first half).
You can see the results from a test application below. First is plotted the original sampled data, then an example of the expected filtered results (filtering out signal higher than 15Hz), then finally the results of my current FFT code (note that the desired results and example FFT result are at a different scale than the original data):
The actual code for my low-pass filter is as follows:
double *lowpassFilterVector(double *accell, uint32_t sampleCount, double lowPassFreq, double sampleRate )
{
double stride = 1;
int ln = log2f(sampleCount);
int n = 1 << ln;
// So that we get an FFT of the whole data set, we pad out the array to the next highest power of 2.
int fullPadN = n * 2;
double *padAccell = malloc(sizeof(double) * fullPadN);
memset(padAccell, 0, sizeof(double) * fullPadN);
memcpy(padAccell, accell, sizeof(double) * sampleCount);
ln = log2f(fullPadN);
n = 1 << ln;
int nOver2 = n/2;
DSPDoubleSplitComplex A;
A.realp = (double *)malloc(sizeof(double) * nOver2);
A.imagp = (double *)malloc(sizeof(double) * nOver2);
// This can be reused, just including it here for simplicity.
FFTSetupD setupReal = vDSP_create_fftsetupD(ln, FFT_RADIX2);
vDSP_ctozD((DSPDoubleComplex*)padAccell,2,&A,1,nOver2);
// Use the FFT to get frequency counts
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_FORWARD);
const double factor = 0.5f;
vDSP_vsmulD(A.realp, 1, &factor, A.realp, 1, nOver2);
vDSP_vsmulD(A.imagp, 1, &factor, A.imagp, 1, nOver2);
A.realp[nOver2] = A.imagp[0];
A.imagp[0] = 0.0f;
A.imagp[nOver2] = 0.0f;
// Set frequencies above target to 0.
// This tells us which bin the frequencies over the minimum desired correspond to
NSInteger binLocation = (lowPassFreq * n) / sampleRate;
// We add 2 because bin 0 holds special FFT meta data, so bins really start at "1" - and we want to filter out anything OVER the target frequency
for ( NSInteger i = binLocation+2; i < nOver2; i++ )
{
A.realp[i] = 0;
}
// Clear out all imaginary parts
bzero(A.imagp, (nOver2) * sizeof(double));
//A.imagp[0] = A.realp[nOver2];
// Now shift back all of the values
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_INVERSE);
double *filteredAccell = (double *)malloc(sizeof(double) * fullPadN);
// Converts complex vector back into 2D array
vDSP_ztocD(&A, stride, (DSPDoubleComplex*)filteredAccell, 2, nOver2);
// Have to scale results to account for Apple's FFT library algorithm, see:
// http://developer.apple.com/library/ios/#documentation/Performance/Conceptual/vDSP_Programming_Guide/UsingFourierTransforms/UsingFourierTransforms.html#//apple_ref/doc/uid/TP40005147-CH202-15952
double scale = (float)1.0f / fullPadN;//(2.0f * (float)n);
vDSP_vsmulD(filteredAccell, 1, &scale, filteredAccell, 1, fullPadN);
// Tracks results of conversion
printf("\nInput & output:\n");
for (int k = 0; k < sampleCount; k++)
{
printf("%3d\t%6.2f\t%6.2f\t%6.2f\n", k, accell[k], padAccell[k], filteredAccell[k]);
}
// Acceleration data will be replaced in-place.
return filteredAccell;
}
In the original code the library was handling non power-of-two sizes of input data; in my Accelerate code I am padding out the input to the nearest power of two. In the case of the sample test below the original sample data is 1000 samples so it's padded to 1024. I don't think that would affect results but I include that for the sake of possible differences.
If you want to experiment with a solution, you can download the sample project that generates the graphs here (in the FFTTest folder):
FFT Example Project code
Thanks for any insight, I've not worked with FFT's before so I feel like I am missing something critical.
If you want a strictly real (not complex) result, then the data before the IFFT must be conjugate symmetric. If you don't want the result to be mirror symmetric, then don't zero the imaginary component before the IFFT. Merely zeroing bins before the IFFT creates a filter with a huge amount of ripple in the passband.
The Accelerate framework also supports more FFT lengths than just powers of 2.
//EDIT...
I'm editing my question slightly to address the issue of working specifically with non-power-of-two images. I've got a basic structure that works with square grayscale images with sizes like 256x256 or 1024x1024, but can't see how to generalize to arbitrarily sized images. The fft functions seem to want you to include the log2 of the width and height, but then its unclear how to unpack the resulting data, or if the data isn't just getting scrambled. I suppose the obvious thing to do would be to center the npot image within a larger, all black image and then ignore any values in those positions when looking at the data. But wondering if there's a less awkward way to work with npot data.
//...END EDIT
I'm having a bit of trouble with the Accelerate Framework documentation. I would normally use FFTW3, but I'm having trouble getting that to compile on an actual IOS device (see this question). Can anybody point me to a super simple implementation using Accelerate that does something like the following:
1) Turns image data into an appropriate data structure that can be passed to Accelerate's FFT methods.
In FFTW3, at its simplest, using a grayscale image, this involves placing the unsigned bytes into a "fftw_complex" array, which is simply a struct of two floats, one holding the real value and the other the imaginary (and where the imaginary is initialized to zero for each pixel).
2) Takes this data structure and performs an FFT on it.
3) Prints out the magnitude and phase.
4) Performs an IFFT on it.
5) Recreates the original image from the data resulting from the IFFT.
Although this is a very basic example, I am having trouble using the documentation from Apple's site. The SO answer by Pi here is very helpful, but I am still somewhat confused about how to use Accelerate to do this basic functionality using a grayscale (or color) 2D image.
Anyhow, any pointers or especially some simple working code that processes a 2D image would be extremely helpful!
\\\ EDIT \\\
Okay, after taking some time to dive into the documentation and some very helpful code on SO as well as on pkmital's github repo, I've got some working code that I thought I'd post since 1) it took me a while to figure it out and 2) since I have a couple of remaining questions...
Initialize FFT "plan". Assuming a square power-of-two image:
#include <Accelerate/Accelerate.h>
...
UInt32 N = log2(length*length);
UInt32 log2nr = N / 2;
UInt32 log2nc = N / 2;
UInt32 numElements = 1 << ( log2nr + log2nc );
float SCALE = 1.0/numElements;
SInt32 rowStride = 1;
SInt32 columnStride = 0;
FFTSetup setup = create_fftsetup(MAX(log2nr, log2nc), FFT_RADIX2);
Pass in a byte array for a square power-of-two grayscale image and turn it into a COMPLEX_SPLIT:
COMPLEX_SPLIT in_fft;
in_fft.realp = ( float* ) malloc ( numElements * sizeof ( float ) );
in_fft.imagp = ( float* ) malloc ( numElements * sizeof ( float ) );
for ( UInt32 i = 0; i < numElements; i++ ) {
if (i < t->width * t->height) {
in_fft.realp[i] = t->data[i] / 255.0;
in_fft.imagp[i] = 0.0;
}
}
Run the FFT on the transformed image data, then grab the magnitude and phase:
COMPLEX_SPLIT out_fft;
out_fft.realp = ( float* ) malloc ( numElements * sizeof ( float ) );
out_fft.imagp = ( float* ) malloc ( numElements * sizeof ( float ) );
fft2d_zop ( setup, &in_fft, rowStride, columnStride, &out_fft, rowStride, columnStride, log2nc, log2nr, FFT_FORWARD );
magnitude = (float *) malloc(numElements * sizeof(float));
phase = (float *) malloc(numElements * sizeof(float));
for (int i = 0; i < numElements; i++) {
magnitude[i] = sqrt(out_fft.realp[i] * out_fft.realp[i] + out_fft.imagp[i] * out_fft.imagp[i]) ;
phase[i] = atan2(out_fft.imagp[i],out_fft.realp[i]);
}
Now you can run an IFFT on the out_fft data to get the original image...
COMPLEX_SPLIT out_ifft;
out_ifft.realp = ( float* ) malloc ( numElements * sizeof ( float ) );
out_ifft.imagp = ( float* ) malloc ( numElements * sizeof ( float ) );
fft2d_zop (setup, &out_fft, rowStride, columnStride, &out_ifft, rowStride, columnStride, log2nc, log2nr, FFT_INVERSE);
vsmul( out_ifft.realp, 1, SCALE, out_ifft.realp, 1, numElements );
vsmul( out_ifft.imagp, 1, SCALE, out_ifft.imagp, 1, numElements );
Or you can run an IFFT on the magnitude to get an autocorrelation...
COMPLEX_SPLIT in_ifft;
in_ifft.realp = ( float* ) malloc ( numElements * sizeof ( float ) );
in_ifft.imagp = ( float* ) malloc ( numElements * sizeof ( float ) );
for (int i = 0; i < numElements; i++) {
in_ifft.realp[i] = (magnitude[i]);
in_ifft.imagp[i] = 0.0;
}
fft2d_zop ( setup, &in_fft, rowStride, columnStride, &out_ifft, rowStride, columnStride, log2nc, log2nr, FFT_INVERSE );
vsmul( out_ifft.realp, 1, SCALE, out_ifft.realp, 1, numElements );
vsmul( out_ifft.imagp, 1, SCALE, out_ifft.imagp, 1, numElements );
Finally, you can put the ifft results back into an image array:
for ( UInt32 i = 0; i < numElements; i++ ) {
t->data[i] = (int) (out_ifft.realp[i] * 255.0);
}
I haven't figured out how to use the Accelerate framework to handle non-power-of-two images. If I allocate enough memory in the setup, then I can do an FFT, followed by an IFFT to get my original image. But if try to do an autocorrelation (with the magnitude of the FFT), then my image gets wonky results. I'm not sure of the best way to pad the image appropriately, so hopefully someone has an idea of how to do this. (Or share a working version of the vDSP_conv method!)
I would say that in order to perform work on arbitrary image sizes, all you have to do is size your input value array appropriately to the next power of 2.
The hard part is where to put your original image data and what to fill with. What you are really trying to do to the image or data mine from the image is crucial.
In the linked PDF below, pay particular attention to the paragraph just above 12.4.2
http://www.mathcs.org/java/programs/FFT/FFTInfo/c12-4.pdf
While the above speaks about the manipulation along 2 axes, we could potentialy perform a similar idea prior to the second dimension, and following onto the second dimension. If Im correct, then this example could apply (and this is by no means an exact algorithm yet):
say we have an image that is 900 by 900:
first we could split the image into vertical strips of 512, 256, 128, and 4.
We would then process 4 1D FFTs for each row, one for the first 512 pixels, the next for the following 256 pixels, the next for the following 128, then the last for the remaining 4. Since the output of the FFT is essentially popularity of frequency, then these could simply be added (from the frequency ONLY perspective, not the angular offset).
We could then push this same techniquie toward the 2nd dimension. At this point we would have taken into consideration every input pixel without actually having to pad.
This is really just food for thought, I have not tried this myself, and indeed should research this myself. If you are truly doing this kind of work right now, you may have more time than I at this point though.
I'm trying to get frequency from iPhone / iPod music library for a spectrum app on iPod library, helping myself with reading-audio-samples-via-avassetreader to get audio samples and then with using-the-apple-fft-and-accelerate-framework and Apple vDSP Samples, but somehow I'm wrong somewhere and unable to calculate the frequency.
So step by step:
read audio sample
Hanning window
calculate fft
Is this the correct way to get frequencies from an iPod mp3 library?
Here is my code:
static COMPLEX_SPLIT A;
static FFTSetup setupReal;
static uint32_t log2n, n, nOver2;
static int32_t stride;
static float *obtainedReal;
static float scale;
+ (void)initialize
{
log2n = 10;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
obtainedReal = (float *) malloc(n * sizeof(float));
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
}
- (float) performAcceleratedFastFourierTransForAudioBuffer:(AudioBufferList)ioData
{
NSUInteger * sampleIn = (NSUInteger *)ioData.mBuffers[0].mData;
for (int i = 0; i < nOver2; i++) {
double multiplier = 0.5 * (1 - cos(2*M_PI*i/nOver2-1));
A.realp[i] = multiplier * sampleIn[i];
A.imagp[i] = 0;
}
memset(ioData.mBuffers[0].mData, 0, ioData.mBuffers[0].mDataByteSize);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
vDSP_zvmags(&A, 1, A.realp, 1, nOver2);
scale = (float) 1.0 / (2 * n);
vDSP_vsmul(A.realp, 1, &scale, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &scale, A.imagp, 1, nOver2);
vDSP_ztoc(&A, 1, (COMPLEX *)obtainedReal, 2, nOver2);
int peakIndex = 0;
for (size_t i=1; i < nOver2-1; ++i) {
if ((obtainedReal[i] > obtainedReal[i-1]) && (obtainedReal[i] > obtainedReal[i+1]))
{
peakIndex = i;
break;
}
}
//here I don't know how to calculate frequency with my data
float frequency = obtainedReal[peakIndex-1] / 44100 / n;
vDSP_destroy_fftsetup(setupReal);
free(obtainedReal);
free(A.realp);
free(A.imagp);
return frequency;
}
I got 1.485757 and 1.332233 as my first frequencies
It looks to me like there is a problem in the conversion to complex input for the FFT. vDSP_ctoz() splits a buffer where real and imaginary components are interleaved into two buffers, one real and one imaginary. Your input to that function appears to be just real data that has been casted to COMPLEX. This means that your input buffer to vDSP_ctoz() is only half as long as it needs to be and some garbage data beyond the buffer size is getting converted.
You either need to create sampleOut to be 2*n in length and set every other value (the real parts) or better yet, you can bypass the vDSP_ctoz() and directly copy your input data into A.realp and set A.imagp to zeros. vDSP_ctoz() should only be needed when interfacing to a source that produces interleaved complex data.
Edit
Ok, I think I was wrong on my first suggestion since the vDSP documentation says that the real input of the real-to-complex in-place fft should be formatted into the split complex format such that imagp contains even samples and realp contains the odd samples. I have not actually used the vDSP library, but I am familiar with a lot of other FFT libraries and I missed that detail.
You should be able to find the peaks using A.realp after the call to vDSP_zvmags(&A, 1, A.realp, 1, nOver2); At that point, A.realp should contain the magnitude squared of the FFT output, which is scalar. If you are going to do the scaling, it should be done before the mag2 operation, but it may not be needed if you are just looking for the peaks.
To get the real frequencies represented by the FFT output, use this formula:
F = (i * Fs) / N, i=0,1,...,N/2
where
i is the index of the FFT output buffer
Fs is the audio sampling rate
N is the FFT length
so your calculation might look like this:
float frequency = (peakIndex * 44100) / n;
Keep in mind that vDSP only returns the first half of the input spectrum for real input since the second half is redundant. So the FFT output represents frequencies from 0 to Fs/2.
One other note is that I don't know if your peak finding algorithm will work very well since FFT output will not be smooth and there will often be a lot of oscillation. You are simply taking the first sample where the two adjacent samples are lower. If you just want to find a single peak, it would be better just to find the max magnitude across the entire output. If you want to find multiple peaks, you will have to do something more sophisticated.