Understanding Overlap and Add for Filtering - signal-processing

I am trying to implement the overlap and add method in oder to apply a filter in a real time context. However, it seems that there is something I am doing wrong, as the resulting output has a larger error than I would expect. For comparing the accuracy of my computations I created a file, that I am processing in one chunk. I am comparing this with the output of the overlap and add process and take the resulting comparison as an indicator for the accuracy of the computation. So here is my process of doing Overlap and add:
I take a chunk of length L from my input signal
I pad the chunk with zeros to length L*2
I transform that signal into frequency domain
I multiply the signal in frequency domain with my filter response of length L*2 in frequency domain (the filter response is actually created by interpolating control points in the UI - so this is not transformed from time domain. However using length L*2 in frequency domain should be similar to using a ffted time domain signal of length L padded to L*2)
Then I transform the resulting signal back to time domain and add it to the output stream with an overlap of L
Is there anything wrong with that procedure? After reading a lot of different papers and books I've gotten pretty unsure which is the right way to deal with that.
Here is some more data from the tests I have been running:
I created a signal, which consists of three cosine waves
I used this filter function in the time domain for filtering. (It's symmetric, as it is applied to the whole output of the FFT, which also is symmetric for real input signals)
The output of the IFFT looks like this: It can be seen that low frequencies are attenuated more than frequency in the mid range.
For the overlap add/save and the windowed processing I divided the input signal into 8 chunks of 256 samples. After reassembling them they look like that. (sample 490 - 540)
Output Signal overlap and add:
output signal overlap and save:
output signal using STFT with Hanning window:
It can be seen that the overlap add/save processes differ from the STFT version at the point where chunks are put together (sample 511). This is the main error which leads to different results when comparing windowed process and overlap add/save. However the STFT is closer to the output signal, which has been processed in one chunk.
I am pretty much stuck at this point since a few days. What is wrong here?
Here is my source
// overlap and add
// init Buffers
for (UInt32 j = 0; j<samples; j++){
output[j] = 0.0;
}
// process multiple chunks of data
for (UInt32 i = 0; i < (float)div * 2; i++){
for (UInt32 j = 0; j < chunklength/2; j++){
// copy input data to the first half ofcurrent buffer
inBuffer[j] = input[(int)((float)i * chunklength / 2 + j)];
// pad second half with zeros
inBuffer[j + chunklength/2] = 0.0;
}
// clear buffers
for (UInt32 j = 0; j < chunklength; j++){
outBuffer[j][0] = 0.0;
outBuffer[j][8] = 0.0;
FFTBuffer[j][0] = 0.0;
FFTBuffer[j][9] = 0.0;
}
FFT(inBuffer, FFTBuffer, chunklength);
// processing
for(UInt32 j = 0; j < chunklength; j++){
// multiply with filter
FFTBuffer[j][0] *= multiplier[j];
FFTBuffer[j][10] *= multiplier[j];
}
// Inverse Transform
IFFT((const double**)FFTBuffer, outBuffer, chunklength);
for (UInt32 j = 0; j < chunklength; j++){
// copy to output
if ((int)((float)i * chunklength / 2 + j) < samples){
output[(int)((float)i * chunklength / 2 + j)] += outBuffer[j][0];
}
}
}
After the suggestion below, I tried the following:
IFFTed my Filter. This looks like this:
set the second half to zero:
FFTed the signal and compared the magnitudes to the old filter (blue):
After trying to do overlap and add with this filter, the results have obviously gotten worse instead of better. In order to make sure my FFT works correctly, I tried to IFFT and FFT the filter without setting the second half zero. The result is identical to the orignal filter. So the problem shouldn't be the FFTing. I suppose that this is more of some general understanding of the overlap and add method. But I still can't figure out what is going wrong...

One thing to check is the length of the impulse response of your filter. It must be shorter than the length of zero padding used before the fast convolution FFT, or you will get wrap around errors.

I think the problem might be in the windowing approach that you are using. You simply add zeros to the chunks so there is no actual overlap. In the overlap and add method, you need to damp the edges of the window. What this means is that where you add zeros to the chunk you instead have add weighted input signal and the weight in your case should be 0.5 since only two windows overlap.
Rest of the procedure seems OK. You then simply take FTs, multiply and take inverse FTS and finally add up all the chunks to get the final signal which should be exactly the same if you filtered the whole signal at once.

Related

Image computation on GPU and value returning

I have a C# project in which I retreive grey-scale images from cameras and do some computation with the image data. The computations are quite time-consuming since I need to loop over the total image several times and I am doing it all on the CPU.
Now I would like to try to get the evaluation running on the GPU, but I have a lot of struggle achieving that, since I never did any GPU calculations before.
The software should be able to run on several computers with varying hardware, so CUDA for example is not a solution for me, since the code should also run on laptops which only have onboard graphics. After some research I came accross Cloo (found it on this project), which seems to be a quite resonable choice.
So far I integrated Cloo in my project and tried to get this hello world example running. I guess it is running, since I don´t get any exception, but I don´t know where I can see the printed output.
For my computations I need to pass the image to the GPU and I also need the x-y coordinates during the computation. So, in C# the computation looks like this:
int a = 0;
for (int y = 0; y < img_height; y++){
for (int x = 0; x < img_width; x++){
a += image[x,y] * x * y;
}
}
int b = 0;
for (int y = 0; y < img_height; y++){
for (int x = 0; x < img_width; x++){
b += image[x,y] * (x-a) * y;
}
}
Now I want to have these calculations to run on the GPU, and I want to parallel the y-loop, so that in every task one x-loop is running. Then I could take all the resulting a values and add them up before the second loop block would start.
Afterwards I would like to return the values a and b to my C# code and use them there.
So, to wrap up my questions:
Is Cloo a recommendable choice for this task?
What is the best way to pass the image-data (16bit, short-array) and the dimensions (img_width, img_height) to the GPU?
How can I return a value from the GPU? As far as I know kernels are always used as kernel void...
What would be the best way to implement the loops?
I hope my questions are clear and I provided sufficient information to understand my struggles. Any help is appreciated. Thanks in advance.
Let's reverse engineer the problem. Understanding the efficient processing of the "dependency-chain" of image[][], image_height, image_width, a, b
Ad 4 ) the tandem of identical for-loops has a poor performance
given the defined code, there could be just a single loop, thus with reduced overhead costs and best with also maximising cache-aligned vectorised code.
Cache-Naive re-formulation:
int a = 0;
int c = 1;
for ( int y = 0; y < img_height; y++ ){
for ( int x = 0; x < img_width; x++ ){
int intermediate = image[x,y] * y; // .SET PROD(i[x,y],y)
a += x * intermediate; // .REUSE 1st
c -= intermediate; // .REUSE 2nd
}
}
int b = a * c; // was my fault upon being in a hurry leaving for weekend :o)
Moving the code into the split tandem loops is only increasing these overheads and devastating any possible cache-friendly tricks in the code-performance tweaking.
Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this
OpenCL and Cloo document these details, so nothing magical beyond the documented methods is needed here.
Yet, there are latency costs associated with each such host-side to device-side + device-side to host-side transfers. Given you claim that the 16bit-1920x1200 image-data are to be re-processed ~ 10 times in a loop, there are some chances these latencies need not be spent on every such loop pass-through.
The worst performance-killer is a very shallow kernel mathematical density. The problem is, there is indeed not much to calculate in the kernel, so the chances for any efficient SIMD / GPU parallel tricks are indeed pretty low.
In this sense, the CPU-side smart-vectorised code will do much better than the ( H2D + D2H )-overheads-far latency-hostile computationally-shallow GPU-kernel processing.
Ad 1) Given 2+3 and 4 above, 1 may easily loose sense
As prototyped and given additional cache-friendly vectorised tricks, the in-ram + in-cache vectorised code will have chances to beat all OpenCL and mixed-GPU/CPU automated ad-hoc kernel compilation generated device code and it's computing efforts.

How would to write a multiplication of double values in NEON assembly?

The line in question is pretty contained:
w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1]
Considering these variables are double (but I can downgrade to float), I think I can pass one value per register? Would it be more efficient to use the 2x2 matrix W directly?
EDIT 1:
This line is inside a loop that is fired hundreds of times per second and has real-time requirements. Instruments says this line takes 60% of the time of the loop.
EDIT 2:
This is the loop(s) I'm talking about:
for (int x=startingX; x<endingX; ++x)
{
for (int y=startingY; y<endingY; ++y)
{
Matx21d position(x,y);
// warp patch
uint8_t *data;
[self backwardWarpPatchWithWarpingMatrix:warpingMatrix withWarpData:&data withReferenceImage:_initialView withCenter:position];
// check that the backward patch was successful
if (!data)
continue;
// calculate zero mean (on the patch) sum of squared differences
int ssd = [self computeZMSSDScoreWithX:x withY:y withCurrentTargetPatch:data];
if (fabs(ssd) < bestSSD)
{
bestPosition = position;
bestSSD = ssd;
}
}
}
backwardWarpPatchWithWarpingMatrix:
Matx22d warpingMatrixInverse = warpingMatrix.inv();
double wmi0 = warpingMatrixInverse(0,0), wmi1 = warpingMatrixInverse(0,1), wmi2 = warpingMatrixInverse(1,0), wmi3 = warpingMatrixInverse(1,1);
if (isnan(wmi0))
{
warpingMatrixInverse = Matx22d::eye();
}
// Perform the warp on a larger patch.
int LEVEL_REF = 0, halfPatchSize = PATCH_SIZE/2;
Matx21d centerInLevel = center * (1.0 / (1<<LEVEL_REF));
__block Mat warped(PATCH_SIZE, PATCH_SIZE, CV_8UC1);
dispatch_apply(PATCH_SIZE, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t y)
{
for (int x=0; x<PATCH_SIZE; ++x)
{
double pp0 = x - halfPatchSize, pp1 = (double)y - halfPatchSize;
Matx21d multiplication(wmi0 * pp0 + wmi1 * pp1, wmi2 * pp0 + wmi3 * pp1);
Matx21d px(multiplication(0) + centerInLevel(0), multiplication(1) + centerInLevel(1));
double warpedPixel = [self interpolatePointInImage:referenceImage withU:px(0) withV:px(1)];
warped.at<uchar>(y,x) = (uint8_t)warpedPixel;
}
});
computeReferencePatchScores:
int x = (int)u;
int y = (int)v;
float subpixX = u - x,
subpixY = v - y,
oneMinusSubpixX = 1.0 - subpixX,
oneMinusSubpixY = 1.0 - subpixY;
float w00 = oneMinusSubpixX * oneMinusSubpixY,
w01 = oneMinusSubpixX * subpixY,
w10 = subpixX * oneMinusSubpixY,
w11 = 1.0f - w00 - w01 - w10;
const int stride = (int)image.step.p[0];
uchar* ptr = image.data + y * stride + x;
return w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1];
You typically don't translate a single line of code into assembly. For it to be worth writing in assembly, you have to first assume that you can generate better assembly than the compiler will. Sometimes that's true for vectorized code on NEON, but it's usually because you have special knowledge about a complex loop. You're unlikely to beat the compiler significantly on a single line of code (and will likely lose). Is this line part of a loop that you've profiled and identified as a major bottleneck? Have you already tried Accelerate? Have you analyzed the assembly the compiler is generating and found mistakes that it's making.
Trying to do this in ObjC++ is very inefficient. ObjC++ is a glue language for tying together C++ and ObjC; doing both in the same file imposes several performance costs, especially with ARC. Calling an ObjC method inside of a performance-critical inner-loop is very expensive in any case (even if there weren't mixed-in C++). You should never do any kind of function call (least of all an ObjC method dispatch) inside of a tight inner-loop. It's not clear where you're actually calling computeReferencePatchScores. The use of GCD here is probably hurting you more than helping (since it prevents the compiler from applying certain vector optimizations).
This is all to say: how a particular line of code is being compiled into assembly is by far the least of your problems in this code. Its structure is fighting clang's optimizer.
Step one is to step back and ask what computation you want to execute, and then read through the Core Image Programming Guide and the vImage Programming Guide and verify that it isn't already available. You might also look over OpenGL ES, but OpenGL is often a whole approach to drawing (so it's a bit more of a commitment). It looks like you're already using OpenCV, so make sure it doesn't have available functions to do what you want. (Most of what I see in there looks like stuff built into both OpenCV and vImage.)
The simplest way to improve performance without moving to more powerful frameworks is to move the entire loop into a single C++ function. Then the optimizer can see all the code and apply vector operations on its own. But the next step is to make use of the high-level high-performance frameworks already available.
In any case, you'll want to sit down and carefully work through exactly the calculations you need to perform (I usually do this by hand on paper). Make sure you're not duplicating anything, that you need every calculation you're performing, and that each change you make still generates the same result.
This looks to be a 2x2 convolution. If the data set is large, then vImageConvolve_PlanarF with a 3x3 kernel with some zero padding in it will do the job. It tries to skip work on kernel elements that are 0. You would need to convert the data set to single precision.
If the data set is small, then you are probably stuck with scalar code performance. Inline the function if you can. Perhaps you can figure out how to aggregate a bunch of these together to take advantage of a heavier duty high performance routine.
However, if the weights change from pixel to pixel, then a convolution isn't going to work. You may look instead at the N-dimensional lookup table feature in vImage/Transform.h, if your data set is not huge.
I am a bit skeptical that the time is really spent just in that line. It is best to look at the assembly view in instruments to see where the samples really land.

Real to Complex FFT with CUFFT, using OpenCV as Data source

I'm having an issue trying to perform a two dimensional transform on an array of floats using cuFFT. I've had a look at the documentation, but some of the information is contradictory/not clear; so I have a few questions:
My data is 480 rows, with 640 columns (e.g. float data[480][640] but in a single dimension so float data[480*640])
If we say my input dimensions (of real data) are N1 = 480 and N2 = 640. Are the dimensions (after a real to complex transform) N1=480, N2=321?
Can I cudaMemcpy the data directly into a cufftReal array of the same size? Or must it be acufftComplex array?
If it must be acufftComplex array, I am assuming the elements need to be in the place of the real components?
What is the correct structure of a call to cufftPlan2d, cufftExecR2C and cufftC2R given the above values.
I think that's all for now...
Many thanks in advance
EDIT: So, I've implemented the Forward and Inverse transforms as suggested by JackOLantern. However my results are not what I am expecting (an identical Result after FFT as Before it). I have an image gallery here showing two sets of examples. The first is from my room, the second from my University Project.
In the cuFFT Documentation, there is ambiguity in the use of cufftPlan2d (hence why I asked). In the documentation, for a two dimensional array, the data should be input as above (float data[480][640] == float data[NY][NX]) So NY represents the rows. However in the function listing for cufftPlan2d, it states that nx (the parameter) is for the rows...
Swapping the values of NX and NY in the function call gives the result as in the project image (correct orientation, but split into three partially overlapping images at 1/4 the normal size) however, using the parameters as JackOLantern states in his answer gives a slanted/skewed result.
Am I doing something wrong here? Or does the cuFFT library have issues with this type of thing.
ALSO: I have undone a couple of the edits made by JackOLantern to this question as my issues MAY stem from the fact my data is coming from OpenCV.
EDIT: I've recently found out that I was the one who made a mistake in the way I used the function.
Originally I though the function definition referred to the size of the data being passed into it.
However, it appears that the parameters actually refer directly to the size of the REAL part.
This means that the parameters refer to:
The size of the input data when using R2C (Real to Complex)
The size of the output data when using C2R (Complex to Real)
So it appears that the cuFFT documentation and the library itself do not correspond.
When performing an R2C followed by a C2R (real to complex, complex to real respectively), the documentation states that for a Real input of NX x NY dimensions, the Complex output is NX x (floor(NY/2) +1); and vice versa.
However the actual output is of dimensions NX x NY and the actual input is of dimensions NX x NY. This is (half) mentioned on the very first page as
C2R - Symmetric complex input to real output
Implying that the complex data must be Symmetric, i.e. must also have the redundant data in addition to the non-redundant data.
There are a number of other contradictions within the documentation as well which I won't go into.
Needless to say, the problem has been solved.
I have included a MWE below. Near the top are a couple of lines with #define NUM_C2 and appropriate comments. Changing this changes whether the documentation format is followed, or my "fix".
The output is
The Input Real data
The Intermediate Complex data
The output Real data
The ratio of the output data to the input data (there are minor FFT errors, ~1 indicates correct)
Feel free to change the parameters (NUM_R and NUM_C) and feel free to comment if you think I have made a mistake somewhere.
#include <iostream>
#include <math.h>
#include <cufft.h>
// e.g. float data[NUM_R][NUM_C]
#define NUM_R 12
#define NUM_C 16
// Documentation Version
//#define NUM_C2 (1+NUM_C/2)
// "Correct" Version
#define NUM_C2 NUM_C
using namespace std;
int main(int argc, char** argv)
{
cufftReal *in_h, *out_h, *in_d, *out_d;
cufftComplex *mid_d, *mid_h;
cufftHandle pF, pI;
int r, c;
in_h = (cufftReal*) malloc(NUM_R * NUM_C * sizeof(cufftReal));
out_h= (cufftReal*) malloc(NUM_R * NUM_C * sizeof(cufftReal));
mid_h= (cufftComplex*)malloc(NUM_C2*NUM_R*sizeof(cufftComplex));
cudaMalloc((void**) &in_d, NUM_R * NUM_C * sizeof(cufftReal));
cudaMalloc((void**)&out_d, NUM_R * NUM_C * sizeof(cufftReal));
cudaMalloc((void**)&mid_d, NUM_C2 * NUM_R * sizeof(cufftComplex));
cufftPlan2d(&pF, NUM_R, NUM_C, CUFFT_R2C);
cufftPlan2d(&pI, NUM_R,NUM_C2, CUFFT_C2R);
cout<<endl<<"------"<<endl;
for(r=0; r<NUM_R; r++)
{
for(c=0; c<NUM_C; c++)
{
in_h[c + NUM_C * r] = cos(2.0*M_PI*(c*7.0/NUM_C+r*3.0/NUM_R));
out_h[c+ NUM_C * r] = 0.f;
cout<<in_h[c+NUM_C*r];
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
}
}
cudaMemcpy((cufftReal*)in_d, (cufftReal*)in_h, NUM_R * NUM_C * sizeof(cufftReal),cudaMemcpyHostToDevice);
cufftExecR2C(pF, (cufftReal*)in_d, (cufftComplex*)mid_d);
cudaMemcpy((cufftComplex*)mid_h, (cufftComplex*)mid_d, NUM_C2*NUM_R*sizeof(cufftComplex), cudaMemcpyDeviceToHost);
cout<<endl<<"------"<<endl;
for(r=0; r<NUM_R; r++)
{
for(c=0; c<NUM_C2; c++)
{
cout<<mid_h[c+(NUM_C2)*r].x<<"|"<<mid_h[c+(NUM_C2)*r].y;
if(c<(NUM_C2-1)) cout<<", ";
else cout<<endl;
}
}
cufftExecC2R(pI, (cufftComplex*)mid_d, (cufftReal*)out_d);
cudaMemcpy((cufftReal*)out_h, (cufftReal*)out_d, NUM_R*NUM_C*sizeof(cufftReal), cudaMemcpyDeviceToHost);
cout<<endl<<"------"<<endl;
for(r=0; r<NUM_R; r++)
{
for(c=0; c<NUM_C; c++)
{
cout<<out_h[c+NUM_C*r]/(NUM_R*NUM_C);
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
}
}
cout<<endl<<"------"<<endl;
for(r=0; r<NUM_R; r++)
{
for(c=0; c<NUM_C; c++)
{
cout<<(out_h[c+NUM_C*r]/(NUM_R*NUM_C))/in_h[c+NUM_C*r];
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
}
}
free(in_h);
free(out_h);
free(mid_h);
cudaFree(in_d);
cudaFree(out_h);
cudaFree(mid_d);
return 0;
}
1) If we say my input dimensions (of real data) are N1 = 480 and N2 = 640. Are the dimensions (after a real to complex transform) N1=480, N2=321?
The output of cufftExecR2C is a NX*(NY/2+1) cufftComplex matrix. So in your case, you will have a 480x321 float2 matrix as output.
2) Can I cudaMemcpy the data directly into a cufftReal array of the same size? Or must it be a cufftComplex array?
If it must be a cufftComplex array, I am assuming the elements need to be in the place of the real components?
Yes, you can copy the data to a cufftReal array and the N1xN2 data.
3) What is the correct structure of a call to cufftPlan2d, cufftExecR2C and cufftC2R given the above values.
cufftPlan2d(&plan, N1, N2, CUFFT_R2C);
cufftExecR2C(plan, (cufftReal*)idata, (cufftComplex*) odata);

Extract Treble and Bass from audio in iOS

I'm looking for a way to get the treble and bass data from a song for some incrementation of time (say 0.1 seconds) and in the range of 0.0 to 1.0. I've googled around but haven't been able to find anything remotely close to what I'm looking for. Ultimately I want to be able to represent the treble and bass level while the song is playing.
Thanks!
Its reasonably easy. You need to perform an FFT and then sum up the bins that interest you. A lot of how you select will depend on the sampling rate of your audio.
You then need to choose an appropriate FFT order to get good information in the frequency bins returned.
So if you do an order 8 FFT you will need 256 samples. This will return you 128 complex pairs.
Next you need to convert these to magnitude. This is actually quite simple. if you are using std::complex you can simply perform a std::abs on the complex number and you will have its magnitude (sqrt( r^2 + i^2 )).
Interestingly at this point there is something called Parseval's theorem. This theorem states that after performinng a fourier transform the sum of the bins returned is equal to the sum of mean squares of the input signal.
This means that to get the amplitude of a specific set of bins you can simply add them together divide by the number of them and then sqrt to get the RMS amplitude value of those bins.
So where does this leave you?
Well from here you need to figure out which bins you are adding together.
A treble tone is defined as above 2000Hz.
A bass tone is below 300Hz (if my memory serves me correctly).
Mids are between 300Hz and 2kHz.
Now suppose your sample rate is 8kHz. The Nyquist rate says that the highest frequency you can represent in 8kHz sampling is 4kHz. Each bin thus represents 4000/128 or 31.25Hz.
So if the first 10 bins (Up to 312.5Hz) are used for Bass frequencies. Bin 10 to Bin 63 represent the mids. Finally bin 64 to 127 is the trebles.
You can then calculate the RMS value as described above and you have the RMS values.
RMS values can be converted to dBFS values by performing 20.0f * log10f( rmsVal );. This will return you a value from 0dB (max amplitude) down to -infinity dB (min amplitude). Be aware amplitudes do not range from -1 to 1.
To help you along, here is a bit of my C++ based FFT class for iPhone (which uses vDSP under the hood):
MacOSFFT::MacOSFFT( unsigned int fftOrder ) :
BaseFFT( fftOrder )
{
mFFTSetup = (void*)vDSP_create_fftsetup( mFFTOrder, 0 );
mImagBuffer.resize( 1 << mFFTOrder );
mRealBufferOut.resize( 1 << mFFTOrder );
mImagBufferOut.resize( 1 << mFFTOrder );
}
MacOSFFT::~MacOSFFT()
{
vDSP_destroy_fftsetup( (FFTSetup)mFFTSetup );
}
bool MacOSFFT::ForwardFFT( std::vector< std::complex< float > >& outVec, const std::vector< float >& inVec )
{
return ForwardFFT( &outVec.front(), &inVec.front(), inVec.size() );
}
bool MacOSFFT::ForwardFFT( std::complex< float >* pOut, const float* pIn, unsigned int num )
{
// Bring in a pre-allocated imaginary buffer that is initialised to 0.
DSPSplitComplex dspscIn;
dspscIn.realp = (float*)pIn;
dspscIn.imagp = &mImagBuffer.front();
DSPSplitComplex dspscOut;
dspscOut.realp = &mRealBufferOut.front();
dspscOut.imagp = &mImagBufferOut.front();
vDSP_fft_zop( (FFTSetup)mFFTSetup, &dspscIn, 1, &dspscOut, 1, mFFTOrder, kFFTDirection_Forward );
vDSP_ztoc( &dspscOut, 1, (DSPComplex*)pOut, 1, num );
return true;
}
It seems that you're looking for Fast Fourier Transform sample code.
It is quite a large topic to cover in an answer.
The tools you will need are already build in iOS: vDSP API
This should help you: vDSP Programming Guide
And there is also a FFT Sample Code available
You might also want to check out iPhoneFFT. Though that code is slighlty
outdated it can help you understand processes "under-the-hood".
Refer to auriotouch2 example from Apple - it has everything from frequency analysis to UI representation of what you want.

Need a specific example of U-Matrix in Self Organizing Map

I'm trying to develop an application using SOM in analyzing data. However, after finishing training, I cannot find a way to visualize the result. I know that U-Matrix is one of the method but I cannot understand it properly. Hence, I'm asking for a specific and detail example how to construct U-Matrix.
I also read an answer at U-matrix and self organizing maps but it only refers to 1 row map, how about 3x3 map? I know that for 3x3 map:
m(1) m(2) m(3)
m(4) m(5) m(6)
m(7) m(8) m(9)
a 5x5 matrix must me created:
u(1) u(1,2) u(2) u(2,3) u(3)
u(1,4) u(1,2,4,5) u(2,5) u(2,3,5,6) u(3,6)
u(4) u(4,5) u(5) u(5,6) u(6)
u(4,7) u(4,5,7,8) u(5,8) u(5,6,8,9) u(6,9)
u(7) u(7,8) u(8) u(8,9) u(9)
but I don't know how to calculate u-weight u(1,2,4,5), u(2,3,5,6), u(4,5,7,8) and u(5,6,8,9).
Finally, after constructing U-Matrix, is there any way to visualize it using color, e.g. heat map?
Thank you very much for your time.
Cheers
I don't know if you are still interested in this but I found this link
http://www.uni-marburg.de/fb12/datenbionik/pdf/pubs/1990/UltschSiemon90
which explains very speciffically how to calculate the U-matrix.
Hope it helps.
By the way, the site were I found the link has several resources referring to SOMs I leave it here in case anyone is interested:
http://www.ifs.tuwien.ac.at/dm/somtoolbox/visualisations.html
The essential idea of a Kohonen map is that the data points are mapped to a
lattice, which is often a 2D rectangular grid.
In the simplest implementations, the lattice is initialized by creating a 3D
array with these dimensions:
width * height * number_features
This is the U-matrix.
Width and height are chosen by the user; number_features is just the number
of features (columns or fields) in your data.
Intuitively this is just creating a 2D grid of dimensions w * h
(e.g., if w = 10 and h = 10 then your lattice has 100 cells), then
into each cell, placing a random 1D array (sometimes called "reference tuples")
whose size and values are constrained by your data.
The reference tuples are also referred to as weights.
How is the U-matrix rendered?
In my example below, the data is comprised of rgb tuples, so the reference tuples
have length of three and each of the three values must lie between 0 and 255).
It's with this 3D array ("lattice") that you begin the main iterative loop
The algorithm iteratively positions each data point so that it is closest to others similar to it.
If you plot it over time (iteration number) then you can visualize cluster
formation.
The plotting tool i use for this is the brilliant Python library, Matplotlib,
which plots the lattice directly, just by passing it into the imshow function.
Below are eight snapshots of the progress of a SOM algorithm, from initialization to 700 iterations. The newly initialized (iteration_count = 0) lattice is rendered in the top left panel; the result from the final iteration, in the bottom right panel.
Alternatively, you can use a lower-level imaging library (in Python, e.g., PIL) and transfer the reference tuples onto the 2D grid, one at a time:
for y in range(h):
for x in range(w):
img.putpixel( (x, y), (
SOM.Umatrix[y, x, 0],
SOM.Umatrix[y, x, 1],
SOM.Umatrix[y, x, 2])
)
Here img is an instance of PIL's Image class. Here the image is created by iterating over the grid one pixel at a time; for each pixel, putpixel is called on img three times, the three calls of course corresponding to the three values in an rgb tuple.
From the matrix that you create:
u(1) u(1,2) u(2) u(2,3) u(3)
u(1,4) u(1,2,4,5) u(2,5) u(2,3,5,6) u(3,6)
u(4) u(4,5) u(5) u(5,6) u(6)
u(4,7) u(4,5,7,8) u(5,8) u(5,6,8,9) u(6,9)
u(7) u(7,8) u(8) u(8,9) u(9)
The elements with single numbers like u(1), u(2), ..., u(9) as just the elements with more than two numbers like u(1,2,4,5), u(2,3,5,6), ... , u(5,6,8,9) are calculated using something like the mean, median, min or max of the values in the neighborhood.
It's a nice idea calculate the elements with two numbers first, one possible code for that is:
for i in range(self.h_u_matrix):
for j in range(self.w_u_matrix):
nb = (0,0)
if not (i % 2) and (j % 2):
nb = (0,1)
elif (i % 2) and not (j % 2):
nb = (1,0)
self.u_matrix[(i,j)] = np.linalg.norm(
self.weights[i //2, j //2] - self.weights[i //2 +nb[0], j // 2 + nb[1]],
axis = 0
)
In the code above the self.h_u_matrix = self.weights.shape[0]*2 - 1 and self.w_u_matrix = self.weights.shape[1]*2 - 1 are the dimensions of the U-Matrix. With that said, for calculate the others elements it's necessary obtain a list with they neighboors and apply a mean for example. The following code implements that's idea:
for i in range(self.h_u_matrix):
for j in range(self.w_u_matrix):
if not (i % 2) and not (j % 2):
nodelist = []
if i > 0:
nodelist.append((i-1,j))
if i < 4:
nodelist.append((i+1, j))
if j > 0:
nodelist.append((i,j -1))
if j < 4:
nodelist.append((i,j+1))
meanlist = [self.u_matrix[u_node] for u_node in nodelist]
self.u_matrix[(i,j)] = np.mean(meanlist)
elif (i % 2) and (j % 2):
meanlist = [
(i - 1, j),
(i + 1, j),
(i, j - 1),
(i, j + 1)]
self.u_matrix[(i,j)] = np.mean(meanlist)

Resources