Understanding FFT in aurioTouch2 - ios

I've been looking at aurioTouch 2 from Apple' sample code (found here). At the end of the day I want to analyze the frequencies myself. For now I'm trying to understand some of what's going on here. My apologies if this is trivial, just trying to understand some of the uncommented magic numbers floating around in some of the source. My main points of confusion right now are:
Why do they zero out the nyquist value in FFTBufferManager::ComputeFFT? Can this value really just be thrown away? (~line 112 of FFTBufferManager.cpp).
They scale everything down by -128db, so I'm assuming that the results are thus in the range of (-128, 0). However, later in aurioTouchAppDelegate.mm (~line 807), They convert this to a value between 0 and 1 by adding 80 and dividing by 64, then clamping to 0 and 1. Why the fuzziness? Also, am I right in assuming values will be in the vicinity of (-128, 0)?

Well, it's not trivial for me either but this is how i understand it. If i've over simplified it is purely for my benefit, i don't mean to be patronising.
Zeroing the result corresponding to the Nyquist frequency:
I'm going to suppose we are computing the forward FFT of 1024 input samples. At 44100hz input this is usually true in my case (but isn't what AurioTouch is doing, which i find a bit weird, but i'm no expert). It's easier for me to understand with specific values.
Given 1024 (n) input samples, arranged as needed (even indexes' first then odd indexes' { in[0], in[2], in[4], …, in1, in[3], in[5], … }) (use vDSP_ctoz() to order your input)
The output of FFT 1024 (n) input samples is 513 ((n/2)+1) complex values. ie 513 real components and 513 imaginary components, a total of 1026 values.
However, imaginary[0] and imaginary[512] (n/2) are always, necessarily, zero. So by placing real[512] (the real component of the Nyquist frequency bin) at imaginary[0] and forgetting imaginary[512] - which is always zero and can be inferred, the results are packed into an 1024 (n) length buffer.
So, for the returned results to be valid you must at least set imaginary[0] back to zero. If you require all 513 ((n/2)+1) frequency bins you need to append another complex value to the result and set it thus..
unpackedVal = imaginary[0]
real[512]=unpackedVal, imaginary[512]=0
imaginary[0] = 0
In AurioTouch i always supposed they just don't bother. n/2 results is obviously more convenient to work with and you can hardly tell from the visualizer:- "Oh look, it's missing one magnitude at the Nyquist frequency"
The UsingFourierTransforms docs explain the packing
NB the specific values 1024, 513, 512, etc. are examples not the actual values of n, (n/2)+1, n/2 from AurioTouch.
They scale everything down by -128db
Not quite, the range of the output values is relative to the number of input samples so it has to be normalised. The scale is 1.0/(2*inNumberFrames).
After scaling the range is -1.0 –> +1.0. The magnitude of the complex vector is then taken (the phase is ignored) giving a Scalar value for each frequency bin between 0 and 1.0
This value is then interpreted as a decibel value between -128 and 0
The drawing stuff… +80 / 64. …*120… …i'm not sure. I may be completely wrong or it may be …artistic license?

Related

How to calculate 512 point FFT using 2048 point FFT hardware module

I have a 2048 point FFT IP. How may I use it to calculate 512 point FFT ?
There are different ways to accomplish this, but the simplest is to replicate the input data 4 times, to obtain a signal of 2048 samples. Note that the DFT (which is what the FFT computes) can be seen as assuming the input signal being replicated infinitely. Thus, we are just providing a larger "view" of this infinitely long periodic signal.
The resulting FFT will have 512 non-zero values, with zeros in between. Each of the non-zero values will also be four times as large as the 512-point FFT would have produced, because there are four times as many input samples (that is, if the normalization is as commonly applied, with no normalization in the forward transform and 1/N normalization in the inverse transform).
Here is a proof of principle in MATLAB:
data = randn(1,512);
ft = fft(data); % 512-point FFT
data = repmat(data,1,4);
ft2 = fft(data); % 2048-point FFT
ft2 = ft2(1:4:end) / 4; % 512-point FFT
assert(all(ft2==ft))
(Very surprising that the values were exactly equal, no differences due to numerical precision appeared in this case!)
An alternate solution from the correct solution provided by Cris Luengo which does not require any rescaling is to pad the data with zeros to the required length of 2048 samples. You then get your result by reading every 2048/512 = 4 outputs (i.e. output[0], output[3], ... in a 0-based indexing system).
Since you mention making use of a hardware module, this could be implemented in hardware by connecting the first 512 input pins and grounding all other inputs, and reading every 4th output pin (ignoring all other output pins).
Note that this works because the FFT of the zero-padded signal is an interpolation in the frequency-domain of the original signal's FFT. In this case you do not need the interpolated values, so you can just ignore them. Here's an example computing a 4-point FFT using a 16-point module (I've reduced the size of the FFT for brievety, but kept the same ratio of 4 between the two):
x = [1,2,3,4]
fft(x)
ans> 10.+0.j,
-2.+2.j,
-2.+0.j,
-2.-2.j
x = [1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0]
fft(x)
ans> 10.+0.j, 6.499-6.582j, -0.414-7.242j, -4.051-2.438j,
-2.+2.j, 1.808+1.804j, 2.414-1.242j, -0.257-2.3395j,
-2.+0.j, -0.257+2.339j, 2.414+1.2426j, 1.808-1.8042j,
-2.-2.j, -4.051+2.438j, -0.414+7.2426j, 6.499+6.5822j
As you can see in the second output, the first column (which correspond to output 0, 3, 7 and 11) is identical to the desired output from the first, smaller-sized FFT.

Does FFT neccessary to find peaks and pits on audio files

I'm able to read a wav files and its values. I need to find peaks and pits positions and their values. First time, i tried to smooth it by (i-1 + i + i +1) / 3 formula then searching on array as array[i-1] > array[i] & direction == 'up' --> pits style solution but because of noise and other reasons of future calculations of project, I'm tring to find better working area. Since couple days, I'm researching FFT. As my understanding, fft translates the audio files to series of sines and cosines. After fft operation the given values is a0's and a1's for a0 + ak * cos(k*x) + bk * sin(k*x) which k++ and x++ as this picture
http://zone.ni.com/images/reference/en-XX/help/371361E-01/loc_eps_sigadd3freqcomp.gif
My question is, does fft helps to me find peaks and pits on audio? Does anybody has a experience for this kind of problems?
It depends on exactly what you are trying to do, which you haven't really made clear. "finding the peaks and pits" is one thing, but since there might be various reasons for doing this there might be various methods. You already tried the straightforward thing of actually looking for the local maximum and minima, it sounds like. Here are some tips:
you do not need the FFT.
audio data usually swings above and below zero (there are exceptions, including 8-bit wavs, which are unsigned, but these are exceptions), so you must be aware of positive and negative values. Generally, large positive and large negative values carry large amounts of energy, though, so you want to count those as the same.
due to #2, if you want to average, you might want to take the average of the absolute value, or more commonly, the average of the square. Once you find the average of the squares, take the square root of that value and this gives the RMS, which is related to the power of the signal, so you might do something like this is you are trying to indicate signal loudness, intensity or approximate an analog meter. The average of absolutes may be more robust against extreme values, but is less commonly used.
another approach is to simply look for the peak of the absolute value over some number of samples, this is commonly done when drawing waveforms, and for digital "peak" meters. It makes less sense to look at the minimum absolute.
Once you've done something like the above, yes you may want to compute the log of the value you've found in order to display the signal in dB, but make sure you use the right formula. 10 * log_10( amplitude ) is not it. Rule of thumb: usually when computing logs from amplitude you will see a 20, not a 10. If you want to compute dBFS (the amount of "headroom" before clipping, which is the standard measurement for digital meters), the formula is -20 * log_10( |amplitude| ), where amplitude is normalize to +/- 1. Watch out for amplitude = 0, which gives an infinite headroom in dB.
If I understand you correctly, you just want to estimate the relative loudness/quietness of an audio digital sample at a given point.
For this estimation, you don't need to use FFT. However your method of averaging the signal does not produce the appropiate picture neither.
The digital signal is the value of the audio wave at a given moment. You need to find the overall amplitude of the signal at that given moment. You can somewhat see it as the local maximum value for a given interval around the moment you want to calculate. You may have a moving max for the signal and get your amplitude estimation.
At a 16 bit sound sample, the sound signal value can go from 0 up to 32767. At a 44.1 kHz sample rate, you can find peaks and pits of around 0.01 secs by finding the max value of 441 samples around a given t moment.
max=1;
for (i=0; i<441; i++) if (array[t*44100+i]>max) max=array[t*44100+i];
then for representing it on a 0 to 1 scale you (not really 0, because we used a minimum of 1)
amplitude = max / 32767;
or you might represent it in relative dB logarithmic scale (here you see why we used 1 for the minimum value)
dB = 20 * log10(amplitude);
all you need to do is take dy/dx, which can getapproximately by just scanning through the wave and and subtracting the previous value from the current one and look at where it goes to zero or changes from positive to negative
in this code I made it really brief and unintelligent for sake of brevity, of course you could handle cases of dy being zero better, find the 'centre' of a long section of a flat peak, that kind of thing. But if all you need is basic peaks and troughs, this will find them.
lastY=0;
bool goingup=true;
for( i=0; i < wave.length; i++ ) {
y = wave[i];
dy = y - lastY;
bool stillgoingup = (dy>0);
if( goingup != direction ) {
// changed direction - note value of i(place) and 'y'(height)
stillgoingup = goingup;
}
}

Can FFT length affect filtering accuracy?

I am designing a fractional delay filter, and my lagrange coefficient of order 5 h(n) have 6 taps in time domain. I have tested to convolute the h(n) with x(n) which is 5000 sampled signal using matlab, and the result seems ok. When I tried to use FFT and IFFT method, the output is totally wrong. Actually my FFT is computed with 8192 data in frequency domain, which is the nearest power of 2 for 5000 signal sample. For the IFFT portion, I convert back the 8192 frequency domain data back to 5000 length data in time domain. So, the problem is, why this thing works in convolution, but not in FFT multiplication. Does converting my 6 taps h(n) to 8192 taps in frequency domain causes this problem?
Actually I have tried using overlap-save method, which perform the FFT and multiplication with smaller chunks of x(n) and doing it 5 times separately. The result seems slight better than the previous, and at least I can see the waveform pattern, but still slightly distorted. So, any idea where goes wrong, and what is the solution. Thank you.
The reason I am implementing the circular convolution in frequency domain instead of time domain is, I am try to merge the Lagrange filter with other low pass filter in frequency domain, so that the implementation can be more efficient. Of course I do believe implement filtering in frequency domain will be much faster than convolution in time domain. The LP filter has 120 taps in time domain. Due to the memory constraints, the raw data including the padding will be limited to 1024 in length, and so with the fft bins.
Because my Lagrange coefficient has only 6 taps, which is huge different with 1024 taps. I doubt that the fft of the 6 taps to 1024 bins in frequency domain will cause error. Here is my matlab code on Lagrange filter only. This is just a test code only, not implementation code. It's a bit messy, sorry about that. Really appreciate if you can give me more advice on this problem. Thank you.
t=1:5000;
fs=2.5*(10^12);
A=70000;
x=A*sin(2*pi*10.*t.*(10^6).*t./fs);
delay=0.4;
N=5;
n = 0:N;
h = ones(1,N+1);
for k = 0:N
index = find(n ~= k);
h(index) = h(index) * (delay-k)./ (n(index)-k);
end
pad=zeros(1,length(h)-1);
out=[];
H=fft(hh,1024);
H=fft([h zeros(1,1024-length(h))]);
for i=0:1:ceil(length(x)/(1024-length(h)+1))-1
if (i ~= ceil(length(x)/(1024-length(h)+1))-1)
a=x(1,i*(1024-length(h)+1)+1:(i+1)*(1024-length(h)+1));
else
temp=x(1,i*(1024-length(h)+1)+1:length(x));
a=[temp zeros(1,1024-length(h)+1-length(temp))];
end
xx=[pad a];
X=fft(xx,1024);
Y=H.*X;
y=abs(ifft(Y,1024));
out=[out y(1,length(h):length(y))];
pad=y(1,length(a)+1:length(y));
end
Some comments:
The nearest power of two is actually 4096. Do you expect the remaining 904 samples to contribute much? I would guess that they are significant only if you are looking for relatively low-frequency features.
How did you pad your signal out to 8192 samples? Padding your sample out to 8192 implies that approximately 40% of your data is "fictional". If you used zeros to lengthen your dataset, you likely injected a step change at the pad point - which implies a lot of high-frequency content.
A short code snippet demonstrating your methods couldn't hurt.

Why do the convolution results have different lengths when performed in time domain vs in frequency domain?

I'm not a DSP expert, but I understand that there are two ways that I can apply a discrete time-domain filter to a discrete time-domain waveform. The first is to convolve them in the time domain, and the second is to take the FFT of both, multiply both complex spectrums, and take IFFT of the result. One key difference in these methods is the second approach is subject to circular convolution.
As an example, if the filter and waveforms are both N points long, the first approach (i.e. convolution) produces a result that is N+N-1 points long, where the first half of this response is the filter filling up and the 2nd half is the filter emptying. To get a steady-state response, the filter needs to have fewer points than the waveform to be filtered.
Continuing this example with the second approach, and assuming the discrete time-domain waveform data is all real (not complex), the FFT of the filter and the waveform both produce FFTs of N points long. Multiplying both spectrums IFFT'ing the result produces a time-domain result also N points long. Here the response where the filter fills up and empties overlap each other in the time domain, and there's no steady state response. This is the effect of circular convolution. To avoid this, typically the filter size would be smaller than the waveform size and both would be zero-padded to allow space for the frequency convolution to expand in time after IFFT of the product of the two spectrums.
My question is, I often see work in the literature from well-established experts/companies where they have a discrete (real) time-domain waveform (N points), they FFT it, multiply it by some filter (also N points), and IFFT the result for subsequent processing. My naive thinking is this result should contain no steady-state response and thus should contain artifacts from the filter filling/emptying that would lead to errors in interpreting the resulting data, but I must be missing something. Under what circumstances can this be a valid approach?
Any insight would be greatly appreciated
The basic problem is not about zero padding vs the assumed periodicity, but that Fourier analysis decomposes the signal into sine waves which, at the most basic level, are assumed to be infinite in extent. Both approaches are correct in that the IFFT using the full FFT will return the exact input waveform, and both approaches are incorrect in that using less than the full spectrum can lead to effects at the edges (that usually extend a few wavelengths). The only difference is in the details of what you assume fills in the rest of infinity, not in whether you are making an assumption.
Back to your first paragraph: Usually, in DSP, the biggest problem I run into with FFTs is that they are non-causal, and for this reason I often prefer to stay in the time domain, using, for example, FIR and IIR filters.
Update:
In the question statement, the OP correctly points out some of the problems that can arise when using FFTs to filter signals, for example, edge effects, that can be particularly problematic when doing a convolution that is comparable in the length (in the time domain) to the sampled waveform. It's important to note though that not all filtering is done using FFTs, and in the paper cited by the OP, they are not using FFT filters, and the problems that would arise with an FFT filter implementation do not arise using their approach.
Consider, for example, a filter that implements a simple average over 128 sample points, using two different implementations.
FFT: In the FFT/convolution approach one would have a sample of, say, 256, points and convolve this with a wfm that is constant for the first half and goes to zero in the second half. The question here is (even after this system has run a few cycles), what determines the value of the first point of the result? The FFT assumes that the wfm is circular (i.e. infinitely periodic) so either: the first point of the result is determined by the last 127 (i.e. future) samples of the wfm (skipping over the middle of the wfm), or by 127 zeros if you zero-pad. Neither is correct.
FIR: Another approach is to implement the average with an FIR filter. For example, here one could use the average of the values in a 128 register FIFO queue. That is, as each sample point comes in, 1) put it in the queue, 2) dequeue the oldest item, 3) average all of the 128 items remaining in the queue; and this is your result for this sample point. This approach runs continuously, handling one point at a time, and returning the filtered result after each sample, and has none of the problems that occur from the FFT as it's applied to finite sample chunks. Each result is just the average of the current sample and the 127 samples that came before it.
The paper that OP cites takes an approach much more similar to the FIR filter than to the FFT filter (note though that the filter in the paper is more complicated, and the whole paper is basically an analysis of this filter.) See, for example, this free book which describes how to analyze and apply different filters, and note also that the Laplace approach to analysis of the FIR and IIR filters is quite similar what what's found in the cited paper.
Here's an example of convolution without zero padding for the DFT (circular convolution) vs linear convolution. This is the convolution of a length M=32 sequence with a length L=128 sequence (using Numpy/Matplotlib):
f = rand(32); g = rand(128)
h1 = convolve(f, g)
h2 = real(ifft(fft(f, 128)*fft(g)))
plot(h1); plot(h2,'r')
grid()
The first M-1 points are different, and it's short by M-1 points since it wasn't zero padded. These differences are a problem if you're doing block convolution, but techniques such as overlap and save or overlap and add are used to overcome this problem. Otherwise if you're just computing a one-off filtering operation, the valid result will start at index M-1 and end at index L-1, with a length of L-M+1.
As to the paper cited, I looked at their MATLAB code in appendix A. I think they made a mistake in applying the Hfinal transfer function to the negative frequencies without first conjugating it. Otherwise, you can see in their graphs that the clock jitter is a periodic signal, so using circular convolution is fine for a steady-state analysis.
Edit: Regarding conjugating the transfer function, the PLLs have a real-valued impulse response, and every real-valued signal has a conjugate symmetric spectrum. In the code you can see that they're just using Hfinal[N-i] to get the negative frequencies without taking the conjugate. I've plotted their transfer function from -50 MHz to 50 MHz:
N = 150000 # number of samples. Need >50k to get a good spectrum.
res = 100e6/N # resolution of single freq point
f = res * arange(-N/2, N/2) # set the frequency sweep [-50MHz,50MHz), N points
s = 2j*pi*f # set the xfer function to complex radians
f1 = 22e6 # define 3dB corner frequency for H1
zeta1 = 0.54 # define peaking for H1
f2 = 7e6 # define 3dB corner frequency for H2
zeta2 = 0.54 # define peaking for H2
f3 = 1.0e6 # define 3dB corner frequency for H3
# w1 = natural frequency
w1 = 2*pi*f1/((1 + 2*zeta1**2 + ((1 + 2*zeta1**2)**2 + 1)**0.5)**0.5)
# H1 transfer function
H1 = ((2*zeta1*w1*s + w1**2)/(s**2 + 2*zeta1*w1*s + w1**2))
# w2 = natural frequency
w2 = 2*pi*f2/((1 + 2*zeta2**2 + ((1 + 2*zeta2**2)**2 + 1)**0.5)**0.5)
# H2 transfer function
H2 = ((2*zeta2*w2*s + w2**2)/(s**2 + 2*zeta2*w2*s + w2**2))
w3 = 2*pi*f3 # w3 = 3dB point for a single pole high pass function.
H3 = s/(s+w3) # the H3 xfer function is a high pass
Ht = 2*(H1-H2)*H3 # Final transfer based on the difference functions
subplot(311); plot(f, abs(Ht)); ylabel("abs")
subplot(312); plot(f, real(Ht)); ylabel("real")
subplot(313); plot(f, imag(Ht)); ylabel("imag")
As you can see, the real component has even symmetry and the imaginary component has odd symmetry. In their code they only calculated the positive frequencies for a loglog plot (reasonable enough). However, for calculating the inverse transform they used the values for the positive frequencies for the negative frequencies by indexing Hfinal[N-i] but forgot to conjugate it.
I can shed some light to the reason why "windowing" is applied before FFT is applied.
As already pointed out the FFT assumes that we have a infinite signal. When we take a sample over a finite time T this is mathematically the equivalent of multiplying the signal with a rectangular function.
Multiplying in the time domain becomes convolution in the frequency domain. The frequency response of a rectangle is the sync function i.e. sin(x)/x. The x in the numerator is the kicker, because it dies down O(1/N).
If you have frequency components which are exactly multiples of 1/T this does not matter as the sync function is zero in all points except that frequency where it is 1.
However if you have a sine which fall between 2 points you will see the sync function sampled on the frequency point. It lloks like a magnified version of the sync function and the 'ghost' signals caused by the convolution die down with 1/N or 6dB/octave. If you have a signal 60db above the noise floor, you will not see the noise for 1000 frequencies left and right from your main signal, it will be swamped by the "skirts" of the sync function.
If you use a different time window you get a different frequency response, a cosine for example dies down with 1/x^2, there are specialized windows for different measurements. The Hanning window is often used as a general purpose window.
The point is that the rectangular window used when not applying any "windowing function" creates far worse artefacts than a well chosen window. i.e by "distorting" the time samples we get a much better picture in the frequency domain which closer resembles "reality", or rather the "reality" we expect and want to see.
Although there will be artifacts from assuming that a rectangular window of data is periodic at the FFT aperture width, which is one interpretation of what circular convolution does without sufficient zero padding, the differences may or may not be large enough to swamp the data analysis in question.

Kohonen SOM Maps: Normalizing the input with unknown range

According to "Introduction to Neural Networks with Java By Jeff Heaton", the input to the Kohonen neural network must be the values between -1 and 1.
It is possible to normalize inputs where the range is known beforehand:
For instance RGB (125, 125, 125) where the range is know as values between 0 and 255:
1. Divide by 255: (125/255) = 0.5 >> (0.5,0.5,0.5)
2. Multiply by two and subtract one: ((0.5*2)-1)=0 >> (0,0,0)
The question is how can we normalize the input where the range is unknown like our height or weight.
Also, some other papers mention that the input must be normalized to the values between 0 and 1. Which is the proper way, "-1 and 1" or "0 and 1"?
You can always use a squashing function to map an infinite interval to a finite interval. E.g. you can use tanh.
You might want to use tanh(x * l) with a manually chosen l though, in order not to put too many objects in the same region. So if you have a good guess that the maximal values of your data are +/- 500, you might want to use tanh(x / 1000) as a mapping where x is the value of your object It might even make sense to subtract your guess of the mean from x, yielding tanh((x - mean) / max).
From what I know about Kohonen SOM, they specific normalization does not really matter.
Well, it might through specific choices for the value of parameters of the learning algorithm, but the most important thing is that the different dimensions of your input points have to be of the same magnitude.
Imagine that each data point is not a pixel with the three RGB components but a vector with statistical data for a country, e.g. area, population, ....
It is important for the convergence of the learning part that all these numbers are of the same magnitude.
Therefore, it does not really matter if you don't know the exact range, you just have to know approximately the characteristic amplitude of your data.
For weight and size, I'm sure that if you divide them respectively by 200kg and 3 meters all your data points will fall in the ]0 1] interval. You could even use 50kg and 1 meter the important thing is that all coordinates would be of order 1.
Finally, you could a consider running some linear analysis tools like POD on the data that would give you automatically a way to normalize your data and a subspace for the initialization of your map.
Hope this helps.

Resources