While using FFT sample code from Apple documentation, what actually does the N, log2n, n and nOver2 mean?
Does N refer to the window size of the fft or the whole number of samples in a given audio, and
how do I calculate N from an audio file?
how are they related to the audio sampling rate i.e. 44.1kHz?
What would be the FFT frame size in this code?
Code:
/* Set the size of FFT. */
log2n = N;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
printf("1D real FFT of length log2 ( %d ) = %d\n\n", n, log2n);
/* Allocate memory for the input operands and check its availability,
* use the vector version to get 16-byte alignment. */
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
originalReal = (float *) malloc(n * sizeof(float));
obtainedReal = (float *) malloc(n * sizeof(float));
N or n typically refers to the number of elements. log2n is the base-two logarithm of n. (The base-two logarithm of 32 is 5.) nOver2 is n/2, n divided by two.
In the context of an FFT, n is the number of samples being fed into the FFT.
n is usually determined by a variety of constraints. You want more samples to provide a better quality result, but you do not want so many samples that processing takes up a lot of computer time or that the result is not available until so late that the user notices a lag. Usually, it is not the length of an audio file that determines the size. Rather, you design a “window” that you will use for processing, then you read samples from the audio file into a buffer big enough to hold your window, then you process the buffer, then you repeat with more samples from the file. Repetitions continue until the entire file is processed.
A higher audio sampling rate means there will be more samples in a given period of time. E.g., if you want to keep your window under 1/30th of a second, then a 44.1 kHz sampling rate will have less than 44.1•1000/30 = 1470 samples. A higher sampling rate means you have more work to do, so you may need to adjust your window size to keep the processing within limits.
That code uses N for log2n, which is unfortunate, since it may confuse people. Otherwise, the code is as I described above, and the FFT frame size is n.
There can be some confusion about FFT size or length when a mix of real data and complex data is involved. Typically, for a real-to-complex FFT, the number of real elements is said to be the length. When doing a complex-to-complex FFT, the number of complex elements is the length.
'N' is the number of samples, i.e., your vector size. Corresponding, 'log2N' is the logarithm of 'N' with the base 2, and 'nOver2' is the half of 'N'.
To answer the other questions, one must know, what do you want to do with FFT. This document, even it is written with a specific system in mind, can serve as an survey about the relation and the meaning of the parameters in (D)FFT.
Related
Using channel modeling software(Quadriga), I am calculating the spectral efficiency(bps/Hz) of 2x2 a MIMO system A being higher than the respective capacity of a 2x2 MIMO system B for a fixed frequency. However system A has lower isolation (mutual coupling-S21) than system B. Why is that? The software takes as input the Realized Gain pattern. It also takes as an input a coupling matrix which I used the S parameters of each system respectively.
System A for the given frequnecy s21=-20dB||
System B for the given frequnecy s21=-40dB
The capacity of a MIMO system depends on two factors: The overall SNR (averaged over all MIMO sublinks) and the "orthogonality" of the channel matrix. The latter can be determined by a singular-value decomposition of the matrix. If all singular values are equal, the matrix is orthogonal.
In the 2x2 case, you would be able to transmit 2 data-streams in parallel, i.e. the capacity scales linear with the number of streams (and logarithmic with the SNR). However, perfect orthogonality is rarely achieved. If you use identical channel model settings (multipath structure, frequency, bandwidth, power, etc.) and only adjust the coupling between the antennas ports, then a lower isolation would result in a lower capacity due to a higher correlation in the channel coefficients. This can be seen in the following code example (using QuaDRiGa v2.2):
clear all
iso = 0:5:40
for n = 1:numel(iso)
% Keep identical random ssed
RandStream.setGlobalStream(RandStream('mt19937ar','seed',1));
l = qd_layout; % Quadriga Layout
l.randomize_rx_positions(100,1.5,1.5,0); % Random receiver position
l.tx_array = qd_arrayant('ula2'); % ULA antenna
l.tx_array.coupling = [ 1,sqrt( 10^(-iso(n)/10) ); 0,1 ];
l.rx_array = l.tx_array.copy;
l.set_scenario('3GPP_38.901_UMi_NLOS');
c = l.get_channels; % Channel coefficients
H = c.fr(1e9,1); % MIMO matrix
P = mean(abs(H(:).^2)); % Average Power
H = H./sqrt(P); % Normalize Power
C(n) = log2(abs(det( eye(2) + 10/2 * (H*H') ))); % MIMO Capacity # 10 dB SNR
end
plot(iso,C,'-o')
I have a 2048 point FFT IP. How may I use it to calculate 512 point FFT ?
There are different ways to accomplish this, but the simplest is to replicate the input data 4 times, to obtain a signal of 2048 samples. Note that the DFT (which is what the FFT computes) can be seen as assuming the input signal being replicated infinitely. Thus, we are just providing a larger "view" of this infinitely long periodic signal.
The resulting FFT will have 512 non-zero values, with zeros in between. Each of the non-zero values will also be four times as large as the 512-point FFT would have produced, because there are four times as many input samples (that is, if the normalization is as commonly applied, with no normalization in the forward transform and 1/N normalization in the inverse transform).
Here is a proof of principle in MATLAB:
data = randn(1,512);
ft = fft(data); % 512-point FFT
data = repmat(data,1,4);
ft2 = fft(data); % 2048-point FFT
ft2 = ft2(1:4:end) / 4; % 512-point FFT
assert(all(ft2==ft))
(Very surprising that the values were exactly equal, no differences due to numerical precision appeared in this case!)
An alternate solution from the correct solution provided by Cris Luengo which does not require any rescaling is to pad the data with zeros to the required length of 2048 samples. You then get your result by reading every 2048/512 = 4 outputs (i.e. output[0], output[3], ... in a 0-based indexing system).
Since you mention making use of a hardware module, this could be implemented in hardware by connecting the first 512 input pins and grounding all other inputs, and reading every 4th output pin (ignoring all other output pins).
Note that this works because the FFT of the zero-padded signal is an interpolation in the frequency-domain of the original signal's FFT. In this case you do not need the interpolated values, so you can just ignore them. Here's an example computing a 4-point FFT using a 16-point module (I've reduced the size of the FFT for brievety, but kept the same ratio of 4 between the two):
x = [1,2,3,4]
fft(x)
ans> 10.+0.j,
-2.+2.j,
-2.+0.j,
-2.-2.j
x = [1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0]
fft(x)
ans> 10.+0.j, 6.499-6.582j, -0.414-7.242j, -4.051-2.438j,
-2.+2.j, 1.808+1.804j, 2.414-1.242j, -0.257-2.3395j,
-2.+0.j, -0.257+2.339j, 2.414+1.2426j, 1.808-1.8042j,
-2.-2.j, -4.051+2.438j, -0.414+7.2426j, 6.499+6.5822j
As you can see in the second output, the first column (which correspond to output 0, 3, 7 and 11) is identical to the desired output from the first, smaller-sized FFT.
I need to do the following two operations:
float x[4];
float y[16];
// 1-to-4 broadcast
for ( int i = 0; i < 16; ++i )
y[i] = x[i / 4];
// 4-to-1 reduce-add
for ( int i = 0; i < 16; ++i )
x[i / 4] += y[i];
What would be an efficient AVX-512 implementation?
For the reduce-add, just do in-lane shuffles and adds (vmovshdup / vaddps / vpermilps imm8/vaddps) like in Fastest way to do horizontal float vector sum on x86 to get a horizontal sum in each 128-bit lane, and then vpermps to shuffle the desired elements to the bottom. Or vcompressps with a constant mask to do the same thing, optionally with a memory destination.
Once packed down to a single vector, you have a normal SIMD 128-bit add.
If your arrays are actually larger than 16, instead of vpermps you could vpermt2ps to take every 4th element from each of two source vectors, setting you up for doing the += part with into x[] 256-bit vectors. (Or combine again with another shuffle into 512-bit vectors, but that will probably bottleneck on shuffle throughput on SKX).
On SKX, vpermt2ps is only a single uop, with 1c throughput / 3c latency, so it's very efficient for how powerful it is. On KNL it has 2c throughput, worse than vpermps, but maybe still worth it. (KNL doesn't have AVX512VL, but for adding to x[] with 256-bit vectors you (or a compiler) can use AVX1 vaddps ymm if you want.)
See https://agner.org/optimize/ for instruction tables.
For the load:
Is this done inside a loop, or repeatedly? (i.e. can you keep a a shuffle-control vector in a register? If so, you could
do a 128->512 broadcast with VBROADCASTF32X4 (single uop for a load port).
do an in-lane shuffle with vpermilps zmm,zmm,zmm to broadcast a different element within each 128-bit lane. (Has to be separate from the broadcast-load, because a memory-source vpermilps can either have a m512 or m32bcst source. (Instructions typically have their memory broadcast granularity = their element size, unfortunately in some cases like this where it's not at all useful. And vpermilps takes the control vector as a memory operand, not the source data.)
This is slightly better than vpermps zmm,zmm,zmm because the shuffle has 1 cycle latency instead of 3 (on Skylake-avx512).
Even outside a loop, loading a shuffle-control vector might still be your best bet.
I'm using Apples vDSP APIs to calculate the FFT of audio. However, my results (in amp[]) aren't symmetrical around N/2, which they should be, from my understanding of FFTs on real inputs?
In the below frame is an array[128] of floats containing the audio samples.
int numSamples = 128;
vDSP_Length log2n = log2f(numSamples);
FFTSetup fftSetup = vDSP_create_fftsetup(log2n, FFT_RADIX2);
int nOver2 = numSamples/2;
COMPLEX_SPLIT A;
A.realp = (float *) malloc(nOver2*sizeof(float));
A.imagp = (float *) malloc(nOver2*sizeof(float));
vDSP_ctoz((COMPLEX*)frame, 2, &A, 1, nOver2);
//Perform FFT using fftSetup and A
//Results are returned in A
vDSP_fft_zrip(fftSetup, &A, 1, log2n, FFT_FORWARD);
//Convert COMPLEX_SPLIT A result to float array to be returned
float amp[numSamples];
amp[0] = A.realp[0]/(numSamples*2);
for(int i=1;i<numSamples;i++) {
amp[i]=A.realp[i]*A.realp[i]+A.imagp[i]*A.imagp[i];
printf("%f ",amp[i]);
}
If I put the same float array into an online FFT calculator I do get a symmetrical output. Am I doing something wrong above?
For some reason, most values in amp[] are around 0 to 1e-5, but I also get one huge value of about 1e23. I'm not doing any windowing here, just trying to get a basic FFT working initially.
I've attached a picture of the two FFT outputs, using the same data. You can see they are similar upto 64, although not by a constant scaling factor, so I'm not sure what they are different by. Then over 64 they are completely different.
Because the mathematical output of a real-to-complex FFT is symmetrical, there is no value in returning the second half. There is also no space for it in the array that is passed to vDSP_fft_zrip. So vDSP_fft_zrip returns only the first half (except for the special N/2 point, discussed below). The second half is usually not needed explicitly and, if it is, you can compute it easily from the first half.
The output of vDSP_fft_zrip when used for a forward (real to complex) transformation has the H0 output (which is purely real; its imaginary part is zero) in A.realp[0]. The HN/2 output (which is also purely real) is stored in A.imagp[0]. The remaining values Hi, for 0 < i < N/2, are stored normally in A.realp[i] and A.imagp[i].
Documentation explaining this is here, in the section “Data Packing for Real FFTs”.
To get symmetric results from strictly real inout to a basic FFT, your complex data input and output arrays have to be the same length as your FFT. You seem to be allocating and copying only half your data into the FFT input, which could be feeding non-real memory garbage to the FFT.
I have an array of 240 data points sampled at 600hz, representing 400ms. I need to resample this data to 512 data points sampled at 1024hz, representing 500ms. I assume since I'm starting with 400ms of data, the last 100ms will just need to be padded with 0s.
Is there a best approach to take to accomplish this?
If you want to avoid interpolation then you need to upsample to a 76.8 kHz sample rate (i.e. insert 127 0s after every input sample), low pass filter, then decimate (drop 74 out of every 75 samples).
You can use windowed Sinc interpolation, which will give you the same result as upsampling and downsampling using a linear phase FIR low-pass filter with a windowed Sinc impulse response. When using a FIR filter, one normally has to pad a signal with zeros the length of the FIR filter kernel on both sides.
Added:
Another possibility is to zero pad 240 samples with 60 zeros, apply a non-power-of-2 FFT of length 300, "center" zero pad the FFT result with 212 complex zeros to make it 512 long, but with the identical spectrum, and do an IFFT of length 512 to get the resampled result.
Yes to endolith's response, if you want to interpolate x[n] by simply computing the FFT, zero-stuff, and then IFFT, you'll get errors if x[n] is not periodic. See this reference: http://www.embedded.com/design/other/4212939/Time-domain-interpolation-using-the-Fast-Fourier-Transform-
FFT based resampling/upsampling is pretty easy...
If you can use python, scipy.signal.resample should work.
For C/C++, there is a simple fftw trick to upsample if you have real (as opposed to complex) data.
nfft = the original data length
upnfft = the new data length
double * data = the original data
// allocate
fftw_complex * tmp_fd = (fftw_complex*)fftw_malloc((upnfft/2+1)*sizeof(fftw_complex));
double * result = (double*)fftw_malloc(upnfft*sizeof(double));
// create fftw plans
fftw_plan fft_plan = fftw_plan_dft_r2c_1d(nfft, data, tmp_fd, FFTW_ESTIMATE);
fftw_plan ifft_plan = fftw_plan_dft_c2r_1d(upnfft, tmp_fd, result, FFTW_ESTIMATE);
// zero out tmp_fd
memset(tmp_fd, 0, (upnfft/2+1)*sizeof(fftw_complex));
// execute the plans (forward then reverse)
fftw_execute_dft_r2c(fft_plan, data, tmp_fd);
fftw_execute_dft_c2r(ifft_plan, tmp_fd, result);
// cleanup
fftw_free(tmp_fd);