vDSP: Do the FFT functions include windowing? - signal-processing

I am working on implementing an algorithm using vDSP.
1) take FFT
2) take log of square of absolute value (can be done with lookup table)
3) take another FFT
4) take absolute value
I'm not sure if it is up to me to throw the incoming data through a windowing function before I run the FFT on it.
vDSP_fft_zrip(setupReal, &A, stride, log2n, direction);
that is my FFT function
Do I need to throw the data through vDSP_hamm_window(...) first?

The iOS Accelerate library function vDSP_fft_zrip() does not include applying a window function (unless you count the implied rectangular window due to the finite length parameter).
So you need to apply your chosen window function (there are many different ones) first.

It sounds like you're doing cepstral analysis and yes, you do need a window function prior to the first FFT. I would suggest a simple Hann or Hamming window.

I don't have any experience with your particular library, but in every other FFT library I know of it's up to you to window the data first. If nothing else, the library can't know what window you wish to use, and sometimes you don't want to use a window (if you're using the FFT for overlap-add filtering, or if you know the signal is exactly periodic in the transform block).
Also, just offhand, it seems like if you're doing 2 FFTs, the overhead of calling a logarithm function is relatively minor.

Related

How to use apple's Accelerate framework in swift in order to compute the FFT of a real signal?

How am I supposed to use the Accelerate framework to compute the FFT of a real signal in Swift on iOS?
Available example on the web
Apple’s Accelerate framework seems to provide functions to compute the FFT of a signal efficiently.
Unfortunately, most of the examples available on the Internet, like Swift-FFT-Example and TempiFFT, crash if tested extensively and call the Objective C API.
The Apple documentation answers many questions, but also leads to some others (Is this piece mandatory? Why do I need this call to convert?).
Threads on Stack Overflow
There are few threads addressing various aspects of the FFT with concrete examples. Notably FFT Using Accelerate In Swift, DFT result in Swift is different than that of MATLAB and FFT Calculating incorrectly - Swift.
None of them address directly the question “What is the proper way to do it, starting from 0”?
It took me one day to figure out how to properly do it, so I hope this thread can give a clear explanation of how you are supposed to use Apple's FFT, show what are the pitfalls to avoid, and help developers save precious hours of their time.
TL ; DR : If you need a working implementation to copy past here is a gist.
What is FFT?
The Fast Fourier transform is an algorithm that take a signal in the time domain -- a collection of measurements took at a regular, usual small, interval of time -- and turn it into a signal expressed into the phase domain (a collection of frequency).
The ability to express the signal along time lost by the transformation (the transformation is invertible, which means no information is lost by computing the FFT and you can apply a IFFT to get the original signal back), but we get the ability to distinguish between frequencies that the signal contained. This is typically used to display the spectrograms of the music you are listening to on various hardware and youtube videos.
The FFT works with complexe numbers. If you don't know what they are, lets just pretend it is a combination of a radius and an angle. There is one complex number per point on a 2D plane. Real numbers (your usual floats) can be saw as a position on a line (negative on the left, positive on the right).
Nb: FFT(FFT(FFT(FFT(X))) = X (up to a constant depending on your FFT implementation).
How to compute the FFT of a real signal.
Usual you want to compute the FFT of a small window of an audio signal. For sake of the example, we will take a small 1024 samples window. You would also prefer to use a power of two, otherwise things gets a little bit more difficult.
var signal: [Float] // Array of length 1024
First, you need to initialize some constants for the computation.
// The length of the input
length = vDSP_Length(signal.count)
// The power of two of two times the length of the input.
// Do not forget this factor 2.
log2n = vDSP_Length(ceil(log2(Float(length * 2))))
// Create the instance of the FFT class which allow computing FFT of complex vector with length
// up to `length`.
fftSetup = vDSP.FFT(log2n: log2n, radix: .radix2, ofType: DSPSplitComplex.self)!
Following apple's documentation, we first need to create a complex array that will be our input.
Dont get mislead by the tutorial. What you usual want is to copy your signal as the real part of the input, and keep the complex part null.
// Input / Output arrays
var forwardInputReal = [Float](signal) // Copy the signal here
var forwardInputImag = [Float](repeating: 0, count: Int(length))
var forwardOutputReal = [Float](repeating: 0, count: Int(length))
var forwardOutputImag = [Float](repeating: 0, count: Int(length))
Be careful, the FFT function do not allow to use the same splitComplex as input and output at the same time. If you experience crashs, this may be the cause. This is why we define both an input and an output.
Now, we have to be careful and "lock" the pointer to this four arrays, as showed in the documentation example. If you simply use &forwardInputReal as argument of your DSPSplitComplex, the pointer may become invalidated at the following line and you will likely experience sporadic crash of your app.
forwardInputReal.withUnsafeMutableBufferPointer { forwardInputRealPtr in
forwardInputImag.withUnsafeMutableBufferPointer { forwardInputImagPtr in
forwardOutputReal.withUnsafeMutableBufferPointer { forwardOutputRealPtr in
fforwardOutputImag.withUnsafeMutableBufferPointer { forwardOutputImagPtr in
// Input
let forwardInput = DSPSplitComplex(realp: forwardInputRealPtr.baseAddress!, imagp: forwardInputImagPtr.baseAddress!)
// Output
var forwardOutput = DSPSplitComplex(realp: forwardOutputRealPtr.baseAddress!, imagp: forwardOutputImagPtr.baseAddress!)
// FFT call goes here
}
}
}
}
Now, the finale line: the call to your fft:
fftSetup.forward(input: forwardInput, output: &forwardOutput)
The result of your FFT is now available in forwardOutputReal and forwardOutputImag.
If you want only the amplitude of each frequency, and you don't care about the real and imaginary part, you can declare alongside the input and output an additional array:
var magnitudes = [Float](repeating: 0, count: Int(length))
add right after your fft compute the amplitude of each "bin" with:
vDSP.absolute(forwardOutput, result: &magnitudes)

Python: time stretch wave files - comparison between three methods

I'm doing some data augmentation on a speech dataset, and I want to stretch/squeeze each audio file in the time domain.
I found the following three ways to do that, but I'm not sure which one is the best or more optimized way:
dimension = int(len(signal) * speed)
res = librosa.effects.time_stretch(signal, speed)
res = cv2.resize(signal, (1, dimension)).squeeze()
res = skimage.transform.resize(signal, (dimension, 1)).squeeze()
However, I found that librosa.effects.time_stretch adds unwanted echo (or something like that) to the signal.
So, my question is: What are the main differences between these three ways? And is there any better way to do that?
librosa.effects.time_stretch(signal, speed) (docs)
In essence, this approach transforms the signal using stft (short time Fourier transform), stretches it using a phase vocoder and uses the inverse stft to reconstruct the time domain signal. Typically, when doing it this way, one introduces a little bit of "phasiness", i.e. a metallic clang, because the phase cannot be reconstructed 100%. That's probably what you've identified as "echo."
Note that while this approach effectively stretches audio in the time domain (i.e., the input is in the time domain as well as the output), the work is actually being done in the frequency domain.
cv2.resize(signal, (1, dimension)).squeeze() (docs)
All this approach does is interpolating the given signal using bilinear interpolation. This approach is suitable for images, but strikes me as unsuitable for audio signals. Have you listened to the result? Does it sound at all like the original signal only faster/slower? I would assume not only the tempo changes, but also the frequency and perhaps other effects.
skimage.transform.resize(signal, (dimension, 1)).squeeze() (docs)
Again, this is meant for images, not sound. Additionally to the interpolation (spline interpolation with the order 1 by default), this function also does anti-aliasing for images. Note that this has nothing to do with avoiding audio aliasing effects (Nyqist/Aliasing), therefore you should probably turn that off by passing anti_aliasing=False. Again, I would assume that the results may not be exactly what you want (changing frequencies, other artifacts).
What to do?
IMO, you have several options.
If what you feed into your ML algorithms ends up being something like a Mel spectrogram, you could simply treat it as image and stretch it using the skimage or opencv approach. Frequency ranges would be preserved. I have successfully used this kind of approach in this music tempo estimation paper.
Use a better time_stretch library, e.g. rubberband. librosa is great, but its current time scale modification (TSM) algorithm is not state of the art. For a review of TSM algorithms, see for example this article.
Ignore the fact that the frequency changes and simply add 0 samples on a regular basis to the signal or drop samples on a regular basis from the signal (much like your image interpolation does). If you don't stretch too far it may still work for data augmentation purposes. After all the word content is not changed, if the audio content has higher or lower frequencies.
Resample the signal to another sampling frequency, e.g. 44100 Hz -> 43000 Hz or 44100 Hz -> 46000 Hz using a library like resampy and then pretend that it's still 44100 Hz. This still change the frequencies, but at least you get the benefit that resampy does proper filtering of the result so that you avoid the aforementioned aliasing, which otherwise occurs.

Frequency Modulation

im trying to make frequency modulation. But could anyone explain, what about non-sinusoidal (and maybe non-periodic) carrier? Could we assume some FM( A(t), B(t) ) function, which modulates carrier, given by ABSTRACT(non-sinusoidal) function A(t) with signal, given by abstract function B(t)? could anyone write/explain something about that? what will formula look like in that most common case? I want some kind of recursive formula in terms, like "A(t-1)". Or either some explanation, if that is not possible and why.
Frequency modulation (FM)
proposes some kind of "varying playback speed" - but seems it does something wrong.
so i am repeating asking "how?"
Well, for a non-sinusoidal but periodic carrier you could easer use a look-up table as suggested by the answer of Paul R, or you could break up the periodic carrier into its Fourier modes, create an individual oscillator for each mode, modulate each one and sum then up.
In the case of a non-periodic signal the phase or frequency is not defined in general. Just think of noise, how should that be modulated? You would need to define what frequency modulation means for arbitrary signals.
If you are using a look-up table for your waveform generation then it's pretty easy to modify the standard phase accumulator synthesis method to add an FM input. See e.g. this answer.

What is an ideal domain for FFT?

I am using the Jtransforms library which seems to be wicked fast for my purpose.
At this point I think I have a pretty good handle on how FFT works so now I am wondering if there is any form of a standard domain which is used for audio visualizations like spectograms?
Thanks to android's native FFT in 2.3 I had been using bytes as the range although I am still unclear as to whether it is signed or not. (I know java doesn't have unsigned bytes, but Google implemented these functions natively and the waveform is PCM 8bit unsigned)
However I am adapting my app to work with mic audio and 2.1 phones. At this point having the input domain being in the range of bytes whether it is [-128, 127] or [0, 255] no longer seems quite optimal.
I would like the range of my FFT function to be [0,1] so that I can scale it easily.
So should I use a domain of [-1, 1] or [0, 1]?
Essentially, the input domain does not matter. At most, it causes an offset and a change in scaling on your original data, which will be turned into an offset on bin #0 and an overall change in scaling on your frequency-domain results, respectively.
As to limiting your FFT output to [0,1]; that's essentially impossible. In general, the FFT output will be complex, there's no way to manipulate your input data so that the output is restricted to positive real numbers.
If you use DCT instead of FFT your output range will be real. (Read about the difference and decide if DCT is suitable for your application.)
FFT implementations for real numbers (as input domain) use half the samples for the output range (since there are only even results when the input is real), therefore the fact you have both real and imaginary parts for each sample doesn't effect the size of the result (vs the size of the source) much (output size is ceil(n/2)*2).

Advantages of cv::Matx

I noticed that a new data structure cv::Matx was added to the new OpenCV version, intended for small matrices of known size at compilation time, for example
cv::Matx31f // matrix 3x1 of float type
Checking the documentation I saw that most of matrix operations are available, but still I don't see the advantages of using this new type instead of the old cv::Mat.
When should I use Matx instead of Mat?
Short answer: cv::Mat uses the heap to store its data, while cv::Matx uses the stack.
A cv::Mat uses dynamic memory allocation (on the heap). This is appropriate for big matrices (like images) and lets you do things like shallow copies of a matrix, which is the default behavior of cv::Mat.
However, for the small matrices that cv::Matx is designed for, heap allocation would be very expensive compared to doing the same thing on the stack. I have seen a block of math reduce processing time by over 75% by switching to using stack-allocated types (e.g. cv::Point and cv::Matx) instead of cv::Mat.
It's about memory management and not wasting (in some cases important) memory or just reservation of memory for an object you'll use later.
That's how I understand it – may be someone else can give a better explanation.
This is a late late answer, but it is still an interesting question!
dom's answer is quite accurate, and the heap/stack reference in user1460044's is also interesting.
From a practical point of view, I wouldn't use Matx (or Vec), except if it were completely necessary. The major advantages of Matx are
Using the stack (efficient![1])
Initialization.
The problem is, at the end you will have to move your Matx data to a Mat to do most of stuff, and so, you will be back at the heap again.
On the other hand, the "cool initialization" of a Matx can be done in a normal Mat:
// Matx initialization:
Matx31f A(1.f,2.f,3.f);
// Mat initialization:
Mat B = (Mat_<float>(3,1) << 1.f, 2.f, 3.f);
Also, there is a difference in initialization (beyond the heap/stack) stuff. If you try to put 5 values into the Matx31, it will crash (runtime exception), while calling the Mat_::operator<< with 5 values will only store the first three.
[1] Efficient if your program has to create lots of matrices of less than ~10 elements. In that case use Matx matrices.
There are 2 other reasons why I prefer Matx to Mat:
Readability: people reading the code can immediately see the size of the matrices, for example:
cv::Matx34d transform = ...;
It's clear that this is a 3x4 matrix, so it contains a 3D transformation of type (R,t), where R is a rotation matrix (as opposed to say, axis-angle).
Similarly, accessing an element is more natural with transform(i,j) vs transform.at<double>(i,j).
Easy debugging. Since the elements for Matx are allocated on the stack in an array of known length, IDEs or debuggers can display the entire contents nicely when stepping through the code.

Resources