I'm looking for a way to get the treble and bass data from a song for some incrementation of time (say 0.1 seconds) and in the range of 0.0 to 1.0. I've googled around but haven't been able to find anything remotely close to what I'm looking for. Ultimately I want to be able to represent the treble and bass level while the song is playing.
Thanks!
Its reasonably easy. You need to perform an FFT and then sum up the bins that interest you. A lot of how you select will depend on the sampling rate of your audio.
You then need to choose an appropriate FFT order to get good information in the frequency bins returned.
So if you do an order 8 FFT you will need 256 samples. This will return you 128 complex pairs.
Next you need to convert these to magnitude. This is actually quite simple. if you are using std::complex you can simply perform a std::abs on the complex number and you will have its magnitude (sqrt( r^2 + i^2 )).
Interestingly at this point there is something called Parseval's theorem. This theorem states that after performinng a fourier transform the sum of the bins returned is equal to the sum of mean squares of the input signal.
This means that to get the amplitude of a specific set of bins you can simply add them together divide by the number of them and then sqrt to get the RMS amplitude value of those bins.
So where does this leave you?
Well from here you need to figure out which bins you are adding together.
A treble tone is defined as above 2000Hz.
A bass tone is below 300Hz (if my memory serves me correctly).
Mids are between 300Hz and 2kHz.
Now suppose your sample rate is 8kHz. The Nyquist rate says that the highest frequency you can represent in 8kHz sampling is 4kHz. Each bin thus represents 4000/128 or 31.25Hz.
So if the first 10 bins (Up to 312.5Hz) are used for Bass frequencies. Bin 10 to Bin 63 represent the mids. Finally bin 64 to 127 is the trebles.
You can then calculate the RMS value as described above and you have the RMS values.
RMS values can be converted to dBFS values by performing 20.0f * log10f( rmsVal );. This will return you a value from 0dB (max amplitude) down to -infinity dB (min amplitude). Be aware amplitudes do not range from -1 to 1.
To help you along, here is a bit of my C++ based FFT class for iPhone (which uses vDSP under the hood):
MacOSFFT::MacOSFFT( unsigned int fftOrder ) :
BaseFFT( fftOrder )
{
mFFTSetup = (void*)vDSP_create_fftsetup( mFFTOrder, 0 );
mImagBuffer.resize( 1 << mFFTOrder );
mRealBufferOut.resize( 1 << mFFTOrder );
mImagBufferOut.resize( 1 << mFFTOrder );
}
MacOSFFT::~MacOSFFT()
{
vDSP_destroy_fftsetup( (FFTSetup)mFFTSetup );
}
bool MacOSFFT::ForwardFFT( std::vector< std::complex< float > >& outVec, const std::vector< float >& inVec )
{
return ForwardFFT( &outVec.front(), &inVec.front(), inVec.size() );
}
bool MacOSFFT::ForwardFFT( std::complex< float >* pOut, const float* pIn, unsigned int num )
{
// Bring in a pre-allocated imaginary buffer that is initialised to 0.
DSPSplitComplex dspscIn;
dspscIn.realp = (float*)pIn;
dspscIn.imagp = &mImagBuffer.front();
DSPSplitComplex dspscOut;
dspscOut.realp = &mRealBufferOut.front();
dspscOut.imagp = &mImagBufferOut.front();
vDSP_fft_zop( (FFTSetup)mFFTSetup, &dspscIn, 1, &dspscOut, 1, mFFTOrder, kFFTDirection_Forward );
vDSP_ztoc( &dspscOut, 1, (DSPComplex*)pOut, 1, num );
return true;
}
It seems that you're looking for Fast Fourier Transform sample code.
It is quite a large topic to cover in an answer.
The tools you will need are already build in iOS: vDSP API
This should help you: vDSP Programming Guide
And there is also a FFT Sample Code available
You might also want to check out iPhoneFFT. Though that code is slighlty
outdated it can help you understand processes "under-the-hood".
Refer to auriotouch2 example from Apple - it has everything from frequency analysis to UI representation of what you want.
Related
I want to use MedianBlur function with very high Ksize, like 301 or more. But if I pass ksize too high, sometimes the function will crash. The error message is:
OpenCV Error: (k < 16) in cv::medianBlur_8u_O1, in file ../opencv\modules\imgproc\src\smooth.cpp
(I use opencv4nodejs, but I also tried the original OpenCV 3.4.6).
I did reduce the ksize in a try/catch loop, but not so effective, since I have to work with videos.
I did checkout the OpenCV source code and did some research.
In OpenCV 3.4.6, the crash come from line 241, file opencv\modules\imgproc\src\median_blur.simd.hpp:
for ( k = 0; k < 16 ; ++k )
{
sum += H.coarse[k];
if ( sum > t )
{
sum -= H.coarse[k];
break;
}
}
CV_Assert( k < 16 ); // Error here
t is caculated base on ksize. But sum and H.coarse array's calculations are quite complicated.
Did further researches, I found a scientific document about the algorithm: https://www.researchgate.net/publication/321690537_Efficient_Scalable_Median_Filtering_Using_Histogram-Based_Operations
I am trying to read but honestly, I don't understand too much.
How do I calculate the maximum ksize with a given image?
The maximum kernel size is determined from the bit depth of the image. As mentioned in the publication you cited:
"An 8-bit value is limited to a max value of 255. Our goal is to
support larger kernel sizes, including kernels that are greater in
size than 17 × 17, thus the larger 32-bit data type is used"
so for an image of data type CV_8U the maximum kernel size is 255.
I am using a nice FFT library I found online to see if I can write a pitch-detection program. So far, I have been able to successfully let the library do FFT calculation on a test audio signal containing a few sine waves including one at 440Hz (I'm using 16384 samples as the size and the sample rate at 44100Hz).
The FFT output looks like:
433.356Hz - Real: 590.644 - Imag: -27.9856 - MAG: 16529.5
436.047Hz - Real: 683.921 - Imag: 51.2798 - MAG: 35071.4
438.739Hz - Real: 4615.24 - Imag: 1170.8 - MAG: 5.40352e+006
441.431Hz - Real: -3861.97 - Imag: 2111.13 - MAG: 8.15315e+006
444.122Hz - Real: -653.75 - Imag: 341.107 - MAG: 222999
446.814Hz - Real: -564.629 - Imag: 186.592 - MAG: 105355
As you can see, the 441.431Hz and 438.739Hz bins both show equally high magnitude outputs (the right-most numbers following "MAG:"), so it's obvious that the target frequency 440Hz falls somewhere between. Increasing the resolution might be one way to close in, but that would add to the calculation time.
How do I calculate the exact frequency that falls between two frequency bins?
UPDATE:
I tried out Barry Quinn's "Second Estimator" discussed on the DSPGuru website and got excellent results. The following shows the result for 440Hz square wave - now I'm only off by 0.003Hz!
Here is the code I used. I simply adapted this example I found, which was for Swift. Thank you everyone for your very valuable input, this has been a great learning journey :)
To calculate the "true" frequency, once I used parabola fit algorithm. It worked very well for my use case.
This is the way I followed in order to find the fundamental frequency:
Calculate DFT (WOLA).
Find peaks in your DFT bins.
Find Harmonic Product Spectrum. Not the most reliable nor precise, but this is a very easy way of finding your fundamental frequency candidates.
Based on peaks and HPS, use parabola fit algorithm to find fundamental pitch frequency (and amplitude if needed).
For example, HPS says the fundamental (strongest) pitch is concentrated in bin x of your DFT; if bin x belongs to the peak y, then parabola fit frequency is taken from the peak y and that is the pitch you were looking for.
If you are not looking for fundamental pitch, but exact frequency in any bin, just apply parabola fit for that bin.
Some code to get you started:
struct Peak
{
float freq ; // Peak frequency calculated by parabola fit algorithm.
float amplitude; // True amplitude.
float strength ; // Peak strength when compared to neighbouring bins.
uint16_t startPos ; // Peak starting position (DFT bin).
uint16_t maxPos ; // Peak location (DFT bin).
uint16_t stopPos ; // Peak stop position (DFT bin).
};
void calculateTrueFrequency( Peak & peak, float const bins, uint32_t const fs, DFT_Magnitudes mags )
{
// Parabola fit:
float a = mags[ peak.maxPos - 1 ];
float b = mags[ peak.maxPos ];
float c = mags[ peak.maxPos + 1 ];
float p = 0.5f * ( a - c ) / ( a - 2.0f * b + c );
float bin = convert<float>( peak.maxPos ) + p;
peak.freq = convert<float>( fs ) * bin / bins / 2;
peak.amplitude = b - 0.25f + ( a - c ) * p;
}
Sinc interpolation can be used to accurately interpolate (or reconstruct) the spectrum between FFT result bins. A zero-padded FFT will produce a similar interpolated spectrum. You can use a high quality interpolator (such as a windowed Sinc kernel) with successive approximation to estimate the actual spectral peak to whatever resolution the S/N allows. This reconstruction might not work near the DC or Fs/2 FFT result bins unless you include the effects of the the spectrum's conjugate image in the interpolation kernel.
See: https://ccrma.stanford.edu/~jos/Interpolation/Ideal_Bandlimited_Sinc_Interpolation.html and https://en.wikipedia.org/wiki/Whittaker%E2%80%93Shannon_interpolation_formula for details about time domain reconstruction, but the same interpolation method works in either domain, frequency or time, for bandlimited or time limited signals respectively.
If you require a less accurate estimator with far less computational overhead, parabolic interpolation (and other similar curve fitting estimators) might work. See: https://www.dsprelated.com/freebooks/sasp/Quadratic_Interpolation_Spectral_Peaks.html and https://mgasior.web.cern.ch/mgasior/pap/FFT_resol_note.pdf for details for parabolic, and http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.555.2873&rep=rep1&type=pdf for other curve fitting peak estimators.
A small program that I wrote in octave does not yield desired phase spectrum. The magnitude plot is perfect though.
f = 200;
fs = 1000;
phase = pi/3;
t = 0: 1/fs: 1;
sig = sin((2*pi*f*t) + phase);
sig_fft = fft(sig);
sig_fft_phase = angle(sig_fft) * 180/pi;
sig_fft_phase(201)
sig_fft_phase(201) returns 5.998 (6 degrees) rather than 60 degrees. What am I doing wrong? Is my expectation incorrect?
In your example, if you generate the frequency axis: (sorry, I don’t have Octave here, so Python will have to do—I’m sure it’s the same in Octave):
faxis = (np.arange(0, t.size) / t.size) * fs
you’ll see that faxis[200] (Python is 0-indexed, equivalent to Octave’s 201 index) is 199.80019980019981. You think you’re asking for the phase at 200 Hz but you’re not, you’re asking for the phase of 199.8 Hz.
(This happens because your t vector includes 1.0—that one extra sample slightly decreases the spectral spacing! I don’t think the link #Sardar_Usama posted in their comment is correct—it has nothing to do with the fact that the sinusoid doesn’t end on a complete cycle, since this approach should work with incomplete cycles.)
The solution: zero-pad the 1001-long sig vector to 2000 samples. Then, with a new faxis frequency vector, faxis[400] (Octave’s 401st index) corresponds to exactly 200 Hz:
In [54]: sig_fft = fft.fft(sig, 2000);
In [55]: faxis = np.arange(0, sig_fft.size) / sig_fft.size * fs
In [56]: faxis[400]
Out[56]: 200.0
In [57]: np.angle(sig_fft[400]) * 180 / np.pi
Out[57]: -29.950454729683386
But oh no, what happened? This says the angle is -30°?
Well, recall that Euler’s formula says that sin(x) = (exp(i * x) - exp(-i * x)) / 2i. That i in the denominator means that the phase recovered by the FFT won’t be 60°, even though the input sine wave has phase of 60°. Instead, the FFT bin’s phase will be 60 - 90 degrees, since -90° = angle(1/i) = angle(-i). So this is actually the right answer! To recover the sine wave’s phase, you’d need to add 90° to the phase of the FFT bin.
So to summarize, you need to fix two things:
make sure you’re looking at the right frequency bin. For an N-point FFT (and no fftshift), the bins are [0 : N - 1] / N * fs. Above, we just used a N=2000 point FFT to ensure that 200 Hz be represented.
Understand that, although you have a sine wave, as far as the FFT is concerned, it gets two complex exponentials, at +200 and -200 Hz, and with amplitudes 1/(2i) and -1/(2i). That imaginary value in the denominators shifts the phase you expect by -90° and +90° respectively.
If you happened to have used cos, a cosine wave, for sig, you wouldn’t have run into this mathematical obstacle 😆, so pay attention to the difference between sin and cos in the future 💪!
change to t=0:1/fs:1-1/fs; then
sig_fft_phase(201)
ans = -30.000
I'm interested in getting my iOS app to turn on the microphone and only listen for frequencies above 17000 hz. If it hears something in that range, I'd like the app to call a method.
I was able to find a repository that detects frequency: https://github.com/krafter/DetectingAudioFrequency
And here is a post breaking down FFT:
Get Hz frequency from audio stream on iPhone
Using these examples, I've been able to get the phone to react to the strongest frequency it hears, but I'm more interested in just reacting to the above 17000 hz frequencies.
The fact that I wrote that code helps me answering this question but the answer probably only applies to this code.
You can easily limit the frequencies you listen to just by trimming that output array to a piece that contains only the range you need.
In details: To be simple - array[0..255] contains your audio in frequency domain. For example you sample rate was 44100 when you did FFT.
Then maximum frequency you can encode is 22050. (Nyquist theorem).
That is array[0] contains value for 22050/256=86.13 Hz. Array[1] contains value for 86.13*2 = 172.26 Hz, array[2] contains value for 86.13*3 = 258.39 Hz. And so on. Your full range is distributed across those 256 values. (and yes, precision suffers)
So if you only need to listen to some range, let's say above 17000Hz, you just take a piece of that array and ignore the rest. In this case you take 17000/86.13=197 to 255 subarray and you have it. Only 17000-22050 range.
In my repo you modify strongestFrequencyHZ function like that:
static Float32 strongestFrequencyHZ(Float32 *buffer, FFTHelperRef *fftHelper, UInt32 frameSize, Float32 *freqValue) {
Float32 *fftData = computeFFT(fftHelper, buffer, frameSize);
fftData[0] = 0.0;
unsigned long length = frameSize/2.0;
Float32 max = 0;
unsigned long maxIndex = 0;
Float32 freqLimit = 17000; //HZ
Float32 freqsPerIndex = NyquistMaxFreq/length;
unsigned long lowestLimitIndex = (unsigned long) freqLimit/freqsPerIndex;
unsigned long newLen = length-lowestLimitIndex;
Float32 *newData = fftData+lowestLimitIndex; //address arithmetic
max = vectorMaxValueACC32_index(newData, newLen, 1, &maxIndex);
if (freqValue!=NULL) { *freqValue = max; }
Float32 HZ = frequencyHerzValue(lowestLimitIndex+maxIndex, length, NyquistMaxFreq);
return HZ;
}
I did some address arithmetic in there so it looks kind of complicated. You can just take that fftData array and do the regular stuff.
Other things to keep in mind:
Finding strongest freq. is easy. You just find maximum in that array. That's it. But in you case you need to monitor the range and find when it went from regular weak noise to some strong signal. In other words when stuff peaks, and this is not so trivial, but possible. You can probably just set some limit above which the signal becomes detected, although this not the best option.
I would rather be optimistic about this cause in real life you can't see much noise at 18000Hz around you. The only thing a can remember of are some old TVs that produce that high pitched sound when they're on.
I am trying to implement the overlap and add method in oder to apply a filter in a real time context. However, it seems that there is something I am doing wrong, as the resulting output has a larger error than I would expect. For comparing the accuracy of my computations I created a file, that I am processing in one chunk. I am comparing this with the output of the overlap and add process and take the resulting comparison as an indicator for the accuracy of the computation. So here is my process of doing Overlap and add:
I take a chunk of length L from my input signal
I pad the chunk with zeros to length L*2
I transform that signal into frequency domain
I multiply the signal in frequency domain with my filter response of length L*2 in frequency domain (the filter response is actually created by interpolating control points in the UI - so this is not transformed from time domain. However using length L*2 in frequency domain should be similar to using a ffted time domain signal of length L padded to L*2)
Then I transform the resulting signal back to time domain and add it to the output stream with an overlap of L
Is there anything wrong with that procedure? After reading a lot of different papers and books I've gotten pretty unsure which is the right way to deal with that.
Here is some more data from the tests I have been running:
I created a signal, which consists of three cosine waves
I used this filter function in the time domain for filtering. (It's symmetric, as it is applied to the whole output of the FFT, which also is symmetric for real input signals)
The output of the IFFT looks like this: It can be seen that low frequencies are attenuated more than frequency in the mid range.
For the overlap add/save and the windowed processing I divided the input signal into 8 chunks of 256 samples. After reassembling them they look like that. (sample 490 - 540)
Output Signal overlap and add:
output signal overlap and save:
output signal using STFT with Hanning window:
It can be seen that the overlap add/save processes differ from the STFT version at the point where chunks are put together (sample 511). This is the main error which leads to different results when comparing windowed process and overlap add/save. However the STFT is closer to the output signal, which has been processed in one chunk.
I am pretty much stuck at this point since a few days. What is wrong here?
Here is my source
// overlap and add
// init Buffers
for (UInt32 j = 0; j<samples; j++){
output[j] = 0.0;
}
// process multiple chunks of data
for (UInt32 i = 0; i < (float)div * 2; i++){
for (UInt32 j = 0; j < chunklength/2; j++){
// copy input data to the first half ofcurrent buffer
inBuffer[j] = input[(int)((float)i * chunklength / 2 + j)];
// pad second half with zeros
inBuffer[j + chunklength/2] = 0.0;
}
// clear buffers
for (UInt32 j = 0; j < chunklength; j++){
outBuffer[j][0] = 0.0;
outBuffer[j][8] = 0.0;
FFTBuffer[j][0] = 0.0;
FFTBuffer[j][9] = 0.0;
}
FFT(inBuffer, FFTBuffer, chunklength);
// processing
for(UInt32 j = 0; j < chunklength; j++){
// multiply with filter
FFTBuffer[j][0] *= multiplier[j];
FFTBuffer[j][10] *= multiplier[j];
}
// Inverse Transform
IFFT((const double**)FFTBuffer, outBuffer, chunklength);
for (UInt32 j = 0; j < chunklength; j++){
// copy to output
if ((int)((float)i * chunklength / 2 + j) < samples){
output[(int)((float)i * chunklength / 2 + j)] += outBuffer[j][0];
}
}
}
After the suggestion below, I tried the following:
IFFTed my Filter. This looks like this:
set the second half to zero:
FFTed the signal and compared the magnitudes to the old filter (blue):
After trying to do overlap and add with this filter, the results have obviously gotten worse instead of better. In order to make sure my FFT works correctly, I tried to IFFT and FFT the filter without setting the second half zero. The result is identical to the orignal filter. So the problem shouldn't be the FFTing. I suppose that this is more of some general understanding of the overlap and add method. But I still can't figure out what is going wrong...
One thing to check is the length of the impulse response of your filter. It must be shorter than the length of zero padding used before the fast convolution FFT, or you will get wrap around errors.
I think the problem might be in the windowing approach that you are using. You simply add zeros to the chunks so there is no actual overlap. In the overlap and add method, you need to damp the edges of the window. What this means is that where you add zeros to the chunk you instead have add weighted input signal and the weight in your case should be 0.5 since only two windows overlap.
Rest of the procedure seems OK. You then simply take FTs, multiply and take inverse FTS and finally add up all the chunks to get the final signal which should be exactly the same if you filtered the whole signal at once.