How to fix the parameters for MFCC feature extraction? - signal-processing

I want to extract MFCC features for a 10 second speech audio which was recorded at 44.1 kHz using librosa.
Should I fix my sr at 8k or 44.1k (given that speech is mostly in the lower band)?
Also, how do I choose the values for hop_length, window_length, n_mels, n_mfcc, and n_fft? How does the calculation work?
I would like to use this audio for an ASR task. Thanks in advance for your expertise!

Related

How to "MUX" audio tracks in Audacity?

I have two mono tracks (A & B) of 48khz sampling frequency. Is there a way to generate a new mono track of 96khz sampling rate in which the samples are alternating from the original mono tracks? i.e. a1,b1,a2,b2,a3,b3....
Is there a better tool for doing such a process ?

How can I resample an audio file programatically in swift?

I would like to know if it's possible to resample an already written AVAudioFile.
All the references that I found don't work on this particular problem, since:
They propose the resampling while the user is recording an AVAudioFile, while installTap is running. In this approach, the AVAudioConverter works in each buffer chunk given by the inputNode and appends it in the AVAudioFile. [1] [2]
The point is that I would like to resample my audio file regardless of the recording process.
The harder approach would be to upsample the signal by a L factor and applying decimation by a factor of M, using vDSP:
Audio on Compact Disc has a sampling rate of 44.1 kHz; to transfer it to a digital medium that uses 48 kHz, method 1 above can be used with L = 160, M = 147 (since 48000/44100 = 160/147). For the reverse conversion, the values of L and M are swapped. Per above, in both cases, the low-pass filter should be set to 22.05 kHz. [3]
The last one obviously seems like a too hard coded way to solve it. I hope there's a way to resample it with AVAudioConverter, but it lacks documentation :(

Keyword spotter doesn't work well with narrowband speech signal. How to solve it?

Here's what I have:
Acoustic model (CMU Sphinx) to be used in a keyword spotter. Trained for speech sampled at 16kHz and performs well. Doesn't perform well when presented with a speech signal sampled at 8kHz or a speech signal with max bandwidth of 4kHz and sample rate = 16kHz.
A microphone which only delivers a narrow-band signal. The bandwidth of the signal is max 4kKz. I can set the sample rate (audio driver API) to 16kHz, but the bandwidth remains the same since the underlying
HW samples at 8kHz. Can't change that!
Here's the result:
The keyword spotter fails when it's presented with a speech signal (sample rate 16kHz) which only has
a bandwidth of 4kHz.
Here's my question:
Would it be reasonable to expect that the keyword spotter will work if I "fake it" by bandwidth
extending the narrowband signal prior to sending it to the keyword spotter?
What is the simplest BW-extender ? (I'm looking for something which can be implemented fast).
Thanks
There are 8khz models, you should use them instead.
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English/cmusphinx-en-us-ptm-8khz-5.2.tar.gz

Audio Unit Bandpass Filter iOS

I'm developing a Bandpass Filter for iOS and the filter needs two parameters (center frequency and bandwidth). The problem basically is that bandwidth is in Cents (range 100-1200) instead of Hz. I've tried to find a way to convert from Cents to Hz but apparently there is no way. I also tried this link but the range of bandwidth that I'm using doesn't fit.
So, does anyone know something about this? Is there another way to implement a bandpass filter using audio units?
Thanks for the help. Any explanation would be really helpful!
The explanation can be found in AudioUnitProperties.h :
if f is freq in hertz then absoluteCents = 1200 * log2(f / 440) + 6900

MATLAB - Trouble of converting training data to spectrogram

I am a student and new to signal processing just few months ago. I picked "A Novel Fuzzy Approach to Speech Recognition" for my project (you can google for the downloadable version).
I am a little stuck in converting the training data into a spectrogram which has been passed through a mel-filter.
I use this for my mel-filterbank, with a little modification of course.
Then I wrote this simple code to make the spectrogram of my training data:
p =25;
fl =0.0;
fh =0.5;
w ='hty';
[a,fs]=wavread('a.wav'); %you can simply record a sound and name it a.wav, other param will follows
n=length(a)+1;
fa=rfft(a);
xa=melbank_me(p,n,fs); %the mel-filterbank function
za=log(xa*abs(fa).^2);
ca=dct(za);
spectrogram(ca(:,1))
All I got is just like this which is not like the paper say::
Please let me know that either my code or the spectrogram I have was right. if so, what do I have to do to make my spectrogram like the paper's? and if didn't, please tell me where's the wrong
And another question, is it ok to having the lenght of FFT that much?
Because when I try to lower it, my code gives errors.
You shouldn't be doing an FFT of the entire file - that will include too much time-varing information - you should pick a window size in which the sound is relatively stationary, e.g. 10 ms # 44.1 kHz = 441 samples, so perhaps N = 512 might be a good starting point. You can then generate your spectrogram over successive windows if needed, in order to display the time-varying frequency content.

Resources