I am new to machine learning domain. Currently, I am trying to implement a audio language detection system, based on MFCC, delta, delta delta and Mel Spectrum Coefficients of any audio file. These features are extracted using librosa. Librosa returns a 2D matrix of MFCC's. The problem is that I want to train them on a Gaussian Mixture Model. The Sci-kit library takes the input in the format (n_samples, n_features), but I have a D matrix of the form (n_samples, n_mfcc, n_time) as returned by librosa.features.mfcc(). How can i provide a 3D input to a GMM?
Also is there a way so that I can send all the 4 features mentioned above into the model?
Related
I am currently learning ML and one project (for learning purpose) that I am thinking is to classify DTMF tones using ML.
I will be using numpy/scipy and I will have a time domain DTMF signal (for all the numbers 0-9) and use that on an FFT function and I will get an array of frequency values that represents the phone dial pad with two frequencies that have higher value than the rest of the other frequency.
Hypothetical example: a hypothetical DTMF tone have two frequencies 100Hz and 300Hz. The FFT array will have an increment of 100Hz (only on this example, on my actual implementation this will have finer increments)
index 0 (100Hz) - 1.0
index 1 (200Hz) - 0.0
index 2 (300Hz) - 1.0
index 3 to n - 0.0
Most of the scikit-learn examples I seen uses single value for classification. How can I use this array of FFT frequency to train and classify the DTMF data?
What I am thinking currently is to use matplotlib and plot the FFT frequencies and save those plots as pictures and use image classification to be able to train the model and classify the DMTF signals. But that seems an "expensive" approach. What could be an approach that I could use without resorting to image classification?
Credit: picture from https://blogs.mathworks.com/cleve/2014/09/01/touch-tone-telephone-dialing/
A linear classifier would be a plausible ML approach for this task:
Compute the FFT of the input signal. Take the magnitude (abs) of the spectrum, since phase is not relevant for distinguishing DTMF tones. This will result in an array of N real nonnegative values.
Follow this with a linear layer (without bias term), taking a size-N input and producing a size-4 output. In other words, the layer is an Nx4 matrix, and you multiply the spectrum with this matrix to get 4 values.
Finally, apply a softmax to get 4 normalized confidences in the [0, 1] range. Interpret the ith confidence as predicting whether the ith tone is present. Find the largest two confidences to determine which DTMF symbol it is.
When trained, the linear layer should learn which band of frequencies to associate with each tone. If N is large compared to the number of training examples, it will help to add an L2 penalty or other regularization on the linear layer's parameters.
We know that generative adversarial network(GAN) can generate data which is similar to real data. In general Generator needs a random variable z as input and generates a vector for representing data x. I don't know how to calculate the probability of P_G(x) when I have new data and want to know the probability of GAN generating it?
You cannot directly compute the likelihood of data using GAN.
Original GAN paper uses parzen window based estimation of likelihood. You can generate data from GAN and use that data for estimating likelihood with whatever method you like.
I am developing a back-end speech recognition software wherein the user can import mp3 files. How can I extract the features from this digital audio file? should I convert it back to analog first?
Your question is unclear, since you are using terms analog and digital incorrectly. Analog is a real-world, continuous function, i.e. voltage, pressure, etc. Digital is a discrete (sampled) and quantized version of the analog signal. You must calculate the FFT of your audio frames when calculating the MFCC's. You can extract MFCC's only from the digital signal - it's rather impossible to do it with the analog one.
If you are asking about whether it is possible to extract the MFCC's from an mp3 file, then yes - it is possible. All you need is to perform the standard algorithm and you can get your features - obviously it is outside of spec of that question.
Calculate the FFT for frames of data.
Calculate the PSD by squaring the samples.
Apply the mel-filterbank and sum the energy across banks.
Calculate the logarithm of each of the energies.
Calculate the DCT of the logarithms of energies.
You're confusing things here, like #jojek said you can do all that WITH the digital signal. This here is a pretty spot on tutorial:
http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
This one is more practical:
http://www.speech.cs.cmu.edu/15-492/slides/03_mfcc.pdf
From Wikipedia: [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum]
MFCCs are commonly derived as follows:[1][2]
Take the Fourier transform of (a windowed excerpt of) a signal. Means short time fourier transform)
Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. (Calculation described in the links above)
Take the logs of the powers at each of the mel frequencies.
Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
The MFCCs are the amplitudes of the resulting spectrum.
and here's a Matlab toolbox to help you understand it better:
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
Well Hello everybody. I am doing a project that consist in dectect objects using kinect and svm and ann machine learning. I want if it is posible to give the names of library for svm and ann with graphical tool because I want only to train ann with that library and save in .xml then load .xml with opencv!!
SVM is a classifier used to classify samples based upon their feature vectors. So, your task is to convert the images into feature vectors which can be used by SVM for its training and testing.
Ok, to create feature vector from your images there are several possibilites and i am going to mention some very common technique:
A very easy method is to create normalized hue-histogram of your each image. Let's say, you have created hue-histogram with 5-bins. So, based upon your image color there will be some values in these 5 bins. Lets say the values look like this { 0.32 0.56 0 0 0.12 }. So, now this is your one input vector with 5 dimensions (i.e. number of bins). You have to do the same procedure for all training samples and then you will do it for test image too.
Extract some feature from your input samples (e.g. by using SIFT, SURf) and then create there descriptor using SIFT/SURF. And, then you can use these descriptors as the input to your SVM for training.
I am using libsvm to train a SVM with hog features. The model file has n support vectors. But when I try to use it in OpenCV's SVM I found that there is only one vector in OpenCV's model. How does OpenCV do it??
I guess libsvm stores support vectors, whereas opencv just uses a weight vector to store the hyperplane (one vector + one scalar suffices to describe a plane) - you can get there from the decision function using the support vectors by swapping sum and scalar product.
Here is the explanation from Learning OpenCV3:
In the case of linear SVM, all the support vectors for each decision plane can be compressed into a single vector that will basically describe the separating hyperplane.