Are MFCC features required for speech recognition - machine-learning

I'm currently developing a speech recognition project and I'm trying to select the most meaningful features.
Most of the relevant papers suggest using Zero Crossing Rates, F0, and MFCC features therefore I'm using those.
My question is, a training sample with duration of 00:03 has 268 features. Considering I'm doing a multi class classification project with 50+ samples per class training including all MFCC features may suffer the project from curse of dimensionality or 'reduce the importance' of the other features.
So my question is, should I include all MFCC features if not can you suggest an alternative?

You should not use f0 and zero crossing, they are too unstable. You can simply increase your training data and use mfccs, they have good representation capabilitites. But remember to mean-normalize them.

After getting the MFCC coefficient of each frame, you can represent as MFCC features as the combination of:
1) First 12 MFCC
2) 1 energy feature
3) 12 delta MFCC feature
4) 12 double-delta MFCC feature
5) 1 delta energy feature
6) 1 double delta energy feature
The concent of delta MFCC feature is described in this
link.
The 39 dimension MFCC feature is feed into HMM or Recurrent Neural Network.

The point I'd like to make is that MFCCs are not required. You can use MFCCs, and you can use the energy, delta and delta-delta features, as mentioned by #Mahendra Thapa but it is not "required". Some researchers use 40 CCs, some drop the DCT from MFCC calculation making it MFSCs (spectral not cepstral). Some add extra features. Some use less. Susceptibility to the curse of dimensionality depends on the your classifier, doesn't it? Some recently even claim to have made progress towards the "holy grail" of speech recognition, to train using the raw signal, using deep learning, learning the best features rather than hand-crafting them.

MFCC is widely used,and the effect is relatively better .

Related

Machine learning: PCA for different number of features

I am quite new in machine learning and I am building a simple app to recognize spoken digits.
I used MFCC to extract the filtering characteristic of my audio files. MFCC outputs me a 13 x length_of_audio matrix. I would like to use this information for my feature vector. But obviously, each example would have different number of features.
My question is what are the approaches to handle different number of features. E.g. could I use PCA to always extract some fixed amount of features and then use them in a particular learning algorithm ?
I would like to use logistic regression as the learning algorithm.
This is what I received at analyzing one of the spoken digits.
In your case if you have for a user a length_of_audio_matrix = N you don't have a feature vector of 13*N, you have N features vectors of length 13 (they compose a sequence but they are different feature vectors).
You must compose a matrix of 13 features:
MFCCUser1,Slot1
MFCCUser1,Slot2
....
MFCCUser1,SlotN
MFCCUser2,Slot1
MFCCUser2,Slot2
...
And then you can apply Principal Component Analysis.
You only have 13 features, do you really need to reduce them?

Suggested unsupervised feature selection / extraction method for 2 class classification?

I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.

Does SVM need to do learning each time when detecting people?

I'm using SVM and HOG in OpenCV to implement people detection.
Say using my own dataset: 3000 positive samples and 6000 negative samples.
My question is Does SVM need to do learning each time when detecting people?
If so, the learning time and predicting time could be so time-consuming. Is there any way to implement real-time people detection?
Thank you in advance.
Thank you for your answers. I have obtained the xml result after training(3000 positive and 6000 negative), so I can just use this result to write an other standalone program just use svm.load() and svm.predict()? That's great. Besides, I found that the predicting time for 1000 detection window size image(128x64) is also quite time-consuming(about 10 seconds), so how does it handle a normal surveillance camera capture(320x240 or higher) using 1 or 2 pixels scanning stepsize in real time?
I implemented HOG according to the original paper, 8x8 pixels per cell, 2x2 cells per block(50% overlap), so 3780 dimensions vector for one detection window(128x64). Is the time problem caused by the huge feature vector? Should I reduce the dimensions for each window?
This is a very specific question to a general topic.
Short answer: no, you don't need to learning every time you want to use a SVM. It is a two step process. The first step, learning (in your case by providing your learning algorithm with many many labeled (containing, not containing) pictures containing people or not containing people), results in a model which is used in the second step: testing (in your case detecting people).
no, you don't have to re-train an svm each and every time.
you do the training once, then svm.save() the trained model to a xml/yml file.
later you just svm.load() that instead of the (re-)training, and do your predictions

How to speed up svm.predict?

I'm writing a sliding window to extract features and feed it into CvSVM's predict function.
However, what I've stumbled upon is that the svm.predict function is relatively slow.
Basically the window slides thru the image with fixed stride length, on number of image scales.
The speed traversing the image plus extracting features for each
window takes around 1000 ms (1 sec).
Inclusion of weak classifiers trained by adaboost resulted in around
1200 ms (1.2 secs)
However when I pass the features (which has been marked as positive
by the weak classifiers) to svm.predict function, the overall speed
slowed down to around 16000 ms ( 16 secs )
Trying to collect all 'positive' features first, before passing to
svm.predict utilizing TBB's threads resulted in 19000 ms ( 19 secs ), probably due to the overhead needed to create the threads, etc.
My OpenCV build was compiled to include both TBB (threading) and OpenCL (GPU) functions.
Has anyone managed to speed up OpenCV's SVM.predict function ?
I've been stuck in this issue for quite sometime, since it's frustrating to run this detection algorithm thru my test data for statistics and threshold adjustment.
Thanks a lot for reading thru this !
(Answer posted to formalize my comments, above:)
The prediction algorithm for an SVM takes O(nSV * f) time, where nSV is the number of support vectors and f is the number of features. The number of support vectors can be reduced by training with stronger regularization, i.e. by increasing the hyperparameter C (possibly at a cost in predictive accuracy).
I'm not sure what features you are extracting but from the size of your feature (3780) I would say you are extracting HOG. There is a very robust, optimized, and fast way of HOG "prediction" in cv::HOGDescriptor class. All you need to do is to
extract your HOGs for training
put them in the svmLight format
use svmLight linear kernel to train a model
calculate the 3780 + 1 dimensional vector necessary for prediction
feed the vector to setSvmDetector() method of cv::HOGDescriptor object
use detect() or detectMultiScale() methods for detection
The following document has very good information about how to achieve what you are trying to do: http://opencv.willowgarage.com/wiki/trainHOG although I must warn you that there is a small problem in the original program, but it teaches you how to approach this problem properly.
As Fred Foo has already mentioned, you have to reduce the number of support vectors. From my experience, 5-10% of the training base is enough to have a good level of prediction.
The other means to make it work faster:
reduce the size of the feature. 3780 is way too much. I'm not sure what this size of feature can describe in your case but from my experience, for example, a description of an image like the automobile logo can effectively be packed into size 150-200:
PCA can be used to reduce the size of the feature as well as reduce its "noise". There are examples of how it can be used with SVM;
if not helping - try other principles of image description, for example, LBP and/or LBP histograms
LDA (alone or with SVM) can also be used.
Try linear SVM first. It is much faster and your feature size 3780 (3780 dimensions) is more than enough of "space" to have good separation in higher dimensions if your sets are linearly separatable in principle. If not good enough - try RBF kernel with some pretty standard setup like C = 1 and gamma = 0.1. And only after that - POLY - the slowest one.

What's the difference between ANN, SVM and KNN classifiers?

I am doing remote sensing image classification. I am using the object-oriented method: first I segmented the image to different regions, then I extract the features from regions such as color, shape and texture. The number of all features in a region may be 30 and commonly there are 2000 regions in all, and I will choose 5 classes with 15 samples for every class.
In summary:
Sample data 1530
Test data 197530
How do I choose the proper classifier? If there are 3 classifiers (ANN, SVM, and KNN), which should I choose for better classification?
KNN is the most basic machine learning algorithm to paramtise and implement, but as alluded to by #etov, would likely be outperformed by SVM due to the small training data sizes. ANNs have been observed to be limited by insufficient training data also. However, KNN makes the least number of assumptions regarding your data, other than that accurate training data should form relatively discrete clusters. ANN and SVM are notoriously difficult to paramtise, especially if you wish to repeat the process using multiple datasets and rely upon certain assumptions, such as that your data is linearly separable (SVM).
I would also recommend the Random Forests algorithm as this is easy to implement and is relatively insensitive to training data size, but I would advise against using very small training data sizes.
The scikit-learn module contains these algorithms and is able to cope with large training data sizes, so you could increase the number of training data samples. the best way to know for sure would be to investigate them yourself, as suggested by #etov
If your "sample data" is the train set, it seems very small. I'd first suggest using more than 15 examples per class.
As said in the comments, it's best to match the algorithm to the problem, so you can simply test to see which algorithm works better. But to start with, I'd suggest SVM: it works better than KNN with small train sets, and generally easier to train then ANN, as there are less choices to make.
Have a look at below mind map
KNN: KNN performs well when sample size < 100K records, for non textual data. If accuracy is not high, immediately move to SVC ( Support Vector Classifier of SVM)
SVM: When sample size > 100K records, go for SVM with SGDClassifier.
ANN: ANN has evolved overtime and they are powerful. You can use both ANN and SVM in combination to classify images
More details are available #semanticscholar.org

Resources