Using MFCC coefficients for simple voice activity detection - signal-processing

Since MFCC coefficients stores information about amplitudes for bands of frequencies (that depends on used filter bank), how can those coefficient be used for voice activity detection?
Would it be sufficient to use this coefficients to perform further energy calculaction and make decisions with them?

Question 1)Since MFCC coefficients stores information about amplitudes for bands of frequencies (that depends on used filter bank), how can those coefficient be used for voice activity detection?
Answer:-As MFCC coefficients stores information about amplitudes for bands of frequencies, different people have different frequencies for utterance for same sentence. The bands of spoken words are compared with database bands frequencies and person is identified.
Question 2)Would it be sufficient to use this coefficients to perform further energy calculaction and make decisions with them?
answer:-yes it would be sufficient to perform further energy calculation , but along with MFCC if you add LPC and other features it will give you better decision.

Related

DSP and ML: How to classify DTMF tones?

I am currently learning ML and one project (for learning purpose) that I am thinking is to classify DTMF tones using ML.
I will be using numpy/scipy and I will have a time domain DTMF signal (for all the numbers 0-9) and use that on an FFT function and I will get an array of frequency values that represents the phone dial pad with two frequencies that have higher value than the rest of the other frequency.
Hypothetical example: a hypothetical DTMF tone have two frequencies 100Hz and 300Hz. The FFT array will have an increment of 100Hz (only on this example, on my actual implementation this will have finer increments)
index 0 (100Hz) - 1.0
index 1 (200Hz) - 0.0
index 2 (300Hz) - 1.0
index 3 to n - 0.0
Most of the scikit-learn examples I seen uses single value for classification. How can I use this array of FFT frequency to train and classify the DTMF data?
What I am thinking currently is to use matplotlib and plot the FFT frequencies and save those plots as pictures and use image classification to be able to train the model and classify the DMTF signals. But that seems an "expensive" approach. What could be an approach that I could use without resorting to image classification?
Credit: picture from https://blogs.mathworks.com/cleve/2014/09/01/touch-tone-telephone-dialing/
A linear classifier would be a plausible ML approach for this task:
Compute the FFT of the input signal. Take the magnitude (abs) of the spectrum, since phase is not relevant for distinguishing DTMF tones. This will result in an array of N real nonnegative values.
Follow this with a linear layer (without bias term), taking a size-N input and producing a size-4 output. In other words, the layer is an Nx4 matrix, and you multiply the spectrum with this matrix to get 4 values.
Finally, apply a softmax to get 4 normalized confidences in the [0, 1] range. Interpret the ith confidence as predicting whether the ith tone is present. Find the largest two confidences to determine which DTMF symbol it is.
When trained, the linear layer should learn which band of frequencies to associate with each tone. If N is large compared to the number of training examples, it will help to add an L2 penalty or other regularization on the linear layer's parameters.

Forest Classification with multispectral imagery

How can I classify forest cover into stratified classes of forest canopy density with 12-band sentinel-2 data?
I am attempting use recent sentinel-2 data to classify all forest over 5m in height. If it is possible I would also like canopy density (as a metric for percentage tree cover) however I am not sure how feasible this is without control points on the ground.
I have read that it is possible to use the shadowing in the panchromatic band in conjunction with the NIR and IR bands to determine depth but this does not appear available with Sentinel-2.
I have completed forest inventory before utilizing LiDar and on another occasion with multi-spectral imagery, ground control points and object based classification. However I have never attempted it using solely multi-spectral imagery.
A method has been developed previously which used multispectral satellite imagery from the [Landsat 7] thematic mapper plus (ETM+) sensor and a supervised learning algorithm to determine tree cover but details regarding the finer points of the method are scant.
I am unfortunately limited to Arc 10 [Advanced Licence] with just the Spatial Analyst and Geo-statistical Analyst extensions.
Any assistance with this would be greatly appreciated.

How can I normalize data to have same average sum of square?

In a lot of articles in my field, this sentence has been repeated: " The 2 matrices has been normalized to have the same average sum-of-squares (computed across all subjects and all voxels for each modality)". Suppose that we have two matrices that the rows define different subjects and the columns are features (voxels). In these articles, no much explanation can be found for normalization method. Does anybody knows how I should normalize data to have "same average sum-of-squares"? I don't understand it at all. Thanks
For a start normalization in this context is also known as features scaling, which pretty much sums it up. You scale your features, your data to get rid of variances and range of values which would disturb your algorithm and your results in the end.
https://en.wikipedia.org/wiki/Feature_scaling
In data processing, normalization is quite useful (depending on the application). E.g. in distance based machine learning algorithms you should normalize your features in order to get a proportional contribution to the outcome of your algorithm, independent of the range of value the features comprise.
To do so, you can use different statistical measurements, like the
Sum of squares:
SUM_i(Xi-Xbar)²
Other than that you could use the variance or the standard deviation of your data.
https://www.westgard.com/lesson35.htm#4
Those statistical terms can then be used to normalize your data, to improve e.g. the clustering quality of your algorithm. Which term to use and which method highly depends on the algorithms and data you're using and what you're aiming at.
Here is a paper which compares some of the approaches you could choose from for clustering:
http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf
I hope this can help you a little.

Simple word detector using MFCC

I am implementing a software for speech recognition using Mel Frequency Cepstrum Coefficients. In particular the system must recognize a single specified word. Since the audio file I get the MFCCs in a matrix with 12 rows(the MFCCs) and as many columns as the number of voice frames. I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames). My question is how to train a classifier to detect the word? I have a training set with only positive samples, the MFCCs that i get from several audio file (several registration of the same word).
I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames).
This is a very bad idea because you lose all information about the word, you need to analyze the whole mfcc sequence, not a part of it
My question is how to train a classifier to detect the word?
The simple form would be a GMM classifier, you can check here:
http://www.mathworks.com/company/newsletters/articles/developing-an-isolated-word-recognition-system-in-matlab.html
In more complex form you need to learn more complex model like HMM. You can learn more about HMM from textbook like this one
http://www.amazon.com/Fundamentals-Speech-Recognition-Lawrence-Rabiner/dp/0130151572

Why should we perform cosine normalization for SVM feature vectors?

I was recently playing around with the well known movie review dataset used in binary sentiment analysis. It consists of 1,000 positive and 1,000 negative reviews. While exploring various feature-encodings with unigram features, I noticed that all previous research publications normalize the vectors by their Euclidean norm in order to scale them to unit-length.
In my experiments using Liblinear, however, I found that such length-normalization decreases the classification accuracy significantly. I studied the vectors, and I think this is the reason: the dimension of the vector space is, say, 10,000. As a result, the Euclidean norm of the vectors is very high compared to the individual projections. Therefore, after normalization, all the vectors get very small numbers on each axis (i.e., the projection on an axis).
This surprised me, because all publications in this field claim that they perform cosine normalization, whereas I found that NOT normalizing yields better classification.
Thus my question: is there any specific disadvantage if we don't perform cosine normalization for SVM feature vectors? (Basically, I am seeking a mathematical explanation for this need for normalization).
After perusing the manual of LibSVM, I realize why the normalization was yielding much lower accuracy when compared to not normalizing. They recommend scaling the data to a [0,1] or [-1,1] interval. This is something I had not done. Scaling up will resolve the issue of having too many data points very close to zero, while retaining the advantages of length-normalization.

Resources