How to extract human voice from an audio clip, using machine learning? - machine-learning

How can we use machine learning to get human voice from an audio clip which can be having a lot many noise over whole frequency domain.

As in any ML application the process is simple: collect samples, design features, train the classifier. For the samples you can use your noisy recordings or you can find a lot of noises in the web sound collections like freesound.org. For the features you can use mean-normalized mel-frequency coefficients, you can find implementation in CMUSphinx speech recognition toolkit. For classifier you can pick GMM or SVM. If you have enough data it will work fairly well.
To improve the accuracy you can add assumption that noise and voice are continuous, for that reason you can analyze detection history with hangover scheme (essentially HMM) to detect voice chunks instead of analysis of the every frame individually.

Related

Classification task based on speech recordings

I'm currently working with a huge dataset consisting of speech recordings of conersations of 120 persons. For each person, I have around 5 conversation recordings lasting 40-60min (the conversations are dyadic). For each recording, I have a label (i.e., around 600 labels). The labels provide information about the mental state of one of the person in the conversation (three classes). To classify this mental state based on the speech recordings, I see the following three possibilities:
Extracting Mel-Spectrograms or MFCCs (better for speech) and training
a RNN CNN (e.g., ConvLSTM) for the classification task. Here I see
the problem that with the small amount of labels, it might overfit.
In addition, recordings are long and thus training a RNN might be
difficult. A pretrained network on a different task (e.g., speech
recognition?) could also be used (but probably not available for
RNN).
Training a CNN autoencoder on Mel-Spectrograms or MFCCs over small
shifted windows (e.g., 1 minutes). The encoder could then be used to
extract features. The problem here is that probably the whole
recording must be used for the prediction. Thus, features need to be
extracted over the whole recording which would require the same
shifted windows as for training the autoencoder.
Extracting the features manually (e.g. frequency-based features etc.)
and using a SVM or Random Forest for the prediction which might suit
better for the small amount of labels (a fully-connected network
could also be used for comparison). Here the advantage is that
features can be chosen that are independent of the length of the
recording.
Which option do you think is best? Do you have any recommendations?

(Speech/sound recognition) Why do most article/book show codes that train machine using JPEG file of plotted spectrogram?

Background: Hi, I am a total beginner in machine learning with civil engineering background.
I am attempting on training a machine (using Python) to create an anomaly detecting algorithm that can detect defect inside of concrete by using sound from hammering test.
From what I understand, to train a machine for speech recognition, you need to process your sound signal using signal processing like Short Time Fourier Analysis and Wavelet Analysis. From this analysis, you will get your sound data decomposed to frequencies (and time) that it is made up of. So, a 3D array data; time-frequency-amplitude.
After that, most articles that I've read would plot spectrogram using this 3D array and save it in JPG/JPEG. And from the image data, it will be processed again to be fed into neural network. The rest will be the same as how we would do in training machine for image recognition algorithm.
My question is, why do we need to plot the 3D array to spectrogram (image file) and feed our machine with the image file instead of using the array directly.

Condition Based Monitoring | CBM

Machine Learning (ML) can do two things from Vibration/ Acoustic Signal for Condition Based Monitoring (CBM):
1 . Feature Extraction (FT) and
2 . Classification
But if we look through the research/process, then why signal processing techniques are used for pre-processing and ML for rest of the part; I mean classification?
We can use only ML for all of these. But I have seen the merging model of the two techniques: conventional signal processing approach and ML.
I want to know the specific reason for that. Why researchers use these two; they could do with ML only; but they use both.
Yes you can do so. However, the task becomes more complicated.
FFT for example transforms the input space into a more meaningful representation. If you have rotating equipment you would expect that the spectrum is mainly on the frequency of rotation. However, if there is a problem the spectrum changes. This can often be detected by for example SVMS.
If you don't do the FFT but only give the raw signal, SVMs have a hard time.
Nevertheless, i've seen recent practical examples using Deep Convolutional Networks which have learned to predict problems on raw vibration data. The disadvantage is, however, that you do need more data. More data is not a problem in general, but if you take for example a wind turbine more failure data is obviously -- or hopefully ;-) -- a problem.
The other thing is that the ConvNet learned the FFT all by itself. But why not use prior knowledge if you have that.....

hand motion recognition using Hidden Markov Model

I'm doing hand motion recognition project for my final assigment, the core of my code is Hidden Markov Model some papers said that we first need to detect the object, perform feature extraction then use HMM to recognize the motion,
I'm using openCV, I've done the hand detection using haar clasifier, I've prepared the hmm code using c++, but I missed something:
I dont' know how to integrating Haar Clasifier with HMM
How to perform feature extraction from detected hand (haar clasifier)?
I know we should first train the HMM for motion recognition, but i don't how to train motion data, what kind of data that I should use? how to prepare the data? where can I find them or how can I collect them?
If I searching on google, some people said that HMM motion recognition has a similiarity with HMM speech recognition, but I confused which part is similiar?
someone please tell me if I do wrong, give me suggestion what should I do
please teach me, master
To my understanding:
1) haar is used to detect static objects, which means it works within a frame of image.
2) HMM is used to recognize temporal features, which means it works across frames.
So the things you wanna do is to first track the hand, get the feature of the hand and train the gesture movement in HMM.
As for the feature, the most naive one is the "pixel by pixel" feature. You just put all the pixels' intensities together. After this, a dimensionality reduction is needed, say, PCA.
After that, one way of using HMM is to discretize the features into clusters, and train the model with discretized states sequence, then predict the probability of a given sequence of features belonging to each of the groups.
Note
This is not a standard gesture recognition procedure. However it is quite naive for your "final project".

Simple technique to upsample/interpolate video features?

I'm trying to analyse audio and visual features in tandem. My audio speech features are mel-frequency cepstrum co-efficients sampled at 100fps using the Hidden Markov Model Toolkit. My visual features come from a lip-tracking programme I built and are sampled at 29.97fps.
I know that I need to interpolate my visual features so that the sample rate is also 100fps, but I can't find a nice explanation or tutorial on how to do this online. Most of the help I have found comes from the speech recognition community which assumes a knowledge of interpolation on behalf of the reader, i.e. most cover the step with a simple "interpolate the visual features so that the sample rate equals 100fps".
Can anyone pooint me in the right direction?
Thanks a million
Since face movement is not low-pass filtered prior to video capture, most of the classic DSP interpolation methods may not apply. You might as well try linear interpolation of your features vectors to get from one set of time points to a set at a different set of time points. Just pick the 2 closest video frames and interpolate to get more data points in between. You could also try spline interpolation if your facial tracking algorithm measures accelerations in face motion.

Resources