I am implementing a software for speech recognition using Mel Frequency Cepstrum Coefficients. In particular the system must recognize a single specified word. Since the audio file I get the MFCCs in a matrix with 12 rows(the MFCCs) and as many columns as the number of voice frames. I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames). My question is how to train a classifier to detect the word? I have a training set with only positive samples, the MFCCs that i get from several audio file (several registration of the same word).
I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames).
This is a very bad idea because you lose all information about the word, you need to analyze the whole mfcc sequence, not a part of it
My question is how to train a classifier to detect the word?
The simple form would be a GMM classifier, you can check here:
http://www.mathworks.com/company/newsletters/articles/developing-an-isolated-word-recognition-system-in-matlab.html
In more complex form you need to learn more complex model like HMM. You can learn more about HMM from textbook like this one
http://www.amazon.com/Fundamentals-Speech-Recognition-Lawrence-Rabiner/dp/0130151572
Related
I am trying to replicate (https://arxiv.org/abs/1704.05513) to do a Big 5 author classification on Facebook data (posts and Big 5 profiles are given).
After removing the stop words, I embed each word in the file with their pre-trained GloVe word vectors. However, computing the average or coordinate-wise min/max word vector for each user and using those as input for a Gaussian Process/SVM gives me terrible results. Now I wonder what the paper means by:
Our method combines Word Embedding with Gaussian
Processes. We extract the words from the users’ tweets and
average their Word Embedding representation into a single
vector. The Gaussian Processes model then takes these
vectors as an input for training and testing.
What else can I "average" the vectors to get decent results and do they use some specific Gaussian Process?
I have time series data of size 100000*5. 100000 samples and five variables.I have labeled each 100000 samples as either 0 or 1. i.e. binary classification.
I want to train it using LSTM , because of the time series nature of data.I have seen examples of LSTM for time series prediction, Is it suitable to use it in my case.
Not sure about your needs.
LSTM is best suited for sequence models, like time series you said, and your description don't look a time series.
Any way, you may use LSTM for time series, not for prediction, but for classification like this article.
In my experience, for binary classification having only 5 features you could find better methods, will consume more memory thant other methods, and could get worst results.
First of all, you can see it from a different perspective, i.e. instead of having 10,000 labeled samples of 5 variables, you should treat it as 10,000 unlabeled samples of 6 variables, where the 6th variable is the label.
Therefore, you can train your LSTM as a multivariate predictor for your 6th variable, that is the sample label and compare with the ground truth during testing to evaluate its performance.
I have the following dataset
for a chemical process in a refinery. It's comprised of 5x5 input vector where each vector is sampled at every minute. The output is the result of the whole process and sampled each 5 minutes.
I concluded that the output (yellow) depends highly on past input vectors in a timely manner. And got recently to have a look on LSTMs and trying to learn a bit about them on Python and Torch.
However I don't have any idea how should I prepare my dataset in a manner where my LSTM could process it and show me future predictions if tested with new input vectors.
Is there a straight forward manner to preprocess my dataset accordingly?
EDIT1: Actually i found out this awesome blog about training LSTMs on natural language processing http://karpathy.github.io/2015/05/21/rnn-effectiveness/ . Long story short, an LSTM takes a character as an input and tries to generate the next character. Eventually, it can be trained on Shakespeare poems to generate new Shakespeare poems! But GPU acceleration is recommended.
EDIT2: Based on EDIT1, the best way to format my dataset is to just transform my excel to txt with TAB-separated columns. I'll post the results of the LSTM prediction on my above numbers dataset as soon as possible.
As described in several books, the process of recognition of isolated words consists of the following:
For a given set of signals(templates), determine feature vector for
each template – matrix M×N, where M is number of
features(MFCC,ZCR,…) and N is number of frames.
Train the templates with some algorithm, such as ANN, HMM, GMM, SVM.
Recognize test signal by trained model.
Because speech signals have different duration, their lengths are aligned by Dynamic Time Warping (DTW) technique, so that N is same for all templates. It can be done during training.
My question is: How to change length of test signal? I can not use DTW on it, since I do not know to which class it belongs. Should I use "time stretching" algorithms, preserving pitch and if I should, how this will affect recognition accuracy?
You can get an equivalent MxN feature vector to one for a "time stretched" signal by extracting the features with the N frames spaced closer together or farther apart in time.
You do not need to change the length to make a match. You extract features from reference samples and test samples, they all have different number of frames. Then you apply DTW between each reference and test thus aligning them. As a result of DTW runs you get the score of match between test sample and each of the references. What you do is you stretch non-uniformly each reference sample to match with test sample. Because you compared each reference with single test, you can use DTW scores in comparison. So you select the reference with best score as a result.
For details and ideas of DTW speech recognition check this presentation.
If you want to get closer to ideas of speech recognition with DTW, you can read a book Fundamentals of Speech Recognition 1st Edition
by Lawrence Rabiner, Biing-Hwang Juang.
I'm using SVM and HOG in OpenCV to implement people detection.
Say using my own dataset: 3000 positive samples and 6000 negative samples.
My question is Does SVM need to do learning each time when detecting people?
If so, the learning time and predicting time could be so time-consuming. Is there any way to implement real-time people detection?
Thank you in advance.
Thank you for your answers. I have obtained the xml result after training(3000 positive and 6000 negative), so I can just use this result to write an other standalone program just use svm.load() and svm.predict()? That's great. Besides, I found that the predicting time for 1000 detection window size image(128x64) is also quite time-consuming(about 10 seconds), so how does it handle a normal surveillance camera capture(320x240 or higher) using 1 or 2 pixels scanning stepsize in real time?
I implemented HOG according to the original paper, 8x8 pixels per cell, 2x2 cells per block(50% overlap), so 3780 dimensions vector for one detection window(128x64). Is the time problem caused by the huge feature vector? Should I reduce the dimensions for each window?
This is a very specific question to a general topic.
Short answer: no, you don't need to learning every time you want to use a SVM. It is a two step process. The first step, learning (in your case by providing your learning algorithm with many many labeled (containing, not containing) pictures containing people or not containing people), results in a model which is used in the second step: testing (in your case detecting people).
no, you don't have to re-train an svm each and every time.
you do the training once, then svm.save() the trained model to a xml/yml file.
later you just svm.load() that instead of the (re-)training, and do your predictions