How to change the length of the unidentified speech signal during recognition? - signal-processing

As described in several books, the process of recognition of isolated words consists of the following:
For a given set of signals(templates), determine feature vector for
each template – matrix M×N, where M is number of
features(MFCC,ZCR,…) and N is number of frames.
Train the templates with some algorithm, such as ANN, HMM, GMM, SVM.
Recognize test signal by trained model.
Because speech signals have different duration, their lengths are aligned by Dynamic Time Warping (DTW) technique, so that N is same for all templates. It can be done during training.
My question is: How to change length of test signal? I can not use DTW on it, since I do not know to which class it belongs. Should I use "time stretching" algorithms, preserving pitch and if I should, how this will affect recognition accuracy?

You can get an equivalent MxN feature vector to one for a "time stretched" signal by extracting the features with the N frames spaced closer together or farther apart in time.

You do not need to change the length to make a match. You extract features from reference samples and test samples, they all have different number of frames. Then you apply DTW between each reference and test thus aligning them. As a result of DTW runs you get the score of match between test sample and each of the references. What you do is you stretch non-uniformly each reference sample to match with test sample. Because you compared each reference with single test, you can use DTW scores in comparison. So you select the reference with best score as a result.
For details and ideas of DTW speech recognition check this presentation.
If you want to get closer to ideas of speech recognition with DTW, you can read a book Fundamentals of Speech Recognition 1st Edition
by Lawrence Rabiner, Biing-Hwang Juang.

Related

What is used to train a self-attention mechanism?

I've been trying to understand self-attention, but everything I found doesn't explain the concept on a high level very well.
Let's say we use self-attention in a NLP task, so our input is a sentence.
Then self-attention can be used to measure how "important" each word in the sentence is for every other word.
The problem is that I do not understand how that "importance" is measured. Important for what?
What exactly is the goal vector the weights in the self-attention algorithm are trained against?
Connecting language with underlying meaning is called grounding. A sentence like “The ball is on the table” results into an image which can be reproduced with multimodal learning. Multimodal means, that different kind of words are available for example events, action words, subjects and so on. A self-attention mechanism works with mapping input vector to output vectors and between them is a neural network. The output vector of the neural network is referencing to the grounded situation.
Let us make a short example. We need a pixel image which is 300x200, we need a sentence in natural language and we need a parser. The parser works in both directions. He can convert text to image, that means the sentence “The ball is on the table” gets converted into the 300x200 image. But it is also possible to parse a given image and extract the natural sentence back. Self-attention learning is a bootstrapping technique to learn and use the grounded relationship. That means to verify existing language models, to learn new one and to predict future system states.
This question is old now but I came across it so I figured I should update others as my own understanding has increased.
Attention simply refers to some operation that takes the output and combines it with some other information. Typically this just happens by taking the dot product of the output with some other vector so it can "attend" to it in some way.
Self-attention combines the output with other parts of the input (hence self part). Again the combination usually occurs via the dot-product between the vectors.
Finally how is attention (or self-attention) trained?
Let's call Z our output, W our weight matrix and X our input (we'll use # as matrix multiplication symbol).
Z = X^T # W^T # X
In NLP we will compare Z to whatever we want the resulting output to be. In machine translation it is the sentence in the other language for example. We can compare the two with average cross entropy loss over each word predicted. Finally we can update W with back propagation.
How do we see what is important? We can look at the magnitudes of Z to see after the attention what words were most "attended" to.
This is a slightly simplified example as it only has one weight matrix and typically the inputs are embedded but I think it still highlights some of the necessary details concerning attention.
Here is a useful resource with visualizations for more information about attention.
Here is another resource with visualizations for more about attention in transformers specifically self-attention.

NLP - Word Representations

I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created.
The problem is, how do I know if my representation work? How do I know if it actually captures the meaning of each word? How do I compare my representation with other existing vector space models?
As of now, I am doing the following tests to check the quality of my word vectors:
Distance test:
Does the cosine distance between vectors reflect the semantic distance between words?
Analogy test:
Can the representation be used to solve problems like "King is to queen what man is to ________ ", (the answer should be woman)
Picking the odd one out:
Can the vectors be used to pick the odd word in a given list of words. If the input is {"cat","dog","phone"}, the output should be "phone"?
What are the other tests that I should do to check the quality of the vectors? What other tasks are word vectors expected to be capable of doing? Is there a benchmark for vector space models?
Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings.
In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates plots, gives correlations with word pair similarity rankings, and compares your embeddings with pre-trained vectors from previous research. You can find a more detailed description in the accompanying paper.

Simple word detector using MFCC

I am implementing a software for speech recognition using Mel Frequency Cepstrum Coefficients. In particular the system must recognize a single specified word. Since the audio file I get the MFCCs in a matrix with 12 rows(the MFCCs) and as many columns as the number of voice frames. I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames). My question is how to train a classifier to detect the word? I have a training set with only positive samples, the MFCCs that i get from several audio file (several registration of the same word).
I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames).
This is a very bad idea because you lose all information about the word, you need to analyze the whole mfcc sequence, not a part of it
My question is how to train a classifier to detect the word?
The simple form would be a GMM classifier, you can check here:
http://www.mathworks.com/company/newsletters/articles/developing-an-isolated-word-recognition-system-in-matlab.html
In more complex form you need to learn more complex model like HMM. You can learn more about HMM from textbook like this one
http://www.amazon.com/Fundamentals-Speech-Recognition-Lawrence-Rabiner/dp/0130151572

facial expression classification using k-means

My method for classifying facial expressions using k-means is:
Use opencv to detect the face in the image
Use ASM and stasm to get the facial feature point
Calculate the distance between facial features (as show in the picture). There'll be 5 distances.
Calculate the centroid for each distance for each facial expression (exp: in the distance D1 there are 7 centroids for each expression 'happy, angry...').
Use 5 k-means each k-means for a distance and each k-means will have as a result the expression shown by the distance closest to the Centroid calculated in the first step.
Final expression will be the expression that appears in the most k-means results
However, using that method my results are wrong?
Is my method correct or is it wrong somewhere?
K-means is not a classification algorithm. Once runned, it simply finds centroids of K elements, so it splits data into K parts, but in most cases it won't have anything to do with desired classes. This algorithm (as all the clustering methods) should be used when you want to explore data and find some distinguishable objects. Distinguishable in any sense. If your task is to build a system, which recognizes some given classes, then it is a classification problem, not clustering. One of the most simple methods, which are easy to both implement and understand is KNN (K-nearest neighbours), which roughly does what you are trying to accomplish - checks which classes' objects are the closest ones to some predefined ones.
To better see the difference let us consider your case - you are trying to detect emotional state based on the face features. Running k-means on such data can split your face photos into many groups:
If you use photos of different people, it can cluster photos of particular people together (as their distances differ from others)
it can split data into for example man and woman, as there are gender specific differences in such features
it can even split your data based on the distance from the camera, as the perspective changes your features, creating "clusters".
etc.
As you can see, there are dozens possible "reasonable" (and even more completely not interpretable) splits, and K-means (and any) other clustering algorithm will simply find one of them (in most cases - the not interpretable one). Classification methods are used to overcome this issue, to "explain" the algorithm what are you expecting.

How can HMMs be used for handwriting recognition?

The problem is a bit different than traditional handwriting recognition. I have a dataset that are thousands of the following. For one drawn character, I have several sequential (x, y) coordinates where the pen was pressed down. So, this is a sequential (temporal) problem.
I want to be able to classify handwritten characters based on this data, and would love to implement HMMs for learning purposes. But, is this the right approach? How can they be used to do this?
I think HMM can be used in both problems mentioned by #jens. I'm working on online handwriting too, and HMM is used in many articles. The simplest approach is like this:
Select a feature.
If selected feature is continuous convert it to discrete.
Choose HMM parameters: Topology and # of states.
Train character models using HMM. one model for each class.
Test using test set.
for each item:
the simplest feature is angle of vector which connects consecutive
points. You can use more complicated features like angles of vectors
obtained by Douglas & Peucker algorithm.
the simplest way for discretization is using Freeman codes, but
clustering algorithms like k-means and GMM can be used too.
HMM topologies: Ergodic, Left-Right, Bakis and Linear. # of states
can be obtained by trial & error. HMM parameters can be variable for
each model. # of observations is determined by discretization.
observation samples can be have variable length.
I recommend Kevin Murphy HMM toolbox.
Good luck.
This problem is actually a mix of two problems:
recognizing one character from your data
recognizing a word from a (noisy) sequence of characters
A HMM is used for finding the most likely sequence of a finite number of discrete states out of noisy measurements. This is exactly problem 2, since noisy measurements of discrete states a-z,0-9 follow eachother in a sequence.
For problem 1, a HMM is useless because you aren't interested in the underlying sequence. What you want is to augment your handwritten digit with information on how you wrote it.
Personally, I would start by implementing regular state-of-the-art handwriting recognition which already is very good (with convolutional neural networks or deep learning). After that, you can add information about how it was written, for example clockwise/counterclockwise.

Resources