I would like to prepare an Audio-dataset for a machine learning model.
Each .wav file should be represented as an MFCC image.
While all of the images will have the same MFCC amount (= 20), the lengths of the .wav
files are between 3-5 seconds.
Should I manipulate all the .wav files to have the same length?
Should I normalize the MFCC values (between 0 and 1) prior to plotting?
Are there any important steps to do with such data before passing it to a machine learning model?
Further reading links would also be appreciated.
Most classifiers will require a fixed size input, yes. You can do this by cutting or padding the MFCCs after you have calculated them. No need to manipulate the WAV/waveform, per se.
Another approach is to split your audio files into multiple analysis windows, say 1 seconds each. A 3 second file can then be done with 3 predictions (or more if one uses overlap), while a 5 second file would take 5 predictions (or more). Then to get clip-wide prediction, one would merge predictions over all windows in the clip. The easy ways to train in this way requires assuming that a label given for the clip is valid for each individual analysis window.
Related
I am researching some information about audio classification, more specifically: balanced vs. imbalanced audio datasets.
So, assuming here I have two folders of two datasets' classes: Car sounds and Motorcycle sounds, car class folder has 1000 .wav and motorcycle folder has 1000 .wav too. Does that mean I have a balanced datasets just because the numbers are equal? What if the total size of .wav files inside car class is 500 Mb and the other one is 200 Mb? Okay, assuming both of them have same folder size, yet what if the time duration of individual audio clips of car recordings are longer than others in the motorcycle class?
Balanced dataset means the same number from both classes. Often shorter data is padded to make it the same length to fit into classifiers. I don't have a background in audio so I can't say if padding is the norm, but if your network has some way of reconciling different input lengths that does not involve creating more inputs it will be balanced 1000-1000.
I'm training on three CT volumes using the Selective Sampler to ensure that enough samples are taken around the RoI (due to class imbalance), with some random samples. I'm also augmenting the data by scaling, rotation, and flipping, which takes a significant amount of time whenever samples are created.
Setting sample_per_volume to some large value (such as 32768) and batch_size to 128, it seems like NiftyNet will do 256 iterations of 128 samples just taken from the first volume, then switch to samples only taken from the 2nd volume (with a sharp jump in loss) and so on. I want each batch of 128 samples to be a roughly even mixture of samples taken from all of the training volumes.
I've tried setting sample_per_volume to roughly 1/3 of the batch_size so that samples are reselected for each iteration, but this slows down each iteration from around 2s to 50-60s.
Am I misunderstanding something? Or is there a way around this to ensure my batches are made up of samples from a mix of all the training data? Thanks.
The samples populate a queue of length queue_length, given in the .ini file. They are then randomly taken from the queue to populate the batch.
I would make the queue_length parameter bigger. Then it will be filled with data from several different subjects.
To handle sequences of different lengths we use bucketing and padding. In bucketing we make different bucket for some max_len and we do this to reduce the amount of padding, after making different buckets we train different model on different bucket.
This is what I found so far. But what I don't understand is that how this all different models trained and how they are used for translating a new sentence?
Both at training and inference time, the algorithm needs to pick the network that is best suited for the current input sentence (or batch). Usually, it simply takes the minimal bucket which input size is greater or equal to the sentence length.
For example, suppose there are just two buckets [10, 16] and [20, 32]: the first one takes any input up to length 10 (padded to exactly 10) and outputs the translated sentence up to length 16 (padded to 16). Likewise the second bucket handles the inputs up to length 20. The two networks corresponding to these buckets accept non-intersecting input sets.
Then, for the sentence of length 8, it's better to select the first bucket. Note that if this is a test sentence, the second bucket can handle it as well, but in this case its neural network had been trained on bigger sentences, from 11 to 20 words, so it's likely not to recognize this sentence well. The network that corresponds to the first bucket had been trained on inputs 1 to 10, hence is a better choice.
You may be in trouble if the test sentence has the length 25, longer than any available bucket. There's no universal solution here. The best course of action here is to trim the input to 20 words and try to translate anyway.
I have the below dataset for a chemical process comprised of 5 consecutive input vectors to produce 1 output. Each input is sampled every minute while the output os sample every 5.
While I believe the output depends on the 5 previous input vectors, than I decided to look for LSTMs for my design. After a lot of research on how should be my LSTM architecture, I concluded that I should mask some of the output sequence by zeros and only leave the last output. The final architecture is below according to my dataset:
My question is: What should be my 3D input tensor parameters? E.g. [5, 5, ?]? And also what should be my "Batch size"? Should it be the quantity of my samples?
Since you are going for many to one sequence modelling, you don't need to pad zeros to your output (it's not needed). The easiest thing would be to perform classification at last time-step i.e after RNN/LSTM sees the 5th input. The dimension of your 3D input tensor will be [batch_size, sequence_length, input_dimensionality], where sequence_length is 5 in your case (row 1-5, 7-11, 13-17 etc.), and input_dimensionality is also 5 (i.e. column A- E).
Batch_size depends on the number of examples (also how much reliable is your data), if you have more than 10,000 examples then batch size of 30-50 should be okay (read this explanation about choosing the appropriate batch size).
Looking at the previous answer, I would say that you do not have to do a many-to-one architecture. It really depends on the problem you have. For example, if you system has a lot of dependencies from the past, i.e. more that 5 samples in your case, it would be better to do many-to-many architecture but with different input and output frequencies. But if you think that the previous 5 samples do not impact your next 5 samples. then a many-to-one architecture would do it.
Also, if you your problem is regression, you can use a Dense layer as the output of an LSTM cell is a tanh with output range of (-1, 1).
I am implementing a software for speech recognition using Mel Frequency Cepstrum Coefficients. In particular the system must recognize a single specified word. Since the audio file I get the MFCCs in a matrix with 12 rows(the MFCCs) and as many columns as the number of voice frames. I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames). My question is how to train a classifier to detect the word? I have a training set with only positive samples, the MFCCs that i get from several audio file (several registration of the same word).
I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames).
This is a very bad idea because you lose all information about the word, you need to analyze the whole mfcc sequence, not a part of it
My question is how to train a classifier to detect the word?
The simple form would be a GMM classifier, you can check here:
http://www.mathworks.com/company/newsletters/articles/developing-an-isolated-word-recognition-system-in-matlab.html
In more complex form you need to learn more complex model like HMM. You can learn more about HMM from textbook like this one
http://www.amazon.com/Fundamentals-Speech-Recognition-Lawrence-Rabiner/dp/0130151572