Dealing With Variable Length Audio in Machine Learning - machine-learning

I'm working on a model for speech emotion recognition, and I'm currently in the pre-processing phase, creating a utility that can transform audio files into a feature space of fixed dimensions. I'm planning on experimenting with spectrograms, mel-spectrograms and mfccs (including deltas and delta-deltas) as input features for convolutional neural networks. One glaring issue is that the audio is of variable length.
Now, I know that the typical method of dealing with this is setting some length and then expanding all the audio files to fit that length, or truncating files, but I imagine the former method is preferable, because truncation loses some data that could be valuable in training. So I intend to pad audio files with 0s to expand them to some fixed length. I found the maximum duration of an audio file in my dataset and then took its ceiling to get the length. Now I intend to add trailing 0s (after resampling to a fixed sampling rate) to expand all the audio files to have a static length.
My question is, do these extra dimensions not potentially confuse the model? I know neural networks automatically handle feature extraction, but are there any potential caveats I should be made aware of, or perhaps some alternative method for going about doing this that may produce better results?
Thanks.

Related

CNN for audio recognition has near perfect test and validation accuracy but not generalizing to new data

I am creating a CNN to recognize a friend's speech. She uses unique vocalizations (not in any language) to communicate. To begin, I recorded 60 samples of three of these sounds (180 .wav samples total). After training the model, I was getting near perfect accuracy from both the test and validation data. I then recorded new sounds immediately after this training, and was getting about 50 percent accuracy, which showed some level of learning and generalizing, since random guesses on 3 classes should have been getting about 33% accuracy.
The next day I tried recording new audio again, and the model's predictions were as good as random. My guess as to the problem is that the model is sensitive to very small changes in environment. It showed some learning immediately after training because the environment would have been very similar. However, the following day, there were probably more substantial changes to environment (background noise, distance from microphone, sitting in different part of the room, etc.). Does this seem like a reasonable guess as to the cause of the problem? And if so, how can I make my model less sensitive to environment? Would adding white noise help? Are there ways to add background noise to my samples? Any help would be appreciated.
That is to be expected! 180 samples is not nearly enough to train the CNN on. A CNN contains thousands up to millions of parameters so you could quite possibly have many more parameters to tune than bytes of data in your dataset!
Futhermore, your claim of getting perfect accuracy on the test set seem suspicious. I'd wager that you have accidentally used test data to train the model with.
You could try "growing" your dataset by adding randomized noise to the sound files. I don't think that would help much though. The network would become resilient to the type of white noise you added, but probably not to the type of noise found in actual recordings. For example, in speech recognition noises people make when speaking like breathing, saying "uh" and "eh", or clearing their throats, can confuse the recognizer. It is very hard to synthetically add such noises.
Also, while two sounds may sound similar to human ears, their waveforms might be completely different. A song played in different keys sound similar or even identical to human ears but have totally different waveforms. You have the same effect if you listen to someone talking indoors vs. outdoors vs. in a noisy bar. Even whether someone is standing or sitting can completely change the sound of their voice.
Bottom line: you need waaaay more data. I also recommend experimenting with RNNs and Bidirectional RNNs. They are a better fit for temporal data like sound samples than CNNs. Generally they also require less parameters so training is faster.

OpenCV poor performance of Haar classifier trained by me

I would like to use the Haar classifier to detect the presence of vehicles in a scene (trying with only cars so far). Since I have not found many trained XML files online, I decided to generate my own.
I found some image sets of vehicles that have been used for similar purposes (training computer vision algorithms) and used these to create my own XML files. It has been almost a week and some of them have finished, so I tried using them but the results were terrible. The classifiers I found online worked decently, at least it appears they are trying to detect vehicles and work fast enough for real-time application (maybe 5-10 FPS or so).
Whereas mine can take several minutes to analyze a frame using detectMultiScale() using the same parameters, and if I pass different parameters (e.g. increase min size, decrease max size, increase scaling factor) it will work faster (maybe 1 FPS) but detects absolutely nothing of note, never detects any vehicle and randomly detects some spots of asphalt as a vehicle.
Where did I go wrong in generating my files? I have limited time to complete this task and these classifiers can take a whole week to train so I have very few attempts remaining. For reference, my methodology is (following this tutorial):
-Take all positive and negative images; if no negative images supplied, take negative images from another data set, at least as many negatives as positives
-Generate as many samples as the number of positives
-Use same parameters as suggested, except image size (set to the size of images in a given data set), and nstages (set to 10 because 20 takes far too long)
-For the npos parameter, I use 1/10th the number of samples, using the full number of samples resulted in "assertion failed" after a few hours, apparently the number of samples cannot be the same as the npos according to this so I gave myself a safety margin.
TL;DR Haar classifier I trained myself performs much worse than one found online (in terms of time and accuracy), need advice on how to improve it and not waste another week training it.
There are two problems here. One, the accuracy of the classifier is low. The other, the classifier runs too slow.
There seems to be no problem with the reference that you used. The steps seem accurate, and I have personally tried them in that order and managed to get good results.
As #Micka mentions, nPos around 90% of the original sample count is good enough. minHitRate is a parameter that you can change. Did you observe the numbers that are displayed while training? How was the accuracy improving, and did your classifier stop training (or are you using the trained parameters before learning ends?)?
For the low speed in detection, the most likely reason is that your training data did not have simple features to learn quickly. Did you trying detection on the data that you used for training? How were the results in that case? Compiler settings or high image resolution can be a problem too, but if you tried the same inputs and settings with other classifiers, this is unlikely.
If you like tor try a different approach (and have a GPU), YOLO V2 should be much faster and more accurate for this task.

Adjusting hyperparameters of neural network used for (offline) handwriting recognition

I'm currently using a java library to do some naive experimentation with offline handwriting recognition. I give my program an image of a pre-written English sentence and segment it into individual characters, which I then feed to a very naively constructed neural network.
I'm new to the idea of neural nets, so my question is where to start with regard to optimising this network's hyperparameters. Currently it's a simple feed forward network which I train using resilient propagation, so the only parameters I can optimise are the number of hidden layers, and the number of neurons in each hidden layer. I could of course do an exhaustive search through a large but finite number of combination, but this would be very time-consuming, and I'm sure someone out there who is more informed in this art must be able to point me in the right direction.
I found a post somewhere on here that stated a good place to start for any network in general was to use only one hidden layer with number of neurons equal to the mean number of neurons in your input and output layer, so that's what I'm doing at the moment.
I'm getting performance of about 40-60% (depending on character) accuracy with this model.

Wavetable sampling variation

I am interested in making a simple digital synthesizer to be implemented on an 8bit MCU. I would like to make wavetables for accurate representations of the sound. Standard wavetables seem to either have a table for several frequencies or to have a single sample that has fractional increments with missing data interpolated by the program to create different frequencies.
Would it be possible to create a single table for a given waveform, likely of a low frequency and change the rate at which the program polls the table to generate different frequencies which would then be processed. My MCU (free one, no budget) is rather slow so I don't have the space for lots of wavetables nor for large amounts of processing so I am trying to skimp where I can. Has anyone seen this implementation?
You should consider using a single table with a phase accumulator and linear interpolation. See this question on DSP.SE for many useful suggestions.

Detecting the fundamental frequency [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
There's this tech-festival in IIT-Bombay, India, where they're having an event called "Artbots" where we're supposed to design artbots with artistic abilities. I had an idea about a musical robot which takes a song as input, detects the notes in the song and plays it back on a piano. I need some method which will help me compute the pitches of the notes of the song. Any idea/suggestion on how to go about it?
This is exactly what I'm doing here as my last year project :) except one thing that my project is about tracking the pitch of human singing voice (and I don't have the robot to play the tune)
The quickest way I can think of is to utilize BASS library. It contains ready-to-use function that can give you FFT data from default recording device. Take a look at "livespec" code example that comes with BASS.
By the way, raw FFT data will not enough to determine fundamental frequency. You need algorithm such as Harmonic Product Spectrum to get the F0.
Another consideration is the audio source. If you are going to do FFT and apply Harmonic Product Spectrum on it. You will need to make sure the input has only one audio source. If it contains multiple sources such as in modern songs there will be to many frequencies to consider.
Harmonic Product Spectrum Theory
If the input signal is a musical note,
then its spectrum should consist of a
series of peaks, corresponding to
fundamental frequency with harmonic
components at integer multiples of the
fundamental frequency. Hence when we
compress the spectrum a number of
times (downsampling), and compare it
with the original spectrum, we can see
that the strongest harmonic peaks line
up. The first peak in the original
spectrum coincides with the second
peak in the spectrum compressed by a
factor of two, which coincides with
the third peak in the spectrum
compressed by a factor of three.
Hence, when the various spectrums are
multiplied together, the result will
form clear peak at the fundamental
frequency.
Method
First, we divide the input signal into
segments by applying a Hanning window,
where the window size and hop size are
given as an input. For each window,
we utilize the Short-Time Fourier
Transform to convert the input signal
from the time domain to the frequency
domain. Once the input is in the
frequency domain, we apply the
Harmonic Product Spectrum technique to
each window.
The HPS involves two steps:
downsampling and multiplication. To
downsample, we compressed the spectrum
twice in each window by resampling:
the first time, we compress the
original spectrum by two and the
second time, by three. Once this is
completed, we multiply the three
spectra together and find the
frequency that corresponds to the peak
(maximum value). This particular
frequency represents the fundamental
frequency of that particular window.
Limitations of the HPS method
Some nice features of this method
include: it is computationally
inexpensive, reasonably resistant to
additive and multiplicative noise, and
adjustable to different kind of
inputs. For instance, we could change
the number of compressed spectra to
use, and we could replace the spectral
multiplication with a spectral
addition. However, since human pitch
perception is basically logarithmic,
this means that low pitches may be
tracked less accurately than high
pitches.
Another severe shortfall of the HPS
method is that it its resolution is
only as good as the length of the FFT
used to calculate the spectrum. If we
perform a short and fast FFT, we are
limited in the number of discrete
frequencies we can consider. In order
to gain a higher resolution in our
output (and therefore see less
graininess in our pitch output), we
need to take a longer FFT which
requires more time.
from: http://cnx.org/content/m11714/latest/
Just a comment: The fundamental harmonic may as well be missing from a (harmonic) sound, this doesn't change the perceived pitch. As a limit case, if you take a square wave (say, a C# note) and completely suppress the first harmonic, the perceived note is still C#, in the same octave. In a way, our brain is able to compensate the absence of some harmonics, even the first, when it guesses a note.
Hence, to detect a pitch with frequency-domain techniques you should take into account all the harmonics (local maxima in the magnitude of the Fourier transform), and extract some sort of "greatest common divisor" of their frequencies. Pitch detection is not a trivial problem at all...
DAFX has about 30 pages dedicated to pitch detection, with examples and Matlab code.
Autocorrelation - http://en.wikipedia.org/wiki/Autocorrelation
Zero-crossing - http://en.wikipedia.org/wiki/Zero_crossing (this method is used in cheap guitar tuners)
Try YAAPT pitch tracking, which detects fundamental frequency in both time and frequency domains. You can download Matlab source code from the link and look for peaks in the FFT output using the spectral process part.
Python package http://bjbschmitt.github.io/AMFM_decompy/pYAAPT.html#
Did you try Wikipedia's article on pitch detection? It contains a few references that can be interesting to you.
In addition, here's a list of DSP applications and libraries, where you can poke around. The list only mentions Linux software packages, but many of them are cross-platform, and there's a lot of source code you can look at.
Just FYI, detecting the pitch of the notes in a monophonic recording is within reach of most DSP-savvy people. Detecting the pitches of all notes, including chords and stuff, is a lot harder.
Just a thought - but do you need to process a digital audio stream as input?
If not, consider using a symbolic representation of music (such as MIDI). The pitches of the notes will then be stated explicitly, and you can synthesize sounds (and movements) corresponding to the pitch, rhythm and many other musical parameters extremely easily.
If you need to analyse a digital audio stream (mp3, wav, live input, etc) bear in mind that while pitch detection of simple monophonic sounds is quite advanced, polyphonic pitch detection is an unsolved problem. In this case, you may find my answer to this question helpful.
For extracting the fundamental frequency of the melody from polyphonic music you could try the MELODIA plug-in: http://mtg.upf.edu/technologies/melodia
Extracting the F0's of all the instruments in a song (multi-F0 tracking) or transcribing them into notes is an even harder task. Both melody extraction and music transcription are still open research problems, so regardless of the algorithm/tool you use don't expect to obtain perfect results for either.
If you're trying to detect the notes of a polyphonic recording (multiple notes at the same time) good luck. That's a very tricky problem. I don't know of any way to listen to, say, a recording of a string quartet and have an algorithm separate the four voices. (Wavelets maybe?) If it's just one note at a time, there are several pitch tracking algorithms out there, many of them mentioned in other comments.
The algorithm you want to use will depend on the type of music you are listening to. If you want it to pick up people singing there are a lot of good algorithms out there designed specifically for voice. (That's where most of the research is.) If you are trying to pick up specific instruments you'll have to be a bit more creative. Voice algorithms can be simple because the range of the human singing voice is generally limited to about 100-2000 Hz. (Speaking range is much more narrow). The fundamental frequencies on a piano, however, go from about 27 Hz. to 4200 Hz., so you're dealing with a wider range usually ignored by voice pitch detection algorithms.
The waveform of most instruments is going to be fairly complex, with lots of harmonics, so a simple approach like counting zeros or just taking the autocorrelation won't work. If you knew roughly what frequency range you were looking in you could low-pass filter and then zero count. I'd think you'd be better off though with a more complex algorithm such as the Harmonic Product Spectrum mentioned by another user, or YAAPT ("Yet Another Algorithm for Pitch Tracking"), or something similar.
One last problem: some instruments, the piano in particular, will have the problem of missing fundamentals and inharmonicity. Missing fundamentals can be dealt with by the pitch tracking algorithms...in fact they have to be since fundamentals are often cut out in electronic transmission...though you'll probably still get some octave errors. Inharmonicity however, will give you problems if somebody plays a note in the bottom octaves of the piano. Normal pitch tracking algorithms aren't designed to deal with inharmonicity because the human voice is not significantly inharmonic.
You basically need a spectrum analyzer. You might be able to to a FFT on a recording of an analog input, but much depends on the resolution of the recording.
what immediately comes to my mind:
filter out very low frequencies (drums, bass-line),
filter out high frequencies (harmonics)
FFT,
look for peaks in the FFT output for the melody
I am not sure, if that works for very polyphonic sounds - maybe googling for "FFT, analysis, melody etc." will return more info on possible problems.
regards

Resources