Seperation of instruments' audios from a single channel non-MIDI musical file - machine-learning

My friend Prasad Raghavendra and me, were trying to experiment with Machine Learning on audio.
We were doing it to learn and to explore interesting possibilities at any upcoming get-togethers.
I decided to see how deep learning or any machine learning can be fed with certain audios rated by humans (evaluation).
To our dismay, we found that the problem had to be split to accommodate for the dimensionality of input.
So, we decided to discard vocals and assess by accompaniments with an assumption that vocals and instruments are always correlated.
We tried to look for mp3/wav to MIDI converter. Unfortunately, they were only for single instruments on SourceForge and Github and other options are paid options. (Ableton Live, Fruity Loops etc.) We decided to take this as a sub-problem.
We thought of FFT, band-pass filters and moving window to accommodate for these.
But, we are not understanding as to how we can go about splitting instruments if chords are played and there are 5-6 instruments in file.
What are the algorithms that I can look for?
My friend knows to play Keyboard. So, I will be able to get MIDI data. But, are there any data-sets meant for this?
How many instruments can these algorithms detect?
How do we split the audio? We do not have multiple audios or the mixing matrix
We were also thinking about finding out the patterns of accompaniments and using those accompaniments in real-time while singing along. I guess we will be able to think about it once we get answers to 1,2,3 and 4. (We are thinking about both Chord progressions and Markovian dynamics)
Thanks for all help!
P.S.: We also tried FFT and we are able to see some harmonics. Is it due to Sinc() in fft when rectangular wave is input in time domain? Can that be used to determine timbre?
We were able to formulate the problem roughly. But, still, we are finding it difficult to formulate the problem. If we use frequency domain for certain frequency, then the instruments are indistinguishable. A trombone playing at 440 Hz or a Guitar playing at 440 Hz would have same frequency excepting timbre. We still do not know how we can determine timbre. We decided to go by time domain by considering notes. If a note exceeds a certain octave, we would use that as a separate dimension +1 for next octave, 0 for current octave and -1 for the previous octave.
If notes are represented by letters such as 'A', 'B', 'C' etc, then the problem reduces to mixing matrices.
O = MI during training.
M is the mixing matrix that will have to be found out using the known O output and I input of MIDI file.
During prediction though, M must be replaced by a probability matrix P which would be generated using previous M matrices.
The problem reduces to Ipredicted = P-1O. The error would then be reduced to LMSE of I. We can use DNN to adjust P using back-propagation.
But, in this approach, we assume that the notes 'A','B','C' etc are known. How do we detect them instantaneously or in small duration like 0.1 seconds? Because, template matching may not work due to harmonics. Any suggestions would be much appreciated.

Splitting out the different parts is a machine learning problem all to its own. Unfortunately, you can't look at this problem in audio land only. You must consider the music.
You need to train something to understand musical patterns and progressions in the context of the type of music you give it. It needs to understand what the different instruments sound like, both mixed and not mixed. It needs to understand how these instruments are often played together, if it's going to have any chance at all at separating what's going on.
This is a very, very difficult problem.

This is a very hard problem mainly because converting audio to pitch isnt very simple due to Nyquist folding harmonics that are 22Khz+ back down and also other harmonic introductions such as saturators/distortion and other analogue equipment that introduce harmonics.
The fundamental harmonic isnt always the loudest which is why your plan will not work.
The hardest thing to measure would be a distorted guitar. The harmonic some pedals/plugins can make is crazy.

Related

CNN for audio recognition has near perfect test and validation accuracy but not generalizing to new data

I am creating a CNN to recognize a friend's speech. She uses unique vocalizations (not in any language) to communicate. To begin, I recorded 60 samples of three of these sounds (180 .wav samples total). After training the model, I was getting near perfect accuracy from both the test and validation data. I then recorded new sounds immediately after this training, and was getting about 50 percent accuracy, which showed some level of learning and generalizing, since random guesses on 3 classes should have been getting about 33% accuracy.
The next day I tried recording new audio again, and the model's predictions were as good as random. My guess as to the problem is that the model is sensitive to very small changes in environment. It showed some learning immediately after training because the environment would have been very similar. However, the following day, there were probably more substantial changes to environment (background noise, distance from microphone, sitting in different part of the room, etc.). Does this seem like a reasonable guess as to the cause of the problem? And if so, how can I make my model less sensitive to environment? Would adding white noise help? Are there ways to add background noise to my samples? Any help would be appreciated.
That is to be expected! 180 samples is not nearly enough to train the CNN on. A CNN contains thousands up to millions of parameters so you could quite possibly have many more parameters to tune than bytes of data in your dataset!
Futhermore, your claim of getting perfect accuracy on the test set seem suspicious. I'd wager that you have accidentally used test data to train the model with.
You could try "growing" your dataset by adding randomized noise to the sound files. I don't think that would help much though. The network would become resilient to the type of white noise you added, but probably not to the type of noise found in actual recordings. For example, in speech recognition noises people make when speaking like breathing, saying "uh" and "eh", or clearing their throats, can confuse the recognizer. It is very hard to synthetically add such noises.
Also, while two sounds may sound similar to human ears, their waveforms might be completely different. A song played in different keys sound similar or even identical to human ears but have totally different waveforms. You have the same effect if you listen to someone talking indoors vs. outdoors vs. in a noisy bar. Even whether someone is standing or sitting can completely change the sound of their voice.
Bottom line: you need waaaay more data. I also recommend experimenting with RNNs and Bidirectional RNNs. They are a better fit for temporal data like sound samples than CNNs. Generally they also require less parameters so training is faster.

Waveform Comparison

I am working on a personal research project.
My objective is to be able to recognize a sound and identify if it belongs to the IPA or not by comparing it's waveform to a wave form in my data base. I have some skill with Mathematica, SciPy, and PyBrain.
For the first phase, I'm only using the English (US) phonetic alphabet.
I have a simple test bank of English phonetic alphabet sound files I found online. The trick here is:
I want to separate a sound file into wave forms that correspond to different syllables- this will take a learning algorithm. So, 'I like apples' would be cut up into the syllable waveforms that would make up the sentence.
Each waveform is then compared against the English PA's wave forms. I'm not certain how to do this part. I was thinking of using Praat to detect the waveforms, capture the image of the wave form and compare it to the one stored in the database with image analysis (which is kind of fun to do).
The damage here, is that I don't know how to make Praat generate a wave form file automatically then cut it up between syllables into waveform chunks. Logically, I would just prepare test cases for a learning algorithm and teach the comp to do it.
Instead of needing a wave form image- could I do this with fast Fourier transformation and compare two fft's- within x% margin of error consider it y syllable?
Frankly I don't really know about Praat, But I find your project super cool and interesting. I have experience with car motor's fault detection using it's sound, which might be connected to your project. I used Neural Networks and SVM to do the classification because multiple research papers proved it. Thus I didn't have any doubt about the way I chose. So my advice is maybe you should research and read some Papers about it. It really helps when you have questions like this (Will it work?, Can I use it instead or Am I using optimal solution? etc...). And good luck that's an awesome project :)
You could try Praat scripting.
Using just FFT will give you rather terrible results. Very long feature vector that will be really difficult to segment and run any training on it. That's thousands of points for a single syllable. Some deep neural networks are able to cope with it, but that's assuming you design them properly and provide huge training set. The advantage of using neural networks is that they can build features for you from the "raw data" (and I would consider fft also "raw"). However, when you work with sound, it's not that badly needed - you can manually engineer features. In case of sounds, science knows very well what sort of "features" sound have.
You can calculate these features with libraries like Yaafe. I recommend checking it even if you are not doing it in C++ or Python - the link I provided also delivers formulas for calculating them. I used some of them in my kiwi classifier.
Another good approach comes from scikit-talkbox, which provides exactly the tooling you might need.

People Detection and Tracking

I want to do pedestrian detection and tracking.
Input: Video Stream from CCTV camera.
Output:
#(no of) people going from left to right
# people going from right to left
# No. of people in the middle
What have i done so far:
For pedestrian detection I am using HOG and SVM. The detection is decent with high false positive rate. And its very slow as i am running in android platform.
Question:
After detection how to do I calculate the required values listed above. Can anyone tell me what is the tracking algorithm I have to use and any good algorithm for pedestrian detection.
Or should I use tracking algorithm? Is there a way to do without it?
Any references to codes/blogs/technical papers is appreciated.
Platform: C++ & OpenCV / android.
--Thanks
This is somehow close to a research problem.
You may want to have a look to this website which gathers a lot of references.
In particular, the work done by the group from Oxford present therein is pretty close to what you are doing, since their are using HOG for detection. (That work has been extremely illuminating for me).
EPFL and Julich have as well work done in the field.
You may also want to give a look to this review which describes several detection/tracking techniques, often involving variants of the HOG algorithm.
Along with #Acorbe response, I suggest the publications section of this (archived) website.
A recent work at the end of last year also released a code base here:
https://bitbucket.org/rodrigob/doppia
There have also been earlier pedestrian detector works that have released code as well:
https://sites.google.com/site/wujx2001/home/c4
http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians
The best accurate way is to use tracking algorithm instead of statistic appearance counting of incoming people and detection occurred left right and middle..
You can use extended statistical models.. That produce how many inputs producing one of the outputs and back validate from output detection the input.
My experience is that tracking leads to better results than approach above. But is also little bit complicated. We talk about multi target tracking when the critical is match detection with tracked model which should be update based on detection. If tracking is matched with wrong model. The problems are there.
Here on youtube I developed some multi target tracker by simple LBP people detector, but multi model and kalman filter for tracking. Both capabilities are available in opencv. You need to when something is detected create new kalman filter for each object and update in case you match same detection. Predict in case detection is not here in frame and also remove the Kalman i it is not necessary to track any more.
1 Detect
2 Match detections with kalmans, hungarian algorithm and l2 norm. (for example)
3 Lot of work. Decide if kalman shoudl be established, remove, update, or results is not detected and should be predicted. This is lot of work here.
Pure statistic approach is less accurate, second one is for experience people at least one moth of coding and 3 month of tuning.. If you need to be faster and your resources are quite limited. You can by smart statistic achieve your results by pure detection much faster and little bit less accurate. People are judge the image and video tracking even multi target tracking is capable to beat human. Try to count and register each person in video and count exits point. You are not able to do this in some number of people. It is really repents on, what you want, application, customer you have, and results you show to customers. If this is 4 numbers income, left, right, middle and your error is 20 percent is still much more than one bored small paid guard should achieved by all day long counting..
https://www.youtube.com/watch?v=d-RCKfVjFI4
You can find on my BLOG Some dataset for people detection and car detection on my blog same as script for learning ideas, tutorials and tracking examples..
Opencv blog tutorials code and ideas
You can use KLT for this purpose as this will tell you the flow of person traveling from left to right then you can compute that by computing line length which in given example is drawn using cv2.line you can use input parameters of this functions to compute your case, little math involved. if there is a flow of pixels from left to right this is case 1 or right to left then case 3 and for no flow case 2. Or you can use this basic tutorial to track object movement. LINK

How to compare word pronounce?

This is for a personal project of mine, and I have no idea from where to start as it falls way beyond my comfort zone.
I know that there are a few language learning software out there that allows the user to record his or her voice and compare the pronounce with a native speaker of said language.
My question is, how to achieve this?
I mean, how one compares the pronunciation between the user and the native speaker?
If you're looking for something relatively simple, you could simply compute the MFCC (http://en.wikipedia.org/wiki/Mel-frequency_cepstrum) of the recording, and then look at something simple like the correlation between the recording and the average coefficients of that word being pronounced by a native speaker. The MFCC will transform the audio into a space where euclidean distance corresponds more closely with perceptual difference.
Of course, there are several possible problems:
Aligning the two recordings so the coefficients match up. To fix this, you could look at the maximum cross-correlation of the coefficients, rather than the simple correlation, so you will get an automatic "best alignment" for free. Also, you might have to clip off ends of the recording, so only the actual pronunciation of the word remains in the recording.
The MFCC maps to perceptual space, but might not correspond so well to accent inaccuracies. You could perhaps try to fix this by instead of comparing it to just the "ideal" pronunciation, comparing it to the average for several different types of mispronunciation, and looking at which model it is closest to.
Even good accented words will be on average some "distance" from the ideal. You'll have to take that into account, and compare the input's distance to the "relative" good distance.
Correlation might not be the best way to compare the relative similarity of two sounds. Experiment with lots of different metrics... try different L^p norms: (http://en.wikipedia.org/wiki/Lp_space), or try weighing the different MFCCs differently (if I recall, even after MFCC have been taken, although they are all supposed to have the same perceptual "weight", the ones in the middle are still more important for how we perceive a sound than the high or low ones.)
There might be certain parts of the sound where the pronunciation matters much more for the quality of the accent. Perhaps transient detection to find those positions and mark them as more important would be good. If you had a whole bunch of "good pronunciation" and "bad pronunciation" examples, you could probably automatically extract those locations.
Again, in the end the only way you're going to know which combination of these options works best is by testing.
I've read about adapting gaussian mixture models for the phonetic space of a general speaker to an individual. This might be useful for training for a non-canonical accent for private use.
If you just compare the speaker to a general pronunciation model, then the match might not be very good. So the idea is to adjust the models to fit the speaker better during individual training.
Speaker Verification using Adapted Gaussian Mixture Models
EDIT: looking over your question again, I think I answered a different question. But the technique uses similar models:
Model various language (Do you have lots of data for different languages? Collecting the data might be the hard part.) GMMs work well for this.
Compare the data point from the speaker to the various language models
Choose the model that is the best predictor for the speaker data as the winner.

Detecting the fundamental frequency [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
There's this tech-festival in IIT-Bombay, India, where they're having an event called "Artbots" where we're supposed to design artbots with artistic abilities. I had an idea about a musical robot which takes a song as input, detects the notes in the song and plays it back on a piano. I need some method which will help me compute the pitches of the notes of the song. Any idea/suggestion on how to go about it?
This is exactly what I'm doing here as my last year project :) except one thing that my project is about tracking the pitch of human singing voice (and I don't have the robot to play the tune)
The quickest way I can think of is to utilize BASS library. It contains ready-to-use function that can give you FFT data from default recording device. Take a look at "livespec" code example that comes with BASS.
By the way, raw FFT data will not enough to determine fundamental frequency. You need algorithm such as Harmonic Product Spectrum to get the F0.
Another consideration is the audio source. If you are going to do FFT and apply Harmonic Product Spectrum on it. You will need to make sure the input has only one audio source. If it contains multiple sources such as in modern songs there will be to many frequencies to consider.
Harmonic Product Spectrum Theory
If the input signal is a musical note,
then its spectrum should consist of a
series of peaks, corresponding to
fundamental frequency with harmonic
components at integer multiples of the
fundamental frequency. Hence when we
compress the spectrum a number of
times (downsampling), and compare it
with the original spectrum, we can see
that the strongest harmonic peaks line
up. The first peak in the original
spectrum coincides with the second
peak in the spectrum compressed by a
factor of two, which coincides with
the third peak in the spectrum
compressed by a factor of three.
Hence, when the various spectrums are
multiplied together, the result will
form clear peak at the fundamental
frequency.
Method
First, we divide the input signal into
segments by applying a Hanning window,
where the window size and hop size are
given as an input. For each window,
we utilize the Short-Time Fourier
Transform to convert the input signal
from the time domain to the frequency
domain. Once the input is in the
frequency domain, we apply the
Harmonic Product Spectrum technique to
each window.
The HPS involves two steps:
downsampling and multiplication. To
downsample, we compressed the spectrum
twice in each window by resampling:
the first time, we compress the
original spectrum by two and the
second time, by three. Once this is
completed, we multiply the three
spectra together and find the
frequency that corresponds to the peak
(maximum value). This particular
frequency represents the fundamental
frequency of that particular window.
Limitations of the HPS method
Some nice features of this method
include: it is computationally
inexpensive, reasonably resistant to
additive and multiplicative noise, and
adjustable to different kind of
inputs. For instance, we could change
the number of compressed spectra to
use, and we could replace the spectral
multiplication with a spectral
addition. However, since human pitch
perception is basically logarithmic,
this means that low pitches may be
tracked less accurately than high
pitches.
Another severe shortfall of the HPS
method is that it its resolution is
only as good as the length of the FFT
used to calculate the spectrum. If we
perform a short and fast FFT, we are
limited in the number of discrete
frequencies we can consider. In order
to gain a higher resolution in our
output (and therefore see less
graininess in our pitch output), we
need to take a longer FFT which
requires more time.
from: http://cnx.org/content/m11714/latest/
Just a comment: The fundamental harmonic may as well be missing from a (harmonic) sound, this doesn't change the perceived pitch. As a limit case, if you take a square wave (say, a C# note) and completely suppress the first harmonic, the perceived note is still C#, in the same octave. In a way, our brain is able to compensate the absence of some harmonics, even the first, when it guesses a note.
Hence, to detect a pitch with frequency-domain techniques you should take into account all the harmonics (local maxima in the magnitude of the Fourier transform), and extract some sort of "greatest common divisor" of their frequencies. Pitch detection is not a trivial problem at all...
DAFX has about 30 pages dedicated to pitch detection, with examples and Matlab code.
Autocorrelation - http://en.wikipedia.org/wiki/Autocorrelation
Zero-crossing - http://en.wikipedia.org/wiki/Zero_crossing (this method is used in cheap guitar tuners)
Try YAAPT pitch tracking, which detects fundamental frequency in both time and frequency domains. You can download Matlab source code from the link and look for peaks in the FFT output using the spectral process part.
Python package http://bjbschmitt.github.io/AMFM_decompy/pYAAPT.html#
Did you try Wikipedia's article on pitch detection? It contains a few references that can be interesting to you.
In addition, here's a list of DSP applications and libraries, where you can poke around. The list only mentions Linux software packages, but many of them are cross-platform, and there's a lot of source code you can look at.
Just FYI, detecting the pitch of the notes in a monophonic recording is within reach of most DSP-savvy people. Detecting the pitches of all notes, including chords and stuff, is a lot harder.
Just a thought - but do you need to process a digital audio stream as input?
If not, consider using a symbolic representation of music (such as MIDI). The pitches of the notes will then be stated explicitly, and you can synthesize sounds (and movements) corresponding to the pitch, rhythm and many other musical parameters extremely easily.
If you need to analyse a digital audio stream (mp3, wav, live input, etc) bear in mind that while pitch detection of simple monophonic sounds is quite advanced, polyphonic pitch detection is an unsolved problem. In this case, you may find my answer to this question helpful.
For extracting the fundamental frequency of the melody from polyphonic music you could try the MELODIA plug-in: http://mtg.upf.edu/technologies/melodia
Extracting the F0's of all the instruments in a song (multi-F0 tracking) or transcribing them into notes is an even harder task. Both melody extraction and music transcription are still open research problems, so regardless of the algorithm/tool you use don't expect to obtain perfect results for either.
If you're trying to detect the notes of a polyphonic recording (multiple notes at the same time) good luck. That's a very tricky problem. I don't know of any way to listen to, say, a recording of a string quartet and have an algorithm separate the four voices. (Wavelets maybe?) If it's just one note at a time, there are several pitch tracking algorithms out there, many of them mentioned in other comments.
The algorithm you want to use will depend on the type of music you are listening to. If you want it to pick up people singing there are a lot of good algorithms out there designed specifically for voice. (That's where most of the research is.) If you are trying to pick up specific instruments you'll have to be a bit more creative. Voice algorithms can be simple because the range of the human singing voice is generally limited to about 100-2000 Hz. (Speaking range is much more narrow). The fundamental frequencies on a piano, however, go from about 27 Hz. to 4200 Hz., so you're dealing with a wider range usually ignored by voice pitch detection algorithms.
The waveform of most instruments is going to be fairly complex, with lots of harmonics, so a simple approach like counting zeros or just taking the autocorrelation won't work. If you knew roughly what frequency range you were looking in you could low-pass filter and then zero count. I'd think you'd be better off though with a more complex algorithm such as the Harmonic Product Spectrum mentioned by another user, or YAAPT ("Yet Another Algorithm for Pitch Tracking"), or something similar.
One last problem: some instruments, the piano in particular, will have the problem of missing fundamentals and inharmonicity. Missing fundamentals can be dealt with by the pitch tracking algorithms...in fact they have to be since fundamentals are often cut out in electronic transmission...though you'll probably still get some octave errors. Inharmonicity however, will give you problems if somebody plays a note in the bottom octaves of the piano. Normal pitch tracking algorithms aren't designed to deal with inharmonicity because the human voice is not significantly inharmonic.
You basically need a spectrum analyzer. You might be able to to a FFT on a recording of an analog input, but much depends on the resolution of the recording.
what immediately comes to my mind:
filter out very low frequencies (drums, bass-line),
filter out high frequencies (harmonics)
FFT,
look for peaks in the FFT output for the melody
I am not sure, if that works for very polyphonic sounds - maybe googling for "FFT, analysis, melody etc." will return more info on possible problems.
regards

Resources