Recognizing individual voices - signal-processing

I plan to write a conversation analysis software, which will recognize the individual speakers, their pitch and intensity. Pitch and intensity are somewhat straightforward (pitch via autocorrelation).
How would I go about recognizing individual speakers, so I can record his/her features? Will storing some heuristics for each speaker's frequencies be enough? I can assume that only one person speaks at a time (strictly non-overlapping). I can also assume that for training, each speaker can record a minute's worth of data before actual analysis.

Pitch and intensity on their own tell you nothing. You really need to analyse how pitch varies. In order to identify different speakers you need to transform the speech audio into some kind of feature space, and then make comparisons against your database of speakers in this feature space. The general term that you might want to Google for is prosody - see e.g. http://en.wikipedia.org/wiki/Prosody_(linguistics). While you're Googling you might also want to read up on speaker identification aka speaker recognition, see e.g. http://en.wikipedia.org/wiki/Speaker_identification

If you are still working on this... are you using speech-recognition on the sound input? Because Microsoft SAPI for example provides the application with a rich API for digging into the speech sound wave, which could make the speaker-recognition problem more tractable. I think you can get phoneme positions within the waveform. That would let you do power-spectrum analysis of vowels, for example, which could be used to generate features to distinguish speakers. (Before anybody starts muttering about pitch and volume, keep in mind that the formant curves come from vocal-tract shape and are fairly independent of pitch, which is vocal-cord frequency, and the relative position and relative amplitude of formants are (relatively!) independent of overall volume.) Phoneme duration in-context might also be a useful feature. Energy distribution during 'n' sounds could provide a 'nasality' feature. And so on. Just a thought. I expect to be working in this area myself.

Related

Calculate the percentage of accuracy with which user made the assigned sound

I want to design a web-app for my cousin who is 2 years of age in which i have implemented a functionality in which when an image is clicked some sound gets played and the user has to make the same sound which gets recorded.
For eg-If i click on image of "Apple" the sound made is "A for Apple".Now the user has to say those words which get recorded.
Now I want to calculate the percentage of accuracy with which the user spoke.I want to know how can i know the accuracy percentage.I have not used machine learning or Natural Language Processing earlier so i want some guidance on what should i learn about or ways of implementing this functionality.I need some help on that.
Also use nodejs frameworks quite frequently so is there any module in nodejs with the help of which the above requirement can be fulfilled.
What you want to reach is a quite complex and non-trivial task that can be faced at several levels. First of all, you should answer a question in before for yourself:
What do you mean with "accuarcy"? Which metric do you want to use for that? Accuracy means to compare a result with its optimum. So what would be the optimum of saying "Apple"?
I think there are several levels on which you could measure speech accuracy:
On the audio level: Here are several correlation metrics that can compute the similarity of two audio files. See e.g. here for more details. SImply said, the idea is directly comparing the audio samples. In your case, you would need a reference audio track that is the "correct" result. The correct time alignment might become a problem though.
On the level of speech recognition: You could use a speech recognizer -- commercial or open source -- and return a string of spoken words. In this case you should think about when the recording is stopped, to limit the record length. Then you have to think about a metric that evaluates the correctness of the transcription. Some that I worked with are Levensthein-Distance or Word-Error-Rate. Wit these you can compute a similarity.

Seperation of instruments' audios from a single channel non-MIDI musical file

My friend Prasad Raghavendra and me, were trying to experiment with Machine Learning on audio.
We were doing it to learn and to explore interesting possibilities at any upcoming get-togethers.
I decided to see how deep learning or any machine learning can be fed with certain audios rated by humans (evaluation).
To our dismay, we found that the problem had to be split to accommodate for the dimensionality of input.
So, we decided to discard vocals and assess by accompaniments with an assumption that vocals and instruments are always correlated.
We tried to look for mp3/wav to MIDI converter. Unfortunately, they were only for single instruments on SourceForge and Github and other options are paid options. (Ableton Live, Fruity Loops etc.) We decided to take this as a sub-problem.
We thought of FFT, band-pass filters and moving window to accommodate for these.
But, we are not understanding as to how we can go about splitting instruments if chords are played and there are 5-6 instruments in file.
What are the algorithms that I can look for?
My friend knows to play Keyboard. So, I will be able to get MIDI data. But, are there any data-sets meant for this?
How many instruments can these algorithms detect?
How do we split the audio? We do not have multiple audios or the mixing matrix
We were also thinking about finding out the patterns of accompaniments and using those accompaniments in real-time while singing along. I guess we will be able to think about it once we get answers to 1,2,3 and 4. (We are thinking about both Chord progressions and Markovian dynamics)
Thanks for all help!
P.S.: We also tried FFT and we are able to see some harmonics. Is it due to Sinc() in fft when rectangular wave is input in time domain? Can that be used to determine timbre?
We were able to formulate the problem roughly. But, still, we are finding it difficult to formulate the problem. If we use frequency domain for certain frequency, then the instruments are indistinguishable. A trombone playing at 440 Hz or a Guitar playing at 440 Hz would have same frequency excepting timbre. We still do not know how we can determine timbre. We decided to go by time domain by considering notes. If a note exceeds a certain octave, we would use that as a separate dimension +1 for next octave, 0 for current octave and -1 for the previous octave.
If notes are represented by letters such as 'A', 'B', 'C' etc, then the problem reduces to mixing matrices.
O = MI during training.
M is the mixing matrix that will have to be found out using the known O output and I input of MIDI file.
During prediction though, M must be replaced by a probability matrix P which would be generated using previous M matrices.
The problem reduces to Ipredicted = P-1O. The error would then be reduced to LMSE of I. We can use DNN to adjust P using back-propagation.
But, in this approach, we assume that the notes 'A','B','C' etc are known. How do we detect them instantaneously or in small duration like 0.1 seconds? Because, template matching may not work due to harmonics. Any suggestions would be much appreciated.
Splitting out the different parts is a machine learning problem all to its own. Unfortunately, you can't look at this problem in audio land only. You must consider the music.
You need to train something to understand musical patterns and progressions in the context of the type of music you give it. It needs to understand what the different instruments sound like, both mixed and not mixed. It needs to understand how these instruments are often played together, if it's going to have any chance at all at separating what's going on.
This is a very, very difficult problem.
This is a very hard problem mainly because converting audio to pitch isnt very simple due to Nyquist folding harmonics that are 22Khz+ back down and also other harmonic introductions such as saturators/distortion and other analogue equipment that introduce harmonics.
The fundamental harmonic isnt always the loudest which is why your plan will not work.
The hardest thing to measure would be a distorted guitar. The harmonic some pedals/plugins can make is crazy.

How do I do a decent speech detection?

I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).

How to compare word pronounce?

This is for a personal project of mine, and I have no idea from where to start as it falls way beyond my comfort zone.
I know that there are a few language learning software out there that allows the user to record his or her voice and compare the pronounce with a native speaker of said language.
My question is, how to achieve this?
I mean, how one compares the pronunciation between the user and the native speaker?
If you're looking for something relatively simple, you could simply compute the MFCC (http://en.wikipedia.org/wiki/Mel-frequency_cepstrum) of the recording, and then look at something simple like the correlation between the recording and the average coefficients of that word being pronounced by a native speaker. The MFCC will transform the audio into a space where euclidean distance corresponds more closely with perceptual difference.
Of course, there are several possible problems:
Aligning the two recordings so the coefficients match up. To fix this, you could look at the maximum cross-correlation of the coefficients, rather than the simple correlation, so you will get an automatic "best alignment" for free. Also, you might have to clip off ends of the recording, so only the actual pronunciation of the word remains in the recording.
The MFCC maps to perceptual space, but might not correspond so well to accent inaccuracies. You could perhaps try to fix this by instead of comparing it to just the "ideal" pronunciation, comparing it to the average for several different types of mispronunciation, and looking at which model it is closest to.
Even good accented words will be on average some "distance" from the ideal. You'll have to take that into account, and compare the input's distance to the "relative" good distance.
Correlation might not be the best way to compare the relative similarity of two sounds. Experiment with lots of different metrics... try different L^p norms: (http://en.wikipedia.org/wiki/Lp_space), or try weighing the different MFCCs differently (if I recall, even after MFCC have been taken, although they are all supposed to have the same perceptual "weight", the ones in the middle are still more important for how we perceive a sound than the high or low ones.)
There might be certain parts of the sound where the pronunciation matters much more for the quality of the accent. Perhaps transient detection to find those positions and mark them as more important would be good. If you had a whole bunch of "good pronunciation" and "bad pronunciation" examples, you could probably automatically extract those locations.
Again, in the end the only way you're going to know which combination of these options works best is by testing.
I've read about adapting gaussian mixture models for the phonetic space of a general speaker to an individual. This might be useful for training for a non-canonical accent for private use.
If you just compare the speaker to a general pronunciation model, then the match might not be very good. So the idea is to adjust the models to fit the speaker better during individual training.
Speaker Verification using Adapted Gaussian Mixture Models
EDIT: looking over your question again, I think I answered a different question. But the technique uses similar models:
Model various language (Do you have lots of data for different languages? Collecting the data might be the hard part.) GMMs work well for this.
Compare the data point from the speaker to the various language models
Choose the model that is the best predictor for the speaker data as the winner.

Detecting the fundamental frequency [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
There's this tech-festival in IIT-Bombay, India, where they're having an event called "Artbots" where we're supposed to design artbots with artistic abilities. I had an idea about a musical robot which takes a song as input, detects the notes in the song and plays it back on a piano. I need some method which will help me compute the pitches of the notes of the song. Any idea/suggestion on how to go about it?
This is exactly what I'm doing here as my last year project :) except one thing that my project is about tracking the pitch of human singing voice (and I don't have the robot to play the tune)
The quickest way I can think of is to utilize BASS library. It contains ready-to-use function that can give you FFT data from default recording device. Take a look at "livespec" code example that comes with BASS.
By the way, raw FFT data will not enough to determine fundamental frequency. You need algorithm such as Harmonic Product Spectrum to get the F0.
Another consideration is the audio source. If you are going to do FFT and apply Harmonic Product Spectrum on it. You will need to make sure the input has only one audio source. If it contains multiple sources such as in modern songs there will be to many frequencies to consider.
Harmonic Product Spectrum Theory
If the input signal is a musical note,
then its spectrum should consist of a
series of peaks, corresponding to
fundamental frequency with harmonic
components at integer multiples of the
fundamental frequency. Hence when we
compress the spectrum a number of
times (downsampling), and compare it
with the original spectrum, we can see
that the strongest harmonic peaks line
up. The first peak in the original
spectrum coincides with the second
peak in the spectrum compressed by a
factor of two, which coincides with
the third peak in the spectrum
compressed by a factor of three.
Hence, when the various spectrums are
multiplied together, the result will
form clear peak at the fundamental
frequency.
Method
First, we divide the input signal into
segments by applying a Hanning window,
where the window size and hop size are
given as an input. For each window,
we utilize the Short-Time Fourier
Transform to convert the input signal
from the time domain to the frequency
domain. Once the input is in the
frequency domain, we apply the
Harmonic Product Spectrum technique to
each window.
The HPS involves two steps:
downsampling and multiplication. To
downsample, we compressed the spectrum
twice in each window by resampling:
the first time, we compress the
original spectrum by two and the
second time, by three. Once this is
completed, we multiply the three
spectra together and find the
frequency that corresponds to the peak
(maximum value). This particular
frequency represents the fundamental
frequency of that particular window.
Limitations of the HPS method
Some nice features of this method
include: it is computationally
inexpensive, reasonably resistant to
additive and multiplicative noise, and
adjustable to different kind of
inputs. For instance, we could change
the number of compressed spectra to
use, and we could replace the spectral
multiplication with a spectral
addition. However, since human pitch
perception is basically logarithmic,
this means that low pitches may be
tracked less accurately than high
pitches.
Another severe shortfall of the HPS
method is that it its resolution is
only as good as the length of the FFT
used to calculate the spectrum. If we
perform a short and fast FFT, we are
limited in the number of discrete
frequencies we can consider. In order
to gain a higher resolution in our
output (and therefore see less
graininess in our pitch output), we
need to take a longer FFT which
requires more time.
from: http://cnx.org/content/m11714/latest/
Just a comment: The fundamental harmonic may as well be missing from a (harmonic) sound, this doesn't change the perceived pitch. As a limit case, if you take a square wave (say, a C# note) and completely suppress the first harmonic, the perceived note is still C#, in the same octave. In a way, our brain is able to compensate the absence of some harmonics, even the first, when it guesses a note.
Hence, to detect a pitch with frequency-domain techniques you should take into account all the harmonics (local maxima in the magnitude of the Fourier transform), and extract some sort of "greatest common divisor" of their frequencies. Pitch detection is not a trivial problem at all...
DAFX has about 30 pages dedicated to pitch detection, with examples and Matlab code.
Autocorrelation - http://en.wikipedia.org/wiki/Autocorrelation
Zero-crossing - http://en.wikipedia.org/wiki/Zero_crossing (this method is used in cheap guitar tuners)
Try YAAPT pitch tracking, which detects fundamental frequency in both time and frequency domains. You can download Matlab source code from the link and look for peaks in the FFT output using the spectral process part.
Python package http://bjbschmitt.github.io/AMFM_decompy/pYAAPT.html#
Did you try Wikipedia's article on pitch detection? It contains a few references that can be interesting to you.
In addition, here's a list of DSP applications and libraries, where you can poke around. The list only mentions Linux software packages, but many of them are cross-platform, and there's a lot of source code you can look at.
Just FYI, detecting the pitch of the notes in a monophonic recording is within reach of most DSP-savvy people. Detecting the pitches of all notes, including chords and stuff, is a lot harder.
Just a thought - but do you need to process a digital audio stream as input?
If not, consider using a symbolic representation of music (such as MIDI). The pitches of the notes will then be stated explicitly, and you can synthesize sounds (and movements) corresponding to the pitch, rhythm and many other musical parameters extremely easily.
If you need to analyse a digital audio stream (mp3, wav, live input, etc) bear in mind that while pitch detection of simple monophonic sounds is quite advanced, polyphonic pitch detection is an unsolved problem. In this case, you may find my answer to this question helpful.
For extracting the fundamental frequency of the melody from polyphonic music you could try the MELODIA plug-in: http://mtg.upf.edu/technologies/melodia
Extracting the F0's of all the instruments in a song (multi-F0 tracking) or transcribing them into notes is an even harder task. Both melody extraction and music transcription are still open research problems, so regardless of the algorithm/tool you use don't expect to obtain perfect results for either.
If you're trying to detect the notes of a polyphonic recording (multiple notes at the same time) good luck. That's a very tricky problem. I don't know of any way to listen to, say, a recording of a string quartet and have an algorithm separate the four voices. (Wavelets maybe?) If it's just one note at a time, there are several pitch tracking algorithms out there, many of them mentioned in other comments.
The algorithm you want to use will depend on the type of music you are listening to. If you want it to pick up people singing there are a lot of good algorithms out there designed specifically for voice. (That's where most of the research is.) If you are trying to pick up specific instruments you'll have to be a bit more creative. Voice algorithms can be simple because the range of the human singing voice is generally limited to about 100-2000 Hz. (Speaking range is much more narrow). The fundamental frequencies on a piano, however, go from about 27 Hz. to 4200 Hz., so you're dealing with a wider range usually ignored by voice pitch detection algorithms.
The waveform of most instruments is going to be fairly complex, with lots of harmonics, so a simple approach like counting zeros or just taking the autocorrelation won't work. If you knew roughly what frequency range you were looking in you could low-pass filter and then zero count. I'd think you'd be better off though with a more complex algorithm such as the Harmonic Product Spectrum mentioned by another user, or YAAPT ("Yet Another Algorithm for Pitch Tracking"), or something similar.
One last problem: some instruments, the piano in particular, will have the problem of missing fundamentals and inharmonicity. Missing fundamentals can be dealt with by the pitch tracking algorithms...in fact they have to be since fundamentals are often cut out in electronic transmission...though you'll probably still get some octave errors. Inharmonicity however, will give you problems if somebody plays a note in the bottom octaves of the piano. Normal pitch tracking algorithms aren't designed to deal with inharmonicity because the human voice is not significantly inharmonic.
You basically need a spectrum analyzer. You might be able to to a FFT on a recording of an analog input, but much depends on the resolution of the recording.
what immediately comes to my mind:
filter out very low frequencies (drums, bass-line),
filter out high frequencies (harmonics)
FFT,
look for peaks in the FFT output for the melody
I am not sure, if that works for very polyphonic sounds - maybe googling for "FFT, analysis, melody etc." will return more info on possible problems.
regards

Resources