Is it possible to run a speaker recognition algorithm on a low-level embedded system(like Cortex-M0)? - machine-learning

I'm trying to implement a hello-world algorithm in speaker recognition, GMM-UBM, on an end-device MCU. Wifi/BLE...etc are not available due to some limitation.
The UBM is pre-trained on PC then downloaded to a flash memory that my MCU can access.
What will actually be run on the MCU are only speaker enrollment and testing, as known as a GMM model adaptation procedure and score(log-likelihood ratio) calculation.
For the purpose of this system is to tell if an input voice segment is from a target speaker or not, a score threshold should be selected. Hence, currently, after speaker enrollment, I use a bunch of impostor voices(pre-saved in flash memory) to calculate the impostor scores corresponding to the target speaker. Assuming the impostor scores is a 1-D Gaussian distribution, then I can decide the score threshold depend on the false alarm rate acceptable.
But here's the problem, the procedure above for threshold selection is time-consuming especially on a device with CPU clock rate < 100MHz. Imagine that you have to wait 1 minute or even longer time to get response from Google assistant first time you use it, that will be embarrassing.
The question are as below:
Is my concept right or there is some misunderstanding?
Is this a standard procedure to enroll a speaker and select a threshold?
Is there some method to reduce the computing time of the threshold selection procedure, or even a pre-defined threshold to avoid the on-chip computing?
Thanks a billion!!

Related

Calculate the percentage of accuracy with which user made the assigned sound

I want to design a web-app for my cousin who is 2 years of age in which i have implemented a functionality in which when an image is clicked some sound gets played and the user has to make the same sound which gets recorded.
For eg-If i click on image of "Apple" the sound made is "A for Apple".Now the user has to say those words which get recorded.
Now I want to calculate the percentage of accuracy with which the user spoke.I want to know how can i know the accuracy percentage.I have not used machine learning or Natural Language Processing earlier so i want some guidance on what should i learn about or ways of implementing this functionality.I need some help on that.
Also use nodejs frameworks quite frequently so is there any module in nodejs with the help of which the above requirement can be fulfilled.
What you want to reach is a quite complex and non-trivial task that can be faced at several levels. First of all, you should answer a question in before for yourself:
What do you mean with "accuarcy"? Which metric do you want to use for that? Accuracy means to compare a result with its optimum. So what would be the optimum of saying "Apple"?
I think there are several levels on which you could measure speech accuracy:
On the audio level: Here are several correlation metrics that can compute the similarity of two audio files. See e.g. here for more details. SImply said, the idea is directly comparing the audio samples. In your case, you would need a reference audio track that is the "correct" result. The correct time alignment might become a problem though.
On the level of speech recognition: You could use a speech recognizer -- commercial or open source -- and return a string of spoken words. In this case you should think about when the recording is stopped, to limit the record length. Then you have to think about a metric that evaluates the correctness of the transcription. Some that I worked with are Levensthein-Distance or Word-Error-Rate. Wit these you can compute a similarity.

Detecting exact frequency of Bluetooth signal

I was wondering if there was a way that I could detect the exact frequency of a BLE signal with an iphone. I know it will be in the 2.4 GHz range but i would like to know the difference down to the 1 Hz range between the transmitted frequency and the received frequency. The difference would be caused by the doppler effect meaning that the central or the peripheral would have to be moving. Also is there an exact frequency that iphones transmit BLE at or does it depend on the iphone's antenna?
Bluetooth doesn't have one particular frequency it operates on. Via bluetooth.com:
Bluetooth technology operates in the unlicensed industrial, scientific and medical (ISM) band at 2.4 to 2.485 GHz, using a spread spectrum, frequency hopping, full-duplex signal at a nominal rate of 1600 hops/sec.
… adaptive hopping among 79 frequencies at 1 MHz intervals gives a high degree of interference immunity and also allows for more efficient transmission within the spectrum.
So there'll be a wide spread of frequencies in use for even a single connection to a single device. There's hardware on the market like the Ubertooth that can do packet captures and spectrum analysis.
To my knowledge, iOS doesn't offer API to find out this information. OS X does at some level, probably via SPI or an IOBluetooth API, because Apple's Hardware Tools (search for "Bluetooth") offer a way to monitor spectrum usage of Bluetooth Classic devices on OS X.
As to your desire to detect movement via the Doppler effect on the radios, my instincts say that it's going to be very, very difficult to do. I'm not sure what the exact mathematics behind it would look like, but you'll want to examine what the Doppler effect on a transmission at 2.4 GHz would be as a result of low-to-moderate rates of motion. (A higher rate of motion or relative speed, say, over a few tens of miles an hour, will be quickly make Bluetooth the wrong radio technology to use because of its low transmit power.)

How do I do a decent speech detection?

I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).

Recognizing individual voices

I plan to write a conversation analysis software, which will recognize the individual speakers, their pitch and intensity. Pitch and intensity are somewhat straightforward (pitch via autocorrelation).
How would I go about recognizing individual speakers, so I can record his/her features? Will storing some heuristics for each speaker's frequencies be enough? I can assume that only one person speaks at a time (strictly non-overlapping). I can also assume that for training, each speaker can record a minute's worth of data before actual analysis.
Pitch and intensity on their own tell you nothing. You really need to analyse how pitch varies. In order to identify different speakers you need to transform the speech audio into some kind of feature space, and then make comparisons against your database of speakers in this feature space. The general term that you might want to Google for is prosody - see e.g. http://en.wikipedia.org/wiki/Prosody_(linguistics). While you're Googling you might also want to read up on speaker identification aka speaker recognition, see e.g. http://en.wikipedia.org/wiki/Speaker_identification
If you are still working on this... are you using speech-recognition on the sound input? Because Microsoft SAPI for example provides the application with a rich API for digging into the speech sound wave, which could make the speaker-recognition problem more tractable. I think you can get phoneme positions within the waveform. That would let you do power-spectrum analysis of vowels, for example, which could be used to generate features to distinguish speakers. (Before anybody starts muttering about pitch and volume, keep in mind that the formant curves come from vocal-tract shape and are fairly independent of pitch, which is vocal-cord frequency, and the relative position and relative amplitude of formants are (relatively!) independent of overall volume.) Phoneme duration in-context might also be a useful feature. Energy distribution during 'n' sounds could provide a 'nasality' feature. And so on. Just a thought. I expect to be working in this area myself.

Detecting the fundamental frequency [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
There's this tech-festival in IIT-Bombay, India, where they're having an event called "Artbots" where we're supposed to design artbots with artistic abilities. I had an idea about a musical robot which takes a song as input, detects the notes in the song and plays it back on a piano. I need some method which will help me compute the pitches of the notes of the song. Any idea/suggestion on how to go about it?
This is exactly what I'm doing here as my last year project :) except one thing that my project is about tracking the pitch of human singing voice (and I don't have the robot to play the tune)
The quickest way I can think of is to utilize BASS library. It contains ready-to-use function that can give you FFT data from default recording device. Take a look at "livespec" code example that comes with BASS.
By the way, raw FFT data will not enough to determine fundamental frequency. You need algorithm such as Harmonic Product Spectrum to get the F0.
Another consideration is the audio source. If you are going to do FFT and apply Harmonic Product Spectrum on it. You will need to make sure the input has only one audio source. If it contains multiple sources such as in modern songs there will be to many frequencies to consider.
Harmonic Product Spectrum Theory
If the input signal is a musical note,
then its spectrum should consist of a
series of peaks, corresponding to
fundamental frequency with harmonic
components at integer multiples of the
fundamental frequency. Hence when we
compress the spectrum a number of
times (downsampling), and compare it
with the original spectrum, we can see
that the strongest harmonic peaks line
up. The first peak in the original
spectrum coincides with the second
peak in the spectrum compressed by a
factor of two, which coincides with
the third peak in the spectrum
compressed by a factor of three.
Hence, when the various spectrums are
multiplied together, the result will
form clear peak at the fundamental
frequency.
Method
First, we divide the input signal into
segments by applying a Hanning window,
where the window size and hop size are
given as an input. For each window,
we utilize the Short-Time Fourier
Transform to convert the input signal
from the time domain to the frequency
domain. Once the input is in the
frequency domain, we apply the
Harmonic Product Spectrum technique to
each window.
The HPS involves two steps:
downsampling and multiplication. To
downsample, we compressed the spectrum
twice in each window by resampling:
the first time, we compress the
original spectrum by two and the
second time, by three. Once this is
completed, we multiply the three
spectra together and find the
frequency that corresponds to the peak
(maximum value). This particular
frequency represents the fundamental
frequency of that particular window.
Limitations of the HPS method
Some nice features of this method
include: it is computationally
inexpensive, reasonably resistant to
additive and multiplicative noise, and
adjustable to different kind of
inputs. For instance, we could change
the number of compressed spectra to
use, and we could replace the spectral
multiplication with a spectral
addition. However, since human pitch
perception is basically logarithmic,
this means that low pitches may be
tracked less accurately than high
pitches.
Another severe shortfall of the HPS
method is that it its resolution is
only as good as the length of the FFT
used to calculate the spectrum. If we
perform a short and fast FFT, we are
limited in the number of discrete
frequencies we can consider. In order
to gain a higher resolution in our
output (and therefore see less
graininess in our pitch output), we
need to take a longer FFT which
requires more time.
from: http://cnx.org/content/m11714/latest/
Just a comment: The fundamental harmonic may as well be missing from a (harmonic) sound, this doesn't change the perceived pitch. As a limit case, if you take a square wave (say, a C# note) and completely suppress the first harmonic, the perceived note is still C#, in the same octave. In a way, our brain is able to compensate the absence of some harmonics, even the first, when it guesses a note.
Hence, to detect a pitch with frequency-domain techniques you should take into account all the harmonics (local maxima in the magnitude of the Fourier transform), and extract some sort of "greatest common divisor" of their frequencies. Pitch detection is not a trivial problem at all...
DAFX has about 30 pages dedicated to pitch detection, with examples and Matlab code.
Autocorrelation - http://en.wikipedia.org/wiki/Autocorrelation
Zero-crossing - http://en.wikipedia.org/wiki/Zero_crossing (this method is used in cheap guitar tuners)
Try YAAPT pitch tracking, which detects fundamental frequency in both time and frequency domains. You can download Matlab source code from the link and look for peaks in the FFT output using the spectral process part.
Python package http://bjbschmitt.github.io/AMFM_decompy/pYAAPT.html#
Did you try Wikipedia's article on pitch detection? It contains a few references that can be interesting to you.
In addition, here's a list of DSP applications and libraries, where you can poke around. The list only mentions Linux software packages, but many of them are cross-platform, and there's a lot of source code you can look at.
Just FYI, detecting the pitch of the notes in a monophonic recording is within reach of most DSP-savvy people. Detecting the pitches of all notes, including chords and stuff, is a lot harder.
Just a thought - but do you need to process a digital audio stream as input?
If not, consider using a symbolic representation of music (such as MIDI). The pitches of the notes will then be stated explicitly, and you can synthesize sounds (and movements) corresponding to the pitch, rhythm and many other musical parameters extremely easily.
If you need to analyse a digital audio stream (mp3, wav, live input, etc) bear in mind that while pitch detection of simple monophonic sounds is quite advanced, polyphonic pitch detection is an unsolved problem. In this case, you may find my answer to this question helpful.
For extracting the fundamental frequency of the melody from polyphonic music you could try the MELODIA plug-in: http://mtg.upf.edu/technologies/melodia
Extracting the F0's of all the instruments in a song (multi-F0 tracking) or transcribing them into notes is an even harder task. Both melody extraction and music transcription are still open research problems, so regardless of the algorithm/tool you use don't expect to obtain perfect results for either.
If you're trying to detect the notes of a polyphonic recording (multiple notes at the same time) good luck. That's a very tricky problem. I don't know of any way to listen to, say, a recording of a string quartet and have an algorithm separate the four voices. (Wavelets maybe?) If it's just one note at a time, there are several pitch tracking algorithms out there, many of them mentioned in other comments.
The algorithm you want to use will depend on the type of music you are listening to. If you want it to pick up people singing there are a lot of good algorithms out there designed specifically for voice. (That's where most of the research is.) If you are trying to pick up specific instruments you'll have to be a bit more creative. Voice algorithms can be simple because the range of the human singing voice is generally limited to about 100-2000 Hz. (Speaking range is much more narrow). The fundamental frequencies on a piano, however, go from about 27 Hz. to 4200 Hz., so you're dealing with a wider range usually ignored by voice pitch detection algorithms.
The waveform of most instruments is going to be fairly complex, with lots of harmonics, so a simple approach like counting zeros or just taking the autocorrelation won't work. If you knew roughly what frequency range you were looking in you could low-pass filter and then zero count. I'd think you'd be better off though with a more complex algorithm such as the Harmonic Product Spectrum mentioned by another user, or YAAPT ("Yet Another Algorithm for Pitch Tracking"), or something similar.
One last problem: some instruments, the piano in particular, will have the problem of missing fundamentals and inharmonicity. Missing fundamentals can be dealt with by the pitch tracking algorithms...in fact they have to be since fundamentals are often cut out in electronic transmission...though you'll probably still get some octave errors. Inharmonicity however, will give you problems if somebody plays a note in the bottom octaves of the piano. Normal pitch tracking algorithms aren't designed to deal with inharmonicity because the human voice is not significantly inharmonic.
You basically need a spectrum analyzer. You might be able to to a FFT on a recording of an analog input, but much depends on the resolution of the recording.
what immediately comes to my mind:
filter out very low frequencies (drums, bass-line),
filter out high frequencies (harmonics)
FFT,
look for peaks in the FFT output for the melody
I am not sure, if that works for very polyphonic sounds - maybe googling for "FFT, analysis, melody etc." will return more info on possible problems.
regards

Resources