I want to build an app that responds to the sound you make when blowing out birthday candles. This is not speech recognition per se (that sound isn't a word in English), and the very kind Halle over at OpenEars told me that it's not possible using that framework. (Thanks for your quick response, Halle!)
Is there a way to "teach" an app a sound such that the app can subsequently recognize it?
How would I go about this? Is it even doable? Am I crazy or taking on a problem that is much more difficult than I think it is? What should my homework be?
The good news is that it's achievable and you don't need any third party frameworks—AVFoundation is all you really need.
There's a good article from Mobile Orchard that covers the details, but somewhat inevitably for a four year old, there's some gotchas you need to be aware of.
Before you begin recording on a real device, I had need to set the audio session category, like so:
[[AVAudioSession sharedInstance] setCategory:AVAudioSessionCategoryPlayAndRecord error:nil];
Play around with the threshold in this line:
if (lowPassResults > 0.95)
I found 0.95 to be too high and got better results setting it somewhere between 0.55 and 0.75. Similarly, I played around with the 0.05 multiplier in this line:
double peakPowerForChannel = pow(10, (0.05 * [recorder peakPowerForChannel:0]));
Using simple thresholds on energy levels would probably not be robust enough for your use case.
A good way to go about this would be to first extract some properties from the sound stream that are specific to the sound of blowing out candles. Then use a machine learning algorithm to train a model based on training examples (a set of recordings of the sound you want to recognize), which can then be used to classify snippets of sound coming into your microphone in real-time when using the application.
Given the possible environmental sounds going on while you blow out candles (birthdays are always noisy, aren't they?), it may be difficult to train a model that is robust enough to these background sounds. This is not a simple problem if you care about accuracy.
It may be doable though:
Forgive me the self-promotion, but my company developed an SDK that provides an answer to the question you are asking: "Is there a way to "teach" an app a sound such that the app can subsequently recognize it?"
I am not sure if the specific sound of blowing out candles would work, as the SDK was primarily aimed at applications involving somewhat percussive sounds, but it might still work for your case. Here is a link, where you will also find a demo program you can download and try if you like: SampleSumo PSR SDK
Related
I am working on a game for iPhone that is fully usable by providing YES / NO responses.
It would be great to make this game available to blind users, runners, and people driving cars by allowing voice control. This does not require full speech recognition, I am looking to implement keyword spotting.
I can already detect start and stop of utterances, and have implemented this at https://github.com/fulldecent/FDSoundActivatedRecorder The next step is to distinguish between YES and NO responses reliably for a wide variety of users.
THE QUESTION: For reasonable performance (distinguish YES / NO / STOP within 0.5 sec after speech stops), is AVAudioRecorder a reasonable choice? Is there a published algorithm that meets these needs?
Your best bet here is OpenEars, a free and open voice recognition platform for iOS.
http://www.politepix.com/openears/
You most likely DO NOT want to get into the algorithmic side of this. It's massive and nasty - there is a reason only a small number of companies do voice recognition from scratch.
I am working on making an app that performs an action when the sound of a clap is recognized. I have looked into simply measuring the average and peak power from an AVAudioRecorder and this works okay, but if there are other sounds then it reports lots of false positives. I believe I need some kind of audio fingerprinting for this to work while other audio is playing. Now I know that this has been asked a lot before on SO, but most of the answers say something along the lines of "Use FFT" and then the person says "Oh okay!" but no clear explanation is given and I still have no idea how to correctly identify sounds using an FFT.
Can anyone clearly explain, cite another tutorial, or post a link to a library that can identify sounds using audio fingerprinting?
Thanks!
I am trying to build an app that allows the user to record individual people speaking, and then save the recordings on the device and tag each record with the name of the person who spoke. Then there is the detection mode, in which i record someone and can tell whats his name if he is in the local database.
First of all - is this possible at all? I am very new to iOS development and not so familiar with the available APIs.
More importantly, which API should I use (ideally free) to correlate between the incoming voice and the records I have in the local db? This should behave something like Shazam, but much more simple since the database I am looking for a match against is much smaller.
If you're new to iOS development, I'd start with the core app to record the audio and let people manually choose a profile/name to attach it to and worry about the speaker recognition part later.
You obviously have two options for the recognition side of things: You can either tie in someone else's speech authentication/speaker recognition library (which will probably be in C or C++), or you can try to write your own.
How many people are going to use your app? You might be able to create something basic yourself: If it's the difference between a man and a woman you could probably figure that out by doing an FFT spectral analysis of the audio and figure out where the frequency peaks are. Obviously the frequencies used to enunciate different phonemes are going to vary somewhat, so solving the general case for two people who sound fairly similar is probably hard. You'll need to train the system with a bunch of text and build some kind of model of frequency distributions. You could try to do clustering or something, but you're going to run into a fair bit of maths fairly quickly (gaussian mixture models, et al). There are libraries/projects that'll do this. You might be able to port this from matlab, for example: https://github.com/codyaray/speaker-recognition
If you want to take something off-the-shelf, I'd go with a straight C library like mistral, as it should be relatively easy to call into from Objective-C.
The SpeakHere sample code should get you started for audio recording and playback.
Also, it may well take longer for the user to train your app to recognise them than it's worth in time-saving from just picking their name from a list. Unless you're intending their voice to be some kind of security passport type thing, it might just not be worth bothering with.
My aim is code a project which records human sound and changes it (with effects).
e.g : a person will record its sound over microphone (speak for a while) and than the program makes its like a baby sound.
This shall run effectively and fast (while recording the altering operation must run, too)
What is the optimum way to do it ?
Thanks
If you're looking for either XNA or DirectX to do this for you, I'm pretty sure you're going to be out of luck (I don't have much experience with DirectSound; maybe somebody can correct me). What it sounds like you want to do is realtime digital signal processing, which means that you're either going to need to write your own code to manipulate the raw waveform, or find somebody else who's already written the code for you.
If you don't have experience writing this sort of thing, it's probably best to use somebody else's signal processing library, because this sort of thing can quickly get complicated. Since you're developing for the PC, you're in luck; you can use any library you like using P/Invoke. You might try out some of the solutions suggested here and here.
MSDN has some info about the Audio namespace from XNA, and the audio recording introduced in version 4:
Working with Microphones
Recording Audio from a Microphone
Keep in mind that recorded data is returned in PCM format.
I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.
Preferably the lowest level that does the job.
I'm developing radical / experimental musical training software for iPhone / iPad.
My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.
I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?
Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.
Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.
So really I'm hunting for any open source software that recognises phonemes.
PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch
If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:
Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.
I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.
One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.
Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)
...using a sledgehammer to crack a nut.
I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.
All the best with your problem solving.
Not sure if this would help: check out OpenEars' LanguageModelGenerator. OpenEars uses Sphinx and other libraries.
http://www.hfink.eu/matchbox
This page links to both YouTube video demo and github source.
I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.