I am working on a game for iPhone that is fully usable by providing YES / NO responses.
It would be great to make this game available to blind users, runners, and people driving cars by allowing voice control. This does not require full speech recognition, I am looking to implement keyword spotting.
I can already detect start and stop of utterances, and have implemented this at https://github.com/fulldecent/FDSoundActivatedRecorder The next step is to distinguish between YES and NO responses reliably for a wide variety of users.
THE QUESTION: For reasonable performance (distinguish YES / NO / STOP within 0.5 sec after speech stops), is AVAudioRecorder a reasonable choice? Is there a published algorithm that meets these needs?
Your best bet here is OpenEars, a free and open voice recognition platform for iOS.
http://www.politepix.com/openears/
You most likely DO NOT want to get into the algorithmic side of this. It's massive and nasty - there is a reason only a small number of companies do voice recognition from scratch.
Related
I am working in an application that gathers a user's voice input for an IVR. The input we're capturing is a limited set of proper nouns but even though we have added hints for all of the possible options, we very frequently get back unintelligible results, possibly as a result of our users having various accents from all parts of the world. I'm looking for a way to further improve the speech recognition results beyond just using hints. The available Google adaptive classes will not be useful, as there are none that match the type of input that we're gathering. I see that Twilio recently added something called experimental_utterances that may help but I'm finding little technical documentation on what it does or how to implement.
Any guidance on how to improve our speech recognition results?
Google does a decent job doing recognition of proper names, but not in real time just asynchronously. I've not seen a PaaS tool that can do this in real time. I recommend you change your approach and maybe identify callers based on ANI or account number or have them record their name for manual transcription.
david
I'm using speech recognition in my app. It's guite important for user experience, so I want it to be good (and free or cheap).
Right now, I'm using Speech Kit from Apple, and it works like a charm but it's not very reliable because there are some limits per app and per device, and I don't know these limits.
Another option is to use OpenEars. It's not nearly as good as Speech Kit for me, so I'm thinking about switching from Speech Kit to OpenEars silently if Speech Kit is not working (and back,when Speech Kit is alive and well).
But is there a way to know that Speech Kit is not working right now before ever using it?
The only way I know of is to try to recognise some audiofile before every user session, but it needs time (at least, several seconds will be spent, and several seconds is a lot), and it's not very good solution in terms of using the service — it seems too costly to recognise audio just to check if Speech Kit is working or not. Also, I don't know how to debug this, because obviously I don't have any problems with limits in my app right now.
What is the best way to solve this?
I also thought about this question not long ago. Here's an answer from the Apple Q & A. "The current rate limit for the number of SFSpeechRecognitionRequest calls a device can make is 1000 requests per hour." There is also an example of the error being received upon reached limit, so you can prepare yourself for that :)
Here's the link: Apple Q & A
I've been researching several iOS speech recognition frameworks and have found it hard to accomplish something I would think is pretty straightforward.
I have an app that allows people to record their voices. After a recording is made, they have the option to create a text version.
Looking into the services out there (i.e., Nuance) most require you to use the microphone. OpenEars allows you to do this, but the dictionary is so limited because it is an offline solution (they recommend 300 or less words).
There are a few other things going on with the app that would make it very unappealing to switch from the current recording method. For what it is worth, I am using the Amazing Audio Engine framework.
Anyone have any other suggestions for frameworks. Or is there a way to dig deeper with Nuance to transcribe a recorded file?
Thank you for your time.
For services, there are a few cloud based hosted speech recognition services you can use. You simply post the audio file to their URL and receive back the text. Most of them don't have any constraint on the vocabulary. You can of course choose any recording method you like.
See here: Server-side Voice Recognition . Many of them offer free trial as well.
I want to build an app that responds to the sound you make when blowing out birthday candles. This is not speech recognition per se (that sound isn't a word in English), and the very kind Halle over at OpenEars told me that it's not possible using that framework. (Thanks for your quick response, Halle!)
Is there a way to "teach" an app a sound such that the app can subsequently recognize it?
How would I go about this? Is it even doable? Am I crazy or taking on a problem that is much more difficult than I think it is? What should my homework be?
The good news is that it's achievable and you don't need any third party frameworks—AVFoundation is all you really need.
There's a good article from Mobile Orchard that covers the details, but somewhat inevitably for a four year old, there's some gotchas you need to be aware of.
Before you begin recording on a real device, I had need to set the audio session category, like so:
[[AVAudioSession sharedInstance] setCategory:AVAudioSessionCategoryPlayAndRecord error:nil];
Play around with the threshold in this line:
if (lowPassResults > 0.95)
I found 0.95 to be too high and got better results setting it somewhere between 0.55 and 0.75. Similarly, I played around with the 0.05 multiplier in this line:
double peakPowerForChannel = pow(10, (0.05 * [recorder peakPowerForChannel:0]));
Using simple thresholds on energy levels would probably not be robust enough for your use case.
A good way to go about this would be to first extract some properties from the sound stream that are specific to the sound of blowing out candles. Then use a machine learning algorithm to train a model based on training examples (a set of recordings of the sound you want to recognize), which can then be used to classify snippets of sound coming into your microphone in real-time when using the application.
Given the possible environmental sounds going on while you blow out candles (birthdays are always noisy, aren't they?), it may be difficult to train a model that is robust enough to these background sounds. This is not a simple problem if you care about accuracy.
It may be doable though:
Forgive me the self-promotion, but my company developed an SDK that provides an answer to the question you are asking: "Is there a way to "teach" an app a sound such that the app can subsequently recognize it?"
I am not sure if the specific sound of blowing out candles would work, as the SDK was primarily aimed at applications involving somewhat percussive sounds, but it might still work for your case. Here is a link, where you will also find a demo program you can download and try if you like: SampleSumo PSR SDK
I am trying to build an app that allows the user to record individual people speaking, and then save the recordings on the device and tag each record with the name of the person who spoke. Then there is the detection mode, in which i record someone and can tell whats his name if he is in the local database.
First of all - is this possible at all? I am very new to iOS development and not so familiar with the available APIs.
More importantly, which API should I use (ideally free) to correlate between the incoming voice and the records I have in the local db? This should behave something like Shazam, but much more simple since the database I am looking for a match against is much smaller.
If you're new to iOS development, I'd start with the core app to record the audio and let people manually choose a profile/name to attach it to and worry about the speaker recognition part later.
You obviously have two options for the recognition side of things: You can either tie in someone else's speech authentication/speaker recognition library (which will probably be in C or C++), or you can try to write your own.
How many people are going to use your app? You might be able to create something basic yourself: If it's the difference between a man and a woman you could probably figure that out by doing an FFT spectral analysis of the audio and figure out where the frequency peaks are. Obviously the frequencies used to enunciate different phonemes are going to vary somewhat, so solving the general case for two people who sound fairly similar is probably hard. You'll need to train the system with a bunch of text and build some kind of model of frequency distributions. You could try to do clustering or something, but you're going to run into a fair bit of maths fairly quickly (gaussian mixture models, et al). There are libraries/projects that'll do this. You might be able to port this from matlab, for example: https://github.com/codyaray/speaker-recognition
If you want to take something off-the-shelf, I'd go with a straight C library like mistral, as it should be relatively easy to call into from Objective-C.
The SpeakHere sample code should get you started for audio recording and playback.
Also, it may well take longer for the user to train your app to recognise them than it's worth in time-saving from just picking their name from a list. Unless you're intending their voice to be some kind of security passport type thing, it might just not be worth bothering with.