Best approach to compare recognized speech with a known text - ios

Given a known manuscript (text) which I expect the user to read (more or less accurately), what is the best approach to recognize the user's progress within the manuscript?
While I'm searching for a particular solution on iOS, I'm also interested in a more general answer.
iOS provides a speech recognition framework called Speech that I can use to recognize any speech. My current approach is to use the string results of this framework to match them against the manuscript. However, it seems to me like this has quite some overhead and that it would save resources and increase precision when I first feed the speech recognizer with the expected words so that it "knows" what to listen for.
For example, when the next word in the manuscript is "fish", I don't
need the speech recognizer to search the whole English language
dictionary for a word that best matches the recorded audio – I only
need to get a probability value how likely it is that the user just
said "fish".
I think it's very similar to keyword spotting only that I'm not only spotting a few keywords but the words in a whole manuscript.
Unfortunately, I haven't been able to find such an API on iOS. Is there any better approach to achieve this "speech tracking" than the one described above?

However, it seems to me like this has quite some overhead and that it would save resources and increase precision when I first feed the speech recognizer with the expected words so that it "knows" what to listen for.
Maybe it would, but the speech framework provides no way for you to do that, so you can't.

Related

Improving Twilio Speech Recognition of Proper Nouns

I am working in an application that gathers a user's voice input for an IVR. The input we're capturing is a limited set of proper nouns but even though we have added hints for all of the possible options, we very frequently get back unintelligible results, possibly as a result of our users having various accents from all parts of the world. I'm looking for a way to further improve the speech recognition results beyond just using hints. The available Google adaptive classes will not be useful, as there are none that match the type of input that we're gathering. I see that Twilio recently added something called experimental_utterances that may help but I'm finding little technical documentation on what it does or how to implement.
Any guidance on how to improve our speech recognition results?
Google does a decent job doing recognition of proper names, but not in real time just asynchronously. I've not seen a PaaS tool that can do this in real time. I recommend you change your approach and maybe identify callers based on ANI or account number or have them record their name for manual transcription.
david

Does the iOS Speech API support grammar?

I was investigating various Speech Recognition strategies and I liked the idea of grammars as defined in the Web Speech spec. It seems that if you can tell the speech recognition service that you expect “Yes” or “No”, the service could more reliably recognize a “Yes” as “Yes”, “No” as `No”, and hopefully also be able to say “it didn’t sound like either of those!”.
However, in SFSpeechRecognitionRequest, I only see taskHint with values from SFSpeechRecognitionTaskHint of confirmation, dictation, search, and unspecified.
I also see SFSpeechRecognitionRequest.contextualStrings, but it seems to be for a different purpose. I.e., I think I should put brands/trademark type things in there. Putting “Yes” and “No” in wouldn’t make those words any more likely to be selected because they already exist in the system dictionary (this is an assumption I’m making based on the little the documentation says).
Is a way with the API to do something more like grammars or, even more simply, just providing a list of expected phrases so that the speech recognition is more likely to come up with a result I expect instead of similar-sounding gibberish/homophones? Does contextualStrings perhaps increase the likelihood that the system chooses one of those strings instead of just expanding the system dictionary? Or maybe I’m taking the wrong approach and am supposed to enforce grammar on my own and enumerate over SFSpeechRecognitionResult.transcriptions until I find one matching an expected word?
Unfortunately, I can’t test these APIs myself; I am merely researching the viability of writing a native iOS app and do not have the necessary development environment.

Openears detecting words not in the dictionary and/or language file

I am working on a navigation app that utilizes voice recognition to point the user towards their destination. However, this may require phrases or words outside what might be obvious. "Navigate to" would be simple, since I can add it to the dictionary, but "starbucks" won't be so easy. I could simply add starbucks, but that only solves one venue out of the hundreds of thousands that have non-standard names. I am looking for a way to do this in a more widespread way.
Is there any way to setup or configure Openears to detect and understand all words said?

Open Ears API says every sound it hears is a word, even a cough

I am trying to use Open Ears for small part of my app. I have three or four keywords that I want to be able to "listen" to. Something like "Add", "Subtract", etc. I am just using the sample app found here. I want to have a special case in the app when I here "Add" etc. as opposed to a word that is not one of my four keywords. Right now I set my language to be only the four keywords, but whenever the Open Ears API hears anything, it picks between my four keywords. So if I cough, it picks the closest word out of the four words
How can I listen for a specific word without always choosing one of the keywords?
I was thinking I could have a whole bunch of words, a few hundred, and just check which word was spoken, and have a special case for my four keywords, but I don't want to have to type down each word. Does Open ears provide any default languages?
OpenEars developer here. Check out the dynamic grammar generation API that was just added in OpenEars 1.7 which may provide the right results for your requirements: http://www.politepix.com/2014/04/10/openears-1-7-introducing-dynamic-grammar-generation/
This approach might be more suitable for keyword detection and detection of fixed phrases. Please bring further questions to the OpenEars forums if you'd like to troubleshoot them with me.

voice recognition on iOS - convert OOV words to phonemes on iOS?

I’ve tried, as suggested on StackOverflow, Openears sucessfully, and generate custom vocabularies from arrays of NSSTRINGS.
However, we also need to recognize names from the addressbook, and here the fallback method inevitably fails miserably very often…
I could write a parser and dynamically transcribe the texts (mainly French and Dutch sounding names) to phonemes myself, but that would be a lot of (guessing) work…. I’m pretty sure the data I need is generated somewhere in the recognition process, so maybe someone could point me to a hook in OpenEars or Flite code in a way I can exploit on iOS?
Or some other library that would convert user speech to a string of phonemes I can feed into Openears?
The right way to recognize names in openears is to put specific pronunciations in the phonetic dictionary. You do not need to analyze phonetic strings yourself and actually recognizer do not have information about phonetic string altogether so you can not even retrieve it. Also, there is no clear correspondence between audio and phoneme sequence.
For example grapheme to phoneme code can defer the following pronunciaiton:
tena T IH N
While the correct pronunciation is
tena T EH N AH
With incorrect pronunciation predicted the recognizer will not be able to recognize a name. With corrected it will recognize the name accurately
The problem is that automatic word to phoneme converion in openears might fail. For foreign words it might fail even more frequently. What you need to do is to add the names into the dictionary so that recognizer will know their proper phonetic sequencies. If proper sequence is known, the recognizer will be able to detect the word by itself. You can also improve grapheme to phoneme code in openears to make it more accurate. Modern pocketsphinx uses phonetisaurus API which is both more accurate than flite and also trainable on special cases like foreign names.
For all the issues you have with accuracy first of all it's recommended to collect a database of test samples in order to enable stringaccuracy analysis. Once you have such database you can improve accuracy significantly. See for details
http://cmusphinx.sourceforge.net/wiki/faq#qwhy_my_accuracy_is_poor

Resources