voice recognition on iOS - convert OOV words to phonemes on iOS? - ios

I’ve tried, as suggested on StackOverflow, Openears sucessfully, and generate custom vocabularies from arrays of NSSTRINGS.
However, we also need to recognize names from the addressbook, and here the fallback method inevitably fails miserably very often…
I could write a parser and dynamically transcribe the texts (mainly French and Dutch sounding names) to phonemes myself, but that would be a lot of (guessing) work…. I’m pretty sure the data I need is generated somewhere in the recognition process, so maybe someone could point me to a hook in OpenEars or Flite code in a way I can exploit on iOS?
Or some other library that would convert user speech to a string of phonemes I can feed into Openears?

The right way to recognize names in openears is to put specific pronunciations in the phonetic dictionary. You do not need to analyze phonetic strings yourself and actually recognizer do not have information about phonetic string altogether so you can not even retrieve it. Also, there is no clear correspondence between audio and phoneme sequence.
For example grapheme to phoneme code can defer the following pronunciaiton:
tena T IH N
While the correct pronunciation is
tena T EH N AH
With incorrect pronunciation predicted the recognizer will not be able to recognize a name. With corrected it will recognize the name accurately
The problem is that automatic word to phoneme converion in openears might fail. For foreign words it might fail even more frequently. What you need to do is to add the names into the dictionary so that recognizer will know their proper phonetic sequencies. If proper sequence is known, the recognizer will be able to detect the word by itself. You can also improve grapheme to phoneme code in openears to make it more accurate. Modern pocketsphinx uses phonetisaurus API which is both more accurate than flite and also trainable on special cases like foreign names.
For all the issues you have with accuracy first of all it's recommended to collect a database of test samples in order to enable stringaccuracy analysis. Once you have such database you can improve accuracy significantly. See for details
http://cmusphinx.sourceforge.net/wiki/faq#qwhy_my_accuracy_is_poor

Related

Best approach to compare recognized speech with a known text

Given a known manuscript (text) which I expect the user to read (more or less accurately), what is the best approach to recognize the user's progress within the manuscript?
While I'm searching for a particular solution on iOS, I'm also interested in a more general answer.
iOS provides a speech recognition framework called Speech that I can use to recognize any speech. My current approach is to use the string results of this framework to match them against the manuscript. However, it seems to me like this has quite some overhead and that it would save resources and increase precision when I first feed the speech recognizer with the expected words so that it "knows" what to listen for.
For example, when the next word in the manuscript is "fish", I don't
need the speech recognizer to search the whole English language
dictionary for a word that best matches the recorded audio – I only
need to get a probability value how likely it is that the user just
said "fish".
I think it's very similar to keyword spotting only that I'm not only spotting a few keywords but the words in a whole manuscript.
Unfortunately, I haven't been able to find such an API on iOS. Is there any better approach to achieve this "speech tracking" than the one described above?
However, it seems to me like this has quite some overhead and that it would save resources and increase precision when I first feed the speech recognizer with the expected words so that it "knows" what to listen for.
Maybe it would, but the speech framework provides no way for you to do that, so you can't.

Does the iOS Speech API support grammar?

I was investigating various Speech Recognition strategies and I liked the idea of grammars as defined in the Web Speech spec. It seems that if you can tell the speech recognition service that you expect “Yes” or “No”, the service could more reliably recognize a “Yes” as “Yes”, “No” as `No”, and hopefully also be able to say “it didn’t sound like either of those!”.
However, in SFSpeechRecognitionRequest, I only see taskHint with values from SFSpeechRecognitionTaskHint of confirmation, dictation, search, and unspecified.
I also see SFSpeechRecognitionRequest.contextualStrings, but it seems to be for a different purpose. I.e., I think I should put brands/trademark type things in there. Putting “Yes” and “No” in wouldn’t make those words any more likely to be selected because they already exist in the system dictionary (this is an assumption I’m making based on the little the documentation says).
Is a way with the API to do something more like grammars or, even more simply, just providing a list of expected phrases so that the speech recognition is more likely to come up with a result I expect instead of similar-sounding gibberish/homophones? Does contextualStrings perhaps increase the likelihood that the system chooses one of those strings instead of just expanding the system dictionary? Or maybe I’m taking the wrong approach and am supposed to enforce grammar on my own and enumerate over SFSpeechRecognitionResult.transcriptions until I find one matching an expected word?
Unfortunately, I can’t test these APIs myself; I am merely researching the viability of writing a native iOS app and do not have the necessary development environment.

Openears detecting words not in the dictionary and/or language file

I am working on a navigation app that utilizes voice recognition to point the user towards their destination. However, this may require phrases or words outside what might be obvious. "Navigate to" would be simple, since I can add it to the dictionary, but "starbucks" won't be so easy. I could simply add starbucks, but that only solves one venue out of the hundreds of thousands that have non-standard names. I am looking for a way to do this in a more widespread way.
Is there any way to setup or configure Openears to detect and understand all words said?

Scanning image word by word using Tesseract OCR for iOS

Right now, I am using TesseractOCR for iOS to scan images and convert them into text. I want to be able to find a word a highlight it in the original image, so I am thinking to scan the document word by word and look for the phrase or word passed in by the user. However, I can't find any resources on the tesseractOCR website that point me in this direction. So basically, I am looking to scan an image word by word so I can find a phrase. I need to be able to highlight the word on the original image which is why I think i should be should scan the original image word by word. Is there any way I can scan the original image word by word using tesseractOCR (probably involving detecting whitespace)? If so any relevant resources would be helpful. If I can't use tesseractOCR should I be using something else or is it not possible at all?
Thanks in advance.
TesseractOCR for iOS has an api call which returns recognized blocks by iterator level. You can set the iterator level to G8PageIteratorLevelWord to obtain words.
What's also important is that each recognized block has boundingBox property, which points directly to the location of the block on the image. You can use this for highlighting the word on the image.
If, after this, you want to find some specific phrase or word in obtained set of words, you will have to be a little more creative :) OCR results can contain errors, so you can use exact string matching, but fuzzy matching. Also, searching phrases (as opposed to searching just words), opens questions of layouting OCR results, because words in one phrase aren't always adjacent in OCR result.
Note: my company MicroBlink offes commercial OCR engine for mobile devices. On iOS you can easily try it using cocoapods
pod try PPBlinkOCR
BlinkOCR solves all of the problems above, and you can contact us for support while you use it.

Profanity checking for promotional codes

I have a slightly unusual profanity-related question.
Now we're used to dealing with profanity-filtering of user-generated content — any method is imperfect, but products like CleanSpeak and WebPurify do a good-enough job.
The problem we have at the moment, though, is that we've been building an engine to run promotional-code–based competitions, that will be used internationally. We could do with checking that none of these codes is profane in Latin American Spanish or Malay (at least in the first instance), to make sure we don't send out a code that's equivalent to FUCK23 or PEN15 or something.
We've tried Googling around and asking people we know, but we can't find an easy way of getting hold of an es-419 or an ms profanity list to filter the codes against. As there are literally millions of codes per locale, we'd rather do an offline check than hit an API for each code (which would be expensive both in terms of bandwidth and usage fees).
I know this is a bit of a long shot, but does anyone know of a good source for profanity lists in different languages?
#disclaim: We know that no profanity filtering is perfect, that it's essentially futile with user-generated content and we have read SO #273516: How do you implement a good profanity filter? — that's not what we're asking.
Building or finding lists in other languages is extremely time consuming and difficult (trust me, we've built many of them at Inversoft). You might be better off tweaking the code generators instead (from what I could tell your code is generating the promotional codes rather than humans).
The best way to tweak a generator is to ensure that the codes can't easily form words based on the general use of consonants and vowels in most European languages. Things get a bit dicey in Polish and others, but it usually works.
Generally, most codes that start with a vowel are followed by another vowel or a non-joining consonant (like 'q' without a 'u'). If the code starts with a consonant then the next character is the same consonant or one that has a low probability of being used. For example, if you start with 's' then adding 'g' is a good choice.
You could also use wiktionary or other similar sources (like Linux dictionary files) to build a statistical approach to this. By extracting the probability of characters being next to each other, you should be able to generate codes with good accuracy of never being words in any language.
However, if I misread your question and you aren't generating the codes programmatically, you can ignore my response completely. :)
I have had the same thoughts. in trying to generate 6 character codes for a project i am doing.
I decided to reduce the likelyhood of obvious porfain codes So i removed the vowels that i found in as many "bad" words as i could think of, from my intial base 36 generation code. Leaving me with something more like a base 28 system that did not include a,e,i,o,u, 1,0. the one and zero were removed to reduce confusion between those characters in some fonts with I,L,O's
so far I have not seen a "profain" code genreated. Although base 28 has 1.something billion unique combinations.
i cannot vouch for other languages, and had not even considered it...

Resources