Is there an event or something that could be used to indicate what word is being spoken currently?
I can't find anything in the documentation but I want to double check.
I need that so, e.g., it's possible to move back X words.
Thank you
Polly can generate speech marks, i.e. the position and timing of each word. Using this information, you could certainly achieve what you have in mind.
To generate speech marks, simply call the SynthesizeSpeech API using the 'json' output format.
https://docs.aws.amazon.com/polly/latest/dg/using-speechmarks.html
Related
I am banging my head on this. Twilio Studio says it supports SSML using Amazon Polly voices on the say and gather widgets, https://www.twilio.com/docs/studio/widget-library/sayplay#ssml-support-for-polly-voices
I cannot make them work no matter what I try.
I tried using the examples from their docs, but nothing. What I currently have is this.
Twilio gather
I have also tried wraapping the whole block of text in valid ssml, using neural and non neural voices, single quoting, escaping. Nothing seems to work like the docs tell me it will.
When I look in the call log, the converted TWiML just strips all of the ssml. It looks like this
Twilio details
Any idea what I am doing wrong?
I figured this out in the end. The whole text block needed to be wrapped in <speak> </speak> block and the ampersands in the text needed to be removed.
I am working in an application that gathers a user's voice input for an IVR. The input we're capturing is a limited set of proper nouns but even though we have added hints for all of the possible options, we very frequently get back unintelligible results, possibly as a result of our users having various accents from all parts of the world. I'm looking for a way to further improve the speech recognition results beyond just using hints. The available Google adaptive classes will not be useful, as there are none that match the type of input that we're gathering. I see that Twilio recently added something called experimental_utterances that may help but I'm finding little technical documentation on what it does or how to implement.
Any guidance on how to improve our speech recognition results?
Google does a decent job doing recognition of proper names, but not in real time just asynchronously. I've not seen a PaaS tool that can do this in real time. I recommend you change your approach and maybe identify callers based on ANI or account number or have them record their name for manual transcription.
david
I'm trying to build a software that will identify the language being spoken.
My plan is to use Google's cloud speech to text to transcribe the speech, and put it through cloud translation api to detect the langauge of the transcription.
However, since speech to text requires language code to be set prior to transcribing, I was planning to run it multiple times with different sets of languages and compare the "confidence" value to find the most confident transcription, that will be put through to cloud translation api.
Would this be the ideal way? Or would there be any other possible options?
Maybe you can check the Detecting language spoken automatically page in google cloud speech documentation.
Given a known manuscript (text) which I expect the user to read (more or less accurately), what is the best approach to recognize the user's progress within the manuscript?
While I'm searching for a particular solution on iOS, I'm also interested in a more general answer.
iOS provides a speech recognition framework called Speech that I can use to recognize any speech. My current approach is to use the string results of this framework to match them against the manuscript. However, it seems to me like this has quite some overhead and that it would save resources and increase precision when I first feed the speech recognizer with the expected words so that it "knows" what to listen for.
For example, when the next word in the manuscript is "fish", I don't
need the speech recognizer to search the whole English language
dictionary for a word that best matches the recorded audio – I only
need to get a probability value how likely it is that the user just
said "fish".
I think it's very similar to keyword spotting only that I'm not only spotting a few keywords but the words in a whole manuscript.
Unfortunately, I haven't been able to find such an API on iOS. Is there any better approach to achieve this "speech tracking" than the one described above?
However, it seems to me like this has quite some overhead and that it would save resources and increase precision when I first feed the speech recognizer with the expected words so that it "knows" what to listen for.
Maybe it would, but the speech framework provides no way for you to do that, so you can't.
I am trying to use Open Ears for small part of my app. I have three or four keywords that I want to be able to "listen" to. Something like "Add", "Subtract", etc. I am just using the sample app found here. I want to have a special case in the app when I here "Add" etc. as opposed to a word that is not one of my four keywords. Right now I set my language to be only the four keywords, but whenever the Open Ears API hears anything, it picks between my four keywords. So if I cough, it picks the closest word out of the four words
How can I listen for a specific word without always choosing one of the keywords?
I was thinking I could have a whole bunch of words, a few hundred, and just check which word was spoken, and have a special case for my four keywords, but I don't want to have to type down each word. Does Open ears provide any default languages?
OpenEars developer here. Check out the dynamic grammar generation API that was just added in OpenEars 1.7 which may provide the right results for your requirements: http://www.politepix.com/2014/04/10/openears-1-7-introducing-dynamic-grammar-generation/
This approach might be more suitable for keyword detection and detection of fixed phrases. Please bring further questions to the OpenEars forums if you'd like to troubleshoot them with me.