Does anyone knows if it is possible by Twilio to create multiple audio records during a call based on a kind of audio flag or pattern, like silence for example. So that you could fire a callback on the end of each portion of speech to generate text during the call.
thank...
Twilio Evangelist here.
So, you could use the timeout attribute on the <Record> verb to get short 'bursts' of spoken text, but this may mean you time out while the caller is speaking a word. So you would only get half of it! This may make it difficult to decipher what is being said, and I would personally not use this approach.
You can end recording on a key-press (a DTMF tone) with the finishOnKey attribute, which may help your needs.
You cannot currently get a live, or near realtime transcription. You will receive the transcription very quickly, but we only support the timeout and key presses to end a recording and begin transcription.
Hope this helps!
Related
I can't seem to find any documentation on this, but I'd like to say "one moment" in a Gather block in between when a user stops speaking and when the speech recognition processor delivers the words they said (since anecdotally this can take a few seconds and result in dead air in the meantime).
I can't seem to find anything like that in the documentation. All of the examples are for things like:
<Response>
<Gather>
<Say>Voice prompt to read to the user before collection</Say>
<Say>Say more things if you want</Say>
</Gather>
<Say>Something to say if the user doesn't provide feedback</Say>
</Response>
Having around 5 seconds of dead air isn't the worst thing ever, but it lacks polish.
Twilio developer evangelist here.
There is nothing to provide for a message after the user finishes speaking to the <Gather> and after the speech result is ready and sent to the action URL, however I think you might be characterising the delay wrong.
Twilio streams the voice to the speech detection service, so we get real time results (you can get partial results by setting a partialResultCallback URL). Instead, the time that elapses between the end of the caller speaking and the action being called is based on the timeout which is 5 seconds by default.
What I would suggest is that you try different values for the speechTimout attribute including auto, which "will stop speech recognition when there is a pause in speech and return the results immediately."
Let me know if that helps at all.
My Twilio application dials our conference line, waits two seconds and then sends the conference PIN, followed by #.
$dial->number('442031234567', ['sendDigits' => 'wwww123456789'] );
I would like to be able to give my users an estimate of how long they should expect silence (while Twilio is sending the PIN digits) before the call is ready. I can make the call multiple times and time the delay, but that seems less exact that finding the underlying timings!
I know that each w character takes 0.5s, but I can't find any documentation for the amount of time each digit takes after that wait.
I've looked at Twilio's docs for sendDigits and also play
Twilio developer evangelist here.
I don't believe we give any guidance on how long the DTMF tones will take, but I believe they are a constant time. I would recommend trying it a few times, along with the system that you are dialling in to in order to estimate the time for your users.
I'm currently using Twilio to make phone calls and I'd like to add a speech recognition element such that if a user says a specific phrase, my backend can take specific actions. If you're familiar with Twilio, something akin to the Gather verb. It needs to be real-time since if there are issues with recognition, the user would be prompted for clarification.
To add speech recognition to the Twilio Gather verb, add "speech" to the Gather input value, example: input="dtmf speech". After the caller says something and is quiet, the Twilio server translates the speech in text and sends the text to the action URL, then waits for response instructions. Your program can use the text to respond how ever you choose. One choice is to have your program respond with correction instructions (Say verb) and have the caller say something more, which would be processed again by your action URL.
Twilio Gather documentation including the implementation of speech recognition:
https://www.twilio.com/docs/api/twiml/gather
Example TwiML with a Gather verb using the speech recognition identifier.
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather input="dtmf speech" language="en-us"
numDigits="1"
timeout="6"
action="http://hostname/processUserResponse.py">
<Say voice="alice" language="en-CA">
Okay, speech recognition test. Enter any digit or say something.
</Say>
</Gather>
<Say voice="alice" language="en-CA">
Waited to long to say something. Response canceled ....
</Say>
</Response>
This was briefly covered here: https://stackoverflow.com/a/30224103/6189694
Seems like you would have to set up a conference call, and then join in as a muted user to listen in on the call.
I don't believe there is anything that works in real-time to do this. You could, however, use voice recording, pass the recording to another service (IBM's Watson Speech to Text comes to mind) and then handle it from there. It should be able to do this relatively quickly with the right workflow. I have never used Watson, just seen it used. So I am not sure on how long it would take to process the recording. I would think one or two word commands should be completed quickly.
Sorry I can't provide more guidance. Someone else in the community may have another method.
C# .net Core IVR Gather example using list of enums instead of the combined enum available in the official old C# example as per my comment above (also had to convert the url.actionurl to this monstrosity):
List<Gather.InputEnum> bothDtmfAndSpeech =
new List<Gather.InputEnum>(2){
Gather.InputEnum.Dtmf, Gather.InputEnum.Speech
};
var gather = new Gather(
action: new Uri(Url.Action("Show", "Menu")),
numDigits: 1, input:bothDtmfAndSpeech, bargeIn: true);
The IBM Watson Speech To Text service (STT) has this capability, it is called Keyword Spotting (https://www.ibm.com/watson/developercloud/doc/speech-to-text/output.shtml). Watson STT will let you push a live stream of telephony audio and produce not only recognition hypotheses but also it will be able to detect whether the user said sentences or commands specified beforehand. There is actually a demo that showcases this functionality, please give it a try:
https://speech-to-text-demo.mybluemix.net/
Our call center deals with businesses and we use Twilio to make our calls. However, many businesses have a menu to navigate before we get to talk to someone. How can I create a 10-key pad on our end and use it to send menu selections to the call we are connected with?
I know about the senddigits attribute on Dialing numbers with Twilio, but this sends preprogrammed tones. We have no way of knowing what the tones need to be until we are connected and in the menu, so this won't work.
I've been through the API pretty thoroughly and can't seem to find anything relating to this.
If there is nothing, is there another software that anyone can recommend that allows for making calls out, generating recordings of calls and allows me to send keytones manually after the call has been started?
Check out the digits attribute of the 'Play' tag.
https://www.twilio.com/docs/api/twiml/play#attributes-digits
Each 'w' character tells Twilio to wait 0.5 seconds instead of playing a digit.
Assuming I am understanding your problem, could you not us MP3s of DTMF tones (http://jetcityorange.com/dtmf/) and PLAY to send the tones after the call has started?
I want to get four related pieces of information from a caller, and get recordings of each. Do I need to implement this as four separate calls (Say/Record pairs in the XML file) to the Twilio API and my web endpoint (the Record 'action' or StatusCallback)? I'm using Python and Flask, but examples in other languages and frameworks would be helpful too.
Twilo evangelist here.
Four separate recordings is probably the easiest way to do what you want.
<Say>Begin Recording 1</Say>
<Record>
<Say>Begin Recording 2</Say>
<Record>
<Say>Begin Recording 3</Say>
<Record>
<Say>Begin Recording 4</Say>
<Record>
<Say>Thanks for recording four things</Say>
The hard part is going to be knowing that the user actually gave you the right information, so you might want to think about adding a step at the end of your call flow that allows your caller to hear their recording and choose to re-record the info if they want.
Hope that helps.