When initializing the Google speech recognition service, we see a response in around 300 milliseconds. However, if we stop the reco service and start it again, each subsequent "start" has a much longer delay before the service is ready to start receiving our audio stream. We've seen this time range anywhere between 5 seconds and 11 seconds.
Does anyone have any idea why this might be occurring?
-- ADDITIONAL INFORMATION --
2019-05-30:
Our development team is following closely the example that we found here. The only difference is that we're not sending a file, instead we're redirecting an audio stream to this service.
When debugging our code, we see that the delay appears to be within these three lines..
auto creds = grpc::GoogleDefaultCredentials();
auto channel = grpc::CreateChannel( "speech.googleapis.com", creds );
std::unique_ptr<Speech::Stub> speech( Speech::NewStub( channel ) );
(These three lines are the first three lines in our stream create thread.)
Related
There seem to be many alike questions/issues with a Beam pipeline being stuck in GroupByKey, but after a whole day of trying different settings for PubSubIO and Window triggers, I still have not come any further.
I have a PubSub topic producing a steady stream of data:
When starting my pipeline in Dataflow a bunch of elements are added. Hereafter, the number of elements remains the same. The data watermark increases slowly and is around 10-11 minutes behind current time, which coincides with the message retention time of the PubSub subscription of 10 minutes. I have tried different setups for reading PubSub, with and without attributes, adding the timestamp myself, etc. In this job it is just a vanilla read without attributes relying on Google to compute the timestamps.
Here I am trying to run my pipeline, which doesn't do any grouping by key:
I tried a lot of Window setups. My goals is a sliding window of 30 minutes duration every 1 minute, but here I am just trying to get it to work with a FixedWindow of 1 minute. I also tried a lot of different triggers with both early and late firings added. In this job, I did not specify anything, ie. it should use the default trigger.
The job id is 2019-11-01_06_38_28-2654303843876161133
Does anyone have any suggestions on what else I can try to get some element through the GBK?
Update:
I have simplified my pipeline to continue troubleshooting the issue using the hint in one of the comments to look at the watermark at reading the PubSub messages.
I have an attribute on the PubSub message that I use as timestamp (.withTimestampAttribute(...)). Logging the timestamp() in ProcessContext does give the correct timestamp and window assignment. Messages are streaming in real-time with a couple of seconds lag, but the issue is that "data watermark" stays a while (observed like 1,5 hours) behind and therefore the window is never triggered and GroupByKey does not work.
If I omit .withTimestampAttribute(...) when reading from PubSub, everything seems to work correct, but I have a lag on my timestamp, which causes messages to be assigned to a later window in many cases.
I found a workaround by triggering in processing time instead of event time, but I haven't assessed if this is a real solution:
.triggering(AfterProcessingTime.pastFirstElementInPane().alignedTo(Duration.standardMinutes(1)).plusDelayOf(Duration.standardMinutes(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
The question is how I can make sure the watermark is updates when reading from PubSub with using the timestamp attribute?
Using the Google cloud speech API's streaming requires the length of a streaming session to be less than 60 seconds. To handle streaming beyond this limitation, we would require to split the streaming data in to several chunks, using something like single_utterance, for example. When submitting such chunk, we use the "RequestObserver.onCompleted()" to mark the end of streaming session. However it seems like the Grpc threads which were created to handle streaming running even after getting the final results, causing the error "io.grpc.StatusRuntimeException: OUT_OF_RANGE: Exceeded maximum allowed stream duration of 65 seconds".
Is there any other mechanisms that could be used to properly terminate Grpc threads, so that it won't run until the allowed 60 second limit? (Seems like we could use SpeechClient.close() or SpeechClient.shutdown() to release all background resources, but would require recreating the SpeechClient instances again. This would be somewhat heavy.)
Is there any other recommended ways that we could use to stream data beyond the 60 second limit, without splitting the stream to several chunks.
Parameters : [Encoding=LINEAR16,Rate=44100]
I'm using TwilioML to collect user's input.
What I'm seeing is a significant delay 4-6 sec. from the time the user stop speaking to the time my service ( endpoint) is called. This happens even with very simple sentences (ex. my name is john).
Is this a known issue? From a user experience point of view it is not a great experience.
I tried to add a 'filler' via but it does not have any effect because the earlier I can get it started is when the endpoint is called.
Maybe there is a way to play a file in parallel while the the audio is converted to text.
From the documentation:
The 'timeout' attribute sets the limit in seconds that Twilio will
wait for the caller to press another digit or say another word before
moving on and making a request to the 'action' URL. For example, if
'timeout' is '3', Twilio will wait three seconds for the caller to
press another key or say another word before submitting the previously
entered digits or speech to the 'action' URL. Twilio waits until
completing the execution of all nested verbs before beginning the
timeout period.
The fact you are seeing a delay of between 4 and 6 seconds is probably explained by the fact the default timeout setting is 5 seconds.
Have you tried using a partialResultCallback URL? If set Twilio will submit results of speech recognition in real time to this URL. It's also worth adding hints if you are expecting callers to say certain words as this can speed up the recognition.
We are leaving a voice message (using an MP3) using Twilio's answering machine detection. We are seeing in our logs the correct calls to/from the API (answered by answering machine, post of our recorded message)...no error.
But the persons were are testing on, only 1/4 are actually getting a voicemail. The rest receive no voicemail, even though the logs show the correct API calls...? What is happening here?
Here is the code to call the twiml.
if (Request.Form["AnsweredBy"] != null)
{
switch (Request.Form["AnsweredBy"])
{
case "machine_end_beep":
case "machine_end_silence":
case "machine_end_other":
SaveTwilioMessage(transaction.Campaign.Id.ToString());
//var machineResponse = new VoiceResponse();
if (!string.IsNullOrWhiteSpace(transaction.Campaign.VoicemailMessageUrl))
{
response.Play(transaction.Campaign.VoicemailMessageUrl);
}
else
{
response.Say(transaction.Campaign.VoicemailMessage, voice: _voice);
}
return new TwiMLResult(response);
case "human":
case "fax":
case "unknown":
default:
break;
}
And here is the call that generates this:
var call = await CallResource.CreateAsync(url: callbackUrl, to: new PhoneNumber(phoneNumber), from: new PhoneNumber(fromPhone),machineDetection: "DetectMessageEnd");
var result = new TelephonicResource(call.Sid);
return result;
Any thoughts?
Twilio developer evangelist here.
When using Twilio's answering machine detection you have two options for detecting a machine. You set answering machine detection to on by supplying a MachineDetection parameter to the REST API request to make a call.
The MachineDetection parameter can be either Enable or DetectMessageEnd. If you want to leave a message after the answering machine message is complete you need to use DetectMessageEnd.
Then, when you get the webhook callback for the start of the call you get an extra parameter, AnsweredBy, which is either machine_end_beep, machine_end_silence, machine_end_other, human, fax or unknown.
If you get machine_end_beep, machine_end_silence or machine_end_other then you should be able to leave your message then. For other results you can handle them as you would a normal call.
If you just use MachineDetection=Enable then Twilio will attempt to connect you to the call with the result as soon as it has figured out if it is human or machine. If you want to leave a message, I would not choose this option.
Let me know if this helps at all.
Answering my own question here. Finally got some support from Twilio after pushing up through sales - after ticket remained open and unworked for days.
Basically, if you want to pull off AMD successfully, you need to be able to respond within 150ms. On our calls, the voice mails were starting, detecting no sound, and saying "we're sorry, but you're not talking...then our message would start". The correction was to do less DB lookups in our API calls by changing programming practice, and moving our MP3 to somewhere on the East coast (AWS preferred).
We investigated this...and found that while our API response was taking ~1 second, the AMD was sometimes waiting 15+ seconds after the beep to play the message. Still confused.
We are using Twilio AMD to broadcast phone appointment reminders, and have similar difficulty with leaving voicemail messages. Having tested this on about 50 calls, we see cases where the AMD detects a 'machine_end_silence' well before the 'beep' that should trigger the PLAY. So when you listen to the recorded call, the PLAY is occurring at the same time as the "your call has been forwarded to an automated voice messaging service...please leave a message...". So the API calls all look correct, but the user doesn't receive a voicemail (as the play message wav file ends just before the beep).
We saw other cases where the AMD doesn't hear the beep - and instead waits for the machineDetectionTimeout before playing the wav - leaving a long gap of dead air on the receivers voicemail. Even within my small development team we saw differences in behavior. For instance, we made test calls to a few iphones that all had the same default voicemail setup (the setup you get when you haven't recorded a custom greeting) on the same Verizon service plan. So the AMD should be hearing the exact same answer. Yet, AMD would detect the 'beep' on some of our phones, but not all of them.
Given all these challenges, we found that it's a good idea to leave a 'longer' version of the message (repeating the critical information a few times). Assuming your message is longer than the machineDetectinTimeout - you at least get some of your message saved on the voicemail.
am newbie for multimedia work.i want to capture audio by samples and transfer to some other ios device via network.how to start my work??? .i have just gone through apple multi media guide and speakhere example ,it is full of c++ code and they are writing in file and then start services ,but i need buffer...please help me to start my work in correct way .
Thanks in advance
I just spent a bunch of time working on real time audio stuff you can use AudioQueue but it has latency issues around 100-200ms.
If you want to do something like the t-pain app, you have to use
RemoteIO API
Audio Unit API
They are equally difficult to implement, so I would just pick the remote IO path.
Source can be found here:
http://atastypixel.com/blog/using-remoteio-audio-unit/
I have upvoted the answer above, but I wanted to add a piece of information that took me a while to figure out. When using AudioQueue for recording, the intuitive notion is that the callback is done in regular intervals of whatever the number of samples represent. That notion is incorrect, AudioQueue seems to gather the samples for a long period of time, then deliver them in very fast iterations of the callback.
In my case, I was doing 20ms samples, and receiving 320 samples per callback. When printing out the timestamps for the call, I noticed a pattern of: 1 call every 2 ms, then after a while one call of ~180ms. Since I was doing VoIP, this presented the symptom of an increasing delay on the receiving end. Switching to Remote I/O seems to have solved the issue.