Speech to Text rest API is slow. How can it be improved? - google-cloud-speech

I am using google cloud speech API(Rest) for speech to text conversion. It is taking almost 2.8 secs to convert 2 words. Is there any way to improve this and bring it down to below 1 sec?

Based on the Speech API best practices, you can use the StreamingRecognize and single_utterance properties to optimize the recognition for short utterances and minimizes the latency of your calls.
Additionally, you can check the Frame size to verify you are not using a very large frame as it can add latency.
A 100-millisecond frame size is recommended as a good tradeoff between
latency and efficiency.

Related

How to improve google speech recognition performance with pre-process

When I try google speech recognition it shows low performance on traditional Chineses audio file with background noise. Can I improve the performance of speech recognition after some pre-processing(like speech enhancement)? Does it work on google speech service?
I would suggest that you go through this page in google cloud speech documentation outlining best practices on how to provide speech data to the service, including recommendations for pre-processing.
Keep the recording as close to the original speech signal as possible. No distortion, no clipping, no noise, no artificial pre-processing, like noise suppression and automatic gain control. I think such kind of pre-processings can damage the useful information in speech signals.
I copied the key points from google and paste them as below.
Position the microphone as close as possible to the person that is speaking, particularly when background noise is present.
Avoid audio clipping.
Do not use automatic gain control (AGC).
All noise reduction processing should be disabled.
Listen to some sample audio. It should sound clear, without distortion or unexpected noise.

How long should audio samples be for music/speech discrimination?

I am working on a convolutional neural net which takes an audio spectrogram to discriminate between music and speech using the GTZAN dataset
If single samples are shorter, then this gives more samples overall. But if samples are too short, then they may lack important features?
How much data is needed for recognizing if a piece of audio is music or speech?
How long should the audio samples be ideally?
The length of audios vary on number of factors.
The basic idea is to get just enough samples.
Since audio changes constantly, it is preferred to work on a shorter data. However, very small frame would result into less/no feature to be captured.
On the other hand very large sample would capture too many features, thereby leading to complexity.
So, in most usecases, although the ideal audio length is 25seconds, but it is not a written rule and you may manipulate it accordingly.Just make sure the frame size is not very small or very large.
Update for dataset
Check this link for dataset of 30s
How much data is needed for recognizing if a piece of audio is music or speech?
If someone knew the answer to this question exactly then the problem would be solved already :)
But seriously, it depends on what your downstream application will be. Imagine trying to discriminate between speech with background music vs acapella singing (hard) or classifying orchestral music vs audio books (easy).
How long should the audio samples be ideally?
Like everything in machine learning, it depends on the application. For you, I would say test with at least 10, 20, and 30 secs, or something like that. You are correct in that the spectral values can change rather drastically depending on the length!

How to track rate of speech

I am developing an iPhone app that tracks rate of speech, and hoping to use Nuance Speechkit (https://developer.nuance.com/public/Help/DragonMobileSDKReference_iOS/SpeechKit_Guide/Basics.html)
Is there a way to track rate of speech (e.g., updating WPM every few seconds) with the framework? Right now it seems to just do speech-to-text at the end of a long utterance, as opposed to every word or so (i.e., return partial results).
There are easier ways, for example you can use CMUSphinx with phonetic recognizer to recognize just phonemes instead of words. It would work locally on the device and will be very fast. From the rate of phones you can calculate the rate of words with pretty high accuracy.

High speed design using FPGA

I'm designing high speed FIR filter ON FPGA .Currently My sampling rate is 3600MSPS. But the clock supported by device is 350MHZ.Please suggest how to go with multiple instantiation
or parallel implementation of FIR Filter so that it meets the design requirement.
Also suggest how to pass samples to the parallel implementation
It's difficult to answer your question based on the information you have given.
The first question I would ask myself is: can you reduce the sample rate at all? 3600 MSPS is very high. The sample rate only needs to be this high if you are truely supporting data requiring that bandwidth.
Assuming you do really need that rate, then in order to implement an FIR filter running at such a high sample rate, you will need to parallelise the architecture as you suggested. It's generally very easy to implement such a structure. An example approach is shown here:
http://en.wikipedia.org/wiki/Parallel_Processing_%28DSP_implementation%29#Parallel_FIR_Filters
Each clock cycle you will pass a parallel word into each filter section, and extract a word from the combined filter output.
But only you know the requirements and constraints of your FPGA design; you will have to craft the FIR filter according to your requirements.

How do I do a decent speech detection?

I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).

Resources