How to improve google speech recognition performance with pre-process - google-cloud-speech

When I try google speech recognition it shows low performance on traditional Chineses audio file with background noise. Can I improve the performance of speech recognition after some pre-processing(like speech enhancement)? Does it work on google speech service?

I would suggest that you go through this page in google cloud speech documentation outlining best practices on how to provide speech data to the service, including recommendations for pre-processing.

Keep the recording as close to the original speech signal as possible. No distortion, no clipping, no noise, no artificial pre-processing, like noise suppression and automatic gain control. I think such kind of pre-processings can damage the useful information in speech signals.
I copied the key points from google and paste them as below.
Position the microphone as close as possible to the person that is speaking, particularly when background noise is present.
Avoid audio clipping.
Do not use automatic gain control (AGC).
All noise reduction processing should be disabled.
Listen to some sample audio. It should sound clear, without distortion or unexpected noise.

Related

Low Level Speech Recognition in iOS

From what I have seen online, Apple dev APIs enable use of speech recognition but only at a very high level.
Is there a way to access low-level speech recognition tools, so that for example, the speech can be used to recognize a particular individual or determine the mood of speaker? Of course, these would require bespoke algorithms, but I am asking whether or not the data would be available.
Thanks.

How to amplify the voice recorded from far distance

When a person speaks far away from a mobile, the voice recorded is low.
When a person speaks near a mobile, the voice recorded is high. I want to is to play the human voice in equal volume no matter how far away (not infinite) he is from the phone when the voice is recorded.
What I have already tried:
adjust the volume based on the dB such as AVAudioPlayer But
the problem is that the dB contains all the environmental sound. So
it only works when the human voice vary heavily.
Then I thought I should find a way to sample the intensity of the
human voice in the media which leads me to voice recognition. But
this is a huge topic. I cannot narrow the areas which could
solve my problems.
The voice recorded from distance suffers from significant corruption. One problem is noise, another is echo. To amplify it you need to clean voice from echo and noise. Ideally you need to do that with a better microphone, but if only a single microphone is available you have to apply signal processing. The signal processing algorithms you are interested in are:
Noise cancellation. You can find many samples on Google from simple
to very advanced ones
Echo cancellation. Again you can find many implementations.
There is no ready library to do the above, you will have to implement a large part yourself, you can look on the WebRTC code which has both noise and echo cancellation, like described in this question:
Is it possible to reduce background noise while streaming audio on the iPhone?

How to extract human voice from an audio clip, using machine learning?

How can we use machine learning to get human voice from an audio clip which can be having a lot many noise over whole frequency domain.
As in any ML application the process is simple: collect samples, design features, train the classifier. For the samples you can use your noisy recordings or you can find a lot of noises in the web sound collections like freesound.org. For the features you can use mean-normalized mel-frequency coefficients, you can find implementation in CMUSphinx speech recognition toolkit. For classifier you can pick GMM or SVM. If you have enough data it will work fairly well.
To improve the accuracy you can add assumption that noise and voice are continuous, for that reason you can analyze detection history with hangover scheme (essentially HMM) to detect voice chunks instead of analysis of the every frame individually.

How do I do a decent speech detection?

I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).

Are there built-in developer tools for speech recognition in iOS?

How can I detect that speech was started from some audio file. I need only detect start and stop of the speech without recognition
Thank you.
Check out this app
http://developer.apple.com/library/ios/#samplecode/SpeakHere/Introduction/Intro.html
you can tinker with this sample code a little to get what you need...
Here is one more link that I have come across
http://developer.apple.com/library/ios/#samplecode/aurioTouch/Introduction/Intro.html#//apple_ref/doc/uid/DTS40007770
You could use a pitch detector to listen for the presence of harmonic tones within the range of human speech. I don't know of any pitch detector for iOS though. I wrote my own And it was very hard.
Dirac does pitch detection, I don't know how accurate it is because I don't want to spend £1000 on the licence.

Resources