How to amplify the voice recorded from far distance - ios

When a person speaks far away from a mobile, the voice recorded is low.
When a person speaks near a mobile, the voice recorded is high. I want to is to play the human voice in equal volume no matter how far away (not infinite) he is from the phone when the voice is recorded.
What I have already tried:
adjust the volume based on the dB such as AVAudioPlayer But
the problem is that the dB contains all the environmental sound. So
it only works when the human voice vary heavily.
Then I thought I should find a way to sample the intensity of the
human voice in the media which leads me to voice recognition. But
this is a huge topic. I cannot narrow the areas which could
solve my problems.

The voice recorded from distance suffers from significant corruption. One problem is noise, another is echo. To amplify it you need to clean voice from echo and noise. Ideally you need to do that with a better microphone, but if only a single microphone is available you have to apply signal processing. The signal processing algorithms you are interested in are:
Noise cancellation. You can find many samples on Google from simple
to very advanced ones
Echo cancellation. Again you can find many implementations.
There is no ready library to do the above, you will have to implement a large part yourself, you can look on the WebRTC code which has both noise and echo cancellation, like described in this question:
Is it possible to reduce background noise while streaming audio on the iPhone?

Related

How to improve google speech recognition performance with pre-process

When I try google speech recognition it shows low performance on traditional Chineses audio file with background noise. Can I improve the performance of speech recognition after some pre-processing(like speech enhancement)? Does it work on google speech service?
I would suggest that you go through this page in google cloud speech documentation outlining best practices on how to provide speech data to the service, including recommendations for pre-processing.
Keep the recording as close to the original speech signal as possible. No distortion, no clipping, no noise, no artificial pre-processing, like noise suppression and automatic gain control. I think such kind of pre-processings can damage the useful information in speech signals.
I copied the key points from google and paste them as below.
Position the microphone as close as possible to the person that is speaking, particularly when background noise is present.
Avoid audio clipping.
Do not use automatic gain control (AGC).
All noise reduction processing should be disabled.
Listen to some sample audio. It should sound clear, without distortion or unexpected noise.

how to equalize all ios audio streams?

I decided to write an equalizer of ios, which would allow to change the level of audio frequencies to improve the audibility of sound for people with hearing problems. For example in my left ear is missing audibility of high frequencies, and I would like to be able to increase the high frequencies in all applications (skype, youtube etc), including a voice call over the cellular connection. How it could be implemented? Sorry for my bad english.

How can I apply guitar effects such as dive (pitch shift) or wah-wah (compression) to guitar audio samples played on an iOS app?

I am building an iOS app that allows the user to play guitar sounds - e.g. plucking or strumming.
I'd like to allow the user to apply pitch shifting or wah-wah (compression) on the guitar sound being played.
Currently, I am using audio samples of the guitar sound.
I've done some basic read-ups on DSP and audio synthesis, but I'm no expert in it. I saw libraries such as csound and stk, and it appears that the sounds they produced are synthesized (i.e. not played from audio samples). I am not sure how to apply them, or if I can use them to apply effects such as pitch shifting or wah-wah to audio samples.
Can someone point me in the right direction for this?
You can use open-source audio processing libraries. Essentially, you are getting audio samples in and you need to process them and send them as samples out. The processing can be done by these libraries, or you use one of your own. Here's one DSP-Library (Disclaimer: I wrote this). Look at the process(float,float) method for any of the classes to see how one does this.
Wah-wah and compression are 2 completely different effects. Wah-wah is a lowpass filter whose center frequency varies slowly, whereas compression is a method to equalize the volume. The above library has a Compressor class that you can check out.
The STK does have effects classes as well, not just synthesis classes (JCRev) is one for reverb but I would highly recommend staying away from it as they are really hard to compile and maintain.
If you haven't seen this already, check out Julius Smith's excellent, and comprehensive book Physical Audio Signal Processing

Analyze voice patterns IOS

I am looking for a way / library to analyze voice patterns. Say, there are 6 people in the room. I want to identify each one by voice.
Any hints are much appreciated.
Dmitry
The task of taking a long contiguous audio recording and splitting it up in chunks in which only one speaker is speaking - without any prior knowledge about the voice characteristics of each speaker - is called "Speaker diarization". You can find links to research code on the wikipedia page.
If you have prior recordings of each voice, and would rather do classification, this is a slightly different problem (Speaker recognition or Speaker identification). Software tools for that are available here (note that general purposes speech recognition packages like Sphinx or HTK are flexible enough to be coaxed into doing that).
Answered here https://dsp.stackexchange.com/questions/3119/library-to-differentiate-people-by-their-voice-timbre

Recognizing individual voices

I plan to write a conversation analysis software, which will recognize the individual speakers, their pitch and intensity. Pitch and intensity are somewhat straightforward (pitch via autocorrelation).
How would I go about recognizing individual speakers, so I can record his/her features? Will storing some heuristics for each speaker's frequencies be enough? I can assume that only one person speaks at a time (strictly non-overlapping). I can also assume that for training, each speaker can record a minute's worth of data before actual analysis.
Pitch and intensity on their own tell you nothing. You really need to analyse how pitch varies. In order to identify different speakers you need to transform the speech audio into some kind of feature space, and then make comparisons against your database of speakers in this feature space. The general term that you might want to Google for is prosody - see e.g. http://en.wikipedia.org/wiki/Prosody_(linguistics). While you're Googling you might also want to read up on speaker identification aka speaker recognition, see e.g. http://en.wikipedia.org/wiki/Speaker_identification
If you are still working on this... are you using speech-recognition on the sound input? Because Microsoft SAPI for example provides the application with a rich API for digging into the speech sound wave, which could make the speaker-recognition problem more tractable. I think you can get phoneme positions within the waveform. That would let you do power-spectrum analysis of vowels, for example, which could be used to generate features to distinguish speakers. (Before anybody starts muttering about pitch and volume, keep in mind that the formant curves come from vocal-tract shape and are fairly independent of pitch, which is vocal-cord frequency, and the relative position and relative amplitude of formants are (relatively!) independent of overall volume.) Phoneme duration in-context might also be a useful feature. Energy distribution during 'n' sounds could provide a 'nasality' feature. And so on. Just a thought. I expect to be working in this area myself.

Resources