I am looking for a way / library to analyze voice patterns. Say, there are 6 people in the room. I want to identify each one by voice.
Any hints are much appreciated.
Dmitry
The task of taking a long contiguous audio recording and splitting it up in chunks in which only one speaker is speaking - without any prior knowledge about the voice characteristics of each speaker - is called "Speaker diarization". You can find links to research code on the wikipedia page.
If you have prior recordings of each voice, and would rather do classification, this is a slightly different problem (Speaker recognition or Speaker identification). Software tools for that are available here (note that general purposes speech recognition packages like Sphinx or HTK are flexible enough to be coaxed into doing that).
Answered here https://dsp.stackexchange.com/questions/3119/library-to-differentiate-people-by-their-voice-timbre
Related
I am working on a convolutional neural net which takes an audio spectrogram to discriminate between music and speech using the GTZAN dataset
If single samples are shorter, then this gives more samples overall. But if samples are too short, then they may lack important features?
How much data is needed for recognizing if a piece of audio is music or speech?
How long should the audio samples be ideally?
The length of audios vary on number of factors.
The basic idea is to get just enough samples.
Since audio changes constantly, it is preferred to work on a shorter data. However, very small frame would result into less/no feature to be captured.
On the other hand very large sample would capture too many features, thereby leading to complexity.
So, in most usecases, although the ideal audio length is 25seconds, but it is not a written rule and you may manipulate it accordingly.Just make sure the frame size is not very small or very large.
Update for dataset
Check this link for dataset of 30s
How much data is needed for recognizing if a piece of audio is music or speech?
If someone knew the answer to this question exactly then the problem would be solved already :)
But seriously, it depends on what your downstream application will be. Imagine trying to discriminate between speech with background music vs acapella singing (hard) or classifying orchestral music vs audio books (easy).
How long should the audio samples be ideally?
Like everything in machine learning, it depends on the application. For you, I would say test with at least 10, 20, and 30 secs, or something like that. You are correct in that the spectral values can change rather drastically depending on the length!
When a person speaks far away from a mobile, the voice recorded is low.
When a person speaks near a mobile, the voice recorded is high. I want to is to play the human voice in equal volume no matter how far away (not infinite) he is from the phone when the voice is recorded.
What I have already tried:
adjust the volume based on the dB such as AVAudioPlayer But
the problem is that the dB contains all the environmental sound. So
it only works when the human voice vary heavily.
Then I thought I should find a way to sample the intensity of the
human voice in the media which leads me to voice recognition. But
this is a huge topic. I cannot narrow the areas which could
solve my problems.
The voice recorded from distance suffers from significant corruption. One problem is noise, another is echo. To amplify it you need to clean voice from echo and noise. Ideally you need to do that with a better microphone, but if only a single microphone is available you have to apply signal processing. The signal processing algorithms you are interested in are:
Noise cancellation. You can find many samples on Google from simple
to very advanced ones
Echo cancellation. Again you can find many implementations.
There is no ready library to do the above, you will have to implement a large part yourself, you can look on the WebRTC code which has both noise and echo cancellation, like described in this question:
Is it possible to reduce background noise while streaming audio on the iPhone?
I decided to write an equalizer of ios, which would allow to change the level of audio frequencies to improve the audibility of sound for people with hearing problems. For example in my left ear is missing audibility of high frequencies, and I would like to be able to increase the high frequencies in all applications (skype, youtube etc), including a voice call over the cellular connection. How it could be implemented? Sorry for my bad english.
I am building an iOS app that allows the user to play guitar sounds - e.g. plucking or strumming.
I'd like to allow the user to apply pitch shifting or wah-wah (compression) on the guitar sound being played.
Currently, I am using audio samples of the guitar sound.
I've done some basic read-ups on DSP and audio synthesis, but I'm no expert in it. I saw libraries such as csound and stk, and it appears that the sounds they produced are synthesized (i.e. not played from audio samples). I am not sure how to apply them, or if I can use them to apply effects such as pitch shifting or wah-wah to audio samples.
Can someone point me in the right direction for this?
You can use open-source audio processing libraries. Essentially, you are getting audio samples in and you need to process them and send them as samples out. The processing can be done by these libraries, or you use one of your own. Here's one DSP-Library (Disclaimer: I wrote this). Look at the process(float,float) method for any of the classes to see how one does this.
Wah-wah and compression are 2 completely different effects. Wah-wah is a lowpass filter whose center frequency varies slowly, whereas compression is a method to equalize the volume. The above library has a Compressor class that you can check out.
The STK does have effects classes as well, not just synthesis classes (JCRev) is one for reverb but I would highly recommend staying away from it as they are really hard to compile and maintain.
If you haven't seen this already, check out Julius Smith's excellent, and comprehensive book Physical Audio Signal Processing
I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).