I am trying to create a neural network that performs speaker recognition. I would like to be able to serve it such that it takes streaming audio - i.e. I want to perform partial recognition on 100ms frames and then calculate an average at the end.
I would like to know which of the following two forseeable options is the best.
Training the network on audio clips of 100ms
Using audio clips of arbitrary lengths and feeding subsequent 100ms segments into some sort of recurrent network. I was thinking that similar to text analysis, maintaining some state information could be useful in real time speaker identification.
Does anyone have some guidance in this regard? Thanks.
Related
I am trying to build an iOS app where I have one mono-channel input (real-time from mic) and double-channel impulse response which needs to be real-time convolved with mono channel input and impulse response and will provide an output which is double-channel output (stereo). Is there a way to do that on iOS with Apple's Audio Toolbox?
You should first decide whether you will be doing convolution in the time or frequency domain - there are benefits to both depending on the length of your signal/impulse response. This is somewhere you should do your own research.
For time domain, rolling your own convolution should be straightforward enough. For frequency domain you will be needing a FFT function, you could roll your own but more efficient versions will exist. For example the Accelerate framework has this implemented already.
But for basic I/O Audio Toolbox is a valid choice ..
There are certain machine learning algorithms in use that takes videos files as input. If I have to pull all the videos from youtube that are associated with a certain tag and provide them as input to this algorithm, what should be my input format?
There is no format in which you can pass a video to a machine learning algorithm, since it won't understand the contents of the video.
You need to preprocess the video first, which might depend on how you have to use it. In general you can do something like converting each frame of the video to CSV (same as preprocessing an image), which you can pass to your machine learning algorithm. If you want to process your frames sequentially, you may want to use a Recurrent Neural Network. Also if the video has some audio, then just find its audio time series, and combine each part of the time series with its corresponding video frame.
I am working on a convolutional neural net which takes an audio spectrogram to discriminate between music and speech using the GTZAN dataset
If single samples are shorter, then this gives more samples overall. But if samples are too short, then they may lack important features?
How much data is needed for recognizing if a piece of audio is music or speech?
How long should the audio samples be ideally?
The length of audios vary on number of factors.
The basic idea is to get just enough samples.
Since audio changes constantly, it is preferred to work on a shorter data. However, very small frame would result into less/no feature to be captured.
On the other hand very large sample would capture too many features, thereby leading to complexity.
So, in most usecases, although the ideal audio length is 25seconds, but it is not a written rule and you may manipulate it accordingly.Just make sure the frame size is not very small or very large.
Update for dataset
Check this link for dataset of 30s
How much data is needed for recognizing if a piece of audio is music or speech?
If someone knew the answer to this question exactly then the problem would be solved already :)
But seriously, it depends on what your downstream application will be. Imagine trying to discriminate between speech with background music vs acapella singing (hard) or classifying orchestral music vs audio books (easy).
How long should the audio samples be ideally?
Like everything in machine learning, it depends on the application. For you, I would say test with at least 10, 20, and 30 secs, or something like that. You are correct in that the spectral values can change rather drastically depending on the length!
When a person speaks far away from a mobile, the voice recorded is low.
When a person speaks near a mobile, the voice recorded is high. I want to is to play the human voice in equal volume no matter how far away (not infinite) he is from the phone when the voice is recorded.
What I have already tried:
adjust the volume based on the dB such as AVAudioPlayer But
the problem is that the dB contains all the environmental sound. So
it only works when the human voice vary heavily.
Then I thought I should find a way to sample the intensity of the
human voice in the media which leads me to voice recognition. But
this is a huge topic. I cannot narrow the areas which could
solve my problems.
The voice recorded from distance suffers from significant corruption. One problem is noise, another is echo. To amplify it you need to clean voice from echo and noise. Ideally you need to do that with a better microphone, but if only a single microphone is available you have to apply signal processing. The signal processing algorithms you are interested in are:
Noise cancellation. You can find many samples on Google from simple
to very advanced ones
Echo cancellation. Again you can find many implementations.
There is no ready library to do the above, you will have to implement a large part yourself, you can look on the WebRTC code which has both noise and echo cancellation, like described in this question:
Is it possible to reduce background noise while streaming audio on the iPhone?
How can we use machine learning to get human voice from an audio clip which can be having a lot many noise over whole frequency domain.
As in any ML application the process is simple: collect samples, design features, train the classifier. For the samples you can use your noisy recordings or you can find a lot of noises in the web sound collections like freesound.org. For the features you can use mean-normalized mel-frequency coefficients, you can find implementation in CMUSphinx speech recognition toolkit. For classifier you can pick GMM or SVM. If you have enough data it will work fairly well.
To improve the accuracy you can add assumption that noise and voice are continuous, for that reason you can analyze detection history with hangover scheme (essentially HMM) to detect voice chunks instead of analysis of the every frame individually.