Extracting commentary from match video - signal-processing

I'm working on event classification for sports videos and as a part of it, I was looking to extract information from the commentator's excited tone. Since the frequency of human voice is bound by a range, can I just extract that from the audio signal on a time scale and work with that? I've tried using the fdesign.bandpass function, but don't know how to proceed further with it.
Or is there a better approach to doing this?

Related

Calculate the percentage of accuracy with which user made the assigned sound

I want to design a web-app for my cousin who is 2 years of age in which i have implemented a functionality in which when an image is clicked some sound gets played and the user has to make the same sound which gets recorded.
For eg-If i click on image of "Apple" the sound made is "A for Apple".Now the user has to say those words which get recorded.
Now I want to calculate the percentage of accuracy with which the user spoke.I want to know how can i know the accuracy percentage.I have not used machine learning or Natural Language Processing earlier so i want some guidance on what should i learn about or ways of implementing this functionality.I need some help on that.
Also use nodejs frameworks quite frequently so is there any module in nodejs with the help of which the above requirement can be fulfilled.
What you want to reach is a quite complex and non-trivial task that can be faced at several levels. First of all, you should answer a question in before for yourself:
What do you mean with "accuarcy"? Which metric do you want to use for that? Accuracy means to compare a result with its optimum. So what would be the optimum of saying "Apple"?
I think there are several levels on which you could measure speech accuracy:
On the audio level: Here are several correlation metrics that can compute the similarity of two audio files. See e.g. here for more details. SImply said, the idea is directly comparing the audio samples. In your case, you would need a reference audio track that is the "correct" result. The correct time alignment might become a problem though.
On the level of speech recognition: You could use a speech recognizer -- commercial or open source -- and return a string of spoken words. In this case you should think about when the recording is stopped, to limit the record length. Then you have to think about a metric that evaluates the correctness of the transcription. Some that I worked with are Levensthein-Distance or Word-Error-Rate. Wit these you can compute a similarity.

Filtering background noise frequency

I would like to ask if it is possible to filter the frequency of the human voice only via AudioKit or otherwise. I want to create an emotional analyzer based on these frequencies from a human voice, but the problem is that the microphone captures all the frequencies around me. Is there any way to remove this?
And next, I would like to ask if it is possible to recognize which one is just talking. I mean conversation between two people.
Thank you in advance for a possible answer.

How long should audio samples be for music/speech discrimination?

I am working on a convolutional neural net which takes an audio spectrogram to discriminate between music and speech using the GTZAN dataset
If single samples are shorter, then this gives more samples overall. But if samples are too short, then they may lack important features?
How much data is needed for recognizing if a piece of audio is music or speech?
How long should the audio samples be ideally?
The length of audios vary on number of factors.
The basic idea is to get just enough samples.
Since audio changes constantly, it is preferred to work on a shorter data. However, very small frame would result into less/no feature to be captured.
On the other hand very large sample would capture too many features, thereby leading to complexity.
So, in most usecases, although the ideal audio length is 25seconds, but it is not a written rule and you may manipulate it accordingly.Just make sure the frame size is not very small or very large.
Update for dataset
Check this link for dataset of 30s
How much data is needed for recognizing if a piece of audio is music or speech?
If someone knew the answer to this question exactly then the problem would be solved already :)
But seriously, it depends on what your downstream application will be. Imagine trying to discriminate between speech with background music vs acapella singing (hard) or classifying orchestral music vs audio books (easy).
How long should the audio samples be ideally?
Like everything in machine learning, it depends on the application. For you, I would say test with at least 10, 20, and 30 secs, or something like that. You are correct in that the spectral values can change rather drastically depending on the length!

Recognize sound based on recorded library of sounds

I am trying to create an iOS app that will perform an action when it detects a clapping sound.
Things I've tried:
1) My first approach was to simply measure the overall power using an AVAudioRecorder. This worked OK but it could get set off by talking too loud, other noises, etc so I decided to take a different approach.
2) I then implemented some code that uses a FFT to get the frequency and magnitude of the live streaming audio from the microphone. I found that the clap spike generally resides in the 13kHZ-20kHZ range while most talking resides in a lot lower frequencies. I then implemented a simple thresh-hold in this frequency range, and this worked OK, but other sounds could set it off. For example, dropping a pencil on the table right next to my phone would pass this thresh-hold and be counted as a clap.
3) I then tried splitting this frequency range up into a couple hundred bins and then getting enough data where when a sound passed that thresh-hold my app would calculate the Z-Score (probability from statistics) and if the Z-Score was good, then could that as a clap. This did not work at all as some claps were not recognized and some other sounds were recognized.
Graph:
To try to help me understand how to detect claps, I made this graph in Excel (each graph has around 800 data points) and it covers the 13kHZ-21kHZ range:
Where I am now:
Even after all of this, I am still not seeing how to recognize a clap versus other sounds.
Any help is greatly appreciated!

Recognizing individual voices

I plan to write a conversation analysis software, which will recognize the individual speakers, their pitch and intensity. Pitch and intensity are somewhat straightforward (pitch via autocorrelation).
How would I go about recognizing individual speakers, so I can record his/her features? Will storing some heuristics for each speaker's frequencies be enough? I can assume that only one person speaks at a time (strictly non-overlapping). I can also assume that for training, each speaker can record a minute's worth of data before actual analysis.
Pitch and intensity on their own tell you nothing. You really need to analyse how pitch varies. In order to identify different speakers you need to transform the speech audio into some kind of feature space, and then make comparisons against your database of speakers in this feature space. The general term that you might want to Google for is prosody - see e.g. http://en.wikipedia.org/wiki/Prosody_(linguistics). While you're Googling you might also want to read up on speaker identification aka speaker recognition, see e.g. http://en.wikipedia.org/wiki/Speaker_identification
If you are still working on this... are you using speech-recognition on the sound input? Because Microsoft SAPI for example provides the application with a rich API for digging into the speech sound wave, which could make the speaker-recognition problem more tractable. I think you can get phoneme positions within the waveform. That would let you do power-spectrum analysis of vowels, for example, which could be used to generate features to distinguish speakers. (Before anybody starts muttering about pitch and volume, keep in mind that the formant curves come from vocal-tract shape and are fairly independent of pitch, which is vocal-cord frequency, and the relative position and relative amplitude of formants are (relatively!) independent of overall volume.) Phoneme duration in-context might also be a useful feature. Energy distribution during 'n' sounds could provide a 'nasality' feature. And so on. Just a thought. I expect to be working in this area myself.

Resources