Filtering background noise frequency - ios

I would like to ask if it is possible to filter the frequency of the human voice only via AudioKit or otherwise. I want to create an emotional analyzer based on these frequencies from a human voice, but the problem is that the microphone captures all the frequencies around me. Is there any way to remove this?
And next, I would like to ask if it is possible to recognize which one is just talking. I mean conversation between two people.
Thank you in advance for a possible answer.

Related

Calculate the percentage of accuracy with which user made the assigned sound

I want to design a web-app for my cousin who is 2 years of age in which i have implemented a functionality in which when an image is clicked some sound gets played and the user has to make the same sound which gets recorded.
For eg-If i click on image of "Apple" the sound made is "A for Apple".Now the user has to say those words which get recorded.
Now I want to calculate the percentage of accuracy with which the user spoke.I want to know how can i know the accuracy percentage.I have not used machine learning or Natural Language Processing earlier so i want some guidance on what should i learn about or ways of implementing this functionality.I need some help on that.
Also use nodejs frameworks quite frequently so is there any module in nodejs with the help of which the above requirement can be fulfilled.
What you want to reach is a quite complex and non-trivial task that can be faced at several levels. First of all, you should answer a question in before for yourself:
What do you mean with "accuarcy"? Which metric do you want to use for that? Accuracy means to compare a result with its optimum. So what would be the optimum of saying "Apple"?
I think there are several levels on which you could measure speech accuracy:
On the audio level: Here are several correlation metrics that can compute the similarity of two audio files. See e.g. here for more details. SImply said, the idea is directly comparing the audio samples. In your case, you would need a reference audio track that is the "correct" result. The correct time alignment might become a problem though.
On the level of speech recognition: You could use a speech recognizer -- commercial or open source -- and return a string of spoken words. In this case you should think about when the recording is stopped, to limit the record length. Then you have to think about a metric that evaluates the correctness of the transcription. Some that I worked with are Levensthein-Distance or Word-Error-Rate. Wit these you can compute a similarity.

Is it feasible to transfer data with iphone's headphone jack

Question 1: Is it feasible? (As far as I know [info get from google], it is feasible. However, I need a more affirmative answer.)
Question 2: Say I have a device that generates square wave, how can I get the message?
As a beginner, I want to know to which class I should pay my attention to?
Thanks, any info will be appreciated.
This is a great question. I've put together a C library which does this, so you might be able to adapt it for iOS.
Library: https://github.com/quiet/quiet
Live Demo: https://quiet.github.io/quiet-js/lab.html
With a sound port, you want to avoid square waves. Those make inefficient use of the range of amplitudes you have available, and they're not very spectrally efficient. The most basic modulation people typically use here is frequency shift keying. My library offers that (as gaussian minimum shift keying) but also more advanced modes like phase shift keying and quadrature amplitude shift keying. I've managed to reach transfer speeds of 64kbps using this library.

Extracting commentary from match video

I'm working on event classification for sports videos and as a part of it, I was looking to extract information from the commentator's excited tone. Since the frequency of human voice is bound by a range, can I just extract that from the audio signal on a time scale and work with that? I've tried using the fdesign.bandpass function, but don't know how to proceed further with it.
Or is there a better approach to doing this?

How can I graph the intonation of a voice sample?

I want to make an iOS app that allows me to graph the intonation (the rise and fall of the pitch of their voice) of an audio sample as read in by the user. Intonation is very important in various languages around the world and this would be an attempt to practice intonation as well as pronunciation.
I am not very versed in the world of speech/audio technology, so what do I need? Are there libraries that come installed with Cocoa-touch that gives me the ability to access the data I need from a voice sample? What exactly am I going to be looking to capture?
If anyone has an idea of the technology I am going to need to leverage, I would appreciate a point in the right direction.
Thanks!
What you're looking for is called formant analysis.
Formants are, in essence, the spectral peaks of the uttered sounds. They are listed in order of frequency, as in f1, f2, etc. Seems to me that what you're looking to plot is f1.
Formant analysis is at the core of speech recognition, usually f1 and f2 are enough to differentiate vowels apart. I'd recommend you do a search on formant analysis algorithms and take it from there.
Good luck :)

Recognizing individual voices

I plan to write a conversation analysis software, which will recognize the individual speakers, their pitch and intensity. Pitch and intensity are somewhat straightforward (pitch via autocorrelation).
How would I go about recognizing individual speakers, so I can record his/her features? Will storing some heuristics for each speaker's frequencies be enough? I can assume that only one person speaks at a time (strictly non-overlapping). I can also assume that for training, each speaker can record a minute's worth of data before actual analysis.
Pitch and intensity on their own tell you nothing. You really need to analyse how pitch varies. In order to identify different speakers you need to transform the speech audio into some kind of feature space, and then make comparisons against your database of speakers in this feature space. The general term that you might want to Google for is prosody - see e.g. http://en.wikipedia.org/wiki/Prosody_(linguistics). While you're Googling you might also want to read up on speaker identification aka speaker recognition, see e.g. http://en.wikipedia.org/wiki/Speaker_identification
If you are still working on this... are you using speech-recognition on the sound input? Because Microsoft SAPI for example provides the application with a rich API for digging into the speech sound wave, which could make the speaker-recognition problem more tractable. I think you can get phoneme positions within the waveform. That would let you do power-spectrum analysis of vowels, for example, which could be used to generate features to distinguish speakers. (Before anybody starts muttering about pitch and volume, keep in mind that the formant curves come from vocal-tract shape and are fairly independent of pitch, which is vocal-cord frequency, and the relative position and relative amplitude of formants are (relatively!) independent of overall volume.) Phoneme duration in-context might also be a useful feature. Energy distribution during 'n' sounds could provide a 'nasality' feature. And so on. Just a thought. I expect to be working in this area myself.

Resources