Recognize sound based on recorded library of sounds - ios

I am trying to create an iOS app that will perform an action when it detects a clapping sound.
Things I've tried:
1) My first approach was to simply measure the overall power using an AVAudioRecorder. This worked OK but it could get set off by talking too loud, other noises, etc so I decided to take a different approach.
2) I then implemented some code that uses a FFT to get the frequency and magnitude of the live streaming audio from the microphone. I found that the clap spike generally resides in the 13kHZ-20kHZ range while most talking resides in a lot lower frequencies. I then implemented a simple thresh-hold in this frequency range, and this worked OK, but other sounds could set it off. For example, dropping a pencil on the table right next to my phone would pass this thresh-hold and be counted as a clap.
3) I then tried splitting this frequency range up into a couple hundred bins and then getting enough data where when a sound passed that thresh-hold my app would calculate the Z-Score (probability from statistics) and if the Z-Score was good, then could that as a clap. This did not work at all as some claps were not recognized and some other sounds were recognized.
Graph:
To try to help me understand how to detect claps, I made this graph in Excel (each graph has around 800 data points) and it covers the 13kHZ-21kHZ range:
Where I am now:
Even after all of this, I am still not seeing how to recognize a clap versus other sounds.
Any help is greatly appreciated!

Related

Working on a small project dealing with motion detection and sending alerts if displacement exceeds certain boundary.is it possible to implement yolo?

This is the first mini project I am working on and I have been searching for some information regarding yolo. I want to know if we could train yolo to recognise objects in a real time webcam and set up a boundary (not to be confused with the boundary boxes) that sends out a simple alert if the so called object (in our case, a face) goes out of the boundary.
This is my first time asking here and I don't know if it is appropriate to do so. please let me know and I will be reading APIs related to motion detection. If there are any suggestions, please do give them.
I would check out this open source CCTV solution called https://shinobi.video/. I used it once to do motion detection and I think it could be much easier for you than building something from scratch.
Here are some articles they have that sound related to what you are trying to do:
https://hub.shinobi.video/articles/view/JtJiGkdbcpAig40
https://hub.shinobi.video/articles/view/xEMps3O4y4VEaYk

Calculate the percentage of accuracy with which user made the assigned sound

I want to design a web-app for my cousin who is 2 years of age in which i have implemented a functionality in which when an image is clicked some sound gets played and the user has to make the same sound which gets recorded.
For eg-If i click on image of "Apple" the sound made is "A for Apple".Now the user has to say those words which get recorded.
Now I want to calculate the percentage of accuracy with which the user spoke.I want to know how can i know the accuracy percentage.I have not used machine learning or Natural Language Processing earlier so i want some guidance on what should i learn about or ways of implementing this functionality.I need some help on that.
Also use nodejs frameworks quite frequently so is there any module in nodejs with the help of which the above requirement can be fulfilled.
What you want to reach is a quite complex and non-trivial task that can be faced at several levels. First of all, you should answer a question in before for yourself:
What do you mean with "accuarcy"? Which metric do you want to use for that? Accuracy means to compare a result with its optimum. So what would be the optimum of saying "Apple"?
I think there are several levels on which you could measure speech accuracy:
On the audio level: Here are several correlation metrics that can compute the similarity of two audio files. See e.g. here for more details. SImply said, the idea is directly comparing the audio samples. In your case, you would need a reference audio track that is the "correct" result. The correct time alignment might become a problem though.
On the level of speech recognition: You could use a speech recognizer -- commercial or open source -- and return a string of spoken words. In this case you should think about when the recording is stopped, to limit the record length. Then you have to think about a metric that evaluates the correctness of the transcription. Some that I worked with are Levensthein-Distance or Word-Error-Rate. Wit these you can compute a similarity.

Seperation of instruments' audios from a single channel non-MIDI musical file

My friend Prasad Raghavendra and me, were trying to experiment with Machine Learning on audio.
We were doing it to learn and to explore interesting possibilities at any upcoming get-togethers.
I decided to see how deep learning or any machine learning can be fed with certain audios rated by humans (evaluation).
To our dismay, we found that the problem had to be split to accommodate for the dimensionality of input.
So, we decided to discard vocals and assess by accompaniments with an assumption that vocals and instruments are always correlated.
We tried to look for mp3/wav to MIDI converter. Unfortunately, they were only for single instruments on SourceForge and Github and other options are paid options. (Ableton Live, Fruity Loops etc.) We decided to take this as a sub-problem.
We thought of FFT, band-pass filters and moving window to accommodate for these.
But, we are not understanding as to how we can go about splitting instruments if chords are played and there are 5-6 instruments in file.
What are the algorithms that I can look for?
My friend knows to play Keyboard. So, I will be able to get MIDI data. But, are there any data-sets meant for this?
How many instruments can these algorithms detect?
How do we split the audio? We do not have multiple audios or the mixing matrix
We were also thinking about finding out the patterns of accompaniments and using those accompaniments in real-time while singing along. I guess we will be able to think about it once we get answers to 1,2,3 and 4. (We are thinking about both Chord progressions and Markovian dynamics)
Thanks for all help!
P.S.: We also tried FFT and we are able to see some harmonics. Is it due to Sinc() in fft when rectangular wave is input in time domain? Can that be used to determine timbre?
We were able to formulate the problem roughly. But, still, we are finding it difficult to formulate the problem. If we use frequency domain for certain frequency, then the instruments are indistinguishable. A trombone playing at 440 Hz or a Guitar playing at 440 Hz would have same frequency excepting timbre. We still do not know how we can determine timbre. We decided to go by time domain by considering notes. If a note exceeds a certain octave, we would use that as a separate dimension +1 for next octave, 0 for current octave and -1 for the previous octave.
If notes are represented by letters such as 'A', 'B', 'C' etc, then the problem reduces to mixing matrices.
O = MI during training.
M is the mixing matrix that will have to be found out using the known O output and I input of MIDI file.
During prediction though, M must be replaced by a probability matrix P which would be generated using previous M matrices.
The problem reduces to Ipredicted = P-1O. The error would then be reduced to LMSE of I. We can use DNN to adjust P using back-propagation.
But, in this approach, we assume that the notes 'A','B','C' etc are known. How do we detect them instantaneously or in small duration like 0.1 seconds? Because, template matching may not work due to harmonics. Any suggestions would be much appreciated.
Splitting out the different parts is a machine learning problem all to its own. Unfortunately, you can't look at this problem in audio land only. You must consider the music.
You need to train something to understand musical patterns and progressions in the context of the type of music you give it. It needs to understand what the different instruments sound like, both mixed and not mixed. It needs to understand how these instruments are often played together, if it's going to have any chance at all at separating what's going on.
This is a very, very difficult problem.
This is a very hard problem mainly because converting audio to pitch isnt very simple due to Nyquist folding harmonics that are 22Khz+ back down and also other harmonic introductions such as saturators/distortion and other analogue equipment that introduce harmonics.
The fundamental harmonic isnt always the loudest which is why your plan will not work.
The hardest thing to measure would be a distorted guitar. The harmonic some pedals/plugins can make is crazy.

Programming Musical Instrument Emulators? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Can someone provide me information pertaining to programming musical instrument emulators. As an example, see here (Smule's Ocarina app for the iPhone).
I am to find sufficient information on this topic. Running with the ocarina app as an example, how are the individual notes produced? Since the results are based on strength of breath and which "holes" are held down, some of it must be handled programmatically, but is the whole sound generated programmatically or would it use a sound sample on the back-end and modify that (or those, if multiple samples are used)?
Are there any resources on this topic? All of my searches come up with information on how to play music (just standard audio) or how to make music (in music editing software), but none on how to do what is shown in that video.
Responses needn't be strictly related to ocarinas, though I wouldn't mind if they were.
That particular musical instrument sounds to me like it's a fairly simple synthesis module, based perhaps on a square wave or FM, with a reverb filter tacked on. So I'm guessing it's artifically generated sound all the way down. If you were going to build one of these instruments yourself, you could use a sample set as your basis instead if you wished. There's another possibility I'm going to mention a ways below.
Dealing with breath input: The breath input is generally translated to a value that represents the air pressure on the input microphone. This can be done by taking small chunks of the input audio signal and calculating the peak or RMS of each chunk. I prefer RMS, which is calculated by something like:
int BUFFER_SIZE = 1024; // just for purposes of this example
float buffer[BUFFER_SIZE]; // 1 channel of float samples between -1.0 and 1.0
float rms = 0.0f;
for (int i=0; i<BUFFER_SIZE; ++i) {
rms += buffer[i]*buffer[i];
}
rms = sqrt(rms/BUFFER_SIZE);
In MIDI, this value is usually transmitted on the channel CC2 as a value between 0 and 127. That value is then used to continually control the volume of the output sound. (On the iPhone, MIDI may or may not be used internally, but the concept's the same. I'll call this value CC2 from here on out regardless.)
Dealing with key presses: The key presses in this case are probably just mapped directly to the notes that they correspond to. These would then be sent as new note events to the instrument. I don't think there's any fancy modeling there.
Other forms of control: The Ocarina instrument uses the tilt of the iPhone to control vibrato frequency and volume. This is usually modeled simply by a low-frequency oscillator (LFO) that's scaled, offset, and multiplied with the output of the rest of your instrument to produce a fluttering volume effect. It can also be used to control the pitch of your instrument, where it will cause the pitch to fluctuate. (This can be hard to do right if you're working with samples, but relatively easy if you're using waveforms.) Fancy MIDI wind controllers also track finger pressure and bite-down pressure, and can expose those as parameters for you to shape your sound with as well.
Breath instruments 201: There are some tricks that people pull to make sounds more expressive when they are controlled by a breath controller:
Make sure that your output is only playing one note at a time; switching to a new note automatically ends the previous note.
Make sure that the volume from the old note to the new note remains smooth if the breath pressure is constant and the key presses are connected. This allows you to distinguish between legato playing and detached playing.
Breath instruments 301: And then we get to the fun stuff: how to simulate overblowing, timbre change, partial fingering, etc. like a real wind instrument can do. There are several approaches I can think of here:
Mix in the sound of the breath input itself, perhaps filtered in some way, to impart a natural chiff or breathiness to your sound.
Use crossfading between velocity layers to transform the sound at high velocities into an entirely different sound. In other words, you literally fade out the old sound while you're fading in the new sound; they're playing the same pitch, but the new tonal characteristics of the new sound will make themselves gradually apparent.
Use a complex sound with plenty of high-frequency components. Hook up a low-pass filter whose cutoff frequency is controlled by CC2. Have the cutoff frequency increase as the value of CC2 increases. This can increase the high frequency content in an interesting way as you blow harder on the input.
The hard-core way to do this is called physical modeling. It involves creating a detailed mathematical model of the physical behavior of the instrument you're trying to emulate. Doing this can give you a quite realistic overblowing effect, and it can capture many subtle effects of how the breath input and fingering shape the sound. There's a quick overview of this approach at Princeton's Sound Lab and a sample instrument to poke at in the STK C++ library – but be warned, it's not for the mathematically faint of heart!
First of all, I'm not quite sure what your question is.
There are quite a few kinds of sound synthesis. A few I know about are:
Frequency Modulation
Oscillation Wave
Table (sample based)
Oscillation is quite simple and probably the place to start. If you generate a square wave at 440Hz you have the note "A" or more specifically middle A.
That kind of simple synthesis is really quite fun and easy to do. Maybe you can start making a simple synth for the PC speaker. Oh, but I don't know if all OSes let you access that. LADSPA has some good examples. There are lots of libs for linux with docs to get you started. You might want to have a look at Csound for starters: http://www.csounds.com/chapter1/index.html
I played around with it a bit and have a couple corny synths going on...

Recognizing individual voices

I plan to write a conversation analysis software, which will recognize the individual speakers, their pitch and intensity. Pitch and intensity are somewhat straightforward (pitch via autocorrelation).
How would I go about recognizing individual speakers, so I can record his/her features? Will storing some heuristics for each speaker's frequencies be enough? I can assume that only one person speaks at a time (strictly non-overlapping). I can also assume that for training, each speaker can record a minute's worth of data before actual analysis.
Pitch and intensity on their own tell you nothing. You really need to analyse how pitch varies. In order to identify different speakers you need to transform the speech audio into some kind of feature space, and then make comparisons against your database of speakers in this feature space. The general term that you might want to Google for is prosody - see e.g. http://en.wikipedia.org/wiki/Prosody_(linguistics). While you're Googling you might also want to read up on speaker identification aka speaker recognition, see e.g. http://en.wikipedia.org/wiki/Speaker_identification
If you are still working on this... are you using speech-recognition on the sound input? Because Microsoft SAPI for example provides the application with a rich API for digging into the speech sound wave, which could make the speaker-recognition problem more tractable. I think you can get phoneme positions within the waveform. That would let you do power-spectrum analysis of vowels, for example, which could be used to generate features to distinguish speakers. (Before anybody starts muttering about pitch and volume, keep in mind that the formant curves come from vocal-tract shape and are fairly independent of pitch, which is vocal-cord frequency, and the relative position and relative amplitude of formants are (relatively!) independent of overall volume.) Phoneme duration in-context might also be a useful feature. Energy distribution during 'n' sounds could provide a 'nasality' feature. And so on. Just a thought. I expect to be working in this area myself.

Resources