Identify a specific sound on iOS - ios

I'd like to be able to recognise a specific sound in an iOS application. I guess it would basically work like speech recognition in that it's fairly fuzzy, but it would only have to be for 1 specific sound.
I've done some quick FFT stuff to identify specific frequencies over a certain threshold and only when they're solo (ie, they're not surrounded by other frequencies) so I can identify individual tones pretty easily. I'm thinking it's just an extension of this, but comparing to an FFT data set of a recording of the sound, and compare say 0.1 second chunks over the length of the audio. And I would also have to account for variation in amplitude, a little in pitch and a little in time.
Can anyone point me to any pre-existing source that I could use to speed this process along? I can't seem to find anything usable. Or failing that, any ideas on how to get started on something like this?
Thanks very much

From your description it is not entirely clear what you want to do.
What is the "specific" sound like? Does it have high background noise?
Whats the specific recognizable feature (e.g. pitch, inhamonicity, timbre ...)?
Against which other "sounds" do you want to compare it?
Do you simply want to match an arbitrary sound spectrum against a "template sound"?
Is your sound percussive, melodic, speech, ...? Is it long, short ...?
Whats the frequency range you expect the best discriminability? Are the features invariant with time?
There is no "general" solution that works for everything. Speech recognition in itself is fairly complex and wont work well for abstract sounds whose discriminable frequencies are not in the e.g. MEL bands.
So in conclusion, you are leaving too many open questions to get a useful answer.
Only suggestion i can make based on the few informations is the following:
For the template sound:
1) Extract spectral peak positions from the power spectrum
2) Measure the standard deviation around the peaks and construct a gaussian from it
3) save the gaussians for later classification
For unkown sounds:
1) Extract spectral peak positions
2) Project those points onto the saved gaussians which leaves you with z-scores of the peak positions
3) With the computed z-scores you should be able to classify your template sound
Note: This is a very crude method which discriminates sounds according to their most powerful frequencies. Using the gaussians it leaves room for slight shifts in the most powerful frequencies.

Related

How to decorrelate accelerometer data

Is it possible to decorrelate accelerometer data in real-time? If so, how is it done?
Background:
My application is receiving (X,Y,Z) accelerometer data in real-time (sample rate is 6.75Hz). The sensor is moving in a periodic motion but the motion is not necessarily along only one axis. The 3 signals x(t), y(t) and z(t) are therefore slightly correlated and I would like to know if I can find a rotation matrix (in real time) which can be used to rotate the measured (x,y,z) into a new vector (x*,y*,z*) so that the entire motion is along the z-axis?
I would like to implement the algorithm in C.
Thanks.
What you're trying to do is generally called "principal component analysis". The Wikipedia article is pretty good:
https://en.wikipedia.org/wiki/Principal_component_analysis
For static data you generally use the eigenvectors of the covariance matrix as your new coordinate basis.
PCA in real time is doable, but not super easy. See, for example: http://www.bio-conferences.org/articles/bioconf/pdf/2011/01/bioconf_skills_00055.pdf
I'd like to first of all emphasize that Matt Timmermans' answer has done exactly what people are actually doing when classifying accelerometer data from clinical studies (a project I worked on).
Then: you're observing a sampled signal. In general, if you have a sensor that gives you samples at a rate of 6.75Hz, the highest frequency of a signal you can detect is 6.75Hz/2 = 3.375Hz. Everything that has a frequency higher than that will inherently be aliased back and look like it was something with a frequency f with 0<=f<3.375Hz. If you've not considered this, please go and read up on the Nyquist–Shannon sampling theorem. Especially: shield your sensors (however you do that, e.g. by employing dampeners) from all input above that limit, otherwise your measurements might be worth very little or even nothing. If your sensor does this internally (that's absolutely possible, there are enough accelerometers with analog low pass filters), this has been taken care of. However, document that characteristics of your sensor.
Now, your case is a little bit easier because you know pretty well that your whole observation is going to be periodic, and it's measured along three orthogonal axis.
In this case, just doing three discrete Fourier transforms at once, extracting the "strongest" spectral component over all three channels, and finding the phase of that spectral component (which is but the complex argument of that DFT bin) in the two others would give you something that you can map to a periodic movement around a specific axis in 3D space. If you want to, remove these value (set the bins to 0), and search for strongest component again etc.
Discrete cosine transforms can be done in staggering speed nowadays. with 6.75Hz, no PC in this world will ever get into trouble when you try this while you receive further samples. It's a hilariously low sampling rate.
Another, more elegant (read: you need less samples to compute this) would be using a parametric estimator; in your case, a direction-of-arrival sensor from the world of RF technology with multiple antennas would, as far as I can think, map directly to detection of rotational axis. The classical algorithms here are MUSIC and ESPRIT, and for your case (limited, known amount of oscillating parts), ESPRIT might be the better choice.

How does knocktounlock work?

I am trying to figure out how knocktounlock.com is able to detect "knocks" on the iPhone. I am sure they use the accelerometer to achieve this, however all my tries come up with false flags (if user moves, jumps, etc it sometimes fires)
Basically, I want to be able to detect when a user knocks/taps/smacks their phone (and be able to distinguish that from things that may also give a rise to the accelerometer). So I am looking for sharp high peeks. The device will be in the pocket so the movement of the device will not be very much.
I have tried things like high/low pass (not sure if there would be a better option)
This is a duplicate of this: Detect hard taps anywhere on iPhone through accelerometer But it has not received any answers.
Any help/suggestions would be awesome! Thanks.
EDIT: Looking for more thoughts before I accept the answer below. I did hear back from Knocktounlock and they use the fourth derivative (jounce) to get better values to then analyse. Which is interesting.
I would consider knock on the iPhone to be exactly same as bumping two phones with each other. Check out this Github Repo,
https://github.com/joejcon1/iOS-Accelerometer-visualiser
Build&Run the App on iPhone and check out the spikes on Green line. You can see the value of the spike clearly,
Knocking the iPhone:
As you can see the time of the actual spike is very short when you knock the phone. However the spike patterns are little different in Hard Knock and Soft knock but can be distinguished programmatically.
Now lets see the Accelerometer pattern when iPhone moves in space freely,
As you can see the Spikes are bell shaped that means the it takes a little time for spike value to return to 0.
By these pattern it will be easier to determine the knocking pattern. Good Luck.
Also, This will drain your battery out as the sensor will always be running and iPhone needs to persist connection with Mac via Bluetooth.
P.S.: Also check this answer, https://stackoverflow.com/a/7580185/753603
I think the way to go here is using pattern recognition with accelerometer data.
You could (write and) train a classifier (e.g. K-nearest neighbor) with data you gathered and that has been classified by hand. Neural networks are also an option. However, there will be many different ways to solve that problem. But there is probably no straightforward way for achieving this.
Some papers showing pattern recognition approaches to similar topics (activity, movement), like
http://www.math.unipd.it/~cpalazzi/papers/Palazzi-Accelerometer.pdf
(some more, but I am not allowed to post them with my reputation count. You can search for "pattern recognition accelerometer data")
There is also a master thesis about gesture recognition on the iPhone:
http://klingmann.ch/msc_thesis_marco_klingmann_iphone_gestures.pdf
In general you won't achieve 100% correct classification. Depending on the time/knowledge one has got the result will vary between good-usable and we-could-use-random-classification-instead.
Just a though, but It could be useful to add to the mix the output of the microphone to listen to really short, loud noises at the same time that a possible "knock" movement has been detected.
I am surprised that 4th derivative is needed, intuitively feels to me 3rd ("jerk", the derivative of acceleration) should be enough. It is a big hint what to keep eye on, though.
It seems quite simple to me: collect accelerometer data at high rates, plot on chart, observe. Calculate from that first derivative, plot&observe. Then rinse&repeat, derivative of the last one. Draw conclusions. I highly doubt you will need to do pattern recognition per se, clustering/classifiers/what-have-you - i think you will see very distinct peak on one of your charts, may only need to tune collection rate and smoothing.
It is more interesting to me how come you don't have to be running the KnockToUnlock app for this to work? And if it was running in the background, who left it run there for unlimited time. I dont think accel. qualifies for unlimited background run. And after some pondering, i am guessing the reason is that the app uses Bluetooth to connect Mac as accessory - and as such gets a pass from iOS to run in the background (and suck your battery, shhht)
To solve this problem you need to select the frequency. Tap (knock)
has a very high frequency, so you should chose the frequency of the
accelerometer is not lower than 50 Hz (perhaps even 100 Hz) for
quality tap detection in the case of noise from other movements.
The use of classifiers is necessary, but in order to save battery consumption you should not call a classifier very often.It should write a simple algorithm that would find only taps and situation similar to knoks and report that you program need to call a classifier.
Note the gyro signal, it also responds to knocks, besides the
gyroscope signal not be need separated from the constant component
and the gyroscope signal contains less noise.
That is a good video about the basics of working with smartphones sensors: http://talkminer.com/viewtalk.jsp?videoid=C7JQ7Rpwn2k#.UaTTYUC-2Sp .

Comparing pitches with digital audio

I work on application which will compare musical notes with digital audio. My first idea was analyzes wav file (or sound in real-time) with some polyphonic pitch algorithms and gets notes and chords from this file and subsequently compared with notes in dataset. I went through a lot of pages and it seems to be a lot of hard work because existing implementations and algorithms are mainly/only focus on monophonic sound.
Now, I got the idea to do this in the opposite way. In dataset I have for example note: A4 or better example chord: A4 B4 H4. And my idea is make some wave (or whatever I don't know what) from this note or chord and then compared with piece of digital audio.
Is this good idea? Is it better/harder solution?
If yes can you recommend me how to do it?
The easiest solution is to take the FFT (Fast Fourier Transform) of the waveform: all the notes (and their harmonics) will be present in the signal. You then look for the frequencies that correspond to notes, and there's your solution.
Note - in order to get decent frequency resolution you need a sufficiently long sample, and high enough sample rate. But try it and you will see.
Here are a couple of screen shots of an app called SpectraWave that I took sitting in front of my piano. The first is of middle A (f = 440 Hz as you know):
and the second is of an A-minor chord (as you can see, my middle finger is a little stronger and the C is showing up as the note with the greatest volume). The harmonics will soon make it hard to see more than just a few notes…
Your "solution" most likely makes matching even more difficult, since you will have no idea what waveform to make for each note. Most musical instruments and voices not only produce waveforms that are significantly different from single sinewaves or any other familiar waveform, but these waveforms evolve over time. Thus guessing the proper
waveform to use for each note for a match is extremely improbable.

Feature Detection in Noisy Images

I've built an imaging system with a webcam and feature matching such that as I move the camera around; I can track the camera's motion. I am doing something similar to here, except with the webcam frames as the input.
It works really well for "good" images, but when taking images in really low light lots of noise appears (camera high gain), and that messes with the feature detection and matching. Basically, it doesn't detect any good features, and when it does, it cannot match them correctly between frames.
Does anyone know a good solution for this? What other methods are used for finding and matching features?
Here are two example images with very low features:
I think phase correlation is going to be your best bet here. It is designed to tell you the phase shift (i.e., translation) between two images. It is much more resilient (but not immune) to noise than feature detection because it operates in frequency space; whereas, feature detectors operate spatially. Another benefit is, it is very fast when compared with feature detection methods. I have an implementation available in the OpenCV trunk that is sub-pixel accurate located here.
However, your images are pretty much "featureless" with the exception of the crease in the middle, so even phase correlation may have some trouble with it. Think of it like trying to detect translation in a snow storm. If all you can see is white, you can't tell that you have translated at all, thus the term whiteout. In your case, the algorithm might suffer from "greenout" :)
Can you adjust the camera settings to work better in low-light conditions. Have you fully opened the iris? Can you live with lower framerates? Setting a longer exposure time will allow the camera to gather more light, thus giving you more features at the cost of adding motion blur. Or, if low-light is your default environment you probably want something designed for this like an IR camera, but those can be expensive. Other than that, a big lens and long exposures are your friend :)
Histogram equalization may be of interest in improving the image contrast. But, sometimes it can just enhance the noise. OpenCV has a global histogram equalization function called equalizeHist. For a more localized implementation, you'll want to look at Contrast Limited Adaptive Histogram Equalization or CLAHE for short. Here is a good article on it. This page has some nice examples, and some code.

How to detect the voice from an audio stream

I need to determine when someone speaks in an audio stream. I applied the Hamming window and calculated the FFT. How do i detect the human voice from here?
If you want to experiment with your own voice activity detection algorithms, an FFT can be used as an initial stage. Next you might want to try subtracting any characterized stationary spectral noise background. Then you could try using the modified FFT results to calculate a cepstrum (or some weighted cepstral coefficients) for feature extraction. You could then do some statistical pattern matching on whatever feature vectors you decided to extract, and feed the results to a decision algorithm.
Each of the above steps has likely been a research topic, and a good implementation might involve studying dozens of published research papers, which perhaps can be found in your university library.
You don't need to do an FFT for this, you need to implement a Voice Activity Detection algorithm.

Resources