How to detect the voice from an audio stream - signal-processing

I need to determine when someone speaks in an audio stream. I applied the Hamming window and calculated the FFT. How do i detect the human voice from here?

If you want to experiment with your own voice activity detection algorithms, an FFT can be used as an initial stage. Next you might want to try subtracting any characterized stationary spectral noise background. Then you could try using the modified FFT results to calculate a cepstrum (or some weighted cepstral coefficients) for feature extraction. You could then do some statistical pattern matching on whatever feature vectors you decided to extract, and feed the results to a decision algorithm.
Each of the above steps has likely been a research topic, and a good implementation might involve studying dozens of published research papers, which perhaps can be found in your university library.

You don't need to do an FFT for this, you need to implement a Voice Activity Detection algorithm.

Related

Why don't we use motion vector data from video in optical flow?

All the implementation i saw in optical flow in opencv uses video as an array of frames and then implement optical flow on each image. That involves slicing the image into NxN block and searching for velocity vector.
Although motion vector in video codec is misleading and it does not necessarily contain motion information, why don't we use it to check which block likely has motion and then run optical flow on those blocks ? Shouldn't that fasten the process ?
Motion vector calculated for video encoding is not 'True Motion Vector'. The calculation targets to find best matching block to achieve highest compression. It doesn't targets the best motion estimation. That's why it can't be readily used for motion estimation purpose.
However, the motion-vectors from decoder can be used in some way to make motion estimation faster and better process. e.g.
As a seed to your algorithm of motion estimation.
Try filter like median (depending on data) to remove outlier motion vectors and use it for better motion estimation.
One of above step can be used as first step in a two-step motion estimation algo.
OpenCV is a universal image processing framework. It takes in frames, not compressed video, for its algorithms.
You can certainly write a video decoder that also hands out info about displacement from the codec to openCV – but that will be very codec-specific and thus isn't in scope of openCV itself.
I distinctly remember having read academic work done on the use of "motion" vectors embedded in h.264 and similar codecs for optical flow/vision motion analysis. The easiest approach to do so, is for iterative estimations (such as Horn and Schunk). One can simply bootstrap the iterative algorithm with the vectors from the decoder.
This will not improve the estimation accuracy, at best I would guess it could speed up the rate of convergence, allowing for earlier stopping.
Not all h.264 encoding is of the same quality either. Especially from real-time systems with limited hardware. There are likely cases where the motion vectors will be outright awful, and actually be a detriment to the flow estimations instead of helping. I am thinking for example of a low-end IP camera that is designed for mostly static scenes, that is moving in varying lighting etc.

How to decorrelate accelerometer data

Is it possible to decorrelate accelerometer data in real-time? If so, how is it done?
Background:
My application is receiving (X,Y,Z) accelerometer data in real-time (sample rate is 6.75Hz). The sensor is moving in a periodic motion but the motion is not necessarily along only one axis. The 3 signals x(t), y(t) and z(t) are therefore slightly correlated and I would like to know if I can find a rotation matrix (in real time) which can be used to rotate the measured (x,y,z) into a new vector (x*,y*,z*) so that the entire motion is along the z-axis?
I would like to implement the algorithm in C.
Thanks.
What you're trying to do is generally called "principal component analysis". The Wikipedia article is pretty good:
https://en.wikipedia.org/wiki/Principal_component_analysis
For static data you generally use the eigenvectors of the covariance matrix as your new coordinate basis.
PCA in real time is doable, but not super easy. See, for example: http://www.bio-conferences.org/articles/bioconf/pdf/2011/01/bioconf_skills_00055.pdf
I'd like to first of all emphasize that Matt Timmermans' answer has done exactly what people are actually doing when classifying accelerometer data from clinical studies (a project I worked on).
Then: you're observing a sampled signal. In general, if you have a sensor that gives you samples at a rate of 6.75Hz, the highest frequency of a signal you can detect is 6.75Hz/2 = 3.375Hz. Everything that has a frequency higher than that will inherently be aliased back and look like it was something with a frequency f with 0<=f<3.375Hz. If you've not considered this, please go and read up on the Nyquist–Shannon sampling theorem. Especially: shield your sensors (however you do that, e.g. by employing dampeners) from all input above that limit, otherwise your measurements might be worth very little or even nothing. If your sensor does this internally (that's absolutely possible, there are enough accelerometers with analog low pass filters), this has been taken care of. However, document that characteristics of your sensor.
Now, your case is a little bit easier because you know pretty well that your whole observation is going to be periodic, and it's measured along three orthogonal axis.
In this case, just doing three discrete Fourier transforms at once, extracting the "strongest" spectral component over all three channels, and finding the phase of that spectral component (which is but the complex argument of that DFT bin) in the two others would give you something that you can map to a periodic movement around a specific axis in 3D space. If you want to, remove these value (set the bins to 0), and search for strongest component again etc.
Discrete cosine transforms can be done in staggering speed nowadays. with 6.75Hz, no PC in this world will ever get into trouble when you try this while you receive further samples. It's a hilariously low sampling rate.
Another, more elegant (read: you need less samples to compute this) would be using a parametric estimator; in your case, a direction-of-arrival sensor from the world of RF technology with multiple antennas would, as far as I can think, map directly to detection of rotational axis. The classical algorithms here are MUSIC and ESPRIT, and for your case (limited, known amount of oscillating parts), ESPRIT might be the better choice.

Denoise EEG signal by using Daubechies function

I have an EEG signal and it contains eye blink artifacts, i read some references and know that can detect eye blink and remove them by using wavelet transform but i don't know that how do it, How to detect eye blink? Have any tutorials for me, after transformed EEG signal into wavelet coefficients, what should i do and which level of daubechies can be used to do that? Thank you!
I don't know whether this will work but you can give it a try.
Wavelet transform works like a filter bank.
Set the wavelet level to such a value so that the last level of the decomposition gives you a filter bank of nearly 0Hz - 5Hz.
Get the coefficients of the detail functions at this level and do a thresholding (soft/hard) on the same.and then compose back the signal using the new coeffiecients
Blinks have a relatively high amplitude and thresholding on them might give you what you want.
If you want to remove eye blinks, a commonly used approach is running Independent Component Analysis (ICA) on the data, identifying the blink artifact and backtransforming to the original data without that independent component. There are other approaches, but ICA works quite well even in very noisy EEG data (e.g. from simultaneous EEG-fMRI sessions).
Eye blinks will generally have a frequency between 2-5Hz.
You can first train a system to capture eyeblinks.
Then use the same to detect the blinks in an eeg signal

Real time tracking of hand

I am trying to detect and track hand in real time using opencv. I thought haar cascade classifiers would yield a fair result. After training with 10k and 20k positive and negative images respectively, I obtained a classifier xml file. Unfortunately, it detects hand only in certain positions, proving that it works best only for rigid objects. So I am now thinking of adopting another algorithm that can track hand, once detected through haar classifier.
My question is,if I make sure that haar classifier detects hand in a certain frame, certain position, what method would yield robust tracking of hand further?
I searched web a bit, and have understood I can go for optical flow of the detected hand , or kalman filter or particle filter, but also have come across their own disadvantages.
also, If I incorporate stereo vision, would it help me, as I can possibly reconstruct hand in 3d.
You concluded rightly about Haar features - they aren't that useful when it comes to non-rigid objects.
Take a look at the following papers which use skin colour to detect hands.
Interaction between hands and wearable cameras
Markerless inspection of augmented reality objects
and this paper that uses KLT features to track the hand after the first detection:
Fast 2D hand tracking with flocks of features and multi-cue integration
I would say that a stereo camera will not help your cause much, as 3D reconstruction of non-rigid objects isn't straightforward and would require a whole lot of innovation and development. However, you can take a look at the papers in the hand pose estimation section of this page if you wish to pursue 3D tracking.
EDIT: Also take a look at this recent paper, which seems to get good results.
Zhang et al.'s Real-time Compressive Tracking does a reasonable job of tracking an object, once it has been detected by some other method, provided that the motion is not too fast. They have an OpenCV implementation (but it would need a bit of work to reuse).
This research paper describes a method to track hands, without using gloves by using a stereo camera setup.
there have been similar questions on stack overflow...
have a look at my answer and that of others: https://stackoverflow.com/a/17375647/1463143
you can for certain get better results by avoiding haar training and detection for deformable entities.
CamShift algorithm is generally fast and accurate, if you want to track the hand as a single entity. OpenCV documentation contains a good, easy-to-understand demo program that you can easily modify.
If you need to track fingers etc., however, further modeling will be needed.

Identify a specific sound on iOS

I'd like to be able to recognise a specific sound in an iOS application. I guess it would basically work like speech recognition in that it's fairly fuzzy, but it would only have to be for 1 specific sound.
I've done some quick FFT stuff to identify specific frequencies over a certain threshold and only when they're solo (ie, they're not surrounded by other frequencies) so I can identify individual tones pretty easily. I'm thinking it's just an extension of this, but comparing to an FFT data set of a recording of the sound, and compare say 0.1 second chunks over the length of the audio. And I would also have to account for variation in amplitude, a little in pitch and a little in time.
Can anyone point me to any pre-existing source that I could use to speed this process along? I can't seem to find anything usable. Or failing that, any ideas on how to get started on something like this?
Thanks very much
From your description it is not entirely clear what you want to do.
What is the "specific" sound like? Does it have high background noise?
Whats the specific recognizable feature (e.g. pitch, inhamonicity, timbre ...)?
Against which other "sounds" do you want to compare it?
Do you simply want to match an arbitrary sound spectrum against a "template sound"?
Is your sound percussive, melodic, speech, ...? Is it long, short ...?
Whats the frequency range you expect the best discriminability? Are the features invariant with time?
There is no "general" solution that works for everything. Speech recognition in itself is fairly complex and wont work well for abstract sounds whose discriminable frequencies are not in the e.g. MEL bands.
So in conclusion, you are leaving too many open questions to get a useful answer.
Only suggestion i can make based on the few informations is the following:
For the template sound:
1) Extract spectral peak positions from the power spectrum
2) Measure the standard deviation around the peaks and construct a gaussian from it
3) save the gaussians for later classification
For unkown sounds:
1) Extract spectral peak positions
2) Project those points onto the saved gaussians which leaves you with z-scores of the peak positions
3) With the computed z-scores you should be able to classify your template sound
Note: This is a very crude method which discriminates sounds according to their most powerful frequencies. Using the gaussians it leaves room for slight shifts in the most powerful frequencies.

Resources