Optical flow and finger tracking - opencv

I am using opencv to implement finger tracking system
And also use
calcOpticalFlowPyrLK(pGmask,nGmask,fingers,track,status,err);
to perform a LK tracker.
The concept I am not clear, after I implement the LK tracker, how should I detect the movement of fingers? Also, the tracker get the last frame and current frame, how to detect a series of action or continuous gesture like within 5 frames?

The 4th parameter of calcOpticalFlowPyrLK (here track) will contain the calculated new positions of input features in the second image (here nGmask).
In the simple case, you can estimate the centroid separately of fingers and track where you can infer to the movement. Making decision can be done from the direction and magnitude of the vector pointing from fingers' centroid to track's centroid.
Furthermore, complex movements can be considered as time series, because movements are consisting of some successive measurements made over a time interval. These measurements could be the direction and magnitude of the vector mentioned above. So any movement can be represented as below:
("label of movement", time_series), where
time_series = {(d1, m1), (d2, m2), ..., (dn, mn)}, where
di is direction and mi is magnitude of the ith vector (i=1..n)
So the time-series consists of n * 2 measurements (sampling n times), that's the only question how to recognize movements?
If you have prior information about the movement, i.e. you know how to perform a circular movement, write an a letter etc. then the question can be reduced to: how to align time series to themselves?
Here comes the well known Dynamic Time Warping (DTW). It can be also considered as a generative model, but it is used between pairs of sequences. DTW is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed (such in our case).
In general, DTW calculates an optimal match between two given time series with certain restrictions. The sequences are warped non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension.

Related

accuracy of dense optical flow

Currently I am learning dense optical flow by myself. To understand it, I conduct one experiment. I produce one image using Matlab. One box with a given grays value is placed under one uniform background and the box is translated two pixels in x and y directions in another image. The two images are input into the implementation of the algorithm called TV-L1. The generated motion vector outer of the box is not zero. Is the reason that the gradient outer of the box is zero? Is the values filled in from the values with large gradient value?
In Horn and Schunck's paper, it reads
In parts of the image where the brightness gradient is zero, the velocity
estimates will simply be averages of the neighboring velocity estimates. There
is no local information to constrain the apparent velocity of motion of the
brightness pattern in these areas.
The progress of this filling-in phenomena is similar to the propagation effects
in the solution of the heat equation for a uniform flat plate, where the time rate of change of temperature is proportional to the Laplacian.
Is it not possible to obtain correct motion vectors for pixels with small gradients? Or the experiment is not practical. In practical applications, this doesn't happen.
Yes, in so called homogenous image regions with very small gradients no information where a motion can dervided from exists. That's why the motion from your rectangle is propagated outer the border. If you give your background a texture this effect will be less dominant. I know such problem when it comes to estimate the ego-motion of a car. Then the streat makes a lot of problems cause of here homogenoutiy.
Two pioneers in this field Lukas&Kanade (LK) and Horn&Schunch (HS) are developed methods for computing Optical Flow (OF). Both rely on brightness constancy assumption which feature location pixel values between two sequence frames not change. This constraint may be expressed as two equations: I(x+dx,y+dy,t+dt)=I(x,y,t) and ∂I/∂x dx+∂I/∂y dy+∂I/∂t dt=0 by using a Taylor series expansion I(x+dx,y+dy,t+dt) , we get (x+dx,y+dy,t+dt)=I(x,y,t)+∂I/∂x dx+∂I/∂y dy+∂I/∂t dt… letting ∂x/∂t=u and ∂y/∂t=v and combining these equations we get the OF constraint equation: ∂I/∂t=∂I/∂t u+∂I/∂t v . The OF equation has more than one solution, so the different techniques diverge here. LK equations are derived assuming that pixels in a neighborhood of each tracked feature move with the same velocity as the feature. In OpenCV, to catch large motions with a small window size (to keep the “same local velocity” assumption).

Measure of motion in a video

I am trying to get a measure of motion between frames of an American Football match. When the players are set on the line of scrimmage, the motion should be minimal as few parts are moving. Then once the play begins, there will be a sharp increase in motion.
I am aware of optical flow being used to find the general motion between parts of an image. However, is there any way to quantify how much motion is occurring?
By taking the sum of all motion vectors, you get global motion between frames.
By definition, the motion vectors calculated by optical flow algorithms will be the transformations between corresponding parts of frame(i) and frame(i+1). Therefore, the amount of motion between two frames is the sum of the motion vectors (motion of each part). You may simply add together their magnitude for a rough estimate of how chaotic the transformation/motion was. You may also add together the vectors to receive a new vector that shows net motion.
As far as I can tell, you want the first value; sum(motionVectors.magnitude). This will give you a number allowing you to compare (roughly), the amount of motion in two different frames. This also allows you to graph motion over time, allowing you to compare two videos using metrics such as 'unexpected motion', or 'total motion'.
There are many uses for this data, and I'll leave them to your imagination.

What FFT descriptors should be used as feature to implement classification or clustering algorithm?

I have some geographical trajectories sampled to analyze, and I calculated the histogram of data in spatial and temporal dimension, which yielded a time domain based feature for each spatial element. I want to perform a discrete FFT to transform the time domain based feature into frequency domain based feature (which I think maybe more robust), and then do some classification or clustering algorithms.
But I'm not sure using what descriptor as frequency domain based feature, since there are amplitude spectrum, power spectrum and phase spectrum of a signal and I've read some references but still got confused about the significance. And what distance (similarity) function should be used as measurement when performing learning algorithms on frequency domain based feature vector(Euclidean distance? Cosine distance? Gaussian function? Chi-kernel or something else?)
Hope someone give me a clue or some material that I can refer to, thanks~
Edit
Thanks to #DrKoch, I chose a spatial element with the largest L-1 norm and plotted its log power spectrum in python and it did show some prominent peaks, below is my code and the figure
import numpy as np
import matplotlib.pyplot as plt
sp = np.fft.fft(signal)
freq = np.fft.fftfreq(signal.shape[-1], d = 1.) # time sloth of histogram is 1 hour
plt.plot(freq, np.log10(np.abs(sp) ** 2))
plt.show()
And I have several trivial questions to ask to make sure I totally understand your suggestion:
In your second suggestion, you said "ignore all these values."
Do you mean the horizontal line represent the threshold and all values below it should be assigned to value zero?
"you may search for the two, three largest peaks and use their location and probably widths as 'Features' for further classification."
I'm a little bit confused about the meaning of "location" and "width", does "location" refer to the log value of power spectrum (y-axis) and "width" refer to the frequency (x-axis)? If so, how to combine them together as a feature vector and compare two feature vector of "a similar frequency and a similar widths" ?
Edit
I replaced np.fft.fft with np.fft.rfft to calculate the positive part and plot both power spectrum and log power spectrum.
code:
f, axarr = plt.subplot(2, sharex = True)
axarr[0].plot(freq, np.abs(sp) ** 2)
axarr[1].plot(freq, np.log10(np.abs(sp) ** 2))
plt.show()
figure:
Please correct me if I'm wrong:
I think I should keep the last four peaks in first figure with power = np.abs(sp) ** 2 and power[power < threshold] = 0 because the log power spectrum reduces the difference among each component. And then use the log spectrum of new power as feature vector to feed classifiers.
I also see some reference suggest applying a window function (e.g. Hamming window) before doing fft to avoid spectral leakage. My raw data is sampled every 5 ~ 15 seconds and I've applied a histogram on sampling time, is that method equivalent to apply a window function or I still need apply it on the histogram data?
Generally you should extract just a small number of "Features" out of the complete FFT spectrum.
First: Use the log power spec.
Complex numbers and Phase are useless in these circumstances, because they depend on where you start/stop your data acquisiton (among many other things)
Second: you will see a "Noise Level" e.g. most values are below a certain threshold, ignore all these values.
Third: If you are lucky, e.g. your data has some harmonic content (cycles, repetitions) you will see a few prominent Peaks.
If there are clear peaks, it is even easier to detect the noise: Everything between the peaks should be considered noise.
Now you may search for the two, three largest peaks and use their location and probably widths as "Features" for further classification.
Location is the x-value of the peak i.e. the 'frequency'. It says something how "fast" your cycles are in the input data.
If your cycles don't have constant frequency during the measuring intervall (or you use a window before caclculating the FFT), the peak will be broader than one bin. So this widths of the peak says something about the 'stability' of your cycles.
Based on this: Two patterns are similar if the biggest peaks of both hava a similar frequency and a similar widths, and so on.
EDIT
Very intersiting to see a logarithmic power spectrum of one of your examples.
Now its clear that your input contains a single harmonic (periodic, oscillating) component with a frequency (repetition rate, cycle-duration) of about f0=0.04.
(This is relative frquency, proprtional to the your sampling frequency, the inverse of the time beetween individual measurment points)
Its is not a pute sine-wave, but some "interesting" waveform. Such waveforms produce peaks at 1*f0, 2*f0, 3*f0 and so on.
(So using an FFT for further analysis turns out to be very good idea)
At this point you should produce spectra of several measurements and see what makes a similar measurement and how differ different measurements. What are the "important" features to distinguish your mesurements? Thinks to look out for:
Absolute amplitude: Height of the prominent (leftmost, highest) peaks.
Pitch (Main cycle rate, speed of changes): this is position of first peak, distance between consecutive peaks.
Exact Waveform: Relative amplitude of the first few peaks.
If your most important feature is absoulute amplitude, you're better off with calculating the RMS (root mean square) level of our input signal.
If pitch is important, you're better off with calculationg the ACF (auto-correlation function) of your input signal.
Don't focus on the leftmost peaks, these come from the high frequency components in your input and tend to vary as much as the noise floor.
Windows
For a high quality analyis it is importnat to apply a window to the input data before applying the FFT. This reduces the infulens of the "jump" between the end of your input vector ant the beginning of your input vector, because the FFT considers the input as a single cycle.
There are several popular windows which mark different choices of an unavoidable trade-off: Precision of a single peak vs. level of sidelobes:
You chose a "rectangular window" (equivalent to no window at all, just start/stop your measurement). This gives excellent precission of your peaks which now have a width of just one sample. Your sidelobes (the small peaks left and right of your main peaks) are at -21dB, very tolerable given your input data. In your case this is an excellent choice.
A Hanning window is a single cosine wave. It makes your peaks slightly broader but reduces side-lobe levels.
The Hammimg-Window (cosine-wave, slightly raised above 0.0) produces even broader peaks, but supresses side-lobes by -42 dB. This is a good choice if you expect further weak (but important) components between your main peaks or generally if you have complicated signals like speech, music and so on.
Edit: Scaling
Correct scaling of a spectrum is a complicated thing, because the values of the FFT lines depend on may things like sampling rate, lenght of FFT, window, and even implementation details of the FFT algorithm (there exist several different accepted conventions).
After all, the FFT should show the underlying conservation of energy. The RMS of the input signal should be the same as the RMS (Energy) of the spectrum.
On the other hand: if used for classification it is enough to maintain relative amplitudes. As long as the paramaters mentioned above do not change, the result can be used for classification without further scaling.

fast fourier transform apply window and overlap

This may be a naive question, but I didn't find exact details in searching.
In FFT with window overlapping, after we've applied window functions to sequences of data set with overlapping and got the FFT results, how do we combine those FFT results for overlapping sequence?
Do we just add them together, treating those frequency domain results as non-overlapping parts?
Are magnitudes of these results in complex numbers frequency magnitudes?
Thank you.
For each FFT you typically calculate the magnitude of each complex output bin - this gives you a spectrum (magnitude versus frequency) for one window. The sequence of magnitude spectra for all time windows is effectively a 3D data set or graph - magnitude versus frequency versus time - which is typically plotted as a a spectrogram, waterfall or time varying 2D spectrum.
In the specific case where the data is statistically stationary and you just want to reduce the variance you can average the successive magnitude spectra - this is called ensemble averaging. Normally though for time-varying signals such as speech or music you would not want to do this.

Features for gesture recognition

I would like to create an application which can learn to classify a sequence of points drawn by a user, e.g. something like handwriting recognition. If the data point consists of a number of (x,y) pairs (like the pixels corresponding to a gesture instance), what are the best features to compute about the instance which would make for a good multi-class classifier (e.g. SVM, NN, etc)? Particularly if there are limited training examples provided.
If I were you, I would find the data points that correspond with corners, end points and intersections, use those as features and discard the intermediate points. You could include the angle or some other descriptor of these interest points as well.
For detecting interest points you could use a Harris detector, you could then use the gradient value at that point as a simple descriptor. Alternatively you could go with a more fancy method like SIFT.
You could use the descriptor of every pixel in your downsampled image and then classify with SVM. The disadvantage of that is that there would be a large amount of uninteresting data points in the feature vector.
An alternative would be to not approach it as a classification problem but as a template matching problem (fairly common in computer-vision). In this case a gesture can be specified as an arbitrary number of interest points, completely leaving out the non-interesting data. A certain threshold percentage of an instance's points has to match a template for a positive identification. For example, when matching the corner points of an instance of 'R' against the template for 'X', the bottom right point should match, being end points in the same position orientation, but the others are too dissimilar, giving a fairly low score and the identification R=X will be rejected.

Resources