I've been struggling for a few weeks on a phase vocoder. The ultimate goal is achieving time stretching of a signal. I've been making a lot of progress, but I still have two issues to solve.
Issue1: Do I need a synthesis window?.
I take overlapping frames from the input signal (a sine wave) with any hop size (e.g. N/2, N = samples per frame). I apply a Hanning window to the frame and feed the result to FFT.
To achieve time-stretching I perform iFFT and overlap-add the output frames using a different hop size than the one used during analysis.
The problem is that with an output hop factor = 0.5 (hop size = N/2) the output is smooth, but for greater hop-sizes I can hear 'vibrations'. The image shows the output of 8 frames with a hop factor = 1 (zero overlap). It is evident why the sound is vibrating. For small hop sizes the frames overlap much more and the sound is smoother. I've read a lot about phase vocoding, but I don't seem to get how to get a smooth output for large hop sizes. What am I missing?
Issue2: Phase-correction.
Currently the output sounds worse with phase correction but I'll leave that for another post.
Thanks in advance for taking the time.
I'm an amateur at this, but wouldn't you get a better result if you started with a much bigger overlap, e.g. a "hop size" of N/10 or something like that? Then you'd have more freedom to adjust it on output while still keeping a substantial overlap.
Also, it might pay to adjust the steepness of the window depending on how much you're expanding/compressing time.
Related
I am trying the stereo_calib example and it fails with garbage output. For instance:
However, it is finding corners in my images...
My xml file and images are all here:
https://drive.google.com/open?id=12-5jBN7FK-LO6SLb4r3YYkrOnP7f_xmG
What am I doing wrong? I first tried printing a pattern on a sheet of paper, then thought ok that must be too wavy or something, so had this printed on foam board. But no dice.
(we chatted on a side channel, so this is to the benefit of the rest of the world)
tl;dr: hold the board very still or get a camera with global shutter.
Rolling shutter (see here and there), an attribute of most webcam sensors, many camcorder sensors, and some industrial image sensors, will distort objects that are moving. If you've moved the board even just a little during a frame capture (visible in files right19/right20), it will be captured with distortion. That will affect everything you do with the picture, starting with intrinsic calibration.
To give a sense of scale for the distortions: assuming a 30 FPS video stream, the worst case rolling shutter lag is 33 ms. A pedestrian travels 40-50 mm in that time. If your hands are moving slightly, you can maybe expect a tenth of that, which is still a lot in proportion to the square sizes most people use.
Another source of trouble is printers. If you've printed your checkerboard pattern, make sure to measure the width and height of your squares. they might be slightly rectangular. It's also a good idea to make sure the pattern is quite flat, not bent.
I'm trying to make a program that detects people in CCTV footage and I've made a lot of progress. Unfortunately, the amount of noise in the videos varies a lot between different cameras and time of day, so with each of the sample videos. This means that the NoiseSigma needed varies from 1-25.
I've used a fastNlMeansDenoisingColored function and that helped a bit, but the NoiseSigma is still an issue.
Would it be effective to maybe loop through the video once, and somehow get an idea for how noisy the video is and make a relationship for noise vs NoiseSigma? Any ideas would be welcome.
I don't think it's possible to determine noise level in an image (or video) without having a reference data which doesn't contain any noise. One thing that comes to my mind is to record some static scenery and measure how all the frames differ between each other and then try to find some relationship (hopefuly linear) between the measure and NoiseSigma. If there was no noise, the accumulated difference between frames would be 0. By accumulated difference I mean something like this:
for i=1, i<frames.count(), ++i
{
cumulativeError += sum(abs(frame(i) - frame(i-1)))
}
cumulativeError/=frames.count()
Where sum adds up all elements of an image (frame) to produce scalar value.
Please keep in mind that I'm just following my intuition here and it's not a method I've seen before.
I'm writing an algorithm for auto focus. For that I'm using a stepper motor which has 3318 steps for focus.
To find the focus, after every frame from the camera I'm taking the statistics and performing some calculation which results in a numeric value, i.e. focus value (fv). So the motor step where I get the highest fv is where my image is highest focused.
Right now, I am traversing through all the points to find the maximum fv and it's working but taking too long; about 15 secs.
Is there any algorithm I can use to reduce the no. of steps and minimize the time to find the focused point?
If you assume there is:
A single global maximum sharpness score;
No local maxima
Then your focus function should be relatively smooth.
In this case, you can do a search that is faster than linear.
Basically start somewhere and start rolling downhill.
You can use e.g. the Golden section search or by calculating the local change (derivatives) use the Newton (rolling down hill) or Conjugate gradient (jumping downhill) methods.
First, find out what exactly your bottleneck is:
time to take a frame
time to move the stepper motor to a specific position
time to process the frame and get a focus function value
Then learn something about the functional dependence of your focus function on the focus position in general (for your samples).
Is it smooth or bumpy (noisy)?
Is it wide (very flat maximum) or is it narrow (very steep, small maximum)?
Is it approximately quadratic?
Most probably there is not much noise, the maximum is rather wide and approximately quadratic.
Then Newton's method or the Levenberg-Marquardt fitting algorithm would converge in a few iterations.
However they only find local optima as well as the Golden search mentioned in the answer by Adi Shavit.
When noise is a problem, I recommend a kind of robust, zoom in approach:
Measure 10 frames over the whole range (332 steps away each)
Smooth the resulting 10 values slightly if there is noise present
Take the position of the best frame
Measure 20 frames over a range of [-330,330] steps around this best frame with a step size of 33 steps per frame
Smooth the resulting 20 values slightly if there is noise present
Take the position of the best frame
Measure 10 frames over a range of [-15, 15] steps around this best frame with a step size of 3 steps per frame
Smooth the resulting 10 values slightly if there is noise present
Take the best frame and measure one frame above and below
Take the best frame, it's the focus position
This needs 10+20+10+2=32 frames recorded and may therefore present an approximately 100 times speedup compared to taking 3318 frames or (0.15s instead of 15s) if taking the frames is the crucial part and not moving the stepper motor.
Lately I have been experimenting with audio and FFTs, specifically the Minim library in Processing (basically Java, not that its particularly important for this question). What I have come to understand is that with a buffer/sample size N and sample rate K, after performing a forward FFT, I will get N frequency bins (only N/2 usable data and in fact Minim only returns N/2 bins) linearly spaced representing the spectrum from 0 to K/2 HZ.
With Minim (as well as other typical FFT implementations) you wait to gather N samples, and then perform the forward transformation, then wait for N more samples, and so on. In order to get a reasonable frame-rate (for audio visualizations, beat detection, etc.), I must use a small sample size relative to the sampling frequency.
The problem with this, though, is that a small sample size results in a very low resolution for the low end of the spectrum when I compute logarithmically spaced averages (Since a bass octave is much narrower than a high pitched octave).
I was wondering if a possible way to squeeze more apparent resolution would be to perform FFTs more often than every N samples on a slightly larger sample size than I am currently using. (I.E. with input buffer of size 2048, every 100 samples, add those samples to the input buffer and remove the oldest 100 samples, and perform a FFT). It seems like this would possibly create a rolling-average type of affect (which I can live with) but I'm not too sure.
What would be the pros and cons of this approach? Are there any other ways I could increase my apparent resolution while still being able to do real-time visualization and analysis?
That approach goes by the name Short-time Fourier transform. You get all the answers to your question on wikipedia: https://en.wikipedia.org/wiki/Short-time_Fourier_transform
It works great in practice and you can even get better resolution out of it compared to what you would expect from a rolling window by using the phase difference between the fft's.
Here is one article that does pitch shifting of audio signals. The way how to get higher frequency resolution is well explained: http://www.dspdimension.com/admin/pitch-shifting-using-the-ft/
We use the approach you describe, which we call overlapping, to make sure all the rows of a spectral waterfall are filled in. Overlap can be used to provide spectra that are spaced as closely as a single sample interval.
The primary disadvantage is the extra processing to produce all those spectra.
On the positive side, while the time resolution of each spectra is still constrained by FFT size, looking at closely spaced adjacent spectra seems to provide a kind of a visual interpolation that, I think, lets you see the data with higher precision.
One common way this is done is to use multiple lengths of windowed FFTs on the same data, short FFTs for good time resolution, much longer FFTs for better frequency resolution of lower frequencies. Then the problem for visualization becomes picking the best FFT result out of several possible at each plot point (such as the highest contrast sub-block, etc.) and blending them attractively.
Most modern processors (in PCs and mobile phones, etc.) can easily do multiple lengths (dozens) of FFTs still in real-time for audio.
I am developing a feature tracking application and so far, after trying to almost all the feature detectors/descriptors, i've got the most satisfactory overall results with ORB.
Both my feature descriptor and detector is ORB.
I am selecting a specific area for detecting features on my source image (by masking). and then matching it with features detected on subsequent frames.
Then i filter my matches by performing ratio test on 'matches' obtained from the following code:
std::vector<std::vector<DMatch>> matches1;
m_matcher.knnMatch( m_descriptorsSrcScene, m_descriptorsCurScene, matches1,2 );
I also tried the two way ratio test(filtering matches from Source to Current scene and vice-versa, then filtering out common matches) but it didn't do much, so I went ahead with the one way ratio test.
i also add a min distance check to my ratio test, which, it apppears, gives better results
if (distanceRatio < m_fThreshRatio && bestMatch.distance < 5*min_dist)
{
refinedMatches.push_back(bestMatch);
}
and in the end , i estimate the Homography.
Mat H = findHomography(points1,points2);
I've tried using the RANSAC method for estimating inliners and then using those to recalculate my Homography, but that gives more unstability plus consumes more time.
then in the end i draw a rectangle around my specific region which is to be tracked. i get the plane coordinates by:
perspectiveTransform( obj_corners, scene_corners, H);
where 'objcorners' are the coordinates of my masked(or unmasked) region.
The reactangle I draw using 'scene_corners' seems to be vibrating. increasing the number of features has reduced it quite a bit, but I cant increase them too much because of the time constraint.
How can i improve the stability?
Any suggestions would be appreciated.
Thanks.
If it is the vibrations that are really bothersome to you then you could try taking the moving average of the homography matrices over time:
cv::Mat homoG = cv::findHomography(obj, scene, CV_RANSAC);
if (homography.empty()) {
homoG.copyTo(homography);
}
cv::accumulateWeighted(homoG, homography, 0.1);
Make the 'homography' variable global, and keep calling this every time you get a new frame.
The alpha parameter of accumulateWeighted is the reciprocal of the period of the moving average.
So 0.1 is taking the average of the last 10 frames and 0.2 is taking the average of the last 5 and so on...
A suggestion that comes to mind from experience with feature detection/matching is that sometimes you just have to accept the matched feature points will not work perfectly. Even subtle changes in the scene you are looking at can cause somewhat annoying problems, for example changes in light or unwanted objects coming into view.
It appears to me that you have a decently working feature matching in place from what you say, you may want to work on a way of keeping the region of interest constant. If you know the typical speed or any other movement patterns unique to any object you are trying to track between frames, or any constraints relating to the position of your camera, it may be useful in avoiding recalculating the region of interest unnecessarily causing vibrations. Or in fact it may help in creating a more efficient searching algorithm, allowing you to increase the number of feature points you can detect and use.
Another (small) hack you can use is to avoid redrawing the region window if the previous window was of similar size and position.