accuracy of dense optical flow - image-processing

Currently I am learning dense optical flow by myself. To understand it, I conduct one experiment. I produce one image using Matlab. One box with a given grays value is placed under one uniform background and the box is translated two pixels in x and y directions in another image. The two images are input into the implementation of the algorithm called TV-L1. The generated motion vector outer of the box is not zero. Is the reason that the gradient outer of the box is zero? Is the values filled in from the values with large gradient value?
In Horn and Schunck's paper, it reads
In parts of the image where the brightness gradient is zero, the velocity
estimates will simply be averages of the neighboring velocity estimates. There
is no local information to constrain the apparent velocity of motion of the
brightness pattern in these areas.
The progress of this filling-in phenomena is similar to the propagation effects
in the solution of the heat equation for a uniform flat plate, where the time rate of change of temperature is proportional to the Laplacian.
Is it not possible to obtain correct motion vectors for pixels with small gradients? Or the experiment is not practical. In practical applications, this doesn't happen.

Yes, in so called homogenous image regions with very small gradients no information where a motion can dervided from exists. That's why the motion from your rectangle is propagated outer the border. If you give your background a texture this effect will be less dominant. I know such problem when it comes to estimate the ego-motion of a car. Then the streat makes a lot of problems cause of here homogenoutiy.

Two pioneers in this field Lukas&Kanade (LK) and Horn&Schunch (HS) are developed methods for computing Optical Flow (OF). Both rely on brightness constancy assumption which feature location pixel values between two sequence frames not change. This constraint may be expressed as two equations: I(x+dx,y+dy,t+dt)=I(x,y,t) and ∂I/∂x dx+∂I/∂y dy+∂I/∂t dt=0 by using a Taylor series expansion I(x+dx,y+dy,t+dt) , we get (x+dx,y+dy,t+dt)=I(x,y,t)+∂I/∂x dx+∂I/∂y dy+∂I/∂t dt… letting ∂x/∂t=u and ∂y/∂t=v and combining these equations we get the OF constraint equation: ∂I/∂t=∂I/∂t u+∂I/∂t v . The OF equation has more than one solution, so the different techniques diverge here. LK equations are derived assuming that pixels in a neighborhood of each tracked feature move with the same velocity as the feature. In OpenCV, to catch large motions with a small window size (to keep the “same local velocity” assumption).

Related

How to find hinge point or axis of rotation point from top view using image processing?

I have a problem at hand where I need to detect/predict the coordinates of the hinge point or axis of rotation point using image processing. The image is as shown below:
I've used a method where I started with tracking the circular movement (in an arc) of a few feature points in an RoI around the default hinge coordinates (entered manually) in a configuration file. This circular motion of these tracked points happens around the vertical axis which passes through the hinge point. Now, I tracked these points from their initial position until the connecting bar made a particular angle (15°/20°) with the y-axis, I drew secants between these different positions (start and end positions) of the same point and drew its perpendicular bisector, which will ideally pass through the centre of the (concentric) circles, which is the ideal hinge point.
Eg:
y_intercepts calculated for each point
H0 (322, 42)
H1 (322, 64) (within tolerance, closest to GT)
H2 (322, 48)
H_avg (322,52)
H_groundtruth (x,y): (322, 61)
We need an accuracy or tolerance of +/- 3 pixels.
Now, the issues we faced in this ideal scenario to practical working of it is:
Different tracked points give different potential hinge points (different dots on the vertical yellow line), (few of which are very close the ground truth(yellow circle)), but their weighted/average (big green circle) goes off the mark. Quite frankly, this is a problem of too many in which we do get the closest potentially to ground truth, but we’re not sure, which of these points is the closest as we’re not to use the default hitch coordinates (entered manually) from config file.
One solution could be to use frameworks already implemented for image registration such as elastix. If you configure it for a rigid registration, you can get the transformation matrix and therefore the center of the rotation.
The problem here is that only one part of your image is moving. Before doing the registration, I would simply mask the region of interest by calculating a mask from the subtraction of the two images, to keep only the part where something actually moved.
Such approach could get a subpixel accuracy. You could also repeat it for multiple angles and average the result. Alternatively to the averaging, you could use the RANSAC algorithm to know which hinge points are off (outliers) and exclude them.
Here is an example how to do a simple rigid transformation with elastix.
I hope this helps!
I intended this as only a comment, but it ended up significantly over the character limit:
The problem from an accuracy perspective (sorry, couldn't resist) seems to be that you're trying to use a planar euclidean geometry technique to solve a projective geometry problem.
Those feature tracks are only circular arcs in 3D world space. They're actually (noisy) elliptical arcs in 2D image pixel space due to the projection.
Your hinge rotation axis isn't a single pixel either, unless your camera's optical axis is directly aligned with the hinge axis. If that's not the case (as the perspective in the photo you added suggests), then your hinge axis is actually a line in pixel space, not a point, and different heights for the different tracks in model space will be 'centered' around different pixels on that line. So asking for +/- 3 pixel hinge 'point' accuracy is unclear, and so is measuring angles in pixel space in general in a way that doesn't account for perspective.
I only mention these details because you seem focused on measuring accurately. Often, those kinds of 2D approximations are fine for many applications, but high accuracy and precision from a single camera (if that's really what you need) requires better 3D scene understanding. (Or you could train a deep network with a bunch of labeled ground truth images and let it figure out the mappings.)
Now maybe you don't need such high accuracy for your application after all. In that case, simple affine geometry techniques like that mentioned in the other answer might work well enough.

Training Algorithm to train this data

I am working in MATLAB
PLots
NOTE : Here, the data plotted is the track of x - position of the pixel at position (i,j) of the FIRST frame throughout all the frames. It means that the pixel at (23,87) in the first frame has, at the end of the sequence, x-position as 35 (as visible in the plot).
Here is some typical plots of x_pos for some different values of (i,j) . (i,j) refers to a pixel at (i,j) in the first frame not throughout all frames
For (i,j) = (23 ,87)
(i,j) = (42 ,56)
(i,j) = (67 ,19)
So it's not about pixels in the image, but more about moving object, which makes the task much more tractable. Your data is indeed time series, thus time-aware algorithms are preferable. Markov models (in particular Markov chains and a bit more sophisticated Hidden Markov models) are classic examples of them.
However, your input is noisy because of camera instability. Thus, even better solution would be to use Kalman filter - model similar to HMMs, but with explicit notion of noise. It is widely used in robotics, navigation and similar areas to estimate current and predict future position of a vehicle based on inexact sensor data and historical information. Doesn't it sound similar to what you need?
I'm not big fun of Matlab, but it seems to have kalman function that implements mentioned filter.
A video is like a sequence of photos of real objects.
And real object, in front of a camera, can do only 2 different things:
they stand still
they move
If the pixel you are trying to predict are from a video, then you need to look ad how pixel are moving on screen, because object are moving on screen.
And this is how video codec compression works (H264, H265...) (clearly video compression algorithm are much more complex that just try to understand the direction of a pixel... :-) )
Here is some question/answer on stackoverflow that may help you:
Motion vectors calculation
Kalman filter in computer vision: the choice of Q and R noise covariances
How to do motion tracking of an object using video?
Vehicle segmentation and tracking

Algorithm for selecting outer points on a graph ("rich" convex hull)

I'm looking for an efficient way of selecting a relatively large portion of points (2D Euclidian graph) that are the furthest away from the center. This resembles the convex hull, but would include (many) more points. Further criteria:
The number of points in the selection / set ("K") must be within a specified range. Most likely it won't be very narrow, but it most work for different ranges (eg. 0.01*N < K < 0.05*N as well as 0.1*N < K < 0.2*N).
The algorithm must be able to balance distance from the center and "local density". If there are dense areas near the upper part of the graph range, but sparse areas near the lower part, then the algorithm must make sure to select some points from the lower part even if they are closer to the center than the points in the upper region. (See example below)
Bonus: rather than simple distance from center, taking into account distance to a specific point (or both a point and the center) would be perfect.
My attempts so far have focused on using "pigeon holing" (divide graph into CxR boxes, assign points to boxes based on coordinates) and selecting "outer" boxes until we have sufficient points in the set. However, I haven't been successful at balancing the selection (dense regions over-selected because of fixed box size) nor at using a selected point as reference instead of (only) the center.
I've (poorly) drawn an Example: The red dots are the points, the green shape is an example of what I want (outside the green = selected). For sparse regions, the bounding shape comes closer to the center to find suitable points (but doesn't necessarily find any, if they're too close to the center). The yellow box is an example of what my Pigeon Holing based algorithms does. Even when trying to adjust for sparser regions, it doesn't manage well.
Any and all ideas are welcome!
I don't think there are any standard algorithms that will give you what you want. You're going to have to get creative. Assuming your points are embedded in 2D Euclidean space here are some ideas:
Iteratively compute several convex hulls. For example, compute the convex hull, keep the points that are part of the convex hull, then compute another convex hull ignoring the points from the original convex hull. Continue to do this until you have a sufficient number of points, essentially plucking off points on the perimeter for each iteration. The only problem with this approach is that it will not work well for concavities in your data set (e.g., the one on the bottom of your sample you posted).
Fit a Gaussian to your data and keep everything > N standard
deviations away from the mean (where N is a value that you'd have to
choose). This should work pretty well if your data is Gaussian. If
it isn't, you could always model it with several Gaussians (instead
of one), and keep points with a joint probability less than some threshold. Using multiple Gaussians will probably handle concavities decently. References:
http://en.wikipedia.org/wiki/Gaussian_function
How to fit a gaussian to data in matlab/octave?\
Use Kernel Density Estimation - If you create a kernel density
surface, you could slice the surface at some height (e.g., turning
it into a plateau), giving you a perimeter shape (the shape of the
plateau) around the points. The trick would be to slice it at the
right location though, because you could end up getting no points
outside of the shape, but with the right selection you could easily
get the green shape you drew. This approach will work well and give you the green shape in your example if you choose the slice point wisely (which may be difficult to do). The big drawback of this approach is that it is very computationally expensive. More information:
http://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation
Use alpha shapes to get a general shape the wraps tightly around
the outside perimeter of the point set. Then erode the shape a
little to force some points outside of the shape. I don't have a lot of experience with alpha shapes, but this approach will also be quite computationally expensive. More info:
http://doc.cgal.org/latest/Alpha_shapes_2/index.html

Vehicle segmentation and tracking

I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.

Remove high frequency vertical shear noise from image

I have a some scanned images, where the scanner appears to have introduced a certain kind of noise that I've not encountered before. I would like to find a way to remove it automatically. The noise looks like high frequency vertical shear. In other words, a horizontal line that should look like ------------ shows up as /\/\/\/\/\/\/\/\/\, where the amplitude and frequency of the shear seem pretty regular.
Can someone suggest a way of doing the following steps?
Given an image, identify the frequency and amplitude of the shear noise. One can assume that it is always vertical and the characteristic frequency is higher than other frequencies that naturally appear in the image.
Given the above parameters, apply an opposite, vertical, periodic shear to the image to cancel this noise.
It would also be helpful to know how these could be implemented using the tools implemented by a freely available image processing package. (Netpbm, ImageMagick, Gimp, some Python library are some examples.)
Update: Here's a sample from an image with this kind of distortion. Actually, this sample shows that the shear amplitude need not be uniform throughout the image. :-(
The original images are higher resolution (600 dpi).
My solution to the problem would be to convert the image to frequency domain using FFT. The result will be two matrices: the image signal amplitude and the image signal phase. These two matrices should have the same dimensions of the input image.
Now, you should use the amplitude matrix to detect a spike in the area tha corresponds to the noise frequency. Note that the top left of this corner of this matrix should correspond to low frequency components and bottom right to high frequencies.
After you have indentified the spike, you should set the corresponding coefficients (amplitude matrix entries) to zero. After you apply the inverse FFT you should get the input image without the noise.
Please provide an example image for a more concrete (a practical) solution to your problem.
You could use a Hough fit or RANSAC to fit lines first. For Hough to work you may need to "smear" the points using Gaussian blur or morphological dilation so that you get more hits for a given (rho, theta) line in parameter space.
Once you have line fits, you can determine the relative distance of the original points to each line. From that spatial information you can use FFT to find help find a "best fit" spatial frequency and then shift pixels up/down accordingly.
As a first take, you might even skip FFT and use more of a brute force method:
Find the best fit lines using Hough or RANSAC.
Determine the orientation of the lines.
Sampling perpendicular to the (nominally) horizontal lines, find the points along that column with respect to the closest best fit lines.
If the points along one sample are on average a distance +N away from their best fit lines, shift all the pixels in that column (or along that perpendicular sample) by -N.
This sort of technique should work if the shear is consistent along a vertical sample, but not necessarily from left to right. If the shear is always exactly vertical, then finding horizontal lines should be relatively easy.
Judging from your sample image, it looks as though the shear may be consistent across a horizontal line segment between a 3-way or 4-way intersection with a nominally vertical line segment. You could use corner detectors or other methods to find these intersections to limit the extent over which a pixel shifting operation takes place.
A technique I posted here is another way to find horizontal stretches of dark pixels in case they don't fall on a line:
Is there an efficient algorithm for segmentation of handwritten text?
All that aside, is there a chance you could have the scanner fixed?

Resources