I need to get the u,v components so that I can compute an avoidance strategy of obstacles for blind people.
I will divide the frame into 2 halves and sum up the flow components u+v in them the avoidance strategy will be that the blind person will move away from the half that has the higher value of flow .
The function calcOpticalFlowPyrLK in opencv returns the position of the points in the new frame however I need the u and v components.
How can that be achieved. And also is there an avoidance strategy that I can use better than this one using only an RGB camera
As for dividing for the u and v components, I suggest doing a simple subtraction of point coordinates before and after translation. You can try to speed it up by for example putting all the points into 2 channel matrices and subtracting matrices from each other.
As for a better way here is an article I based my masters thesis on. There is a trick in it using amount of optical flow for obstacle detection.
Related
I have a problem at hand where I need to detect/predict the coordinates of the hinge point or axis of rotation point using image processing. The image is as shown below:
I've used a method where I started with tracking the circular movement (in an arc) of a few feature points in an RoI around the default hinge coordinates (entered manually) in a configuration file. This circular motion of these tracked points happens around the vertical axis which passes through the hinge point. Now, I tracked these points from their initial position until the connecting bar made a particular angle (15°/20°) with the y-axis, I drew secants between these different positions (start and end positions) of the same point and drew its perpendicular bisector, which will ideally pass through the centre of the (concentric) circles, which is the ideal hinge point.
Eg:
y_intercepts calculated for each point
H0 (322, 42)
H1 (322, 64) (within tolerance, closest to GT)
H2 (322, 48)
H_avg (322,52)
H_groundtruth (x,y): (322, 61)
We need an accuracy or tolerance of +/- 3 pixels.
Now, the issues we faced in this ideal scenario to practical working of it is:
Different tracked points give different potential hinge points (different dots on the vertical yellow line), (few of which are very close the ground truth(yellow circle)), but their weighted/average (big green circle) goes off the mark. Quite frankly, this is a problem of too many in which we do get the closest potentially to ground truth, but we’re not sure, which of these points is the closest as we’re not to use the default hitch coordinates (entered manually) from config file.
One solution could be to use frameworks already implemented for image registration such as elastix. If you configure it for a rigid registration, you can get the transformation matrix and therefore the center of the rotation.
The problem here is that only one part of your image is moving. Before doing the registration, I would simply mask the region of interest by calculating a mask from the subtraction of the two images, to keep only the part where something actually moved.
Such approach could get a subpixel accuracy. You could also repeat it for multiple angles and average the result. Alternatively to the averaging, you could use the RANSAC algorithm to know which hinge points are off (outliers) and exclude them.
Here is an example how to do a simple rigid transformation with elastix.
I hope this helps!
I intended this as only a comment, but it ended up significantly over the character limit:
The problem from an accuracy perspective (sorry, couldn't resist) seems to be that you're trying to use a planar euclidean geometry technique to solve a projective geometry problem.
Those feature tracks are only circular arcs in 3D world space. They're actually (noisy) elliptical arcs in 2D image pixel space due to the projection.
Your hinge rotation axis isn't a single pixel either, unless your camera's optical axis is directly aligned with the hinge axis. If that's not the case (as the perspective in the photo you added suggests), then your hinge axis is actually a line in pixel space, not a point, and different heights for the different tracks in model space will be 'centered' around different pixels on that line. So asking for +/- 3 pixel hinge 'point' accuracy is unclear, and so is measuring angles in pixel space in general in a way that doesn't account for perspective.
I only mention these details because you seem focused on measuring accurately. Often, those kinds of 2D approximations are fine for many applications, but high accuracy and precision from a single camera (if that's really what you need) requires better 3D scene understanding. (Or you could train a deep network with a bunch of labeled ground truth images and let it figure out the mappings.)
Now maybe you don't need such high accuracy for your application after all. In that case, simple affine geometry techniques like that mentioned in the other answer might work well enough.
Currently I am learning dense optical flow by myself. To understand it, I conduct one experiment. I produce one image using Matlab. One box with a given grays value is placed under one uniform background and the box is translated two pixels in x and y directions in another image. The two images are input into the implementation of the algorithm called TV-L1. The generated motion vector outer of the box is not zero. Is the reason that the gradient outer of the box is zero? Is the values filled in from the values with large gradient value?
In Horn and Schunck's paper, it reads
In parts of the image where the brightness gradient is zero, the velocity
estimates will simply be averages of the neighboring velocity estimates. There
is no local information to constrain the apparent velocity of motion of the
brightness pattern in these areas.
The progress of this filling-in phenomena is similar to the propagation effects
in the solution of the heat equation for a uniform flat plate, where the time rate of change of temperature is proportional to the Laplacian.
Is it not possible to obtain correct motion vectors for pixels with small gradients? Or the experiment is not practical. In practical applications, this doesn't happen.
Yes, in so called homogenous image regions with very small gradients no information where a motion can dervided from exists. That's why the motion from your rectangle is propagated outer the border. If you give your background a texture this effect will be less dominant. I know such problem when it comes to estimate the ego-motion of a car. Then the streat makes a lot of problems cause of here homogenoutiy.
Two pioneers in this field Lukas&Kanade (LK) and Horn&Schunch (HS) are developed methods for computing Optical Flow (OF). Both rely on brightness constancy assumption which feature location pixel values between two sequence frames not change. This constraint may be expressed as two equations: I(x+dx,y+dy,t+dt)=I(x,y,t) and ∂I/∂x dx+∂I/∂y dy+∂I/∂t dt=0 by using a Taylor series expansion I(x+dx,y+dy,t+dt) , we get (x+dx,y+dy,t+dt)=I(x,y,t)+∂I/∂x dx+∂I/∂y dy+∂I/∂t dt… letting ∂x/∂t=u and ∂y/∂t=v and combining these equations we get the OF constraint equation: ∂I/∂t=∂I/∂t u+∂I/∂t v . The OF equation has more than one solution, so the different techniques diverge here. LK equations are derived assuming that pixels in a neighborhood of each tracked feature move with the same velocity as the feature. In OpenCV, to catch large motions with a small window size (to keep the “same local velocity” assumption).
I am using standard OpenCV functions to calibrate camera for intrinsic parameters. In order to obtain good results, I know we have to use images of the chessboard from different angles (considering different planes in the 3D). This is stated in all the documentations
and papers but I really don't understand, why is it so important for us to consider different planes and if there is an optimal number of planes that we have to consider for the best calibration results?
I will be glad if you can provide me reference to some paper or documentation which explains this. (I think Zhang's paper talks about it but, its mathematically intensive and was hart to digest.)
Thanks
Mathematically, a unique solution for the intrinsic parameters (up to scale) is defined only if you have 3 or more distinct images of the planar target. See page 6 of Zhang's paper: "If n images of the model plane are observed, by stacking n such equations as (8) we have Vb = 0 ; (9) where V is a 2n×6 matrix. If n ≥ 3, we will have in general a unique solution b defined up to a scale factor..."
There isn't an "optimal" number of planes, where data are concerned, the more you have the merrier you are. But as the solution starts to converge, the marginal gain in calibration accuracy due to adding an extra image becomes negligible. Of course, this assumes that the images show planes well separated in both pose and location.
See also this other answer of mine for practical tips.
If you're looking for a little intuition, here's an example of why one plane isn't enough. Imagine your calibration chessboard is tilting away from you at a 45° angle:
You can see that when you move up the chessboard by 1 meter in the +y direction, you also move away from the camera by 1 meter in the +z direction. This means there's no way to separate the effect of moving in the y direction vs the z direction. The y and z movement directions are effectively tied to each other, for all our training points. So, if we just look at points on this one plane, there's no way to tease apart the effects of y movement vs z movement.
For example, from this 1 plane, we can't tell the difference between these scenarios:
The camera has perspective distortion such that things appear smaller in the image as they move in the world's +y direction.
The camera focal length is such that things appear smaller in the image as they move in the world's +z direction.
Any mixture of the effects in #1 and #2.
Mathematically, this ambiguity means that there are many equally possible solutions when OpenCV tries to fit a camera matrix to match the data. (Note that the 45° angle was not important. Any plane you choose will have the same problem: training examples' (x,y,z) dimensions are coupled together, so you can't separate their effects.)
One last note: if you make enough assumptions about the camera matrix (e.g. no perspective distortion, x and y scale identically, etc) then you can end up with a situation with fewer unknowns (in an extreme case, maybe you're just calculating the focal length) and in that case you could calibrate with just 1 plane.
I have a few doubts about how to approach my goal. I have an outside camera who is recording people and I want to draw an ellipse on every person.
Right now what I do is get the feature points of the people from the frame (I get them using a mask to only have the feature points on the people), set a EM algorithm and train it with my samples (the feature points extracted). The number of clusters is twice the number of people from the image (I get it before start the EM algorithm using other methods such as pixel counting with a codebook).
My question is
(a) Do I have to just train it only for the first frame and then use predict in the following frames? or,
(b) use train with the feature points in every frame?
Right now I am doing the option b) (I don't use predict) because I don't really know how to use the predict.
If I do a), can you help me with it and after that how to draw the ellipses?. If I do b), can you help me drawing an ellipse for every person? Since right know I got different ellipses for the same person using the cov, mean, etc (one for the arm, for example).
What I want to achieve is this paper using the Gaussian model: Link
If you would draw bounding boxes, rather then ellipses, you could use the function groupRectanlges to merge the different bounding boxes.
But, more important - for people detection, you can simply use openCV's person detector (based on HOG) or latent svm detector with the person model.
You should do b) anyway because, otherwise you'll try to match the keypoints to the clusters (persons) in the first frame. After a few seconds this would not be relevant.
It seems reasonable to assume that from frame to frame change is not going to be overwhelming, so reusing the results of the training on frame N-1 is a good seed to train on frame N, likely to converge faster that running EM from scratch on each frame.
in order to draw the ellipses you can leverage from the mixture of gaussian example in the python bindings:
https://github.com/opencv/opencv/blob/master/samples/python/gaussian_mix.py
Note if you use a diagonal covariance matrix, your ellipses are going to be aligned "straight", their own axis aligned with the X and Y axis of the frame, you can skip the calculation of the angle of the ellipse
I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.