Finding displacement between two camera frames - opencv

I'm currently working on a visual odometry project. Currently I've implemented up to Essential Matrix decomposition stage. But the resulting translation vector is normalized and cannot be able to plot the movement.
Now how can I compute the displacement in some scale? I have seen some suggestions to use planner homography to compute the absolute translation. I didn't got the idea of doing it as, the outdoor environment is not simply planner. At least, by considering the ground as planner, how to obtain, the translation of it. I've seen a suggestion here. Is it possible to use this approach to get the displacement between two frames?

What you are referring to is called registration. This is a vast field. There are methods for linear transformation across the entire image, and per pixel methods ( the two ends of the spectrum). Naturally per pixel methods are far slower typically and have many local errors.
Typically two frames have very little transformation between them and simple Homography will do to find the general scaling between them. Especially if you are talking about aerial photos. If your case is very far from planar then you may want to use something closer to pixel-wise. For example using spline fitting: https://www.mathworks.com/matlabcentral/fileexchange/20057-b-spline-grid--image-and-point-based-registration

You cannot recover scale, generally speaking, unless you can recognize in the scene 1 or more objects of known physical size.

Related

OpenCV - Feature Matching vs Optical Flow

I am interested in making a motion tracking app using OpenCV, and there has been a wealth of information available online. However, I am a tad confused between feature matching and tracking features using a sparse optical flow algorithm such as Lucas-Kanade. With that in mind, I have the following questions:
What is the main difference between the two (feature matching and optical flow) if I have specified a region of pixels to track? I'm not interested in tracking in real time, if that helps clear up any assumptions.
In addition, since I'm not doing real time tracking, is it a better idea to use dense optical flow (Farneback) to keep track of the pixels in my specified region of interest?
Thank you.
I would like to add a few thoughts about that theme since I found this a very interesting question too.
As said before Feature Matching is a technique that is based on:
A feature detection step which returns a set of so called feature points. These feature points are located at positions with salient image structures, e.g. edge-like structures when you are using FAST or blob like structures if you are using SIFT or SURF.
The second step is the matching. The association of feature points extracted from two different images. The matching is based on local visual descriptors, e.g. histogram of gradients or binary patterns, that are locally extracted around the feature positions. The descriptor is a feature vector and associated feature point pairs are pairs a minimal feature vector distances.
Most feature matching methods are scale and rotation invariant and are robust for changes in illuminations (e.g caused by shadow or different contrast). Thus these methods can be applied to image sequences but are more often used to align image pairs captured from different views or with different devices.The disadvantage of Feature Matching methods is the difficulty of defining where the feature matches are spawn and that the feature pair (which in a image sequence are motion vectors) are in general very sparse. In addition the subpixel accuracy of matching approaches are very limited as most detector are fine-graded to integer positions.
From my experience the main advantage of feature matching approaches is that they can compute very large motions/ displacements.
OpenCV offers some feature matching methods but there are a lot of more recent, faster and more accurate approaches available online e.g.:
DeepMatching which relies on deep learning and are often used to initialize optical flow methods to help them deal with long-range motions.
Stereoscann which is a very fast approach at its origin proposed for visual odometry.
Optical flow methods in contrast rely on the minimization of the brightness constancy and additional constrain e.g. smoothness etc. Thus they derive motion vector based on spatial and temporal image gradients of a sequence of consecutive frames. Thus they are more suited image sequences rather than image pairs that are captured from very different view points. The main challenges in the estimation of motion with optical flow vectors are large motions, occlusion, strong illumination changes and changes of the appearance of the objects and mostly the low runtime. However optical flow methods can be highly accurate and compute dense motion fields which respect to shared motion boundaries of the objects in a scene.
However, the accuracy of different optical flow methods is very different. Local methods such as the PLK (Lucas Kanade) are in general less accurate but allow to compute pre selected motion vectors only and can thus be very fast. (In the recent years we have done some research to improve the accuracy of the local approach, see here for further information).
The main OpenCV trunk offers global approaches such as the Farnback. But this is a quite outdated approach. Try the OpenCV contrib trunk which more recent methods. But to get an good overview of the most recent methods take a look at the public optical flow benchmarks. Here you will find code and implementations as well e.g.:
MPI-Sintel optical flow benchmark
KITTI 2012 optical flow benchmark. Both offer links e.g. to git's or source code for some newer methods. Such as FlowFields.
But from my point of view I would not on an early stage reject a specific approach matching or optical flow. Try as much as possible available online implementations and see what is the best for your application.
Feature matching uses the feature descriptors to match features with one another (usually) using a nearest neighbor search in the feature descriptor space. The basic idea is you have descriptor vectors, and the same feature in two images should be near each other in the descriptor space, so you just match that way.
Optical flow algorithms do not look at a descriptor space, and instead, looks at pixel patches around features and tries to match those patches instead. If you're familiar with dense optical flow, sparse optical flow just does dense optical flow but on small patches of the image around feature points. Thus optical flow assumes brightness constancy, that is, that pixel brightness doesn't change between frames. Also, since you're looking around neighboring pixels, you need to make the assumption that neighboring points to your features move similarly to your feature. Finally, since it's using a dense flow algorithm on small patches, the points where they move cannot be very far in the image from the original feature location. If they are, then the pyramid-resolution approach is recommended, where you scale down the image before you do this so that what once was a 16 pixel translation is now a 2 pixel translation, and then you can scale up with the found transformation as your prior.
So feature matching algorithms are all-in-all far better when it comes to using templates where the scale is not exactly the same, or if there's a perspective difference in the image and template, or if the transformations are large. However, your matches are only as good as your feature detector is exact. On optical flow algorithms, as long as it's looking in the right spot, the transformations can be really, really precise. They're both computationally expensive a bit; optical flow algorithms being an iterative approach makes them expensive (and although you'd think the pyramid approach can eat up more costs by running on more images, it can actually make it faster in some cases to reach the desired accuracy), and nearest neighbor searches are also expensive. Optical flow algorithms OTOH can work really well when the transformations are small, but if anything in your scene messes with your lighting or you get some incorrect pixels (like say, even minor occlusion) can really throw it off.
Which one to use definitely depends on the project. For a project I worked on with satellite imagery, I used dense optical flow because the images of desert terrain I was working with did not have precise enough features (in location) and different feature descriptors happen to look relatively similar so searching that feature space wasn't giving tons of great matches. In this case, optical flow was the better method. However, if you were doing image alignment on satellite imagery of a city where buildings can occlude parts of the scene, there are a lot of features that will stay matched and give a better result.
The OpenCV Lucas-Kanade tutorial doesn't give a whole lot of insight but should get your code moving in the right direction with the above in mind.
key-point matching = sparse optical flow
KLT tracking is a good example of sparse flow, see the demo LKDemo.cpp (it had some python wrapper example too, cant remember it now).
for a dense example, see samples/python/opt_flow.py, using Farnebäcks method.
You are right in being confused... The entire world is confused about this terribly simple topic. Alot of the reason is because people believe Lucas-Kanade to be sparse flow (due to a terribly badly named and commented example in openCV: LKdemo which should be called KLTDemo).

How to do grid-based (dense) optical flow on a masked image?

I am trying to track multiple people using a video camera. I do not want to use blob segmentation techniques.
What I want to do:
Perform background subtraction to obtain a mask isolating the peoples' motion.
Perform grid based optical flow on those areas -
What would be my best bet?
I am struggling to implement. I have tried blob detection and also some optical flow based examples (sparse), sparse didn't really do it for me as I wasn't getting enough feature points from goodfeaturestotrack() - I would like to end up with at least 20 track able points per person so that's why I think a grid based method would be better for me, I will use the motion vectors obtained to classify different people ( clustering on magnitude and direction possibly? )
I am using opencv3 with Python 3.5 - but am still quite noobish in this field.
Would appreciate some guidance immensely!
For a sparse optical flow ( in OpenCV the pyramidal Lucas Kanade method) you don't need good features-to-track mandatory to get the positions.
The calcOpticalFlowPyrLK function allows you to estimate the motion at predefined positions and these can be given by you too.
So just initialized a grid of cv::Point2f by your self, e.g. create a list of points and set the positions to the grid points located at your blobs, and run calcOpticalFlowPyrLK().
The idea of the good features-to-track method is that it gives you the points where the calcOpticalFlowPyrLK() result is more likely to be accurate and this is on image locations with edge-like structures. But in my experiences this gives not always the optimal feature point set. I prefer to use regular grids as feature point sets.

Vehicle segmentation and tracking

I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.

Block Bundle Adjustment Flow

I am working on bundle block adjustment for finding
X,Y,Z values of image points
Corrected values of camera characteristics(extrinsic parameters etc..)
Corrected values of measurements
In my opinion BB Adjustment process is done by following these steps(camera intrinsics are given):
Gather tie points( x,y for each image pair ) and ground control points( x,y and related X,Y,Z positions for each image )
Calculate initial extrinsic parameters( camera pose ) for each view
Calculate each tie point's initial real world position by using camera pose
Execute sparse bundle adjustment step by using all these initial values and other parameters as inputs
Use output of sparse bundle adjustment as accurate results of real world position, extrinsic characteristics and measurements.
One thing i want to ask is if that flow is correct. There are lots of methods for structure and motion estimation from views so i can not be so sure about that.
As i search through various resources i found that there are libraries that does each part of the block bundle adjustment operation. For each step:
Image processing libraries like OpenCV may be used for automatic tie point collection
cvFindExtrinsicCameraParams2 may be used for space resection ( but it requires 4 points, for block bundle adjustment it is mentioned that 3 Ground control points are enough for each view. Should i use another method like pose estimation from stereo views? )
By using triangulation and projection methods of OpenCV, real world positions may be calculated
SBA or SSBA is suitable for this operation
N/A
One another question is that, if previously mentioned flow is right, is matched libraries are enough for implementing entire flow?( May be better advises for each part )
I am newbie in this field, so i appreciate any help in this subject, Thanks...
You have described the default approach to stereo photogrammetry. Rather than using computer vision terms (extrinsic, intrinsic) I suggest you search using the terms interior- and exterior-orientation. This is a good approach if you have finite numbers of overlapping images and it has the benefit of some well defined error estimation methods.
Here is some basic math:
http://itee.uq.edu.au/~elec4600/elec4600_lectures/1perpage/uq1.pdf
http://itee.uq.edu.au/~elec4600/elec4600_lectures/1perpage/uq2.pdf
.2. cvFindExtrinsicCameraParams2 may be used for space resection ( but it
requires 4 points, for block bundle adjustment it is mentioned that 3
Ground control points are enough for each view.
The reason four control points are required by cvFindExtrinsicCameraParams2 is that the equations are under-determined with only three. If you don't have enough control, you might have to use an alternate approach (or sensor) to estimate the initial camera pose vector.

3D reconstruction -- How to create 3D model from 2D image?

If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/

Resources